Chapter 16: Comparing Two Groups

Contributors

39 min read

Here's the question that drives most research in the world: is there a difference between these two groups?

Prerequisites

14
15
4
Basic arithmetic (fractions, decimals, percentages)
Familiarity with square roots

Learning Objectives

Conduct and interpret a two-sample t-test for independent groups
Conduct and interpret a paired t-test for dependent samples
Compare two proportions using a two-proportion z-test
Construct confidence intervals for the difference between two groups
Choose the correct test based on study design and data type

In This Chapter

Chapter Overview
16.1 A Puzzle Before We Start (Productive Struggle)
16.2 The Big Picture: From One Group to Two
16.3 The Two-Sample t-Test for Independent Groups
16.4 The Paired t-Test: When Data Come in Pairs
16.5 Paired vs. Independent: The Critical Choice
16.6 The Two-Proportion z-Test
16.7 Confidence Intervals for Differences: The Full Picture
16.8 Choosing the Right Test: A Decision Flowchart
16.9 Python: Two-Group Tests
16.10 Excel: Two-Group Tests
16.11 Mathematical Details: Formulas at a Glance
16.12 Progressive Project: Compare Two Groups Within Your Dataset
16.13 Common Mistakes and How to Avoid Them
16.14 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 16: Comparing Two Groups

"The purpose of computing is insight, not numbers." — Richard Hamming

Chapter Overview

Here's the question that drives most research in the world: is there a difference between these two groups?

Not "what's the average?" Not "is this number different from a benchmark?" Those are useful questions — and you've spent the last three chapters answering them. But the question that fills research journals, shapes medical practice, drives business decisions, and determines public policy is almost always a comparison: Is the new drug better than the old one? Do men and women earn different salaries? Did the policy change affect outcomes? Is the algorithm biased against one group compared to another?

Alex Rivera has been waiting for this chapter since Chapter 1. StreamVibe ran an A/B test — randomly assigning users to the old recommendation algorithm or a new one — and Alex needs to know: did the new algorithm actually increase watch time? That's a two-group comparison. The old algorithm is one group. The new algorithm is the other. Same question, different lens.

Dr. Maya Chen wants to compare disease rates between two communities — an industrial neighborhood and a suburban control community. Same disease, two populations. Is the difference real, or could it be explained by random variation?

Sam Okafor has an interesting wrinkle. He wants to compare Daria's shooting performance this season versus last season. But here's the thing: it's the same player. The two "groups" aren't independent — they're matched by the person doing the shooting. That changes everything about how we analyze the data.

And Professor James Washington wants to compare recidivism rates between defendants whose bail was set by an algorithm versus those whose bail was set by a judge. Two groups, one categorical outcome (re-arrested or not), and enormous consequences for the people involved.

Every single one of these questions requires a different variant of the same fundamental idea: measure the difference between two groups, then ask whether that difference is larger than what random variation alone could produce.

In this chapter, you will learn to: - Conduct and interpret a two-sample t-test for independent groups - Conduct and interpret a paired t-test for dependent samples - Compare two proportions using a two-proportion z-test - Construct confidence intervals for the difference between two groups - Choose the correct test based on study design and data type

Fast Track: If you've done two-sample tests before, skim Sections 16.1–16.3, then jump to Section 16.8 (choosing the right test). Complete quiz questions 1, 10, and 18 to verify.

Deep Dive: After this chapter, read Case Study 1 (Alex's A/B test — the full analysis) for a complete tech industry application, then Case Study 2 (James's algorithmic bail study) for a deep look at how two-group comparisons reveal algorithmic bias. Both include full worked solutions.

16.1 A Puzzle Before We Start (Productive Struggle)

Before we jump into formulas, try this thought experiment.

The Training Program

A company wants to test whether a new employee training program improves customer satisfaction scores. They try two study designs:

Design A: Take 50 employees who went through the new training and 50 employees who went through the old training. Compare their average customer satisfaction scores.

Design B: Measure 50 employees' customer satisfaction scores before the new training, then measure the same 50 employees again after the new training. Compare the before and after scores.

(a) Both designs compare two groups. What's fundamentally different about them?

(b) In Design A, Employee #1 in the new-training group might naturally be more charismatic than Employee #1 in the old-training group. How does this affect the comparison?

(c) In Design B, you're comparing each employee to themselves. Why might this be more powerful?

(d) Here's the twist: in Design B, if the company announced the new training program with great fanfare and employees knew they were being evaluated, could the improvement be due to something other than the training itself?

Take 3 minutes. Part (c) is the key insight for this chapter.

Here's what I hope you noticed:

For part (a), the fundamental difference is independence. In Design A, the two groups are separate people — the new-training group and the old-training group have no overlap. In Design B, the two "groups" are the same people measured twice. Each data point in the "before" group has a natural partner in the "after" group.

Part (b) gets at a crucial problem with Design A. Different employees have different baseline abilities. Some are naturally charming, others are more reserved. These person-to-person differences create noise that makes it harder to detect the training effect. If the new-training group happens to include more charismatic employees, the comparison is confounded.

Part (c) reveals the power of paired designs. When you compare each employee to themselves, you eliminate all the person-to-person variability. You don't care whether Employee #7 is naturally better than Employee #23 — you only care whether Employee #7 improved relative to their own baseline. This dramatically reduces noise and often makes real effects easier to detect.

And part (d) is a healthy reminder from Chapter 4: study design matters. The improvement in Design B could reflect a Hawthorne effect (people perform better when they know they're being watched) or a practice effect (scores improve just from repeating the evaluation). The paired design controls for person-level variability but doesn't automatically guarantee a causal interpretation.

You've just identified the core ideas of this chapter: independent vs. paired comparisons, the tradeoffs between them, and the importance of study design in interpreting results. Now let's formalize them.

16.2 The Big Picture: From One Group to Two

🔄 Spaced Review 1 (from Ch.13): The Hypothesis Testing Framework

In Chapter 13, you learned the five-step procedure for hypothesis testing:

State $H_0$ and $H_a$

Check conditions

Compute the test statistic

Find the p-value

Conclude in context

Every test in this chapter follows exactly the same framework. The only thing that changes is what goes into the test statistic. In Chapter 13, the test statistic measured how far a sample statistic was from a single hypothesized value. Now, the test statistic will measure how far the difference between two groups is from zero (or from some other hypothesized difference). Same logic. Bigger question.

In Chapters 14 and 15, you tested claims about a single population parameter:

Chapter 14: Is the population proportion $p$ equal to $p_0$?
Chapter 15: Is the population mean $\mu$ equal to $\mu_0$?

Now we're asking a fundamentally different question: is there a difference between two populations? The parameter of interest shifts from a single value to a difference:

Is $\mu_1 - \mu_2 = 0$? (difference in means)
Is $p_1 - p_2 = 0$? (difference in proportions)

The general form of the test statistic stays the same:

$$\text{test statistic} = \frac{\text{observed difference} - \text{hypothesized difference}}{\text{standard error of the difference}}$$

Usually, the hypothesized difference is zero (no difference between groups), so this simplifies to:

$$\text{test statistic} = \frac{\text{observed difference}}{\text{standard error of the difference}}$$

The challenge — and the subject of this chapter — is figuring out what "standard error of the difference" means in each situation. It depends on:

What you're comparing: means or proportions?
How the data are structured: independent groups or paired observations?

This gives us three main scenarios:

Scenario	Data Type	Structure	Test
Two independent groups, numerical data	Means	Independent	Two-sample t-test
Same subjects measured twice, numerical data	Means	Paired	Paired t-test
Two independent groups, categorical data	Proportions	Independent	Two-proportion z-test

Let's tackle each one.

16.3 The Two-Sample t-Test for Independent Groups

When to Use It

Use the two-sample t-test (also called the independent-samples t-test) when you want to compare the means of two separate, unrelated groups. The key word is independent — knowing the value for one observation in Group 1 tells you nothing about any observation in Group 2.

Concept 1: Independent Samples

Two samples are independent when the individuals in one sample are completely unrelated to the individuals in the other sample. Random assignment to two treatment groups creates independent samples. Comparing men vs. women, or treatment vs. control, or new algorithm vs. old algorithm — all independent samples (assuming no matching or pairing). The observations in one group do not constrain or determine the observations in the other.

Examples of independent samples: - Patients randomly assigned to a drug group vs. a placebo group - Students at School A vs. students at School B - Users who see Algorithm A vs. users who see Algorithm B (Alex's A/B test!) - Crime outcomes under algorithm-based bail vs. judge-based bail (James's study!)

The Hypotheses

For comparing two population means $\mu_1$ and $\mu_2$:

Test Type	$H_0$	$H_a$
Two-tailed	$\mu_1 - \mu_2 = 0$	$\mu_1 - \mu_2 \neq 0$
Right-tailed	$\mu_1 - \mu_2 = 0$	$\mu_1 - \mu_2 > 0$
Left-tailed	$\mu_1 - \mu_2 = 0$	$\mu_1 - \mu_2 < 0$

Or equivalently: $H_0: \mu_1 = \mu_2$ vs. $H_a: \mu_1 \neq \mu_2$ (or $>$ or $<$).

The Standard Error of the Difference

🔄 Spaced Review 2 (from Ch.11): Standard Error — Now for Differences

In Chapter 11, you learned that the standard error measures how much a statistic varies from sample to sample. For a single sample mean: $SE_{\bar{x}} = \sigma / \sqrt{n}$, estimated by $s / \sqrt{n}$.

Now we need the standard error of the difference between two sample means. Here's the beautiful mathematical fact: when two random variables are independent, the variance of their difference equals the sum of their variances.

$$\text{Var}(\bar{X}_1 - \bar{X}_2) = \text{Var}(\bar{X}_1) + \text{Var}(\bar{X}_2) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$$

This is why the standard error adds the two variances (not the standard deviations). The standard error of the difference is larger than the standard error of either group alone — which makes sense. When you compare two groups, there are two sources of sampling variability instead of one.

The standard error of the difference in means is:

$$SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

This is the standard error of the difference for the two-sample t-test. Notice that it combines the variability from both groups. (Be careful with terminology here: because this formula keeps the two sample variances separate rather than merging them, it is the unpooled standard error — the one Welch's test uses. The pooled standard error, discussed below, is a different formula that merges the variances under the assumption that they are equal.)

Key Term: Standard Error of the Difference

The standard error of the difference combines the sampling variability from both groups into a single measure of how much the difference $\bar{x}_1 - \bar{x}_2$ is expected to vary from sample to sample. For independent samples (Welch's, unpooled): $SE = \sqrt{s_1^2/n_1 + s_2^2/n_2}$.

The Test Statistic

$$\boxed{t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}$$

In plain English: The t-statistic measures how many standard errors the observed difference in sample means is from zero (no difference). A large t-value means the groups differ by more than we'd expect from random variation alone.

Welch's t-Test: The Default Choice

The formula above is called Welch's t-test (also called the unequal-variances t-test). It does not assume that the two populations have equal variances. This is important because in real data, groups usually don't have equal variances — and incorrectly assuming they do can give misleading results.

The degrees of freedom for Welch's t-test are calculated using the Welch-Satterthwaite approximation:

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}$$

Don't panic. You will never compute this by hand. Python and Excel handle it automatically. The formula exists so you know what's happening under the hood.

Why Welch's, not Student's?

The "classic" two-sample t-test (sometimes called the equal-variances or pooled t-test) assumes $\sigma_1 = \sigma_2$ and pools the two sample variances into one estimate. This was useful when computation was expensive, but modern research strongly recommends Welch's version as the default:

Welch's test gives correct results whether or not variances are equal

The classic test can give inflated Type I error rates when variances are unequal

When variances are equal, Welch's test gives nearly identical results to the classic test

The one situation where the classic pooled test has a real (if modest) edge is when the variances genuinely are equal and the two group sizes are very different — there, pooling buys back a little statistical power. In practice you rarely know the variances are equal, so the safer default is still Welch's.

Bottom line: Use Welch's t-test by default. Python's scipy.stats.ttest_ind() uses Welch's by default (with equal_var=False), and Excel's T.TEST offers it as Type 3. There's no good reason to assume equal variances unless you have strong prior justification.

Conditions for the Two-Sample t-Test

The conditions mirror the one-sample t-test from Chapter 15, applied to both groups:

Condition	What to Check
1. Independence (between groups)	The two samples are independent of each other; random assignment or separate populations
2. Independence (within groups)	Observations within each group are independent; 10% condition for sampling without replacement
3. Normality	Sampling distribution of $\bar{x}_1 - \bar{x}_2$ is approximately normal; same guidelines as Ch.15: each group needs $n \geq 30$, or approximate normality in each group

Normality guidelines for two-sample t-tests:

Group Sizes	Requirement
Both $n_1, n_2 \geq 30$	CLT handles most shapes in both groups
Either $n_i < 30$	Check that group for approximate normality (histogram, QQ-plot)
Both $n_i < 15$	Both groups need to be approximately normal

Complete Worked Example: Alex's A/B Test

This is the moment Alex has been waiting for since Chapter 1. StreamVibe randomly assigned users to one of two recommendation algorithms and measured average watch time per session.

The Data:

	Old Algorithm (Control)	New Algorithm (Treatment)
Sample size	$n_1 = 247$	$n_2 = 253$
Sample mean	$\bar{x}_1 = 42.3$ min	$\bar{x}_2 = 46.8$ min
Sample SD	$s_1 = 18.5$ min	$s_2 = 21.2$ min

The observed difference is $\bar{x}_2 - \bar{x}_1 = 46.8 - 42.3 = 4.5$ minutes. Is this difference real, or could it be explained by chance?

Step 1: State the Hypotheses

Alex wants to know whether the new algorithm performs differently from the old one (in either direction), so this is two-tailed:

$$H_0: \mu_{\text{new}} - \mu_{\text{old}} = 0$$ $$H_a: \mu_{\text{new}} - \mu_{\text{old}} \neq 0$$

(Alex could justify a one-tailed test — "does the new algorithm increase watch time?" — but the two-tailed approach is more conservative and catches unexpected decreases too.)

Step 2: Check the Conditions

Independence between groups: Users were randomly assigned to algorithms. The two groups are independent. ✓
Independence within groups: Individual viewing sessions are independent (each user counted once). Both samples are less than 10% of all StreamVibe users. ✓
Normality: Both groups have $n > 30$ (247 and 253). By the CLT, the sampling distribution of the difference in means is approximately normal, even though individual watch times are likely right-skewed. ✓

All conditions met.

Step 3: Compute the Test Statistic

$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{18.5^2}{247} + \frac{21.2^2}{253}} = \sqrt{\frac{342.25}{247} + \frac{449.44}{253}}$$

$$SE = \sqrt{1.3856 + 1.7763} = \sqrt{3.1619} = 1.778$$

$$t = \frac{46.8 - 42.3}{1.778} = \frac{4.5}{1.778} = 2.530$$

Step 4: Find the P-Value

Using the Welch-Satterthwaite degrees of freedom (which Python computes automatically, approximately $df \approx 491$):

$$p\text{-value} = 2 \times P(T_{491} \geq 2.530) \approx 0.012$$

Step 5: Conclude in Context

At $\alpha = 0.05$: Since $p = 0.012 < 0.05$, we reject $H_0$.

Conclusion: There is statistically significant evidence that the average watch time differs between the two algorithms ($t = 2.53$, $p = 0.012$). Users assigned to the new algorithm watched an average of 4.5 minutes longer per session than users assigned to the old algorithm.

The confidence interval: A 95% CI for $\mu_{\text{new}} - \mu_{\text{old}}$:

$$(\bar{x}_2 - \bar{x}_1) \pm t^* \cdot SE = 4.5 \pm 1.965 \times 1.778 = 4.5 \pm 3.49$$

$$95\% \text{ CI: } (1.01, 7.99) \text{ minutes}$$

Where did $t^* = 1.965$ come from? With $df \approx 491$, the t-distribution is nearly indistinguishable from the standard normal, so $t^*$ has converged almost all the way down to the z-value $1.960$. (Compare the $t^*$ table in Chapter 15: $t^* = 1.984$ at $df = 100$, shrinking toward $1.960$ as $df$ grows.) For any df in the hundreds, using $1.96$ instead of the exact value changes the margin by less than a thousandth of a minute.

The CI tells us that the true difference is plausibly between about 1 minute and 8 minutes of additional watch time. Notice that zero is not in this interval — consistent with rejecting $H_0$.

Alex's Reaction: "Four and a half minutes more per session! That sounds small, but StreamVibe has 12 million active users. If each user watches an average of 3 sessions per day, that's 162 million additional minutes of watch time per day. At our average ad revenue rate, that translates to roughly $1.8 million per month in incremental revenue. The algorithm change is worth it."

This is the kind of practical significance that matters in business — and it only became visible because Alex used a proper two-sample test on randomized data. This is the A/B testing thread from Chapter 1, Chapter 4, and Chapter 15 — now fully resolved.

🔄 Spaced Review 3 (from Ch.4): Experimental vs. Observational Design — Why Alex Can Say "Caused"

In Chapter 4, you learned the critical distinction between experiments (where the researcher controls the treatment) and observational studies (where the researcher merely observes). Only randomized experiments support causal conclusions.

Alex's A/B test is a randomized experiment: users were randomly assigned to algorithms. This means the statistically significant difference can be interpreted causally — the new algorithm caused the increase in watch time, because randomization balanced all other variables (device type, time of day, user preferences) between the two groups.

If Alex had instead compared users who chose the new algorithm to those who stayed with the old one, it would be an observational study. Users who actively switch algorithms might be more engaged viewers generally, creating a confound. Same statistical test, same p-value, very different conclusion.

The test tells you whether the difference is real. The study design tells you whether you can call it causal.

16.4 The Paired t-Test: When Data Come in Pairs

When to Use It

Use the paired t-test when your observations come in natural pairs. This happens when:

Before-and-after designs: The same subjects are measured twice (e.g., blood pressure before and after medication)
Matched pairs: Subjects are matched on key characteristics (e.g., twins, siblings, or participants matched on age and gender)
Repeated measures: The same experimental units are tested under two conditions (e.g., left eye vs. right eye, morning vs. evening)

Concept 2: Dependent Samples

Two samples are dependent (also called paired or matched) when each observation in one sample has a natural partner in the other sample. The value of one observation is related to the value of its partner — not because of the treatment, but because of an underlying connection (same person, same location, same time period). Dependent samples require a different analysis than independent samples because the two measurements are correlated.

Key Term: Matched Pairs

In a matched pairs design, each observation in one group is linked to a specific observation in the other group. The pairing creates a natural one-to-one correspondence. Examples: the same student's test score before and after tutoring, the same city's crime rate before and after a policy change, or the same product's sales in two different seasons.

The Brilliant Insight Behind the Paired t-Test

Here's the key idea: a paired t-test is just a one-sample t-test on the differences.

Instead of comparing two groups directly, you: 1. Compute the difference $d_i = x_{i,\text{after}} - x_{i,\text{before}}$ for each pair 2. Treat those differences as a single sample 3. Test whether the mean difference $\mu_d$ equals zero using the one-sample t-test from Chapter 15

That's it. You already know how to do this. The paired t-test isn't a new procedure — it's the one-sample t-test applied to a specific kind of data.

The Formula

For $n$ pairs with differences $d_1, d_2, \ldots, d_n$:

$$\bar{d} = \frac{1}{n} \sum_{i=1}^{n} d_i \qquad s_d = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (d_i - \bar{d})^2}$$

$$\boxed{t = \frac{\bar{d} - 0}{s_d / \sqrt{n}} = \frac{\bar{d}}{s_d / \sqrt{n}}}$$

with $df = n - 1$ (where $n$ is the number of pairs, not the total number of observations).

In plain English: Compute each within-pair difference. Then test whether the average difference is significantly different from zero. If it is, the treatment had an effect.

The Hypotheses

Test Type	$H_0$	$H_a$
Two-tailed	$\mu_d = 0$	$\mu_d \neq 0$
Right-tailed	$\mu_d = 0$	$\mu_d > 0$
Left-tailed	$\mu_d = 0$	$\mu_d < 0$

Why Paired Tests Are Often More Powerful

The paired t-test eliminates between-subject variability. In Alex's A/B test (independent samples), part of the variation in watch time comes from the fact that different people have different viewing habits — some binge-watch, others check in for 10 minutes. That person-to-person variation is noise that makes the treatment effect harder to detect.

In a paired design, each subject serves as their own control. The differences capture only the within-person change, stripping away the between-person noise. The result: the standard deviation of the differences ($s_d$) is often much smaller than the individual standard deviations ($s_1$ or $s_2$), which means a smaller standard error, which means a more powerful test.

The tradeoff: Paired designs only work when natural pairing exists. You can't pair data retroactively if subjects weren't matched or measured twice.

Conditions for the Paired t-Test

Condition	What to Check
1. Paired data	Each observation in one group has a natural partner in the other group
2. Random	The pairs are randomly selected from the population, or subjects are randomly assigned to the order of treatments
3. Independence of pairs	The differences $d_i$ are independent of each other (one pair's difference doesn't influence another's)
4. Normality of differences	The distribution of the differences $d_i$ is approximately normal; same guidelines: $n \geq 30$ pairs or check for normality

Notice: we check normality of the differences, not of the original measurements. The differences might be approximately normal even when the original data aren't.

Complete Worked Example: Sam's Shooting Comparison

Sam Okafor wants to compare Daria Williams's shooting performance this season versus last season. He has data from 12 games where he can match games by opponent — Daria played each of these 12 opponents in both seasons, so he can create natural pairs.

The Data:

Opponent	Last Season (pts)	This Season (pts)	Difference ($d_i$)
Hawks	18	22	+4
Wolves	24	28	+4
Panthers	15	19	+4
Eagles	22	20	−2
Bears	20	25	+5
Falcons	16	21	+5
Tigers	28	26	−2
Lions	19	24	+5
Sharks	21	23	+2
Cobras	17	22	+5
Stallions	23	27	+4
Vipers	14	18	+4

The differences: 4, 4, 4, −2, 5, 5, −2, 5, 2, 5, 4, 4

Summary statistics for the differences: $$n = 12, \quad \bar{d} = 3.17, \quad s_d = 2.59$$

Step 1: State the Hypotheses

Sam wants to know if Daria's scoring has improved (increased), so this is one-tailed:

$$H_0: \mu_d = 0 \quad (\text{no change in scoring})$$ $$H_a: \mu_d > 0 \quad (\text{scoring has improved})$$

Step 2: Check the Conditions

Paired data: Each game is matched by opponent across seasons. ✓
Random: The 12 opponents represent a convenience sample (not randomly selected), but they're a reasonable representation of typical opponents. ✓ (with caveat)
Independence of pairs: Each game is independent of other games. ✓
Normality of differences: With $n = 12$ (in the $< 15$ range), we need approximate normality. The differences range from −2 to +5 with no extreme outliers. A histogram would show rough symmetry. ✓ (marginally)

Step 3: Compute the Test Statistic

$$SE_d = \frac{s_d}{\sqrt{n}} = \frac{2.59}{\sqrt{12}} = \frac{2.59}{3.464} = 0.748$$

$$t = \frac{\bar{d} - 0}{SE_d} = \frac{3.17}{0.748} = 4.237$$

Step 4: Find the P-Value

Using the t-distribution with $df = 11$:

$$p\text{-value} = P(T_{11} \geq 4.237) \approx 0.0007$$

Step 5: Conclude in Context

At $\alpha = 0.05$: Since $p = 0.0007 < 0.05$, we reject $H_0$.

Conclusion: There is strong statistical evidence that Daria's scoring has improved this season compared to last season ($t = 4.24$, $p < 0.001$). When matched by opponent, Daria scored an average of 3.17 points more per game this season than last season.

95% CI for the mean difference:

$$\bar{d} \pm t^*_{11} \cdot SE_d = 3.17 \pm 2.201 \times 0.748 = 3.17 \pm 1.65$$

$$95\% \text{ CI: } (1.52, 4.82) \text{ points per game}$$

Sam's Insight: "So the improvement is real — somewhere between 1.5 and 5 points per game when we control for opponent difficulty. But here's the thing I almost got wrong: I was about to run a two-sample t-test, treating each season's scores as independent samples. That would have been the wrong test! The same player against the same opponents — that's paired data. And look at how strong the result is when we use the right test."

Why the paired test worked better: If Sam had treated the data as independent samples, the pooled standard error would have included all the game-to-game variability (some opponents are tougher, some games are blowouts). The paired design eliminated that opponent-level variability by comparing Daria to herself, game by game. The differences ($s_d = 2.59$) are much less variable than the raw scores ($s_1 \approx 4.1$, $s_2 \approx 3.2$), giving a more powerful test.

16.5 Paired vs. Independent: The Critical Choice

⚠️ Common Mistake Alert

The #1 mistake in two-group comparisons is using the wrong test — treating paired data as independent, or treating independent data as paired. This isn't just a technical error. It can dramatically change your results.

Using an independent test on paired data ignores the pairing and includes unnecessary noise. You lose power and might miss a real effect.

Using a paired test on independent data is even worse. The "differences" you compute are meaningless because there's no natural pairing, and your results could go in either direction — too liberal or too conservative.

The test you use must match the study design, not the format of the data.

The Decision Rule

Ask yourself: does each observation in one group have a specific, natural partner in the other group?

If the answer is...	Then use...	Because...
Yes — same person, same location, matched pair	Paired t-test	The pairing captures within-pair change
No — different people, different units, no matching	Two-sample t-test	The groups are independent

Sam's Almost-Mistake: A Cautionary Tale

Let's see what would have happened if Sam had used the wrong test on Daria's data.

Correct analysis (paired t-test): - $\bar{d} = 3.17$, $SE_d = 0.748$, $t = 4.24$, $p = 0.0007$ - Conclusion: Strong evidence of improvement. ✓

Incorrect analysis (independent two-sample t-test): - $\bar{x}_{\text{this}} = 22.92$, $s_{\text{this}} = 3.18$, $n_{\text{this}} = 12$ - $\bar{x}_{\text{last}} = 19.75$, $s_{\text{last}} = 4.07$, $n_{\text{last}} = 12$ - $SE_{\text{ind}} = \sqrt{3.18^2/12 + 4.07^2/12} = \sqrt{0.842 + 1.380} = 1.491$ - $t_{\text{ind}} = (22.92 - 19.75) / 1.491 = 2.13$, $p \approx 0.046$

The paired test gives $p = 0.0007$. The independent test gives $p = 0.046$. Both reject $H_0$ at $\alpha = 0.05$ in this case, but the paired test produces much stronger evidence — because it properly accounts for the pairing. With slightly noisier data, the independent test might fail to reject while the paired test still would.

Key takeaway: When paired data exist, always use the paired test. You're leaving statistical power on the table if you don't.

Quick Reference: Is It Paired?

Scenario	Paired?	Why?
Blood pressure before and after medication	Yes	Same patients measured twice
Men's vs. women's salaries at a company	No	Different people
Left eye vs. right eye measurements	Yes	Same patients, natural pairing
Test scores of tutored vs. untutored students	No (usually)	Different students, unless matched
A restaurant's ratings on Yelp vs. Google	Yes	Same restaurant on two platforms
Crime rates in 2020 vs. 2024 across 50 states	Yes	Same states measured in both years
Satisfaction scores for Product A vs. Product B	Depends	Paired if same people rated both; independent if different people rated each

16.6 The Two-Proportion z-Test

When to Use It

Use the two-proportion z-test when you want to compare proportions from two independent groups. The outcome variable is categorical (success/failure), and you're asking: is the proportion of successes different in the two groups?

Concept 3: Difference in Proportions

The difference in proportions $\hat{p}_1 - \hat{p}_2$ estimates the true difference $p_1 - p_2$ between two population proportions. Just as with means, we test whether this observed difference is large enough to be statistically significant — too large to be explained by random variation alone.

Examples: - Is the cure rate higher for Drug A than Drug B? (Maya's world) - Is the click-through rate different for two website designs? (Alex's world) - Is the recidivism rate different for algorithm-recommended vs. judge-recommended bail decisions? (James's world!)

The Hypotheses

Test Type	$H_0$	$H_a$
Two-tailed	$p_1 - p_2 = 0$	$p_1 - p_2 \neq 0$
Right-tailed	$p_1 - p_2 = 0$	$p_1 - p_2 > 0$
Left-tailed	$p_1 - p_2 = 0$	$p_1 - p_2 < 0$

The Pooled Proportion

Under $H_0$, we assume $p_1 = p_2$ — the two groups have the same underlying proportion. So we estimate this common proportion by pooling the data from both groups:

$$\hat{p}_{\text{pooled}} = \frac{X_1 + X_2}{n_1 + n_2}$$

where $X_1$ and $X_2$ are the number of successes in each group.

The Standard Error (Under $H_0$)

$$SE = \sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

The Test Statistic

$$\boxed{z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE} = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}}$$

Why z, not t? When comparing proportions, we use the z-distribution (standard normal) rather than the t-distribution. This is because the standard error formula for proportions is derived directly from the binomial distribution's variance, and the CLT gives us normality when the success-failure condition is met. There's no "$s$ estimating $\sigma$" issue that requires the t-distribution's heavier tails.

Conditions for the Two-Proportion z-Test

Condition	What to Check
1. Independence (between groups)	The two samples are independent
2. Independence (within groups)	Observations within each group are independent; 10% condition
3. Success-failure condition	Each group needs at least 10 successes AND 10 failures: $n_1\hat{p}_{\text{pooled}} \geq 10$, $n_1(1-\hat{p}_{\text{pooled}}) \geq 10$, and similarly for group 2

Confidence Interval for the Difference in Proportions

For the CI, we use the unpooled standard error (because we're not assuming $p_1 = p_2$):

$$SE_{\text{CI}} = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

$$(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot SE_{\text{CI}}$$

Note the subtle difference: The test uses the pooled proportion (because we assume $H_0: p_1 = p_2$). The confidence interval uses the unpooled standard error (because we're estimating what $p_1 - p_2$ actually is, without assuming they're equal). This is analogous to how in Chapter 14, the test used $p_0$ in the SE while the CI used $\hat{p}$.

Complete Worked Example: James's Algorithmic Bail Study

Professor James Washington has obtained data comparing recidivism outcomes for two groups of defendants released on bail:

Group 1 (Algorithm-recommended): An algorithmic risk assessment tool recommended bail.
Group 2 (Judge-recommended): A human judge determined bail without the algorithm.

The question: is the recidivism rate different between the two groups?

The Data:

	Algorithm-Recommended	Judge-Recommended
Total defendants	$n_1 = 412$	$n_2 = 388$
Re-arrested within 2 years	$X_1 = 89$	$X_2 = 107$
Recidivism rate	$\hat{p}_1 = 89/412 = 0.216$	$\hat{p}_2 = 107/388 = 0.276$

The observed difference: $\hat{p}_1 - \hat{p}_2 = 0.216 - 0.276 = -0.060$. The algorithm group has a recidivism rate 6.0 percentage points lower than the judge group.

Step 1: State the Hypotheses

James uses a two-tailed test (the algorithm might perform better or worse than judges):

$$H_0: p_{\text{alg}} - p_{\text{judge}} = 0$$ $$H_a: p_{\text{alg}} - p_{\text{judge}} \neq 0$$

Step 2: Check the Conditions

Independence between groups: Defendants were assigned to algorithm-recommended or judge-recommended bail through a quasi-experimental design (different courtrooms in different months). The groups are effectively independent. ✓
Independence within groups: Individual recidivism outcomes are independent. ✓
Success-failure condition:

$$\hat{p}_{\text{pooled}} = \frac{89 + 107}{412 + 388} = \frac{196}{800} = 0.245$$

Group 1: $412 \times 0.245 = 100.9 \geq 10$ ✓ and $412 \times 0.755 = 311.1 \geq 10$ ✓
Group 2: $388 \times 0.245 = 95.1 \geq 10$ ✓ and $388 \times 0.755 = 292.9 \geq 10$ ✓

All conditions met.

Step 3: Compute the Test Statistic

$$SE = \sqrt{0.245 \times 0.755 \times \left(\frac{1}{412} + \frac{1}{388}\right)}$$

$$SE = \sqrt{0.18498 \times (0.002427 + 0.002577)} = \sqrt{0.18498 \times 0.005004}$$

$$SE = \sqrt{0.000926} = 0.03043$$

$$z = \frac{0.216 - 0.276}{0.03043} = \frac{-0.060}{0.03043} = -1.972$$

Step 4: Find the P-Value

$$p\text{-value} = 2 \times P(Z \leq -1.972) = 2 \times 0.0243 = 0.0486$$

Step 5: Conclude in Context

At $\alpha = 0.05$: Since $p = 0.049 < 0.05$, we reject $H_0$.

Conclusion: There is statistically significant evidence that recidivism rates differ between algorithm-recommended and judge-recommended bail decisions ($z = -1.97$, $p = 0.049$). The algorithm group had a recidivism rate of 21.6% compared to 27.6% for the judge group — a difference of 6.0 percentage points.

95% CI for the difference in proportions (using unpooled SE):

$$SE_{\text{CI}} = \sqrt{\frac{0.216 \times 0.784}{412} + \frac{0.276 \times 0.724}{388}} = \sqrt{\frac{0.1693}{412} + \frac{0.1998}{388}}$$

$$SE_{\text{CI}} = \sqrt{0.000411 + 0.000515} = \sqrt{0.000926} = 0.03044$$

$$(-0.060) \pm 1.960 \times 0.03044 = -0.060 \pm 0.060$$

$$95\% \text{ CI: } (-0.120, -0.001)$$

The CI just barely excludes zero, consistent with the borderline p-value.

James's Analysis: "The algorithm group has a 6-percentage-point lower recidivism rate. That's statistically significant, but barely — and the confidence interval nearly includes zero. This suggests the algorithm might be slightly better than judges at predicting who will re-offend, but the evidence isn't overwhelming.

More importantly, this overall comparison doesn't tell us whether the algorithm works equally well across racial groups. The next question — the one that really matters for justice — is whether the false positive rate differs for Black defendants versus white defendants. That requires stratified comparisons, which I'll need separate two-proportion tests for each subgroup."

Theme 2 Connection: Comparing Groups Is Where Bias Becomes Visible

Here's something profound about this chapter's methods: comparing two groups is often how we discover that a system treats people unequally.

James's overall comparison (algorithm vs. judge) is informative, but the real power of two-group comparisons comes when you break the data down by demographics and ask: does the algorithm work equally well for everyone? If the false positive rate is 15% for white defendants and 30% for Black defendants, that disparity only becomes visible when you compare two groups.

This is why two-group inference is the methodological foundation of fairness audits, pay equity analyses, and discrimination studies. Every time someone asks "is there a gap?" — a gender wage gap, a racial achievement gap, a health disparity — they're running a two-group comparison. The methods in this chapter are tools for justice as much as tools for science.

16.7 Confidence Intervals for Differences: The Full Picture

Concept 4: Confidence Intervals for Differences

Just as a confidence interval for a single parameter tells you how large the parameter plausibly is, a confidence interval for a difference tells you how large the difference between two groups plausibly is. A CI for $\mu_1 - \mu_2$ or $p_1 - p_2$ that contains zero is consistent with "no difference" — equivalent to failing to reject $H_0$.

Summary of CI Formulas

Scenario	CI Formula
Independent means	$(\bar{x}_1 - \bar{x}_2) \pm t^* \cdot \sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}$
Paired means	$\bar{d} \pm t^*_{n-1} \cdot \dfrac{s_d}{\sqrt{n}}$
Independent proportions	$(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot \sqrt{\dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \dfrac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$

Interpreting CIs for Differences

The interpretation follows the same logic as Chapter 12, extended to differences:

If the CI contains zero: The observed difference is consistent with no difference between groups. We fail to reject $H_0$.
If the CI is entirely positive: Group 1's value is plausibly greater than Group 2's.
If the CI is entirely negative: Group 1's value is plausibly less than Group 2's.
Width of the CI: Narrow CIs indicate precise estimates; wide CIs indicate substantial uncertainty about the true difference.

Always report the CI alongside the test. The test tells you WHETHER there's a difference. The CI tells you HOW BIG the difference might be. Both pieces of information matter.

Maya's Community Comparison

Dr. Maya Chen compares hospital admission rates for respiratory illness between two communities:

Industrial neighborhood: 847 residents surveyed, 127 reported at least one respiratory-related hospital visit in the past year. $\hat{p}_1 = 127/847 = 0.150$.
Suburban control community: 792 residents surveyed, 81 reported a respiratory-related hospital visit. $\hat{p}_2 = 81/792 = 0.102$.

The observed difference: $\hat{p}_1 - \hat{p}_2 = 0.150 - 0.102 = 0.048$ (4.8 percentage points higher in the industrial neighborhood).

Two-proportion z-test:

$$\hat{p}_{\text{pooled}} = \frac{127 + 81}{847 + 792} = \frac{208}{1639} = 0.1269$$

$$SE = \sqrt{0.1269 \times 0.8731 \times \left(\frac{1}{847} + \frac{1}{792}\right)} = \sqrt{0.1108 \times 0.002442} = \sqrt{0.000271} = 0.01645$$

$$z = \frac{0.048}{0.01645} = 2.918$$

$$p\text{-value} = 2 \times P(Z \geq 2.918) = 2 \times 0.0018 = 0.0035$$

95% CI for the difference (unpooled SE):

$$SE_{\text{CI}} = \sqrt{\frac{0.150 \times 0.850}{847} + \frac{0.102 \times 0.898}{792}} = \sqrt{0.0001504 + 0.0001156} = 0.01631$$

$$0.048 \pm 1.960 \times 0.01631 = 0.048 \pm 0.032$$

$$95\% \text{ CI: } (0.016, 0.080)$$

Maya's Conclusion: "The respiratory illness rate in the industrial neighborhood is 4.8 percentage points higher than in the suburban community, and this difference is statistically significant ($z = 2.92$, $p = 0.004$). The 95% confidence interval suggests the true difference is between 1.6 and 8.0 percentage points. This is consistent with environmental health literature on the effects of industrial air pollution on respiratory outcomes.

Of course, this is an observational study — I can't conclude that industrial pollution caused the higher rates. There could be confounding factors: income differences, access to healthcare, smoking rates, age distributions. But the statistically significant gap justifies further investigation, including air quality monitoring and a multivariate analysis that controls for these potential confounders."

Theme 5 Connection: Correlation vs. Causation in Group Comparisons

Maya's caution is exactly right. Finding a statistically significant difference between two groups does not automatically mean one group's condition caused the difference. In observational studies, the difference could be driven by confounding variables.

Alex's A/B test (randomized experiment) → significant difference → causal claim justified

Maya's community comparison (observational study) → significant difference → association only

James's bail study (quasi-experiment) → significant difference → cautious causal interpretation

The statistical test tells you the difference is real. The study design determines what kind of "real" it is.

16.8 Choosing the Right Test: A Decision Flowchart

Here's the complete decision process for choosing among the three two-group tests:

                  Comparing two groups?
                         │
                         ▼
              What type of variable?
              ╱                    ╲
        Numerical                Categorical
        (means)                 (proportions)
           │                        │
           ▼                        ▼
    Are the data paired?     Two-proportion z-test
    ╱              ╲              (§16.6)
  Yes               No
   │                 │
   ▼                 ▼
Paired t-test    Two-sample t-test
  (§16.4)         (Welch's, §16.3)
   │                 │
   ▼                 ▼
Compute d_i      Use raw group
differences      data directly
   │                 │
   ▼                 ▼
One-sample       t = (x̄₁ - x̄₂)
t-test on d      ────────────────
                  SE(difference)

Quick Decision Table

Question to Ask	If Yes →	If No →
Is the outcome numerical (means)?	Go to "Are data paired?"	Use two-proportion z-test
Are the data paired (same subjects, matched pairs)?	Use paired t-test	Use two-sample t-test (Welch's)
Are sample sizes large ($n_1, n_2 \geq 30$)?	CLT handles normality	Check distributions for normality
Are variances equal?	Welch's still works fine	Definitely use Welch's

Common Traps

Trap	What Goes Wrong	How to Avoid It
Using paired test on independent data	"Differences" are meaningless → invalid results	Ask: "Does each observation have a natural partner?"
Using independent test on paired data	Ignores pairing → loses power → might miss real effect	Ask: "Were these the same subjects measured twice?"
Running two separate one-sample tests instead of one two-group test	Inflated Type I error, can't properly compare	Always use a single test that directly compares the groups
Forgetting to check conditions	Invalid p-values and CIs	Check all conditions before computing

16.9 Python: Two-Group Tests

Here's how to run all three tests in Python.

Two-Sample t-Test (Independent Groups)

import numpy as np
from scipy import stats

# --- Alex's A/B Test ---
# If you have raw data:
np.random.seed(2026)
old_algo = np.random.normal(loc=42.3, scale=18.5, size=247)
new_algo = np.random.normal(loc=46.8, scale=21.2, size=253)

# Welch's t-test (default: equal_var=False)
t_stat, p_value = stats.ttest_ind(new_algo, old_algo, equal_var=False)
print("=== Alex's A/B Test (Welch's t-test) ===")
print(f"Old algorithm: n={len(old_algo)}, mean={np.mean(old_algo):.2f}, "
      f"SD={np.std(old_algo, ddof=1):.2f}")
print(f"New algorithm: n={len(new_algo)}, mean={np.mean(new_algo):.2f}, "
      f"SD={np.std(new_algo, ddof=1):.2f}")
print(f"Difference: {np.mean(new_algo) - np.mean(old_algo):.2f} minutes")
print(f"t = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_value:.4f}")

# For one-tailed (SciPy >= 1.7):
result = stats.ttest_ind(new_algo, old_algo, equal_var=False,
                          alternative='greater')
print(f"p-value (one-tailed, new > old) = {result.pvalue:.4f}")

# --- Confidence interval for the difference ---
diff = np.mean(new_algo) - np.mean(old_algo)
se = np.sqrt(np.var(old_algo, ddof=1)/len(old_algo)
             + np.var(new_algo, ddof=1)/len(new_algo))
# Use large-sample z* for simplicity, or compute Welch df
z_star = 1.960  # 95% CI
ci_lower = diff - z_star * se
ci_upper = diff + z_star * se
print(f"95% CI for difference: ({ci_lower:.2f}, {ci_upper:.2f})")

Two-Sample t-Test from Summary Statistics

from scipy import stats
import numpy as np

def two_sample_t_from_summary(x1_bar, s1, n1, x2_bar, s2, n2,
                                alternative='two-sided'):
    """
    Welch's two-sample t-test from summary statistics.

    Parameters:
    -----------
    x1_bar, s1, n1 : mean, SD, and size of group 1
    x2_bar, s2, n2 : mean, SD, and size of group 2
    alternative : 'two-sided', 'greater' (group 1 > group 2), or 'less'

    Returns:
    --------
    dict with t-statistic, p-value, Welch df, and 95% CI
    """
    se = np.sqrt(s1**2/n1 + s2**2/n2)
    t_stat = (x1_bar - x2_bar) / se

    # Welch-Satterthwaite degrees of freedom
    num = (s1**2/n1 + s2**2/n2)**2
    denom = (s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1)
    df = num / denom

    if alternative == 'two-sided':
        p_value = 2 * stats.t.sf(abs(t_stat), df)
    elif alternative == 'greater':
        p_value = stats.t.sf(t_stat, df)
    elif alternative == 'less':
        p_value = stats.t.cdf(t_stat, df)

    # 95% CI
    t_star = stats.t.ppf(0.975, df)
    diff = x1_bar - x2_bar
    ci = (diff - t_star * se, diff + t_star * se)

    return {
        't_statistic': t_stat,
        'p_value': p_value,
        'df_welch': df,
        'se': se,
        'ci_95': ci,
        'difference': diff
    }

# Alex's data
result = two_sample_t_from_summary(
    x1_bar=46.8, s1=21.2, n1=253,
    x2_bar=42.3, s2=18.5, n2=247,
    alternative='two-sided'
)

print("=== Alex's A/B Test (from summary stats) ===")
print(f"Difference: {result['difference']:.1f} min")
print(f"t = {result['t_statistic']:.4f}")
print(f"Welch df = {result['df_welch']:.1f}")
print(f"p-value = {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_95'][0]:.2f}, {result['ci_95'][1]:.2f})")

Paired t-Test

import numpy as np
from scipy import stats

# --- Sam's Shooting Data ---
last_season = np.array([18, 24, 15, 22, 20, 16, 28, 19, 21, 17, 23, 14])
this_season = np.array([22, 28, 19, 20, 25, 21, 26, 24, 23, 22, 27, 18])

# Compute differences
differences = this_season - last_season
print("Differences:", differences)
print(f"Mean difference: {np.mean(differences):.2f}")
print(f"SD of differences: {np.std(differences, ddof=1):.2f}")

# Paired t-test (equivalent to one-sample t-test on differences)
t_stat, p_value = stats.ttest_rel(this_season, last_season)
print(f"\n=== Sam's Paired t-Test ===")
print(f"t = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_value:.4f}")

# One-tailed (improvement = this season > last season)
result = stats.ttest_rel(this_season, last_season, alternative='greater')
print(f"p-value (one-tailed, improvement) = {result.pvalue:.4f}")

# Equivalent: one-sample t-test on differences
t2, p2 = stats.ttest_1samp(differences, popmean=0, alternative='greater')
print(f"\nEquivalent one-sample t-test on differences:")
print(f"t = {t2:.4f}, p = {p2:.4f}")  # Same results!

# Confidence interval for mean difference
n = len(differences)
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * s_d / np.sqrt(n)
print(f"\n95% CI for mean difference: ({d_bar - margin:.2f}, "
      f"{d_bar + margin:.2f})")

Two-Proportion z-Test

import numpy as np
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep

# --- James's Bail Study ---
# Number of "successes" (re-arrested) and sample sizes
count = np.array([89, 107])    # re-arrests in each group
nobs = np.array([412, 388])    # total in each group

# Two-proportion z-test
z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
print("=== James's Two-Proportion z-Test ===")
print(f"Algorithm group: {count[0]}/{nobs[0]} = {count[0]/nobs[0]:.3f}")
print(f"Judge group:     {count[1]}/{nobs[1]} = {count[1]/nobs[1]:.3f}")
print(f"Difference: {count[0]/nobs[0] - count[1]/nobs[1]:.3f}")
print(f"z = {z_stat:.4f}")
print(f"p-value = {p_value:.4f}")

# Confidence interval for the difference in proportions
ci_low, ci_upp = confint_proportions_2indep(
    count[0], nobs[0], count[1], nobs[1], method='wald'
)
print(f"95% CI for p1 - p2: ({ci_low:.4f}, {ci_upp:.4f})")

# --- Maya's Community Comparison ---
count_maya = np.array([127, 81])
nobs_maya = np.array([847, 792])

z_maya, p_maya = proportions_ztest(count_maya, nobs_maya)
print(f"\n=== Maya's Community Comparison ===")
print(f"Industrial: {count_maya[0]}/{nobs_maya[0]} = "
      f"{count_maya[0]/nobs_maya[0]:.3f}")
print(f"Suburban:   {count_maya[1]}/{nobs_maya[1]} = "
      f"{count_maya[1]/nobs_maya[1]:.3f}")
print(f"z = {z_maya:.4f}, p = {p_maya:.4f}")

16.10 Excel: Two-Group Tests

Two-Sample t-Test

Excel's T.TEST function handles two-sample tests directly:

Syntax	`=T.TEST(array1, array2, tails, type)`
array1	Data range for Group 1
array2	Data range for Group 2
tails	1 for one-tailed, 2 for two-tailed
type	1 = paired, 2 = equal variance, 3 = Welch's (recommended)

Examples:

What You Want	Formula
Welch's two-tailed p-value	`=T.TEST(A2:A248, B2:B254, 2, 3)`
Welch's one-tailed p-value	`=T.TEST(A2:A248, B2:B254, 1, 3)`
Paired two-tailed p-value	`=T.TEST(A2:A13, B2:B13, 2, 1)`
Equal-variance t-test (use rarely)	`=T.TEST(A2:A248, B2:B254, 2, 2)`

Computing from Summary Statistics in Excel

What You Need	Formula
Difference in means	`=AVERAGE(A2:A248) - AVERAGE(B2:B254)`
Pooled SE	`=SQRT(VAR.S(A2:A248)/COUNT(A2:A248) + VAR.S(B2:B254)/COUNT(B2:B254))`
t-statistic	`=difference / pooled_SE`

Two-Proportion z-Test in Excel

Excel doesn't have a built-in two-proportion z-test, but you can compute it with formulas:

What You Need	Formula
$\hat{p}_1$	`=successes1 / n1`
$\hat{p}_2$	`=successes2 / n2`
$\hat{p}_{\text{pooled}}$	`=(successes1 + successes2) / (n1 + n2)`
SE	`=SQRT(p_pooled(1-p_pooled)(1/n1 + 1/n2))`
z-statistic	`=(p1 - p2) / SE`
p-value (two-tailed)	`=2*(1-NORM.S.DIST(ABS(z), TRUE))`
p-value (one-tailed, right)	`=1-NORM.S.DIST(z, TRUE)`

16.11 Mathematical Details: Formulas at a Glance

For reference, here are all the formulas from this chapter in one place.

Two-Sample t-Test (Independent Groups, Welch's)

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} \qquad df = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1} + \dfrac{(s_2^2/n_2)^2}{n_2-1}}$$

95% CI:

$$(\bar{x}_1 - \bar{x}_2) \pm t^*_{df} \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Paired t-Test

$$t = \frac{\bar{d}}{s_d / \sqrt{n}} \qquad df = n - 1$$

where $d_i = x_{i,1} - x_{i,2}$, $\bar{d} = \frac{1}{n}\sum d_i$, $s_d = \sqrt{\frac{1}{n-1}\sum(d_i - \bar{d})^2}$

95% CI:

$$\bar{d} \pm t^*_{n-1} \cdot \frac{s_d}{\sqrt{n}}$$

Two-Proportion z-Test

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}} \qquad \hat{p}_{\text{pooled}} = \frac{X_1 + X_2}{n_1 + n_2}$$

95% CI (unpooled SE):

$$(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

16.12 Progressive Project: Compare Two Groups Within Your Dataset

Time to apply two-group comparisons to your own dataset.

Your Task

Identify a meaningful two-group comparison in your dataset. Ideas: - Compare a numerical variable between two subgroups (e.g., income for college graduates vs. non-graduates; life expectancy for two continents) - Compare a proportion between two subgroups (e.g., smoking rates for men vs. women; graduation rates for public vs. private universities) - If your dataset has a time component, create a before/after comparison
Determine whether your comparison is independent or paired. Most comparisons in public datasets will be independent, but if you're comparing the same entities across two time periods (e.g., same countries in 2010 vs. 2020), that's paired.
Choose and execute the appropriate test. Use the decision flowchart from Section 16.8.
Compute and interpret a confidence interval for the difference.
Discuss whether the difference is causal or associational, based on the study design.

Template Code

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Load your data
df = pd.read_csv('your_dataset.csv')

# ====== OPTION A: Compare means (independent groups) ======
group1 = df[df['grouping_variable'] == 'Group 1']['numerical_variable'].dropna()
group2 = df[df['grouping_variable'] == 'Group 2']['numerical_variable'].dropna()

# Visualize both groups
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(group1, bins=20, alpha=0.7, label='Group 1', color='steelblue')
axes[0].hist(group2, bins=20, alpha=0.7, label='Group 2', color='coral')
axes[0].legend()
axes[0].set_title('Overlaid Histograms')

axes[1].boxplot([group1, group2], labels=['Group 1', 'Group 2'])
axes[1].set_title('Side-by-Side Box Plots')
plt.tight_layout()
plt.show()

# Summary statistics
print(f"Group 1: n={len(group1)}, mean={np.mean(group1):.3f}, "
      f"SD={np.std(group1, ddof=1):.3f}")
print(f"Group 2: n={len(group2)}, mean={np.mean(group2):.3f}, "
      f"SD={np.std(group2, ddof=1):.3f}")

# Welch's two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"\nWelch's t-test: t = {t_stat:.4f}, p = {p_value:.4f}")

# Confidence interval
diff = np.mean(group1) - np.mean(group2)
se = np.sqrt(np.var(group1, ddof=1)/len(group1)
             + np.var(group2, ddof=1)/len(group2))
ci = (diff - 1.96*se, diff + 1.96*se)
print(f"Difference: {diff:.3f}")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")

# ====== OPTION B: Compare proportions ======
# from statsmodels.stats.proportion import proportions_ztest
# count = np.array([successes1, successes2])
# nobs = np.array([n1, n2])
# z_stat, p_value = proportions_ztest(count, nobs)

What to Write in Your Notebook

Add a new section titled "Chapter 16: Two-Group Comparisons" to your Data Detective Portfolio. Include: - Your research question and the two groups being compared - Justification for whether the data are independent or paired - Condition checks with visualizations - Test results with full interpretation - Confidence interval with practical interpretation - Discussion of whether the finding is causal or associational (and why) - A 2-3 sentence reflection: What does this comparison reveal about your dataset?

16.13 Common Mistakes and How to Avoid Them

Mistake	Why It's Wrong	What to Do Instead
Using a paired test on independent data	Creates meaningless "differences"; results are invalid	Ask "Is there a natural pairing?" before choosing the test
Using an independent test on paired data	Ignores the pairing, adds unnecessary noise, reduces power	Compute within-pair differences and use the paired t-test
Assuming equal variances without checking	The classic (pooled) t-test can give wrong p-values when variances differ	Use Welch's t-test by default
Comparing overlapping CIs and concluding "no difference"	Two CIs can overlap even when the difference is significant	Always compute the CI for the difference, not separate CIs for each group
Running two one-sample tests instead of one two-group test	Multiple testing inflates Type I error; doesn't directly answer the comparison question	Use a single test that compares the groups directly
Ignoring study design when interpreting results	Finding a difference ≠ proving causation	State whether the study is experimental or observational; causal claims require randomization
Forgetting to check conditions for both groups	Violations in either group can invalidate the test	Check normality and independence for each group separately

16.14 Chapter Summary

Take a step back and see what you've accomplished. You now have three powerful tools for comparing two groups — the most common analysis in applied statistics:

The two-sample t-test (Welch's) compares means from two independent groups. It's the workhorse of A/B testing, clinical trials, and any study comparing two separate populations.
The paired t-test compares paired observations by reducing the problem to a one-sample t-test on the differences. It's the go-to for before-and-after designs and matched-pairs studies, and it's often more powerful than the independent-samples test because it eliminates between-subject variability.
The two-proportion z-test compares proportions from two independent groups. It's essential for public health comparisons, fairness audits, and any study with a yes/no outcome across two populations.

All three tests follow the same logical framework: compute an observed difference, calculate the standard error of that difference, form a test statistic, and compare to a reference distribution. The choice among them depends on two questions: what type of data? (means or proportions) and what type of design? (independent or paired).

And here's the deeper insight: comparing groups is where statistics gets real. One-sample tests are useful for checking benchmarks, but two-group comparisons answer the questions that drive research, policy, and business decisions: Does the treatment work better than the control? Is there a disparity between groups? Did the change make a difference?

Alex's A/B test showed that the new algorithm increases watch time by about 4.5 minutes — a small individual effect with massive business implications. Sam's paired analysis revealed that Daria's improvement is real when you control for opponent difficulty. James's proportion comparison uncovered a statistically significant gap between algorithmic and human bail decisions. And Maya's community comparison documented a health disparity that demands further investigation.

What's Next: Chapter 17 will tackle a critical question lurking behind every test in this chapter: how big is the effect, and did we have enough data to find it? You'll learn about statistical power (the probability of detecting a real effect), effect sizes (how to measure practical significance), and what "statistically significant" really means when you look at the full picture. Sam's borderline results from Chapter 15 — and the question of how many games he'd need to confirm Daria's improvement — will finally get a rigorous answer.

"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

Prerequisites

Learning Objectives

In This Chapter

Chapter 16: Comparing Two Groups

Chapter Overview

16.1 A Puzzle Before We Start (Productive Struggle)

16.2 The Big Picture: From One Group to Two

16.3 The Two-Sample t-Test for Independent Groups

When to Use It

The Hypotheses

The Standard Error of the Difference

The Test Statistic

Welch's t-Test: The Default Choice

Conditions for the Two-Sample t-Test

Complete Worked Example: Alex's A/B Test

16.4 The Paired t-Test: When Data Come in Pairs

When to Use It

The Brilliant Insight Behind the Paired t-Test

The Formula

The Hypotheses

Why Paired Tests Are Often More Powerful

Conditions for the Paired t-Test

Complete Worked Example: Sam's Shooting Comparison

16.5 Paired vs. Independent: The Critical Choice

The Decision Rule

Sam's Almost-Mistake: A Cautionary Tale

Quick Reference: Is It Paired?

16.6 The Two-Proportion z-Test

When to Use It

The Hypotheses

The Pooled Proportion

The Standard Error (Under $H_0$)

The Test Statistic

Conditions for the Two-Proportion z-Test

Confidence Interval for the Difference in Proportions

Complete Worked Example: James's Algorithmic Bail Study

16.7 Confidence Intervals for Differences: The Full Picture

Summary of CI Formulas

Interpreting CIs for Differences

Maya's Community Comparison

16.8 Choosing the Right Test: A Decision Flowchart

Quick Decision Table

Common Traps

16.9 Python: Two-Group Tests

Two-Sample t-Test (Independent Groups)

Two-Sample t-Test from Summary Statistics

Paired t-Test

Two-Proportion z-Test

16.10 Excel: Two-Group Tests

Two-Sample t-Test

Computing from Summary Statistics in Excel

Two-Proportion z-Test in Excel

16.11 Mathematical Details: Formulas at a Glance

Two-Sample t-Test (Independent Groups, Welch's)

Paired t-Test

Two-Proportion z-Test

16.12 Progressive Project: Compare Two Groups Within Your Dataset

Your Task

Template Code

What to Write in Your Notebook

16.13 Common Mistakes and How to Avoid Them

16.14 Chapter Summary

Related Reading