Here's the question that drives most research in the world: is there a difference between these two groups?
Learning Objectives
- Conduct and interpret a two-sample t-test for independent groups
- Conduct and interpret a paired t-test for dependent samples
- Compare two proportions using a two-proportion z-test
- Construct confidence intervals for the difference between two groups
- Choose the correct test based on study design and data type
In This Chapter
- Chapter Overview
- 16.1 A Puzzle Before We Start (Productive Struggle)
- 16.2 The Big Picture: From One Group to Two
- 16.3 The Two-Sample t-Test for Independent Groups
- 16.4 The Paired t-Test: When Data Come in Pairs
- 16.5 Paired vs. Independent: The Critical Choice
- 16.6 The Two-Proportion z-Test
- 16.7 Confidence Intervals for Differences: The Full Picture
- 16.8 Choosing the Right Test: A Decision Flowchart
- 16.9 Python: Two-Group Tests
- 16.10 Excel: Two-Group Tests
- 16.11 Mathematical Details: Formulas at a Glance
- 16.12 Progressive Project: Compare Two Groups Within Your Dataset
- 16.13 Common Mistakes and How to Avoid Them
- 16.14 Chapter Summary
Chapter 16: Comparing Two Groups
"The purpose of computing is insight, not numbers." — Richard Hamming
Chapter Overview
Here's the question that drives most research in the world: is there a difference between these two groups?
Not "what's the average?" Not "is this number different from a benchmark?" Those are useful questions — and you've spent the last three chapters answering them. But the question that fills research journals, shapes medical practice, drives business decisions, and determines public policy is almost always a comparison: Is the new drug better than the old one? Do men and women earn different salaries? Did the policy change affect outcomes? Is the algorithm biased against one group compared to another?
Alex Rivera has been waiting for this chapter since Chapter 1. StreamVibe ran an A/B test — randomly assigning users to the old recommendation algorithm or a new one — and Alex needs to know: did the new algorithm actually increase watch time? That's a two-group comparison. The old algorithm is one group. The new algorithm is the other. Same question, different lens.
Dr. Maya Chen wants to compare disease rates between two communities — an industrial neighborhood and a suburban control community. Same disease, two populations. Is the difference real, or could it be explained by random variation?
Sam Okafor has an interesting wrinkle. He wants to compare Daria's shooting performance this season versus last season. But here's the thing: it's the same player. The two "groups" aren't independent — they're matched by the person doing the shooting. That changes everything about how we analyze the data.
And Professor James Washington wants to compare recidivism rates between defendants whose bail was set by an algorithm versus those whose bail was set by a judge. Two groups, one categorical outcome (re-arrested or not), and enormous consequences for the people involved.
Every single one of these questions requires a different variant of the same fundamental idea: measure the difference between two groups, then ask whether that difference is larger than what random variation alone could produce.
In this chapter, you will learn to: - Conduct and interpret a two-sample t-test for independent groups - Conduct and interpret a paired t-test for dependent samples - Compare two proportions using a two-proportion z-test - Construct confidence intervals for the difference between two groups - Choose the correct test based on study design and data type
Fast Track: If you've done two-sample tests before, skim Sections 16.1–16.3, then jump to Section 16.8 (choosing the right test). Complete quiz questions 1, 10, and 18 to verify.
Deep Dive: After this chapter, read Case Study 1 (Alex's A/B test — the full analysis) for a complete tech industry application, then Case Study 2 (James's algorithmic bail study) for a deep look at how two-group comparisons reveal algorithmic bias. Both include full worked solutions.
16.1 A Puzzle Before We Start (Productive Struggle)
Before we jump into formulas, try this thought experiment.
The Training Program
A company wants to test whether a new employee training program improves customer satisfaction scores. They try two study designs:
Design A: Take 50 employees who went through the new training and 50 employees who went through the old training. Compare their average customer satisfaction scores.
Design B: Measure 50 employees' customer satisfaction scores before the new training, then measure the same 50 employees again after the new training. Compare the before and after scores.
(a) Both designs compare two groups. What's fundamentally different about them?
(b) In Design A, Employee #1 in the new-training group might naturally be more charismatic than Employee #1 in the old-training group. How does this affect the comparison?
(c) In Design B, you're comparing each employee to themselves. Why might this be more powerful?
(d) Here's the twist: in Design B, if the company announced the new training program with great fanfare and employees knew they were being evaluated, could the improvement be due to something other than the training itself?
Take 3 minutes. Part (c) is the key insight for this chapter.
Here's what I hope you noticed:
For part (a), the fundamental difference is independence. In Design A, the two groups are separate people — the new-training group and the old-training group have no overlap. In Design B, the two "groups" are the same people measured twice. Each data point in the "before" group has a natural partner in the "after" group.
Part (b) gets at a crucial problem with Design A. Different employees have different baseline abilities. Some are naturally charming, others are more reserved. These person-to-person differences create noise that makes it harder to detect the training effect. If the new-training group happens to include more charismatic employees, the comparison is confounded.
Part (c) reveals the power of paired designs. When you compare each employee to themselves, you eliminate all the person-to-person variability. You don't care whether Employee #7 is naturally better than Employee #23 — you only care whether Employee #7 improved relative to their own baseline. This dramatically reduces noise and often makes real effects easier to detect.
And part (d) is a healthy reminder from Chapter 4: study design matters. The improvement in Design B could reflect a Hawthorne effect (people perform better when they know they're being watched) or a practice effect (scores improve just from repeating the evaluation). The paired design controls for person-level variability but doesn't automatically guarantee a causal interpretation.
You've just identified the core ideas of this chapter: independent vs. paired comparisons, the tradeoffs between them, and the importance of study design in interpreting results. Now let's formalize them.
16.2 The Big Picture: From One Group to Two
🔄 Spaced Review 1 (from Ch.13): The Hypothesis Testing Framework
In Chapter 13, you learned the five-step procedure for hypothesis testing:
- State $H_0$ and $H_a$
- Check conditions
- Compute the test statistic
- Find the p-value
- Conclude in context
Every test in this chapter follows exactly the same framework. The only thing that changes is what goes into the test statistic. In Chapter 13, the test statistic measured how far a sample statistic was from a single hypothesized value. Now, the test statistic will measure how far the difference between two groups is from zero (or from some other hypothesized difference). Same logic. Bigger question.
In Chapters 14 and 15, you tested claims about a single population parameter:
- Chapter 14: Is the population proportion $p$ equal to $p_0$?
- Chapter 15: Is the population mean $\mu$ equal to $\mu_0$?
Now we're asking a fundamentally different question: is there a difference between two populations? The parameter of interest shifts from a single value to a difference:
- Is $\mu_1 - \mu_2 = 0$? (difference in means)
- Is $p_1 - p_2 = 0$? (difference in proportions)
The general form of the test statistic stays the same:
$$\text{test statistic} = \frac{\text{observed difference} - \text{hypothesized difference}}{\text{standard error of the difference}}$$
Usually, the hypothesized difference is zero (no difference between groups), so this simplifies to:
$$\text{test statistic} = \frac{\text{observed difference}}{\text{standard error of the difference}}$$
The challenge — and the subject of this chapter — is figuring out what "standard error of the difference" means in each situation. It depends on:
- What you're comparing: means or proportions?
- How the data are structured: independent groups or paired observations?
This gives us three main scenarios:
| Scenario | Data Type | Structure | Test |
|---|---|---|---|
| Two independent groups, numerical data | Means | Independent | Two-sample t-test |
| Same subjects measured twice, numerical data | Means | Paired | Paired t-test |
| Two independent groups, categorical data | Proportions | Independent | Two-proportion z-test |
Let's tackle each one.
16.3 The Two-Sample t-Test for Independent Groups
When to Use It
Use the two-sample t-test (also called the independent-samples t-test) when you want to compare the means of two separate, unrelated groups. The key word is independent — knowing the value for one observation in Group 1 tells you nothing about any observation in Group 2.
Concept 1: Independent Samples
Two samples are independent when the individuals in one sample are completely unrelated to the individuals in the other sample. Random assignment to two treatment groups creates independent samples. Comparing men vs. women, or treatment vs. control, or new algorithm vs. old algorithm — all independent samples (assuming no matching or pairing). The observations in one group do not constrain or determine the observations in the other.
Examples of independent samples: - Patients randomly assigned to a drug group vs. a placebo group - Students at School A vs. students at School B - Users who see Algorithm A vs. users who see Algorithm B (Alex's A/B test!) - Crime outcomes under algorithm-based bail vs. judge-based bail (James's study!)
The Hypotheses
For comparing two population means $\mu_1$ and $\mu_2$:
| Test Type | $H_0$ | $H_a$ |
|---|---|---|
| Two-tailed | $\mu_1 - \mu_2 = 0$ | $\mu_1 - \mu_2 \neq 0$ |
| Right-tailed | $\mu_1 - \mu_2 = 0$ | $\mu_1 - \mu_2 > 0$ |
| Left-tailed | $\mu_1 - \mu_2 = 0$ | $\mu_1 - \mu_2 < 0$ |
Or equivalently: $H_0: \mu_1 = \mu_2$ vs. $H_a: \mu_1 \neq \mu_2$ (or $>$ or $<$).
The Standard Error of the Difference
🔄 Spaced Review 2 (from Ch.11): Standard Error — Now for Differences
In Chapter 11, you learned that the standard error measures how much a statistic varies from sample to sample. For a single sample mean: $SE_{\bar{x}} = \sigma / \sqrt{n}$, estimated by $s / \sqrt{n}$.
Now we need the standard error of the difference between two sample means. Here's the beautiful mathematical fact: when two random variables are independent, the variance of their difference equals the sum of their variances.
$$\text{Var}(\bar{X}_1 - \bar{X}_2) = \text{Var}(\bar{X}_1) + \text{Var}(\bar{X}_2) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$$
This is why the standard error adds the two variances (not the standard deviations). The standard error of the difference is larger than the standard error of either group alone — which makes sense. When you compare two groups, there are two sources of sampling variability instead of one.
The standard error of the difference in means is:
$$SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$
This is the pooled standard error for the two-sample t-test. Notice that it combines the variability from both groups.
Key Term: Pooled Standard Error
The pooled standard error combines the sampling variability from both groups into a single measure of how much the difference $\bar{x}_1 - \bar{x}_2$ is expected to vary from sample to sample. For independent samples: $SE = \sqrt{s_1^2/n_1 + s_2^2/n_2}$.
The Test Statistic
$$\boxed{t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}$$
In plain English: The t-statistic measures how many standard errors the observed difference in sample means is from zero (no difference). A large t-value means the groups differ by more than we'd expect from random variation alone.
Welch's t-Test: The Default Choice
The formula above is called Welch's t-test (also called the unequal-variances t-test). It does not assume that the two populations have equal variances. This is important because in real data, groups usually don't have equal variances — and incorrectly assuming they do can give misleading results.
The degrees of freedom for Welch's t-test are calculated using the Welch-Satterthwaite approximation:
$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}$$
Don't panic. You will never compute this by hand. Python and Excel handle it automatically. The formula exists so you know what's happening under the hood.
Why Welch's, not Student's?
The "classic" two-sample t-test (sometimes called the equal-variances or pooled t-test) assumes $\sigma_1 = \sigma_2$ and pools the two sample variances into one estimate. This was useful when computation was expensive, but modern research strongly recommends Welch's version as the default:
- Welch's test gives correct results whether or not variances are equal
- The classic test can give inflated Type I error rates when variances are unequal
- When variances are equal, Welch's test gives nearly identical results to the classic test
Bottom line: Use Welch's t-test by default. Python's
scipy.stats.ttest_ind()uses Welch's by default (withequal_var=False), and Excel'sT.TESToffers it as Type 3. There's no good reason to assume equal variances unless you have strong prior justification.
Conditions for the Two-Sample t-Test
The conditions mirror the one-sample t-test from Chapter 15, applied to both groups:
| Condition | What to Check |
|---|---|
| 1. Independence (between groups) | The two samples are independent of each other; random assignment or separate populations |
| 2. Independence (within groups) | Observations within each group are independent; 10% condition for sampling without replacement |
| 3. Normality | Sampling distribution of $\bar{x}_1 - \bar{x}_2$ is approximately normal; same guidelines as Ch.15: each group needs $n \geq 30$, or approximate normality in each group |
Normality guidelines for two-sample t-tests:
| Group Sizes | Requirement |
|---|---|
| Both $n_1, n_2 \geq 30$ | CLT handles most shapes in both groups |
| Either $n_i < 30$ | Check that group for approximate normality (histogram, QQ-plot) |
| Both $n_i < 15$ | Both groups need to be approximately normal |
Complete Worked Example: Alex's A/B Test
This is the moment Alex has been waiting for since Chapter 1. StreamVibe randomly assigned users to one of two recommendation algorithms and measured average watch time per session.
The Data:
| Old Algorithm (Control) | New Algorithm (Treatment) | |
|---|---|---|
| Sample size | $n_1 = 247$ | $n_2 = 253$ |
| Sample mean | $\bar{x}_1 = 42.3$ min | $\bar{x}_2 = 46.8$ min |
| Sample SD | $s_1 = 18.5$ min | $s_2 = 21.2$ min |
The observed difference is $\bar{x}_2 - \bar{x}_1 = 46.8 - 42.3 = 4.5$ minutes. Is this difference real, or could it be explained by chance?
Step 1: State the Hypotheses
Alex wants to know whether the new algorithm performs differently from the old one (in either direction), so this is two-tailed:
$$H_0: \mu_{\text{new}} - \mu_{\text{old}} = 0$$ $$H_a: \mu_{\text{new}} - \mu_{\text{old}} \neq 0$$
(Alex could justify a one-tailed test — "does the new algorithm increase watch time?" — but the two-tailed approach is more conservative and catches unexpected decreases too.)
Step 2: Check the Conditions
- Independence between groups: Users were randomly assigned to algorithms. The two groups are independent. ✓
- Independence within groups: Individual viewing sessions are independent (each user counted once). Both samples are less than 10% of all StreamVibe users. ✓
- Normality: Both groups have $n > 30$ (247 and 253). By the CLT, the sampling distribution of the difference in means is approximately normal, even though individual watch times are likely right-skewed. ✓
All conditions met.
Step 3: Compute the Test Statistic
$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{18.5^2}{247} + \frac{21.2^2}{253}} = \sqrt{\frac{342.25}{247} + \frac{449.44}{253}}$$
$$SE = \sqrt{1.3856 + 1.7763} = \sqrt{3.1619} = 1.778$$
$$t = \frac{46.8 - 42.3}{1.778} = \frac{4.5}{1.778} = 2.530$$
Step 4: Find the P-Value
Using the Welch-Satterthwaite degrees of freedom (which Python computes automatically, approximately $df \approx 491$):
$$p\text{-value} = 2 \times P(T_{491} \geq 2.530) \approx 0.012$$
Step 5: Conclude in Context
At $\alpha = 0.05$: Since $p = 0.012 < 0.05$, we reject $H_0$.
Conclusion: There is statistically significant evidence that the average watch time differs between the two algorithms ($t = 2.53$, $p = 0.012$). Users assigned to the new algorithm watched an average of 4.5 minutes longer per session than users assigned to the old algorithm.
The confidence interval: A 95% CI for $\mu_{\text{new}} - \mu_{\text{old}}$:
$$(\bar{x}_2 - \bar{x}_1) \pm t^* \cdot SE = 4.5 \pm 1.965 \times 1.778 = 4.5 \pm 3.49$$
$$95\% \text{ CI: } (1.01, 7.99) \text{ minutes}$$
The CI tells us that the true difference is plausibly between about 1 minute and 8 minutes of additional watch time. Notice that zero is not in this interval — consistent with rejecting $H_0$.
Alex's Reaction: "Four and a half minutes more per session! That sounds small, but StreamVibe has 12 million active users. If each user watches an average of 3 sessions per day, that's 162 million additional minutes of watch time per day. At our average ad revenue rate, that translates to roughly $1.8 million per month in incremental revenue. The algorithm change is worth it."
This is the kind of practical significance that matters in business — and it only became visible because Alex used a proper two-sample test on randomized data. This is the A/B testing thread from Chapter 1, Chapter 4, and Chapter 15 — now fully resolved.
🔄 Spaced Review 3 (from Ch.4): Experimental vs. Observational Design — Why Alex Can Say "Caused"
In Chapter 4, you learned the critical distinction between experiments (where the researcher controls the treatment) and observational studies (where the researcher merely observes). Only randomized experiments support causal conclusions.
Alex's A/B test is a randomized experiment: users were randomly assigned to algorithms. This means the statistically significant difference can be interpreted causally — the new algorithm caused the increase in watch time, because randomization balanced all other variables (device type, time of day, user preferences) between the two groups.
If Alex had instead compared users who chose the new algorithm to those who stayed with the old one, it would be an observational study. Users who actively switch algorithms might be more engaged viewers generally, creating a confound. Same statistical test, same p-value, very different conclusion.
The test tells you whether the difference is real. The study design tells you whether you can call it causal.
16.4 The Paired t-Test: When Data Come in Pairs
When to Use It
Use the paired t-test when your observations come in natural pairs. This happens when:
- Before-and-after designs: The same subjects are measured twice (e.g., blood pressure before and after medication)
- Matched pairs: Subjects are matched on key characteristics (e.g., twins, siblings, or participants matched on age and gender)
- Repeated measures: The same experimental units are tested under two conditions (e.g., left eye vs. right eye, morning vs. evening)
Concept 2: Dependent Samples
Two samples are dependent (also called paired or matched) when each observation in one sample has a natural partner in the other sample. The value of one observation is related to the value of its partner — not because of the treatment, but because of an underlying connection (same person, same location, same time period). Dependent samples require a different analysis than independent samples because the two measurements are correlated.
Key Term: Matched Pairs
In a matched pairs design, each observation in one group is linked to a specific observation in the other group. The pairing creates a natural one-to-one correspondence. Examples: the same student's test score before and after tutoring, the same city's crime rate before and after a policy change, or the same product's sales in two different seasons.
The Brilliant Insight Behind the Paired t-Test
Here's the key idea: a paired t-test is just a one-sample t-test on the differences.
Instead of comparing two groups directly, you: 1. Compute the difference $d_i = x_{i,\text{after}} - x_{i,\text{before}}$ for each pair 2. Treat those differences as a single sample 3. Test whether the mean difference $\mu_d$ equals zero using the one-sample t-test from Chapter 15
That's it. You already know how to do this. The paired t-test isn't a new procedure — it's the one-sample t-test applied to a specific kind of data.
The Formula
For $n$ pairs with differences $d_1, d_2, \ldots, d_n$:
$$\bar{d} = \frac{1}{n} \sum_{i=1}^{n} d_i \qquad s_d = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (d_i - \bar{d})^2}$$
$$\boxed{t = \frac{\bar{d} - 0}{s_d / \sqrt{n}} = \frac{\bar{d}}{s_d / \sqrt{n}}}$$
with $df = n - 1$ (where $n$ is the number of pairs, not the total number of observations).
In plain English: Compute each within-pair difference. Then test whether the average difference is significantly different from zero. If it is, the treatment had an effect.
The Hypotheses
| Test Type | $H_0$ | $H_a$ |
|---|---|---|
| Two-tailed | $\mu_d = 0$ | $\mu_d \neq 0$ |
| Right-tailed | $\mu_d = 0$ | $\mu_d > 0$ |
| Left-tailed | $\mu_d = 0$ | $\mu_d < 0$ |
Why Paired Tests Are Often More Powerful
The paired t-test eliminates between-subject variability. In Alex's A/B test (independent samples), part of the variation in watch time comes from the fact that different people have different viewing habits — some binge-watch, others check in for 10 minutes. That person-to-person variation is noise that makes the treatment effect harder to detect.
In a paired design, each subject serves as their own control. The differences capture only the within-person change, stripping away the between-person noise. The result: the standard deviation of the differences ($s_d$) is often much smaller than the individual standard deviations ($s_1$ or $s_2$), which means a smaller standard error, which means a more powerful test.
The tradeoff: Paired designs only work when natural pairing exists. You can't pair data retroactively if subjects weren't matched or measured twice.
Conditions for the Paired t-Test
| Condition | What to Check |
|---|---|
| 1. Paired data | Each observation in one group has a natural partner in the other group |
| 2. Random | The pairs are randomly selected from the population, or subjects are randomly assigned to the order of treatments |
| 3. Independence of pairs | The differences $d_i$ are independent of each other (one pair's difference doesn't influence another's) |
| 4. Normality of differences | The distribution of the differences $d_i$ is approximately normal; same guidelines: $n \geq 30$ pairs or check for normality |
Notice: we check normality of the differences, not of the original measurements. The differences might be approximately normal even when the original data aren't.
Complete Worked Example: Sam's Shooting Comparison
Sam Okafor wants to compare Daria Williams's shooting performance this season versus last season. He has data from 12 games where he can match games by opponent — Daria played each of these 12 opponents in both seasons, so he can create natural pairs.
The Data:
| Opponent | Last Season (pts) | This Season (pts) | Difference ($d_i$) |
|---|---|---|---|
| Hawks | 18 | 22 | +4 |
| Wolves | 24 | 28 | +4 |
| Panthers | 15 | 19 | +4 |
| Eagles | 22 | 20 | −2 |
| Bears | 20 | 25 | +5 |
| Falcons | 16 | 21 | +5 |
| Tigers | 28 | 26 | −2 |
| Lions | 19 | 24 | +5 |
| Sharks | 21 | 23 | +2 |
| Cobras | 17 | 22 | +5 |
| Stallions | 23 | 27 | +4 |
| Vipers | 14 | 18 | +4 |
The differences: 4, 4, 4, −2, 5, 5, −2, 5, 2, 5, 4, 4
Summary statistics for the differences: $$n = 12, \quad \bar{d} = 3.17, \quad s_d = 2.59$$
Step 1: State the Hypotheses
Sam wants to know if Daria's scoring has improved (increased), so this is one-tailed:
$$H_0: \mu_d = 0 \quad (\text{no change in scoring})$$ $$H_a: \mu_d > 0 \quad (\text{scoring has improved})$$
Step 2: Check the Conditions
- Paired data: Each game is matched by opponent across seasons. ✓
- Random: The 12 opponents represent a convenience sample (not randomly selected), but they're a reasonable representation of typical opponents. ✓ (with caveat)
- Independence of pairs: Each game is independent of other games. ✓
- Normality of differences: With $n = 12$ (in the $< 15$ range), we need approximate normality. The differences range from −2 to +5 with no extreme outliers. A histogram would show rough symmetry. ✓ (marginally)
Step 3: Compute the Test Statistic
$$SE_d = \frac{s_d}{\sqrt{n}} = \frac{2.59}{\sqrt{12}} = \frac{2.59}{3.464} = 0.748$$
$$t = \frac{\bar{d} - 0}{SE_d} = \frac{3.17}{0.748} = 4.237$$
Step 4: Find the P-Value
Using the t-distribution with $df = 11$:
$$p\text{-value} = P(T_{11} \geq 4.237) \approx 0.0007$$
Step 5: Conclude in Context
At $\alpha = 0.05$: Since $p = 0.0007 < 0.05$, we reject $H_0$.
Conclusion: There is strong statistical evidence that Daria's scoring has improved this season compared to last season ($t = 4.24$, $p < 0.001$). When matched by opponent, Daria scored an average of 3.17 points more per game this season than last season.
95% CI for the mean difference:
$$\bar{d} \pm t^*_{11} \cdot SE_d = 3.17 \pm 2.201 \times 0.748 = 3.17 \pm 1.65$$
$$95\% \text{ CI: } (1.52, 4.82) \text{ points per game}$$
Sam's Insight: "So the improvement is real — somewhere between 1.5 and 5 points per game when we control for opponent difficulty. But here's the thing I almost got wrong: I was about to run a two-sample t-test, treating each season's scores as independent samples. That would have been the wrong test! The same player against the same opponents — that's paired data. And look at how strong the result is when we use the right test."
Why the paired test worked better: If Sam had treated the data as independent samples, the pooled standard error would have included all the game-to-game variability (some opponents are tougher, some games are blowouts). The paired design eliminated that opponent-level variability by comparing Daria to herself, game by game. The differences ($s_d = 2.59$) are much less variable than the raw scores ($s_1 \approx 4.1$, $s_2 \approx 3.2$), giving a more powerful test.
16.5 Paired vs. Independent: The Critical Choice
⚠️ Common Mistake Alert
The #1 mistake in two-group comparisons is using the wrong test — treating paired data as independent, or treating independent data as paired. This isn't just a technical error. It can dramatically change your results.
- Using an independent test on paired data ignores the pairing and includes unnecessary noise. You lose power and might miss a real effect.
- Using a paired test on independent data is even worse. The "differences" you compute are meaningless because there's no natural pairing, and your results could go in either direction — too liberal or too conservative.
The test you use must match the study design, not the format of the data.
The Decision Rule
Ask yourself: does each observation in one group have a specific, natural partner in the other group?
| If the answer is... | Then use... | Because... |
|---|---|---|
| Yes — same person, same location, matched pair | Paired t-test | The pairing captures within-pair change |
| No — different people, different units, no matching | Two-sample t-test | The groups are independent |
Sam's Almost-Mistake: A Cautionary Tale
Let's see what would have happened if Sam had used the wrong test on Daria's data.
Correct analysis (paired t-test): - $\bar{d} = 3.17$, $SE_d = 0.748$, $t = 4.24$, $p = 0.0007$ - Conclusion: Strong evidence of improvement. ✓
Incorrect analysis (independent two-sample t-test): - $\bar{x}_{\text{this}} = 22.92$, $s_{\text{this}} = 3.18$, $n_{\text{this}} = 12$ - $\bar{x}_{\text{last}} = 19.75$, $s_{\text{last}} = 4.07$, $n_{\text{last}} = 12$ - $SE_{\text{ind}} = \sqrt{3.18^2/12 + 4.07^2/12} = \sqrt{0.842 + 1.380} = 1.491$ - $t_{\text{ind}} = (22.92 - 19.75) / 1.491 = 2.13$, $p \approx 0.046$
The paired test gives $p = 0.0007$. The independent test gives $p = 0.046$. Both reject $H_0$ at $\alpha = 0.05$ in this case, but the paired test produces much stronger evidence — because it properly accounts for the pairing. With slightly noisier data, the independent test might fail to reject while the paired test still would.
Key takeaway: When paired data exist, always use the paired test. You're leaving statistical power on the table if you don't.
Quick Reference: Is It Paired?
| Scenario | Paired? | Why? |
|---|---|---|
| Blood pressure before and after medication | Yes | Same patients measured twice |
| Men's vs. women's salaries at a company | No | Different people |
| Left eye vs. right eye measurements | Yes | Same patients, natural pairing |
| Test scores of tutored vs. untutored students | No (usually) | Different students, unless matched |
| A restaurant's ratings on Yelp vs. Google | Yes | Same restaurant on two platforms |
| Crime rates in 2020 vs. 2024 across 50 states | Yes | Same states measured in both years |
| Satisfaction scores for Product A vs. Product B | Depends | Paired if same people rated both; independent if different people rated each |
16.6 The Two-Proportion z-Test
When to Use It
Use the two-proportion z-test when you want to compare proportions from two independent groups. The outcome variable is categorical (success/failure), and you're asking: is the proportion of successes different in the two groups?
Concept 3: Difference in Proportions
The difference in proportions $\hat{p}_1 - \hat{p}_2$ estimates the true difference $p_1 - p_2$ between two population proportions. Just as with means, we test whether this observed difference is large enough to be statistically significant — too large to be explained by random variation alone.
Examples: - Is the cure rate higher for Drug A than Drug B? (Maya's world) - Is the click-through rate different for two website designs? (Alex's world) - Is the recidivism rate different for algorithm-recommended vs. judge-recommended bail decisions? (James's world!)
The Hypotheses
| Test Type | $H_0$ | $H_a$ |
|---|---|---|
| Two-tailed | $p_1 - p_2 = 0$ | $p_1 - p_2 \neq 0$ |
| Right-tailed | $p_1 - p_2 = 0$ | $p_1 - p_2 > 0$ |
| Left-tailed | $p_1 - p_2 = 0$ | $p_1 - p_2 < 0$ |
The Pooled Proportion
Under $H_0$, we assume $p_1 = p_2$ — the two groups have the same underlying proportion. So we estimate this common proportion by pooling the data from both groups:
$$\hat{p}_{\text{pooled}} = \frac{X_1 + X_2}{n_1 + n_2}$$
where $X_1$ and $X_2$ are the number of successes in each group.
The Standard Error (Under $H_0$)
$$SE = \sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$
The Test Statistic
$$\boxed{z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE} = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}}$$
Why z, not t? When comparing proportions, we use the z-distribution (standard normal) rather than the t-distribution. This is because the standard error formula for proportions is derived directly from the binomial distribution's variance, and the CLT gives us normality when the success-failure condition is met. There's no "$s$ estimating $\sigma$" issue that requires the t-distribution's heavier tails.
Conditions for the Two-Proportion z-Test
| Condition | What to Check |
|---|---|
| 1. Independence (between groups) | The two samples are independent |
| 2. Independence (within groups) | Observations within each group are independent; 10% condition |
| 3. Success-failure condition | Each group needs at least 10 successes AND 10 failures: $n_1\hat{p}_{\text{pooled}} \geq 10$, $n_1(1-\hat{p}_{\text{pooled}}) \geq 10$, and similarly for group 2 |
Confidence Interval for the Difference in Proportions
For the CI, we use the unpooled standard error (because we're not assuming $p_1 = p_2$):
$$SE_{\text{CI}} = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$
$$(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot SE_{\text{CI}}$$
Note the subtle difference: The test uses the pooled proportion (because we assume $H_0: p_1 = p_2$). The confidence interval uses the unpooled standard error (because we're estimating what $p_1 - p_2$ actually is, without assuming they're equal). This is analogous to how in Chapter 14, the test used $p_0$ in the SE while the CI used $\hat{p}$.
Complete Worked Example: James's Algorithmic Bail Study
Professor James Washington has obtained data comparing recidivism outcomes for two groups of defendants released on bail:
- Group 1 (Algorithm-recommended): An algorithmic risk assessment tool recommended bail.
- Group 2 (Judge-recommended): A human judge determined bail without the algorithm.
The question: is the recidivism rate different between the two groups?
The Data:
| Algorithm-Recommended | Judge-Recommended | |
|---|---|---|
| Total defendants | $n_1 = 412$ | $n_2 = 388$ |
| Re-arrested within 2 years | $X_1 = 89$ | $X_2 = 107$ |
| Recidivism rate | $\hat{p}_1 = 89/412 = 0.216$ | $\hat{p}_2 = 107/388 = 0.276$ |
The observed difference: $\hat{p}_1 - \hat{p}_2 = 0.216 - 0.276 = -0.060$. The algorithm group has a recidivism rate 6.0 percentage points lower than the judge group.
Step 1: State the Hypotheses
James uses a two-tailed test (the algorithm might perform better or worse than judges):
$$H_0: p_{\text{alg}} - p_{\text{judge}} = 0$$ $$H_a: p_{\text{alg}} - p_{\text{judge}} \neq 0$$
Step 2: Check the Conditions
- Independence between groups: Defendants were assigned to algorithm-recommended or judge-recommended bail through a quasi-experimental design (different courtrooms in different months). The groups are effectively independent. ✓
- Independence within groups: Individual recidivism outcomes are independent. ✓
- Success-failure condition:
$$\hat{p}_{\text{pooled}} = \frac{89 + 107}{412 + 388} = \frac{196}{800} = 0.245$$
- Group 1: $412 \times 0.245 = 100.9 \geq 10$ ✓ and $412 \times 0.755 = 311.1 \geq 10$ ✓
- Group 2: $388 \times 0.245 = 95.1 \geq 10$ ✓ and $388 \times 0.755 = 292.9 \geq 10$ ✓
All conditions met.
Step 3: Compute the Test Statistic
$$SE = \sqrt{0.245 \times 0.755 \times \left(\frac{1}{412} + \frac{1}{388}\right)}$$
$$SE = \sqrt{0.18498 \times (0.002427 + 0.002577)} = \sqrt{0.18498 \times 0.005004}$$
$$SE = \sqrt{0.000926} = 0.03043$$
$$z = \frac{0.216 - 0.276}{0.03043} = \frac{-0.060}{0.03043} = -1.972$$
Step 4: Find the P-Value
$$p\text{-value} = 2 \times P(Z \leq -1.972) = 2 \times 0.0243 = 0.0486$$
Step 5: Conclude in Context
At $\alpha = 0.05$: Since $p = 0.049 < 0.05$, we reject $H_0$.
Conclusion: There is statistically significant evidence that recidivism rates differ between algorithm-recommended and judge-recommended bail decisions ($z = -1.97$, $p = 0.049$). The algorithm group had a recidivism rate of 21.6% compared to 27.6% for the judge group — a difference of 6.0 percentage points.
95% CI for the difference in proportions (using unpooled SE):
$$SE_{\text{CI}} = \sqrt{\frac{0.216 \times 0.784}{412} + \frac{0.276 \times 0.724}{388}} = \sqrt{\frac{0.1693}{412} + \frac{0.1998}{388}}$$
$$SE_{\text{CI}} = \sqrt{0.000411 + 0.000515} = \sqrt{0.000926} = 0.03044$$
$$(-0.060) \pm 1.960 \times 0.03044 = -0.060 \pm 0.060$$
$$95\% \text{ CI: } (-0.120, -0.001)$$
The CI just barely excludes zero, consistent with the borderline p-value.
James's Analysis: "The algorithm group has a 6-percentage-point lower recidivism rate. That's statistically significant, but barely — and the confidence interval nearly includes zero. This suggests the algorithm might be slightly better than judges at predicting who will re-offend, but the evidence isn't overwhelming.
More importantly, this overall comparison doesn't tell us whether the algorithm works equally well across racial groups. The next question — the one that really matters for justice — is whether the false positive rate differs for Black defendants versus white defendants. That requires stratified comparisons, which I'll need separate two-proportion tests for each subgroup."
Theme 2 Connection: Comparing Groups Is Where Bias Becomes Visible
Here's something profound about this chapter's methods: comparing two groups is often how we discover that a system treats people unequally.
James's overall comparison (algorithm vs. judge) is informative, but the real power of two-group comparisons comes when you break the data down by demographics and ask: does the algorithm work equally well for everyone? If the false positive rate is 15% for white defendants and 30% for Black defendants, that disparity only becomes visible when you compare two groups.
This is why two-group inference is the methodological foundation of fairness audits, pay equity analyses, and discrimination studies. Every time someone asks "is there a gap?" — a gender wage gap, a racial achievement gap, a health disparity — they're running a two-group comparison. The methods in this chapter are tools for justice as much as tools for science.
16.7 Confidence Intervals for Differences: The Full Picture
Concept 4: Confidence Intervals for Differences
Just as a confidence interval for a single parameter tells you how large the parameter plausibly is, a confidence interval for a difference tells you how large the difference between two groups plausibly is. A CI for $\mu_1 - \mu_2$ or $p_1 - p_2$ that contains zero is consistent with "no difference" — equivalent to failing to reject $H_0$.
Summary of CI Formulas
| Scenario | CI Formula |
|---|---|
| Independent means | $(\bar{x}_1 - \bar{x}_2) \pm t^* \cdot \sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}$ |
| Paired means | $\bar{d} \pm t^*_{n-1} \cdot \dfrac{s_d}{\sqrt{n}}$ |
| Independent proportions | $(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot \sqrt{\dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \dfrac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$ |
Interpreting CIs for Differences
The interpretation follows the same logic as Chapter 12, extended to differences:
- If the CI contains zero: The observed difference is consistent with no difference between groups. We fail to reject $H_0$.
- If the CI is entirely positive: Group 1's value is plausibly greater than Group 2's.
- If the CI is entirely negative: Group 1's value is plausibly less than Group 2's.
- Width of the CI: Narrow CIs indicate precise estimates; wide CIs indicate substantial uncertainty about the true difference.
Always report the CI alongside the test. The test tells you WHETHER there's a difference. The CI tells you HOW BIG the difference might be. Both pieces of information matter.
Maya's Community Comparison
Dr. Maya Chen compares hospital admission rates for respiratory illness between two communities:
- Industrial neighborhood: 847 residents surveyed, 127 reported at least one respiratory-related hospital visit in the past year. $\hat{p}_1 = 127/847 = 0.150$.
- Suburban control community: 792 residents surveyed, 81 reported a respiratory-related hospital visit. $\hat{p}_2 = 81/792 = 0.102$.
The observed difference: $\hat{p}_1 - \hat{p}_2 = 0.150 - 0.102 = 0.048$ (4.8 percentage points higher in the industrial neighborhood).
Two-proportion z-test:
$$\hat{p}_{\text{pooled}} = \frac{127 + 81}{847 + 792} = \frac{208}{1639} = 0.1269$$
$$SE = \sqrt{0.1269 \times 0.8731 \times \left(\frac{1}{847} + \frac{1}{792}\right)} = \sqrt{0.1108 \times 0.002442} = \sqrt{0.000271} = 0.01645$$
$$z = \frac{0.048}{0.01645} = 2.918$$
$$p\text{-value} = 2 \times P(Z \geq 2.918) = 2 \times 0.0018 = 0.0035$$
95% CI for the difference (unpooled SE):
$$SE_{\text{CI}} = \sqrt{\frac{0.150 \times 0.850}{847} + \frac{0.102 \times 0.898}{792}} = \sqrt{0.0001504 + 0.0001156} = 0.01631$$
$$0.048 \pm 1.960 \times 0.01631 = 0.048 \pm 0.032$$
$$95\% \text{ CI: } (0.016, 0.080)$$
Maya's Conclusion: "The respiratory illness rate in the industrial neighborhood is 4.8 percentage points higher than in the suburban community, and this difference is statistically significant ($z = 2.92$, $p = 0.004$). The 95% confidence interval suggests the true difference is between 1.6 and 8.0 percentage points. This is consistent with environmental health literature on the effects of industrial air pollution on respiratory outcomes.
Of course, this is an observational study — I can't conclude that industrial pollution caused the higher rates. There could be confounding factors: income differences, access to healthcare, smoking rates, age distributions. But the statistically significant gap justifies further investigation, including air quality monitoring and a multivariate analysis that controls for these potential confounders."
Theme 5 Connection: Correlation vs. Causation in Group Comparisons
Maya's caution is exactly right. Finding a statistically significant difference between two groups does not automatically mean one group's condition caused the difference. In observational studies, the difference could be driven by confounding variables.
- Alex's A/B test (randomized experiment) → significant difference → causal claim justified
- Maya's community comparison (observational study) → significant difference → association only
- James's bail study (quasi-experiment) → significant difference → cautious causal interpretation
The statistical test tells you the difference is real. The study design determines what kind of "real" it is.
16.8 Choosing the Right Test: A Decision Flowchart
Here's the complete decision process for choosing among the three two-group tests:
Comparing two groups?
│
▼
What type of variable?
╱ ╲
Numerical Categorical
(means) (proportions)
│ │
▼ ▼
Are the data paired? Two-proportion z-test
╱ ╲ (§16.6)
Yes No
│ │
▼ ▼
Paired t-test Two-sample t-test
(§16.4) (Welch's, §16.3)
│ │
▼ ▼
Compute d_i Use raw group
differences data directly
│ │
▼ ▼
One-sample t = (x̄₁ - x̄₂)
t-test on d ────────────────
SE(difference)
Quick Decision Table
| Question to Ask | If Yes → | If No → |
|---|---|---|
| Is the outcome numerical (means)? | Go to "Are data paired?" | Use two-proportion z-test |
| Are the data paired (same subjects, matched pairs)? | Use paired t-test | Use two-sample t-test (Welch's) |
| Are sample sizes large ($n_1, n_2 \geq 30$)? | CLT handles normality | Check distributions for normality |
| Are variances equal? | Welch's still works fine | Definitely use Welch's |
Common Traps
| Trap | What Goes Wrong | How to Avoid It |
|---|---|---|
| Using paired test on independent data | "Differences" are meaningless → invalid results | Ask: "Does each observation have a natural partner?" |
| Using independent test on paired data | Ignores pairing → loses power → might miss real effect | Ask: "Were these the same subjects measured twice?" |
| Running two separate one-sample tests instead of one two-group test | Inflated Type I error, can't properly compare | Always use a single test that directly compares the groups |
| Forgetting to check conditions | Invalid p-values and CIs | Check all conditions before computing |
16.9 Python: Two-Group Tests
Here's how to run all three tests in Python.
Two-Sample t-Test (Independent Groups)
import numpy as np
from scipy import stats
# --- Alex's A/B Test ---
# If you have raw data:
np.random.seed(2026)
old_algo = np.random.normal(loc=42.3, scale=18.5, size=247)
new_algo = np.random.normal(loc=46.8, scale=21.2, size=253)
# Welch's t-test (default: equal_var=False)
t_stat, p_value = stats.ttest_ind(new_algo, old_algo, equal_var=False)
print("=== Alex's A/B Test (Welch's t-test) ===")
print(f"Old algorithm: n={len(old_algo)}, mean={np.mean(old_algo):.2f}, "
f"SD={np.std(old_algo, ddof=1):.2f}")
print(f"New algorithm: n={len(new_algo)}, mean={np.mean(new_algo):.2f}, "
f"SD={np.std(new_algo, ddof=1):.2f}")
print(f"Difference: {np.mean(new_algo) - np.mean(old_algo):.2f} minutes")
print(f"t = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_value:.4f}")
# For one-tailed (SciPy >= 1.7):
result = stats.ttest_ind(new_algo, old_algo, equal_var=False,
alternative='greater')
print(f"p-value (one-tailed, new > old) = {result.pvalue:.4f}")
# --- Confidence interval for the difference ---
diff = np.mean(new_algo) - np.mean(old_algo)
se = np.sqrt(np.var(old_algo, ddof=1)/len(old_algo)
+ np.var(new_algo, ddof=1)/len(new_algo))
# Use large-sample z* for simplicity, or compute Welch df
z_star = 1.960 # 95% CI
ci_lower = diff - z_star * se
ci_upper = diff + z_star * se
print(f"95% CI for difference: ({ci_lower:.2f}, {ci_upper:.2f})")
Two-Sample t-Test from Summary Statistics
from scipy import stats
import numpy as np
def two_sample_t_from_summary(x1_bar, s1, n1, x2_bar, s2, n2,
alternative='two-sided'):
"""
Welch's two-sample t-test from summary statistics.
Parameters:
-----------
x1_bar, s1, n1 : mean, SD, and size of group 1
x2_bar, s2, n2 : mean, SD, and size of group 2
alternative : 'two-sided', 'greater' (group 1 > group 2), or 'less'
Returns:
--------
dict with t-statistic, p-value, Welch df, and 95% CI
"""
se = np.sqrt(s1**2/n1 + s2**2/n2)
t_stat = (x1_bar - x2_bar) / se
# Welch-Satterthwaite degrees of freedom
num = (s1**2/n1 + s2**2/n2)**2
denom = (s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1)
df = num / denom
if alternative == 'two-sided':
p_value = 2 * stats.t.sf(abs(t_stat), df)
elif alternative == 'greater':
p_value = stats.t.sf(t_stat, df)
elif alternative == 'less':
p_value = stats.t.cdf(t_stat, df)
# 95% CI
t_star = stats.t.ppf(0.975, df)
diff = x1_bar - x2_bar
ci = (diff - t_star * se, diff + t_star * se)
return {
't_statistic': t_stat,
'p_value': p_value,
'df_welch': df,
'se': se,
'ci_95': ci,
'difference': diff
}
# Alex's data
result = two_sample_t_from_summary(
x1_bar=46.8, s1=21.2, n1=253,
x2_bar=42.3, s2=18.5, n2=247,
alternative='two-sided'
)
print("=== Alex's A/B Test (from summary stats) ===")
print(f"Difference: {result['difference']:.1f} min")
print(f"t = {result['t_statistic']:.4f}")
print(f"Welch df = {result['df_welch']:.1f}")
print(f"p-value = {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_95'][0]:.2f}, {result['ci_95'][1]:.2f})")
Paired t-Test
import numpy as np
from scipy import stats
# --- Sam's Shooting Data ---
last_season = np.array([18, 24, 15, 22, 20, 16, 28, 19, 21, 17, 23, 14])
this_season = np.array([22, 28, 19, 20, 25, 21, 26, 24, 23, 22, 27, 18])
# Compute differences
differences = this_season - last_season
print("Differences:", differences)
print(f"Mean difference: {np.mean(differences):.2f}")
print(f"SD of differences: {np.std(differences, ddof=1):.2f}")
# Paired t-test (equivalent to one-sample t-test on differences)
t_stat, p_value = stats.ttest_rel(this_season, last_season)
print(f"\n=== Sam's Paired t-Test ===")
print(f"t = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_value:.4f}")
# One-tailed (improvement = this season > last season)
result = stats.ttest_rel(this_season, last_season, alternative='greater')
print(f"p-value (one-tailed, improvement) = {result.pvalue:.4f}")
# Equivalent: one-sample t-test on differences
t2, p2 = stats.ttest_1samp(differences, popmean=0, alternative='greater')
print(f"\nEquivalent one-sample t-test on differences:")
print(f"t = {t2:.4f}, p = {p2:.4f}") # Same results!
# Confidence interval for mean difference
n = len(differences)
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * s_d / np.sqrt(n)
print(f"\n95% CI for mean difference: ({d_bar - margin:.2f}, "
f"{d_bar + margin:.2f})")
Two-Proportion z-Test
import numpy as np
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep
# --- James's Bail Study ---
# Number of "successes" (re-arrested) and sample sizes
count = np.array([89, 107]) # re-arrests in each group
nobs = np.array([412, 388]) # total in each group
# Two-proportion z-test
z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
print("=== James's Two-Proportion z-Test ===")
print(f"Algorithm group: {count[0]}/{nobs[0]} = {count[0]/nobs[0]:.3f}")
print(f"Judge group: {count[1]}/{nobs[1]} = {count[1]/nobs[1]:.3f}")
print(f"Difference: {count[0]/nobs[0] - count[1]/nobs[1]:.3f}")
print(f"z = {z_stat:.4f}")
print(f"p-value = {p_value:.4f}")
# Confidence interval for the difference in proportions
ci_low, ci_upp = confint_proportions_2indep(
count[0], nobs[0], count[1], nobs[1], method='wald'
)
print(f"95% CI for p1 - p2: ({ci_low:.4f}, {ci_upp:.4f})")
# --- Maya's Community Comparison ---
count_maya = np.array([127, 81])
nobs_maya = np.array([847, 792])
z_maya, p_maya = proportions_ztest(count_maya, nobs_maya)
print(f"\n=== Maya's Community Comparison ===")
print(f"Industrial: {count_maya[0]}/{nobs_maya[0]} = "
f"{count_maya[0]/nobs_maya[0]:.3f}")
print(f"Suburban: {count_maya[1]}/{nobs_maya[1]} = "
f"{count_maya[1]/nobs_maya[1]:.3f}")
print(f"z = {z_maya:.4f}, p = {p_maya:.4f}")
16.10 Excel: Two-Group Tests
Two-Sample t-Test
Excel's T.TEST function handles two-sample tests directly:
| Syntax | =T.TEST(array1, array2, tails, type) |
|---|---|
| array1 | Data range for Group 1 |
| array2 | Data range for Group 2 |
| tails | 1 for one-tailed, 2 for two-tailed |
| type | 1 = paired, 2 = equal variance, 3 = Welch's (recommended) |
Examples:
| What You Want | Formula |
|---|---|
| Welch's two-tailed p-value | =T.TEST(A2:A248, B2:B254, 2, 3) |
| Welch's one-tailed p-value | =T.TEST(A2:A248, B2:B254, 1, 3) |
| Paired two-tailed p-value | =T.TEST(A2:A13, B2:B13, 2, 1) |
| Equal-variance t-test (use rarely) | =T.TEST(A2:A248, B2:B254, 2, 2) |
Computing from Summary Statistics in Excel
| What You Need | Formula |
|---|---|
| Difference in means | =AVERAGE(A2:A248) - AVERAGE(B2:B254) |
| Pooled SE | =SQRT(VAR.S(A2:A248)/COUNT(A2:A248) + VAR.S(B2:B254)/COUNT(B2:B254)) |
| t-statistic | =difference / pooled_SE |
Two-Proportion z-Test in Excel
Excel doesn't have a built-in two-proportion z-test, but you can compute it with formulas:
| What You Need | Formula |
|---|---|
| $\hat{p}_1$ | =successes1 / n1 |
| $\hat{p}_2$ | =successes2 / n2 |
| $\hat{p}_{\text{pooled}}$ | =(successes1 + successes2) / (n1 + n2) |
| SE | =SQRT(p_pooled*(1-p_pooled)*(1/n1 + 1/n2)) |
| z-statistic | =(p1 - p2) / SE |
| p-value (two-tailed) | =2*(1-NORM.S.DIST(ABS(z), TRUE)) |
| p-value (one-tailed, right) | =1-NORM.S.DIST(z, TRUE) |
16.11 Mathematical Details: Formulas at a Glance
For reference, here are all the formulas from this chapter in one place.
Two-Sample t-Test (Independent Groups, Welch's)
$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} \qquad df = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1} + \dfrac{(s_2^2/n_2)^2}{n_2-1}}$$
95% CI:
$$(\bar{x}_1 - \bar{x}_2) \pm t^*_{df} \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$
Paired t-Test
$$t = \frac{\bar{d}}{s_d / \sqrt{n}} \qquad df = n - 1$$
where $d_i = x_{i,1} - x_{i,2}$, $\bar{d} = \frac{1}{n}\sum d_i$, $s_d = \sqrt{\frac{1}{n-1}\sum(d_i - \bar{d})^2}$
95% CI:
$$\bar{d} \pm t^*_{n-1} \cdot \frac{s_d}{\sqrt{n}}$$
Two-Proportion z-Test
$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{\text{pooled}}(1 - \hat{p}_{\text{pooled}})\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}} \qquad \hat{p}_{\text{pooled}} = \frac{X_1 + X_2}{n_1 + n_2}$$
95% CI (unpooled SE):
$$(\hat{p}_1 - \hat{p}_2) \pm z^* \cdot \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$
16.12 Progressive Project: Compare Two Groups Within Your Dataset
Time to apply two-group comparisons to your own dataset.
Your Task
-
Identify a meaningful two-group comparison in your dataset. Ideas: - Compare a numerical variable between two subgroups (e.g., income for college graduates vs. non-graduates; life expectancy for two continents) - Compare a proportion between two subgroups (e.g., smoking rates for men vs. women; graduation rates for public vs. private universities) - If your dataset has a time component, create a before/after comparison
-
Determine whether your comparison is independent or paired. Most comparisons in public datasets will be independent, but if you're comparing the same entities across two time periods (e.g., same countries in 2010 vs. 2020), that's paired.
-
Choose and execute the appropriate test. Use the decision flowchart from Section 16.8.
-
Compute and interpret a confidence interval for the difference.
-
Discuss whether the difference is causal or associational, based on the study design.
Template Code
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Load your data
df = pd.read_csv('your_dataset.csv')
# ====== OPTION A: Compare means (independent groups) ======
group1 = df[df['grouping_variable'] == 'Group 1']['numerical_variable'].dropna()
group2 = df[df['grouping_variable'] == 'Group 2']['numerical_variable'].dropna()
# Visualize both groups
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(group1, bins=20, alpha=0.7, label='Group 1', color='steelblue')
axes[0].hist(group2, bins=20, alpha=0.7, label='Group 2', color='coral')
axes[0].legend()
axes[0].set_title('Overlaid Histograms')
axes[1].boxplot([group1, group2], labels=['Group 1', 'Group 2'])
axes[1].set_title('Side-by-Side Box Plots')
plt.tight_layout()
plt.show()
# Summary statistics
print(f"Group 1: n={len(group1)}, mean={np.mean(group1):.3f}, "
f"SD={np.std(group1, ddof=1):.3f}")
print(f"Group 2: n={len(group2)}, mean={np.mean(group2):.3f}, "
f"SD={np.std(group2, ddof=1):.3f}")
# Welch's two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"\nWelch's t-test: t = {t_stat:.4f}, p = {p_value:.4f}")
# Confidence interval
diff = np.mean(group1) - np.mean(group2)
se = np.sqrt(np.var(group1, ddof=1)/len(group1)
+ np.var(group2, ddof=1)/len(group2))
ci = (diff - 1.96*se, diff + 1.96*se)
print(f"Difference: {diff:.3f}")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f})")
# ====== OPTION B: Compare proportions ======
# from statsmodels.stats.proportion import proportions_ztest
# count = np.array([successes1, successes2])
# nobs = np.array([n1, n2])
# z_stat, p_value = proportions_ztest(count, nobs)
What to Write in Your Notebook
Add a new section titled "Chapter 16: Two-Group Comparisons" to your Data Detective Portfolio. Include: - Your research question and the two groups being compared - Justification for whether the data are independent or paired - Condition checks with visualizations - Test results with full interpretation - Confidence interval with practical interpretation - Discussion of whether the finding is causal or associational (and why) - A 2-3 sentence reflection: What does this comparison reveal about your dataset?
16.13 Common Mistakes and How to Avoid Them
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| Using a paired test on independent data | Creates meaningless "differences"; results are invalid | Ask "Is there a natural pairing?" before choosing the test |
| Using an independent test on paired data | Ignores the pairing, adds unnecessary noise, reduces power | Compute within-pair differences and use the paired t-test |
| Assuming equal variances without checking | The classic (pooled) t-test can give wrong p-values when variances differ | Use Welch's t-test by default |
| Comparing overlapping CIs and concluding "no difference" | Two CIs can overlap even when the difference is significant | Always compute the CI for the difference, not separate CIs for each group |
| Running two one-sample tests instead of one two-group test | Multiple testing inflates Type I error; doesn't directly answer the comparison question | Use a single test that compares the groups directly |
| Ignoring study design when interpreting results | Finding a difference ≠ proving causation | State whether the study is experimental or observational; causal claims require randomization |
| Forgetting to check conditions for both groups | Violations in either group can invalidate the test | Check normality and independence for each group separately |
16.14 Chapter Summary
Take a step back and see what you've accomplished. You now have three powerful tools for comparing two groups — the most common analysis in applied statistics:
-
The two-sample t-test (Welch's) compares means from two independent groups. It's the workhorse of A/B testing, clinical trials, and any study comparing two separate populations.
-
The paired t-test compares paired observations by reducing the problem to a one-sample t-test on the differences. It's the go-to for before-and-after designs and matched-pairs studies, and it's often more powerful than the independent-samples test because it eliminates between-subject variability.
-
The two-proportion z-test compares proportions from two independent groups. It's essential for public health comparisons, fairness audits, and any study with a yes/no outcome across two populations.
All three tests follow the same logical framework: compute an observed difference, calculate the standard error of that difference, form a test statistic, and compare to a reference distribution. The choice among them depends on two questions: what type of data? (means or proportions) and what type of design? (independent or paired).
And here's the deeper insight: comparing groups is where statistics gets real. One-sample tests are useful for checking benchmarks, but two-group comparisons answer the questions that drive research, policy, and business decisions: Does the treatment work better than the control? Is there a disparity between groups? Did the change make a difference?
Alex's A/B test showed that the new algorithm increases watch time by about 4.5 minutes — a small individual effect with massive business implications. Sam's paired analysis revealed that Daria's improvement is real when you control for opponent difficulty. James's proportion comparison uncovered a statistically significant gap between algorithmic and human bail decisions. And Maya's community comparison documented a health disparity that demands further investigation.
What's Next: Chapter 17 will tackle a critical question lurking behind every test in this chapter: how big is the effect, and did we have enough data to find it? You'll learn about statistical power (the probability of detecting a real effect), effect sizes (how to measure practical significance), and what "statistically significant" really means when you look at the full picture. Sam's borderline results from Chapter 15 — and the question of how many games he'd need to confirm Daria's improvement — will finally get a rigorous answer.
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey