Quiz: Comparing Two Groups

Q: In a two-sample t-test comparing Group 1 (, , ) and Group 2 (, , ), the standard error of the difference is: (a) (b) (c) (d)

(b) . The standard error of the difference in means for independent samples is . We add the variances (not the standard deviations), each divided by its respective sample size. This gives .

Q: The degrees of freedom for a paired t-test with 25 pairs of observations is: (a) 48 (b) 50 (c) 25 (d) 24

(d) 24. For a paired t-test, where is the number of pairs, not the total number of observations. With 25 pairs, . A common mistake is to use , which would be the degrees of freedom for the equal-variance independent-samples t-test (a different test entirely).

Q: A researcher compares the pass rates on a licensing exam: 72 out of 100 candidates from Program A passed, and 58 out of 90 from Program B passed. The pooled proportion is: (a) (b) (c) (d)

(a) . The pooled proportion combines the raw counts from both groups: . Note that this is NOT the average of the two sample proportions (choice b), which would give equal weight to both groups regardless of sample size. The pooled proportion weights by sample size, which is the correct approach.

Q: The confidence interval for the difference in proportions uses a *different* standard error than the test statistic because: (a) Confidence intervals require more precision (b) The test assumes (pooled SE); the CI does not (unpooled SE) (c) Confidence intervals use the t-distribution instead of the z-distribution (d) The CI formula has a mathematical error in most textbooks

(b) The test assumes (pooled SE); the CI does not (unpooled SE). When testing , we assume the two proportions are equal and pool the data. But when constructing a CI, we're estimating what actually equals — without assuming they're equal. So the CI uses each group's own in the standard error: .

Q: A study compares two teaching methods. Method A: , , . Method B: , , . The 95% CI for is approximately: (a) (b) (c) (d)

(a) . . With large df, . Margin = . Difference = . CI: . Since the CI contains zero, the difference is not statistically significant at .

Contributors

Quiz: Comparing Two Groups

Test your understanding of two-sample t-tests, paired t-tests, two-proportion z-tests, and how to choose the right test. Try to answer each question before revealing the answer.

1. The two-sample t-test (independent samples) is used when:

(a) You want to compare means from two related groups (b) You want to compare means from two unrelated groups (c) You want to compare proportions from two groups (d) You want to test a single mean against a benchmark

Answer

**(b) You want to compare means from two unrelated groups.** The two-sample t-test compares means from two *independent* (unrelated) groups — where individuals in one group have no connection to individuals in the other. If the groups were related (e.g., same subjects measured twice), you'd use a paired t-test. If you're comparing proportions, you'd use a two-proportion z-test. If you're testing a single mean against a benchmark, that's a one-sample t-test ([Chapter 15](../chapter-15-inference-for-means/index.md)).

2. Which of the following is an example of paired (dependent) data?

(a) Comparing average salaries of engineers in California vs. Texas (b) Comparing blood pressure readings of patients before and after medication (c) Comparing test scores of students in two different schools (d) Comparing defect rates at two different factories

Answer

**(b) Comparing blood pressure readings of patients before and after medication.** The blood pressure data are paired because each patient is measured twice — before and after. Each "before" measurement has a natural partner: the "after" measurement for the *same patient*. The other options all involve separate, unrelated groups of individuals (engineers in different states, students at different schools, products from different factories).

3. Welch's t-test is preferred over the equal-variances (pooled) t-test because:

(a) It always produces smaller p-values (b) It works correctly whether or not the population variances are equal (c) It requires fewer assumptions about normality (d) It can only be used with large samples

Answer

**(b) It works correctly whether or not the population variances are equal.** Welch's t-test does not assume equal population variances. When variances *are* equal, Welch's test gives nearly identical results to the pooled test. When variances are unequal, the pooled test can give incorrect p-values (inflated Type I error rates), but Welch's test remains accurate. Since you rarely know whether variances are truly equal, Welch's is the safer default. It does NOT always produce smaller p-values, and it has the same normality assumptions as the pooled test.

4. In a two-sample t-test comparing Group 1 ($n_1 = 40$, $\bar{x}_1 = 85$, $s_1 = 10$) and Group 2 ($n_2 = 35$, $\bar{x}_2 = 80$, $s_2 = 12$), the standard error of the difference is:

(a) $\sqrt{10^2 + 12^2}$ (b) $\sqrt{10^2/40 + 12^2/35}$ (c) $(10 + 12) / \sqrt{40 + 35}$ (d) $\sqrt{(10/40)^2 + (12/35)^2}$

Answer

**(b) $\sqrt{10^2/40 + 12^2/35}$.** The standard error of the difference in means for independent samples is $SE = \sqrt{s_1^2/n_1 + s_2^2/n_2}$. We add the *variances* (not the standard deviations), each divided by its respective sample size. This gives $SE = \sqrt{100/40 + 144/35} = \sqrt{2.5 + 4.114} = \sqrt{6.614} \approx 2.572$.

5. A paired t-test is equivalent to:

(a) Two separate one-sample t-tests, one for each group (b) A one-sample t-test on the within-pair differences (c) A two-sample t-test with equal sample sizes (d) A z-test on the pooled data

Answer

**(b) A one-sample t-test on the within-pair differences.** This is the key insight of the paired t-test. You compute $d_i = x_{i,1} - x_{i,2}$ for each pair, then test whether the mean of these differences ($\bar{d}$) is significantly different from zero using the one-sample t-test from [Chapter 15](../chapter-15-inference-for-means/index.md): $t = \bar{d} / (s_d / \sqrt{n})$ with $df = n - 1$.

6. The degrees of freedom for a paired t-test with 25 pairs of observations is:

(a) 48 (b) 50 (c) 25 (d) 24

Answer

**(d) 24.** For a paired t-test, $df = n - 1$ where $n$ is the number of *pairs*, not the total number of observations. With 25 pairs, $df = 25 - 1 = 24$. A common mistake is to use $n_1 + n_2 - 2 = 48$, which would be the degrees of freedom for the equal-variance independent-samples t-test (a different test entirely).

7. In the two-proportion z-test, we use the pooled proportion $\hat{p}_{\text{pooled}}$ in the standard error because:

(a) It gives a smaller standard error (b) Under $H_0$, we assume the two populations have the same proportion (c) It corrects for unequal sample sizes (d) The z-distribution requires pooled estimates

Answer

**(b) Under $H_0$, we assume the two populations have the same proportion.** When testing $H_0: p_1 = p_2$, we assume there's one common proportion $p$ shared by both groups. The best estimate of this common proportion is the pooled proportion $\hat{p}_{\text{pooled}} = (X_1 + X_2)/(n_1 + n_2)$. For the confidence interval (where we don't assume $p_1 = p_2$), we use the unpooled standard error instead.

8. A researcher compares the pass rates on a licensing exam: 72 out of 100 candidates from Program A passed, and 58 out of 90 from Program B passed. The pooled proportion is:

(a) $(72 + 58) / (100 + 90) = 0.684$ (b) $(0.72 + 0.644) / 2 = 0.682$ (c) $72/100 = 0.72$ (d) $(72 \times 100 + 58 \times 90) / (100 + 90)^2$

Answer

**(a) $(72 + 58) / (100 + 90) = 130/190 = 0.684$.** The pooled proportion combines the raw counts from both groups: $\hat{p}_{\text{pooled}} = (X_1 + X_2) / (n_1 + n_2)$. Note that this is NOT the average of the two sample proportions (choice b), which would give equal weight to both groups regardless of sample size. The pooled proportion weights by sample size, which is the correct approach.

9. If a 95% CI for $\mu_1 - \mu_2$ is $(−3.2, 8.6)$, you can conclude that:

(a) Group 1's mean is definitely higher than Group 2's mean (b) There is no difference between the groups (c) The difference is not statistically significant at $\alpha = 0.05$ (d) Group 1's mean is exactly 2.7 higher than Group 2's mean

Answer

**(c) The difference is not statistically significant at $\alpha = 0.05$.** Because the CI contains zero, we cannot rule out the possibility that there's no difference between the groups. This is consistent with failing to reject $H_0: \mu_1 - \mu_2 = 0$ at $\alpha = 0.05$. The CI tells us the true difference could be anywhere from 3.2 units favoring Group 2 to 8.6 units favoring Group 1. We can't say there's "no difference" (that would require proving $H_0$, which we can't), only that the evidence is insufficient to conclude there IS a difference.

10. Alex's A/B test found a 4.5-minute increase in watch time with $p = 0.012$. This means:

(a) There is a 1.2% chance the new algorithm doesn't work (b) If there were truly no difference between algorithms, there's a 1.2% chance of observing a difference this large or larger (c) 98.8% of users preferred the new algorithm (d) The new algorithm increases watch time by exactly 4.5 minutes

Answer

**(b) If there were truly no difference between algorithms, there's a 1.2% chance of observing a difference this large or larger.** The p-value is $P(\text{data} | H_0)$, not $P(H_0 | \text{data})$. It tells us how surprising the observed difference would be if the null hypothesis (no difference) were true. It does NOT tell us the probability that the algorithms are the same (a), the percentage of users who prefer the new algorithm (c), or the exact magnitude of the true difference (d). This is the same p-value interpretation from [Chapter 13](../../part-04-bridge-to-inference/chapter-13-hypothesis-testing/index.md), now applied to a two-group comparison.

11. When should you use a paired t-test instead of a two-sample t-test?

(a) When the two groups have the same sample size (b) When each observation in one group has a natural partner in the other group (c) When the data are normally distributed (d) When you want more statistical power

Answer

**(b) When each observation in one group has a natural partner in the other group.** The choice between paired and independent tests depends on the *study design*, not on sample sizes (a), normality (c), or a desire for more power (d). If the data are genuinely paired (same subjects before/after, matched pairs, same locations at two times), you must use the paired test. While paired tests often are more powerful (d), that's a *consequence* of properly accounting for the pairing, not a *reason* to misclassify independent data as paired.

12. A researcher uses a two-sample t-test on before/after data from the same 30 subjects. What is the likely consequence?

(a) The test will be too liberal (reject $H_0$ too often) (b) The test will be too conservative (fail to reject when it should) (c) The test will give identical results to the paired t-test (d) The test cannot be computed on paired data

Answer

**(b) The test will be too conservative (fail to reject when it should).** By treating paired data as independent, the researcher ignores the correlation between the before and after measurements. The two-sample test includes between-subject variability that the paired test would have eliminated, producing a larger standard error, a smaller test statistic, and a larger p-value. The researcher loses power and might miss a real effect. This is one of the most common mistakes in applied statistics.

13. In a two-proportion z-test with $\hat{p}_1 = 0.35$, $n_1 = 200$, $\hat{p}_2 = 0.28$, $n_2 = 250$, which conditions must be checked?

(a) Normality of both groups using histograms (b) Independence between groups and success-failure condition for each group (c) Equal variances in both groups (d) That both groups have the same sample size

Answer

**(b) Independence between groups and success-failure condition for each group.** For the two-proportion z-test, we need: (1) independence between groups, (2) independence within each group (10% condition), and (3) the success-failure condition — each group must have at least 10 successes and 10 failures using the pooled proportion. Since proportions are categorical, there's no normality condition to check with histograms (a). Equal variances (c) and equal sample sizes (d) are not required.

14. The confidence interval for the difference in proportions uses a different standard error than the test statistic because:

(a) Confidence intervals require more precision (b) The test assumes $p_1 = p_2$ (pooled SE); the CI does not (unpooled SE) (c) Confidence intervals use the t-distribution instead of the z-distribution (d) The CI formula has a mathematical error in most textbooks

Answer

**(b) The test assumes $p_1 = p_2$ (pooled SE); the CI does not (unpooled SE).** When testing $H_0: p_1 = p_2$, we assume the two proportions are equal and pool the data. But when constructing a CI, we're estimating what $p_1 - p_2$ actually equals — without assuming they're equal. So the CI uses each group's own $\hat{p}$ in the standard error: $SE_{\text{CI}} = \sqrt{\hat{p}_1(1-\hat{p}_1)/n_1 + \hat{p}_2(1-\hat{p}_2)/n_2}$.

15. A study compares two teaching methods. Method A: $n = 45$, $\bar{x} = 78$, $s = 15$. Method B: $n = 50$, $\bar{x} = 82$, $s = 14$. The 95% CI for $\mu_A - \mu_B$ is approximately:

(a) $(−4 \pm 5.9) = (−9.9, 1.9)$ (b) $(−4 \pm 2.9) = (−6.9, −1.1)$ (c) $(4 \pm 5.9) = (−1.9, 9.9)$ (d) $(−4 \pm 8.5) = (−12.5, 4.5)$

Answer

**(a) $(−4 \pm 5.9) = (−9.9, 1.9)$.** $SE = \sqrt{15^2/45 + 14^2/50} = \sqrt{5.0 + 3.92} = \sqrt{8.92} = 2.987$. With large df, $t^* \approx 1.98$. Margin = $1.98 \times 2.987 \approx 5.9$. Difference = $78 - 82 = -4$. CI: $(-4 - 5.9, -4 + 5.9) = (-9.9, 1.9)$. Since the CI contains zero, the difference is not statistically significant at $\alpha = 0.05$.

16. Which of the following is NOT a valid reason to prefer a paired design over an independent-samples design?

(a) It eliminates between-subject variability (b) It often requires fewer total subjects to detect the same effect (c) It guarantees the results will be statistically significant (d) Each subject serves as their own control

Answer

**(c) It guarantees the results will be statistically significant.** No study design guarantees significant results. A paired design *can* be more powerful because it eliminates between-subject variability (a), often requiring fewer subjects (b), with each subject serving as their own control (d). But if there's no real treatment effect, or if the effect is very small relative to within-subject variability, the paired test will (correctly) fail to reject $H_0$.

17. The SciPy function for a Welch's two-sample t-test on raw data is:

(a) stats.ttest_1samp(data1, data2) (b) stats.ttest_ind(data1, data2, equal_var=False) (c) stats.ttest_rel(data1, data2) (d) stats.ttest_ind(data1, data2, equal_var=True)

Answer

**(b) `stats.ttest_ind(data1, data2, equal_var=False)`.** `ttest_ind` performs the independent-samples t-test. Setting `equal_var=False` (the default in recent SciPy versions) uses Welch's method. `ttest_1samp` (a) is for one-sample tests. `ttest_rel` (c) is for paired tests. `equal_var=True` (d) gives the classic equal-variance (pooled) t-test, which is not recommended as the default.

18. Professor Washington finds that the algorithm's false positive rate is 13.3% for white defendants and 31.2% for Black defendants ($z = 4.67$, $p < 0.001$). Which of the following conclusions is MOST appropriate?

(a) The algorithm is racist and should be immediately banned (b) There is strong statistical evidence that the false positive rate differs by race, warranting further investigation into the sources of the disparity (c) The difference is due to confounding variables, not the algorithm itself (d) The p-value is so small that the algorithm must be intentionally biased

Answer

**(b) There is strong statistical evidence that the false positive rate differs by race, warranting further investigation into the sources of the disparity.** The statistical test establishes that the disparity is real (not due to chance), but it doesn't tell us *why* the disparity exists. The algorithm might be using features that serve as proxies for race (e.g., neighborhood, employment history). The difference might reflect systemic biases in the training data. Further investigation is needed to understand the causal mechanism. Choice (a) jumps to policy conclusions beyond what the test supports. Choice (c) dismisses the finding without evidence. Choice (d) conflates statistical significance with intentionality.

19. If you have before-and-after data for 20 subjects and you incorrectly use a two-sample t-test instead of a paired t-test, the degrees of freedom will be:

(a) The same as the paired test ($df = 19$) (b) Larger than the paired test ($df = 38$ instead of 19) (c) Smaller than the paired test (d) It depends on the data

Answer

**(b) Larger than the paired test ($df = 38$ instead of 19).** The two-sample t-test (with equal variances) uses $df = n_1 + n_2 - 2 = 20 + 20 - 2 = 38$, while the paired test uses $df = n - 1 = 20 - 1 = 19$. But having more degrees of freedom doesn't help here — the wrong test has a larger standard error (because it ignores the pairing), leading to a smaller test statistic and larger p-value. The extra degrees of freedom don't compensate for the lost power.

20. Two 95% confidence intervals are reported: Group 1 mean: (40, 50), Group 2 mean: (44, 54). A colleague concludes: "The intervals overlap, so there's no significant difference between the groups." This conclusion is:

(a) Correct — overlapping CIs always mean no significant difference (b) Incorrect — overlapping individual CIs do NOT necessarily mean the difference is non-significant (c) Correct, but only if both groups have the same sample size (d) Incorrect — you need to check if one CI is entirely inside the other

Answer

**(b) Incorrect — overlapping individual CIs do NOT necessarily mean the difference is non-significant.** This is one of the most common mistakes in statistics. Two individual CIs can overlap even when the difference between groups IS statistically significant. To properly test whether the difference is significant, you need to compute the CI *for the difference* ($\mu_1 - \mu_2$). The correct CI for the difference has a smaller standard error than what you'd estimate from eyeballing the overlap of individual CIs. Always compute the CI for the difference directly — never rely on visual comparison of individual CIs.