Further Reading: Comparing Two Groups
Books
For Deeper Understanding
Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's chapters on hypothesis testing include clear, intuitive explanations of A/B testing and two-group comparisons. His example of comparing fertilizer treatments on lawns makes the independent-samples t-test memorable. Previously recommended for the t-test (Chapter 15), confidence intervals (Chapter 12), and hypothesis testing (Chapter 13) — his two-sample coverage is a natural continuation of those discussions.
David Moore, George McCabe, and Bruce Craig, The Practice of Statistics for Business and Economics, 5th edition (2021) Chapters 17-18 provide one of the clearest textbook treatments of two-sample procedures. The business context (market research, quality control, A/B testing) makes it particularly relevant for Alex's StreamVibe scenario. The authors are careful about the distinction between pooled and Welch's t-tests, and they recommend Welch's as the default — exactly the approach taken in this chapter.
Jessica Utts, Seeing Through Statistics, 4th edition (2015) Utts excels at making statistical reasoning accessible, and her treatment of paired vs. independent designs is particularly insightful. She uses compelling real-world examples (including medical studies and consumer research) to illustrate why choosing the correct test matters. Her discussion of the "paired versus unpaired" decision is among the clearest available.
David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th edition (2007) Chapters 27-28 cover two-sample tests with Freedman's characteristic rigor and skepticism. His insistence on checking conditions carefully and his nuanced discussion of when statistical significance does and doesn't support causal claims directly complements Sections 16.3 and 16.7. This is the reference for understanding what can go wrong with two-group comparisons.
For the Mathematically Curious
George Casella and Roger Berger, Statistical Inference, 2nd edition (2002) Chapter 8 develops the theory of two-sample inference, including the derivation of Welch's approximation for degrees of freedom and the mathematical basis for why the variance of a difference equals the sum of the variances (when the samples are independent). Requires calculus and probability theory.
Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (2004) Chapter 10 covers two-sample inference with mathematical precision. Wasserman's treatment of Welch's correction is particularly thorough, and his discussion of when and why the equal-variance assumption fails in practice is illuminating. A good bridge between this chapter and graduate-level inference.
On Algorithmic Fairness
Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (2016) Essential context for James's case study. O'Neil, a mathematician and data scientist, documents how algorithms — from criminal justice risk assessments to teacher evaluations to credit scoring — systematically disadvantage vulnerable populations. Her central argument — that algorithmic systems often encode and amplify historical biases — is demonstrated quantitatively in Case Study 2. Highly readable and deeply unsettling.
Virginia Eubanks, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (2018) Where O'Neil provides a broad survey, Eubanks provides three deep case studies of how automated decision-making systems harm low-income Americans. Her chapter on the Allegheny Family Screening Tool — a predictive model used to identify child abuse risk — raises questions that echo James's analysis: even when an algorithm "works" on average, who bears the cost of its errors?
Ruha Benjamin, Race After Technology: Abolitionist Tools for the New Jim Code (2019) Benjamin introduces the concept of the "New Jim Code" — technology that reinforces racial hierarchies while appearing neutral. Her analysis of how data-driven systems reproduce inequality provides essential theoretical grounding for the statistical evidence in Case Study 2.
Articles and Papers
Delacre, M., Lakens, D., & Leys, C. (2017). "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test." International Review of Social Psychology, 30(1), 92-101. The definitive paper on why Welch's t-test should be the default. The authors demonstrate through simulation that (1) the pooled t-test has inflated Type I error rates when variances are unequal, (2) Welch's test controls Type I error correctly regardless of variance equality, and (3) the power loss from using Welch's when variances ARE equal is negligible. This paper directly supports the recommendation in Section 16.3 to use Welch's by default.
Ruxton, G. D. (2006). "The Unequal Variance t-Test Is an Underused Alternative to Student's t-Test and the Mann-Whitney U Test." Behavioral Ecology, 17(4), 688-690. A short but influential paper arguing that biologists (and other scientists) should abandon the pooled t-test and the practice of testing for equal variances before choosing a test. Ruxton's argument is simple: Welch's test works whether or not variances are equal, so why bother checking?
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." ProPublica, May 23, 2016. The landmark investigative report that analyzed the COMPAS recidivism prediction algorithm in Broward County, Florida. ProPublica found that the algorithm's false positive rate for Black defendants (44.9%) was nearly double the rate for white defendants (23.5%) — a finding directly paralleled by James's analysis in Case Study 2. This article launched a national debate about algorithmic fairness that continues today. Available at https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Flores, A. W., Bechtel, K., & Lowenkamp, C. T. (2016). "False Positives, False Negatives, and False Analyses: A Rejoinder to 'Machine Bias: There's Software Used Across the Country to Predict Future Criminals, and It's Biased Against Blacks.'" Federal Probation, 80(2), 38-46. Northpointe's (the COMPAS creator's) response to ProPublica. They argue that COMPAS achieves predictive parity (same positive predictive value across racial groups), which is a different definition of fairness. This paper illustrates the mathematical tension between competing fairness definitions — a tension that cannot be resolved statistically, only ethically.
Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153-163. A rigorous mathematical proof that when base rates differ between groups, it is impossible to simultaneously achieve equal false positive rates, equal false negative rates, and predictive parity. This impossibility result provides the theoretical foundation for the fairness dilemma in Case Study 2 and explains why the algorithm can be "fair" by one definition and "unfair" by another.
Boneau, C. A. (1960). "The Effects of Violations of Assumptions Underlying the t-Test." Psychological Bulletin, 57(1), 49-64. The classic study on t-test robustness, now relevant for the two-sample case. Boneau showed that the pooled t-test is sensitive to unequal variances (especially with unequal sample sizes), while being relatively robust to non-normality. This finding directly motivates the preference for Welch's test in Section 16.3. Previously recommended in Chapter 15 — the two-sample results extend Boneau's one-sample findings.
Online Resources
Interactive Tools
Seeing Theory — Frequentist Inference https://seeing-theory.brown.edu/frequentist-inference/ Brown University's interactive visualization now extends to two-sample comparisons. You can set up two groups, watch the sampling distribution of the difference in means, and see how the p-value and confidence interval respond to changes in sample size and effect size. Particularly powerful for building intuition about when differences are "real."
Interactive Two-Sample t-Test Explorer https://rpsychologist.com/d3/nhst/ Kristoffer Magnusson's interactive tool lets you adjust means, standard deviations, and sample sizes for two groups and see in real time how the test statistic, p-value, power, and confidence interval change. The overlap between the two distributions and the effect size (Cohen's d) are displayed visually. This is the single best tool for developing intuition about two-sample comparisons.
StatKey: Two-Sample Randomization Test http://www.lock5stat.com/StatKey/ StatKey provides both formula-based and randomization-based two-sample tests. The randomization approach (shuffling group labels and recomputing the difference thousands of times) gives an intuitive sense of what the null distribution looks like. Try both approaches on the same data to see how they compare — the randomization approach is a preview of Chapter 18.
Video Resources
StatQuest with Josh Starmer: "Two-Sample t-Test: Clearly Explained" (YouTube) Josh Starmer's characteristically energetic walkthrough of the two-sample t-test, with clear visual explanations of the pooled standard error and Welch's correction. His separate video on paired t-tests is equally good. Watch both in sequence for a complete overview.
3Blue1Brown: "But What Is a p-Value?" (YouTube) While not specific to two-sample tests, Grant Sanderson's visual explanation of p-values is essential background. Understanding p-values conceptually makes the two-sample extension straightforward — you're still asking "how surprising is this data under $H_0$?" but now "this data" is a difference between groups.
Khan Academy: "Two-Sample t-Test" and "Paired t-Test" (YouTube/khanacademy.org) Sal Khan provides multiple worked examples of both independent-samples and paired t-tests. His step-by-step approach is ideal for students who want additional practice. The paired t-test videos clearly show how computing differences transforms a paired problem into a one-sample problem.
Crash Course Statistics: "t-Tests: A Matched Pair Made in Heaven" (YouTube) A fast-paced overview covering both independent and paired t-tests. The animation showing how pairing reduces variability is particularly effective. Good for a quick review or as supplementary viewing.
jbstatistics: "Independent-Samples t-Test" and "Paired t-Test" (YouTube) Jeremy Balka provides mathematically precise yet accessible videos on both tests. His discussion of when to use Welch's vs. the pooled t-test is especially clear, and he works through the Welch-Satterthwaite degrees of freedom calculation step by step.
Algorithmic Fairness Resources
ProPublica COMPAS Analysis — Full Methodology https://github.com/propublica/compas-analysis The complete code and data from ProPublica's COMPAS investigation. The analysis uses the exact same two-proportion z-tests covered in this chapter, applied to false positive and false negative rates by race. A remarkable example of statistical methods in investigative journalism.
Google's What-If Tool https://pair-code.github.io/what-if-tool/ An interactive visualization tool for exploring machine learning model fairness. You can examine how a model's predictions differ across demographic groups — essentially an automated fairness audit using the two-group comparison logic from this chapter.
Software Documentation
SciPy: scipy.stats.ttest_ind
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
Full documentation for the independent-samples t-test. Key parameters: equal_var=False (Welch's, recommended default) and alternative ('two-sided', 'less', 'greater'). Note: the default for equal_var changed to False in recent SciPy versions.
SciPy: scipy.stats.ttest_rel
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html
Full documentation for the paired t-test. Requires two arrays of equal length, where corresponding elements form pairs. Supports the alternative parameter for one-tailed tests. Equivalent to ttest_1samp(a - b, popmean=0).
statsmodels: proportions_ztest https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html The two-proportion z-test function used in this chapter. Accepts arrays of counts and sample sizes. Supports one-tailed and two-tailed alternatives. Returns the z-statistic and p-value.
statsmodels: confint_proportions_2indep https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.confint_proportions_2indep.html Confidence interval for the difference in two independent proportions. Supports multiple methods: 'wald' (standard), 'agresti-caffo' (improved), and 'newcombe' (Wilson-based). The 'newcombe' method generally has better coverage properties.
pingouin: ttest
https://pingouin-stats.org/build/html/generated/pingouin.ttest.html
The pingouin library provides a unified ttest() function that handles one-sample, independent-samples, and paired tests. It returns the t-statistic, p-value, Cohen's d (effect size), 95% CI, power, and Bayes factor — all in a single, clean DataFrame output. Particularly useful for following the recommendation to report effect sizes alongside p-values. Preview for Chapter 17.
What's Coming Next
Chapter 17 will tackle two critical questions that this chapter raised but couldn't answer:
-
Power: Sam's borderline results (from Chapter 15) and James's barely-significant overall comparison ($p = 0.049$) both raise the question: did we have enough data to reliably detect a real effect? Power analysis answers this prospectively (how many observations do I need?) and retrospectively (how likely was I to find the effect?).
-
Effect sizes: Alex's 4.5-minute difference was statistically significant but might or might not be practically significant. Cohen's d, the odds ratio, and other effect size measures provide standardized ways to quantify how large an effect is, independent of sample size.
Resources to preview: - StatQuest: "Statistical Power" (YouTube) — clear visual explanation of what power is and why it matters - Khan Academy: "Effect Size" (khanacademy.org) — Cohen's d and its interpretation - Seeing Theory: Power module — interactive power curve visualization