Further Reading: Hypothesis Testing: Making Decisions with Data

Books

For Deeper Understanding

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's chapters on hypothesis testing and p-values are among the most accessible popular treatments available. He explains the courtroom analogy with vivid real-world examples and addresses the common misconceptions head-on. If Section 13.6 (p-value misconceptions) resonated with you, Wheelan's treatment will deepen your understanding with additional examples. Previously recommended for confidence intervals (Chapter 12) — the hypothesis testing chapters are equally strong.

David Salsburg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (2001) The title story is the origin of hypothesis testing itself: Ronald Fisher designed a test to determine whether a woman could really distinguish between tea poured into milk versus milk poured into tea. Salsburg traces the development of hypothesis testing from Fisher's agricultural experiments through its adoption across medicine, psychology, and industry. The personal rivalries between Fisher and Neyman-Pearson — who disagreed fundamentally about what hypothesis testing means — are told with novelistic flair. Essential reading for understanding why the framework looks the way it does.

Stuart Ritchie, Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth (2020) A comprehensive, accessible account of the replication crisis and its causes. Ritchie covers p-hacking, publication bias, HARKing, and outright fraud with detailed case studies. Directly relevant to Case Study 1 and Section 13.12. If the replication crisis discussion in this chapter left you wanting more, Ritchie's book is the definitive popular treatment.

David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th edition (2007) Chapters 26-29 provide perhaps the most careful textbook treatment of hypothesis testing available. Freedman was meticulous about the distinction between what p-values say and what people think they say. His examples in criminal justice and medicine are particularly relevant to this chapter's themes. Previously recommended for confidence intervals — the hypothesis testing chapters are equally essential.

For the Mathematically Curious

George Casella and Roger Berger, Statistical Inference, 2nd edition (2002) Chapter 8 provides the rigorous mathematical theory of hypothesis testing: the Neyman-Pearson lemma, uniformly most powerful tests, and the likelihood ratio framework. If you want to understand why the z-test and t-test are the optimal tests for normal data, Casella and Berger show the mathematical proof. Heavy going but deeply satisfying for the mathematically inclined.

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (2004) Chapter 10 covers hypothesis testing with mathematical precision and connects it to confidence intervals through the "duality principle" discussed in Section 13.11. Wasserman's treatment of multiple testing (Chapter 10.6) goes beyond Bonferroni to cover the False Discovery Rate (FDR), which has become the dominant approach in genomics and other data-rich fields.

Articles and Papers

Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA Statement on Statistical Significance and P-Values." The American Statistician, 70(2), 129-133. The landmark American Statistical Association statement referenced in Section 13.6. The six principles about p-values are essential reading for anyone who uses or interprets statistical results. The companion editorial commentary includes perspectives from 20+ statisticians — fascinating reading that reveals how much disagreement exists even among experts. Available free at: https://doi.org/10.1080/00031305.2016.1154108

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). "Moving to a World Beyond 'p < 0.05'." The American Statistician, 73(sup1), 1-19. The ASA's follow-up to the 2016 statement, part of a special issue with 43 invited papers on "Statistical Inference in the 21st Century." The editorial's recommendation: "Don't say 'statistically significant.'" Whether or not you agree with this provocative proposal, the arguments are essential for understanding the current debate about the role of p-values in science. The entire special issue is available free online.

Ioannidis, J. P. A. (2005). "Why Most Published Research Findings Are False." PLOS Medicine, 2(8), e124. One of the most-cited papers in all of science. Ioannidis uses a framework similar to Case Study 1's false discovery rate analysis to show that, under realistic assumptions about base rates, power, and bias, most published positive findings are likely false. The paper was controversial when published but is now widely accepted as a seminal contribution. Directly relevant to Sections 13.12 and Case Study 1.

Benjamin, D. J., et al. (2018). "Redefine Statistical Significance." Nature Human Behaviour, 2, 6-10. A proposal signed by 72 prominent statisticians to lower the default significance threshold from $p < 0.05$ to $p < 0.005$. The authors argue that this would substantially reduce the false positive rate. The paper generated significant debate — Lakens et al. (2018) published a direct rebuttal arguing that the solution is effect sizes and confidence intervals, not stricter thresholds. Reading both papers together gives excellent insight into the ongoing debate about how to fix hypothesis testing. See Exercise H.4 for the debate framework.

Gelman, A., & Loken, E. (2014). "The Statistical Crisis in Science." American Scientist, 102(6), 460-465. The paper that coined "the garden of forking paths" metaphor used in Section 13.12. Gelman and Loken explain how researcher degrees of freedom — the many small choices made during data analysis — can produce "significant" results even without intentional p-hacking. Their key insight: the problem isn't that researchers are dishonest; it's that the standard framework doesn't account for the flexibility inherent in data analysis.

Nuzzo, R. (2014). "Statistical Errors: P Values, the 'Gold Standard' of Statistical Validity, Are Not as Reliable as Many Scientists Assume." Nature, 506, 150-152. A highly readable overview of p-value problems for a general scientific audience. Published in Nature — the world's most prestigious scientific journal — it signaled how seriously the scientific community was taking the crisis. Good entry point for readers who want a quick but authoritative summary.

Greenland, S., Senn, S. J., Rothman, K. J., et al. (2016). "Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations." European Journal of Epidemiology, 31(4), 337-350. A comprehensive catalog of 25 common misinterpretations of p-values, confidence intervals, and statistical power. Each misinterpretation is stated clearly and then corrected. The paper serves as an excellent self-test: read each misinterpretation and see if you can identify the error before reading the correction. If you can, you've truly mastered this chapter.

Online Resources

Interactive Tools

Seeing Theory — Hypothesis Testing https://seeing-theory.brown.edu/frequentist-inference/ Brown University's interactive visualization, previewed in Chapter 12's further reading. The hypothesis testing module lets you set up null and alternative hypotheses, collect data, and watch the p-value change in real time. Seeing the p-value as an area under the curve — and watching it shrink as sample size increases — is far more intuitive than any static diagram.

Understanding P-Values Through Simulation https://rpsychologist.com/pvalue/ Kristoffer Magnusson's interactive tool lets you explore what happens to the p-value as you change the effect size, sample size, and significance level. You can watch the sampling distributions under $H_0$ and $H_a$ overlap or separate, and see how this affects Type I errors, Type II errors, and power. The best interactive tool available for building intuition about hypothesis testing.

StatKey: Randomization Tests http://www.lock5stat.com/StatKey/ StatKey's randomization test modules let you conduct hypothesis tests using simulation rather than formulas. This approach (which you'll learn formally in Chapter 18) provides a more intuitive way to understand what p-values measure. Try the "Test for Single Mean" and "Test for Single Proportion" modules with the built-in datasets.

Video Resources

StatQuest with Josh Starmer: "Hypothesis Testing and p-values" (YouTube) Josh Starmer's characteristically clear and enthusiastic explanation of hypothesis testing. He covers the logic, the notation, and the common misconceptions in about 15 minutes. His separate video on p-value misconceptions ("P Values, Clearly Explained") is particularly good. Previewed in Chapter 12's further reading — now you're ready for the full treatment.

3Blue1Brown: "Bayes theorem, the geometry of changing beliefs" (YouTube) While not directly about hypothesis testing, Grant Sanderson's visualization of Bayes' theorem is essential background for understanding why $P(\text{data} \mid H_0) \neq P(H_0 \mid \text{data})$. The geometric intuition he builds makes the p-value misconception almost impossible to commit once you've internalized it. Previously recommended in Chapter 9 — revisiting it now with your hypothesis testing knowledge will deepen the connection.

Veritasium: "The Replication Crisis" (YouTube) Derek Muller's documentary-style exploration of the replication crisis, with interviews with key researchers including Brian Nosek (co-founder of the Open Science Collaboration). At about 20 minutes, it's an efficient and engaging complement to Case Study 1.

Khan Academy: "Hypothesis Testing" (YouTube/khanacademy.org) Sal Khan provides a thorough walkthrough of hypothesis testing with multiple worked examples. His series covers the z-test, t-test, Type I/II errors, and p-values across several focused videos. Particularly good for students who want more practice with the computational steps.

CrashCourse Statistics: "P-Values" (YouTube) A fast-paced, visual overview of p-values and hypothesis testing in about 12 minutes. Good for a quick review or as a complement to the textbook. The animation of the sampling distribution and p-value area is particularly effective.

Software Documentation

SciPy: scipy.stats.ttest_1samp https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html Documentation for the one-sample t-test function used in Section 13.13. Key details: always returns a two-tailed p-value. For one-tailed tests, divide by 2 (if the test statistic is in the expected direction) or compute $1 - p/2$ (if it's in the opposite direction). The alternative parameter (added in SciPy 1.7) lets you specify 'less', 'greater', or 'two-sided' directly.

SciPy: scipy.stats.norm https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html The standard normal distribution functions used for manual z-tests. Key methods: .cdf(z) for $P(Z \leq z)$, .sf(z) for $P(Z > z)$ (the survival function, more numerically stable than 1 - cdf(z) for large z), and .ppf(q) for finding critical values.

statsmodels: proportions_ztest https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html The proportions_ztest() function conducts z-tests for one or two proportions. Supports one-tailed and two-tailed alternatives. More convenient than manual calculation for proportion tests like Sam's Daria analysis.

pingouin: Statistical Testing Library https://pingouin-stats.org/ A Python library designed for user-friendly statistical testing. The ttest() function returns not just p-values but also effect sizes (Cohen's d), confidence intervals, and Bayes factors — all in one output. Excellent for following the ASA's recommendation to report multiple measures of evidence. Preview for Chapter 17's discussion of effect sizes.

Historical Notes

The Feuding Founders

Hypothesis testing as we know it is actually a hybrid of two competing frameworks developed by two rival camps, and understanding this history helps explain why the framework sometimes feels internally inconsistent.

Ronald Fisher (1890-1962) developed the significance test approach. Fisher viewed the p-value as a continuous measure of evidence — the smaller the p-value, the stronger the evidence against $H_0$. He did not use fixed significance levels or formal decision rules. He did not have an "alternative hypothesis." For Fisher, the p-value was a tool for scientific reasoning, not a decision procedure. The question was: "Are these data surprising under the null model?" The 0.05 threshold was a convenient guideline, not a rigid rule.

Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980) developed the hypothesis test approach. Neyman and Pearson formalized the decision framework: two hypotheses ($H_0$ and $H_a$), a fixed significance level ($\alpha$), Type I and Type II errors, and the concept of power. For them, hypothesis testing was about making optimal long-run decisions, not about measuring evidence in any single study.

The framework taught in modern textbooks — including this one — is a merger of these two approaches. We use Fisher's p-value as a measure of evidence and Neyman-Pearson's fixed-$\alpha$ decision rule. The tension between these philosophies is real: Fisher would likely be horrified by the rigid "significant at 0.05" culture, while Neyman and Pearson would question using p-values as evidence measures.

Understanding this history helps explain why the p-value debate is so persistent. The framework was never fully coherent to begin with — it's a pragmatic compromise between two fundamentally different philosophies of inference. The ongoing discussion about "moving beyond $p < 0.05$" is, in some sense, a continuation of the Fisher-Neyman-Pearson debate.

What's Coming Next

Chapter 14 will apply hypothesis testing to proportions — the most common type of test in everyday applications. You'll learn:

The one-sample z-test for a proportion (formalizing Sam's Daria test)
The two-sample z-test for comparing two proportions (Professor Washington's algorithm audit)
Conditions, assumptions, and when the z-test breaks down
The connection to confidence intervals for proportions from Chapter 12

Chapter 15 will then introduce the t-test for means — the version you'll use in virtually all real-world applications where $\sigma$ is unknown.

Resources to preview: - StatQuest: "One-Proportion Z-Test" (YouTube) — focused walkthrough of the proportion test - Khan Academy: "Hypothesis Test for a Proportion" (khanacademy.org) — multiple worked examples - Seeing Theory: Hypothesis Testing module — interactive p-value visualization for proportions