Further Reading: Hypothesis Testing

Hypothesis testing is one of those topics where reading more doesn't just deepen your understanding — it can genuinely change how you think about evidence and claims. The resources below range from accessible introductions to sobering critiques.

Tier 1: Verified Sources

Jacob Cohen, Statistical Power Analysis for the Behavioral Sciences (Routledge, 2nd edition, 1988). Cohen literally wrote the book on effect sizes and statistical power. His framework for classifying effects as small, medium, and large remains the standard. The book is technical but foundational. If you want to understand why we use d = 0.2, 0.5, and 0.8 as benchmarks, start here.

David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019). Spiegelhalter's chapters on hypothesis testing, p-values, and the replication crisis are among the clearest and most balanced treatments available. He's particularly good at explaining what p-values do and don't mean in plain English. Highly recommended if the interpretation sections of our chapter left you wanting more.

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (W.W. Norton, 2013). Wheelan explains hypothesis testing with humor and real-world examples. His chapter on the logic of inference uses courtroom and medical testing analogies that many readers find illuminating. A great confidence-builder if the topic feels intimidating.

Russ Poldrack, Statistical Thinking for the 21st Century (online textbook, 2022+). A modern open-access statistics textbook that covers hypothesis testing with an emphasis on simulation, replication, and common misunderstandings. Poldrack is a neuroscientist who has been vocal about the replication crisis, and his treatment of the topic is informed by that experience. Available freely online.

Susan Greenland et al., "Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations," European Journal of Epidemiology (2016). A comprehensive catalog of 25 common misinterpretations of p-values, confidence intervals, and power. Reading this paper is like getting a vaccination against statistical misunderstanding. It's aimed at researchers but accessible to anyone who has completed our Chapter 23.

Tier 2: Attributed Resources

American Statistical Association Statement on P-Values (2016). In a historic move, the ASA issued six principles about p-values to combat widespread misinterpretation. The statement is short (about 6 pages), accessible, and essential reading. Search "ASA Statement on Statistical Significance and P-Values" or look in The American Statistician, Volume 70, Issue 2.

John Ioannidis, "Why Most Published Research Findings Are False," PLOS Medicine (2005). The paper that ignited the replication crisis discussion. Ioannidis uses mathematical models to show that under common research conditions (low power, small effects, many researchers testing the same hypotheses), the majority of published positive findings may be false. The paper is freely available and has been cited over 10,000 times.

Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science (2015). The landmark replication study discussed in Case Study 2. The paper documents the attempt to replicate 100 psychology studies and reports the sobering results. Freely available through Science's website.

Simmons, Nelson, and Simonsohn, "False-Positive Psychology," Psychological Science (2011). The paper that demonstrated how researcher degrees of freedom (p-hacking) can produce any result you want — including the "finding" that listening to a Beatles song makes people younger. A vivid and entertaining demonstration of the multiple testing problem.

StatQuest with Josh Starmer (YouTube). Starmer's videos on p-values, hypothesis testing, t-tests, and statistical power are excellent. His "P-values, clearly explained" video has helped millions of people understand what p-values actually mean. Search "StatQuest hypothesis testing" or "StatQuest p-values."

xkcd comic #882, "Significant." Randall Munroe's comic strip illustrates the multiple testing problem with jelly beans: scientists test whether jelly beans of 20 different colors cause acne, find that green jelly beans are "significant" (p < 0.05), and publish "GREEN JELLY BEANS LINKED TO ACNE." A perfect illustration of the problem in four panels. Search "xkcd significant."

Andrew Gelman's blog (Statistical Modeling, Causal Inference, and Social Science). Gelman, a statistician at Columbia, has been one of the most thoughtful and persistent critics of how hypothesis testing is used in practice. His blog is full of detailed analyses of specific studies that illustrate the pitfalls discussed in this chapter.

Recommended Next Steps

If p-value interpretation still feels tricky: Read the ASA statement and Greenland et al.'s misinterpretation guide. Then watch StatQuest's p-value video. Encountering the correct interpretation from multiple angles helps it stick.
If the replication crisis fascinates you: Read Ioannidis (2005) first, then the Open Science Collaboration (2015). Together they provide the theoretical argument and the empirical evidence for why the crisis exists.
If you want to understand power more deeply: Cohen's book is the definitive reference. For a more modern and accessible treatment, search for G*Power (a free power analysis software tool) and work through its tutorials.
If you're interested in alternatives to p-values: Look into Bayesian hypothesis testing, which replaces p-values with Bayes factors — a more direct measure of evidence for or against a hypothesis. Spiegelhalter's book has a gentle introduction. For a deeper treatment, search for "Bayesian Data Analysis" by Gelman et al. (Chapman & Hall/CRC, 3rd edition, 2013).
If you want to see hypothesis testing done well in practice: Look at the CONSORT guidelines for reporting randomized controlled trials (search "CONSORT statement"). They represent best practices for transparent, complete reporting of hypothesis tests in clinical research.
If you're ready to move on: Chapter 24 explores correlation and causation — the relationship between variables and the critical distinction between "these things go together" and "one thing causes another." It builds directly on the hypothesis testing framework from this chapter.

A Personal Note

Hypothesis testing has been called "the most useful and the most abused concept in statistics." Both descriptions are accurate. The framework is powerful when used honestly — it gives you a principled way to distinguish signal from noise. But it's also easy to misuse, and the consequences of misuse (in medicine, policy, science, and business) can be severe.

The antidote to misuse isn't to abandon hypothesis testing — it's to understand it deeply enough to use it responsibly. That's what this chapter aimed to give you. The further reading above will take you even deeper. And the habits you build now — reporting effect sizes, thinking about power, being honest about uncertainty — will serve you throughout your career.