Further Reading: Power, Effect Sizes, and What "Significant" Really Means

Books

For Deeper Understanding

Jacob Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd edition (1988) The foundational text on statistical power and effect sizes. Cohen introduced the d, f, and h effect size measures and the small/medium/large benchmarks used throughout this chapter. While the book is technical, its opening chapters are remarkably accessible and full of practical wisdom. Cohen's insistence that "the primary product of a research inquiry is one or more measures of effect size, not p-values" was decades ahead of its time. If you read one additional source on this chapter's topics, make it Cohen's introduction.

Geoff Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (2012) Cumming makes a compelling case for replacing the traditional null-hypothesis significance testing framework with what he calls the "new statistics" — a focus on effect sizes, confidence intervals, and meta-analysis. The book includes interactive simulations (available at www.thenewstatistics.com) that bring concepts like power, effect size inflation, and the dance of the p-values vividly to life. Particularly relevant to Sections 17.4–17.8.

Stuart Ritchie, Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth (2020) Previously recommended in Chapter 13 for its coverage of the replication crisis. Now even more relevant: Ritchie provides detailed accounts of how underpowered studies, publication bias, and p-hacking have distorted entire fields of research. His chapter on the "decline effect" — where published effect sizes shrink over time as larger, better studies replace smaller ones — directly illustrates the winner's curse discussed in Section 17.6.

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's chapters on regression and inference include intuitive discussions of effect sizes and practical significance. His example of a statistically significant but meaningless difference in fuel economy between two car models is a perfect illustration of the "significant but trivial" problem from Section 17.2. A great companion for readers who want the concepts without the formulas.

For the Mathematically Curious

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (2004) Chapter 10 covers power and the Neyman-Pearson framework with mathematical rigor. Wasserman derives the power function for various tests and shows how the uniformly most powerful test relates to the likelihood ratio. For students who want to understand why the power formulas work.

George Casella and Roger Berger, Statistical Inference, 2nd edition (2002) Chapter 8 develops the theory of hypothesis testing, including the formal relationship between $\alpha$, $\beta$, sample size, and the alternative hypothesis. The proof that no test can simultaneously minimize both Type I and Type II error rates — the fundamental tradeoff underlying all of power analysis — is elegantly presented.

On the Replication Crisis

Brian Nosek, et al., Estimating the Reproducibility of Psychological Science (2015) The Open Science Collaboration's landmark paper, published in Science, that attempted to replicate 100 psychology studies. This is the primary source for the statistics cited throughout this chapter: 97% of original studies significant vs. 36% of replications; average replication effect size roughly half the original. The supplementary materials contain a treasure trove of data for anyone interested in meta-science.

John Ioannidis, Why Most Published Research Findings Are False (2005) One of the most-cited papers in the history of science. Ioannidis uses a Bayesian framework (similar to the false discovery rate model in Chapter 13's case study) to argue that most published research findings are false when studies are small, effects are small, and researcher degrees of freedom are high. This paper catalyzed the reform movement that produced the ASA statement and the push for pre-registration.

Articles and Papers

Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA Statement on Statistical Significance and P-Values." The American Statistician, 70(2), 129-133. The American Statistical Association's 2016 statement, introduced in Chapter 13 and revisited with deeper engagement in Section 17.8. The statement's six principles are the backbone of this chapter's message. Free to read at https://doi.org/10.1080/00031305.2016.1154108

Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). "Moving to a World Beyond 'p < 0.05'." The American Statistician, 73(sup1), 1-19. The 2019 follow-up, which proposed retiring the term "statistical significance" entirely. This editorial introduces the special issue of The American Statistician where over 800 statisticians contributed to the post-p-value future. The editorial's key recommendation: "Don't say 'statistically significant.' Don't say 'not statistically significant.' Don't use p-value thresholds." A provocative and important read.

Cohen, J. (1994). "The Earth Is Round (p < .05)." American Psychologist, 49(12), 997-1003. One of the most influential papers in the history of statistics, and one of the most entertaining. Cohen argues passionately against the ritual of null-hypothesis significance testing, using humor and sharp examples to show why "the earth is round ($p < .05$)" is a ridiculous conclusion. Essential reading for anyone who wants to understand the intellectual roots of this chapter's message.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant." Psychological Science, 22(11), 1359-1366. A landmark paper that demonstrated how easily p-hacking can produce false positives. The authors showed that common research practices (measuring multiple dependent variables, adding participants until significance is reached, testing subgroups selectively) can inflate false positive rates to over 60%. They then "proved" that listening to the Beatles song "When I'm Sixty-Four" literally makes people younger — a deliberately absurd finding that could not be more relevant to the p-hacking simulation in Section 17.9.

Button, K. S., et al. (2013). "Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience." Nature Reviews Neuroscience, 14(5), 365-376. A systematic review of statistical power in neuroscience, finding that the median power to detect a typical effect was about 21%. Yes, 21% — meaning roughly 4 out of 5 real effects are missed. The paper also demonstrates that underpowered studies that do find significance produce inflated effect sizes (the winner's curse). Essential context for Section 17.6.

Lakens, D. (2013). "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science." Frontiers in Psychology, 4, 863. A practical tutorial on calculating and reporting effect sizes. Lakens covers Cohen's d, eta-squared, and omega-squared with clear worked examples and R code. The paper also discusses when to use which measure and how to handle unequal sample sizes. A useful reference for the exercises in this chapter.

Benjamin, D. J., et al. (2018). "Redefine Statistical Significance." Nature Human Behaviour, 2, 6-10. A paper signed by 72 prominent statisticians proposing to change the default significance threshold from 0.05 to 0.005. The authors argue that $p < 0.05$ produces too many false positives and that $p < 0.005$ would bring the false positive rate more in line with expectations. This proposal is one of the positions in the debate framework from Section 17.8.

Online Resources

Interactive Tools

Kristoffer Magnusson's Interactive Visualizations https://rpsychologist.com/d3/nhst/ Previously recommended in Chapter 16 — now even more relevant. This interactive tool lets you adjust effect sizes, sample sizes, and significance levels, and see in real time how power, p-values, and the overlap between distributions change. The "Understanding Statistical Power" module at https://rpsychologist.com/d3/nhst/ is the single best tool for building intuition about this chapter's concepts.

Kristoffer Magnusson's Cohen's d Visualization https://rpsychologist.com/cohend/ An interactive tool specifically for understanding Cohen's d. You can set the effect size and see the distribution overlap, the probability of superiority, and the number needed to treat. Drag the slider from $d = 0$ to $d = 2$ and watch the distributions separate.

G*Power (Free Power Analysis Software) https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower A widely used free application for conducting power analyses for many types of statistical tests. GPower has a graphical interface, supports all common tests (t-tests, ANOVA, chi-square, regression), and can generate publication-quality power curves. If you want to move beyond Python's statsmodels for power analysis, GPower is the standard.

Seeing Theory — Statistical Power Module https://seeing-theory.brown.edu/frequentist-inference/ Brown University's interactive visualization platform includes a module on hypothesis testing that beautifully illustrates the relationship between the null distribution, the alternative distribution, $\alpha$, $\beta$, and power. Watch how changing the effect size or sample size shifts the alternative distribution and changes the power.

Video Resources

StatQuest with Josh Starmer: "Statistical Power, Clearly Explained" (YouTube) Josh Starmer's characteristically energetic walkthrough of statistical power. The visualization of the null and alternative distributions, with the rejection region and the power shaded, is one of the clearest explanations available. Watch this alongside Section 17.6 for maximum understanding.

StatQuest: "Effect Size, Clearly Explained" (YouTube) A companion video covering Cohen's d with visual intuition. Starmer shows how effect size is independent of sample size and why it's essential for interpreting test results.

3Blue1Brown: "Bayes Theorem" (YouTube) While not directly about effect sizes, Grant Sanderson's explanation of Bayes' theorem is essential for understanding false discovery rates and why so many "significant" findings are false (Section 17.10). The visual approach makes the connection between base rates, power, and false positives immediately intuitive.

Crash Course Statistics: "P-Hacking: Crash Course Statistics #30" (YouTube) A fast-paced, entertaining overview of p-hacking with clear examples. Good for reinforcing the concepts from Section 17.9 in a different format.

Daniel Lakens's Coursera Course: "Improving Your Statistical Inferences" https://www.coursera.org/learn/statistical-inferences A free online course by one of the leading experts on effect sizes and statistical reform. The modules on effect sizes, power analysis, and the replication crisis go well beyond what's covered in this chapter. Highly recommended for students who want to develop advanced critical thinking about statistical evidence.

Software Documentation

statsmodels: TTestIndPower https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestIndPower.html The power analysis class used throughout this chapter. Supports solving for any one of the four parameters (effect size, sample size, power, $\alpha$) given the other three. Also supports power curve plotting via the .plot_power() method.

statsmodels: NormalIndPower https://www.statsmodels.org/stable/generated/statsmodels.stats.power.NormalIndPower.html Power analysis for z-tests, used for proportion comparisons (James's bail study, Sam's shooting analysis). Same solve-for-any-parameter interface as TTestIndPower.

SciPy: scipy.stats.ttest_ind https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html The t-test function from Chapters 15-16 that produces the test statistic and p-value. Use the t-statistic and degrees of freedom to compute $r^2$ as shown in Section 17.5.

pingouin: ttest (with built-in effect sizes) https://pingouin-stats.org/build/html/generated/pingouin.ttest.html The pingouin library computes Cohen's d, the 95% CI for the difference, the Bayes factor, and the power — all in a single function call. Returns a clean DataFrame with all the information from Section 17.11's reporting checklist. Highly recommended for efficient, comprehensive reporting.

effect_size (Python package) https://pypi.org/project/effect-size/ A lightweight Python package for calculating various effect size measures (Cohen's d, Glass's delta, Hedges' g, eta-squared, omega-squared, and more). Useful when you need effect sizes beyond Cohen's d.

What's Coming Next

Chapter 18 will introduce a radically different approach to inference — one that doesn't require formulas, distributional assumptions, or power analysis at all:

  • The bootstrap generates confidence intervals and hypothesis tests by resampling your actual data thousands of times. No normality assumption, no standard error formula, no reference distribution. You let the data speak for itself.

  • Simulation-based inference provides p-values by simulating what the null hypothesis looks like, then comparing your data to the simulation. It's the computational equivalent of the hypothesis testing framework — and it often gives better results for small samples and non-normal data.

Resources to preview: - StatQuest: "Bootstrapping, Clearly Explained" (YouTube) — clear visual explanation of the bootstrap - StatKey (http://www.lock5stat.com/StatKey/) — interactive bootstrap and randomization test tool - Tim Hesterberg's "What Teachers Should Know About the Bootstrap" (2015) — accessible overview for learners