Further Reading: Chapter 3
Experimental Design and A/B Testing
The Essential Text
1. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing --- Ron Kohavi, Diane Tang, Ya Xu (Cambridge University Press, 2020) This is the book on A/B testing in industry. Written by veterans of Microsoft, Google, and LinkedIn's experimentation platforms, it covers everything from sample size calculation to organizational pitfalls to advanced topics like network effects and long-term experiments. Chapters 1-5 are directly relevant to this chapter. Chapters 19-22 cover the pitfalls (peeking, multiple testing, Simpson's paradox) with real-world examples from production systems at scale. If you read one resource from this list, read this book. It is practical, opinionated, and informed by decades of collective experience running millions of experiments.
Statistical Foundations
2. "Statistical Power Analysis for the Behavioral Sciences" --- Jacob Cohen (2nd edition, 1988)
The foundational text on effect size and power analysis. Cohen introduced the effect size measures (Cohen's d, Cohen's f) and the conventions for small, medium, and large effects that are still widely used. Chapter 2 covers the logic of power analysis. Chapter 3 covers the specific case of comparing two means (the t-test), which is the most common A/B test scenario. Dense but precise. The statsmodels.stats.power module in Python implements Cohen's formulas.
3. "The New Statistics: Why and How" --- Geoff Cumming (Psychological Science, 2014) A persuasive argument for moving beyond p-values to confidence intervals and effect sizes. Cumming introduces "estimation thinking" as a replacement for "hypothesis testing thinking" and makes the case that confidence intervals communicate uncertainty more effectively than binary significance decisions. Particularly relevant to Section 3.4 of this chapter, where we discussed reporting with confidence intervals rather than p-values alone. Freely available online.
Peeking and Sequential Testing
4. "Peeking at A/B Tests: Why It Matters, and What to Do About It" --- Ramesh Johari, Pete Koomen, Leonid Pekelis, David Walsh (KDD, 2017) The definitive paper on the peeking problem in online experimentation, from the team at Optimizely. The paper formalizes how continuous monitoring inflates false positive rates and introduces the mSPRT (mixture Sequential Probability Ratio Test) as a solution. The simulation in Case Study 2 of this chapter demonstrates the same phenomenon. If you want to understand the mathematical foundations of "always-valid" p-values, start here. Available on arXiv.
5. "Safe Testing" --- Peter Grunwald, Rianne de Heide, Wouter Koolen (2024) A comprehensive treatment of safe, anytime-valid hypothesis testing. This paper introduces e-values and e-processes as alternatives to p-values that allow continuous monitoring without inflating error rates. More theoretical than Johari et al. but provides a rigorous foundation for modern sequential testing. For practitioners, the key insight is that safe tests can be stopped at any time and still maintain their statistical guarantees --- exactly what peeking-prone organizations need.
Variance Reduction
6. "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data" --- Alex Deng, Ya Xu, Ron Kohavi, Toby Walker (WSDM, 2013) The original CUPED paper. The authors show that adjusting experiment metrics using pre-experiment covariates can reduce variance by 50% or more, dramatically increasing the sensitivity of A/B tests. The method is simple (linear regression adjustment), well-understood, and implemented in most production experimentation platforms. Section 3.7 of this chapter implements CUPED from scratch. This paper gives the theoretical justification and demonstrates the gains on real Microsoft experiments.
Multiple Testing
7. "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing" --- Yoav Benjamini, Yosef Hochberg (Journal of the Royal Statistical Society, 1995)
The paper that introduced the Benjamini-Hochberg procedure for controlling the false discovery rate (FDR). In contrast to the Bonferroni correction, which controls the probability of any false positive, BH controls the expected proportion of false positives among all significant results. This is the right framework when you are testing many metrics and expect some to be truly affected. The procedure is simple enough to implement in a few lines of Python (or use statsmodels.stats.multitest.multipletests). A landmark paper in statistics that changed how researchers think about multiple comparisons.
Practical Implementation
8. "Online Experimentation at Microsoft" --- Ron Kohavi, Thomas Crook, Roger Longbotham, et al. (2009) An early and candid paper describing how Microsoft built its experimentation platform. Covers the organizational challenges (getting teams to trust the platform), the technical challenges (logging infrastructure, randomization correctness), and the cultural challenges (getting PMs to accept negative results). Valuable for understanding that A/B testing is not just a statistical problem but a systems and organizational problem. Many of the pitfalls described in this chapter were first documented here.
9. Designing Machine Learning Systems --- Chip Huyen (O'Reilly, 2022), Chapter 7: "Model Deployment and Monitoring" Huyen's treatment of A/B testing in the context of ML deployment is the best practical bridge between offline model evaluation and online experimentation. She covers canary deployments, shadow mode testing, and the progression from offline metrics to online A/B tests. Particularly useful for understanding how A/B testing fits into the broader ML deployment lifecycle that we will cover in Part VI of this book.
Simpson's Paradox and Causal Reasoning
10. "Simpson's Paradox in the Wild" --- Various Sources The Wikipedia article on Simpson's paradox is surprisingly well-written and includes several famous real-world examples, including the UC Berkeley gender admissions case and the kidney stone treatment study. For a more rigorous treatment, see Judea Pearl's The Book of Why (2018), Chapter 6, which explains Simpson's paradox through the lens of causal reasoning and argues that the paradox reveals the fundamental difference between statistical and causal thinking.
Video and Multimedia
11. Evan Miller --- "How Not to Run an A/B Test" (Blog Post) A concise, widely cited blog post that explains the peeking problem with visual clarity. Miller shows how the false positive rate increases with repeated checking and introduces the concept of sequential testing as a fix. The post is responsible for convincing many industry practitioners to take the peeking problem seriously. Available at evanmiller.org.
12. Emily Robinson and David Robinson --- "Build a Career in Data Science" (2020), Chapter 14 While not a technical resource, this chapter covers the organizational dynamics of being a data scientist who says "the test was not significant" to a stakeholder who does not want to hear it. Practical advice on how to have the hard conversation described in Section 3.6 of this chapter, including how to frame negative results constructively and maintain credibility while being honest.
How to Use This List
If you read nothing else, read Kohavi, Tang, and Xu (item 1). It is the single most comprehensive resource on A/B testing in industry, and it covers every topic in this chapter with more depth and more examples.
If you want to understand the math behind power analysis, start with Cohen (item 2) and then move to the CUPED paper (item 6) for variance reduction.
If your organization struggles with peeking, read Johari et al. (item 4) and Evan Miller's blog post (item 11). Together they explain the problem and the solution in under two hours.
If you are responsible for building or choosing an experimentation platform, read Kohavi et al. 2009 (item 8) for the systems perspective and Deng et al. (item 6) for the variance reduction you should demand from any platform.
This reading list supports Chapter 3: Experimental Design and A/B Testing. Return to the chapter to review concepts before diving in.