Chapter 33: Further Reading

DataField.Dev

Affiliate disclosure

Book titles on this page link to Amazon. As an Amazon Associate, DataField.Dev earns from qualifying purchases — at no additional cost to you.

Chapter 33: Further Reading

Essential Sources

1. Ron Kohavi, Diane Tang, and Ya Xu, "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing" (Cambridge University Press, 2020)

The definitive practitioner's reference on online experimentation at scale. Kohavi, Tang, and Xu draw on decades of combined experience building experimentation platforms at Microsoft, Google, and LinkedIn to cover the full lifecycle: experiment design, implementation, analysis, and organizational culture. The book is unusual in devoting substantial attention to the non-statistical challenges — SRM diagnostics, instrumentation bugs, experiment interactions, and the organizational politics of data-driven decision-making.

Reading guidance: Part I (Chapters 1-5) covers foundational concepts and is accessible to readers without a statistics background. Chapter 3's treatment of "Twyman's Law" (any figure that looks interesting or different is usually wrong) is an essential mindset calibration for experimenters. Part II (Chapters 6-12) covers the statistical methodology that corresponds to Sections 33.2-33.11 of this chapter: Chapter 7 covers variance reduction (CUPED, stratification, regression adjustment), Chapter 9 covers interference and network effects, and Chapter 11 covers the peeking problem and sequential testing. Part III (Chapters 13-19) covers organizational and cultural aspects, including the HiPPO problem (Section 33.14) and the challenge of building experimentation culture. Chapter 17's case studies of experiments that produced counterintuitive results — features that everyone expected to work but didn't, and vice versa — provide compelling evidence for the necessity of rigorous testing. The book is intentionally light on mathematical derivations; readers seeking proofs should supplement with the primary papers cited below. For the specific Microsoft experimentation platform architecture, see Gupta et al., "Top Challenges from the First Practical Online Controlled Experiments Summit" (KDD, 2019).

2. Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker, "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data" (WSDM, 2013)

The paper that introduced CUPED (Controlled-experiment Using Pre-Experiment Data) — the variance reduction technique that has become standard in production experimentation platforms at Microsoft, Netflix, Airbnb, Uber, and most major technology companies. Deng et al. derive the optimal linear adjustment for pre-experiment covariates, prove unbiasedness under random assignment, and demonstrate 30-50% variance reduction on real Bing experiments.

Reading guidance: Section 3 derives the CUPED estimator from the theory of control variates, making the connection to Monte Carlo simulation explicit. The optimal $\theta$ (the coefficient that minimizes variance) is the OLS regression coefficient of the post-experiment metric on the pre-experiment metric — a result that is simple but has profound practical implications because the pre-post correlation for engagement metrics is typically 0.5-0.8. Section 4 extends to multiple covariates, showing that multivariate CUPED is equivalent to regression adjustment. The experimental results on Bing (Section 5) demonstrate variance reductions of 33-50% on real metrics, with no degradation in bias. For the theoretical grounding of regression adjustment in randomized experiments, see Winston Lin, "Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman's Critique" (Annals of Applied Statistics, 2013), which proves that the fully interacted regression $Y \sim D + X + D \times X$ is consistent and asymptotically efficient even under misspecification — a stronger result than CUPED alone. For the Netflix implementation and extensions (including quantile regression adjustment and machine learning-based adjustment), see Cai et al., "Improving Experimental Power through Control Using Predictions as Covariate (CUPAC)" (Netflix Tech Blog, 2020).

3. Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh, "Peeking at A/B Tests: Why It Matters, and What to Do About It" (KDD, 2017)

The paper that formalized the peeking problem for online A/B tests and proposed the mSPRT (mixture Sequential Probability Ratio Test) as a practical solution. Johari et al. prove that continuous monitoring with fixed-horizon p-values inflates the type I error rate to 5-10 times the nominal level and derive always-valid p-values that maintain type I error control under arbitrary stopping rules.

Reading guidance: Section 2 contains the key simulation that motivates the entire paper: checking a two-sample z-test daily for 30 days inflates the type I error from 5% to approximately 22%. This result is reproducible and robust — it is the single most convincing argument for sequential testing. Section 3 introduces the mSPRT, which replaces the fixed-horizon likelihood ratio with a mixture over the alternative hypothesis, producing a test statistic that is a non-negative martingale under the null. Ville's inequality then guarantees that the probability of ever exceeding $1/\alpha$ is at most $\alpha$. Section 4 provides the practical implementation details: how to choose the mixing parameter $\tau^2$ (which controls the tradeoff between early and late stopping power), how to compute always-valid confidence intervals (by inverting the mSPRT), and how to handle variance estimation. For the theoretical foundations of confidence sequences (a generalization of always-valid confidence intervals), see Howard, Ramdas, McAuliffe, and Sekhon, "Time-uniform, Nonparametric, Nonasymptotic Confidence Sequences" (Annals of Statistics, 2021), which provides nonparametric confidence sequences based on sub-Gaussian and sub-exponential assumptions. For the Bayesian perspective on optional stopping, see Rouder, "Optional Stopping: No Problem for Bayesians" (Psychonomic Bulletin & Review, 2014) — though note that this result holds for Bayesian inference but not for frequentist error rates.

4. Michael Hudgens and M. Elizabeth Halloran, "Toward Causal Inference with Interference" (Journal of the American Statistical Association, 2008)

The foundational paper on causal inference under interference. Hudgens and Halloran formalize the partial interference assumption (interference within clusters, not between clusters), define the direct effect, indirect effect (spillover), total effect, and overall effect as distinct causal estimands, and derive identification conditions and estimators for each under cluster randomization.

Reading guidance: Section 2 establishes the notation for potential outcomes under interference: $Y_i(\mathbf{d})$ depends on the full treatment vector $\mathbf{d}$, not just $d_i$. The key insight is that under partial interference, the potential outcomes reduce to $Y_i(\mathbf{d}_c)$ where $\mathbf{d}_c$ is the treatment vector within unit $i$'s cluster — making the problem tractable. Section 3 defines the four causal estimands. The direct effect is the effect of treating unit $i$ while holding the cluster's treatment allocation fixed; the indirect effect (spillover) is the effect of increasing the cluster's treatment allocation while holding unit $i$'s assignment fixed. Section 4 derives Horvitz-Thompson and Hajek estimators for each. The paper is rigorous but accessible — the notation is consistent with the potential outcomes framework from Chapter 16 of this textbook. For the extension to network interference (where the cluster structure is not known a priori but inferred from the social graph), see Athey, Eckles, and Imbens, "Exact P-values for Network Interference" (JASA, 2018). For applications to marketplace experiments, see Blake and Coey, "Why Marketplace Experimentation Is Harder than It Seems: The Role of Test-Control Interference" (EC, 2014), which demonstrates interference in eBay display advertising experiments. For the connection between cluster randomization and geo-experiments, see Vaver and Koehler, "Measuring Ad Effectiveness Using Geo Experiments" (Google Technical Report, 2011).

5. Alberto Abadie, Alexis Diamond, and Jens Hainmueller, "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program" (Journal of the American Statistical Association, 2010)

The paper that introduced the synthetic control method and demonstrated its application to policy evaluation. Abadie, Diamond, and Hainmueller estimate the effect of California's 1988 Proposition 99 (a large tobacco tax increase) on per-capita cigarette sales by constructing a "synthetic California" from a weighted combination of control states that matches California's pre-treatment trajectory.

Reading guidance: Section 2 defines the method formally: the synthetic control weights solve a constrained optimization problem that minimizes the distance between the treated unit's pre-treatment outcomes and the weighted combination of control units' pre-treatment outcomes. The constraint that weights be non-negative and sum to one prevents extrapolation. Section 3 applies the method to California's tobacco program, showing that synthetic California closely tracks actual California from 1970 to 1988 (the pre-treatment period) and then diverges sharply after 1988, estimating a 26% reduction in per-capita cigarette sales attributable to Proposition 99. Section 4 introduces placebo tests — the key inferential tool for synthetic control. By applying the method to each control state as if it were treated, the authors construct a distribution of "placebo effects" and show that California's effect is an extreme outlier, providing a permutation-based p-value. For the extension to multiple treated units, see Abadie and Gardeazabal, "The Economic Costs of Conflict: A Case Study of the Basque Country" (AER, 2003). For the use of synthetic control in tech company geo-experiments, see Brodersen et al., "Inferring Causal Impact Using Bayesian Structural Time-Series Models" (Annals of Applied Statistics, 2015), which implements a Bayesian version of synthetic control in Google's CausalImpact R package. For the application to climate policy evaluation, see Aldy, "The Promise and Problems of Pricing Carbon: Theory and Experience" (Journal of Environment and Development, 2012).