Chapter 33: Key Takeaways

DataField.Dev

Chapter 33: Key Takeaways

Interference is the norm in online experiments — ignoring it biases results by 15-40%. When users interact (sharing content, competing for resources, influencing recommendations), the Stable Unit Treatment Value Assumption (SUTVA) fails, and the standard difference-in-means estimator is biased. Positive spillover (sharing) biases the estimate downward; negative spillover (competition) biases it upward. The StreamRec case study demonstrated a 17.7% underestimation of the true direct effect from content sharing alone. Cluster-randomized designs (randomize friend groups, geographic regions, or marketplace zones) restore validity by assigning entire clusters to treatment or control, internalizing interference within clusters. Switchback designs (alternating the entire population between treatment and control across time periods) address global interference in two-sided marketplaces. Synthetic control constructs counterfactuals when there are too few treated units for traditional inference. The choice of design depends on the interference structure — and identifying that structure requires domain knowledge, not just statistical methodology.
CUPED reduces variance by 30-50% for free — it should be the default in every experimentation platform. CUPED adjusts the post-experiment outcome by subtracting a linear function of a pre-experiment covariate, removing the predictable component of outcome variance. With a pre-post correlation of $\rho = 0.65$ (typical for engagement metrics), variance is reduced by 42%, equivalent to running the experiment 1.72 times longer. The key insight is that CUPED is unbiased: it removes the same amount from both treatment and control (in expectation) because the pre-experiment covariate is, by definition, unaffected by the treatment. Multivariate CUPED (regression adjustment with multiple covariates) provides additional reductions. Lin (2013) showed that the fully interacted regression estimator is asymptotically efficient and valid even if the regression model is misspecified, making it the theoretically optimal default for variance reduction in randomized experiments.
Peeking inflates the false positive rate from 5% to 20%+ — sequential testing fixes this at no cost. Continuous monitoring of A/B tests with fixed-horizon p-values violates the assumptions of the test. Under the null hypothesis, the running test statistic follows a random walk that will eventually cross any fixed boundary. Daily checks on a 14-day experiment inflate the type I error to approximately 22-26%. Always-valid confidence sequences and the mSPRT (mixture Sequential Probability Ratio Test) provide valid inference under arbitrary stopping rules: the analyst can check results at any time, and the type I error remains at the nominal level. The tradeoff is wider confidence intervals at each individual time point — the price of validity under continuous monitoring. Sequential testing should be the default in production experimentation platforms.
Multiple testing is unavoidable at scale — Benjamini-Hochberg controls the proportion of false discoveries, not the probability of any false discovery. With 50 concurrent experiments and 8 metrics each, there are 400 comparisons. Without correction, the probability of at least one false positive is essentially 1.0. Bonferroni controls the familywise error rate (FWER) but is extremely conservative — testing each comparison at $\alpha/400 = 0.000125$ misses many true effects. The Benjamini-Hochberg procedure controls the false discovery rate (FDR) — the expected proportion of false positives among all discoveries — at a target level (e.g., 5%). This is the right error guarantee for experimentation platforms: some false positives are acceptable as long as the proportion is controlled. BH is more powerful than Bonferroni and should be the default correction for multi-metric, multi-experiment analyses.
SRM is a data quality alarm, not a statistical nuisance — it means the experiment is broken. A sample ratio mismatch (the observed treatment/control ratio deviating significantly from the configured ratio) indicates that the randomization mechanism has a systematic bias. This is not random variation — it is evidence of a bug (redirect differences, performance discrepancies, differential bot filtering). The experiment results under SRM are unreliable because the treatment and control groups are no longer comparable. The correct response is to stop the experiment, investigate the root cause, fix it, and re-run. No statistical correction can salvage an SRM-contaminated experiment. Every experiment should have an automated SRM check that runs daily.
Novelty and primacy effects cause short-run experiments to misestimate long-run treatment effects — run experiments for at least 2-4 weeks. Novelty effects (users engage more with new features because they are new) inflate the first-week treatment effect; primacy effects (users' established habits are disrupted) deflate it. Both are transient. The StreamRec case study showed a 99% overestimation of the long-run effect from novelty in the first week. Detection requires plotting the daily treatment effect over time and testing for a trend. Mitigation requires running experiments long enough for the transient to dissipate, or restricting analysis to new users who have no prior experience with the old system.
Experimentation culture determines whether rigorous methodology matters — the HiPPO is the most dangerous threat to evidence-based decisions. The most sophisticated experimentation platform is useless if the organization does not trust its results. Building experimentation culture requires executive sponsorship, default-to-experiment policies, acceptance of surprising results, and investment in statistical education for non-technical stakeholders. At scale, approximately two-thirds of tested ideas show no significant effect. An organization that expects every experiment to succeed will stop running experiments when results are "disappointing" — and will lose the ability to distinguish effective changes from ineffective ones.