Key Takeaways: Chapter 3

DataField.Dev

Key Takeaways: Chapter 3

Experimental Design and A/B Testing

A model that performs well offline is a hypothesis, not a conclusion. Offline evaluation tells you the model might work. An A/B test tells you whether it actually does. The gap between holdout set performance and real-world impact is where most ML projects lose credibility. Design the experiment before you build the model, not after.
Power analysis is a negotiation tool, not a formality. Sample size calculation translates business questions ("how small an effect do we care about?") into engineering constraints ("how long do we need to run?"). When a stakeholder says "detect a 0.5% lift," show them the power curve and let the math guide the conversation. The relationship between minimum detectable effect, sample size, and duration is the most important intuition in experimental design.
Choose one primary metric and pre-register it before the experiment starts. Multiple metrics invite cherry-picking. If you test six metrics and celebrate whichever one is significant, your effective false positive rate is not 5% --- it is closer to 26%. Designate one metric for the launch decision. All others are secondary and should be corrected for multiple testing or treated as exploratory.
Peeking at results inflates false positive rates far beyond the nominal alpha. Checking daily for a 21-day experiment can push your effective false positive rate to 15-25%, even when alpha is set to 5%. If you must monitor during the experiment, use sequential testing methods that adjust the significance threshold at each check. If you do not have sequential testing infrastructure, do not look until the pre-registered analysis date.
Guardrail metrics protect you from winning the metric battle and losing the product war. A recommendation algorithm that increases click-through rate but slows page load time by 200ms is a net loss. Define guardrail metrics --- latency, error rates, customer complaints, return rates --- before the experiment starts. If any guardrail degrades significantly, pause and investigate regardless of how good the primary metric looks.
Statistical significance is not practical significance. A p-value of 0.001 means the effect is real. It does not mean the effect is large enough to matter. Always pair statistical tests with confidence intervals and explicitly compare the effect size to the minimum threshold that justifies action. A statistically significant $0.02 lift in revenue per user is not worth $500K in engineering effort.
Simpson's paradox will find you. A treatment that wins overall can lose in every meaningful segment (or vice versa) if the segment mix is unbalanced between groups. Always break down results by key segments --- platform, geography, user tier, new vs. returning. Stratified randomization prevents the worst cases; stratified analysis catches the rest.
Novelty effects inflate short-term results. Users interact more with new experiences because they are new, not because they are better. Run experiments for at least 2-3 full weeks and analyze the treatment effect by week of exposure. The stabilized effect (week 3+) is a better predictor of long-term impact than the inflated week 1 number.
"The test says no difference" is a valid and valuable result. It means you avoided launching a change that does not work. It means you did not waste engineering resources maintaining a system that provides no benefit. It means you can redirect effort to changes with higher potential impact. The hard part is not the statistics --- it is the organizational courage to accept the result when the stakeholder has already promised the board a launch.
CUPED can dramatically increase your experiment's power. By adjusting for pre-experiment user behavior, CUPED reduces metric variance by 20-50% without requiring additional traffic. This means shorter experiments, smaller minimum detectable effects, or both. If your platform supports pre-experiment data joins, implement CUPED.
Pre-registration is a contract, not a bureaucratic formality. Writing down the hypothesis, primary metric, sample size calculation, and analysis date before the experiment starts protects the analyst from stakeholder pressure, prevents post-hoc rationalization, and creates institutional memory. The strongest experimentation cultures treat pre-registration as non-negotiable.

If You Remember One Thing

An A/B test is the only reliable way to establish that your model caused the outcome you observed. Everything else --- offline metrics, before-after comparisons, correlational analysis --- is suggestive at best and misleading at worst. The four-month model that launched without an experiment and was rolled back six months later is not a cautionary tale. It is the default outcome when experimentation is skipped. Design the experiment.

These takeaways summarize Chapter 3: Experimental Design and A/B Testing. Return to the chapter for full context.