> "Most A/B tests are wrong. Not because the math is wrong, but because the assumptions are."
In This Chapter
- Learning Objectives
- 33.1 Why This Chapter Goes Far Beyond "A/B Testing"
- 33.2 The Standard A/B Test and Its Assumptions
- 33.3 Interference: When SUTVA Fails
- 33.4 Cluster-Randomized Experiments
- 33.5 Switchback Designs
- 33.6 Geo-Experiments and Synthetic Control
- 33.7 CUPED: Variance Reduction Through Pre-Experiment Data
- 33.8 Multiple Testing and False Discovery Rate Control
- 33.9 Sequential Testing and the Peeking Problem
- 33.10 Sample Ratio Mismatch (SRM) Diagnostics
- 33.11 Novelty and Primacy Effects
- 33.12 Experiment Interaction Detection
- 33.13 Building an Experimentation Platform
- 33.14 Building Experimentation Culture
- 33.15 Climate Science: Natural Experiments When You Can't Randomize
- 33.16 Progressive Project: StreamRec A/B Test Design and Analysis
- 33.17 Summary
Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms
"Most A/B tests are wrong. Not because the math is wrong, but because the assumptions are." — Ron Kohavi, Diane Tang, and Ya Xu, "Trustworthy Online Controlled Experiments" (2020)
Learning Objectives
By the end of this chapter, you will be able to:
- Design experiments that handle interference effects (network effects, marketplace dynamics) where the Stable Unit Treatment Value Assumption (SUTVA) is violated, using cluster-randomized and switchback designs
- Implement cluster-randomized and switchback experimental designs with appropriate variance estimation
- Apply CUPED and stratification for variance reduction, deriving and implementing the 30-50% effective sample size increase that makes these techniques essential at scale
- Build experimentation platform components that manage concurrent experiments with interaction detection
- Handle peeking (continuous monitoring), multiple testing corrections (Bonferroni, Benjamini-Hochberg), novelty/primacy effects, and sample ratio mismatch (SRM) diagnostics
33.1 Why This Chapter Goes Far Beyond "A/B Testing"
An introductory statistics course covers the two-sample t-test. An intermediate data science course covers A/B testing: random assignment, sample size calculation, hypothesis testing, confidence intervals. These foundations are necessary but woefully insufficient for experimentation at scale.
Consider StreamRec, the content recommendation platform (Progressive Project, Chapters 6-32). The recommendation team wants to test a new ranking algorithm. They randomly assign 50% of users to the new algorithm (treatment) and 50% to the existing algorithm (control). After two weeks, they compute the difference in daily engagement time. Treatment users engage 1.2 minutes more per day, p = 0.03. Ship it?
Not so fast. Five problems undermine this result:
-
Interference. StreamRec users share content with friends. When a treatment user shares a piece of content surfaced by the new algorithm, their control-group friend may engage with it too — inflating the control group's engagement and deflating the measured treatment effect. The experiment's causal estimate is biased toward zero.
-
Variance. With 12 million users and a true effect of 1.2 minutes, the standard error is small and the result is statistically significant. But the team ran 47 other experiments simultaneously. The new ranking algorithm interacts with a concurrent experiment on notification frequency. The measured effect is a composite of the ranking change and the notification interaction — the true ranking effect may be 0.4 minutes, not 1.2.
-
Peeking. The analyst checked the results daily for 14 days. The p-value crossed 0.05 on day 8, briefly rose above it on days 9-10, then settled below 0.05 by day 14. Under continuous monitoring, the true false positive rate is not 5% but 26% (Johari et al., 2017). The result may be a false positive amplified by optional stopping.
-
Novelty. The new algorithm surfaces different content types. Users who have been on the platform for years engage more with the novelty — but this effect fades as the content becomes familiar. The measured 1.2-minute lift is a blend of a transient novelty effect and a smaller (possibly zero) long-run effect.
-
Sample ratio mismatch (SRM). The experiment was configured for a 50/50 split, but the actual counts are 6,012,847 treatment and 5,987,153 control. A chi-squared test gives p < 0.001 — the split is not 50/50. Something in the assignment mechanism is broken, and the groups are no longer comparable.
This chapter addresses all five problems with mathematical rigor and production-grade code. We begin with the most fundamental: interference.
Prediction $\neq$ Causation: This chapter is the bridge between causal inference (Part III) and production systems (Part V). Chapters 15-19 established the theoretical foundations — potential outcomes, causal graphs, estimation methods. This chapter operationalizes them: how do you run a valid causal experiment when SUTVA is violated, when you cannot wait for fixed-horizon results, and when you must manage 50 concurrent experiments on the same user base?
Know How Your Model Is Wrong: Every experiment makes assumptions. The standard A/B test assumes SUTVA (no interference), fixed-horizon evaluation (no peeking), a single comparison (no multiple testing), and stationary treatment effects (no novelty). Violating these assumptions does not produce error messages — it produces confident but wrong conclusions. This chapter teaches you to recognize which assumptions your experiment violates and to apply the correct remedy.
33.2 The Standard A/B Test and Its Assumptions
Before diagnosing what goes wrong, we formalize what must go right. The standard two-sample A/B test rests on three assumptions from the potential outcomes framework (Chapter 16):
Assumption 1: SUTVA (Stable Unit Treatment Value Assumption)
The potential outcome for unit $i$ depends only on unit $i$'s own treatment assignment, not on the assignments of other units:
$$Y_i(D_1, D_2, \ldots, D_N) = Y_i(D_i)$$
In words: my outcome does not change when you are assigned to treatment or control. This rules out interference, spillover, and network effects.
Assumption 2: Ignorability (Random Assignment)
Treatment assignment $D_i$ is independent of potential outcomes, conditional on covariates:
$$\{Y_i(0), Y_i(1)\} \perp\!\!\!\perp D_i \mid X_i$$
In a randomized experiment, this holds by design (unconditionally, without needing to condition on $X_i$).
Assumption 3: Positivity
Every unit has a nonzero probability of being assigned to each condition:
$$0 < P(D_i = 1 \mid X_i) < 1 \quad \text{for all } X_i$$
In a 50/50 A/B test, this is trivially satisfied.
Under these three assumptions, the difference-in-means estimator is unbiased for the Average Treatment Effect (ATE):
$$\hat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \frac{1}{n_1}\sum_{i: D_i=1} Y_i - \frac{1}{n_0}\sum_{i: D_i=0} Y_i$$
with variance:
$$\text{Var}(\hat{\tau}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}$$
where $\sigma_d^2$ is the variance of outcomes in group $d$.
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional
import numpy as np
from scipy import stats
@dataclass
class ABTestResult:
"""Result of a standard two-sample A/B test.
Stores point estimates, confidence intervals, and test statistics
for a simple difference-in-means comparison.
Attributes:
metric_name: Name of the metric being tested.
n_treatment: Number of units in the treatment group.
n_control: Number of units in the control group.
mean_treatment: Sample mean in the treatment group.
mean_control: Sample mean in the control group.
effect: Estimated treatment effect (difference in means).
se: Standard error of the effect estimate.
ci_lower: Lower bound of the 95% confidence interval.
ci_upper: Upper bound of the 95% confidence interval.
z_stat: Z-statistic for the null hypothesis of zero effect.
p_value: Two-sided p-value.
"""
metric_name: str
n_treatment: int
n_control: int
mean_treatment: float
mean_control: float
effect: float
se: float
ci_lower: float
ci_upper: float
z_stat: float
p_value: float
@property
def significant(self) -> bool:
"""Whether the result is statistically significant at alpha=0.05."""
return self.p_value < 0.05
@property
def relative_effect(self) -> float:
"""Relative treatment effect as a percentage of the control mean."""
if self.mean_control == 0:
return float("inf") if self.effect != 0 else 0.0
return 100.0 * self.effect / self.mean_control
def run_ab_test(
y_treatment: np.ndarray,
y_control: np.ndarray,
metric_name: str = "metric",
alpha: float = 0.05,
) -> ABTestResult:
"""Run a standard two-sample z-test for an A/B experiment.
Uses the normal approximation (valid for large samples by CLT).
For small samples, use a t-test instead.
Args:
y_treatment: Outcome values for treatment group.
y_control: Outcome values for control group.
metric_name: Human-readable name for the metric.
alpha: Significance level for the confidence interval.
Returns:
ABTestResult with all test statistics and confidence intervals.
"""
n_t = len(y_treatment)
n_c = len(y_control)
mean_t = np.mean(y_treatment)
mean_c = np.mean(y_control)
var_t = np.var(y_treatment, ddof=1)
var_c = np.var(y_control, ddof=1)
effect = mean_t - mean_c
se = np.sqrt(var_t / n_t + var_c / n_c)
z_crit = stats.norm.ppf(1 - alpha / 2)
z_stat = effect / se if se > 0 else 0.0
p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))
return ABTestResult(
metric_name=metric_name,
n_treatment=n_t,
n_control=n_c,
mean_treatment=float(mean_t),
mean_control=float(mean_c),
effect=float(effect),
se=float(se),
ci_lower=float(effect - z_crit * se),
ci_upper=float(effect + z_crit * se),
z_stat=float(z_stat),
p_value=float(p_value),
)
# Demonstration: A clean A/B test (no interference, no peeking)
np.random.seed(42)
n_per_group = 50_000
true_effect = 1.2 # minutes of engagement
y_control = np.random.normal(loc=30.0, scale=15.0, size=n_per_group)
y_treatment = np.random.normal(loc=30.0 + true_effect, scale=15.0, size=n_per_group)
result = run_ab_test(y_treatment, y_control, metric_name="daily_engagement_minutes")
print(f"Effect: {result.effect:.3f} min ({result.relative_effect:.2f}% relative)")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"p-value: {result.p_value:.6f}")
print(f"Significant: {result.significant}")
Effect: 1.261 min (4.18% relative)
95% CI: [1.075, 1.447]
p-value: 0.000000
Significant: True
In this idealized setting, the test works perfectly. The estimated effect is close to the true 1.2 minutes, the confidence interval covers the truth, and the test is highly significant.
Now we systematically break each assumption.
33.3 Interference: When SUTVA Fails
Why Interference Is the Rule, Not the Exception
SUTVA requires that my outcome depends only on my treatment assignment. This fails whenever units interact:
| Setting | Interference Mechanism |
|---|---|
| Social networks | A treated user shares content with a control user, changing the control user's behavior |
| Two-sided marketplaces | Treating riders changes driver availability, affecting control riders' wait times |
| Recommendation systems | Recommending item X to treatment users reduces inventory for control users (or changes trending signals) |
| Pricing experiments | A discounted price in treatment attracts customers who would have purchased in control |
| Workplace interventions | Treating one employee changes team dynamics for untreated teammates |
In every case, the treatment effect on unit $i$ depends not just on $D_i$ but on the treatment assignments $\mathbf{D}_{-i}$ of other units. The individual-level ATE is no longer well-defined without specifying the treatment assignment vector for all units.
Formalizing Interference
Under interference, unit $i$'s potential outcome depends on the full treatment vector:
$$Y_i = Y_i(D_1, D_2, \ldots, D_N)$$
This is intractable — with $N$ units and binary treatment, there are $2^N$ possible potential outcomes per unit. To make progress, we need structure.
The most common structural assumption is partial interference (Sobel, 2006; Hudgens and Halloran, 2008): the population can be partitioned into clusters $\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_K$ such that interference occurs within clusters but not between clusters. Formally:
$$Y_i(\mathbf{D}) = Y_i(\mathbf{D}_{\mathcal{C}(i)})$$
where $\mathcal{C}(i)$ is the cluster containing unit $i$ and $\mathbf{D}_{\mathcal{C}(i)}$ is the treatment assignment vector restricted to that cluster.
For StreamRec, a natural cluster definition is the user's social graph neighborhood: a user and the friends they share content with. Interference occurs within friend groups but (approximately) not between disconnected groups.
The Bias from Ignoring Interference
When SUTVA fails, the standard difference-in-means estimator is biased. The direction of bias depends on the interference mechanism:
- Positive spillover (sharing beneficial content): control users benefit from treatment users, making the control group better than it would be without any treatment. The treatment effect is underestimated.
- Negative spillover (competition for resources): control users are harmed by treatment, making the control group worse. The treatment effect is overestimated.
def simulate_interference_bias(
n_users: int = 100_000,
n_friends_mean: int = 8,
true_direct_effect: float = 1.5,
spillover_rate: float = 0.3,
spillover_magnitude: float = 0.5,
treatment_prob: float = 0.5,
seed: int = 42,
) -> Dict[str, float]:
"""Simulate A/B test with social network interference.
Treatment users share content with friends. If a control-group friend
receives shared content, their engagement increases (positive spillover).
Args:
n_users: Total number of users.
n_friends_mean: Mean number of friends per user (Poisson).
true_direct_effect: Direct causal effect of treatment on the treated.
spillover_rate: Probability that a treated user shares with each friend.
spillover_magnitude: Engagement boost for a control user who receives
shared content from a treated friend.
treatment_prob: Probability of assignment to treatment.
seed: Random seed for reproducibility.
Returns:
Dictionary with true direct effect, true total effect (direct +
spillover), naive estimated effect, and the bias.
"""
rng = np.random.default_rng(seed)
# Generate social network (each user has Poisson(n_friends_mean) friends)
n_friends = rng.poisson(n_friends_mean, size=n_users)
# For simplicity, sample friends uniformly (not a real social network)
friends = [
rng.choice(n_users, size=min(nf, n_users - 1), replace=False)
for nf in n_friends
]
# Random assignment
treatment = rng.random(n_users) < treatment_prob
# Baseline engagement (before treatment effects)
baseline = rng.normal(30.0, 15.0, size=n_users)
# Direct treatment effect
direct_effect = np.where(treatment, true_direct_effect, 0.0)
# Spillover: for each control user, count treated friends who share
spillover_effect = np.zeros(n_users)
for i in range(n_users):
if not treatment[i]: # Only control users receive spillover
treated_friends = friends[i][treatment[friends[i]]]
# Each treated friend shares with probability spillover_rate
n_sharing = rng.binomial(len(treated_friends), spillover_rate)
if n_sharing > 0:
spillover_effect[i] = spillover_magnitude * min(n_sharing, 3)
# Observed outcomes
y = baseline + direct_effect + spillover_effect + rng.normal(0, 2.0, n_users)
# Naive estimator (ignores interference)
naive_effect = np.mean(y[treatment]) - np.mean(y[~treatment])
# True quantities
mean_spillover_on_control = np.mean(spillover_effect[~treatment])
return {
"true_direct_effect": true_direct_effect,
"mean_spillover_on_control": float(mean_spillover_on_control),
"true_total_effect": true_direct_effect + mean_spillover_on_control,
"naive_estimated_effect": float(naive_effect),
"bias": float(naive_effect - true_direct_effect),
"bias_direction": "underestimate (positive spillover)",
}
interference_result = simulate_interference_bias()
for key, value in interference_result.items():
if isinstance(value, float):
print(f" {key}: {value:.4f}")
else:
print(f" {key}: {value}")
true_direct_effect: 1.5000
mean_spillover_on_control: 0.4832
true_total_effect: 1.9832
naive_estimated_effect: 1.0219
bias: -0.4781
bias_direction: underestimate (positive spillover)
The positive spillover inflates the control group mean by ~0.48 minutes, causing the naive estimator to underestimate the true direct effect by ~32%. This is not a small bias — and it is invisible without an interference-aware design.
33.4 Cluster-Randomized Experiments
The Key Idea
If interference occurs within clusters (friend groups, geographic regions, marketplace zones) but not between them, we can restore a valid causal estimate by randomizing at the cluster level instead of the unit level. Entire clusters are assigned to treatment or control, so all units within a cluster receive the same assignment. Interference within a cluster is allowed because it affects units who share the same assignment — it does not contaminate the control group.
Design
Let $K$ clusters be indexed by $k = 1, \ldots, K$. Cluster $k$ contains $n_k$ units. We assign each cluster to treatment ($D_k = 1$) or control ($D_k = 0$) with probability $\pi$. The cluster-level mean outcome is:
$$\bar{Y}_k = \frac{1}{n_k} \sum_{i \in \mathcal{C}_k} Y_i$$
The estimator for the ATE is the difference in cluster-level means, weighted by cluster size:
$$\hat{\tau}_{\text{cluster}} = \frac{\sum_{k: D_k=1} n_k \bar{Y}_k}{\sum_{k: D_k=1} n_k} - \frac{\sum_{k: D_k=0} n_k \bar{Y}_k}{\sum_{k: D_k=0} n_k}$$
Variance and the Design Effect
Cluster randomization uses information less efficiently than individual randomization. The variance of the cluster-level estimator is inflated by the intracluster correlation coefficient (ICC), $\rho$:
$$\text{Var}(\hat{\tau}_{\text{cluster}}) \approx \left(\frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}\right) \times [1 + (m - 1)\rho]$$
where $m$ is the average cluster size. The factor $\text{DEFF} = 1 + (m - 1)\rho$ is the design effect — the multiplicative inflation in variance relative to individual randomization. For StreamRec, if friend clusters average $m = 10$ users and the ICC is $\rho = 0.05$, the design effect is $1 + 9 \times 0.05 = 1.45$. The experiment needs 45% more users to achieve the same power.
This is the fundamental tradeoff: cluster randomization eliminates interference bias at the cost of statistical efficiency. In practice, the bias from ignoring interference (Section 33.3: ~32% underestimation) far exceeds the efficiency loss from clustering.
@dataclass
class ClusterRandomizedDesign:
"""Design and analysis of a cluster-randomized experiment.
Clusters (e.g., friend groups, geographic regions) are randomly
assigned to treatment or control. All units within a cluster
receive the same assignment, eliminating between-group interference.
Attributes:
cluster_ids: Array of cluster assignments for each unit.
treatment: Boolean array of treatment assignment for each unit.
outcomes: Array of observed outcomes.
"""
cluster_ids: np.ndarray
treatment: np.ndarray
outcomes: np.ndarray
@property
def n_clusters(self) -> int:
return len(np.unique(self.cluster_ids))
@property
def cluster_sizes(self) -> np.ndarray:
_, counts = np.unique(self.cluster_ids, return_counts=True)
return counts
def cluster_means(self) -> Tuple[np.ndarray, np.ndarray]:
"""Compute mean outcome per cluster, separated by treatment."""
clusters = np.unique(self.cluster_ids)
treated_means = []
control_means = []
treated_sizes = []
control_sizes = []
for c in clusters:
mask = self.cluster_ids == c
cluster_outcomes = self.outcomes[mask]
cluster_treatment = self.treatment[mask]
# All units in a cluster should have the same assignment
if cluster_treatment[0]:
treated_means.append(np.mean(cluster_outcomes))
treated_sizes.append(len(cluster_outcomes))
else:
control_means.append(np.mean(cluster_outcomes))
control_sizes.append(len(cluster_outcomes))
return (
np.array(treated_means),
np.array(control_means),
np.array(treated_sizes),
np.array(control_sizes),
)
def estimate_effect(self) -> Dict[str, float]:
"""Estimate the ATE using cluster-level analysis.
Uses weighted cluster means and cluster-robust variance estimation.
"""
t_means, c_means, t_sizes, c_sizes = self.cluster_means()
# Weighted means
mean_t = np.average(t_means, weights=t_sizes)
mean_c = np.average(c_means, weights=c_sizes)
effect = mean_t - mean_c
# Cluster-robust SE (using cluster means as independent observations)
k_t = len(t_means)
k_c = len(c_means)
var_t = np.var(t_means, ddof=1) if k_t > 1 else 0.0
var_c = np.var(c_means, ddof=1) if k_c > 1 else 0.0
se = np.sqrt(var_t / k_t + var_c / k_c)
# ICC estimate
icc = self._estimate_icc()
# Design effect
m_bar = np.mean(np.concatenate([t_sizes, c_sizes]))
deff = 1.0 + (m_bar - 1.0) * max(icc, 0.0)
z_stat = effect / se if se > 0 else 0.0
p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))
return {
"effect": float(effect),
"se": float(se),
"ci_lower": float(effect - 1.96 * se),
"ci_upper": float(effect + 1.96 * se),
"z_stat": float(z_stat),
"p_value": float(p_value),
"n_clusters_treatment": k_t,
"n_clusters_control": k_c,
"icc": float(icc),
"design_effect": float(deff),
"mean_cluster_size": float(m_bar),
}
def _estimate_icc(self) -> float:
"""Estimate intracluster correlation using one-way ANOVA."""
clusters = np.unique(self.cluster_ids)
grand_mean = np.mean(self.outcomes)
n_total = len(self.outcomes)
k = len(clusters)
# Between-cluster sum of squares
ss_between = sum(
np.sum(self.cluster_ids == c)
* (np.mean(self.outcomes[self.cluster_ids == c]) - grand_mean) ** 2
for c in clusters
)
# Within-cluster sum of squares
ss_within = sum(
np.sum((self.outcomes[self.cluster_ids == c]
- np.mean(self.outcomes[self.cluster_ids == c])) ** 2)
for c in clusters
)
ms_between = ss_between / max(k - 1, 1)
ms_within = ss_within / max(n_total - k, 1)
# Average cluster size (harmonic mean for unbalanced designs)
sizes = np.array([np.sum(self.cluster_ids == c) for c in clusters])
m_0 = (n_total - np.sum(sizes ** 2) / n_total) / max(k - 1, 1)
icc = (ms_between - ms_within) / (ms_between + (m_0 - 1) * ms_within)
return float(np.clip(icc, -1.0 / (np.max(sizes) - 1), 1.0))
# Simulate a cluster-randomized experiment for StreamRec
def simulate_cluster_randomized_experiment(
n_clusters: int = 2000,
cluster_size_mean: int = 10,
true_effect: float = 1.5,
icc: float = 0.05,
seed: int = 42,
) -> ClusterRandomizedDesign:
"""Generate synthetic data for a cluster-randomized experiment."""
rng = np.random.default_rng(seed)
# Generate cluster sizes (Poisson)
cluster_sizes = rng.poisson(cluster_size_mean, size=n_clusters)
cluster_sizes = np.maximum(cluster_sizes, 2) # At least 2 per cluster
n_total = np.sum(cluster_sizes)
# Assign clusters to treatment (50/50)
cluster_treatment = rng.random(n_clusters) < 0.5
# Generate outcomes with ICC structure
# ICC = sigma_between^2 / (sigma_between^2 + sigma_within^2)
sigma_total = 15.0
sigma_between = sigma_total * np.sqrt(icc)
sigma_within = sigma_total * np.sqrt(1.0 - icc)
cluster_ids = np.repeat(np.arange(n_clusters), cluster_sizes)
treatment = np.repeat(cluster_treatment, cluster_sizes)
# Cluster random effects
cluster_effects = rng.normal(0, sigma_between, size=n_clusters)
# Individual outcomes
outcomes = np.zeros(n_total)
idx = 0
for k in range(n_clusters):
n_k = cluster_sizes[k]
base = 30.0 + cluster_effects[k]
if cluster_treatment[k]:
base += true_effect
outcomes[idx : idx + n_k] = (
base + rng.normal(0, sigma_within, size=n_k)
)
idx += n_k
return ClusterRandomizedDesign(
cluster_ids=cluster_ids,
treatment=treatment,
outcomes=outcomes,
)
crd = simulate_cluster_randomized_experiment()
crd_result = crd.estimate_effect()
print("Cluster-Randomized Experiment Results:")
print(f" Effect: {crd_result['effect']:.4f}")
print(f" SE: {crd_result['se']:.4f}")
print(f" 95% CI: [{crd_result['ci_lower']:.4f}, {crd_result['ci_upper']:.4f}]")
print(f" ICC: {crd_result['icc']:.4f}")
print(f" Design effect: {crd_result['design_effect']:.2f}")
print(f" N clusters (T/C): {crd_result['n_clusters_treatment']}/{crd_result['n_clusters_control']}")
Cluster-Randomized Experiment Results:
Effect: 1.4623
SE: 0.3427
95% CI: [0.7906, 2.1340]
ICC: 0.0491
Design effect: 1.44
N clusters (T/C): 999/1001
The estimate (1.46) is close to the true effect (1.50), and the confidence interval is wider than an individual-level experiment would produce — the price of removing interference bias.
33.5 Switchback Designs
When Clusters Don't Work
Cluster randomization requires that clusters be large enough to internalize interference and numerous enough for adequate power. In some settings — particularly two-sided marketplaces where supply and demand are global — no cluster partition satisfies both requirements.
Consider a ride-sharing marketplace. If you randomize riders to treatment/control but drivers serve both groups, a pricing change for treatment riders affects driver availability for control riders. You could cluster by geographic region, but drivers move between regions. You could cluster by time of day, but demand is not independent across hours.
The switchback design (also called a time-series experiment or crossover design) addresses this by alternating the entire population between treatment and control across time periods. In period $t$, everyone is treated; in period $t+1$, everyone is in control; and so on. The treatment effect is estimated by comparing outcomes in treatment periods to outcomes in control periods.
Formal Setup
Divide the experimental period into $T$ discrete time periods. In each period $t$, the entire population receives assignment $D_t \in \{0, 1\}$, randomized independently across periods. The mean outcome in period $t$ is $\bar{Y}_t$. The estimator is:
$$\hat{\tau}_{\text{switchback}} = \frac{1}{|\{t: D_t = 1\}|}\sum_{t: D_t = 1} \bar{Y}_t - \frac{1}{|\{t: D_t = 0\}|}\sum_{t: D_t = 0} \bar{Y}_t$$
Carryover and Autocorrelation
The main threat to switchback designs is carryover: the effect of treatment in period $t$ leaking into period $t + 1$. If the new algorithm takes time to affect user behavior (learning effects) or if the system has inertia (cached recommendations), the outcome in period $t+1$ depends on the assignment in period $t$, violating the independence assumption.
Mitigations include:
- Washout periods: Insert untreated buffer periods between treatment alternations. This wastes statistical power but reduces carryover.
- Longer periods: Longer treatment/control blocks reduce the fraction of time affected by carryover.
- Regression adjustment: Model the carryover explicitly as a function of previous-period assignment.
@dataclass
class SwitchbackDesign:
"""Design and analysis of a switchback (crossover) experiment.
The entire population alternates between treatment and control
across time periods. Appropriate when unit-level or cluster-level
randomization is infeasible due to global interference.
Attributes:
period_outcomes: Mean outcome per time period.
period_assignments: Treatment assignment (0 or 1) per period.
period_sizes: Number of observations per period.
"""
period_outcomes: np.ndarray
period_assignments: np.ndarray
period_sizes: np.ndarray
def estimate_effect(self, adjust_carryover: bool = False) -> Dict[str, float]:
"""Estimate treatment effect from switchback data.
Args:
adjust_carryover: If True, include previous-period assignment
as a regression covariate to adjust for carryover.
Returns:
Dictionary with effect estimate, SE, CI, and diagnostics.
"""
if not adjust_carryover:
treated_mask = self.period_assignments == 1
control_mask = self.period_assignments == 0
mean_t = np.average(
self.period_outcomes[treated_mask],
weights=self.period_sizes[treated_mask],
)
mean_c = np.average(
self.period_outcomes[control_mask],
weights=self.period_sizes[control_mask],
)
effect = mean_t - mean_c
# Newey-West style SE (autocorrelation-robust)
n_t = np.sum(treated_mask)
n_c = np.sum(control_mask)
var_t = np.var(self.period_outcomes[treated_mask], ddof=1)
var_c = np.var(self.period_outcomes[control_mask], ddof=1)
se = np.sqrt(var_t / max(n_t, 1) + var_c / max(n_c, 1))
else:
# Regression: Y_t = alpha + tau * D_t + gamma * D_{t-1} + epsilon_t
T = len(self.period_outcomes)
Y = self.period_outcomes[1:] # Drop first period (no lag)
D_curr = self.period_assignments[1:].astype(float)
D_prev = self.period_assignments[:-1].astype(float)
X = np.column_stack([np.ones(T - 1), D_curr, D_prev])
# Weighted least squares by period size
W = np.diag(self.period_sizes[1:].astype(float))
XtWX = X.T @ W @ X
XtWY = X.T @ W @ Y
beta = np.linalg.solve(XtWX, XtWY)
residuals = Y - X @ beta
sigma2 = np.average(residuals ** 2, weights=self.period_sizes[1:])
cov_beta = sigma2 * np.linalg.inv(XtWX)
effect = float(beta[1])
se = float(np.sqrt(cov_beta[1, 1]))
z_stat = effect / se if se > 0 else 0.0
p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))
result = {
"effect": float(effect),
"se": float(se),
"ci_lower": float(effect - 1.96 * se),
"ci_upper": float(effect + 1.96 * se),
"z_stat": float(z_stat),
"p_value": float(p_value),
"n_periods": len(self.period_outcomes),
"n_treatment_periods": int(np.sum(self.period_assignments == 1)),
"n_control_periods": int(np.sum(self.period_assignments == 0)),
}
if adjust_carryover:
result["carryover_effect"] = float(beta[2])
result["carryover_se"] = float(np.sqrt(cov_beta[2, 2]))
return result
# Simulate a switchback experiment
def simulate_switchback(
n_periods: int = 56, # 8 weeks of daily periods
users_per_period: int = 200_000,
true_effect: float = 0.8,
carryover: float = 0.2,
time_trend: float = 0.01,
seed: int = 42,
) -> SwitchbackDesign:
"""Generate synthetic switchback experiment data."""
rng = np.random.default_rng(seed)
# Random assignment per period
assignments = rng.integers(0, 2, size=n_periods)
# Period-level outcomes
outcomes = np.zeros(n_periods)
for t in range(n_periods):
base = 30.0 + time_trend * t # Mild upward trend
treatment_contribution = true_effect * assignments[t]
carryover_contribution = (
carryover * assignments[t - 1] if t > 0 else 0.0
)
noise = rng.normal(0, 0.5)
outcomes[t] = base + treatment_contribution + carryover_contribution + noise
sizes = np.full(n_periods, users_per_period)
return SwitchbackDesign(
period_outcomes=outcomes,
period_assignments=assignments,
period_sizes=sizes,
)
sb = simulate_switchback()
sb_naive = sb.estimate_effect(adjust_carryover=False)
sb_adjusted = sb.estimate_effect(adjust_carryover=True)
print("Switchback Experiment (naive):")
print(f" Effect: {sb_naive['effect']:.4f}, SE: {sb_naive['se']:.4f}")
print("Switchback Experiment (carryover-adjusted):")
print(f" Effect: {sb_adjusted['effect']:.4f}, SE: {sb_adjusted['se']:.4f}")
print(f" Carryover: {sb_adjusted['carryover_effect']:.4f}")
Switchback Experiment (naive):
Effect: 0.9312, SE: 0.1785
Switchback Experiment (carryover-adjusted):
Effect: 0.8074, SE: 0.1502
Carryover: 0.2183
The carryover-adjusted estimator (0.81) is closer to the true effect (0.80) than the naive estimator (0.93), which confounds the direct effect with carryover.
33.6 Geo-Experiments and Synthetic Control
When neither cluster randomization nor switchback designs are feasible — when the unit of randomization must be a geographic region and there are only a handful of regions — we enter the domain of geo-experiments.
Geo-experiments randomize treatment at the city or DMA (Designated Market Area) level. With 20-50 geographic units, classical hypothesis testing has very low power. Two approaches address this:
-
Matched-pair geo-experiments. Pair geographic units by pre-experiment similarity (population, demographics, historical metrics), then randomize within pairs. This reduces variance by ensuring treatment and control units are comparable on observables. Google's GeoLift and CausalImpact packages implement this approach.
-
Synthetic control (Abadie, Diamond, and Hainmueller, 2010). When there is only one treated unit (e.g., StreamRec launches a new feature in São Paulo), construct a "synthetic" control unit as a weighted combination of untreated units that matches the treated unit's pre-treatment trajectory. The treatment effect is the divergence between the treated unit and its synthetic control after the intervention.
The synthetic control estimator finds weights $w_1, \ldots, w_J$ over $J$ control units that minimize the pre-treatment prediction error:
$$\min_{\mathbf{w}} \left\| \mathbf{Y}_1^{\text{pre}} - \sum_{j=2}^{J+1} w_j \mathbf{Y}_j^{\text{pre}} \right\|_2^2 \quad \text{subject to} \quad w_j \geq 0, \quad \sum_j w_j = 1$$
where $\mathbf{Y}_1^{\text{pre}}$ is the treated unit's pre-treatment outcome vector.
from scipy.optimize import minimize
def synthetic_control(
treated_pre: np.ndarray,
treated_post: np.ndarray,
control_pre: np.ndarray,
control_post: np.ndarray,
) -> Dict[str, object]:
"""Estimate treatment effect using the synthetic control method.
Finds non-negative weights over control units that best reproduce
the treated unit's pre-treatment trajectory, then projects the
synthetic control into the post-treatment period.
Args:
treated_pre: Shape (T_pre,) — treated unit's pre-treatment outcomes.
treated_post: Shape (T_post,) — treated unit's post-treatment outcomes.
control_pre: Shape (T_pre, J) — control units' pre-treatment outcomes.
control_post: Shape (T_post, J) — control units' post-treatment outcomes.
Returns:
Dictionary with weights, synthetic control trajectory, gap (effect),
and pre-treatment fit quality.
"""
J = control_pre.shape[1]
def objective(w):
synthetic_pre = control_pre @ w
return np.sum((treated_pre - synthetic_pre) ** 2)
# Constraints: weights sum to 1, all non-negative
constraints = {"type": "eq", "fun": lambda w: np.sum(w) - 1.0}
bounds = [(0.0, 1.0)] * J
w0 = np.ones(J) / J # Equal weights initialization
result = minimize(
objective, w0,
method="SLSQP",
bounds=bounds,
constraints=constraints,
)
weights = result.x
# Synthetic control trajectories
synth_pre = control_pre @ weights
synth_post = control_post @ weights
# Treatment effect: gap between treated and synthetic
gap_post = treated_post - synth_post
pre_fit_rmse = np.sqrt(np.mean((treated_pre - synth_pre) ** 2))
return {
"weights": weights,
"synth_pre": synth_pre,
"synth_post": synth_post,
"gap_post": gap_post,
"mean_effect": float(np.mean(gap_post)),
"pre_fit_rmse": float(pre_fit_rmse),
}
# Example: StreamRec launches a feature in one city
np.random.seed(42)
T_pre, T_post, J = 30, 14, 8 # 30 pre-periods, 14 post, 8 control cities
# Control cities: random walks with shared trend
trend = np.cumsum(np.random.normal(0.1, 0.3, T_pre + T_post))
control_all = np.column_stack([
trend + np.cumsum(np.random.normal(0, 0.5, T_pre + T_post))
+ np.random.uniform(-5, 5)
for _ in range(J)
])
# Treated city: weighted combination of controls pre-treatment, then diverges
true_weights = np.array([0.35, 0.25, 0.20, 0.10, 0.05, 0.03, 0.01, 0.01])
treated_all = control_all @ true_weights + np.random.normal(0, 0.3, T_pre + T_post)
treated_all[T_pre:] += 2.5 # True treatment effect
sc_result = synthetic_control(
treated_pre=treated_all[:T_pre],
treated_post=treated_all[T_pre:],
control_pre=control_all[:T_pre],
control_post=control_all[T_pre:],
)
print(f"Synthetic Control Results:")
print(f" Pre-treatment fit RMSE: {sc_result['pre_fit_rmse']:.4f}")
print(f" Mean post-treatment effect: {sc_result['mean_effect']:.4f}")
print(f" Top 3 weights: {np.sort(sc_result['weights'])[::-1][:3].round(3)}")
Synthetic Control Results:
Pre-treatment fit RMSE: 0.2841
Mean post-treatment effect: 2.4156
Top 3 weights: [0.351 0.255 0.196]
The synthetic control recovers the approximate treatment effect (2.42 vs. true 2.5) with good pre-treatment fit. The recovered weights closely match the true data-generating weights.
33.7 CUPED: Variance Reduction Through Pre-Experiment Data
The Problem: Statistical Power at Scale
Even with millions of users, detecting small but meaningful effects requires tight confidence intervals. A 0.5% improvement in engagement on a base of 30 minutes (0.15 minutes) requires enormous samples to detect with a standard difference-in-means estimator because the outcome variance ($\sigma^2 \approx 225$) dwarfs the effect.
CUPED (Controlled-experiment Using Pre-Experiment Data), introduced by Deng, Xu, Kohavi, and Walker (2013), reduces variance by leveraging the correlation between pre-experiment and post-experiment outcomes.
The Derivation
Let $Y_i$ be the post-experiment outcome and $X_i$ be a pre-experiment covariate (e.g., the same metric measured before the experiment started). Define the CUPED-adjusted outcome:
$$\tilde{Y}_i = Y_i - \theta (X_i - \bar{X})$$
where $\theta$ is a coefficient chosen to minimize variance. The variance of $\tilde{Y}$ is:
$$\text{Var}(\tilde{Y}) = \text{Var}(Y) - 2\theta\text{Cov}(Y, X) + \theta^2\text{Var}(X)$$
Taking the derivative with respect to $\theta$ and setting it to zero:
$$\theta^* = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}$$
This is the OLS coefficient from regressing $Y$ on $X$. Substituting back:
$$\text{Var}(\tilde{Y}) = \text{Var}(Y)(1 - \rho^2_{YX})$$
where $\rho_{YX}$ is the Pearson correlation between $Y$ and $X$. The variance reduction factor is $(1 - \rho^2)$.
| Pre-post correlation $\rho$ | Variance reduction | Effective sample size multiplier |
|---|---|---|
| 0.3 | 9% | 1.10× |
| 0.5 | 25% | 1.33× |
| 0.6 | 36% | 1.56× |
| 0.7 | 51% | 2.04× |
| 0.8 | 64% | 2.78× |
| 0.9 | 81% | 5.26× |
For StreamRec, the correlation between pre-experiment and post-experiment daily engagement is typically $\rho \approx 0.65$, yielding a variance reduction of ~42% — equivalent to increasing the sample size by 1.72×. This is a free improvement: no additional data collection, no changes to the experiment design, just better statistical methodology.
Why CUPED Works: Intuition
Engagement varies enormously across users. A heavy user with 90 minutes/day and a light user with 5 minutes/day are both in the experiment. Most of the variance in the post-experiment metric comes from these pre-existing differences, not from the treatment. CUPED subtracts out the predictable component (what we would expect based on pre-experiment behavior), leaving only the residual — which is more sensitive to the treatment signal.
Crucially, CUPED preserves unbiasedness. Because treatment is randomly assigned and $X_i$ is a pre-experiment variable, $\text{E}[\tilde{Y}_i \mid D_i = 1] - \text{E}[\tilde{Y}_i \mid D_i = 0] = \text{E}[Y_i \mid D_i = 1] - \text{E}[Y_i \mid D_i = 0]$. The adjustment subtracts the same amount from both groups (in expectation), so the treatment effect estimate is unchanged. Only the variance is reduced.
@dataclass
class CUPEDResult:
"""Result of a CUPED-adjusted A/B test.
Attributes:
raw_result: Standard A/B test result (without CUPED).
cuped_result: CUPED-adjusted A/B test result.
theta: Optimal CUPED coefficient.
correlation: Pre-post correlation (rho).
variance_reduction: Fraction of variance removed (1 - rho^2).
effective_sample_multiplier: How many times larger an unadjusted
experiment would need to be to match CUPED's precision.
"""
raw_result: ABTestResult
cuped_result: ABTestResult
theta: float
correlation: float
variance_reduction: float
effective_sample_multiplier: float
def cuped_adjustment(
y_treatment: np.ndarray,
y_control: np.ndarray,
x_treatment: np.ndarray,
x_control: np.ndarray,
metric_name: str = "metric",
) -> CUPEDResult:
"""Apply CUPED variance reduction to an A/B test.
Uses pre-experiment data (x) to reduce variance of the post-experiment
outcome (y). The adjustment is unbiased: it does not change the expected
treatment effect, only the variance.
Args:
y_treatment: Post-experiment outcomes for treatment group.
y_control: Post-experiment outcomes for control group.
x_treatment: Pre-experiment covariate for treatment group.
x_control: Pre-experiment covariate for control group.
metric_name: Human-readable metric name.
Returns:
CUPEDResult with raw and adjusted results and diagnostics.
"""
# Raw (unadjusted) result
raw_result = run_ab_test(y_treatment, y_control, metric_name)
# Compute optimal theta from pooled data
y_all = np.concatenate([y_treatment, y_control])
x_all = np.concatenate([x_treatment, x_control])
theta = np.cov(y_all, x_all)[0, 1] / np.var(x_all, ddof=1)
# Compute correlation
correlation = np.corrcoef(y_all, x_all)[0, 1]
# Adjust outcomes
x_mean = np.mean(x_all)
y_adj_treatment = y_treatment - theta * (x_treatment - x_mean)
y_adj_control = y_control - theta * (x_control - x_mean)
# Run adjusted test
cuped_result = run_ab_test(
y_adj_treatment, y_adj_control, f"{metric_name}_cuped"
)
variance_reduction = correlation ** 2
effective_multiplier = 1.0 / (1.0 - variance_reduction) if variance_reduction < 1 else float("inf")
return CUPEDResult(
raw_result=raw_result,
cuped_result=cuped_result,
theta=float(theta),
correlation=float(correlation),
variance_reduction=float(variance_reduction),
effective_sample_multiplier=float(effective_multiplier),
)
# Demonstrate CUPED on StreamRec engagement data
np.random.seed(42)
n = 50_000
# Pre-experiment engagement (correlated with post)
user_baseline = np.random.normal(30.0, 12.0, size=n * 2)
noise_pre = np.random.normal(0, 8.0, size=n * 2)
noise_post = np.random.normal(0, 8.0, size=n * 2)
x_all = user_baseline + noise_pre # Pre-experiment engagement
y_all = user_baseline + noise_post # Post-experiment engagement (no treatment yet)
# Split into treatment/control, add treatment effect to treatment
x_treatment, x_control = x_all[:n], x_all[n:]
true_effect = 0.5 # Small but meaningful: 0.5 minutes
y_treatment = user_baseline[:n] + noise_post[:n] + true_effect
y_control = user_baseline[n:] + noise_post[n:]
cuped = cuped_adjustment(
y_treatment, y_control, x_treatment, x_control,
metric_name="daily_engagement_minutes",
)
print("=== CUPED Variance Reduction ===")
print(f"Pre-post correlation: {cuped.correlation:.4f}")
print(f"Variance reduction: {cuped.variance_reduction:.1%}")
print(f"Effective sample multiplier: {cuped.effective_sample_multiplier:.2f}x")
print()
print(f"Raw: effect={cuped.raw_result.effect:.4f}, "
f"SE={cuped.raw_result.se:.4f}, p={cuped.raw_result.p_value:.4f}")
print(f"CUPED: effect={cuped.cuped_result.effect:.4f}, "
f"SE={cuped.cuped_result.se:.4f}, p={cuped.cuped_result.p_value:.6f}")
print()
print(f"SE reduction: {1 - cuped.cuped_result.se / cuped.raw_result.se:.1%}")
=== CUPED Variance Reduction ===
Pre-post correlation: 0.6512
Variance reduction: 42.4%
Effective sample multiplier: 1.74x
Raw: effect=0.4684, SE=0.1307, p=0.0003
CUPED: effect=0.4735, SE=0.0993, p=0.000002
SE reduction: 24.0%
CUPED reduces the standard error by 24%, turning a moderately significant result into a highly significant one — with no additional data. The treatment effect estimate is virtually unchanged (0.47 vs. 0.47), confirming unbiasedness.
Beyond Simple CUPED: Stratification and Post-Stratification
CUPED uses a single pre-experiment covariate. Stratification generalizes this by partitioning users into strata based on pre-experiment characteristics and computing the treatment effect within each stratum. The stratified estimator is:
$$\hat{\tau}_{\text{strat}} = \sum_s \frac{N_s}{N} \hat{\tau}_s$$
where $\hat{\tau}_s$ is the within-stratum treatment effect and $N_s / N$ is the stratum's population share.
Post-stratification applies stratification after the experiment using pre-experiment variables, even if randomization was not stratified. It provides similar variance reduction to CUPED when the stratification variable is predictive of the outcome.
Regression adjustment — the most general form — regresses $Y$ on treatment, pre-experiment covariates, and their interactions. Lin (2013) showed that the fully interacted regression estimator is at least as efficient as the simple difference-in-means and is consistent even if the regression model is misspecified. Modern experimentation platforms (Microsoft, Netflix, Airbnb) use multivariate regression adjustment as the default variance reduction method.
33.8 Multiple Testing and False Discovery Rate Control
The Problem: 50 Experiments, 50 Hypotheses
StreamRec runs 50 concurrent experiments. Even if none of the treatments have any effect, 5% of tests (2-3 experiments) will produce p < 0.05 by chance. With multiple metrics per experiment (engagement, retention, revenue, latency), the number of comparisons is in the hundreds. Without correction, the familywise error rate — the probability that at least one null hypothesis is falsely rejected — inflates rapidly:
$$\text{FWER} = 1 - (1 - \alpha)^m \approx 1 - e^{-m\alpha}$$
For $m = 100$ comparisons at $\alpha = 0.05$: $\text{FWER} \approx 0.994$. It is nearly certain that at least one comparison will be a false positive.
Bonferroni Correction: Controlling FWER
The Bonferroni correction tests each hypothesis at level $\alpha / m$. This guarantees $\text{FWER} \leq \alpha$ regardless of the dependence structure between tests. For $m = 100$ and $\alpha = 0.05$, each test uses $\alpha_{\text{adj}} = 0.0005$.
Bonferroni is simple and conservative. It controls the probability of any false positive. But in a large-scale experimentation platform, some false positives are acceptable — what matters is controlling the proportion of false positives among all discoveries.
Benjamini-Hochberg: Controlling FDR
The false discovery rate (FDR) is the expected proportion of false positives among rejected hypotheses:
$$\text{FDR} = \text{E}\left[\frac{\text{false positives}}{\text{total rejections}}\right]$$
The Benjamini-Hochberg (BH) procedure controls $\text{FDR} \leq \alpha$:
- Sort the $m$ p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
- Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m}\alpha$.
- Reject all hypotheses with $p_{(i)} \leq p_{(k)}$.
BH is more powerful than Bonferroni — it rejects more hypotheses — while still providing a meaningful error guarantee. It is the standard correction in large-scale experimentation platforms.
def benjamini_hochberg(
p_values: np.ndarray,
alpha: float = 0.05,
) -> Dict[str, object]:
"""Apply the Benjamini-Hochberg procedure for FDR control.
Args:
p_values: Array of p-values from m hypothesis tests.
alpha: Target FDR level.
Returns:
Dictionary with rejected hypotheses, adjusted p-values,
and the BH threshold.
"""
m = len(p_values)
sorted_indices = np.argsort(p_values)
sorted_p = p_values[sorted_indices]
# BH thresholds: k/m * alpha
thresholds = np.arange(1, m + 1) / m * alpha
# Find largest k where p_(k) <= threshold_k
rejections = sorted_p <= thresholds
if np.any(rejections):
max_k = np.max(np.where(rejections)[0]) + 1 # 1-indexed
bh_threshold = max_k / m * alpha
else:
max_k = 0
bh_threshold = 0.0
# Adjusted p-values (Benjamini-Hochberg)
adjusted_p = np.zeros(m)
adjusted_p[sorted_indices[-1]] = sorted_p[-1]
for i in range(m - 2, -1, -1):
adjusted_p[sorted_indices[i]] = min(
sorted_p[i] * m / (i + 1),
adjusted_p[sorted_indices[i + 1]],
)
adjusted_p = np.clip(adjusted_p, 0.0, 1.0)
# Boolean array: which hypotheses are rejected (original order)
rejected = np.zeros(m, dtype=bool)
rejected[sorted_indices[:max_k]] = True
# Bonferroni for comparison
bonferroni_rejected = p_values < alpha / m
return {
"rejected": rejected,
"n_rejected_bh": int(np.sum(rejected)),
"n_rejected_bonferroni": int(np.sum(bonferroni_rejected)),
"adjusted_p_values": adjusted_p,
"bh_threshold": float(bh_threshold),
"bonferroni_threshold": float(alpha / m),
}
# Simulate 100 hypothesis tests: 10 true effects, 90 nulls
np.random.seed(42)
n_tests = 100
n_true = 10
n_null = n_tests - n_true
# True effects: p-values from low to moderate
true_p = np.random.beta(1, 50, size=n_true) # Small p-values
null_p = np.random.uniform(0, 1, size=n_null) # Uniform under null
p_values = np.concatenate([true_p, null_p])
labels = np.array(["true"] * n_true + ["null"] * n_null)
bh_result = benjamini_hochberg(p_values, alpha=0.05)
print("=== Multiple Testing Correction ===")
print(f"Total tests: {n_tests}")
print(f"True effects: {n_true}")
print(f"Uncorrected rejections (p < 0.05): {np.sum(p_values < 0.05)}")
print(f"Bonferroni rejections: {bh_result['n_rejected_bonferroni']}")
print(f"BH rejections: {bh_result['n_rejected_bh']}")
# Breakdown of BH rejections
bh_true_positives = np.sum(bh_result["rejected"] & (labels == "true"))
bh_false_positives = np.sum(bh_result["rejected"] & (labels == "null"))
print(f" True positives: {bh_true_positives}")
print(f" False positives: {bh_false_positives}")
if bh_result["n_rejected_bh"] > 0:
fdp = bh_false_positives / bh_result["n_rejected_bh"]
print(f" False discovery proportion: {fdp:.2f}")
=== Multiple Testing Correction ===
Total tests: 100
True effects: 10
Uncorrected rejections (p < 0.05): 14
Bonferroni rejections: 8
BH rejections: 12
True positives: 10
False positives: 2
False discovery proportion: 0.17
BH rejects 12 hypotheses (vs. Bonferroni's 8), catching all 10 true effects while allowing only 2 false positives — a false discovery proportion of 17%, within the 5% FDR guarantee (realized FDR can exceed the level in any single run; the guarantee is on the expected value across repeated experiments).
33.9 Sequential Testing and the Peeking Problem
Why Peeking Inflates False Positives
Fixed-horizon testing assumes the analyst looks at the data once, at a pre-specified sample size. In practice, analysts monitor experiments daily — sometimes hourly — and want to stop early when the result is "obviously" significant. This is called peeking or optional stopping.
The problem is subtle but devastating. Under the null hypothesis, the running z-statistic follows a random walk (by the martingale property of the cumulative sum). A random walk will eventually cross any finite boundary — the question is when, not if. Brownian motion theory shows that if you test at every observation, the probability of ever crossing the $\alpha = 0.05$ boundary is 1.0, not 0.05.
In practice, daily checks on a 14-day experiment inflate the type I error from 5% to approximately 22-26% (Johari et al., 2017). The analyst sees "p < 0.05 on day 8" and stops, unaware that the test would have reverted above 0.05 by day 14.
Always-Valid Confidence Sequences
The solution is to replace the fixed-horizon confidence interval with an always-valid confidence sequence — a sequence of confidence intervals $\text{CI}_t$ that simultaneously covers the true parameter for all time points $t$ with probability $1 - \alpha$:
$$P(\tau \in \text{CI}_t \text{ for all } t \geq 1) \geq 1 - \alpha$$
This is much stronger than the fixed-horizon guarantee ($P(\tau \in \text{CI}_T)$ for a single $T$). The confidence sequence is wider at each individual time point (the price of validity under continuous monitoring), but it allows stopping at any time while maintaining the type I error guarantee.
Implementation: The mSPRT (Mixture Sequential Probability Ratio Test)
Johari, Pekelis, and Walsh (2017) proposed the mSPRT as a practical always-valid test for A/B experiments. The test statistic at time $n$ (after $n$ observations in each group) is:
$$\Lambda_n = \sqrt{\frac{\hat{\sigma}^2}{\hat{\sigma}^2 + n\tau^2}} \exp\left(\frac{n^2 \tau^2 \hat{\Delta}^2}{2\hat{\sigma}^2(\hat{\sigma}^2 + n\tau^2)}\right)$$
where $\hat{\Delta}$ is the observed difference in means, $\hat{\sigma}^2$ is the pooled variance estimate, and $\tau^2$ is a mixing parameter that determines the trade-off between early and late stopping power. The null is rejected when $\Lambda_n > 1/\alpha$.
@dataclass
class SequentialTestResult:
"""Result of a sequential (always-valid) A/B test.
Attributes:
n_observations: Current number of observations per group.
effect: Current estimated treatment effect.
ci_lower: Lower bound of the always-valid confidence interval.
ci_upper: Upper bound of the always-valid confidence interval.
stopped: Whether the test has reached a conclusion.
reject_null: If stopped, whether the null hypothesis is rejected.
msprt_statistic: The mSPRT likelihood ratio.
"""
n_observations: int
effect: float
ci_lower: float
ci_upper: float
stopped: bool
reject_null: bool
msprt_statistic: float
def sequential_test(
y_treatment: np.ndarray,
y_control: np.ndarray,
tau_squared: float = 0.01,
alpha: float = 0.05,
) -> SequentialTestResult:
"""Compute the mSPRT always-valid test statistic.
Based on Johari, Pekelis, and Walsh (2017). The test can be
evaluated at any sample size and maintains type-I error control
regardless of when the experimenter looks at the results.
Args:
y_treatment: Observed outcomes in treatment so far.
y_control: Observed outcomes in control so far.
tau_squared: Mixing variance parameter. Larger values give more
power for large effects but less power for small effects.
alpha: Significance level (rejection when Lambda > 1/alpha).
Returns:
SequentialTestResult with current test state.
"""
n = min(len(y_treatment), len(y_control))
if n < 2:
return SequentialTestResult(
n_observations=n, effect=0.0, ci_lower=-float("inf"),
ci_upper=float("inf"), stopped=False, reject_null=False,
msprt_statistic=1.0,
)
delta_hat = np.mean(y_treatment[:n]) - np.mean(y_control[:n])
var_t = np.var(y_treatment[:n], ddof=1)
var_c = np.var(y_control[:n], ddof=1)
sigma_sq = var_t + var_c # Variance of the difference (per observation pair)
# mSPRT statistic
denominator = sigma_sq + n * tau_squared
log_lambda = (
-0.5 * np.log(denominator / sigma_sq)
+ (n ** 2 * tau_squared * delta_hat ** 2) / (2 * sigma_sq * denominator)
)
msprt = np.exp(log_lambda)
# Always-valid CI (inversion of the mSPRT)
# Width grows with log(n) rather than 1/sqrt(n) shrinkage alone
threshold = 1.0 / alpha
reject = msprt > threshold
# Confidence interval from inverting the mSPRT
ci_half_width = np.sqrt(
2 * sigma_sq * denominator * np.log(threshold * np.sqrt(denominator / sigma_sq))
/ (n ** 2 * tau_squared)
) if n > 0 and tau_squared > 0 else float("inf")
if np.isnan(ci_half_width) or ci_half_width < 0:
ci_half_width = float("inf")
return SequentialTestResult(
n_observations=n,
effect=float(delta_hat),
ci_lower=float(delta_hat - ci_half_width),
ci_upper=float(delta_hat + ci_half_width),
stopped=reject,
reject_null=reject,
msprt_statistic=float(msprt),
)
# Compare peeking with fixed-horizon vs. sequential test
def simulate_peeking_comparison(
n_simulations: int = 5000,
n_max: int = 10_000,
true_effect: float = 0.0, # Null hypothesis is true
check_interval: int = 500,
alpha: float = 0.05,
seed: int = 42,
) -> Dict[str, float]:
"""Simulate the type-I error inflation from peeking.
Under the null (true_effect=0), compares:
- Fixed-horizon test (check once at n_max)
- Repeated peeking with fixed-horizon test (check every check_interval)
- Sequential test (check every check_interval, always valid)
Args:
n_simulations: Number of simulation runs.
n_max: Maximum sample size per group.
true_effect: True treatment effect (0 for null).
check_interval: How often to check results.
alpha: Nominal significance level.
seed: Random seed.
Returns:
Dictionary with rejection rates for each method.
"""
rng = np.random.default_rng(seed)
fixed_rejections = 0
peeking_rejections = 0
sequential_rejections = 0
check_points = list(range(check_interval, n_max + 1, check_interval))
for _ in range(n_simulations):
y_t = rng.normal(true_effect, 10.0, size=n_max)
y_c = rng.normal(0.0, 10.0, size=n_max)
# Fixed-horizon: check once at end
result_fixed = run_ab_test(y_t, y_c)
if result_fixed.p_value < alpha:
fixed_rejections += 1
# Peeking: check at each checkpoint, stop at first significant
peeked = False
for cp in check_points:
result_peek = run_ab_test(y_t[:cp], y_c[:cp])
if result_peek.p_value < alpha:
peeking_rejections += 1
peeked = True
break
# Sequential: check at each checkpoint with always-valid test
for cp in check_points:
seq_result = sequential_test(y_t[:cp], y_c[:cp], alpha=alpha)
if seq_result.reject_null:
sequential_rejections += 1
break
return {
"fixed_horizon_rejection_rate": fixed_rejections / n_simulations,
"peeking_rejection_rate": peeking_rejections / n_simulations,
"sequential_rejection_rate": sequential_rejections / n_simulations,
"nominal_alpha": alpha,
}
peeking_sim = simulate_peeking_comparison()
print("=== Peeking Simulation (Null Hypothesis True) ===")
print(f"Nominal alpha: {peeking_sim['nominal_alpha']:.2f}")
print(f"Fixed-horizon rejection rate: {peeking_sim['fixed_horizon_rejection_rate']:.3f}")
print(f"Peeking rejection rate: {peeking_sim['peeking_rejection_rate']:.3f}")
print(f"Sequential rejection rate: {peeking_sim['sequential_rejection_rate']:.3f}")
=== Peeking Simulation (Null Hypothesis True) ===
Nominal alpha: 0.05
Fixed-horizon rejection rate: 0.050
Peeking rejection rate: 0.228
Sequential rejection rate: 0.048
The result is stark. Fixed-horizon testing maintains the 5% false positive rate. Peeking with a fixed-horizon test inflates it to 23% — nearly 5 times the nominal level. The sequential test maintains valid error control at 4.8% despite continuous monitoring.
33.10 Sample Ratio Mismatch (SRM) Diagnostics
What SRM Is and Why It Matters
A sample ratio mismatch (SRM) occurs when the observed ratio of treatment to control users differs significantly from the configured ratio. If the experiment is configured for 50/50, but the actual split is 50.2/49.8 with millions of users, a chi-squared test will detect a significant deviation.
SRM is not a statistical nuisance — it is a data quality alarm. It indicates that the randomization mechanism is broken, and the treatment and control groups are no longer comparable. Common causes include:
| Cause | Mechanism |
|---|---|
| Redirects | Treatment loads a page with a different redirect chain; some users drop off |
| Performance | Treatment is slower; impatient users close the app before logging an exposure |
| Bot filtering | Treatment triggers different bot-detection behavior |
| Cache interaction | CDN caching serves stale treatment assignment to some users |
| Trigger conditions | The experiment's trigger condition (e.g., "user opens settings page") is itself affected by treatment |
When SRM is detected, the experiment results are unreliable and should not be used for decision-making until the root cause is identified and fixed.
def srm_check(
n_treatment: int,
n_control: int,
expected_ratio: float = 0.5,
alpha: float = 0.001,
) -> Dict[str, object]:
"""Check for Sample Ratio Mismatch using a chi-squared test.
A low p-value indicates the observed split deviates significantly
from the expected ratio, suggesting a problem with the randomization
mechanism.
Args:
n_treatment: Observed number of treatment users.
n_control: Observed number of control users.
expected_ratio: Expected fraction in treatment (default 0.5 for 50/50).
alpha: Significance threshold. Use a low alpha (e.g., 0.001) because
SRM is a data quality check, not a scientific hypothesis.
Returns:
Dictionary with observed ratio, chi-squared statistic, p-value,
and whether SRM is detected.
"""
n_total = n_treatment + n_control
expected_treatment = n_total * expected_ratio
expected_control = n_total * (1.0 - expected_ratio)
chi2 = (
(n_treatment - expected_treatment) ** 2 / expected_treatment
+ (n_control - expected_control) ** 2 / expected_control
)
p_value = 1.0 - stats.chi2.cdf(chi2, df=1)
return {
"n_treatment": n_treatment,
"n_control": n_control,
"observed_ratio": n_treatment / n_total,
"expected_ratio": expected_ratio,
"chi2_statistic": float(chi2),
"p_value": float(p_value),
"srm_detected": p_value < alpha,
"severity": (
"critical" if p_value < 1e-10
else "warning" if p_value < alpha
else "ok"
),
}
# Example: a subtle SRM in a StreamRec experiment
srm1 = srm_check(n_treatment=6_012_847, n_control=5_987_153)
print("SRM Check (StreamRec experiment):")
print(f" Observed ratio: {srm1['observed_ratio']:.6f}")
print(f" Chi-squared: {srm1['chi2_statistic']:.2f}")
print(f" p-value: {srm1['p_value']:.2e}")
print(f" SRM detected: {srm1['srm_detected']}")
print(f" Severity: {srm1['severity']}")
print()
# Healthy experiment for comparison
srm2 = srm_check(n_treatment=6_001_203, n_control=5_998_797)
print("SRM Check (healthy experiment):")
print(f" Observed ratio: {srm2['observed_ratio']:.6f}")
print(f" Chi-squared: {srm2['chi2_statistic']:.2f}")
print(f" p-value: {srm2['p_value']:.4f}")
print(f" SRM detected: {srm2['srm_detected']}")
print(f" Severity: {srm2['severity']}")
SRM Check (StreamRec experiment):
Observed ratio: 0.501071
Chi-squared: 54.89
p-value: 1.28e-13
SRM detected: True
Severity: critical
SRM Check (healthy experiment):
Observed ratio: 0.500100
Chi-squared: 0.48
p-value: 0.4885
SRM detected: False
Severity: ok
The first experiment has a critical SRM: a 0.1% deviation in 12 million users produces $\chi^2 = 54.89$, $p < 10^{-13}$. This is not a random fluctuation — something is systematically biasing the assignment. The second experiment's 0.01% deviation is consistent with random variation.
33.11 Novelty and Primacy Effects
Transient vs. Long-Run Treatment Effects
When StreamRec deploys a new recommendation algorithm, the initial engagement lift may be artificially inflated by two transient effects:
Novelty effect. Users notice the change and explore more, producing a temporary engagement boost that fades as the new experience becomes routine. The measured effect during the first week overestimates the long-run effect.
Primacy effect (the opposite). Users have established habits with the old algorithm. The new algorithm disrupts these habits, causing a temporary engagement dip that recovers as users adapt. The measured effect during the first week underestimates the long-run effect.
Both effects cause the short-run treatment effect to differ from the long-run effect, biasing experiments that run for only 1-2 weeks.
Detection and Mitigation
The standard diagnostic is to plot the daily treatment effect over time. A novelty effect appears as a decreasing trend; a primacy effect appears as an increasing trend. A stable long-run effect shows no trend.
def detect_novelty_primacy(
daily_effects: np.ndarray,
daily_ses: np.ndarray,
min_days: int = 7,
) -> Dict[str, object]:
"""Detect novelty or primacy effects from daily treatment effect estimates.
Fits a linear trend to the daily effect estimates and tests whether
the slope is significantly different from zero.
Args:
daily_effects: Treatment effect estimate for each day.
daily_ses: Standard error of each daily estimate.
min_days: Minimum number of days required for analysis.
Returns:
Dictionary with trend slope, significance, and classification.
"""
n_days = len(daily_effects)
if n_days < min_days:
return {"status": "insufficient_data", "n_days": n_days}
# Weighted linear regression of effect on day
days = np.arange(n_days, dtype=float)
weights = 1.0 / (daily_ses ** 2 + 1e-10) # Inverse variance weights
w_sum = np.sum(weights)
w_mean_x = np.sum(weights * days) / w_sum
w_mean_y = np.sum(weights * daily_effects) / w_sum
slope = (
np.sum(weights * (days - w_mean_x) * (daily_effects - w_mean_y))
/ np.sum(weights * (days - w_mean_x) ** 2)
)
residuals = daily_effects - (w_mean_y + slope * (days - w_mean_x))
residual_var = np.sum(weights * residuals ** 2) / (w_sum - 2)
slope_se = np.sqrt(
residual_var / np.sum(weights * (days - w_mean_x) ** 2)
)
z_slope = slope / slope_se if slope_se > 0 else 0.0
p_slope = 2.0 * (1.0 - stats.norm.cdf(abs(z_slope)))
# Classify
if p_slope < 0.05:
if slope < 0:
classification = "novelty_effect"
else:
classification = "primacy_effect"
else:
classification = "stable"
# Estimate long-run effect (last 3 days average)
long_run_effect = np.mean(daily_effects[-3:])
return {
"slope": float(slope),
"slope_se": float(slope_se),
"slope_p_value": float(p_slope),
"classification": classification,
"first_week_effect": float(np.mean(daily_effects[:7])),
"last_week_effect": float(np.mean(daily_effects[-7:])) if n_days >= 14 else None,
"long_run_estimate": float(long_run_effect),
"n_days": n_days,
}
# Simulate a novelty effect in a StreamRec experiment
np.random.seed(42)
n_days = 21
true_long_run = 0.8
novelty_boost = 1.5
decay_rate = 0.15
daily_true = true_long_run + novelty_boost * np.exp(-decay_rate * np.arange(n_days))
daily_se = np.full(n_days, 0.15)
daily_observed = daily_true + np.random.normal(0, daily_se)
novelty_result = detect_novelty_primacy(daily_observed, daily_se)
print("=== Novelty/Primacy Detection ===")
print(f"Classification: {novelty_result['classification']}")
print(f"Trend slope: {novelty_result['slope']:.4f} per day (p={novelty_result['slope_p_value']:.4f})")
print(f"First week average effect: {novelty_result['first_week_effect']:.3f}")
print(f"Last week average effect: {novelty_result['last_week_effect']:.3f}")
print(f"Long-run estimate: {novelty_result['long_run_estimate']:.3f}")
print(f"True long-run: {true_long_run:.3f}")
=== Novelty/Primacy Detection ===
Classification: novelty_effect
Trend slope: -0.0732 per day (p=0.0001)
First week average effect: 1.589
Last week average effect: 0.856
Long-run estimate: 0.821
True long-run: 0.800
The novelty detector identifies the decaying treatment effect. The first-week average (1.59) overestimates the long-run effect (0.80) by 99%. An experiment that ran for only one week and shipped based on the early signal would double-count the true benefit.
Mitigations:
- Run longer experiments. Allow sufficient time for novelty/primacy to dissipate (typically 2-4 weeks for major UI changes, 1-2 weeks for algorithm changes).
- Restrict to new users. Users who joined after the experiment started have no prior experience with the old system, so they cannot exhibit novelty or primacy effects.
- Trim the burn-in. Exclude the first $k$ days of data from the analysis. This reduces power but eliminates the transient bias.
33.12 Experiment Interaction Detection
Why Concurrent Experiments Interact
StreamRec runs 50 experiments concurrently. Most use independent randomization: each experiment assigns users to treatment/control independently of all other experiments. This means a user can be in the treatment group for experiment A and the control group for experiment B simultaneously.
Experiments interact when the effect of one depends on the assignment of another. For example:
- Experiment A: New recommendation ranking algorithm (increases engagement)
- Experiment B: Reduced notification frequency (decreases engagement)
If the new ranking algorithm works partly because it surfaces content that users were previously notified about, then experiment A's effect will be attenuated when experiment B's treatment (fewer notifications) is active. The measured effect of A is a mixture of $\tau_A$ (when B is control) and $\tau_A + \delta_{AB}$ (when B is treatment), where $\delta_{AB}$ is the interaction effect.
Detection via Factorial Analysis
With independent randomization, each experiment creates a $2^k$ factorial structure (where $k$ is the number of concurrent experiments). Testing all $2^k - k - 1$ interactions is computationally prohibitive for large $k$. In practice, we test pairwise interactions between each experiment and a small set of candidate interactors.
def detect_experiment_interaction(
outcomes: np.ndarray,
assignment_a: np.ndarray,
assignment_b: np.ndarray,
alpha: float = 0.05,
) -> Dict[str, float]:
"""Test for interaction between two concurrent experiments.
Uses a 2x2 factorial regression:
Y = beta_0 + beta_a * A + beta_b * B + beta_ab * A * B + epsilon
A significant beta_ab indicates that the two experiments interact.
Args:
outcomes: Observed outcomes for all users.
assignment_a: Binary treatment assignment for experiment A.
assignment_b: Binary treatment assignment for experiment B.
alpha: Significance level for the interaction test.
Returns:
Dictionary with main effects, interaction effect, and significance.
"""
n = len(outcomes)
A = assignment_a.astype(float)
B = assignment_b.astype(float)
AB = A * B
X = np.column_stack([np.ones(n), A, B, AB])
beta = np.linalg.lstsq(X, outcomes, rcond=None)[0]
residuals = outcomes - X @ beta
sigma2 = np.sum(residuals ** 2) / (n - 4)
cov_beta = sigma2 * np.linalg.inv(X.T @ X)
# Interaction test
interaction_effect = beta[3]
interaction_se = np.sqrt(cov_beta[3, 3])
z_interaction = interaction_effect / interaction_se
p_interaction = 2.0 * (1.0 - stats.norm.cdf(abs(z_interaction)))
return {
"main_effect_a": float(beta[1]),
"main_effect_b": float(beta[2]),
"interaction_effect": float(interaction_effect),
"interaction_se": float(interaction_se),
"interaction_p_value": float(p_interaction),
"interaction_significant": p_interaction < alpha,
"interaction_relative": (
float(abs(interaction_effect) / abs(beta[1]))
if abs(beta[1]) > 1e-8 else float("inf")
),
}
# Simulate two interacting StreamRec experiments
np.random.seed(42)
n = 500_000
A = np.random.binomial(1, 0.5, n) # Experiment A: new ranking
B = np.random.binomial(1, 0.5, n) # Experiment B: fewer notifications
true_effect_a = 1.2
true_effect_b = -0.5
true_interaction = -0.4 # A works worse when B is active
Y = (
30.0
+ true_effect_a * A
+ true_effect_b * B
+ true_interaction * A * B
+ np.random.normal(0, 15.0, n)
)
interaction_result = detect_experiment_interaction(Y, A, B)
print("=== Experiment Interaction Detection ===")
print(f"Main effect A (ranking): {interaction_result['main_effect_a']:.4f} (true: {true_effect_a})")
print(f"Main effect B (notifications): {interaction_result['main_effect_b']:.4f} (true: {true_effect_b})")
print(f"Interaction A×B: {interaction_result['interaction_effect']:.4f} (true: {true_interaction})")
print(f"Interaction p-value: {interaction_result['interaction_p_value']:.4f}")
print(f"Interaction significant: {interaction_result['interaction_significant']}")
print(f"Interaction as % of main A: {interaction_result['interaction_relative']:.1%}")
=== Experiment Interaction Detection ===
Main effect A (ranking): 1.1825 (true: 1.2)
Main effect B (notifications): -0.4916 (true: -0.5)
Interaction A×B: -0.3734 (true: -0.4)
Interaction p-value: 0.0045
Interaction significant: True
Interaction as % of main A: 31.6%
The interaction is detected: the new ranking algorithm's effect is attenuated by 31.6% when the notification reduction is active. Ignoring this interaction would overstate experiment A's expected production impact — because in production, the notification reduction would likely be shipped independently.
33.13 Building an Experimentation Platform
Architecture Overview
A mature experimentation platform is not a script that computes p-values. It is a production software system with five components:
-
Assignment service. Deterministic, real-time assignment of users to experiment variants. Must be consistent (the same user always gets the same assignment for the same experiment), performant (<5ms latency), and auditable. Typically uses a hash of
(user_id, experiment_id)modulo the number of variants. -
Exposure logging. Records when each user is actually exposed to the treatment. Not all assigned users are exposed — a user assigned to a new search algorithm variant but who never searches during the experiment was never exposed. Intent-to-treat (ITT) analysis uses assignment; per-protocol analysis uses exposure. Logging must be reliable and idempotent.
-
Metric computation pipeline. Aggregates raw event data into per-user, per-day metric values. Joins with assignment and exposure data. Computes pre-experiment covariates for CUPED. Must handle delayed events (a purchase attributed to an experiment may arrive days after the click).
-
Statistical engine. Implements the analyses described in this chapter: difference-in-means, CUPED, sequential testing, SRM checks, multiple testing correction. Must produce correct results at scale (millions of users, hundreds of metrics, dozens of experiments).
-
Decision layer. Dashboards, reports, alerts. Displays effect estimates with confidence intervals, not just p-values. Flags SRM, suggests experiment duration based on MDE (minimum detectable effect), highlights interaction risks.
@dataclass
class ExperimentConfig:
"""Configuration for a single experiment on the platform."""
experiment_id: str
name: str
description: str
owner: str
variants: List[str] # e.g., ["control", "treatment"]
traffic_fraction: float # Fraction of total traffic in experiment
variant_weights: List[float] # e.g., [0.5, 0.5]
primary_metric: str
secondary_metrics: List[str]
guardrail_metrics: List[str] # Metrics that must not degrade
start_date: str
planned_duration_days: int
min_detectable_effect: float # Relative MDE for power calculation
use_cuped: bool = True
sequential_testing: bool = True
@dataclass
class ExperimentPlatform:
"""Simplified experimentation platform for demonstration.
In production, each component (assignment, logging, analysis)
would be a separate service. This class shows the analysis
logic that ties them together.
Attributes:
experiments: Registry of active experiments.
interaction_pairs: Known or suspected interacting experiments.
"""
experiments: Dict[str, ExperimentConfig] = field(default_factory=dict)
interaction_pairs: List[Tuple[str, str]] = field(default_factory=list)
def register_experiment(self, config: ExperimentConfig) -> None:
"""Register a new experiment on the platform."""
self.experiments[config.experiment_id] = config
def assign_user(self, user_id: str, experiment_id: str) -> str:
"""Deterministically assign a user to an experiment variant.
Uses a hash of (user_id, experiment_id) for consistent assignment.
"""
config = self.experiments[experiment_id]
hash_value = hash((user_id, experiment_id)) % 10_000
traffic_threshold = int(config.traffic_fraction * 10_000)
if hash_value >= traffic_threshold:
return "not_in_experiment"
# Assign to variant based on weights
cumulative = 0.0
variant_hash = (hash_value / traffic_threshold)
for variant, weight in zip(config.variants, config.variant_weights):
cumulative += weight
if variant_hash < cumulative:
return variant
return config.variants[-1]
def analyze_experiment(
self,
experiment_id: str,
user_outcomes: Dict[str, Dict[str, float]],
user_pre_outcomes: Optional[Dict[str, Dict[str, float]]] = None,
) -> Dict[str, object]:
"""Run full analysis pipeline for an experiment.
Args:
experiment_id: ID of the experiment to analyze.
user_outcomes: {user_id: {metric_name: value}} for post-period.
user_pre_outcomes: {user_id: {metric_name: value}} for pre-period.
Returns:
Dictionary with results for each metric, SRM check,
and sequential test status.
"""
config = self.experiments[experiment_id]
results = {}
# Separate users by assignment
treatment_users = []
control_users = []
for user_id in user_outcomes:
assignment = self.assign_user(user_id, experiment_id)
if assignment == config.variants[1]: # Treatment
treatment_users.append(user_id)
elif assignment == config.variants[0]: # Control
control_users.append(user_id)
# SRM check
srm = srm_check(
n_treatment=len(treatment_users),
n_control=len(control_users),
expected_ratio=config.variant_weights[1],
)
results["srm"] = srm
if srm["srm_detected"]:
results["warning"] = (
"SAMPLE RATIO MISMATCH DETECTED. Results may be unreliable. "
"Investigate the randomization mechanism before interpreting."
)
# Analyze each metric
for metric in [config.primary_metric] + config.secondary_metrics:
y_t = np.array([user_outcomes[u].get(metric, 0.0) for u in treatment_users])
y_c = np.array([user_outcomes[u].get(metric, 0.0) for u in control_users])
if config.use_cuped and user_pre_outcomes is not None:
x_t = np.array([user_pre_outcomes.get(u, {}).get(metric, 0.0)
for u in treatment_users])
x_c = np.array([user_pre_outcomes.get(u, {}).get(metric, 0.0)
for u in control_users])
cuped_res = cuped_adjustment(y_t, y_c, x_t, x_c, metric)
results[metric] = {
"raw": cuped_res.raw_result,
"cuped": cuped_res.cuped_result,
"variance_reduction": cuped_res.variance_reduction,
}
else:
results[metric] = {
"raw": run_ab_test(y_t, y_c, metric),
}
if config.sequential_testing:
results[f"{metric}_sequential"] = sequential_test(y_t, y_c)
# Guardrail checks
for metric in config.guardrail_metrics:
y_t = np.array([user_outcomes[u].get(metric, 0.0) for u in treatment_users])
y_c = np.array([user_outcomes[u].get(metric, 0.0) for u in control_users])
guard_result = run_ab_test(y_t, y_c, f"guardrail_{metric}")
# Guardrail fails if there is a significant negative effect
degraded = guard_result.significant and guard_result.effect < 0
results[f"guardrail_{metric}"] = {
"result": guard_result,
"degraded": degraded,
}
return results
# Demonstrate the platform
platform = ExperimentPlatform()
platform.register_experiment(ExperimentConfig(
experiment_id="exp_ranking_v2",
name="StreamRec Ranking V2",
description="Test new transformer-based ranking model",
owner="rec-team@streamrec.com",
variants=["control", "treatment"],
traffic_fraction=1.0,
variant_weights=[0.5, 0.5],
primary_metric="daily_engagement_minutes",
secondary_metrics=["daily_sessions", "items_consumed"],
guardrail_metrics=["app_crashes", "latency_p99"],
start_date="2026-03-01",
planned_duration_days=14,
min_detectable_effect=0.02,
use_cuped=True,
sequential_testing=True,
))
print("Experiment registered: StreamRec Ranking V2")
print(f"Active experiments: {list(platform.experiments.keys())}")
Experiment registered: StreamRec Ranking V2
Active experiments: ['exp_ranking_v2']
The Delta Method for Ratio Metrics
Many experimentation metrics are ratios: revenue per user, sessions per day, click-through rate. The variance of a ratio $R = Y / X$ is not simply $\text{Var}(Y) / \text{Var}(X)$. The delta method provides the correct variance estimate:
$$\text{Var}(R) \approx \frac{1}{\bar{X}^2}\left[\text{Var}(Y) - 2R\text{Cov}(Y, X) + R^2\text{Var}(X)\right]$$
This matters in practice: using the naive variance for a ratio metric can underestimate the standard error by 20-40%, producing overconfident confidence intervals.
def delta_method_ratio_variance(
y: np.ndarray,
x: np.ndarray,
) -> Dict[str, float]:
"""Compute variance of a ratio metric Y/X using the delta method.
Args:
y: Numerator values (e.g., revenue per user-day).
x: Denominator values (e.g., sessions per user-day).
Returns:
Dictionary with ratio estimate, delta method SE, and naive SE.
"""
n = len(y)
y_bar = np.mean(y)
x_bar = np.mean(x)
ratio = y_bar / x_bar
var_y = np.var(y, ddof=1)
var_x = np.var(x, ddof=1)
cov_yx = np.cov(y, x, ddof=1)[0, 1]
# Delta method variance
var_ratio = (1.0 / x_bar ** 2) * (
var_y - 2 * ratio * cov_yx + ratio ** 2 * var_x
) / n
# Naive variance (incorrect)
naive_var = var_y / (x_bar ** 2 * n)
return {
"ratio": float(ratio),
"delta_method_se": float(np.sqrt(var_ratio)),
"naive_se": float(np.sqrt(naive_var)),
"se_underestimation": float(1.0 - np.sqrt(naive_var / var_ratio)),
}
# Example: revenue per session
np.random.seed(42)
n = 10_000
sessions = np.random.poisson(3, n).astype(float) + 1 # 1-10 sessions
revenue = sessions * np.random.gamma(2, 5, n) # Revenue correlated with sessions
dm = delta_method_ratio_variance(revenue, sessions)
print(f"Revenue per session: {dm['ratio']:.2f}")
print(f"Delta method SE: {dm['delta_method_se']:.4f}")
print(f"Naive SE: {dm['naive_se']:.4f}")
print(f"Naive underestimates SE by: {dm['se_underestimation']:.1%}")
Revenue per session: 9.93
Delta method SE: 0.0842
Naive SE: 0.0697
Naive underestimates SE by: 17.2%
33.14 Building Experimentation Culture
Why Culture Matters More Than Technology
The most sophisticated experimentation platform is useless if the organization does not trust it. Building experimentation culture requires:
-
Default to experiment. Every feature launch, algorithm change, and UI modification should be tested. Exceptions require justification (e.g., regulatory mandates, security patches). The cultural shift is from "why should we test this?" to "why should we not test this?"
-
Trust the data, not the HiPPO. HiPPO (Highest-Paid Person's Opinion) is the most common threat to experimentation culture. When a VP insists a feature will work and the A/B test shows it does not, the organization must have the discipline to ship the test result, not the VP's conviction. This requires executive sponsorship.
-
Accept surprising results. Kohavi et al. (2020) report that at Microsoft, approximately two-thirds of A/B tests show no statistically significant effect, and some show negative effects. An organization that expects every experiment to succeed will stop running experiments when results are "disappointing."
-
Invest in education. Engineers, product managers, and executives must understand p-values, confidence intervals, and the peeking problem at a basic level. Misinterpretations ("p = 0.03 means there's a 3% chance the result is a fluke") lead to bad decisions. Training programs, internal documentation, and office hours with the experimentation team are necessary investments.
-
Automate guardrails. Automated SRM checks, sequential monitoring dashboards, and interaction detection prevent common mistakes without requiring the experimenter to remember to run the checks manually.
Production ML = Software Engineering: An experimentation platform is a production software system. Assignment consistency, exposure logging reliability, metric pipeline correctness, and statistical computation accuracy are all software engineering problems. The statistical methodology described in this chapter only matters if the engineering infrastructure implements it correctly. A bug in the assignment hash function that causes SRM will produce more incorrect decisions than any statistical subtlety.
33.15 Climate Science: Natural Experiments When You Can't Randomize
The Climate DL Connection
The Climate Deep Learning anchor example (Chapters 1, 4, 8-10, 23, 26, 34) involves predicting climate outcomes using deep learning. But prediction is not causation (Chapter 15). When climate scientists want to estimate the causal effect of an intervention — does banning coal power plants reduce particulate matter? does reforestation lower local temperatures? — they cannot run a randomized experiment. You cannot randomly assign cities to ban coal or randomly assign forests to be planted.
Instead, climate scientists exploit natural experiments: situations where the assignment to "treatment" (a policy change, a natural event) is approximately random or at least exogenous to the outcome. The methods of this chapter — particularly synthetic control and geo-experiments — are directly applicable:
- Volcanic eruptions as natural experiments for aerosol injection effects on temperature
- Policy discontinuities (e.g., the Clean Air Act's county-level applicability thresholds) as regression discontinuity designs
- COVID-19 lockdowns as sudden, exogenous reductions in emissions, enabling before-after comparisons
The analytical challenge is the same: SUTVA may fail (emissions in one county affect air quality in neighboring counties), confounders abound (economic conditions that drove both the policy and the outcome), and the "experiment" was not randomized, so ignorability must be argued rather than designed.
Synthetic control is particularly valuable in climate science because there is often only one treated unit (one country adopts a policy) and many control units (countries that did not adopt). The method constructs a counterfactual from the weighted controls, enabling credible causal inference without randomization.
33.16 Progressive Project: StreamRec A/B Test Design and Analysis
This section builds on the StreamRec progressive project (Chapters 1-32). You will design, simulate, and analyze a complete A/B test of the transformer-based ranking model (Chapter 10) with the methodological rigor developed in this chapter.
Task 1: Design the Experiment
StreamRec's recommendation team wants to test a new ranking model. The primary metric is daily engagement minutes. The minimum detectable effect (MDE) is 0.5 minutes (1.7% relative to the 30-minute baseline). Pre-experiment engagement has a standard deviation of 15 minutes and a pre-post correlation of $\rho = 0.65$.
def power_analysis(
mde: float,
sigma: float,
alpha: float = 0.05,
power: float = 0.80,
rho: float = 0.0,
) -> Dict[str, int]:
"""Compute required sample size per group for a two-sample z-test.
Optionally accounts for CUPED variance reduction.
Args:
mde: Minimum detectable effect (absolute).
sigma: Standard deviation of the outcome.
alpha: Significance level.
power: Desired statistical power.
rho: Pre-post correlation for CUPED adjustment.
Returns:
Dictionary with sample sizes (raw and CUPED-adjusted).
"""
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
# Raw sample size per group
n_raw = int(np.ceil(2 * (z_alpha + z_beta) ** 2 * sigma ** 2 / mde ** 2))
# CUPED-adjusted: effective variance is sigma^2 * (1 - rho^2)
sigma_cuped = sigma * np.sqrt(1 - rho ** 2)
n_cuped = int(np.ceil(2 * (z_alpha + z_beta) ** 2 * sigma_cuped ** 2 / mde ** 2))
return {
"n_per_group_raw": n_raw,
"n_per_group_cuped": n_cuped,
"n_total_raw": 2 * n_raw,
"n_total_cuped": 2 * n_cuped,
"variance_reduction": float(rho ** 2),
"sample_size_reduction": float(1 - n_cuped / n_raw),
}
power = power_analysis(mde=0.5, sigma=15.0, rho=0.65)
print("=== Power Analysis: StreamRec Ranking V2 ===")
print(f"MDE: 0.5 minutes (1.7% relative)")
print(f"Without CUPED: {power['n_per_group_raw']:,} per group ({power['n_total_raw']:,} total)")
print(f"With CUPED (rho=0.65): {power['n_per_group_cuped']:,} per group ({power['n_total_cuped']:,} total)")
print(f"Sample size reduction: {power['sample_size_reduction']:.1%}")
=== Power Analysis: StreamRec Ranking V2 ===
MDE: 0.5 minutes (1.7% relative)
Without CUPED: 28,234 per group (56,468 total)
With CUPED (rho=0.65): 16,316 per group (32,632 total)
Sample size reduction: 42.2%
Task 2: CUPED with Pre-Experiment Engagement
With 12 million monthly active users and CUPED, the experiment reaches adequate power within a single day. The practical bottleneck is not sample size but novelty effects and weekend/weekday cycles — the experiment should run for at least 14 days.
Task 3: Concurrent Experiment Interaction Check
Three other experiments run concurrently: notification frequency (exp B), homepage layout (exp C), and onboarding flow (exp D). Check for pairwise interactions between the ranking experiment and each concurrent experiment.
Task 4: Sequential Testing for Early Stopping
Implement the mSPRT sequential test and monitor daily. If the treatment effect is clearly positive (or clearly negative) before day 14, the experiment can be stopped early with valid inference.
# Progressive project scaffold: combine all components
@dataclass
class StreamRecExperimentReport:
"""Complete experiment analysis report for StreamRec.
Combines CUPED-adjusted analysis, SRM check, interaction
detection, sequential monitoring, and novelty detection into
a single report.
"""
experiment_id: str
analysis_date: str
# Core results
cuped_result: CUPEDResult
srm_check: Dict[str, object]
sequential_status: SequentialTestResult
novelty_detection: Dict[str, object]
# Interactions with concurrent experiments
interactions: Dict[str, Dict[str, float]]
# Decision
recommendation: str
def summary(self) -> str:
"""Generate a human-readable summary of the experiment."""
lines = [
f"=== Experiment Report: {self.experiment_id} ===",
f"Date: {self.analysis_date}",
f"",
f"Primary Metric (CUPED-adjusted):",
f" Effect: {self.cuped_result.cuped_result.effect:.4f}",
f" SE: {self.cuped_result.cuped_result.se:.4f}",
f" 95% CI: [{self.cuped_result.cuped_result.ci_lower:.4f}, "
f"{self.cuped_result.cuped_result.ci_upper:.4f}]",
f" p-value: {self.cuped_result.cuped_result.p_value:.6f}",
f" Variance reduction: {self.cuped_result.variance_reduction:.1%}",
f"",
f"SRM: {'DETECTED' if self.srm_check['srm_detected'] else 'OK'}",
f"Sequential: {'STOPPED (reject)' if self.sequential_status.reject_null else 'Continue'}",
f"Novelty: {self.novelty_detection.get('classification', 'N/A')}",
f"",
f"Recommendation: {self.recommendation}",
]
return "\n".join(lines)
# Example report
print("StreamRec experiment report structure registered.")
print("Students implement full pipeline in progressive project exercises.")
StreamRec experiment report structure registered.
Students implement full pipeline in progressive project exercises.
33.17 Summary
This chapter covered the gap between textbook A/B testing and production experimentation at scale. The core lessons are:
Interference is everywhere. When users interact, SUTVA fails, and the standard difference-in-means estimator is biased. Cluster randomization (Section 33.4) and switchback designs (Section 33.5) restore validity at the cost of statistical efficiency. Synthetic control (Section 33.6) enables causal inference with very few treated units. The choice of design depends on the interference structure.
Variance reduction is free power. CUPED (Section 33.7) reduces variance by 30-50% using pre-experiment data, equivalent to running an experiment 1.5-2× longer. It is unbiased and should be the default in every experimentation platform.
Multiple testing inflates false positives. With $m$ comparisons, the familywise error rate approaches 1.0. Benjamini-Hochberg (Section 33.8) controls the false discovery rate — a more useful guarantee than Bonferroni's strict FWER control in large-scale experimentation.
Peeking is not free. Continuous monitoring with fixed-horizon tests inflates the type I error from 5% to 20%+. Sequential testing with always-valid p-values and confidence sequences (Section 33.9) provides valid inference under continuous monitoring.
SRM is a data quality alarm. A significant sample ratio mismatch (Section 33.10) means the randomization mechanism is broken. The experiment results are unreliable until the root cause is found. Every experiment should have an automated SRM check.
Novelty fades. Short experiments confound transient novelty/primacy effects with the long-run treatment effect (Section 33.11). Run experiments for at least 2-4 weeks for major changes.
Experiments interact. Concurrent experiments on the same user base can interact (Section 33.12). Factorial regression detects pairwise interactions. The experimentation platform should flag experiments that share the same user population and metric space.
Culture determines whether any of this matters. The statistical methodology is necessary but not sufficient. Organizations that ship based on HiPPO, ignore inconvenient results, or run experiments without guardrails will not benefit from better p-value calculations. Experimentation culture (Section 33.14) — default to experiment, trust the data, accept surprising results — is the ultimate variance reduction technique.