29 min read

> "Most A/B tests are wrong. Not because the math is wrong, but because the assumptions are."

Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms

"Most A/B tests are wrong. Not because the math is wrong, but because the assumptions are." — Ron Kohavi, Diane Tang, and Ya Xu, "Trustworthy Online Controlled Experiments" (2020)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Design experiments that handle interference effects (network effects, marketplace dynamics) where the Stable Unit Treatment Value Assumption (SUTVA) is violated, using cluster-randomized and switchback designs
  2. Implement cluster-randomized and switchback experimental designs with appropriate variance estimation
  3. Apply CUPED and stratification for variance reduction, deriving and implementing the 30-50% effective sample size increase that makes these techniques essential at scale
  4. Build experimentation platform components that manage concurrent experiments with interaction detection
  5. Handle peeking (continuous monitoring), multiple testing corrections (Bonferroni, Benjamini-Hochberg), novelty/primacy effects, and sample ratio mismatch (SRM) diagnostics

33.1 Why This Chapter Goes Far Beyond "A/B Testing"

An introductory statistics course covers the two-sample t-test. An intermediate data science course covers A/B testing: random assignment, sample size calculation, hypothesis testing, confidence intervals. These foundations are necessary but woefully insufficient for experimentation at scale.

Consider StreamRec, the content recommendation platform (Progressive Project, Chapters 6-32). The recommendation team wants to test a new ranking algorithm. They randomly assign 50% of users to the new algorithm (treatment) and 50% to the existing algorithm (control). After two weeks, they compute the difference in daily engagement time. Treatment users engage 1.2 minutes more per day, p = 0.03. Ship it?

Not so fast. Five problems undermine this result:

  1. Interference. StreamRec users share content with friends. When a treatment user shares a piece of content surfaced by the new algorithm, their control-group friend may engage with it too — inflating the control group's engagement and deflating the measured treatment effect. The experiment's causal estimate is biased toward zero.

  2. Variance. With 12 million users and a true effect of 1.2 minutes, the standard error is small and the result is statistically significant. But the team ran 47 other experiments simultaneously. The new ranking algorithm interacts with a concurrent experiment on notification frequency. The measured effect is a composite of the ranking change and the notification interaction — the true ranking effect may be 0.4 minutes, not 1.2.

  3. Peeking. The analyst checked the results daily for 14 days. The p-value crossed 0.05 on day 8, briefly rose above it on days 9-10, then settled below 0.05 by day 14. Under continuous monitoring, the true false positive rate is not 5% but 26% (Johari et al., 2017). The result may be a false positive amplified by optional stopping.

  4. Novelty. The new algorithm surfaces different content types. Users who have been on the platform for years engage more with the novelty — but this effect fades as the content becomes familiar. The measured 1.2-minute lift is a blend of a transient novelty effect and a smaller (possibly zero) long-run effect.

  5. Sample ratio mismatch (SRM). The experiment was configured for a 50/50 split, but the actual counts are 6,012,847 treatment and 5,987,153 control. A chi-squared test gives p < 0.001 — the split is not 50/50. Something in the assignment mechanism is broken, and the groups are no longer comparable.

This chapter addresses all five problems with mathematical rigor and production-grade code. We begin with the most fundamental: interference.

Prediction $\neq$ Causation: This chapter is the bridge between causal inference (Part III) and production systems (Part V). Chapters 15-19 established the theoretical foundations — potential outcomes, causal graphs, estimation methods. This chapter operationalizes them: how do you run a valid causal experiment when SUTVA is violated, when you cannot wait for fixed-horizon results, and when you must manage 50 concurrent experiments on the same user base?

Know How Your Model Is Wrong: Every experiment makes assumptions. The standard A/B test assumes SUTVA (no interference), fixed-horizon evaluation (no peeking), a single comparison (no multiple testing), and stationary treatment effects (no novelty). Violating these assumptions does not produce error messages — it produces confident but wrong conclusions. This chapter teaches you to recognize which assumptions your experiment violates and to apply the correct remedy.


33.2 The Standard A/B Test and Its Assumptions

Before diagnosing what goes wrong, we formalize what must go right. The standard two-sample A/B test rests on three assumptions from the potential outcomes framework (Chapter 16):

Assumption 1: SUTVA (Stable Unit Treatment Value Assumption)

The potential outcome for unit $i$ depends only on unit $i$'s own treatment assignment, not on the assignments of other units:

$$Y_i(D_1, D_2, \ldots, D_N) = Y_i(D_i)$$

In words: my outcome does not change when you are assigned to treatment or control. This rules out interference, spillover, and network effects.

Assumption 2: Ignorability (Random Assignment)

Treatment assignment $D_i$ is independent of potential outcomes, conditional on covariates:

$$\{Y_i(0), Y_i(1)\} \perp\!\!\!\perp D_i \mid X_i$$

In a randomized experiment, this holds by design (unconditionally, without needing to condition on $X_i$).

Assumption 3: Positivity

Every unit has a nonzero probability of being assigned to each condition:

$$0 < P(D_i = 1 \mid X_i) < 1 \quad \text{for all } X_i$$

In a 50/50 A/B test, this is trivially satisfied.

Under these three assumptions, the difference-in-means estimator is unbiased for the Average Treatment Effect (ATE):

$$\hat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \frac{1}{n_1}\sum_{i: D_i=1} Y_i - \frac{1}{n_0}\sum_{i: D_i=0} Y_i$$

with variance:

$$\text{Var}(\hat{\tau}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}$$

where $\sigma_d^2$ is the variance of outcomes in group $d$.

from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional
import numpy as np
from scipy import stats


@dataclass
class ABTestResult:
    """Result of a standard two-sample A/B test.

    Stores point estimates, confidence intervals, and test statistics
    for a simple difference-in-means comparison.

    Attributes:
        metric_name: Name of the metric being tested.
        n_treatment: Number of units in the treatment group.
        n_control: Number of units in the control group.
        mean_treatment: Sample mean in the treatment group.
        mean_control: Sample mean in the control group.
        effect: Estimated treatment effect (difference in means).
        se: Standard error of the effect estimate.
        ci_lower: Lower bound of the 95% confidence interval.
        ci_upper: Upper bound of the 95% confidence interval.
        z_stat: Z-statistic for the null hypothesis of zero effect.
        p_value: Two-sided p-value.
    """

    metric_name: str
    n_treatment: int
    n_control: int
    mean_treatment: float
    mean_control: float
    effect: float
    se: float
    ci_lower: float
    ci_upper: float
    z_stat: float
    p_value: float

    @property
    def significant(self) -> bool:
        """Whether the result is statistically significant at alpha=0.05."""
        return self.p_value < 0.05

    @property
    def relative_effect(self) -> float:
        """Relative treatment effect as a percentage of the control mean."""
        if self.mean_control == 0:
            return float("inf") if self.effect != 0 else 0.0
        return 100.0 * self.effect / self.mean_control


def run_ab_test(
    y_treatment: np.ndarray,
    y_control: np.ndarray,
    metric_name: str = "metric",
    alpha: float = 0.05,
) -> ABTestResult:
    """Run a standard two-sample z-test for an A/B experiment.

    Uses the normal approximation (valid for large samples by CLT).
    For small samples, use a t-test instead.

    Args:
        y_treatment: Outcome values for treatment group.
        y_control: Outcome values for control group.
        metric_name: Human-readable name for the metric.
        alpha: Significance level for the confidence interval.

    Returns:
        ABTestResult with all test statistics and confidence intervals.
    """
    n_t = len(y_treatment)
    n_c = len(y_control)
    mean_t = np.mean(y_treatment)
    mean_c = np.mean(y_control)
    var_t = np.var(y_treatment, ddof=1)
    var_c = np.var(y_control, ddof=1)

    effect = mean_t - mean_c
    se = np.sqrt(var_t / n_t + var_c / n_c)
    z_crit = stats.norm.ppf(1 - alpha / 2)
    z_stat = effect / se if se > 0 else 0.0
    p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))

    return ABTestResult(
        metric_name=metric_name,
        n_treatment=n_t,
        n_control=n_c,
        mean_treatment=float(mean_t),
        mean_control=float(mean_c),
        effect=float(effect),
        se=float(se),
        ci_lower=float(effect - z_crit * se),
        ci_upper=float(effect + z_crit * se),
        z_stat=float(z_stat),
        p_value=float(p_value),
    )


# Demonstration: A clean A/B test (no interference, no peeking)
np.random.seed(42)
n_per_group = 50_000
true_effect = 1.2  # minutes of engagement

y_control = np.random.normal(loc=30.0, scale=15.0, size=n_per_group)
y_treatment = np.random.normal(loc=30.0 + true_effect, scale=15.0, size=n_per_group)

result = run_ab_test(y_treatment, y_control, metric_name="daily_engagement_minutes")
print(f"Effect: {result.effect:.3f} min ({result.relative_effect:.2f}% relative)")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"p-value: {result.p_value:.6f}")
print(f"Significant: {result.significant}")
Effect: 1.261 min (4.18% relative)
95% CI: [1.075, 1.447]
p-value: 0.000000
Significant: True

In this idealized setting, the test works perfectly. The estimated effect is close to the true 1.2 minutes, the confidence interval covers the truth, and the test is highly significant.

Now we systematically break each assumption.


33.3 Interference: When SUTVA Fails

Why Interference Is the Rule, Not the Exception

SUTVA requires that my outcome depends only on my treatment assignment. This fails whenever units interact:

Setting Interference Mechanism
Social networks A treated user shares content with a control user, changing the control user's behavior
Two-sided marketplaces Treating riders changes driver availability, affecting control riders' wait times
Recommendation systems Recommending item X to treatment users reduces inventory for control users (or changes trending signals)
Pricing experiments A discounted price in treatment attracts customers who would have purchased in control
Workplace interventions Treating one employee changes team dynamics for untreated teammates

In every case, the treatment effect on unit $i$ depends not just on $D_i$ but on the treatment assignments $\mathbf{D}_{-i}$ of other units. The individual-level ATE is no longer well-defined without specifying the treatment assignment vector for all units.

Formalizing Interference

Under interference, unit $i$'s potential outcome depends on the full treatment vector:

$$Y_i = Y_i(D_1, D_2, \ldots, D_N)$$

This is intractable — with $N$ units and binary treatment, there are $2^N$ possible potential outcomes per unit. To make progress, we need structure.

The most common structural assumption is partial interference (Sobel, 2006; Hudgens and Halloran, 2008): the population can be partitioned into clusters $\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_K$ such that interference occurs within clusters but not between clusters. Formally:

$$Y_i(\mathbf{D}) = Y_i(\mathbf{D}_{\mathcal{C}(i)})$$

where $\mathcal{C}(i)$ is the cluster containing unit $i$ and $\mathbf{D}_{\mathcal{C}(i)}$ is the treatment assignment vector restricted to that cluster.

For StreamRec, a natural cluster definition is the user's social graph neighborhood: a user and the friends they share content with. Interference occurs within friend groups but (approximately) not between disconnected groups.

The Bias from Ignoring Interference

When SUTVA fails, the standard difference-in-means estimator is biased. The direction of bias depends on the interference mechanism:

  • Positive spillover (sharing beneficial content): control users benefit from treatment users, making the control group better than it would be without any treatment. The treatment effect is underestimated.
  • Negative spillover (competition for resources): control users are harmed by treatment, making the control group worse. The treatment effect is overestimated.
def simulate_interference_bias(
    n_users: int = 100_000,
    n_friends_mean: int = 8,
    true_direct_effect: float = 1.5,
    spillover_rate: float = 0.3,
    spillover_magnitude: float = 0.5,
    treatment_prob: float = 0.5,
    seed: int = 42,
) -> Dict[str, float]:
    """Simulate A/B test with social network interference.

    Treatment users share content with friends. If a control-group friend
    receives shared content, their engagement increases (positive spillover).

    Args:
        n_users: Total number of users.
        n_friends_mean: Mean number of friends per user (Poisson).
        true_direct_effect: Direct causal effect of treatment on the treated.
        spillover_rate: Probability that a treated user shares with each friend.
        spillover_magnitude: Engagement boost for a control user who receives
            shared content from a treated friend.
        treatment_prob: Probability of assignment to treatment.
        seed: Random seed for reproducibility.

    Returns:
        Dictionary with true direct effect, true total effect (direct +
        spillover), naive estimated effect, and the bias.
    """
    rng = np.random.default_rng(seed)

    # Generate social network (each user has Poisson(n_friends_mean) friends)
    n_friends = rng.poisson(n_friends_mean, size=n_users)
    # For simplicity, sample friends uniformly (not a real social network)
    friends = [
        rng.choice(n_users, size=min(nf, n_users - 1), replace=False)
        for nf in n_friends
    ]

    # Random assignment
    treatment = rng.random(n_users) < treatment_prob

    # Baseline engagement (before treatment effects)
    baseline = rng.normal(30.0, 15.0, size=n_users)

    # Direct treatment effect
    direct_effect = np.where(treatment, true_direct_effect, 0.0)

    # Spillover: for each control user, count treated friends who share
    spillover_effect = np.zeros(n_users)
    for i in range(n_users):
        if not treatment[i]:  # Only control users receive spillover
            treated_friends = friends[i][treatment[friends[i]]]
            # Each treated friend shares with probability spillover_rate
            n_sharing = rng.binomial(len(treated_friends), spillover_rate)
            if n_sharing > 0:
                spillover_effect[i] = spillover_magnitude * min(n_sharing, 3)

    # Observed outcomes
    y = baseline + direct_effect + spillover_effect + rng.normal(0, 2.0, n_users)

    # Naive estimator (ignores interference)
    naive_effect = np.mean(y[treatment]) - np.mean(y[~treatment])

    # True quantities
    mean_spillover_on_control = np.mean(spillover_effect[~treatment])

    return {
        "true_direct_effect": true_direct_effect,
        "mean_spillover_on_control": float(mean_spillover_on_control),
        "true_total_effect": true_direct_effect + mean_spillover_on_control,
        "naive_estimated_effect": float(naive_effect),
        "bias": float(naive_effect - true_direct_effect),
        "bias_direction": "underestimate (positive spillover)",
    }


interference_result = simulate_interference_bias()
for key, value in interference_result.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")
  true_direct_effect: 1.5000
  mean_spillover_on_control: 0.4832
  true_total_effect: 1.9832
  naive_estimated_effect: 1.0219
  bias: -0.4781
  bias_direction: underestimate (positive spillover)

The positive spillover inflates the control group mean by ~0.48 minutes, causing the naive estimator to underestimate the true direct effect by ~32%. This is not a small bias — and it is invisible without an interference-aware design.


33.4 Cluster-Randomized Experiments

The Key Idea

If interference occurs within clusters (friend groups, geographic regions, marketplace zones) but not between them, we can restore a valid causal estimate by randomizing at the cluster level instead of the unit level. Entire clusters are assigned to treatment or control, so all units within a cluster receive the same assignment. Interference within a cluster is allowed because it affects units who share the same assignment — it does not contaminate the control group.

Design

Let $K$ clusters be indexed by $k = 1, \ldots, K$. Cluster $k$ contains $n_k$ units. We assign each cluster to treatment ($D_k = 1$) or control ($D_k = 0$) with probability $\pi$. The cluster-level mean outcome is:

$$\bar{Y}_k = \frac{1}{n_k} \sum_{i \in \mathcal{C}_k} Y_i$$

The estimator for the ATE is the difference in cluster-level means, weighted by cluster size:

$$\hat{\tau}_{\text{cluster}} = \frac{\sum_{k: D_k=1} n_k \bar{Y}_k}{\sum_{k: D_k=1} n_k} - \frac{\sum_{k: D_k=0} n_k \bar{Y}_k}{\sum_{k: D_k=0} n_k}$$

Variance and the Design Effect

Cluster randomization uses information less efficiently than individual randomization. The variance of the cluster-level estimator is inflated by the intracluster correlation coefficient (ICC), $\rho$:

$$\text{Var}(\hat{\tau}_{\text{cluster}}) \approx \left(\frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}\right) \times [1 + (m - 1)\rho]$$

where $m$ is the average cluster size. The factor $\text{DEFF} = 1 + (m - 1)\rho$ is the design effect — the multiplicative inflation in variance relative to individual randomization. For StreamRec, if friend clusters average $m = 10$ users and the ICC is $\rho = 0.05$, the design effect is $1 + 9 \times 0.05 = 1.45$. The experiment needs 45% more users to achieve the same power.

This is the fundamental tradeoff: cluster randomization eliminates interference bias at the cost of statistical efficiency. In practice, the bias from ignoring interference (Section 33.3: ~32% underestimation) far exceeds the efficiency loss from clustering.

@dataclass
class ClusterRandomizedDesign:
    """Design and analysis of a cluster-randomized experiment.

    Clusters (e.g., friend groups, geographic regions) are randomly
    assigned to treatment or control. All units within a cluster
    receive the same assignment, eliminating between-group interference.

    Attributes:
        cluster_ids: Array of cluster assignments for each unit.
        treatment: Boolean array of treatment assignment for each unit.
        outcomes: Array of observed outcomes.
    """

    cluster_ids: np.ndarray
    treatment: np.ndarray
    outcomes: np.ndarray

    @property
    def n_clusters(self) -> int:
        return len(np.unique(self.cluster_ids))

    @property
    def cluster_sizes(self) -> np.ndarray:
        _, counts = np.unique(self.cluster_ids, return_counts=True)
        return counts

    def cluster_means(self) -> Tuple[np.ndarray, np.ndarray]:
        """Compute mean outcome per cluster, separated by treatment."""
        clusters = np.unique(self.cluster_ids)
        treated_means = []
        control_means = []
        treated_sizes = []
        control_sizes = []

        for c in clusters:
            mask = self.cluster_ids == c
            cluster_outcomes = self.outcomes[mask]
            cluster_treatment = self.treatment[mask]

            # All units in a cluster should have the same assignment
            if cluster_treatment[0]:
                treated_means.append(np.mean(cluster_outcomes))
                treated_sizes.append(len(cluster_outcomes))
            else:
                control_means.append(np.mean(cluster_outcomes))
                control_sizes.append(len(cluster_outcomes))

        return (
            np.array(treated_means),
            np.array(control_means),
            np.array(treated_sizes),
            np.array(control_sizes),
        )

    def estimate_effect(self) -> Dict[str, float]:
        """Estimate the ATE using cluster-level analysis.

        Uses weighted cluster means and cluster-robust variance estimation.
        """
        t_means, c_means, t_sizes, c_sizes = self.cluster_means()

        # Weighted means
        mean_t = np.average(t_means, weights=t_sizes)
        mean_c = np.average(c_means, weights=c_sizes)
        effect = mean_t - mean_c

        # Cluster-robust SE (using cluster means as independent observations)
        k_t = len(t_means)
        k_c = len(c_means)
        var_t = np.var(t_means, ddof=1) if k_t > 1 else 0.0
        var_c = np.var(c_means, ddof=1) if k_c > 1 else 0.0
        se = np.sqrt(var_t / k_t + var_c / k_c)

        # ICC estimate
        icc = self._estimate_icc()

        # Design effect
        m_bar = np.mean(np.concatenate([t_sizes, c_sizes]))
        deff = 1.0 + (m_bar - 1.0) * max(icc, 0.0)

        z_stat = effect / se if se > 0 else 0.0
        p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))

        return {
            "effect": float(effect),
            "se": float(se),
            "ci_lower": float(effect - 1.96 * se),
            "ci_upper": float(effect + 1.96 * se),
            "z_stat": float(z_stat),
            "p_value": float(p_value),
            "n_clusters_treatment": k_t,
            "n_clusters_control": k_c,
            "icc": float(icc),
            "design_effect": float(deff),
            "mean_cluster_size": float(m_bar),
        }

    def _estimate_icc(self) -> float:
        """Estimate intracluster correlation using one-way ANOVA."""
        clusters = np.unique(self.cluster_ids)
        grand_mean = np.mean(self.outcomes)
        n_total = len(self.outcomes)
        k = len(clusters)

        # Between-cluster sum of squares
        ss_between = sum(
            np.sum(self.cluster_ids == c)
            * (np.mean(self.outcomes[self.cluster_ids == c]) - grand_mean) ** 2
            for c in clusters
        )

        # Within-cluster sum of squares
        ss_within = sum(
            np.sum((self.outcomes[self.cluster_ids == c]
                     - np.mean(self.outcomes[self.cluster_ids == c])) ** 2)
            for c in clusters
        )

        ms_between = ss_between / max(k - 1, 1)
        ms_within = ss_within / max(n_total - k, 1)

        # Average cluster size (harmonic mean for unbalanced designs)
        sizes = np.array([np.sum(self.cluster_ids == c) for c in clusters])
        m_0 = (n_total - np.sum(sizes ** 2) / n_total) / max(k - 1, 1)

        icc = (ms_between - ms_within) / (ms_between + (m_0 - 1) * ms_within)
        return float(np.clip(icc, -1.0 / (np.max(sizes) - 1), 1.0))


# Simulate a cluster-randomized experiment for StreamRec
def simulate_cluster_randomized_experiment(
    n_clusters: int = 2000,
    cluster_size_mean: int = 10,
    true_effect: float = 1.5,
    icc: float = 0.05,
    seed: int = 42,
) -> ClusterRandomizedDesign:
    """Generate synthetic data for a cluster-randomized experiment."""
    rng = np.random.default_rng(seed)

    # Generate cluster sizes (Poisson)
    cluster_sizes = rng.poisson(cluster_size_mean, size=n_clusters)
    cluster_sizes = np.maximum(cluster_sizes, 2)  # At least 2 per cluster
    n_total = np.sum(cluster_sizes)

    # Assign clusters to treatment (50/50)
    cluster_treatment = rng.random(n_clusters) < 0.5

    # Generate outcomes with ICC structure
    # ICC = sigma_between^2 / (sigma_between^2 + sigma_within^2)
    sigma_total = 15.0
    sigma_between = sigma_total * np.sqrt(icc)
    sigma_within = sigma_total * np.sqrt(1.0 - icc)

    cluster_ids = np.repeat(np.arange(n_clusters), cluster_sizes)
    treatment = np.repeat(cluster_treatment, cluster_sizes)

    # Cluster random effects
    cluster_effects = rng.normal(0, sigma_between, size=n_clusters)

    # Individual outcomes
    outcomes = np.zeros(n_total)
    idx = 0
    for k in range(n_clusters):
        n_k = cluster_sizes[k]
        base = 30.0 + cluster_effects[k]
        if cluster_treatment[k]:
            base += true_effect
        outcomes[idx : idx + n_k] = (
            base + rng.normal(0, sigma_within, size=n_k)
        )
        idx += n_k

    return ClusterRandomizedDesign(
        cluster_ids=cluster_ids,
        treatment=treatment,
        outcomes=outcomes,
    )


crd = simulate_cluster_randomized_experiment()
crd_result = crd.estimate_effect()
print("Cluster-Randomized Experiment Results:")
print(f"  Effect: {crd_result['effect']:.4f}")
print(f"  SE: {crd_result['se']:.4f}")
print(f"  95% CI: [{crd_result['ci_lower']:.4f}, {crd_result['ci_upper']:.4f}]")
print(f"  ICC: {crd_result['icc']:.4f}")
print(f"  Design effect: {crd_result['design_effect']:.2f}")
print(f"  N clusters (T/C): {crd_result['n_clusters_treatment']}/{crd_result['n_clusters_control']}")
Cluster-Randomized Experiment Results:
  Effect: 1.4623
  SE: 0.3427
  95% CI: [0.7906, 2.1340]
  ICC: 0.0491
  Design effect: 1.44
  N clusters (T/C): 999/1001

The estimate (1.46) is close to the true effect (1.50), and the confidence interval is wider than an individual-level experiment would produce — the price of removing interference bias.


33.5 Switchback Designs

When Clusters Don't Work

Cluster randomization requires that clusters be large enough to internalize interference and numerous enough for adequate power. In some settings — particularly two-sided marketplaces where supply and demand are global — no cluster partition satisfies both requirements.

Consider a ride-sharing marketplace. If you randomize riders to treatment/control but drivers serve both groups, a pricing change for treatment riders affects driver availability for control riders. You could cluster by geographic region, but drivers move between regions. You could cluster by time of day, but demand is not independent across hours.

The switchback design (also called a time-series experiment or crossover design) addresses this by alternating the entire population between treatment and control across time periods. In period $t$, everyone is treated; in period $t+1$, everyone is in control; and so on. The treatment effect is estimated by comparing outcomes in treatment periods to outcomes in control periods.

Formal Setup

Divide the experimental period into $T$ discrete time periods. In each period $t$, the entire population receives assignment $D_t \in \{0, 1\}$, randomized independently across periods. The mean outcome in period $t$ is $\bar{Y}_t$. The estimator is:

$$\hat{\tau}_{\text{switchback}} = \frac{1}{|\{t: D_t = 1\}|}\sum_{t: D_t = 1} \bar{Y}_t - \frac{1}{|\{t: D_t = 0\}|}\sum_{t: D_t = 0} \bar{Y}_t$$

Carryover and Autocorrelation

The main threat to switchback designs is carryover: the effect of treatment in period $t$ leaking into period $t + 1$. If the new algorithm takes time to affect user behavior (learning effects) or if the system has inertia (cached recommendations), the outcome in period $t+1$ depends on the assignment in period $t$, violating the independence assumption.

Mitigations include:

  1. Washout periods: Insert untreated buffer periods between treatment alternations. This wastes statistical power but reduces carryover.
  2. Longer periods: Longer treatment/control blocks reduce the fraction of time affected by carryover.
  3. Regression adjustment: Model the carryover explicitly as a function of previous-period assignment.
@dataclass
class SwitchbackDesign:
    """Design and analysis of a switchback (crossover) experiment.

    The entire population alternates between treatment and control
    across time periods. Appropriate when unit-level or cluster-level
    randomization is infeasible due to global interference.

    Attributes:
        period_outcomes: Mean outcome per time period.
        period_assignments: Treatment assignment (0 or 1) per period.
        period_sizes: Number of observations per period.
    """

    period_outcomes: np.ndarray
    period_assignments: np.ndarray
    period_sizes: np.ndarray

    def estimate_effect(self, adjust_carryover: bool = False) -> Dict[str, float]:
        """Estimate treatment effect from switchback data.

        Args:
            adjust_carryover: If True, include previous-period assignment
                as a regression covariate to adjust for carryover.

        Returns:
            Dictionary with effect estimate, SE, CI, and diagnostics.
        """
        if not adjust_carryover:
            treated_mask = self.period_assignments == 1
            control_mask = self.period_assignments == 0

            mean_t = np.average(
                self.period_outcomes[treated_mask],
                weights=self.period_sizes[treated_mask],
            )
            mean_c = np.average(
                self.period_outcomes[control_mask],
                weights=self.period_sizes[control_mask],
            )
            effect = mean_t - mean_c

            # Newey-West style SE (autocorrelation-robust)
            n_t = np.sum(treated_mask)
            n_c = np.sum(control_mask)
            var_t = np.var(self.period_outcomes[treated_mask], ddof=1)
            var_c = np.var(self.period_outcomes[control_mask], ddof=1)
            se = np.sqrt(var_t / max(n_t, 1) + var_c / max(n_c, 1))
        else:
            # Regression: Y_t = alpha + tau * D_t + gamma * D_{t-1} + epsilon_t
            T = len(self.period_outcomes)
            Y = self.period_outcomes[1:]  # Drop first period (no lag)
            D_curr = self.period_assignments[1:].astype(float)
            D_prev = self.period_assignments[:-1].astype(float)

            X = np.column_stack([np.ones(T - 1), D_curr, D_prev])
            # Weighted least squares by period size
            W = np.diag(self.period_sizes[1:].astype(float))
            XtWX = X.T @ W @ X
            XtWY = X.T @ W @ Y
            beta = np.linalg.solve(XtWX, XtWY)

            residuals = Y - X @ beta
            sigma2 = np.average(residuals ** 2, weights=self.period_sizes[1:])
            cov_beta = sigma2 * np.linalg.inv(XtWX)

            effect = float(beta[1])
            se = float(np.sqrt(cov_beta[1, 1]))

        z_stat = effect / se if se > 0 else 0.0
        p_value = 2.0 * (1.0 - stats.norm.cdf(abs(z_stat)))

        result = {
            "effect": float(effect),
            "se": float(se),
            "ci_lower": float(effect - 1.96 * se),
            "ci_upper": float(effect + 1.96 * se),
            "z_stat": float(z_stat),
            "p_value": float(p_value),
            "n_periods": len(self.period_outcomes),
            "n_treatment_periods": int(np.sum(self.period_assignments == 1)),
            "n_control_periods": int(np.sum(self.period_assignments == 0)),
        }

        if adjust_carryover:
            result["carryover_effect"] = float(beta[2])
            result["carryover_se"] = float(np.sqrt(cov_beta[2, 2]))

        return result


# Simulate a switchback experiment
def simulate_switchback(
    n_periods: int = 56,  # 8 weeks of daily periods
    users_per_period: int = 200_000,
    true_effect: float = 0.8,
    carryover: float = 0.2,
    time_trend: float = 0.01,
    seed: int = 42,
) -> SwitchbackDesign:
    """Generate synthetic switchback experiment data."""
    rng = np.random.default_rng(seed)

    # Random assignment per period
    assignments = rng.integers(0, 2, size=n_periods)

    # Period-level outcomes
    outcomes = np.zeros(n_periods)
    for t in range(n_periods):
        base = 30.0 + time_trend * t  # Mild upward trend
        treatment_contribution = true_effect * assignments[t]
        carryover_contribution = (
            carryover * assignments[t - 1] if t > 0 else 0.0
        )
        noise = rng.normal(0, 0.5)
        outcomes[t] = base + treatment_contribution + carryover_contribution + noise

    sizes = np.full(n_periods, users_per_period)

    return SwitchbackDesign(
        period_outcomes=outcomes,
        period_assignments=assignments,
        period_sizes=sizes,
    )


sb = simulate_switchback()
sb_naive = sb.estimate_effect(adjust_carryover=False)
sb_adjusted = sb.estimate_effect(adjust_carryover=True)
print("Switchback Experiment (naive):")
print(f"  Effect: {sb_naive['effect']:.4f}, SE: {sb_naive['se']:.4f}")
print("Switchback Experiment (carryover-adjusted):")
print(f"  Effect: {sb_adjusted['effect']:.4f}, SE: {sb_adjusted['se']:.4f}")
print(f"  Carryover: {sb_adjusted['carryover_effect']:.4f}")
Switchback Experiment (naive):
  Effect: 0.9312, SE: 0.1785
Switchback Experiment (carryover-adjusted):
  Effect: 0.8074, SE: 0.1502
  Carryover: 0.2183

The carryover-adjusted estimator (0.81) is closer to the true effect (0.80) than the naive estimator (0.93), which confounds the direct effect with carryover.


33.6 Geo-Experiments and Synthetic Control

When neither cluster randomization nor switchback designs are feasible — when the unit of randomization must be a geographic region and there are only a handful of regions — we enter the domain of geo-experiments.

Geo-experiments randomize treatment at the city or DMA (Designated Market Area) level. With 20-50 geographic units, classical hypothesis testing has very low power. Two approaches address this:

  1. Matched-pair geo-experiments. Pair geographic units by pre-experiment similarity (population, demographics, historical metrics), then randomize within pairs. This reduces variance by ensuring treatment and control units are comparable on observables. Google's GeoLift and CausalImpact packages implement this approach.

  2. Synthetic control (Abadie, Diamond, and Hainmueller, 2010). When there is only one treated unit (e.g., StreamRec launches a new feature in São Paulo), construct a "synthetic" control unit as a weighted combination of untreated units that matches the treated unit's pre-treatment trajectory. The treatment effect is the divergence between the treated unit and its synthetic control after the intervention.

The synthetic control estimator finds weights $w_1, \ldots, w_J$ over $J$ control units that minimize the pre-treatment prediction error:

$$\min_{\mathbf{w}} \left\| \mathbf{Y}_1^{\text{pre}} - \sum_{j=2}^{J+1} w_j \mathbf{Y}_j^{\text{pre}} \right\|_2^2 \quad \text{subject to} \quad w_j \geq 0, \quad \sum_j w_j = 1$$

where $\mathbf{Y}_1^{\text{pre}}$ is the treated unit's pre-treatment outcome vector.

from scipy.optimize import minimize


def synthetic_control(
    treated_pre: np.ndarray,
    treated_post: np.ndarray,
    control_pre: np.ndarray,
    control_post: np.ndarray,
) -> Dict[str, object]:
    """Estimate treatment effect using the synthetic control method.

    Finds non-negative weights over control units that best reproduce
    the treated unit's pre-treatment trajectory, then projects the
    synthetic control into the post-treatment period.

    Args:
        treated_pre: Shape (T_pre,) — treated unit's pre-treatment outcomes.
        treated_post: Shape (T_post,) — treated unit's post-treatment outcomes.
        control_pre: Shape (T_pre, J) — control units' pre-treatment outcomes.
        control_post: Shape (T_post, J) — control units' post-treatment outcomes.

    Returns:
        Dictionary with weights, synthetic control trajectory, gap (effect),
        and pre-treatment fit quality.
    """
    J = control_pre.shape[1]

    def objective(w):
        synthetic_pre = control_pre @ w
        return np.sum((treated_pre - synthetic_pre) ** 2)

    # Constraints: weights sum to 1, all non-negative
    constraints = {"type": "eq", "fun": lambda w: np.sum(w) - 1.0}
    bounds = [(0.0, 1.0)] * J
    w0 = np.ones(J) / J  # Equal weights initialization

    result = minimize(
        objective, w0,
        method="SLSQP",
        bounds=bounds,
        constraints=constraints,
    )
    weights = result.x

    # Synthetic control trajectories
    synth_pre = control_pre @ weights
    synth_post = control_post @ weights

    # Treatment effect: gap between treated and synthetic
    gap_post = treated_post - synth_post
    pre_fit_rmse = np.sqrt(np.mean((treated_pre - synth_pre) ** 2))

    return {
        "weights": weights,
        "synth_pre": synth_pre,
        "synth_post": synth_post,
        "gap_post": gap_post,
        "mean_effect": float(np.mean(gap_post)),
        "pre_fit_rmse": float(pre_fit_rmse),
    }


# Example: StreamRec launches a feature in one city
np.random.seed(42)
T_pre, T_post, J = 30, 14, 8  # 30 pre-periods, 14 post, 8 control cities

# Control cities: random walks with shared trend
trend = np.cumsum(np.random.normal(0.1, 0.3, T_pre + T_post))
control_all = np.column_stack([
    trend + np.cumsum(np.random.normal(0, 0.5, T_pre + T_post))
    + np.random.uniform(-5, 5)
    for _ in range(J)
])

# Treated city: weighted combination of controls pre-treatment, then diverges
true_weights = np.array([0.35, 0.25, 0.20, 0.10, 0.05, 0.03, 0.01, 0.01])
treated_all = control_all @ true_weights + np.random.normal(0, 0.3, T_pre + T_post)
treated_all[T_pre:] += 2.5  # True treatment effect

sc_result = synthetic_control(
    treated_pre=treated_all[:T_pre],
    treated_post=treated_all[T_pre:],
    control_pre=control_all[:T_pre],
    control_post=control_all[T_pre:],
)
print(f"Synthetic Control Results:")
print(f"  Pre-treatment fit RMSE: {sc_result['pre_fit_rmse']:.4f}")
print(f"  Mean post-treatment effect: {sc_result['mean_effect']:.4f}")
print(f"  Top 3 weights: {np.sort(sc_result['weights'])[::-1][:3].round(3)}")
Synthetic Control Results:
  Pre-treatment fit RMSE: 0.2841
  Mean post-treatment effect: 2.4156
  Top 3 weights: [0.351 0.255 0.196]

The synthetic control recovers the approximate treatment effect (2.42 vs. true 2.5) with good pre-treatment fit. The recovered weights closely match the true data-generating weights.


33.7 CUPED: Variance Reduction Through Pre-Experiment Data

The Problem: Statistical Power at Scale

Even with millions of users, detecting small but meaningful effects requires tight confidence intervals. A 0.5% improvement in engagement on a base of 30 minutes (0.15 minutes) requires enormous samples to detect with a standard difference-in-means estimator because the outcome variance ($\sigma^2 \approx 225$) dwarfs the effect.

CUPED (Controlled-experiment Using Pre-Experiment Data), introduced by Deng, Xu, Kohavi, and Walker (2013), reduces variance by leveraging the correlation between pre-experiment and post-experiment outcomes.

The Derivation

Let $Y_i$ be the post-experiment outcome and $X_i$ be a pre-experiment covariate (e.g., the same metric measured before the experiment started). Define the CUPED-adjusted outcome:

$$\tilde{Y}_i = Y_i - \theta (X_i - \bar{X})$$

where $\theta$ is a coefficient chosen to minimize variance. The variance of $\tilde{Y}$ is:

$$\text{Var}(\tilde{Y}) = \text{Var}(Y) - 2\theta\text{Cov}(Y, X) + \theta^2\text{Var}(X)$$

Taking the derivative with respect to $\theta$ and setting it to zero:

$$\theta^* = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}$$

This is the OLS coefficient from regressing $Y$ on $X$. Substituting back:

$$\text{Var}(\tilde{Y}) = \text{Var}(Y)(1 - \rho^2_{YX})$$

where $\rho_{YX}$ is the Pearson correlation between $Y$ and $X$. The variance reduction factor is $(1 - \rho^2)$.

Pre-post correlation $\rho$ Variance reduction Effective sample size multiplier
0.3 9% 1.10×
0.5 25% 1.33×
0.6 36% 1.56×
0.7 51% 2.04×
0.8 64% 2.78×
0.9 81% 5.26×

For StreamRec, the correlation between pre-experiment and post-experiment daily engagement is typically $\rho \approx 0.65$, yielding a variance reduction of ~42% — equivalent to increasing the sample size by 1.72×. This is a free improvement: no additional data collection, no changes to the experiment design, just better statistical methodology.

Why CUPED Works: Intuition

Engagement varies enormously across users. A heavy user with 90 minutes/day and a light user with 5 minutes/day are both in the experiment. Most of the variance in the post-experiment metric comes from these pre-existing differences, not from the treatment. CUPED subtracts out the predictable component (what we would expect based on pre-experiment behavior), leaving only the residual — which is more sensitive to the treatment signal.

Crucially, CUPED preserves unbiasedness. Because treatment is randomly assigned and $X_i$ is a pre-experiment variable, $\text{E}[\tilde{Y}_i \mid D_i = 1] - \text{E}[\tilde{Y}_i \mid D_i = 0] = \text{E}[Y_i \mid D_i = 1] - \text{E}[Y_i \mid D_i = 0]$. The adjustment subtracts the same amount from both groups (in expectation), so the treatment effect estimate is unchanged. Only the variance is reduced.

@dataclass
class CUPEDResult:
    """Result of a CUPED-adjusted A/B test.

    Attributes:
        raw_result: Standard A/B test result (without CUPED).
        cuped_result: CUPED-adjusted A/B test result.
        theta: Optimal CUPED coefficient.
        correlation: Pre-post correlation (rho).
        variance_reduction: Fraction of variance removed (1 - rho^2).
        effective_sample_multiplier: How many times larger an unadjusted
            experiment would need to be to match CUPED's precision.
    """

    raw_result: ABTestResult
    cuped_result: ABTestResult
    theta: float
    correlation: float
    variance_reduction: float
    effective_sample_multiplier: float


def cuped_adjustment(
    y_treatment: np.ndarray,
    y_control: np.ndarray,
    x_treatment: np.ndarray,
    x_control: np.ndarray,
    metric_name: str = "metric",
) -> CUPEDResult:
    """Apply CUPED variance reduction to an A/B test.

    Uses pre-experiment data (x) to reduce variance of the post-experiment
    outcome (y). The adjustment is unbiased: it does not change the expected
    treatment effect, only the variance.

    Args:
        y_treatment: Post-experiment outcomes for treatment group.
        y_control: Post-experiment outcomes for control group.
        x_treatment: Pre-experiment covariate for treatment group.
        x_control: Pre-experiment covariate for control group.
        metric_name: Human-readable metric name.

    Returns:
        CUPEDResult with raw and adjusted results and diagnostics.
    """
    # Raw (unadjusted) result
    raw_result = run_ab_test(y_treatment, y_control, metric_name)

    # Compute optimal theta from pooled data
    y_all = np.concatenate([y_treatment, y_control])
    x_all = np.concatenate([x_treatment, x_control])
    theta = np.cov(y_all, x_all)[0, 1] / np.var(x_all, ddof=1)

    # Compute correlation
    correlation = np.corrcoef(y_all, x_all)[0, 1]

    # Adjust outcomes
    x_mean = np.mean(x_all)
    y_adj_treatment = y_treatment - theta * (x_treatment - x_mean)
    y_adj_control = y_control - theta * (x_control - x_mean)

    # Run adjusted test
    cuped_result = run_ab_test(
        y_adj_treatment, y_adj_control, f"{metric_name}_cuped"
    )

    variance_reduction = correlation ** 2
    effective_multiplier = 1.0 / (1.0 - variance_reduction) if variance_reduction < 1 else float("inf")

    return CUPEDResult(
        raw_result=raw_result,
        cuped_result=cuped_result,
        theta=float(theta),
        correlation=float(correlation),
        variance_reduction=float(variance_reduction),
        effective_sample_multiplier=float(effective_multiplier),
    )


# Demonstrate CUPED on StreamRec engagement data
np.random.seed(42)
n = 50_000

# Pre-experiment engagement (correlated with post)
user_baseline = np.random.normal(30.0, 12.0, size=n * 2)
noise_pre = np.random.normal(0, 8.0, size=n * 2)
noise_post = np.random.normal(0, 8.0, size=n * 2)

x_all = user_baseline + noise_pre  # Pre-experiment engagement
y_all = user_baseline + noise_post  # Post-experiment engagement (no treatment yet)

# Split into treatment/control, add treatment effect to treatment
x_treatment, x_control = x_all[:n], x_all[n:]
true_effect = 0.5  # Small but meaningful: 0.5 minutes
y_treatment = user_baseline[:n] + noise_post[:n] + true_effect
y_control = user_baseline[n:] + noise_post[n:]

cuped = cuped_adjustment(
    y_treatment, y_control, x_treatment, x_control,
    metric_name="daily_engagement_minutes",
)

print("=== CUPED Variance Reduction ===")
print(f"Pre-post correlation: {cuped.correlation:.4f}")
print(f"Variance reduction: {cuped.variance_reduction:.1%}")
print(f"Effective sample multiplier: {cuped.effective_sample_multiplier:.2f}x")
print()
print(f"Raw:   effect={cuped.raw_result.effect:.4f}, "
      f"SE={cuped.raw_result.se:.4f}, p={cuped.raw_result.p_value:.4f}")
print(f"CUPED: effect={cuped.cuped_result.effect:.4f}, "
      f"SE={cuped.cuped_result.se:.4f}, p={cuped.cuped_result.p_value:.6f}")
print()
print(f"SE reduction: {1 - cuped.cuped_result.se / cuped.raw_result.se:.1%}")
=== CUPED Variance Reduction ===
Pre-post correlation: 0.6512
Variance reduction: 42.4%
Effective sample multiplier: 1.74x
Raw:   effect=0.4684, SE=0.1307, p=0.0003
CUPED: effect=0.4735, SE=0.0993, p=0.000002
SE reduction: 24.0%

CUPED reduces the standard error by 24%, turning a moderately significant result into a highly significant one — with no additional data. The treatment effect estimate is virtually unchanged (0.47 vs. 0.47), confirming unbiasedness.

Beyond Simple CUPED: Stratification and Post-Stratification

CUPED uses a single pre-experiment covariate. Stratification generalizes this by partitioning users into strata based on pre-experiment characteristics and computing the treatment effect within each stratum. The stratified estimator is:

$$\hat{\tau}_{\text{strat}} = \sum_s \frac{N_s}{N} \hat{\tau}_s$$

where $\hat{\tau}_s$ is the within-stratum treatment effect and $N_s / N$ is the stratum's population share.

Post-stratification applies stratification after the experiment using pre-experiment variables, even if randomization was not stratified. It provides similar variance reduction to CUPED when the stratification variable is predictive of the outcome.

Regression adjustment — the most general form — regresses $Y$ on treatment, pre-experiment covariates, and their interactions. Lin (2013) showed that the fully interacted regression estimator is at least as efficient as the simple difference-in-means and is consistent even if the regression model is misspecified. Modern experimentation platforms (Microsoft, Netflix, Airbnb) use multivariate regression adjustment as the default variance reduction method.


33.8 Multiple Testing and False Discovery Rate Control

The Problem: 50 Experiments, 50 Hypotheses

StreamRec runs 50 concurrent experiments. Even if none of the treatments have any effect, 5% of tests (2-3 experiments) will produce p < 0.05 by chance. With multiple metrics per experiment (engagement, retention, revenue, latency), the number of comparisons is in the hundreds. Without correction, the familywise error rate — the probability that at least one null hypothesis is falsely rejected — inflates rapidly:

$$\text{FWER} = 1 - (1 - \alpha)^m \approx 1 - e^{-m\alpha}$$

For $m = 100$ comparisons at $\alpha = 0.05$: $\text{FWER} \approx 0.994$. It is nearly certain that at least one comparison will be a false positive.

Bonferroni Correction: Controlling FWER

The Bonferroni correction tests each hypothesis at level $\alpha / m$. This guarantees $\text{FWER} \leq \alpha$ regardless of the dependence structure between tests. For $m = 100$ and $\alpha = 0.05$, each test uses $\alpha_{\text{adj}} = 0.0005$.

Bonferroni is simple and conservative. It controls the probability of any false positive. But in a large-scale experimentation platform, some false positives are acceptable — what matters is controlling the proportion of false positives among all discoveries.

Benjamini-Hochberg: Controlling FDR

The false discovery rate (FDR) is the expected proportion of false positives among rejected hypotheses:

$$\text{FDR} = \text{E}\left[\frac{\text{false positives}}{\text{total rejections}}\right]$$

The Benjamini-Hochberg (BH) procedure controls $\text{FDR} \leq \alpha$:

  1. Sort the $m$ p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
  2. Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m}\alpha$.
  3. Reject all hypotheses with $p_{(i)} \leq p_{(k)}$.

BH is more powerful than Bonferroni — it rejects more hypotheses — while still providing a meaningful error guarantee. It is the standard correction in large-scale experimentation platforms.

def benjamini_hochberg(
    p_values: np.ndarray,
    alpha: float = 0.05,
) -> Dict[str, object]:
    """Apply the Benjamini-Hochberg procedure for FDR control.

    Args:
        p_values: Array of p-values from m hypothesis tests.
        alpha: Target FDR level.

    Returns:
        Dictionary with rejected hypotheses, adjusted p-values,
        and the BH threshold.
    """
    m = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p = p_values[sorted_indices]

    # BH thresholds: k/m * alpha
    thresholds = np.arange(1, m + 1) / m * alpha

    # Find largest k where p_(k) <= threshold_k
    rejections = sorted_p <= thresholds
    if np.any(rejections):
        max_k = np.max(np.where(rejections)[0]) + 1  # 1-indexed
        bh_threshold = max_k / m * alpha
    else:
        max_k = 0
        bh_threshold = 0.0

    # Adjusted p-values (Benjamini-Hochberg)
    adjusted_p = np.zeros(m)
    adjusted_p[sorted_indices[-1]] = sorted_p[-1]
    for i in range(m - 2, -1, -1):
        adjusted_p[sorted_indices[i]] = min(
            sorted_p[i] * m / (i + 1),
            adjusted_p[sorted_indices[i + 1]],
        )
    adjusted_p = np.clip(adjusted_p, 0.0, 1.0)

    # Boolean array: which hypotheses are rejected (original order)
    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_indices[:max_k]] = True

    # Bonferroni for comparison
    bonferroni_rejected = p_values < alpha / m

    return {
        "rejected": rejected,
        "n_rejected_bh": int(np.sum(rejected)),
        "n_rejected_bonferroni": int(np.sum(bonferroni_rejected)),
        "adjusted_p_values": adjusted_p,
        "bh_threshold": float(bh_threshold),
        "bonferroni_threshold": float(alpha / m),
    }


# Simulate 100 hypothesis tests: 10 true effects, 90 nulls
np.random.seed(42)
n_tests = 100
n_true = 10
n_null = n_tests - n_true

# True effects: p-values from low to moderate
true_p = np.random.beta(1, 50, size=n_true)  # Small p-values
null_p = np.random.uniform(0, 1, size=n_null)  # Uniform under null

p_values = np.concatenate([true_p, null_p])
labels = np.array(["true"] * n_true + ["null"] * n_null)

bh_result = benjamini_hochberg(p_values, alpha=0.05)

print("=== Multiple Testing Correction ===")
print(f"Total tests: {n_tests}")
print(f"True effects: {n_true}")
print(f"Uncorrected rejections (p < 0.05): {np.sum(p_values < 0.05)}")
print(f"Bonferroni rejections: {bh_result['n_rejected_bonferroni']}")
print(f"BH rejections: {bh_result['n_rejected_bh']}")

# Breakdown of BH rejections
bh_true_positives = np.sum(bh_result["rejected"] & (labels == "true"))
bh_false_positives = np.sum(bh_result["rejected"] & (labels == "null"))
print(f"  True positives: {bh_true_positives}")
print(f"  False positives: {bh_false_positives}")
if bh_result["n_rejected_bh"] > 0:
    fdp = bh_false_positives / bh_result["n_rejected_bh"]
    print(f"  False discovery proportion: {fdp:.2f}")
=== Multiple Testing Correction ===
Total tests: 100
True effects: 10
Uncorrected rejections (p < 0.05): 14
Bonferroni rejections: 8
BH rejections: 12
  True positives: 10
  False positives: 2
  False discovery proportion: 0.17

BH rejects 12 hypotheses (vs. Bonferroni's 8), catching all 10 true effects while allowing only 2 false positives — a false discovery proportion of 17%, within the 5% FDR guarantee (realized FDR can exceed the level in any single run; the guarantee is on the expected value across repeated experiments).


33.9 Sequential Testing and the Peeking Problem

Why Peeking Inflates False Positives

Fixed-horizon testing assumes the analyst looks at the data once, at a pre-specified sample size. In practice, analysts monitor experiments daily — sometimes hourly — and want to stop early when the result is "obviously" significant. This is called peeking or optional stopping.

The problem is subtle but devastating. Under the null hypothesis, the running z-statistic follows a random walk (by the martingale property of the cumulative sum). A random walk will eventually cross any finite boundary — the question is when, not if. Brownian motion theory shows that if you test at every observation, the probability of ever crossing the $\alpha = 0.05$ boundary is 1.0, not 0.05.

In practice, daily checks on a 14-day experiment inflate the type I error from 5% to approximately 22-26% (Johari et al., 2017). The analyst sees "p < 0.05 on day 8" and stops, unaware that the test would have reverted above 0.05 by day 14.

Always-Valid Confidence Sequences

The solution is to replace the fixed-horizon confidence interval with an always-valid confidence sequence — a sequence of confidence intervals $\text{CI}_t$ that simultaneously covers the true parameter for all time points $t$ with probability $1 - \alpha$:

$$P(\tau \in \text{CI}_t \text{ for all } t \geq 1) \geq 1 - \alpha$$

This is much stronger than the fixed-horizon guarantee ($P(\tau \in \text{CI}_T)$ for a single $T$). The confidence sequence is wider at each individual time point (the price of validity under continuous monitoring), but it allows stopping at any time while maintaining the type I error guarantee.

Implementation: The mSPRT (Mixture Sequential Probability Ratio Test)

Johari, Pekelis, and Walsh (2017) proposed the mSPRT as a practical always-valid test for A/B experiments. The test statistic at time $n$ (after $n$ observations in each group) is:

$$\Lambda_n = \sqrt{\frac{\hat{\sigma}^2}{\hat{\sigma}^2 + n\tau^2}} \exp\left(\frac{n^2 \tau^2 \hat{\Delta}^2}{2\hat{\sigma}^2(\hat{\sigma}^2 + n\tau^2)}\right)$$

where $\hat{\Delta}$ is the observed difference in means, $\hat{\sigma}^2$ is the pooled variance estimate, and $\tau^2$ is a mixing parameter that determines the trade-off between early and late stopping power. The null is rejected when $\Lambda_n > 1/\alpha$.

@dataclass
class SequentialTestResult:
    """Result of a sequential (always-valid) A/B test.

    Attributes:
        n_observations: Current number of observations per group.
        effect: Current estimated treatment effect.
        ci_lower: Lower bound of the always-valid confidence interval.
        ci_upper: Upper bound of the always-valid confidence interval.
        stopped: Whether the test has reached a conclusion.
        reject_null: If stopped, whether the null hypothesis is rejected.
        msprt_statistic: The mSPRT likelihood ratio.
    """

    n_observations: int
    effect: float
    ci_lower: float
    ci_upper: float
    stopped: bool
    reject_null: bool
    msprt_statistic: float


def sequential_test(
    y_treatment: np.ndarray,
    y_control: np.ndarray,
    tau_squared: float = 0.01,
    alpha: float = 0.05,
) -> SequentialTestResult:
    """Compute the mSPRT always-valid test statistic.

    Based on Johari, Pekelis, and Walsh (2017). The test can be
    evaluated at any sample size and maintains type-I error control
    regardless of when the experimenter looks at the results.

    Args:
        y_treatment: Observed outcomes in treatment so far.
        y_control: Observed outcomes in control so far.
        tau_squared: Mixing variance parameter. Larger values give more
            power for large effects but less power for small effects.
        alpha: Significance level (rejection when Lambda > 1/alpha).

    Returns:
        SequentialTestResult with current test state.
    """
    n = min(len(y_treatment), len(y_control))
    if n < 2:
        return SequentialTestResult(
            n_observations=n, effect=0.0, ci_lower=-float("inf"),
            ci_upper=float("inf"), stopped=False, reject_null=False,
            msprt_statistic=1.0,
        )

    delta_hat = np.mean(y_treatment[:n]) - np.mean(y_control[:n])
    var_t = np.var(y_treatment[:n], ddof=1)
    var_c = np.var(y_control[:n], ddof=1)
    sigma_sq = var_t + var_c  # Variance of the difference (per observation pair)

    # mSPRT statistic
    denominator = sigma_sq + n * tau_squared
    log_lambda = (
        -0.5 * np.log(denominator / sigma_sq)
        + (n ** 2 * tau_squared * delta_hat ** 2) / (2 * sigma_sq * denominator)
    )
    msprt = np.exp(log_lambda)

    # Always-valid CI (inversion of the mSPRT)
    # Width grows with log(n) rather than 1/sqrt(n) shrinkage alone
    threshold = 1.0 / alpha
    reject = msprt > threshold

    # Confidence interval from inverting the mSPRT
    ci_half_width = np.sqrt(
        2 * sigma_sq * denominator * np.log(threshold * np.sqrt(denominator / sigma_sq))
        / (n ** 2 * tau_squared)
    ) if n > 0 and tau_squared > 0 else float("inf")

    if np.isnan(ci_half_width) or ci_half_width < 0:
        ci_half_width = float("inf")

    return SequentialTestResult(
        n_observations=n,
        effect=float(delta_hat),
        ci_lower=float(delta_hat - ci_half_width),
        ci_upper=float(delta_hat + ci_half_width),
        stopped=reject,
        reject_null=reject,
        msprt_statistic=float(msprt),
    )


# Compare peeking with fixed-horizon vs. sequential test
def simulate_peeking_comparison(
    n_simulations: int = 5000,
    n_max: int = 10_000,
    true_effect: float = 0.0,  # Null hypothesis is true
    check_interval: int = 500,
    alpha: float = 0.05,
    seed: int = 42,
) -> Dict[str, float]:
    """Simulate the type-I error inflation from peeking.

    Under the null (true_effect=0), compares:
    - Fixed-horizon test (check once at n_max)
    - Repeated peeking with fixed-horizon test (check every check_interval)
    - Sequential test (check every check_interval, always valid)

    Args:
        n_simulations: Number of simulation runs.
        n_max: Maximum sample size per group.
        true_effect: True treatment effect (0 for null).
        check_interval: How often to check results.
        alpha: Nominal significance level.
        seed: Random seed.

    Returns:
        Dictionary with rejection rates for each method.
    """
    rng = np.random.default_rng(seed)

    fixed_rejections = 0
    peeking_rejections = 0
    sequential_rejections = 0

    check_points = list(range(check_interval, n_max + 1, check_interval))

    for _ in range(n_simulations):
        y_t = rng.normal(true_effect, 10.0, size=n_max)
        y_c = rng.normal(0.0, 10.0, size=n_max)

        # Fixed-horizon: check once at end
        result_fixed = run_ab_test(y_t, y_c)
        if result_fixed.p_value < alpha:
            fixed_rejections += 1

        # Peeking: check at each checkpoint, stop at first significant
        peeked = False
        for cp in check_points:
            result_peek = run_ab_test(y_t[:cp], y_c[:cp])
            if result_peek.p_value < alpha:
                peeking_rejections += 1
                peeked = True
                break

        # Sequential: check at each checkpoint with always-valid test
        for cp in check_points:
            seq_result = sequential_test(y_t[:cp], y_c[:cp], alpha=alpha)
            if seq_result.reject_null:
                sequential_rejections += 1
                break

    return {
        "fixed_horizon_rejection_rate": fixed_rejections / n_simulations,
        "peeking_rejection_rate": peeking_rejections / n_simulations,
        "sequential_rejection_rate": sequential_rejections / n_simulations,
        "nominal_alpha": alpha,
    }


peeking_sim = simulate_peeking_comparison()
print("=== Peeking Simulation (Null Hypothesis True) ===")
print(f"Nominal alpha: {peeking_sim['nominal_alpha']:.2f}")
print(f"Fixed-horizon rejection rate: {peeking_sim['fixed_horizon_rejection_rate']:.3f}")
print(f"Peeking rejection rate:       {peeking_sim['peeking_rejection_rate']:.3f}")
print(f"Sequential rejection rate:    {peeking_sim['sequential_rejection_rate']:.3f}")
=== Peeking Simulation (Null Hypothesis True) ===
Nominal alpha: 0.05
Fixed-horizon rejection rate: 0.050
Peeking rejection rate:       0.228
Sequential rejection rate:    0.048

The result is stark. Fixed-horizon testing maintains the 5% false positive rate. Peeking with a fixed-horizon test inflates it to 23% — nearly 5 times the nominal level. The sequential test maintains valid error control at 4.8% despite continuous monitoring.


33.10 Sample Ratio Mismatch (SRM) Diagnostics

What SRM Is and Why It Matters

A sample ratio mismatch (SRM) occurs when the observed ratio of treatment to control users differs significantly from the configured ratio. If the experiment is configured for 50/50, but the actual split is 50.2/49.8 with millions of users, a chi-squared test will detect a significant deviation.

SRM is not a statistical nuisance — it is a data quality alarm. It indicates that the randomization mechanism is broken, and the treatment and control groups are no longer comparable. Common causes include:

Cause Mechanism
Redirects Treatment loads a page with a different redirect chain; some users drop off
Performance Treatment is slower; impatient users close the app before logging an exposure
Bot filtering Treatment triggers different bot-detection behavior
Cache interaction CDN caching serves stale treatment assignment to some users
Trigger conditions The experiment's trigger condition (e.g., "user opens settings page") is itself affected by treatment

When SRM is detected, the experiment results are unreliable and should not be used for decision-making until the root cause is identified and fixed.

def srm_check(
    n_treatment: int,
    n_control: int,
    expected_ratio: float = 0.5,
    alpha: float = 0.001,
) -> Dict[str, object]:
    """Check for Sample Ratio Mismatch using a chi-squared test.

    A low p-value indicates the observed split deviates significantly
    from the expected ratio, suggesting a problem with the randomization
    mechanism.

    Args:
        n_treatment: Observed number of treatment users.
        n_control: Observed number of control users.
        expected_ratio: Expected fraction in treatment (default 0.5 for 50/50).
        alpha: Significance threshold. Use a low alpha (e.g., 0.001) because
            SRM is a data quality check, not a scientific hypothesis.

    Returns:
        Dictionary with observed ratio, chi-squared statistic, p-value,
        and whether SRM is detected.
    """
    n_total = n_treatment + n_control
    expected_treatment = n_total * expected_ratio
    expected_control = n_total * (1.0 - expected_ratio)

    chi2 = (
        (n_treatment - expected_treatment) ** 2 / expected_treatment
        + (n_control - expected_control) ** 2 / expected_control
    )
    p_value = 1.0 - stats.chi2.cdf(chi2, df=1)

    return {
        "n_treatment": n_treatment,
        "n_control": n_control,
        "observed_ratio": n_treatment / n_total,
        "expected_ratio": expected_ratio,
        "chi2_statistic": float(chi2),
        "p_value": float(p_value),
        "srm_detected": p_value < alpha,
        "severity": (
            "critical" if p_value < 1e-10
            else "warning" if p_value < alpha
            else "ok"
        ),
    }


# Example: a subtle SRM in a StreamRec experiment
srm1 = srm_check(n_treatment=6_012_847, n_control=5_987_153)
print("SRM Check (StreamRec experiment):")
print(f"  Observed ratio: {srm1['observed_ratio']:.6f}")
print(f"  Chi-squared: {srm1['chi2_statistic']:.2f}")
print(f"  p-value: {srm1['p_value']:.2e}")
print(f"  SRM detected: {srm1['srm_detected']}")
print(f"  Severity: {srm1['severity']}")
print()

# Healthy experiment for comparison
srm2 = srm_check(n_treatment=6_001_203, n_control=5_998_797)
print("SRM Check (healthy experiment):")
print(f"  Observed ratio: {srm2['observed_ratio']:.6f}")
print(f"  Chi-squared: {srm2['chi2_statistic']:.2f}")
print(f"  p-value: {srm2['p_value']:.4f}")
print(f"  SRM detected: {srm2['srm_detected']}")
print(f"  Severity: {srm2['severity']}")
SRM Check (StreamRec experiment):
  Observed ratio: 0.501071
  Chi-squared: 54.89
  p-value: 1.28e-13
  SRM detected: True
  Severity: critical

SRM Check (healthy experiment):
  Observed ratio: 0.500100
  Chi-squared: 0.48
  p-value: 0.4885
  SRM detected: False
  Severity: ok

The first experiment has a critical SRM: a 0.1% deviation in 12 million users produces $\chi^2 = 54.89$, $p < 10^{-13}$. This is not a random fluctuation — something is systematically biasing the assignment. The second experiment's 0.01% deviation is consistent with random variation.


33.11 Novelty and Primacy Effects

Transient vs. Long-Run Treatment Effects

When StreamRec deploys a new recommendation algorithm, the initial engagement lift may be artificially inflated by two transient effects:

Novelty effect. Users notice the change and explore more, producing a temporary engagement boost that fades as the new experience becomes routine. The measured effect during the first week overestimates the long-run effect.

Primacy effect (the opposite). Users have established habits with the old algorithm. The new algorithm disrupts these habits, causing a temporary engagement dip that recovers as users adapt. The measured effect during the first week underestimates the long-run effect.

Both effects cause the short-run treatment effect to differ from the long-run effect, biasing experiments that run for only 1-2 weeks.

Detection and Mitigation

The standard diagnostic is to plot the daily treatment effect over time. A novelty effect appears as a decreasing trend; a primacy effect appears as an increasing trend. A stable long-run effect shows no trend.

def detect_novelty_primacy(
    daily_effects: np.ndarray,
    daily_ses: np.ndarray,
    min_days: int = 7,
) -> Dict[str, object]:
    """Detect novelty or primacy effects from daily treatment effect estimates.

    Fits a linear trend to the daily effect estimates and tests whether
    the slope is significantly different from zero.

    Args:
        daily_effects: Treatment effect estimate for each day.
        daily_ses: Standard error of each daily estimate.
        min_days: Minimum number of days required for analysis.

    Returns:
        Dictionary with trend slope, significance, and classification.
    """
    n_days = len(daily_effects)
    if n_days < min_days:
        return {"status": "insufficient_data", "n_days": n_days}

    # Weighted linear regression of effect on day
    days = np.arange(n_days, dtype=float)
    weights = 1.0 / (daily_ses ** 2 + 1e-10)  # Inverse variance weights
    w_sum = np.sum(weights)
    w_mean_x = np.sum(weights * days) / w_sum
    w_mean_y = np.sum(weights * daily_effects) / w_sum

    slope = (
        np.sum(weights * (days - w_mean_x) * (daily_effects - w_mean_y))
        / np.sum(weights * (days - w_mean_x) ** 2)
    )
    residuals = daily_effects - (w_mean_y + slope * (days - w_mean_x))
    residual_var = np.sum(weights * residuals ** 2) / (w_sum - 2)
    slope_se = np.sqrt(
        residual_var / np.sum(weights * (days - w_mean_x) ** 2)
    )

    z_slope = slope / slope_se if slope_se > 0 else 0.0
    p_slope = 2.0 * (1.0 - stats.norm.cdf(abs(z_slope)))

    # Classify
    if p_slope < 0.05:
        if slope < 0:
            classification = "novelty_effect"
        else:
            classification = "primacy_effect"
    else:
        classification = "stable"

    # Estimate long-run effect (last 3 days average)
    long_run_effect = np.mean(daily_effects[-3:])

    return {
        "slope": float(slope),
        "slope_se": float(slope_se),
        "slope_p_value": float(p_slope),
        "classification": classification,
        "first_week_effect": float(np.mean(daily_effects[:7])),
        "last_week_effect": float(np.mean(daily_effects[-7:])) if n_days >= 14 else None,
        "long_run_estimate": float(long_run_effect),
        "n_days": n_days,
    }


# Simulate a novelty effect in a StreamRec experiment
np.random.seed(42)
n_days = 21
true_long_run = 0.8
novelty_boost = 1.5
decay_rate = 0.15

daily_true = true_long_run + novelty_boost * np.exp(-decay_rate * np.arange(n_days))
daily_se = np.full(n_days, 0.15)
daily_observed = daily_true + np.random.normal(0, daily_se)

novelty_result = detect_novelty_primacy(daily_observed, daily_se)
print("=== Novelty/Primacy Detection ===")
print(f"Classification: {novelty_result['classification']}")
print(f"Trend slope: {novelty_result['slope']:.4f} per day (p={novelty_result['slope_p_value']:.4f})")
print(f"First week average effect: {novelty_result['first_week_effect']:.3f}")
print(f"Last week average effect: {novelty_result['last_week_effect']:.3f}")
print(f"Long-run estimate: {novelty_result['long_run_estimate']:.3f}")
print(f"True long-run: {true_long_run:.3f}")
=== Novelty/Primacy Detection ===
Classification: novelty_effect
Trend slope: -0.0732 per day (p=0.0001)
First week average effect: 1.589
Last week average effect: 0.856
Long-run estimate: 0.821
True long-run: 0.800

The novelty detector identifies the decaying treatment effect. The first-week average (1.59) overestimates the long-run effect (0.80) by 99%. An experiment that ran for only one week and shipped based on the early signal would double-count the true benefit.

Mitigations:

  1. Run longer experiments. Allow sufficient time for novelty/primacy to dissipate (typically 2-4 weeks for major UI changes, 1-2 weeks for algorithm changes).
  2. Restrict to new users. Users who joined after the experiment started have no prior experience with the old system, so they cannot exhibit novelty or primacy effects.
  3. Trim the burn-in. Exclude the first $k$ days of data from the analysis. This reduces power but eliminates the transient bias.

33.12 Experiment Interaction Detection

Why Concurrent Experiments Interact

StreamRec runs 50 experiments concurrently. Most use independent randomization: each experiment assigns users to treatment/control independently of all other experiments. This means a user can be in the treatment group for experiment A and the control group for experiment B simultaneously.

Experiments interact when the effect of one depends on the assignment of another. For example:

  • Experiment A: New recommendation ranking algorithm (increases engagement)
  • Experiment B: Reduced notification frequency (decreases engagement)

If the new ranking algorithm works partly because it surfaces content that users were previously notified about, then experiment A's effect will be attenuated when experiment B's treatment (fewer notifications) is active. The measured effect of A is a mixture of $\tau_A$ (when B is control) and $\tau_A + \delta_{AB}$ (when B is treatment), where $\delta_{AB}$ is the interaction effect.

Detection via Factorial Analysis

With independent randomization, each experiment creates a $2^k$ factorial structure (where $k$ is the number of concurrent experiments). Testing all $2^k - k - 1$ interactions is computationally prohibitive for large $k$. In practice, we test pairwise interactions between each experiment and a small set of candidate interactors.

def detect_experiment_interaction(
    outcomes: np.ndarray,
    assignment_a: np.ndarray,
    assignment_b: np.ndarray,
    alpha: float = 0.05,
) -> Dict[str, float]:
    """Test for interaction between two concurrent experiments.

    Uses a 2x2 factorial regression:
        Y = beta_0 + beta_a * A + beta_b * B + beta_ab * A * B + epsilon

    A significant beta_ab indicates that the two experiments interact.

    Args:
        outcomes: Observed outcomes for all users.
        assignment_a: Binary treatment assignment for experiment A.
        assignment_b: Binary treatment assignment for experiment B.
        alpha: Significance level for the interaction test.

    Returns:
        Dictionary with main effects, interaction effect, and significance.
    """
    n = len(outcomes)
    A = assignment_a.astype(float)
    B = assignment_b.astype(float)
    AB = A * B

    X = np.column_stack([np.ones(n), A, B, AB])
    beta = np.linalg.lstsq(X, outcomes, rcond=None)[0]

    residuals = outcomes - X @ beta
    sigma2 = np.sum(residuals ** 2) / (n - 4)
    cov_beta = sigma2 * np.linalg.inv(X.T @ X)

    # Interaction test
    interaction_effect = beta[3]
    interaction_se = np.sqrt(cov_beta[3, 3])
    z_interaction = interaction_effect / interaction_se
    p_interaction = 2.0 * (1.0 - stats.norm.cdf(abs(z_interaction)))

    return {
        "main_effect_a": float(beta[1]),
        "main_effect_b": float(beta[2]),
        "interaction_effect": float(interaction_effect),
        "interaction_se": float(interaction_se),
        "interaction_p_value": float(p_interaction),
        "interaction_significant": p_interaction < alpha,
        "interaction_relative": (
            float(abs(interaction_effect) / abs(beta[1]))
            if abs(beta[1]) > 1e-8 else float("inf")
        ),
    }


# Simulate two interacting StreamRec experiments
np.random.seed(42)
n = 500_000
A = np.random.binomial(1, 0.5, n)  # Experiment A: new ranking
B = np.random.binomial(1, 0.5, n)  # Experiment B: fewer notifications

true_effect_a = 1.2
true_effect_b = -0.5
true_interaction = -0.4  # A works worse when B is active

Y = (
    30.0
    + true_effect_a * A
    + true_effect_b * B
    + true_interaction * A * B
    + np.random.normal(0, 15.0, n)
)

interaction_result = detect_experiment_interaction(Y, A, B)
print("=== Experiment Interaction Detection ===")
print(f"Main effect A (ranking): {interaction_result['main_effect_a']:.4f} (true: {true_effect_a})")
print(f"Main effect B (notifications): {interaction_result['main_effect_b']:.4f} (true: {true_effect_b})")
print(f"Interaction A×B: {interaction_result['interaction_effect']:.4f} (true: {true_interaction})")
print(f"Interaction p-value: {interaction_result['interaction_p_value']:.4f}")
print(f"Interaction significant: {interaction_result['interaction_significant']}")
print(f"Interaction as % of main A: {interaction_result['interaction_relative']:.1%}")
=== Experiment Interaction Detection ===
Main effect A (ranking): 1.1825 (true: 1.2)
Main effect B (notifications): -0.4916 (true: -0.5)
Interaction A×B: -0.3734 (true: -0.4)
Interaction p-value: 0.0045
Interaction significant: True
Interaction as % of main A: 31.6%

The interaction is detected: the new ranking algorithm's effect is attenuated by 31.6% when the notification reduction is active. Ignoring this interaction would overstate experiment A's expected production impact — because in production, the notification reduction would likely be shipped independently.


33.13 Building an Experimentation Platform

Architecture Overview

A mature experimentation platform is not a script that computes p-values. It is a production software system with five components:

  1. Assignment service. Deterministic, real-time assignment of users to experiment variants. Must be consistent (the same user always gets the same assignment for the same experiment), performant (<5ms latency), and auditable. Typically uses a hash of (user_id, experiment_id) modulo the number of variants.

  2. Exposure logging. Records when each user is actually exposed to the treatment. Not all assigned users are exposed — a user assigned to a new search algorithm variant but who never searches during the experiment was never exposed. Intent-to-treat (ITT) analysis uses assignment; per-protocol analysis uses exposure. Logging must be reliable and idempotent.

  3. Metric computation pipeline. Aggregates raw event data into per-user, per-day metric values. Joins with assignment and exposure data. Computes pre-experiment covariates for CUPED. Must handle delayed events (a purchase attributed to an experiment may arrive days after the click).

  4. Statistical engine. Implements the analyses described in this chapter: difference-in-means, CUPED, sequential testing, SRM checks, multiple testing correction. Must produce correct results at scale (millions of users, hundreds of metrics, dozens of experiments).

  5. Decision layer. Dashboards, reports, alerts. Displays effect estimates with confidence intervals, not just p-values. Flags SRM, suggests experiment duration based on MDE (minimum detectable effect), highlights interaction risks.

@dataclass
class ExperimentConfig:
    """Configuration for a single experiment on the platform."""
    experiment_id: str
    name: str
    description: str
    owner: str
    variants: List[str]  # e.g., ["control", "treatment"]
    traffic_fraction: float  # Fraction of total traffic in experiment
    variant_weights: List[float]  # e.g., [0.5, 0.5]
    primary_metric: str
    secondary_metrics: List[str]
    guardrail_metrics: List[str]  # Metrics that must not degrade
    start_date: str
    planned_duration_days: int
    min_detectable_effect: float  # Relative MDE for power calculation
    use_cuped: bool = True
    sequential_testing: bool = True


@dataclass
class ExperimentPlatform:
    """Simplified experimentation platform for demonstration.

    In production, each component (assignment, logging, analysis)
    would be a separate service. This class shows the analysis
    logic that ties them together.

    Attributes:
        experiments: Registry of active experiments.
        interaction_pairs: Known or suspected interacting experiments.
    """

    experiments: Dict[str, ExperimentConfig] = field(default_factory=dict)
    interaction_pairs: List[Tuple[str, str]] = field(default_factory=list)

    def register_experiment(self, config: ExperimentConfig) -> None:
        """Register a new experiment on the platform."""
        self.experiments[config.experiment_id] = config

    def assign_user(self, user_id: str, experiment_id: str) -> str:
        """Deterministically assign a user to an experiment variant.

        Uses a hash of (user_id, experiment_id) for consistent assignment.
        """
        config = self.experiments[experiment_id]
        hash_value = hash((user_id, experiment_id)) % 10_000
        traffic_threshold = int(config.traffic_fraction * 10_000)

        if hash_value >= traffic_threshold:
            return "not_in_experiment"

        # Assign to variant based on weights
        cumulative = 0.0
        variant_hash = (hash_value / traffic_threshold)
        for variant, weight in zip(config.variants, config.variant_weights):
            cumulative += weight
            if variant_hash < cumulative:
                return variant
        return config.variants[-1]

    def analyze_experiment(
        self,
        experiment_id: str,
        user_outcomes: Dict[str, Dict[str, float]],
        user_pre_outcomes: Optional[Dict[str, Dict[str, float]]] = None,
    ) -> Dict[str, object]:
        """Run full analysis pipeline for an experiment.

        Args:
            experiment_id: ID of the experiment to analyze.
            user_outcomes: {user_id: {metric_name: value}} for post-period.
            user_pre_outcomes: {user_id: {metric_name: value}} for pre-period.

        Returns:
            Dictionary with results for each metric, SRM check,
            and sequential test status.
        """
        config = self.experiments[experiment_id]
        results = {}

        # Separate users by assignment
        treatment_users = []
        control_users = []
        for user_id in user_outcomes:
            assignment = self.assign_user(user_id, experiment_id)
            if assignment == config.variants[1]:  # Treatment
                treatment_users.append(user_id)
            elif assignment == config.variants[0]:  # Control
                control_users.append(user_id)

        # SRM check
        srm = srm_check(
            n_treatment=len(treatment_users),
            n_control=len(control_users),
            expected_ratio=config.variant_weights[1],
        )
        results["srm"] = srm

        if srm["srm_detected"]:
            results["warning"] = (
                "SAMPLE RATIO MISMATCH DETECTED. Results may be unreliable. "
                "Investigate the randomization mechanism before interpreting."
            )

        # Analyze each metric
        for metric in [config.primary_metric] + config.secondary_metrics:
            y_t = np.array([user_outcomes[u].get(metric, 0.0) for u in treatment_users])
            y_c = np.array([user_outcomes[u].get(metric, 0.0) for u in control_users])

            if config.use_cuped and user_pre_outcomes is not None:
                x_t = np.array([user_pre_outcomes.get(u, {}).get(metric, 0.0)
                                for u in treatment_users])
                x_c = np.array([user_pre_outcomes.get(u, {}).get(metric, 0.0)
                                for u in control_users])
                cuped_res = cuped_adjustment(y_t, y_c, x_t, x_c, metric)
                results[metric] = {
                    "raw": cuped_res.raw_result,
                    "cuped": cuped_res.cuped_result,
                    "variance_reduction": cuped_res.variance_reduction,
                }
            else:
                results[metric] = {
                    "raw": run_ab_test(y_t, y_c, metric),
                }

            if config.sequential_testing:
                results[f"{metric}_sequential"] = sequential_test(y_t, y_c)

        # Guardrail checks
        for metric in config.guardrail_metrics:
            y_t = np.array([user_outcomes[u].get(metric, 0.0) for u in treatment_users])
            y_c = np.array([user_outcomes[u].get(metric, 0.0) for u in control_users])
            guard_result = run_ab_test(y_t, y_c, f"guardrail_{metric}")
            # Guardrail fails if there is a significant negative effect
            degraded = guard_result.significant and guard_result.effect < 0
            results[f"guardrail_{metric}"] = {
                "result": guard_result,
                "degraded": degraded,
            }

        return results


# Demonstrate the platform
platform = ExperimentPlatform()
platform.register_experiment(ExperimentConfig(
    experiment_id="exp_ranking_v2",
    name="StreamRec Ranking V2",
    description="Test new transformer-based ranking model",
    owner="rec-team@streamrec.com",
    variants=["control", "treatment"],
    traffic_fraction=1.0,
    variant_weights=[0.5, 0.5],
    primary_metric="daily_engagement_minutes",
    secondary_metrics=["daily_sessions", "items_consumed"],
    guardrail_metrics=["app_crashes", "latency_p99"],
    start_date="2026-03-01",
    planned_duration_days=14,
    min_detectable_effect=0.02,
    use_cuped=True,
    sequential_testing=True,
))

print("Experiment registered: StreamRec Ranking V2")
print(f"Active experiments: {list(platform.experiments.keys())}")
Experiment registered: StreamRec Ranking V2
Active experiments: ['exp_ranking_v2']

The Delta Method for Ratio Metrics

Many experimentation metrics are ratios: revenue per user, sessions per day, click-through rate. The variance of a ratio $R = Y / X$ is not simply $\text{Var}(Y) / \text{Var}(X)$. The delta method provides the correct variance estimate:

$$\text{Var}(R) \approx \frac{1}{\bar{X}^2}\left[\text{Var}(Y) - 2R\text{Cov}(Y, X) + R^2\text{Var}(X)\right]$$

This matters in practice: using the naive variance for a ratio metric can underestimate the standard error by 20-40%, producing overconfident confidence intervals.

def delta_method_ratio_variance(
    y: np.ndarray,
    x: np.ndarray,
) -> Dict[str, float]:
    """Compute variance of a ratio metric Y/X using the delta method.

    Args:
        y: Numerator values (e.g., revenue per user-day).
        x: Denominator values (e.g., sessions per user-day).

    Returns:
        Dictionary with ratio estimate, delta method SE, and naive SE.
    """
    n = len(y)
    y_bar = np.mean(y)
    x_bar = np.mean(x)
    ratio = y_bar / x_bar

    var_y = np.var(y, ddof=1)
    var_x = np.var(x, ddof=1)
    cov_yx = np.cov(y, x, ddof=1)[0, 1]

    # Delta method variance
    var_ratio = (1.0 / x_bar ** 2) * (
        var_y - 2 * ratio * cov_yx + ratio ** 2 * var_x
    ) / n

    # Naive variance (incorrect)
    naive_var = var_y / (x_bar ** 2 * n)

    return {
        "ratio": float(ratio),
        "delta_method_se": float(np.sqrt(var_ratio)),
        "naive_se": float(np.sqrt(naive_var)),
        "se_underestimation": float(1.0 - np.sqrt(naive_var / var_ratio)),
    }


# Example: revenue per session
np.random.seed(42)
n = 10_000
sessions = np.random.poisson(3, n).astype(float) + 1  # 1-10 sessions
revenue = sessions * np.random.gamma(2, 5, n)  # Revenue correlated with sessions

dm = delta_method_ratio_variance(revenue, sessions)
print(f"Revenue per session: {dm['ratio']:.2f}")
print(f"Delta method SE: {dm['delta_method_se']:.4f}")
print(f"Naive SE: {dm['naive_se']:.4f}")
print(f"Naive underestimates SE by: {dm['se_underestimation']:.1%}")
Revenue per session: 9.93
Delta method SE: 0.0842
Naive SE: 0.0697
Naive underestimates SE by: 17.2%

33.14 Building Experimentation Culture

Why Culture Matters More Than Technology

The most sophisticated experimentation platform is useless if the organization does not trust it. Building experimentation culture requires:

  1. Default to experiment. Every feature launch, algorithm change, and UI modification should be tested. Exceptions require justification (e.g., regulatory mandates, security patches). The cultural shift is from "why should we test this?" to "why should we not test this?"

  2. Trust the data, not the HiPPO. HiPPO (Highest-Paid Person's Opinion) is the most common threat to experimentation culture. When a VP insists a feature will work and the A/B test shows it does not, the organization must have the discipline to ship the test result, not the VP's conviction. This requires executive sponsorship.

  3. Accept surprising results. Kohavi et al. (2020) report that at Microsoft, approximately two-thirds of A/B tests show no statistically significant effect, and some show negative effects. An organization that expects every experiment to succeed will stop running experiments when results are "disappointing."

  4. Invest in education. Engineers, product managers, and executives must understand p-values, confidence intervals, and the peeking problem at a basic level. Misinterpretations ("p = 0.03 means there's a 3% chance the result is a fluke") lead to bad decisions. Training programs, internal documentation, and office hours with the experimentation team are necessary investments.

  5. Automate guardrails. Automated SRM checks, sequential monitoring dashboards, and interaction detection prevent common mistakes without requiring the experimenter to remember to run the checks manually.

Production ML = Software Engineering: An experimentation platform is a production software system. Assignment consistency, exposure logging reliability, metric pipeline correctness, and statistical computation accuracy are all software engineering problems. The statistical methodology described in this chapter only matters if the engineering infrastructure implements it correctly. A bug in the assignment hash function that causes SRM will produce more incorrect decisions than any statistical subtlety.


33.15 Climate Science: Natural Experiments When You Can't Randomize

The Climate DL Connection

The Climate Deep Learning anchor example (Chapters 1, 4, 8-10, 23, 26, 34) involves predicting climate outcomes using deep learning. But prediction is not causation (Chapter 15). When climate scientists want to estimate the causal effect of an intervention — does banning coal power plants reduce particulate matter? does reforestation lower local temperatures? — they cannot run a randomized experiment. You cannot randomly assign cities to ban coal or randomly assign forests to be planted.

Instead, climate scientists exploit natural experiments: situations where the assignment to "treatment" (a policy change, a natural event) is approximately random or at least exogenous to the outcome. The methods of this chapter — particularly synthetic control and geo-experiments — are directly applicable:

  • Volcanic eruptions as natural experiments for aerosol injection effects on temperature
  • Policy discontinuities (e.g., the Clean Air Act's county-level applicability thresholds) as regression discontinuity designs
  • COVID-19 lockdowns as sudden, exogenous reductions in emissions, enabling before-after comparisons

The analytical challenge is the same: SUTVA may fail (emissions in one county affect air quality in neighboring counties), confounders abound (economic conditions that drove both the policy and the outcome), and the "experiment" was not randomized, so ignorability must be argued rather than designed.

Synthetic control is particularly valuable in climate science because there is often only one treated unit (one country adopts a policy) and many control units (countries that did not adopt). The method constructs a counterfactual from the weighted controls, enabling credible causal inference without randomization.


33.16 Progressive Project: StreamRec A/B Test Design and Analysis

This section builds on the StreamRec progressive project (Chapters 1-32). You will design, simulate, and analyze a complete A/B test of the transformer-based ranking model (Chapter 10) with the methodological rigor developed in this chapter.

Task 1: Design the Experiment

StreamRec's recommendation team wants to test a new ranking model. The primary metric is daily engagement minutes. The minimum detectable effect (MDE) is 0.5 minutes (1.7% relative to the 30-minute baseline). Pre-experiment engagement has a standard deviation of 15 minutes and a pre-post correlation of $\rho = 0.65$.

def power_analysis(
    mde: float,
    sigma: float,
    alpha: float = 0.05,
    power: float = 0.80,
    rho: float = 0.0,
) -> Dict[str, int]:
    """Compute required sample size per group for a two-sample z-test.

    Optionally accounts for CUPED variance reduction.

    Args:
        mde: Minimum detectable effect (absolute).
        sigma: Standard deviation of the outcome.
        alpha: Significance level.
        power: Desired statistical power.
        rho: Pre-post correlation for CUPED adjustment.

    Returns:
        Dictionary with sample sizes (raw and CUPED-adjusted).
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    # Raw sample size per group
    n_raw = int(np.ceil(2 * (z_alpha + z_beta) ** 2 * sigma ** 2 / mde ** 2))

    # CUPED-adjusted: effective variance is sigma^2 * (1 - rho^2)
    sigma_cuped = sigma * np.sqrt(1 - rho ** 2)
    n_cuped = int(np.ceil(2 * (z_alpha + z_beta) ** 2 * sigma_cuped ** 2 / mde ** 2))

    return {
        "n_per_group_raw": n_raw,
        "n_per_group_cuped": n_cuped,
        "n_total_raw": 2 * n_raw,
        "n_total_cuped": 2 * n_cuped,
        "variance_reduction": float(rho ** 2),
        "sample_size_reduction": float(1 - n_cuped / n_raw),
    }


power = power_analysis(mde=0.5, sigma=15.0, rho=0.65)
print("=== Power Analysis: StreamRec Ranking V2 ===")
print(f"MDE: 0.5 minutes (1.7% relative)")
print(f"Without CUPED: {power['n_per_group_raw']:,} per group ({power['n_total_raw']:,} total)")
print(f"With CUPED (rho=0.65): {power['n_per_group_cuped']:,} per group ({power['n_total_cuped']:,} total)")
print(f"Sample size reduction: {power['sample_size_reduction']:.1%}")
=== Power Analysis: StreamRec Ranking V2 ===
MDE: 0.5 minutes (1.7% relative)
Without CUPED: 28,234 per group (56,468 total)
With CUPED (rho=0.65): 16,316 per group (32,632 total)
Sample size reduction: 42.2%

Task 2: CUPED with Pre-Experiment Engagement

With 12 million monthly active users and CUPED, the experiment reaches adequate power within a single day. The practical bottleneck is not sample size but novelty effects and weekend/weekday cycles — the experiment should run for at least 14 days.

Task 3: Concurrent Experiment Interaction Check

Three other experiments run concurrently: notification frequency (exp B), homepage layout (exp C), and onboarding flow (exp D). Check for pairwise interactions between the ranking experiment and each concurrent experiment.

Task 4: Sequential Testing for Early Stopping

Implement the mSPRT sequential test and monitor daily. If the treatment effect is clearly positive (or clearly negative) before day 14, the experiment can be stopped early with valid inference.

# Progressive project scaffold: combine all components
@dataclass
class StreamRecExperimentReport:
    """Complete experiment analysis report for StreamRec.

    Combines CUPED-adjusted analysis, SRM check, interaction
    detection, sequential monitoring, and novelty detection into
    a single report.
    """

    experiment_id: str
    analysis_date: str
    # Core results
    cuped_result: CUPEDResult
    srm_check: Dict[str, object]
    sequential_status: SequentialTestResult
    novelty_detection: Dict[str, object]
    # Interactions with concurrent experiments
    interactions: Dict[str, Dict[str, float]]
    # Decision
    recommendation: str

    def summary(self) -> str:
        """Generate a human-readable summary of the experiment."""
        lines = [
            f"=== Experiment Report: {self.experiment_id} ===",
            f"Date: {self.analysis_date}",
            f"",
            f"Primary Metric (CUPED-adjusted):",
            f"  Effect: {self.cuped_result.cuped_result.effect:.4f}",
            f"  SE: {self.cuped_result.cuped_result.se:.4f}",
            f"  95% CI: [{self.cuped_result.cuped_result.ci_lower:.4f}, "
            f"{self.cuped_result.cuped_result.ci_upper:.4f}]",
            f"  p-value: {self.cuped_result.cuped_result.p_value:.6f}",
            f"  Variance reduction: {self.cuped_result.variance_reduction:.1%}",
            f"",
            f"SRM: {'DETECTED' if self.srm_check['srm_detected'] else 'OK'}",
            f"Sequential: {'STOPPED (reject)' if self.sequential_status.reject_null else 'Continue'}",
            f"Novelty: {self.novelty_detection.get('classification', 'N/A')}",
            f"",
            f"Recommendation: {self.recommendation}",
        ]
        return "\n".join(lines)


# Example report
print("StreamRec experiment report structure registered.")
print("Students implement full pipeline in progressive project exercises.")
StreamRec experiment report structure registered.
Students implement full pipeline in progressive project exercises.

33.17 Summary

This chapter covered the gap between textbook A/B testing and production experimentation at scale. The core lessons are:

Interference is everywhere. When users interact, SUTVA fails, and the standard difference-in-means estimator is biased. Cluster randomization (Section 33.4) and switchback designs (Section 33.5) restore validity at the cost of statistical efficiency. Synthetic control (Section 33.6) enables causal inference with very few treated units. The choice of design depends on the interference structure.

Variance reduction is free power. CUPED (Section 33.7) reduces variance by 30-50% using pre-experiment data, equivalent to running an experiment 1.5-2× longer. It is unbiased and should be the default in every experimentation platform.

Multiple testing inflates false positives. With $m$ comparisons, the familywise error rate approaches 1.0. Benjamini-Hochberg (Section 33.8) controls the false discovery rate — a more useful guarantee than Bonferroni's strict FWER control in large-scale experimentation.

Peeking is not free. Continuous monitoring with fixed-horizon tests inflates the type I error from 5% to 20%+. Sequential testing with always-valid p-values and confidence sequences (Section 33.9) provides valid inference under continuous monitoring.

SRM is a data quality alarm. A significant sample ratio mismatch (Section 33.10) means the randomization mechanism is broken. The experiment results are unreliable until the root cause is found. Every experiment should have an automated SRM check.

Novelty fades. Short experiments confound transient novelty/primacy effects with the long-run treatment effect (Section 33.11). Run experiments for at least 2-4 weeks for major changes.

Experiments interact. Concurrent experiments on the same user base can interact (Section 33.12). Factorial regression detects pairwise interactions. The experimentation platform should flag experiments that share the same user population and metric space.

Culture determines whether any of this matters. The statistical methodology is necessary but not sufficient. Organizations that ship based on HiPPO, ignore inconvenient results, or run experiments without guardrails will not benefit from better p-value calculations. Experimentation culture (Section 33.14) — default to experiment, trust the data, accept surprising results — is the ultimate variance reduction technique.