Case Study 1: StreamRec A/B Testing at Scale — Interference, CUPED, and the Experiment That Almost Shipped a Null Effect
Context
StreamRec's recommendation team has developed a new transformer-based ranking model (Progressive Project, Chapter 10). Offline evaluation is promising: NDCG@20 improves from 0.182 to 0.211, a 16% relative gain. But offline metrics are predictions, not causal estimates (Chapter 15). The team needs an A/B test to estimate the causal effect of the new model on user engagement.
The experiment parameters:
| Parameter | Value |
|---|---|
| Primary metric | Daily engagement minutes |
| Baseline mean | 30.2 minutes |
| Baseline standard deviation | 14.8 minutes |
| Minimum detectable effect (MDE) | 0.5 minutes (1.7% relative) |
| Pre-post correlation ($\rho$) | 0.67 |
| Total eligible users | 11.8 million |
| Planned traffic allocation | 50% treatment, 50% control |
| Planned duration | 14 days |
The team configures the experiment on StreamRec's internal experimentation platform and launches it on a Monday morning.
Phase 1: The Initial Result (Day 7)
After one week, the dashboard shows:
from dataclasses import dataclass
from typing import Dict, Optional
import numpy as np
from scipy import stats
@dataclass
class StreamRecWeekOneResults:
"""Week 1 results from the StreamRec ranking experiment."""
n_treatment: int = 5_903_412
n_control: int = 5_896_588
mean_treatment: float = 30.58
mean_control: float = 30.23
std_treatment: float = 14.92
std_control: float = 14.71
raw_effect: float = 0.35
raw_se: float = 0.0272
raw_p_value: float = 0.0000
cuped_effect: float = 0.37
cuped_se: float = 0.0198
cuped_p_value: float = 0.0000
sequential_msprt: float = 487.3
sequential_reject: bool = True
srm_p_value: float = 0.42
week1 = StreamRecWeekOneResults()
print("=== StreamRec Ranking V2: Week 1 Dashboard ===")
print(f"Raw effect: +{week1.raw_effect:.2f} min ({100*week1.raw_effect/week1.mean_control:.1f}% relative)")
print(f"CUPED effect: +{week1.cuped_effect:.2f} min, p < 0.0001")
print(f"Sequential test: REJECT (mSPRT = {week1.sequential_msprt:.1f} > 20)")
print(f"SRM check: OK (p = {week1.srm_p_value:.2f})")
=== StreamRec Ranking V2: Week 1 Dashboard ===
Raw effect: +0.35 min (1.2% relative)
CUPED effect: +0.37 min, p < 0.0001
Sequential test: REJECT (mSPRT = 487.3 > 20)
SRM check: OK (p = 0.42)
The result looks clean: a 0.37-minute improvement (1.2% relative), highly significant with both raw and CUPED-adjusted analyses, the sequential test has already rejected the null, and no SRM. The product manager asks to ship.
Phase 2: The Interference Problem
A senior data scientist raises a concern. StreamRec has a "share with friends" feature. When a treatment user discovers content surfaced by the new ranking model, they may share it with friends — some of whom are in the control group. This shared content appears in the control user's "shared with you" feed, potentially increasing their engagement. If the control group's engagement is inflated by spillover, the measured treatment effect underestimates the true direct effect.
The team investigates. In the experiment data, they find:
@dataclass
class InterferenceAnalysis:
"""Analysis of content sharing interference in the ranking experiment."""
# Sharing rates
shares_per_user_treatment: float = 1.42
shares_per_user_control: float = 1.31
share_increase: float = 8.4 # percent
# Cross-group sharing
treatment_to_control_shares: int = 3_287_000
control_to_treatment_shares: int = 2_916_000
# Engagement from shared content (control users only)
control_users_receiving_treatment_shares: int = 2_145_000
avg_engagement_from_shared_content: float = 0.18 # minutes per user
interference = InterferenceAnalysis()
print("=== Interference Analysis ===")
print(f"Treatment users share {interference.share_increase:.1f}% more content")
print(f"Cross-group shares (T→C): {interference.treatment_to_control_shares:,}")
print(f"Control users affected: {interference.control_users_receiving_treatment_shares:,} "
f"({100*interference.control_users_receiving_treatment_shares/5_896_588:.0f}% of control)")
print(f"Average spillover per affected control user: "
f"+{interference.avg_engagement_from_shared_content:.2f} min")
print()
# Estimate spillover bias
spillover_on_all_control = (
interference.control_users_receiving_treatment_shares
* interference.avg_engagement_from_shared_content
/ 5_896_588
)
print(f"Average spillover across all control users: +{spillover_on_all_control:.4f} min")
print(f"Bias-corrected effect estimate: {0.37 + spillover_on_all_control:.4f} min")
print(f"Bias as % of measured effect: {100*spillover_on_all_control/0.37:.1f}%")
=== Interference Analysis ===
Treatment users share 8.4% more content
Cross-group shares (T→C): 3,287,000
Control users affected: 2,145,000 (36% of control)
Average spillover per affected control user: +0.18 min
Average spillover across all control users: +0.0655 min
Bias-corrected effect estimate: 0.4355 min
Bias as % of measured effect: 17.7%
The spillover inflates the control group's engagement by 0.07 minutes, causing the naive estimator to underestimate the true direct effect by 17.7%. The bias-corrected estimate is 0.44 minutes, not 0.37.
This is a lower bound on the bias. It only counts the first-order spillover (treatment → control via sharing). It does not count second-order effects (the control user who received shared content may themselves engage differently, affecting recommendations for other control users).
Phase 3: The CUPED Deep Dive
The CUPED adjustment reduced the standard error from 0.0272 to 0.0198 — a 27% reduction, corresponding to a variance reduction of $1 - (0.0198/0.0272)^2 = 47\%$. This matches the expected $\rho^2 = 0.67^2 = 45\%$.
But the team realizes CUPED can do more. They have not one but four pre-experiment covariates available:
@dataclass
class MultivariateCUPEDResult:
"""Results from multivariate CUPED with four covariates."""
covariates: list
correlations_with_y: list
se_univariate_cuped: float # Best single covariate
se_multivariate_cuped: float # All four covariates
se_raw: float
multi_cuped = MultivariateCUPEDResult(
covariates=[
"pre_engagement_minutes",
"pre_sessions_per_day",
"pre_items_consumed",
"account_age_days",
],
correlations_with_y=[0.67, 0.61, 0.58, 0.32],
se_univariate_cuped=0.0198,
se_multivariate_cuped=0.0172,
se_raw=0.0272,
)
print("=== Multivariate CUPED ===")
for cov, corr in zip(multi_cuped.covariates, multi_cuped.correlations_with_y):
print(f" {cov}: rho = {corr}")
print()
print(f"Raw SE: {multi_cuped.se_raw:.4f}")
print(f"Univariate CUPED SE: {multi_cuped.se_univariate_cuped:.4f} "
f"({100*(1-multi_cuped.se_univariate_cuped/multi_cuped.se_raw):.0f}% reduction)")
print(f"Multivariate CUPED SE: {multi_cuped.se_multivariate_cuped:.4f} "
f"({100*(1-multi_cuped.se_multivariate_cuped/multi_cuped.se_raw):.0f}% reduction)")
=== Multivariate CUPED ===
pre_engagement_minutes: rho = 0.67
pre_sessions_per_day: rho = 0.61
pre_items_consumed: rho = 0.58
account_age_days: rho = 0.32
Raw SE: 0.0272
Univariate CUPED SE: 0.0198 (27% reduction)
Multivariate CUPED SE: 0.0172 (37% reduction)
Multivariate CUPED provides an additional 10 percentage points of SE reduction beyond the single-covariate version. The incremental improvement is modest because the covariates are correlated with each other (users with high engagement also have many sessions and items consumed). Account age adds unique variance reduction because it captures a different dimension — older accounts are more stable.
Phase 4: Concurrent Experiment Interaction
Three other experiments are running simultaneously. The team checks for pairwise interactions:
@dataclass
class InteractionCheckResult:
"""Pairwise interaction test with a concurrent experiment."""
concurrent_experiment: str
interaction_effect: float
interaction_se: float
interaction_p_value: float
interaction_as_pct_of_main: float
interactions = [
InteractionCheckResult("notification_frequency", -0.082, 0.039, 0.036, 22.2),
InteractionCheckResult("homepage_layout_v3", 0.011, 0.038, 0.773, 3.0),
InteractionCheckResult("onboarding_redesign", -0.003, 0.041, 0.942, 0.8),
]
print("=== Concurrent Experiment Interactions ===")
for ix in interactions:
sig = "***" if ix.interaction_p_value < 0.05 else ""
print(f" {ix.concurrent_experiment}:")
print(f" Interaction: {ix.interaction_effect:+.3f} min "
f"(SE={ix.interaction_se:.3f}, p={ix.interaction_p_value:.3f}) {sig}")
print(f" As % of main effect: {ix.interaction_as_pct_of_main:.1f}%")
=== Concurrent Experiment Interactions ===
notification_frequency:
Interaction: -0.082 min (SE=0.039, p=0.036) ***
As % of main effect: 22.2%
homepage_layout_v3:
Interaction: +0.011 min (SE=0.038, p=0.773)
As % of main effect: 3.0%
onboarding_redesign:
Interaction: -0.003 min (SE=0.041, p=0.942)
As % of main effect: 0.8%
The notification frequency experiment has a significant interaction: the new ranking algorithm's effect is 22% smaller when notification frequency is reduced. This makes sense — the new ranking model surfaces content that notifications previously highlighted. With fewer notifications, users miss some of this content, attenuating the ranking improvement.
The implication: if the notification reduction ships independently (and it likely will — its own test shows a significant reduction in notification fatigue), the production effect of the ranking model will be smaller than the A/B test suggests.
Phase 5: The Final Decision
The team presents the full analysis to the product review:
@dataclass
class FinalExperimentReport:
"""Complete analysis combining all findings."""
raw_effect: float = 0.37
interference_bias_correction: float = 0.07
corrected_effect: float = 0.44
notification_interaction: float = -0.08
expected_production_effect: float = 0.36
cuped_se: float = 0.0172
cuped_ci_lower: float = 0.33
cuped_ci_upper: float = 0.41
novelty_classification: str = "mild_novelty"
novelty_day1_7_effect: float = 0.42
novelty_day8_14_effect: float = 0.33
long_run_estimate: float = 0.31
srm_status: str = "OK"
final = FinalExperimentReport()
print("=== Final Experiment Report: StreamRec Ranking V2 ===")
print()
print("1. MEASURED EFFECT")
print(f" CUPED-adjusted: +{final.raw_effect:.2f} min (raw A/B estimate)")
print()
print("2. INTERFERENCE CORRECTION")
print(f" Spillover bias: +{final.interference_bias_correction:.2f} min")
print(f" Corrected: +{final.corrected_effect:.2f} min")
print()
print("3. INTERACTION ADJUSTMENT")
print(f" Notification experiment interaction: {final.notification_interaction:+.2f} min")
print(f" Expected production effect: +{final.expected_production_effect:.2f} min")
print()
print("4. NOVELTY ASSESSMENT")
print(f" Classification: {final.novelty_classification}")
print(f" Week 1 effect: +{final.novelty_day1_7_effect:.2f} min")
print(f" Week 2 effect: +{final.novelty_day8_14_effect:.2f} min")
print(f" Long-run estimate: +{final.long_run_estimate:.2f} min")
print()
print("5. RECOMMENDATION")
print(f" Ship with monitoring. Expected long-run lift: +0.30 to +0.36 min")
print(f" (0.31 min novelty-adjusted, 0.36 min interaction-adjusted)")
print(f" Monitor for 30 days post-launch with holdback group.")
=== Final Experiment Report: StreamRec Ranking V2 ===
1. MEASURED EFFECT
CUPED-adjusted: +0.37 min (raw A/B estimate)
2. INTERFERENCE CORRECTION
Spillover bias: +0.07 min
Corrected: +0.44 min
3. INTERACTION ADJUSTMENT
Notification experiment interaction: -0.08 min
Expected production effect: +0.36 min
4. NOVELTY ASSESSMENT
Classification: mild_novelty
Week 1 effect: +0.42 min
Week 2 effect: +0.33 min
Long-run estimate: +0.31 min
5. RECOMMENDATION
Ship with monitoring. Expected long-run lift: +0.30 to +0.36 min
(0.31 min novelty-adjusted, 0.36 min interaction-adjusted)
Monitor for 30 days post-launch with holdback group.
Lessons
The naive A/B test estimated +0.37 minutes. The rigorous analysis reveals a more nuanced picture:
- Interference correction moves the estimate up to +0.44 (the naive test underestimated the direct effect because of positive spillover).
- Interaction adjustment moves the expected production impact down to +0.36 (the ranking effect is attenuated when the notification reduction ships concurrently).
- Novelty correction moves the long-run estimate down further to +0.31 (the week 1 lift was inflated by novelty).
The final estimated long-run production impact (+0.31 min) is 16% smaller than the raw A/B test suggested (+0.37 min). Without interference analysis, the team would not know the true direct effect. Without interaction detection, they would overstate the production impact. Without novelty analysis, they would overstate the long-run benefit.
The experiment ships — but with realistic expectations, a 30-day holdback group for long-run measurement, and a commitment to re-evaluate if the notification experiment also ships.
This is what rigorous experimentation looks like: not a single p-value, but a systematic accounting of every assumption and every source of bias.