Case Study 1: StreamRec A/B Testing at Scale — Interference, CUPED, and the Experiment That Almost Shipped a Null Effect

DataField.Dev

Case Study 1: StreamRec A/B Testing at Scale — Interference, CUPED, and the Experiment That Almost Shipped a Null Effect

Context

StreamRec's recommendation team has developed a new transformer-based ranking model (Progressive Project, Chapter 10). Offline evaluation is promising: NDCG@20 improves from 0.182 to 0.211, a 16% relative gain. But offline metrics are predictions, not causal estimates (Chapter 15). The team needs an A/B test to estimate the causal effect of the new model on user engagement.

The experiment parameters:

Parameter	Value
Primary metric	Daily engagement minutes
Baseline mean	30.2 minutes
Baseline standard deviation	14.8 minutes
Minimum detectable effect (MDE)	0.5 minutes (1.7% relative)
Pre-post correlation ($\rho$)	0.67
Total eligible users	11.8 million
Planned traffic allocation	50% treatment, 50% control
Planned duration	14 days

The team configures the experiment on StreamRec's internal experimentation platform and launches it on a Monday morning.

Phase 1: The Initial Result (Day 7)

After one week, the dashboard shows:

from dataclasses import dataclass
from typing import Dict, Optional
import numpy as np
from scipy import stats


@dataclass
class StreamRecWeekOneResults:
    """Week 1 results from the StreamRec ranking experiment."""

    n_treatment: int = 5_903_412
    n_control: int = 5_896_588
    mean_treatment: float = 30.58
    mean_control: float = 30.23
    std_treatment: float = 14.92
    std_control: float = 14.71
    raw_effect: float = 0.35
    raw_se: float = 0.0272
    raw_p_value: float = 0.0000
    cuped_effect: float = 0.37
    cuped_se: float = 0.0198
    cuped_p_value: float = 0.0000
    sequential_msprt: float = 487.3
    sequential_reject: bool = True
    srm_p_value: float = 0.42


week1 = StreamRecWeekOneResults()
print("=== StreamRec Ranking V2: Week 1 Dashboard ===")
print(f"Raw effect: +{week1.raw_effect:.2f} min ({100*week1.raw_effect/week1.mean_control:.1f}% relative)")
print(f"CUPED effect: +{week1.cuped_effect:.2f} min, p < 0.0001")
print(f"Sequential test: REJECT (mSPRT = {week1.sequential_msprt:.1f} > 20)")
print(f"SRM check: OK (p = {week1.srm_p_value:.2f})")

=== StreamRec Ranking V2: Week 1 Dashboard ===
Raw effect: +0.35 min (1.2% relative)
CUPED effect: +0.37 min, p < 0.0001
Sequential test: REJECT (mSPRT = 487.3 > 20)
SRM check: OK (p = 0.42)

The result looks clean: a 0.37-minute improvement (1.2% relative), highly significant with both raw and CUPED-adjusted analyses, the sequential test has already rejected the null, and no SRM. The product manager asks to ship.

Phase 2: The Interference Problem

A senior data scientist raises a concern. StreamRec has a "share with friends" feature. When a treatment user discovers content surfaced by the new ranking model, they may share it with friends — some of whom are in the control group. This shared content appears in the control user's "shared with you" feed, potentially increasing their engagement. If the control group's engagement is inflated by spillover, the measured treatment effect underestimates the true direct effect.

The team investigates. In the experiment data, they find:

@dataclass
class InterferenceAnalysis:
    """Analysis of content sharing interference in the ranking experiment."""

    # Sharing rates
    shares_per_user_treatment: float = 1.42
    shares_per_user_control: float = 1.31
    share_increase: float = 8.4  # percent

    # Cross-group sharing
    treatment_to_control_shares: int = 3_287_000
    control_to_treatment_shares: int = 2_916_000

    # Engagement from shared content (control users only)
    control_users_receiving_treatment_shares: int = 2_145_000
    avg_engagement_from_shared_content: float = 0.18  # minutes per user


interference = InterferenceAnalysis()
print("=== Interference Analysis ===")
print(f"Treatment users share {interference.share_increase:.1f}% more content")
print(f"Cross-group shares (T→C): {interference.treatment_to_control_shares:,}")
print(f"Control users affected: {interference.control_users_receiving_treatment_shares:,} "
      f"({100*interference.control_users_receiving_treatment_shares/5_896_588:.0f}% of control)")
print(f"Average spillover per affected control user: "
      f"+{interference.avg_engagement_from_shared_content:.2f} min")
print()

# Estimate spillover bias
spillover_on_all_control = (
    interference.control_users_receiving_treatment_shares
    * interference.avg_engagement_from_shared_content
    / 5_896_588
)
print(f"Average spillover across all control users: +{spillover_on_all_control:.4f} min")
print(f"Bias-corrected effect estimate: {0.37 + spillover_on_all_control:.4f} min")
print(f"Bias as % of measured effect: {100*spillover_on_all_control/0.37:.1f}%")

=== Interference Analysis ===
Treatment users share 8.4% more content
Cross-group shares (T→C): 3,287,000
Control users affected: 2,145,000 (36% of control)
Average spillover per affected control user: +0.18 min
Average spillover across all control users: +0.0655 min
Bias-corrected effect estimate: 0.4355 min
Bias as % of measured effect: 17.7%

The spillover inflates the control group's engagement by 0.07 minutes, causing the naive estimator to underestimate the true direct effect by 17.7%. The bias-corrected estimate is 0.44 minutes, not 0.37.

This is a lower bound on the bias. It only counts the first-order spillover (treatment → control via sharing). It does not count second-order effects (the control user who received shared content may themselves engage differently, affecting recommendations for other control users).

Phase 3: The CUPED Deep Dive

The CUPED adjustment reduced the standard error from 0.0272 to 0.0198 — a 27% reduction, corresponding to a variance reduction of $1 - (0.0198/0.0272)^2 = 47\%$. This matches the expected $\rho^2 = 0.67^2 = 45\%$.

But the team realizes CUPED can do more. They have not one but four pre-experiment covariates available:

@dataclass
class MultivariateCUPEDResult:
    """Results from multivariate CUPED with four covariates."""

    covariates: list
    correlations_with_y: list
    se_univariate_cuped: float  # Best single covariate
    se_multivariate_cuped: float  # All four covariates
    se_raw: float


multi_cuped = MultivariateCUPEDResult(
    covariates=[
        "pre_engagement_minutes",
        "pre_sessions_per_day",
        "pre_items_consumed",
        "account_age_days",
    ],
    correlations_with_y=[0.67, 0.61, 0.58, 0.32],
    se_univariate_cuped=0.0198,
    se_multivariate_cuped=0.0172,
    se_raw=0.0272,
)

print("=== Multivariate CUPED ===")
for cov, corr in zip(multi_cuped.covariates, multi_cuped.correlations_with_y):
    print(f"  {cov}: rho = {corr}")
print()
print(f"Raw SE:              {multi_cuped.se_raw:.4f}")
print(f"Univariate CUPED SE: {multi_cuped.se_univariate_cuped:.4f} "
      f"({100*(1-multi_cuped.se_univariate_cuped/multi_cuped.se_raw):.0f}% reduction)")
print(f"Multivariate CUPED SE: {multi_cuped.se_multivariate_cuped:.4f} "
      f"({100*(1-multi_cuped.se_multivariate_cuped/multi_cuped.se_raw):.0f}% reduction)")

=== Multivariate CUPED ===
  pre_engagement_minutes: rho = 0.67
  pre_sessions_per_day: rho = 0.61
  pre_items_consumed: rho = 0.58
  account_age_days: rho = 0.32

Raw SE:              0.0272
Univariate CUPED SE: 0.0198 (27% reduction)
Multivariate CUPED SE: 0.0172 (37% reduction)

Multivariate CUPED provides an additional 10 percentage points of SE reduction beyond the single-covariate version. The incremental improvement is modest because the covariates are correlated with each other (users with high engagement also have many sessions and items consumed). Account age adds unique variance reduction because it captures a different dimension — older accounts are more stable.

Phase 4: Concurrent Experiment Interaction

Three other experiments are running simultaneously. The team checks for pairwise interactions:

@dataclass
class InteractionCheckResult:
    """Pairwise interaction test with a concurrent experiment."""
    concurrent_experiment: str
    interaction_effect: float
    interaction_se: float
    interaction_p_value: float
    interaction_as_pct_of_main: float


interactions = [
    InteractionCheckResult("notification_frequency", -0.082, 0.039, 0.036, 22.2),
    InteractionCheckResult("homepage_layout_v3", 0.011, 0.038, 0.773, 3.0),
    InteractionCheckResult("onboarding_redesign", -0.003, 0.041, 0.942, 0.8),
]

print("=== Concurrent Experiment Interactions ===")
for ix in interactions:
    sig = "***" if ix.interaction_p_value < 0.05 else ""
    print(f"  {ix.concurrent_experiment}:")
    print(f"    Interaction: {ix.interaction_effect:+.3f} min "
          f"(SE={ix.interaction_se:.3f}, p={ix.interaction_p_value:.3f}) {sig}")
    print(f"    As % of main effect: {ix.interaction_as_pct_of_main:.1f}%")

=== Concurrent Experiment Interactions ===
  notification_frequency:
    Interaction: -0.082 min (SE=0.039, p=0.036) ***
    As % of main effect: 22.2%
  homepage_layout_v3:
    Interaction: +0.011 min (SE=0.038, p=0.773)
    As % of main effect: 3.0%
  onboarding_redesign:
    Interaction: -0.003 min (SE=0.041, p=0.942)
    As % of main effect: 0.8%

The notification frequency experiment has a significant interaction: the new ranking algorithm's effect is 22% smaller when notification frequency is reduced. This makes sense — the new ranking model surfaces content that notifications previously highlighted. With fewer notifications, users miss some of this content, attenuating the ranking improvement.

The implication: if the notification reduction ships independently (and it likely will — its own test shows a significant reduction in notification fatigue), the production effect of the ranking model will be smaller than the A/B test suggests.

Phase 5: The Final Decision

The team presents the full analysis to the product review:

@dataclass
class FinalExperimentReport:
    """Complete analysis combining all findings."""

    raw_effect: float = 0.37
    interference_bias_correction: float = 0.07
    corrected_effect: float = 0.44
    notification_interaction: float = -0.08
    expected_production_effect: float = 0.36
    cuped_se: float = 0.0172
    cuped_ci_lower: float = 0.33
    cuped_ci_upper: float = 0.41
    novelty_classification: str = "mild_novelty"
    novelty_day1_7_effect: float = 0.42
    novelty_day8_14_effect: float = 0.33
    long_run_estimate: float = 0.31
    srm_status: str = "OK"


final = FinalExperimentReport()
print("=== Final Experiment Report: StreamRec Ranking V2 ===")
print()
print("1. MEASURED EFFECT")
print(f"   CUPED-adjusted: +{final.raw_effect:.2f} min (raw A/B estimate)")
print()
print("2. INTERFERENCE CORRECTION")
print(f"   Spillover bias: +{final.interference_bias_correction:.2f} min")
print(f"   Corrected: +{final.corrected_effect:.2f} min")
print()
print("3. INTERACTION ADJUSTMENT")
print(f"   Notification experiment interaction: {final.notification_interaction:+.2f} min")
print(f"   Expected production effect: +{final.expected_production_effect:.2f} min")
print()
print("4. NOVELTY ASSESSMENT")
print(f"   Classification: {final.novelty_classification}")
print(f"   Week 1 effect: +{final.novelty_day1_7_effect:.2f} min")
print(f"   Week 2 effect: +{final.novelty_day8_14_effect:.2f} min")
print(f"   Long-run estimate: +{final.long_run_estimate:.2f} min")
print()
print("5. RECOMMENDATION")
print(f"   Ship with monitoring. Expected long-run lift: +0.30 to +0.36 min")
print(f"   (0.31 min novelty-adjusted, 0.36 min interaction-adjusted)")
print(f"   Monitor for 30 days post-launch with holdback group.")

=== Final Experiment Report: StreamRec Ranking V2 ===

1. MEASURED EFFECT
   CUPED-adjusted: +0.37 min (raw A/B estimate)

2. INTERFERENCE CORRECTION
   Spillover bias: +0.07 min
   Corrected: +0.44 min

3. INTERACTION ADJUSTMENT
   Notification experiment interaction: -0.08 min
   Expected production effect: +0.36 min

4. NOVELTY ASSESSMENT
   Classification: mild_novelty
   Week 1 effect: +0.42 min
   Week 2 effect: +0.33 min
   Long-run estimate: +0.31 min

5. RECOMMENDATION
   Ship with monitoring. Expected long-run lift: +0.30 to +0.36 min
   (0.31 min novelty-adjusted, 0.36 min interaction-adjusted)
   Monitor for 30 days post-launch with holdback group.

Lessons

The naive A/B test estimated +0.37 minutes. The rigorous analysis reveals a more nuanced picture:

Interference correction moves the estimate up to +0.44 (the naive test underestimated the direct effect because of positive spillover).
Interaction adjustment moves the expected production impact down to +0.36 (the ranking effect is attenuated when the notification reduction ships concurrently).
Novelty correction moves the long-run estimate down further to +0.31 (the week 1 lift was inflated by novelty).

The final estimated long-run production impact (+0.31 min) is 16% smaller than the raw A/B test suggested (+0.37 min). Without interference analysis, the team would not know the true direct effect. Without interaction detection, they would overstate the production impact. Without novelty analysis, they would overstate the long-run benefit.

The experiment ships — but with realistic expectations, a 30-day holdback group for long-run measurement, and a commitment to re-evaluate if the notification experiment also ships.

This is what rigorous experimentation looks like: not a single p-value, but a systematic accounting of every assumption and every source of bias.