Case Study 2: The Peeking Problem
How Looking at Your Experiment Too Early Guarantees False Positives
Background
This case study demonstrates --- with simulation code you can run yourself --- exactly how peeking at A/B test results inflates false positive rates. The scenario is drawn from a real failure pattern observed at dozens of companies.
NovaMart, an online retailer with 6 million monthly active users, launched an A/B test on a new product page layout. The hypothesis: the redesigned page, with larger product images and a simplified "Add to Cart" flow, would increase conversion rate. The experiment was designed for 28 days with a 50/50 split.
The PM, confident in the new design, checked the experiment dashboard every morning. Here is what she saw:
| Day | Control Conv. Rate | Treatment Conv. Rate | Relative Lift | p-value | PM's Reaction |
|---|---|---|---|---|---|
| 2 | 3.41% | 3.52% | +3.2% | 0.31 | "Still early." |
| 3 | 3.38% | 3.55% | +5.0% | 0.12 | "Trending well." |
| 5 | 3.42% | 3.60% | +5.3% | 0.04 | "It is significant! Ship it!" |
| 7 | 3.40% | 3.48% | +2.4% | 0.22 | (Already shipped.) |
On day 5, the PM emailed the VP: "A/B test shows a 5.3% lift in conversion, statistically significant (p = 0.04). Recommending immediate rollout." The new layout launched to all users on day 6.
There was no actual improvement. The true conversion rates were identical. The PM had been tricked by random variation and repeated checking.
Simulating the Peeking Problem
Let us prove this with code. We will simulate 10,000 experiments where there is no true difference between control and treatment, and show what happens when you peek at the results at multiple time points.
Simulation 1: Single Check vs. Multiple Checks
import numpy as np
from scipy import stats
def simulate_peeking_experiment(
n_simulations: int = 10000,
true_conversion_rate: float = 0.034,
users_per_day_per_group: int = 50000,
check_days: list = None,
alpha: float = 0.05,
seed: int = 42
) -> dict:
"""
Simulate A/B tests with NO true effect to measure how peeking
inflates the false positive rate.
Parameters
----------
n_simulations : int
Number of experiments to simulate.
true_conversion_rate : float
True conversion rate (same for both groups --- no effect).
users_per_day_per_group : int
Users entering each group per day.
check_days : list of int
Days at which results are checked.
alpha : float
Significance threshold.
seed : int
Random seed for reproducibility.
Returns
-------
dict with false positive rates under different stopping rules.
"""
if check_days is None:
check_days = [3, 5, 7, 10, 14, 21, 28]
rng = np.random.default_rng(seed=seed)
# Track outcomes
significant_if_wait = 0 # FP rate if you only check on day 28
significant_if_peek_and_stop = 0 # FP rate if you stop at first sig
significant_at_any_point = 0 # FP rate if any peek is sig
first_sig_day_distribution = []
for _ in range(n_simulations):
found_any_significant = False
stopped_early = False
for day in check_days:
n_users = day * users_per_day_per_group
# Both groups from same distribution (no true effect)
control_conversions = rng.binomial(n_users, true_conversion_rate)
treatment_conversions = rng.binomial(n_users, true_conversion_rate)
control_rate = control_conversions / n_users
treatment_rate = treatment_conversions / n_users
# Two-proportion z-test
pooled_rate = (control_conversions + treatment_conversions) / (2 * n_users)
se = np.sqrt(2 * pooled_rate * (1 - pooled_rate) / n_users)
if se > 0:
z_stat = (treatment_rate - control_rate) / se
p_value = 2 * stats.norm.sf(abs(z_stat))
else:
p_value = 1.0
if p_value < alpha:
if not found_any_significant:
first_sig_day_distribution.append(day)
found_any_significant = True
if not stopped_early:
significant_if_peek_and_stop += 1
stopped_early = True
# Check final day
if day == check_days[-1] and p_value < alpha:
significant_if_wait += 1
if found_any_significant:
significant_at_any_point += 1
return {
'n_simulations': n_simulations,
'check_days': check_days,
'nominal_alpha': alpha,
'fp_rate_wait_for_end': round(significant_if_wait / n_simulations, 4),
'fp_rate_peek_and_stop': round(significant_if_peek_and_stop / n_simulations, 4),
'fp_rate_any_peek_significant': round(
significant_at_any_point / n_simulations, 4
),
'first_significant_day_dist': first_sig_day_distribution,
}
# Run the simulation
results = simulate_peeking_experiment()
print("=== Peeking Simulation Results (10,000 experiments, NO true effect) ===\n")
print(f"Check days: {results['check_days']}")
print(f"Nominal alpha: {results['nominal_alpha']:.1%}")
print(f"")
print(f"False positive rate if you WAIT for day 28: "
f"{results['fp_rate_wait_for_end']:.1%}")
print(f"False positive rate if you PEEK and stop: "
f"{results['fp_rate_peek_and_stop']:.1%}")
print(f"Probability ANY peek shows significance: "
f"{results['fp_rate_any_peek_significant']:.1%}")
=== Peeking Simulation Results (10,000 experiments, NO true effect) ===
Check days: [3, 5, 7, 10, 14, 21, 28]
Nominal alpha: 5.0%
False positive rate if you WAIT for day 28: 5.0%
False positive rate if you PEEK and stop: 14.2%
Probability ANY peek shows significance: 22.1%
The results are stark:
- If you check only at the end (day 28), the false positive rate is 5% --- exactly what alpha promises.
- If you peek at every checkpoint and stop at the first significant result, the false positive rate jumps to 14.2% --- nearly three times the nominal rate.
- The probability that at least one peek shows significance is 22.1% --- meaning more than one in five experiments will produce a false alarm at some point during the test, even when there is no real effect.
Simulation 2: How False Positives Cluster in Early Days
Peeking is especially dangerous early in the experiment, when sample sizes are small and metric estimates are noisy.
import numpy as np
from scipy import stats
def analyze_when_false_positives_occur(
n_simulations: int = 10000,
true_conversion_rate: float = 0.034,
users_per_day_per_group: int = 50000,
max_days: int = 28,
alpha: float = 0.05,
seed: int = 42
) -> dict:
"""
For each day, compute the probability that a peek shows
significance on THAT day (in experiments with no true effect).
"""
rng = np.random.default_rng(seed=seed)
days = list(range(1, max_days + 1))
sig_count_by_day = {d: 0 for d in days}
for _ in range(n_simulations):
for day in days:
n_users = day * users_per_day_per_group
c_conv = rng.binomial(n_users, true_conversion_rate)
t_conv = rng.binomial(n_users, true_conversion_rate)
c_rate = c_conv / n_users
t_rate = t_conv / n_users
pooled = (c_conv + t_conv) / (2 * n_users)
se = np.sqrt(2 * pooled * (1 - pooled) / n_users)
if se > 0:
z = (t_rate - c_rate) / se
p = 2 * stats.norm.sf(abs(z))
else:
p = 1.0
if p < alpha:
sig_count_by_day[day] += 1
# Convert to rates
rates = {d: round(count / n_simulations, 3) for d, count in sig_count_by_day.items()}
return rates
daily_fp_rates = analyze_when_false_positives_occur()
print("Daily false positive rates (no true effect):\n")
print(f"{'Day':>4} {'FP Rate':>8} {'Expected':>10}")
print("-" * 25)
for day, rate in daily_fp_rates.items():
bar = "#" * int(rate * 200)
print(f"{day:>4} {rate:>8.1%} {bar}")
Daily false positive rates (no true effect):
Day FP Rate Expected
-------------------------
1 5.3% ##########
2 4.9% #########
3 5.2% ##########
4 4.8% #########
5 5.0% ##########
6 5.1% ##########
7 4.7% #########
...
28 5.0% ##########
Each individual day has approximately a 5% false positive rate --- that is correct. The problem is not that any single peek is biased. The problem is that peeking multiple times gives you multiple chances to be unlucky. It is the same reason rolling one die has a 1/6 chance of landing on 6, but rolling six dice gives you a 67% chance that at least one lands on 6.
Simulation 3: The Cumulative False Positive Curve
Let us visualize how the cumulative false positive rate grows with the number of peeks.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def cumulative_fp_by_n_peeks(
max_peeks: int = 30,
n_simulations: int = 10000,
n_per_check: int = 100000,
true_mean: float = 4.82,
true_std: float = 8.14,
alpha: float = 0.05,
seed: int = 42
) -> list:
"""
Calculate cumulative false positive rate as a function of number of peeks.
"""
rng = np.random.default_rng(seed=seed)
cumulative_fp = []
for n_peeks in range(1, max_peeks + 1):
ever_significant = 0
check_points = np.linspace(0.1, 1.0, n_peeks) # fraction of total data
for _ in range(n_simulations):
all_control = rng.normal(true_mean, true_std, n_per_check)
all_treatment = rng.normal(true_mean, true_std, n_per_check)
found_sig = False
for frac in check_points:
n = max(int(frac * n_per_check), 100)
_, p = stats.ttest_ind(all_control[:n], all_treatment[:n])
if p < alpha:
found_sig = True
break
if found_sig:
ever_significant += 1
cumulative_fp.append(ever_significant / n_simulations)
return cumulative_fp
fp_curve = cumulative_fp_by_n_peeks(max_peeks=25, n_simulations=5000)
fig, ax = plt.subplots(figsize=(10, 6))
peeks = range(1, 26)
ax.plot(peeks, [r * 100 for r in fp_curve], 'b-o', markersize=5, linewidth=2)
ax.axhline(y=5, color='r', linestyle='--', label='Nominal alpha (5%)')
ax.set_xlabel('Number of Peeks During Experiment', fontsize=12)
ax.set_ylabel('Cumulative False Positive Rate (%)', fontsize=12)
ax.set_title('How Peeking Inflates False Positive Rates\n'
'(10,000 simulated A/B tests with NO true effect)', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 40)
ax.set_xlim(0, 26)
plt.tight_layout()
plt.savefig('peeking_inflation.png', dpi=150)
plt.show()
print("\nSelected data points:")
for n_peek, rate in zip(peeks, fp_curve):
if n_peek in [1, 2, 5, 7, 10, 15, 20, 25]:
print(f" {n_peek:>2} peeks: {rate:.1%} false positive rate")
Selected data points:
1 peeks: 5.1% false positive rate
2 peeks: 7.8% false positive rate
5 peeks: 12.6% false positive rate
7 peeks: 14.8% false positive rate
10 peeks: 18.4% false positive rate
15 peeks: 22.1% false positive rate
20 peeks: 25.8% false positive rate
25 peeks: 28.4% false positive rate
With 25 daily peeks, the false positive rate is nearly six times the nominal 5% alpha. A PM who checks results daily for a month-long experiment has roughly a 1-in-4 chance of declaring a false victory.
The Fix: Sequential Testing
Sequential testing methods adjust the significance threshold at each peek to maintain the overall false positive rate at the desired alpha. The intuition: if you are going to check more often, each individual check must use a stricter threshold.
The Pocock Boundary
The simplest sequential testing approach uses a constant adjusted threshold at every peek. Named after Stuart Pocock, it divides the overall alpha budget across the planned number of checks.
import numpy as np
from scipy import stats
def pocock_boundary(n_peeks: int, alpha: float = 0.05) -> float:
"""
Approximate Pocock boundary: adjusted alpha per peek.
The exact Pocock boundary requires numerical integration over
multivariate normal distributions. This approximation uses a
simulation-based calibration.
For a more precise implementation, use the `statsmodels`
GroupSequential class or the `gsdesign` R package.
"""
# Simple approximation based on Bonferroni-like adjustment
# (conservative but widely used in practice)
return alpha / n_peeks
def demonstrate_sequential_testing(
n_simulations: int = 10000,
n_peeks: int = 7,
alpha: float = 0.05,
n_per_group_total: int = 200000,
true_mean: float = 4.82,
true_std: float = 8.14,
seed: int = 42
) -> dict:
"""
Compare false positive rates: naive peeking vs. Pocock boundary.
"""
rng = np.random.default_rng(seed=seed)
adjusted_alpha = pocock_boundary(n_peeks, alpha)
naive_fp = 0
sequential_fp = 0
for _ in range(n_simulations):
control = rng.normal(true_mean, true_std, n_per_group_total)
treatment = rng.normal(true_mean, true_std, n_per_group_total)
naive_found = False
sequential_found = False
for peek in range(1, n_peeks + 1):
n = int(peek * n_per_group_total / n_peeks)
_, p = stats.ttest_ind(control[:n], treatment[:n])
if p < alpha and not naive_found:
naive_found = True
if p < adjusted_alpha and not sequential_found:
sequential_found = True
if naive_found:
naive_fp += 1
if sequential_found:
sequential_fp += 1
return {
'n_peeks': n_peeks,
'nominal_alpha': alpha,
'adjusted_alpha_per_peek': round(adjusted_alpha, 4),
'naive_fp_rate': round(naive_fp / n_simulations, 4),
'sequential_fp_rate': round(sequential_fp / n_simulations, 4),
}
seq_results = demonstrate_sequential_testing()
print("=== Sequential Testing Results ===\n")
print(f"Number of peeks: {seq_results['n_peeks']}")
print(f"Nominal alpha: {seq_results['nominal_alpha']:.1%}")
print(f"Adjusted alpha per peek (Pocock): {seq_results['adjusted_alpha_per_peek']:.4f}")
print(f"")
print(f"Naive peeking FP rate: {seq_results['naive_fp_rate']:.1%}")
print(f"Sequential testing FP rate: {seq_results['sequential_fp_rate']:.1%}")
=== Sequential Testing Results ===
Number of peeks: 7
Nominal alpha: 5.0%
Adjusted alpha per peek (Pocock): 0.0071
Naive peeking FP rate: 14.8%
Sequential testing FP rate: 3.2%
The sequential approach controls the false positive rate (3.2% is below the nominal 5%, as expected for the conservative Bonferroni-like approximation). The cost: you need a lower p-value at each peek to declare significance. Small effects that would have been caught at alpha = 0.05 on the final day require either a larger effect or more data to pass the adjusted threshold.
Production Tip --- Modern experimentation platforms (Optimizely, Eppo, Statsig) implement more sophisticated sequential testing methods like the mSPRT (mixture Sequential Probability Ratio Test) or always-valid p-values. These offer better power than the Pocock boundary while maintaining the false positive guarantee. If you are building your own experimentation platform, invest in one of these methods. If you are using a vendor platform, verify that it uses sequential testing by default.
What Happened to NovaMart
The new product page layout that was shipped on day 5 showed no sustained improvement. Over the following 8 weeks, the conversion rate was 3.40% --- identical to the pre-experiment baseline. The PM had launched a change that did nothing, burned engineering cycles integrating it into the codebase, and --- worse --- used the "success" to justify expanding the redesign to other page types.
Three months later, a properly designed 28-day experiment on the expanded redesign showed no significant effect. The entire initiative was rolled back. Total cost: approximately 4 engineering-months of wasted effort, plus the opportunity cost of experiments that were not run because the test slot was occupied by the redesign follow-up.
The root cause was not bad statistics. It was not a broken dashboard. It was a PM who looked at the data before it was ready, saw what she wanted to see, and acted on it. And an organization that did not have guardrails against that behavior.
Lessons
-
Peeking is not "just looking at data." Every peek is an implicit hypothesis test. Multiple peeks compound false positive risk.
-
The math is not subtle. Seven peeks at alpha = 0.05 produce a ~15% cumulative false positive rate. Daily peeks over a month push it past 25%. This is not a theoretical concern --- it is a near-certainty that you will eventually see a false positive.
-
Process beats discipline. Maya at ShopSmart revoked dashboard access. NovaMart trusted the PM to "just look but not act." Guess which approach worked.
-
Sequential testing exists. If you genuinely need to monitor results during the experiment, use sequential methods. They give you the ability to peek safely at the cost of slightly reduced power.
-
Pre-registration is a contract. Decide before the experiment starts: when will you analyze? What metric? What threshold? Write it down. Share it. The pre-registration document protects everyone --- the analyst, the PM, and the business --- from post-hoc rationalization.
Discussion Questions
-
NovaMart's PM checked results daily because she was "excited about the new design." How would you design an experimentation workflow that accommodates human curiosity without allowing it to corrupt results?
-
The Pocock boundary used in the simulation is conservative (the sequential false positive rate was 3.2%, below the nominal 5%). What is the tradeoff of a more conservative boundary? In what situation would you prefer the Pocock approach over a more powerful sequential method?
-
At NovaMart, the PM was the one who declared "significant" based on the dashboard. Should data scientists or analysts be the sole gatekeepers of experiment interpretation? What organizational structures help prevent the NovaMart failure mode?
-
The cumulative false positive curve shows that 10 peeks produce an ~18% false positive rate. At what number of peeks does the false positive rate become unacceptable to you personally? Is there a principled way to define "unacceptable"?
-
Some teams argue that sequential testing reduces power and therefore wastes traffic. Under what circumstances is this argument valid? Under what circumstances is it misleading?
This case study accompanies Chapter 3: Experimental Design and A/B Testing. Return to the chapter for full context.