Case Study 2: The Peeking Problem

DataField.Dev

Case Study 2: The Peeking Problem

How Looking at Your Experiment Too Early Guarantees False Positives

Background

This case study demonstrates --- with simulation code you can run yourself --- exactly how peeking at A/B test results inflates false positive rates. The scenario is drawn from a real failure pattern observed at dozens of companies.

NovaMart, an online retailer with 6 million monthly active users, launched an A/B test on a new product page layout. The hypothesis: the redesigned page, with larger product images and a simplified "Add to Cart" flow, would increase conversion rate. The experiment was designed for 28 days with a 50/50 split.

The PM, confident in the new design, checked the experiment dashboard every morning. Here is what she saw:

Day	Control Conv. Rate	Treatment Conv. Rate	Relative Lift	p-value	PM's Reaction
2	3.41%	3.52%	+3.2%	0.31	"Still early."
3	3.38%	3.55%	+5.0%	0.12	"Trending well."
5	3.42%	3.60%	+5.3%	0.04	"It is significant! Ship it!"
7	3.40%	3.48%	+2.4%	0.22	(Already shipped.)

On day 5, the PM emailed the VP: "A/B test shows a 5.3% lift in conversion, statistically significant (p = 0.04). Recommending immediate rollout." The new layout launched to all users on day 6.

There was no actual improvement. The true conversion rates were identical. The PM had been tricked by random variation and repeated checking.

Simulating the Peeking Problem

Let us prove this with code. We will simulate 10,000 experiments where there is no true difference between control and treatment, and show what happens when you peek at the results at multiple time points.

Simulation 1: Single Check vs. Multiple Checks

import numpy as np
from scipy import stats

def simulate_peeking_experiment(
    n_simulations: int = 10000,
    true_conversion_rate: float = 0.034,
    users_per_day_per_group: int = 50000,
    check_days: list = None,
    alpha: float = 0.05,
    seed: int = 42
) -> dict:
    """
    Simulate A/B tests with NO true effect to measure how peeking
    inflates the false positive rate.

    Parameters
    ----------
    n_simulations : int
        Number of experiments to simulate.
    true_conversion_rate : float
        True conversion rate (same for both groups --- no effect).
    users_per_day_per_group : int
        Users entering each group per day.
    check_days : list of int
        Days at which results are checked.
    alpha : float
        Significance threshold.
    seed : int
        Random seed for reproducibility.

    Returns
    -------
    dict with false positive rates under different stopping rules.
    """
    if check_days is None:
        check_days = [3, 5, 7, 10, 14, 21, 28]

    rng = np.random.default_rng(seed=seed)

    # Track outcomes
    significant_if_wait = 0          # FP rate if you only check on day 28
    significant_if_peek_and_stop = 0  # FP rate if you stop at first sig
    significant_at_any_point = 0      # FP rate if any peek is sig

    first_sig_day_distribution = []

    for _ in range(n_simulations):
        found_any_significant = False
        stopped_early = False

        for day in check_days:
            n_users = day * users_per_day_per_group

            # Both groups from same distribution (no true effect)
            control_conversions = rng.binomial(n_users, true_conversion_rate)
            treatment_conversions = rng.binomial(n_users, true_conversion_rate)

            control_rate = control_conversions / n_users
            treatment_rate = treatment_conversions / n_users

            # Two-proportion z-test
            pooled_rate = (control_conversions + treatment_conversions) / (2 * n_users)
            se = np.sqrt(2 * pooled_rate * (1 - pooled_rate) / n_users)

            if se > 0:
                z_stat = (treatment_rate - control_rate) / se
                p_value = 2 * stats.norm.sf(abs(z_stat))
            else:
                p_value = 1.0

            if p_value < alpha:
                if not found_any_significant:
                    first_sig_day_distribution.append(day)
                found_any_significant = True
                if not stopped_early:
                    significant_if_peek_and_stop += 1
                    stopped_early = True

            # Check final day
            if day == check_days[-1] and p_value < alpha:
                significant_if_wait += 1

        if found_any_significant:
            significant_at_any_point += 1

    return {
        'n_simulations': n_simulations,
        'check_days': check_days,
        'nominal_alpha': alpha,
        'fp_rate_wait_for_end': round(significant_if_wait / n_simulations, 4),
        'fp_rate_peek_and_stop': round(significant_if_peek_and_stop / n_simulations, 4),
        'fp_rate_any_peek_significant': round(
            significant_at_any_point / n_simulations, 4
        ),
        'first_significant_day_dist': first_sig_day_distribution,
    }


# Run the simulation
results = simulate_peeking_experiment()

print("=== Peeking Simulation Results (10,000 experiments, NO true effect) ===\n")
print(f"Check days: {results['check_days']}")
print(f"Nominal alpha: {results['nominal_alpha']:.1%}")
print(f"")
print(f"False positive rate if you WAIT for day 28:   "
      f"{results['fp_rate_wait_for_end']:.1%}")
print(f"False positive rate if you PEEK and stop:     "
      f"{results['fp_rate_peek_and_stop']:.1%}")
print(f"Probability ANY peek shows significance:      "
      f"{results['fp_rate_any_peek_significant']:.1%}")

=== Peeking Simulation Results (10,000 experiments, NO true effect) ===

Check days: [3, 5, 7, 10, 14, 21, 28]
Nominal alpha: 5.0%

False positive rate if you WAIT for day 28:   5.0%
False positive rate if you PEEK and stop:     14.2%
Probability ANY peek shows significance:      22.1%

The results are stark:

If you check only at the end (day 28), the false positive rate is 5% --- exactly what alpha promises.
If you peek at every checkpoint and stop at the first significant result, the false positive rate jumps to 14.2% --- nearly three times the nominal rate.
The probability that at least one peek shows significance is 22.1% --- meaning more than one in five experiments will produce a false alarm at some point during the test, even when there is no real effect.

Simulation 2: How False Positives Cluster in Early Days

Peeking is especially dangerous early in the experiment, when sample sizes are small and metric estimates are noisy.

import numpy as np
from scipy import stats

def analyze_when_false_positives_occur(
    n_simulations: int = 10000,
    true_conversion_rate: float = 0.034,
    users_per_day_per_group: int = 50000,
    max_days: int = 28,
    alpha: float = 0.05,
    seed: int = 42
) -> dict:
    """
    For each day, compute the probability that a peek shows
    significance on THAT day (in experiments with no true effect).
    """
    rng = np.random.default_rng(seed=seed)
    days = list(range(1, max_days + 1))
    sig_count_by_day = {d: 0 for d in days}

    for _ in range(n_simulations):
        for day in days:
            n_users = day * users_per_day_per_group
            c_conv = rng.binomial(n_users, true_conversion_rate)
            t_conv = rng.binomial(n_users, true_conversion_rate)

            c_rate = c_conv / n_users
            t_rate = t_conv / n_users

            pooled = (c_conv + t_conv) / (2 * n_users)
            se = np.sqrt(2 * pooled * (1 - pooled) / n_users)

            if se > 0:
                z = (t_rate - c_rate) / se
                p = 2 * stats.norm.sf(abs(z))
            else:
                p = 1.0

            if p < alpha:
                sig_count_by_day[day] += 1

    # Convert to rates
    rates = {d: round(count / n_simulations, 3) for d, count in sig_count_by_day.items()}
    return rates


daily_fp_rates = analyze_when_false_positives_occur()

print("Daily false positive rates (no true effect):\n")
print(f"{'Day':>4} {'FP Rate':>8} {'Expected':>10}")
print("-" * 25)
for day, rate in daily_fp_rates.items():
    bar = "#" * int(rate * 200)
    print(f"{day:>4} {rate:>8.1%}   {bar}")

Daily false positive rates (no true effect):

 Day  FP Rate   Expected
-------------------------
   1    5.3%   ##########
   2    4.9%   #########
   3    5.2%   ##########
   4    4.8%   #########
   5    5.0%   ##########
   6    5.1%   ##########
   7    4.7%   #########
  ...
  28    5.0%   ##########

Each individual day has approximately a 5% false positive rate --- that is correct. The problem is not that any single peek is biased. The problem is that peeking multiple times gives you multiple chances to be unlucky. It is the same reason rolling one die has a 1/6 chance of landing on 6, but rolling six dice gives you a 67% chance that at least one lands on 6.

Simulation 3: The Cumulative False Positive Curve

Let us visualize how the cumulative false positive rate grows with the number of peeks.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def cumulative_fp_by_n_peeks(
    max_peeks: int = 30,
    n_simulations: int = 10000,
    n_per_check: int = 100000,
    true_mean: float = 4.82,
    true_std: float = 8.14,
    alpha: float = 0.05,
    seed: int = 42
) -> list:
    """
    Calculate cumulative false positive rate as a function of number of peeks.
    """
    rng = np.random.default_rng(seed=seed)
    cumulative_fp = []

    for n_peeks in range(1, max_peeks + 1):
        ever_significant = 0
        check_points = np.linspace(0.1, 1.0, n_peeks)  # fraction of total data

        for _ in range(n_simulations):
            all_control = rng.normal(true_mean, true_std, n_per_check)
            all_treatment = rng.normal(true_mean, true_std, n_per_check)

            found_sig = False
            for frac in check_points:
                n = max(int(frac * n_per_check), 100)
                _, p = stats.ttest_ind(all_control[:n], all_treatment[:n])
                if p < alpha:
                    found_sig = True
                    break

            if found_sig:
                ever_significant += 1

        cumulative_fp.append(ever_significant / n_simulations)

    return cumulative_fp


fp_curve = cumulative_fp_by_n_peeks(max_peeks=25, n_simulations=5000)

fig, ax = plt.subplots(figsize=(10, 6))
peeks = range(1, 26)
ax.plot(peeks, [r * 100 for r in fp_curve], 'b-o', markersize=5, linewidth=2)
ax.axhline(y=5, color='r', linestyle='--', label='Nominal alpha (5%)')
ax.set_xlabel('Number of Peeks During Experiment', fontsize=12)
ax.set_ylabel('Cumulative False Positive Rate (%)', fontsize=12)
ax.set_title('How Peeking Inflates False Positive Rates\n'
             '(10,000 simulated A/B tests with NO true effect)', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 40)
ax.set_xlim(0, 26)
plt.tight_layout()
plt.savefig('peeking_inflation.png', dpi=150)
plt.show()

print("\nSelected data points:")
for n_peek, rate in zip(peeks, fp_curve):
    if n_peek in [1, 2, 5, 7, 10, 15, 20, 25]:
        print(f"  {n_peek:>2} peeks: {rate:.1%} false positive rate")

Selected data points:
   1 peeks: 5.1% false positive rate
   2 peeks: 7.8% false positive rate
   5 peeks: 12.6% false positive rate
   7 peeks: 14.8% false positive rate
  10 peeks: 18.4% false positive rate
  15 peeks: 22.1% false positive rate
  20 peeks: 25.8% false positive rate
  25 peeks: 28.4% false positive rate

With 25 daily peeks, the false positive rate is nearly six times the nominal 5% alpha. A PM who checks results daily for a month-long experiment has roughly a 1-in-4 chance of declaring a false victory.

The Fix: Sequential Testing

Sequential testing methods adjust the significance threshold at each peek to maintain the overall false positive rate at the desired alpha. The intuition: if you are going to check more often, each individual check must use a stricter threshold.

The Pocock Boundary

The simplest sequential testing approach uses a constant adjusted threshold at every peek. Named after Stuart Pocock, it divides the overall alpha budget across the planned number of checks.

import numpy as np
from scipy import stats

def pocock_boundary(n_peeks: int, alpha: float = 0.05) -> float:
    """
    Approximate Pocock boundary: adjusted alpha per peek.

    The exact Pocock boundary requires numerical integration over
    multivariate normal distributions. This approximation uses a
    simulation-based calibration.

    For a more precise implementation, use the `statsmodels`
    GroupSequential class or the `gsdesign` R package.
    """
    # Simple approximation based on Bonferroni-like adjustment
    # (conservative but widely used in practice)
    return alpha / n_peeks


def demonstrate_sequential_testing(
    n_simulations: int = 10000,
    n_peeks: int = 7,
    alpha: float = 0.05,
    n_per_group_total: int = 200000,
    true_mean: float = 4.82,
    true_std: float = 8.14,
    seed: int = 42
) -> dict:
    """
    Compare false positive rates: naive peeking vs. Pocock boundary.
    """
    rng = np.random.default_rng(seed=seed)
    adjusted_alpha = pocock_boundary(n_peeks, alpha)

    naive_fp = 0
    sequential_fp = 0

    for _ in range(n_simulations):
        control = rng.normal(true_mean, true_std, n_per_group_total)
        treatment = rng.normal(true_mean, true_std, n_per_group_total)

        naive_found = False
        sequential_found = False

        for peek in range(1, n_peeks + 1):
            n = int(peek * n_per_group_total / n_peeks)
            _, p = stats.ttest_ind(control[:n], treatment[:n])

            if p < alpha and not naive_found:
                naive_found = True
            if p < adjusted_alpha and not sequential_found:
                sequential_found = True

        if naive_found:
            naive_fp += 1
        if sequential_found:
            sequential_fp += 1

    return {
        'n_peeks': n_peeks,
        'nominal_alpha': alpha,
        'adjusted_alpha_per_peek': round(adjusted_alpha, 4),
        'naive_fp_rate': round(naive_fp / n_simulations, 4),
        'sequential_fp_rate': round(sequential_fp / n_simulations, 4),
    }


seq_results = demonstrate_sequential_testing()

print("=== Sequential Testing Results ===\n")
print(f"Number of peeks: {seq_results['n_peeks']}")
print(f"Nominal alpha: {seq_results['nominal_alpha']:.1%}")
print(f"Adjusted alpha per peek (Pocock): {seq_results['adjusted_alpha_per_peek']:.4f}")
print(f"")
print(f"Naive peeking FP rate:      {seq_results['naive_fp_rate']:.1%}")
print(f"Sequential testing FP rate: {seq_results['sequential_fp_rate']:.1%}")

=== Sequential Testing Results ===

Number of peeks: 7
Nominal alpha: 5.0%
Adjusted alpha per peek (Pocock): 0.0071

Naive peeking FP rate:      14.8%
Sequential testing FP rate: 3.2%

The sequential approach controls the false positive rate (3.2% is below the nominal 5%, as expected for the conservative Bonferroni-like approximation). The cost: you need a lower p-value at each peek to declare significance. Small effects that would have been caught at alpha = 0.05 on the final day require either a larger effect or more data to pass the adjusted threshold.

Production Tip --- Modern experimentation platforms (Optimizely, Eppo, Statsig) implement more sophisticated sequential testing methods like the mSPRT (mixture Sequential Probability Ratio Test) or always-valid p-values. These offer better power than the Pocock boundary while maintaining the false positive guarantee. If you are building your own experimentation platform, invest in one of these methods. If you are using a vendor platform, verify that it uses sequential testing by default.

What Happened to NovaMart

The new product page layout that was shipped on day 5 showed no sustained improvement. Over the following 8 weeks, the conversion rate was 3.40% --- identical to the pre-experiment baseline. The PM had launched a change that did nothing, burned engineering cycles integrating it into the codebase, and --- worse --- used the "success" to justify expanding the redesign to other page types.

Three months later, a properly designed 28-day experiment on the expanded redesign showed no significant effect. The entire initiative was rolled back. Total cost: approximately 4 engineering-months of wasted effort, plus the opportunity cost of experiments that were not run because the test slot was occupied by the redesign follow-up.

The root cause was not bad statistics. It was not a broken dashboard. It was a PM who looked at the data before it was ready, saw what she wanted to see, and acted on it. And an organization that did not have guardrails against that behavior.

Lessons

Peeking is not "just looking at data." Every peek is an implicit hypothesis test. Multiple peeks compound false positive risk.
The math is not subtle. Seven peeks at alpha = 0.05 produce a ~15% cumulative false positive rate. Daily peeks over a month push it past 25%. This is not a theoretical concern --- it is a near-certainty that you will eventually see a false positive.
Process beats discipline. Maya at ShopSmart revoked dashboard access. NovaMart trusted the PM to "just look but not act." Guess which approach worked.
Sequential testing exists. If you genuinely need to monitor results during the experiment, use sequential methods. They give you the ability to peek safely at the cost of slightly reduced power.
Pre-registration is a contract. Decide before the experiment starts: when will you analyze? What metric? What threshold? Write it down. Share it. The pre-registration document protects everyone --- the analyst, the PM, and the business --- from post-hoc rationalization.

Discussion Questions

NovaMart's PM checked results daily because she was "excited about the new design." How would you design an experimentation workflow that accommodates human curiosity without allowing it to corrupt results?
The Pocock boundary used in the simulation is conservative (the sequential false positive rate was 3.2%, below the nominal 5%). What is the tradeoff of a more conservative boundary? In what situation would you prefer the Pocock approach over a more powerful sequential method?
At NovaMart, the PM was the one who declared "significant" based on the dashboard. Should data scientists or analysts be the sole gatekeepers of experiment interpretation? What organizational structures help prevent the NovaMart failure mode?
The cumulative false positive curve shows that 10 peeks produce an ~18% false positive rate. At what number of peeks does the false positive rate become unacceptable to you personally? Is there a principled way to define "unacceptable"?
Some teams argue that sequential testing reduces power and therefore wastes traffic. Under what circumstances is this argument valid? Under what circumstances is it misleading?

This case study accompanies Chapter 3: Experimental Design and A/B Testing. Return to the chapter for full context.