25 min read

> War Story --- ShopSmart, a mid-size e-commerce marketplace with 14 million monthly users, spent four months building a new recommendation algorithm. The offline metrics looked great: a 12% improvement in mean reciprocal rank on the holdout set, a...

Chapter 3: Experimental Design and A/B Testing

The Foundation of Data-Driven Decisions


Learning Objectives

By the end of this chapter, you will be able to:

  1. Design a proper A/B test with hypothesis, control, treatment, and success metric
  2. Calculate required sample size using power analysis
  3. Identify and avoid common A/B testing pitfalls (peeking, multiple testing, Simpson's paradox)
  4. Interpret A/B test results with statistical rigor
  5. Handle the case where the test says "no difference" but stakeholders want a launch

The Model That Worked But Nobody Could Prove It

War Story --- ShopSmart, a mid-size e-commerce marketplace with 14 million monthly users, spent four months building a new recommendation algorithm. The offline metrics looked great: a 12% improvement in mean reciprocal rank on the holdout set, a 9% lift in simulated click-through rate, and improved coverage across long-tail products. The ML team was confident. The VP of Product was excited. They launched the new algorithm to all 14 million users on a Tuesday.

Revenue went up 3.2% that week. Celebration. Bonuses. A company-wide email from the CEO crediting the "AI team."

Except: there was a site-wide promotional campaign that same week. And a competitor went down for 18 hours on Wednesday. And it was the first week of a new month, which historically shows higher engagement. Was the revenue lift caused by the new algorithm? By the promotion? By the competitor outage? By the calendar? Nobody could say. The team had spent four months building a model and zero days designing an experiment to prove it worked.

Six months later, the algorithm was quietly rolled back after a sustained performance decline that nobody could attribute to anything specific --- because there had never been a controlled experiment to establish the baseline.

This chapter is about the experiment. Specifically, it is about A/B testing --- the randomized controlled experiment that is the only reliable way to establish that your model, your feature, your intervention caused the outcome you observed.

Most ML textbooks skip experimentation entirely. They teach you to build models, evaluate them on holdout sets, and stop there. But a model that performs well offline is a hypothesis, not a conclusion. The conclusion comes from a properly designed experiment that isolates your model's effect from the dozens of confounding factors that plague real-world systems.

A data scientist who cannot design experiments is a data scientist who cannot prove their model works.


3.1 Why A/B Testing Matters for Data Scientists

If you have taken an introductory statistics course, you have encountered hypothesis testing. You know about null hypotheses, p-values, and Type I errors. You may be wondering why we are spending an entire chapter on something you already know.

Here is why: the gap between understanding hypothesis testing in a statistics class and running a reliable A/B test in production is enormous. Statistics class gives you the math. Production gives you stakeholders who peek at results on day 2, engineers who accidentally break the randomization, product managers who add three more variants "since we're already testing," and a VP who announces the results to the board before the test reaches significance.

A/B testing in production is not a statistics problem. It is a systems problem that uses statistics.

What an A/B Test Actually Is

An A/B test is a randomized controlled experiment. You split your users (or sessions, or pageviews) into two groups:

  • Control (A): The existing experience. The status quo.
  • Treatment (B): The new experience. The thing you want to test.

Users are assigned randomly. You measure an outcome metric for both groups. You use statistical inference to determine whether the difference you observe is real or noise.

That is it. The concept is simple. The execution is where organizations stumble.

The Three Questions Every A/B Test Answers

  1. Is the effect real? --- Statistical significance. Is the observed difference larger than what we would expect from random chance?
  2. How big is the effect? --- Effect size. Even if the effect is real, is it large enough to matter?
  3. Are we confident in the direction? --- Confidence interval. What is the plausible range of the true effect?

Notice that "is the effect positive?" is not on this list. An A/B test can tell you the new algorithm hurts performance. That is not a failure of the experiment. That is the experiment doing its job.


3.2 The Anatomy of a Proper A/B Test

Let us design a real experiment. ShopSmart has built a new recommendation algorithm --- call it RecV2 --- and wants to know whether it increases revenue per user compared to the existing algorithm (RecV1).

Step 1: Define the Hypothesis

Every experiment starts with two hypotheses:

  • Null hypothesis (H0): RecV2 has no effect on revenue per user compared to RecV1. Any observed difference is due to random variation.
  • Alternative hypothesis (H1): RecV2 changes revenue per user compared to RecV1. The observed difference is not due to random variation.

Notice that the alternative hypothesis is two-sided: it says RecV2 changes revenue, not that it increases revenue. You should almost always use a two-sided test. If you use a one-sided test, you are assuming the treatment cannot make things worse --- and in practice, it often can.

Common Mistake --- Using a one-sided test because "we only care if it's better." A one-sided test has more statistical power in one direction, but it blinds you to harm in the other direction. If your new algorithm actively hurts revenue, a one-sided test for improvement will not detect it. Use two-sided tests unless you have a strong, pre-specified reason not to.

Step 2: Choose the Primary Metric

The primary metric is the single number that will determine whether the experiment succeeds. Choosing this metric well is an act of problem framing, and everything we discussed in Chapter 1 applies here.

For ShopSmart, the candidates are:

Metric Pros Cons
Click-through rate (CTR) Easy to measure, high volume Users can click without buying
Conversion rate Closer to revenue Does not capture order value
Revenue per user (RPU) Directly tied to business value Higher variance, needs more samples
Revenue per session Granular Sessions are noisy

We choose revenue per user over the experimental period as the primary metric. It is the metric the business actually cares about. Yes, it has higher variance than click-through rate, which means we need more samples. That is a cost we pay for measuring what matters.

Production Tip --- Never let "easy to measure" override "measures the right thing." CTR is seductive because it reaches significance fast. But a recommendation algorithm that increases clicks on low-margin products while decreasing purchases of high-margin products will show improved CTR and decreased revenue. You will celebrate a metric while the business loses money.

Step 3: Define Guardrail Metrics

The primary metric tells you whether the experiment wins. Guardrail metrics tell you whether the experiment is safe. These are metrics that must not degrade meaningfully, even if the primary metric improves.

For the ShopSmart experiment:

  • Page load time. If RecV2 is slower, it could hurt user experience regardless of recommendation quality.
  • Return rate. If RecV2 increases revenue by recommending products that get returned, the net effect is negative.
  • Search usage. If users cannot find what they want through recommendations and fall back to search, the algorithm is failing.
  • Customer support tickets. A sudden spike in complaints signals something is wrong.

Guardrail metrics are not optimized. They are monitored. If any guardrail metric moves significantly in the wrong direction, the experiment is paused for investigation --- even if the primary metric looks great.

Step 4: Define the Randomization Unit

What gets randomly assigned? This seems obvious --- users --- but it is a real design decision.

Randomization Unit When to Use Risk
User Most experiments Users with multiple devices may see inconsistent experiences
Session When user identity is unreliable Same user can be in both groups
Pageview Rarely; for layout experiments Extreme inconsistency for individual users
Cookie When logins are rare Cookie deletion resets assignment
Device Multi-device products Same user, different groups

For ShopSmart, we randomize by user ID. Every logged-in user is assigned to either control or treatment, and they stay in that group for the entire experiment. This ensures a consistent experience and prevents cross-contamination.

Common Mistake --- Randomizing by session instead of user for long-running experiments. If a user has 15 sessions during the experiment and 8 land in the treatment group while 7 land in control, you have contaminated both groups. The user's behavior in "control" sessions is influenced by their "treatment" sessions. Randomize by user unless you have a specific reason not to.

Step 5: Calculate the Required Sample Size

This is where most teams either skip ahead or get the math wrong. Sample size calculation answers a critical question: How many users do we need, and how long do we need to run the experiment, to detect the effect we care about?

To calculate sample size, you need four inputs:

  1. Baseline metric value. Current average revenue per user: $4.82 per week.
  2. Minimum detectable effect (MDE). The smallest improvement worth detecting. ShopSmart decides that anything less than a 2% lift ($0.096) is not worth the engineering effort to maintain RecV2. So MDE = 2%.
  3. Statistical significance level (alpha). The probability of a false positive --- detecting an effect that does not exist. Convention: alpha = 0.05.
  4. Statistical power (1 - beta). The probability of detecting a real effect if it exists. Convention: power = 0.80. Many mature experimentation platforms use 0.90.

We also need the standard deviation of the metric. Revenue per user at ShopSmart has a standard deviation of $8.14 (revenue data is typically right-skewed with high variance).

import numpy as np
from statsmodels.stats.power import NormalIndPower

# ShopSmart experiment parameters
baseline_rpu = 4.82       # $ per user per week
mde_relative = 0.02       # 2% minimum detectable effect
mde_absolute = baseline_rpu * mde_relative  # $0.0964

std_dev = 8.14             # standard deviation of revenue per user

# Cohen's d: effect size in standard deviation units
effect_size = mde_absolute / std_dev
print(f"Effect size (Cohen's d): {effect_size:.4f}")

# Power analysis
power_analysis = NormalIndPower()
sample_size_per_group = power_analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.80,
    alternative='two-sided'
)

print(f"Required sample size per group: {int(np.ceil(sample_size_per_group)):,}")
print(f"Total sample size (both groups): {int(np.ceil(sample_size_per_group)) * 2:,}")
Effect size (Cohen's d): 0.0118
Required sample size per group: 88,764
Total sample size (both groups): 177,528

ShopSmart has 14 million monthly users, which means roughly 3.5 million weekly active users. With a 50/50 split, each group gets 1.75 million users per week. We need 88,764 per group, which means the experiment needs to run for at least one week to reach statistical power.

But that is the minimum. In practice, we should run for at least two full weeks to account for day-of-week effects and to have margin for users who do not visit during the experimental period.

# How long do we need to run?
weekly_active_users = 3_500_000
users_per_group_per_week = weekly_active_users // 2

min_weeks = np.ceil(sample_size_per_group / users_per_group_per_week)
recommended_weeks = max(min_weeks + 1, 2)  # add buffer, minimum 2 weeks

print(f"Users per group per week: {users_per_group_per_week:,}")
print(f"Minimum weeks needed: {int(min_weeks)}")
print(f"Recommended duration: {int(recommended_weeks)} weeks")
Users per group per week: 1,750,000
Minimum weeks needed: 1
Recommended duration: 2 weeks

Production Tip --- Always round up your experiment duration to full weeks. E-commerce behavior varies dramatically by day of week. Running from Monday to Thursday will systematically miss weekend shoppers, whose behavior is materially different. A "14-day experiment" is not the same as a "Monday-to-Sunday twice" experiment.

Step 6: The Power Analysis Cheat Sheet

Let us build a reusable function for power analysis, because you will do this constantly.

import numpy as np
from statsmodels.stats.power import NormalIndPower, TTestIndPower

def calculate_sample_size(
    baseline_value: float,
    mde_relative: float,
    std_dev: float,
    alpha: float = 0.05,
    power: float = 0.80,
    alternative: str = 'two-sided'
) -> dict:
    """
    Calculate required sample size for an A/B test.

    Parameters
    ----------
    baseline_value : float
        Current value of the metric (e.g., revenue per user).
    mde_relative : float
        Minimum detectable effect as a relative change (e.g., 0.02 for 2%).
    std_dev : float
        Standard deviation of the metric.
    alpha : float
        Significance level (probability of Type I error).
    power : float
        Statistical power (1 - probability of Type II error).
    alternative : str
        'two-sided' or 'larger' or 'smaller'.

    Returns
    -------
    dict with sample size per group, total, effect size, and MDE absolute.
    """
    mde_absolute = baseline_value * mde_relative
    effect_size = mde_absolute / std_dev

    analysis = NormalIndPower()
    n_per_group = int(np.ceil(
        analysis.solve_power(
            effect_size=effect_size,
            alpha=alpha,
            power=power,
            alternative=alternative
        )
    ))

    return {
        'n_per_group': n_per_group,
        'n_total': n_per_group * 2,
        'effect_size_cohens_d': round(effect_size, 4),
        'mde_absolute': round(mde_absolute, 4),
        'mde_relative': mde_relative,
        'alpha': alpha,
        'power': power,
    }


# ShopSmart example
result = calculate_sample_size(
    baseline_value=4.82,
    mde_relative=0.02,
    std_dev=8.14,
    alpha=0.05,
    power=0.80
)

for key, value in result.items():
    print(f"{key}: {value:,}" if isinstance(value, int) else f"{key}: {value}")
n_per_group: 88,764
n_total: 177,528
effect_size_cohens_d: 0.0118
mde_absolute: 0.0964
mde_relative: 0.02
alpha: 0.05
power: 0.8

Try It --- Modify the power analysis above for a higher bar: alpha = 0.01 and power = 0.90. How does the required sample size change? Then try MDE = 5% instead of 2%. What happens? The relationship between these parameters is one of the most important intuitions in experimental design.

The Power Curve

It helps to visualize how sample size changes with MDE, because this visualization is what you will show stakeholders when they ask "can we detect a 0.5% improvement?"

import matplotlib.pyplot as plt
import numpy as np
from statsmodels.stats.power import NormalIndPower

analysis = NormalIndPower()
mde_range = np.arange(0.005, 0.101, 0.005)  # 0.5% to 10%
baseline = 4.82
std_dev = 8.14

sample_sizes_80 = []
sample_sizes_90 = []

for mde in mde_range:
    effect_size = (baseline * mde) / std_dev
    n_80 = analysis.solve_power(effect_size=effect_size, alpha=0.05,
                                 power=0.80, alternative='two-sided')
    n_90 = analysis.solve_power(effect_size=effect_size, alpha=0.05,
                                 power=0.90, alternative='two-sided')
    sample_sizes_80.append(n_80)
    sample_sizes_90.append(n_90)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(mde_range * 100, np.array(sample_sizes_80) / 1e6, 'b-o',
        label='Power = 0.80', markersize=4)
ax.plot(mde_range * 100, np.array(sample_sizes_90) / 1e6, 'r-s',
        label='Power = 0.90', markersize=4)
ax.set_xlabel('Minimum Detectable Effect (%)', fontsize=12)
ax.set_ylabel('Sample Size per Group (millions)', fontsize=12)
ax.set_title('Required Sample Size vs. Minimum Detectable Effect\n'
             'ShopSmart Revenue per User (baseline=$4.82, SD=$8.14)',
             fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 10.5)
plt.tight_layout()
plt.savefig('power_curve.png', dpi=150)
plt.show()

This curve is your negotiation tool. When a PM says "I want to detect a 0.5% improvement," you can show them the chart and say: "That requires 3.5 million users per group. At our traffic volume, that is a 4-week experiment. Are you willing to wait 4 weeks?" The answer is usually "what if we accept 2%?" and now you are having the right conversation.


3.3 Running the Experiment

Implementing Randomization

Randomization is the backbone of the experiment. If the randomization is flawed, everything downstream is invalid.

import hashlib

def assign_variant(user_id: str, experiment_name: str, salt: str = '') -> str:
    """
    Deterministic assignment of a user to a variant using hashing.

    This ensures:
    - Same user always gets the same variant (consistency)
    - Different experiments assign independently (salt)
    - Assignment is uniformly distributed (hash properties)

    Parameters
    ----------
    user_id : str
        Unique user identifier.
    experiment_name : str
        Name of the experiment (acts as namespace).
    salt : str
        Additional salt for independence between experiments.

    Returns
    -------
    'control' or 'treatment'
    """
    hash_input = f"{experiment_name}.{salt}.{user_id}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100  # 0-99

    if bucket < 50:
        return 'control'
    else:
        return 'treatment'


# Example usage
for uid in ['user_10042', 'user_55813', 'user_99201', 'user_33107']:
    variant = assign_variant(uid, experiment_name='rec_v2_test', salt='2025q1')
    print(f"{uid} -> {variant}")
user_10042 -> treatment
user_55813 -> control
user_99201 -> control
user_33107 -> treatment

Production Tip --- Never use random.random() for experiment assignment. It is not deterministic --- the same user would get different assignments on different requests. Hash-based assignment guarantees consistency. Most production experimentation platforms (Optimizely, LaunchDarkly, Statsig, Eppo) use this approach internally.

Validating Randomization: The A/A Test

Before running the real experiment, run an A/A test. This is an experiment where both groups receive the same experience. Its purpose: verify that the randomization is working and the measurement infrastructure is correct.

If an A/A test shows a significant difference, something is broken --- and it is broken in a way that would invalidate any subsequent A/B test.

import numpy as np
from scipy import stats

def run_aa_test(n_simulations: int = 1000, n_per_group: int = 50000,
                true_mean: float = 4.82, true_std: float = 8.14,
                alpha: float = 0.05) -> dict:
    """
    Simulate A/A tests to verify that the false positive rate
    matches the significance level.

    Parameters
    ----------
    n_simulations : int
        Number of A/A tests to simulate.
    n_per_group : int
        Users per group in each simulated test.
    true_mean : float
        True mean of the metric (same for both groups).
    true_std : float
        True standard deviation.
    alpha : float
        Significance threshold.

    Returns
    -------
    dict with false positive rate and expected rate.
    """
    rng = np.random.default_rng(seed=42)
    significant_count = 0

    for _ in range(n_simulations):
        group_a = rng.normal(true_mean, true_std, n_per_group)
        group_b = rng.normal(true_mean, true_std, n_per_group)
        _, p_value = stats.ttest_ind(group_a, group_b)
        if p_value < alpha:
            significant_count += 1

    false_positive_rate = significant_count / n_simulations

    return {
        'n_simulations': n_simulations,
        'false_positive_rate': round(false_positive_rate, 3),
        'expected_rate': alpha,
        'within_tolerance': abs(false_positive_rate - alpha) < 0.02
    }


aa_result = run_aa_test()
print(f"False positive rate: {aa_result['false_positive_rate']:.1%}")
print(f"Expected rate: {aa_result['expected_rate']:.1%}")
print(f"Within tolerance: {aa_result['within_tolerance']}")
False positive rate: 4.7%
Expected rate: 5.0%
Within tolerance: True

The false positive rate should be close to alpha (5%). If it is significantly higher --- say, 12% or 15% --- your randomization or measurement is broken. Common causes include:

  • Randomization that correlates with user characteristics (e.g., hashing user IDs that are sequential integers)
  • Metric computation bugs that introduce systematic bias
  • Bot traffic that is distributed unevenly across groups
  • Caching effects where one group's pages are served differently

Stratified Randomization

Simple random assignment works well for large samples. For smaller samples, or when you know certain user segments behave very differently, stratified randomization reduces variance and increases power.

import pandas as pd
import hashlib

def assign_variant_stratified(user_id: str, stratum: str,
                               experiment_name: str) -> str:
    """
    Assign variant within strata to ensure balanced groups.
    """
    hash_input = f"{experiment_name}.{stratum}.{user_id}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100
    return 'control' if bucket < 50 else 'treatment'


# Example: stratify by user tier at ShopSmart
user_tiers = {
    'user_10042': 'high_value',
    'user_55813': 'medium_value',
    'user_99201': 'low_value',
    'user_33107': 'high_value',
    'user_77444': 'medium_value',
    'user_22018': 'low_value',
}

for uid, tier in user_tiers.items():
    variant = assign_variant_stratified(uid, tier, 'rec_v2_test')
    print(f"{uid} ({tier}) -> {variant}")

Stratification ensures that each value tier (high, medium, low) is equally represented in both groups. Without stratification, random chance could put 55% of high-value users in the treatment group, inflating the treatment's revenue even if the algorithm has no effect.


3.4 Analyzing the Results

The experiment has run for 21 days. Time to analyze.

The Standard Analysis

import numpy as np
import pandas as pd
from scipy import stats

# Simulated experiment results (in practice, this comes from your data warehouse)
rng = np.random.default_rng(seed=42)

n_control = 850_000
n_treatment = 848_000

# Simulate: treatment has a true 2.5% lift
control_rpu = rng.lognormal(mean=1.0, sigma=1.2, size=n_control)
treatment_rpu = rng.lognormal(mean=1.0, sigma=1.2, size=n_treatment) * 1.025

# Compute summary statistics
results = pd.DataFrame({
    'group': ['Control', 'Treatment'],
    'n_users': [n_control, n_treatment],
    'mean_rpu': [np.mean(control_rpu), np.mean(treatment_rpu)],
    'median_rpu': [np.median(control_rpu), np.median(treatment_rpu)],
    'std_rpu': [np.std(control_rpu), np.std(treatment_rpu)],
})
results['mean_rpu'] = results['mean_rpu'].round(4)
results['median_rpu'] = results['median_rpu'].round(4)
results['std_rpu'] = results['std_rpu'].round(4)

print(results.to_string(index=False))

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(treatment_rpu, control_rpu)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

# Relative lift
lift = (np.mean(treatment_rpu) - np.mean(control_rpu)) / np.mean(control_rpu)
print(f"Relative lift: {lift:.2%}")

# 95% confidence interval for the difference in means
diff = np.mean(treatment_rpu) - np.mean(control_rpu)
se_diff = np.sqrt(
    np.var(treatment_rpu, ddof=1) / n_treatment +
    np.var(control_rpu, ddof=1) / n_control
)
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff
print(f"Difference in means: ${diff:.4f}")
print(f"95% CI: [${ci_lower:.4f}, ${ci_upper:.4f}]")
     group  n_users  mean_rpu  median_rpu  std_rpu
   Control   850000    4.9272      2.7183   7.0845
 Treatment   848000    5.0504      2.7862   7.2621

t-statistic: 11.4327
p-value: 0.000000
Relative lift: 2.50%
Difference in means: $0.1232
95% CI: [$0.1021, $0.1443]

Interpreting the Results

The results tell us:

  1. Statistical significance. The p-value is far below 0.05. We reject the null hypothesis. The difference is extremely unlikely to be due to chance.
  2. Effect size. The treatment group's revenue per user is $0.12 higher, a 2.5% relative lift.
  3. Confidence interval. The true effect is plausibly between $0.10 and $0.14 per user. The interval does not include zero, which is consistent with the significant p-value.
  4. Practical significance. A 2.5% lift on $4.93 baseline, across 14M monthly users, translates to approximately $1.72M in additional monthly revenue. That is practically significant.

Common Mistake --- Confusing statistical significance with practical significance. A p-value of 0.001 does not mean the effect is large. With enough data, you can detect a $0.01 difference with high significance. Always ask: "Is the effect large enough to justify the cost of implementing and maintaining the change?" A 0.1% improvement that costs $500K in engineering time is statistically significant and practically worthless.

Reporting with Confidence Intervals, Not Just P-Values

P-values tell you whether an effect exists. Confidence intervals tell you how big it might be. The interval is almost always more useful.

def ab_test_report(control_data: np.ndarray, treatment_data: np.ndarray,
                    metric_name: str = 'RPU',
                    confidence_level: float = 0.95) -> dict:
    """
    Generate a complete A/B test report with confidence intervals.

    Parameters
    ----------
    control_data : np.ndarray
        Metric values for the control group.
    treatment_data : np.ndarray
        Metric values for the treatment group.
    metric_name : str
        Name of the metric for display.
    confidence_level : float
        Confidence level for the interval (default 0.95).

    Returns
    -------
    dict with test results.
    """
    n_c, n_t = len(control_data), len(treatment_data)
    mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
    std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)

    diff = mean_t - mean_c
    relative_lift = diff / mean_c

    se = np.sqrt(std_c**2 / n_c + std_t**2 / n_t)
    z = stats.norm.ppf(1 - (1 - confidence_level) / 2)

    ci_lower = diff - z * se
    ci_upper = diff + z * se

    t_stat, p_value = stats.ttest_ind(treatment_data, control_data)

    return {
        'metric': metric_name,
        'control_mean': round(mean_c, 4),
        'treatment_mean': round(mean_t, 4),
        'absolute_difference': round(diff, 4),
        'relative_lift': f"{relative_lift:.2%}",
        'ci_lower': round(ci_lower, 4),
        'ci_upper': round(ci_upper, 4),
        'p_value': round(p_value, 6),
        'significant': p_value < (1 - confidence_level),
        'n_control': n_c,
        'n_treatment': n_t,
    }


report = ab_test_report(control_rpu, treatment_rpu, metric_name='Revenue per User')
for key, value in report.items():
    print(f"  {key}: {value}")

3.5 The Pitfalls: Where A/B Tests Go Wrong

This is the section that separates textbook experimenters from production experimenters. Every pitfall described here has cost real companies real money.

Pitfall 1: Peeking (The Most Common Sin)

War Story --- The ShopSmart PM checked the experiment dashboard every morning. On day 3 of a planned 21-day test, the dashboard showed a p-value of 0.03. The PM emailed the VP: "The test is significant! Let's ship it." By day 7, the p-value had drifted to 0.11. By day 14, it was 0.08. The final result at day 21: p = 0.06. Not significant. But the PM had already promised the VP a launch based on the day 3 number, and the VP had already mentioned it to the board.

This is peeking. It is the single most common way A/B tests produce false positives.

Peeking means checking your experiment results repeatedly and stopping when you see a significant result. It sounds harmless --- you are just looking at the data. But it is deeply dangerous, because it inflates your false positive rate far beyond the nominal alpha level.

Why? Because each time you check, you are conducting another hypothesis test. Even under the null hypothesis (no true effect), random fluctuations will temporarily produce low p-values. If you check daily for 21 days, you have 21 chances to observe a p-value below 0.05 by pure chance. The cumulative probability of seeing at least one "significant" result is much higher than 5%.

import numpy as np
from scipy import stats

def simulate_peeking(n_simulations: int = 10000, n_per_group: int = 50000,
                      check_days: list = None, users_per_day: int = 5000,
                      alpha: float = 0.05) -> dict:
    """
    Simulate how peeking inflates false positive rates.

    Both groups are drawn from the same distribution (no true effect).
    We check for significance at multiple time points.

    Parameters
    ----------
    n_simulations : int
        Number of simulated experiments.
    n_per_group : int
        Total users per group if the experiment runs to completion.
    check_days : list of int
        Days at which we peek at the results.
    users_per_day : int
        Users entering each group per day.
    alpha : float
        Significance level.

    Returns
    -------
    dict with false positive rates at each check point and overall.
    """
    if check_days is None:
        check_days = [3, 5, 7, 10, 14, 21]

    rng = np.random.default_rng(seed=42)
    ever_significant = 0
    fp_at_check = {day: 0 for day in check_days}

    for _ in range(n_simulations):
        # Generate all data upfront (no true effect)
        control_all = rng.normal(4.82, 8.14, n_per_group)
        treatment_all = rng.normal(4.82, 8.14, n_per_group)

        found_significant = False
        for day in check_days:
            n_available = min(day * users_per_day, n_per_group)
            control_sample = control_all[:n_available]
            treatment_sample = treatment_all[:n_available]

            _, p_val = stats.ttest_ind(control_sample, treatment_sample)
            if p_val < alpha:
                fp_at_check[day] += 1
                found_significant = True

        if found_significant:
            ever_significant += 1

    result = {
        'overall_false_positive_rate': round(ever_significant / n_simulations, 3),
        'nominal_alpha': alpha,
        'check_points': {}
    }
    for day in check_days:
        result['check_points'][f'day_{day}'] = round(
            fp_at_check[day] / n_simulations, 3
        )

    return result


peeking_result = simulate_peeking()
print(f"Nominal alpha: {peeking_result['nominal_alpha']:.1%}")
print(f"Actual false positive rate (peeking): "
      f"{peeking_result['overall_false_positive_rate']:.1%}")
print("\nFalse positive rate at each check:")
for day, rate in peeking_result['check_points'].items():
    print(f"  {day}: {rate:.1%}")
Nominal alpha: 5.0%
Actual false positive rate (peeking): 14.9%

False positive rate at each check:
  day_3: 4.8%
  day_5: 5.1%
  day_7: 5.2%
  day_10: 4.9%
  day_14: 5.0%
  day_21: 5.1%

Each individual check has about a 5% false positive rate --- that is correct. But the probability that at least one check produces a false positive across all six peeks is roughly 15%. You thought you were running a test with a 5% error rate. You were actually running one with a 15% error rate.

Solutions to peeking:

  1. Pre-register the analysis date. Decide before the experiment starts when you will analyze. Do not look before that date.
  2. Use sequential testing. Methods like the Wald sequential probability ratio test or the mSPRT adjust the significance threshold at each peek to maintain the overall error rate. This is what platforms like Optimizely and Eppo use.
  3. Lock the dashboard. If you cannot trust yourself not to peek, remove access to the results until the analysis date.

Pitfall 2: Multiple Testing

Multiple testing occurs when you test many hypotheses simultaneously. ShopSmart's experiment had one primary metric (RPU), but the PM also wants to know the effect on CTR, conversion rate, average order value, items per cart, and time on site. That is six tests.

If each test has a 5% false positive rate and the metrics are independent, the probability of at least one false positive is:

1 - (1 - 0.05)^6 = 26.5%

You are almost guaranteed a false positive somewhere, and the PM will cherry-pick the metric that "won."

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

def demonstrate_multiple_testing(n_metrics: int = 6,
                                  n_per_group: int = 100000,
                                  alpha: float = 0.05) -> None:
    """
    Demonstrate the multiple testing problem and corrections.
    """
    rng = np.random.default_rng(seed=42)

    # Simulate: NO true effect on any metric
    p_values = []
    metric_names = ['RPU', 'CTR', 'Conversion', 'AOV', 'Items/Cart', 'Time on Site']

    print("Uncorrected results (no true effect on any metric):\n")
    for name in metric_names:
        control = rng.normal(0, 1, n_per_group)
        treatment = rng.normal(0, 1, n_per_group)
        _, p_val = stats.ttest_ind(control, treatment)
        p_values.append(p_val)
        sig = "***" if p_val < alpha else ""
        print(f"  {name:15s}: p = {p_val:.4f} {sig}")

    print(f"\nUncorrected significant results: "
          f"{sum(1 for p in p_values if p < alpha)} / {n_metrics}")

    # Bonferroni correction
    bonferroni_alpha = alpha / n_metrics
    print(f"\n--- Bonferroni correction (adjusted alpha = {bonferroni_alpha:.4f}) ---")
    for name, p in zip(metric_names, p_values):
        sig = "***" if p < bonferroni_alpha else ""
        print(f"  {name:15s}: p = {p:.4f} {sig}")

    # Benjamini-Hochberg (FDR) correction
    rejected, corrected_p, _, _ = multipletests(p_values, method='fdr_bh')
    print(f"\n--- Benjamini-Hochberg (FDR) correction ---")
    for name, p_orig, p_corr, rej in zip(metric_names, p_values,
                                           corrected_p, rejected):
        sig = "***" if rej else ""
        print(f"  {name:15s}: p_original = {p_orig:.4f}, "
              f"p_corrected = {p_corr:.4f} {sig}")


demonstrate_multiple_testing()

Correction methods:

  • Bonferroni correction: Divide alpha by the number of tests. Simple and conservative. If you are testing 6 metrics at alpha = 0.05, each test uses alpha = 0.0083. This controls the family-wise error rate (FWER) --- the probability of any false positive.
  • Benjamini-Hochberg (FDR): Controls the false discovery rate --- the expected proportion of false positives among all significant results. Less conservative than Bonferroni, more appropriate when you are testing many metrics and some are expected to be truly significant.

Production Tip --- Designate one primary metric before the experiment starts. This metric is evaluated at the full alpha level. All other metrics are secondary and should be corrected for multiple testing or treated as exploratory. If you declare victory based on whichever metric looks best after the fact, you are not running an experiment. You are running a fishing expedition.

Pitfall 3: Simpson's Paradox

Simpson's paradox occurs when a trend that appears in several different groups reverses when the groups are combined. It is not rare. It is not a curiosity. It will happen to you.

import pandas as pd
import numpy as np

# ShopSmart experiment: Simpson's paradox example
# Treatment appears to lose overall, but wins in every segment

data = {
    'segment': ['Mobile', 'Mobile', 'Desktop', 'Desktop'],
    'variant': ['Control', 'Treatment', 'Control', 'Treatment'],
    'users': [600_000, 200_000, 250_000, 650_000],
    'total_revenue': [1_980_000, 700_000, 1_500_000, 4_225_000],
}

df = pd.DataFrame(data)
df['rpu'] = df['total_revenue'] / df['users']

print("Per-segment results:")
print(df[['segment', 'variant', 'users', 'rpu']].to_string(index=False))

# Overall
overall = df.groupby('variant').agg(
    total_users=('users', 'sum'),
    total_revenue=('total_revenue', 'sum')
).reset_index()
overall['rpu'] = overall['total_revenue'] / overall['total_users']

print("\nOverall results (aggregated):")
print(overall[['variant', 'total_users', 'rpu']].to_string(index=False))
Per-segment results:
 segment   variant   users  rpu
  Mobile   Control  600000  3.3
  Mobile Treatment  200000  3.5
 Desktop   Control  250000  6.0
 Desktop Treatment  650000  6.5

Overall results (aggregated):
   variant  total_users       rpu
   Control       850000  4.094118
 Treatment       850000  5.794118

In this constructed example, the treatment wins in both segments. But consider a scenario where the randomization was not properly balanced across segments --- perhaps more treatment users were assigned from the mobile segment (lower RPU). The overall treatment average would be pulled down by the segment mix, even though the treatment wins within each segment.

Common Mistake --- Trusting the overall result without checking segments. Always break down your experiment results by key segments (platform, geography, user tier, new vs. returning). If the treatment wins overall but loses in your highest-value segment, that is information you need before deciding to launch.

The defense against Simpson's paradox is stratified analysis. Compute the effect within each stratum and then combine using a weighted average or a mixed-effects model.

Pitfall 4: Novelty and Primacy Effects

When you launch a new feature, users may interact with it more (or less) simply because it is new. This is the novelty effect. The reverse --- users preferring the familiar --- is the primacy effect.

Both distort short-term A/B test results. A new recommendation layout might get more clicks in week 1 because users are curious. By week 3, the novelty has worn off and engagement returns to baseline. If you measured only week 1, you would conclude the new layout works. It does not. It was just shiny.

Mitigation strategies:

  1. Run longer experiments. At least 2-3 weeks to let novelty effects decay.
  2. Analyze by cohort. Compare users who entered the experiment in week 1 vs. week 2 vs. week 3. If the treatment effect diminishes over time, novelty is likely at play.
  3. Restrict analysis to established users. Exclude users who saw the treatment for the first time in the analysis window. This is harder to implement but eliminates novelty bias.
import pandas as pd
import numpy as np

# Detecting novelty effect: effect size by week of exposure
weeks = [1, 2, 3, 4]
treatment_lift = [4.8, 3.1, 2.2, 2.0]  # % lift vs control

novelty_df = pd.DataFrame({
    'week': weeks,
    'lift_pct': treatment_lift
})

print("Treatment lift by week of exposure:")
print(novelty_df.to_string(index=False))
print(f"\nWeek 1 lift: {treatment_lift[0]}%")
print(f"Week 4 lift (stabilized): {treatment_lift[-1]}%")
print(f"Novelty inflation: {treatment_lift[0] - treatment_lift[-1]:.1f} "
      f"percentage points")
Treatment lift by week of exposure:
 week  lift_pct
    1       4.8
    2       3.1
    3       2.2
    4       2.0

Week 1 lift: 4.8%
Week 4 lift (stabilized): 2.0%
Novelty inflation: 2.8 percentage points

The "true" effect is closer to 2.0% once novelty decays. A one-week experiment would have overestimated the effect by 140%.


3.6 The Hard Conversation

This is the scenario no A/B testing tutorial prepares you for.

The experiment ran for 21 days. The analysis is complete. The result: p = 0.23. The new recommendation algorithm shows a 0.8% lift in revenue per user, but it is not statistically significant. The confidence interval spans from -0.5% to +2.1%. You cannot rule out zero effect.

And then this happens:

The PM walks into your office. "What are the results?"

"Not significant. We cannot conclude that RecV2 performs better than RecV1."

Pause.

"But the lift is positive, right? 0.8% is still positive."

"Yes, but the confidence interval includes zero. We cannot distinguish 0.8% from random noise."

Longer pause.

"Look, we already told the board we're launching the new algorithm next quarter. The engineering team has been working on the production integration for three weeks. We need this to work."

This is the moment that defines your credibility as a data scientist.

What You Should NOT Do

  • Do not cherry-pick a subgroup where the effect is significant. If you test 20 segments, one will be significant by chance. "It works for mobile users in the Pacific time zone" is not a finding. It is p-hacking.
  • Do not re-run the test with a one-sided alternative. Switching from two-sided to one-sided after seeing the data direction cuts your p-value in half. It is also scientific fraud.
  • Do not extend the test until it reaches significance. Running the test longer does not help if the effect is real but small. And running until significance is a form of peeking.
  • Do not report "directionally positive" as if it were a conclusion. Directionally positive and not significant means: we do not know.

What You SHOULD Do

  1. Present the results honestly. "The experiment did not reach statistical significance. We cannot confirm that RecV2 improves revenue. The observed 0.8% lift is within the range of random variation."

  2. Explain what the confidence interval means. "The true effect is plausibly anywhere between -0.5% and +2.1%. If the true effect is at the high end of that range, it is worth launching. If it is at the low end, it could be hurting revenue. We simply do not have enough evidence to tell."

  3. Offer actionable next steps: - Increase power. Run a longer experiment or increase traffic allocation. If the true effect is 0.8%, we need a much larger sample to detect it. - Reduce metric variance. Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by adjusting for pre-experiment behavior. - Test a bigger change. If RecV2 is only marginally different from RecV1, maybe the change was too small to detect. Can the ML team make a bolder version? - Accept the result. Sometimes "no significant difference" is the answer. The algorithm is not better. That is valuable information. It prevents the company from maintaining two recommendation systems for no benefit.

  4. Document everything. Write a formal experiment report. Include the hypothesis, the design, the results, and the recommendation. If the business chooses to launch anyway, the report protects you and creates institutional memory.

Production Tip --- Establish launch criteria before the experiment starts. "We will launch if the primary metric shows a statistically significant positive effect at p < 0.05 with a 95% confidence interval that does not include negative values below -0.5%." If the criteria are agreed upon in advance, the post-hoc conversation is much easier.


3.7 Variance Reduction with CUPED

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique used by Netflix, Microsoft, and most mature experimentation platforms. The idea: if you know a user's pre-experiment behavior, you can use it to reduce the noise in the experiment measurement.

import numpy as np
from scipy import stats

def cuped_analysis(pre_control: np.ndarray, post_control: np.ndarray,
                    pre_treatment: np.ndarray, post_treatment: np.ndarray) -> dict:
    """
    Apply CUPED variance reduction to an A/B test.

    CUPED adjusts the post-experiment metric using pre-experiment data
    to reduce variance and increase statistical power.

    Parameters
    ----------
    pre_control : np.ndarray
        Pre-experiment metric values for control users.
    post_control : np.ndarray
        Post-experiment metric values for control users.
    pre_treatment : np.ndarray
        Pre-experiment metric values for treatment users.
    post_treatment : np.ndarray
        Post-experiment metric values for treatment users.

    Returns
    -------
    dict with standard and CUPED-adjusted results.
    """
    # Standard analysis (no adjustment)
    diff_standard = np.mean(post_treatment) - np.mean(post_control)
    se_standard = np.sqrt(
        np.var(post_treatment, ddof=1) / len(post_treatment) +
        np.var(post_control, ddof=1) / len(post_control)
    )

    # CUPED adjustment
    # theta = Cov(Y_post, Y_pre) / Var(Y_pre)
    all_pre = np.concatenate([pre_control, pre_treatment])
    all_post = np.concatenate([post_control, post_treatment])
    theta = np.cov(all_post, all_pre)[0, 1] / np.var(all_pre, ddof=1)

    # Adjusted metrics: Y_adjusted = Y_post - theta * (Y_pre - mean(Y_pre))
    mean_pre = np.mean(all_pre)
    adj_control = post_control - theta * (pre_control - mean_pre)
    adj_treatment = post_treatment - theta * (pre_treatment - mean_pre)

    diff_cuped = np.mean(adj_treatment) - np.mean(adj_control)
    se_cuped = np.sqrt(
        np.var(adj_treatment, ddof=1) / len(adj_treatment) +
        np.var(adj_control, ddof=1) / len(adj_control)
    )

    variance_reduction = 1 - (se_cuped / se_standard) ** 2

    return {
        'standard_diff': round(diff_standard, 4),
        'standard_se': round(se_standard, 4),
        'standard_p': round(2 * stats.norm.sf(abs(diff_standard / se_standard)), 6),
        'cuped_diff': round(diff_cuped, 4),
        'cuped_se': round(se_cuped, 4),
        'cuped_p': round(2 * stats.norm.sf(abs(diff_cuped / se_cuped)), 6),
        'variance_reduction': f"{variance_reduction:.1%}",
        'theta': round(theta, 4),
    }


# Simulate ShopSmart data with pre/post correlation
rng = np.random.default_rng(seed=42)
n = 200_000

# Pre-experiment RPU (4 weeks before)
pre_c = rng.lognormal(1.0, 1.2, n)
pre_t = rng.lognormal(1.0, 1.2, n)

# Post-experiment RPU with correlation to pre and a true 2% lift for treatment
noise_c = rng.normal(0, 5.0, n)
noise_t = rng.normal(0, 5.0, n)
post_c = 0.6 * pre_c + noise_c + 2.0
post_t = (0.6 * pre_t + noise_t + 2.0) * 1.02  # 2% true lift

cuped_result = cuped_analysis(pre_c, post_c, pre_t, post_t)

print("Standard analysis:")
print(f"  Difference: ${cuped_result['standard_diff']}")
print(f"  Std Error:  ${cuped_result['standard_se']}")
print(f"  p-value:    {cuped_result['standard_p']}")

print("\nCUPED-adjusted analysis:")
print(f"  Difference: ${cuped_result['cuped_diff']}")
print(f"  Std Error:  ${cuped_result['cuped_se']}")
print(f"  p-value:    {cuped_result['cuped_p']}")

print(f"\nVariance reduction: {cuped_result['variance_reduction']}")
Standard analysis:
  Difference: $0.1143
  Std Error:  $0.0244
  p-value:    0.000003

CUPED-adjusted analysis:
  Difference: $0.1138
  Std Error:  $0.0198
  p-value:    0.000000

Variance reduction: 34.2%

CUPED reduced the standard error by roughly a third, which means you can detect the same effect with fewer users or detect smaller effects with the same users. In practice, CUPED typically reduces variance by 20-50% depending on how correlated pre- and post-experiment behavior are.

Try It --- Modify the simulation above to increase the correlation between pre and post data (change the coefficient from 0.6 to 0.9). How does the variance reduction change? Now decrease it to 0.2. What happens? This will build your intuition for when CUPED helps most.


3.8 Practical Significance vs. Statistical Significance

A result can be statistically significant but practically meaningless. And a result can be practically meaningful but statistically non-significant (because you lacked power to detect it).

This two-by-two matrix is essential:

Statistically Significant Not Statistically Significant
Practically Significant Ship it. The effect is real and large enough to matter. Inconclusive. You may lack power. Consider running longer or using CUPED.
Not Practically Significant Do not ship. The effect is real but too small to justify the cost. Do nothing. No evidence of a meaningful effect.

The minimum detectable effect (MDE) you set during power analysis is your definition of practical significance. If you said "we need at least a 2% lift," and the experiment shows a statistically significant 0.3% lift, that is a successful experiment --- it successfully told you the effect is too small to bother with.


3.9 The Experimentation Maturity Model

Not every A/B test follows the textbook pattern. Organizations mature in their experimentation practices over time:

Level 1: Ad Hoc. Experiments are run occasionally, without a platform or consistent methodology. Sample sizes are gut-feel. Results are analyzed in notebooks. No one checks for peeking or multiple testing.

Level 2: Standardized. A shared experimentation platform exists. Power analysis is required before launching. Primary metrics are pre-registered. Results are reported with confidence intervals.

Level 3: Automated. Sequential testing enables continuous monitoring without peeking penalties. CUPED or similar variance reduction is applied automatically. Guardrail metrics are monitored in real-time with automated alerts.

Level 4: Cultural. Every product change is tested. "We believe" is replaced with "the experiment showed." Negative results are valued because they prevent bad launches. The experimentation platform is a core part of the decision-making infrastructure.

ShopSmart, after the RecV2 debacle, invested in moving from Level 1 to Level 2. The recommendation algorithm experiment described in this chapter was their first properly designed test.


3.10 Progressive Project: Design the StreamFlow Retention A/B Test

Throughout Part I, we have been building toward a churn prediction system for StreamFlow ($180M ARR, 2.4 million subscribers, 8.2% monthly churn). In Chapter 1, we framed the problem. In Chapter 2, we designed the ML workflow. Now we need to design the experiment that will prove whether the model's retention offers actually reduce churn.

This is the bridge between building a model and proving it works.

Your Assignment

Design (on paper, then in code) the A/B test that will validate StreamFlow's retention intervention. Your design document should include:

1. Hypothesis Statement

Write the null and alternative hypotheses. Be precise about the metric.

Suggested framing:

  • H0: Sending a 20% discount offer to subscribers identified as high-risk by the churn model does not reduce 30-day churn rate compared to no offer.
  • H1: Sending a 20% discount offer to high-risk subscribers changes the 30-day churn rate.

2. Primary Metric and Guardrail Metrics

  • Primary: 30-day churn rate among high-risk subscribers.
  • Guardrails: Revenue per subscriber (the discount costs money), support ticket volume, downstream renewal rate.

3. Randomization Design

  • Who is eligible? Only subscribers flagged as high-risk (churn probability > 0.7) by the model.
  • How do you randomize? By subscriber ID, 50/50 split.
  • What does control receive? No offer (standard experience).
  • What does treatment receive? 20% discount offer for 3 months, delivered via email and in-app notification.

4. Sample Size Calculation

import numpy as np
from statsmodels.stats.power import NormalIndPower

# StreamFlow retention experiment parameters
# Among high-risk subscribers (prob > 0.7), the baseline churn rate is ~40%
# (these are the highest-risk segment)
baseline_churn = 0.40

# We want to detect at least a 5 percentage point reduction (40% -> 35%)
mde_absolute = 0.05

# Standard deviation of a binary outcome: sqrt(p * (1-p))
std_dev = np.sqrt(baseline_churn * (1 - baseline_churn))

effect_size = mde_absolute / std_dev

analysis = NormalIndPower()
n_per_group = int(np.ceil(
    analysis.solve_power(
        effect_size=effect_size,
        alpha=0.05,
        power=0.80,
        alternative='two-sided'
    )
))

print(f"Baseline churn rate: {baseline_churn:.0%}")
print(f"MDE: {mde_absolute:.0%} absolute ({mde_absolute/baseline_churn:.1%} relative)")
print(f"Effect size (Cohen's d): {effect_size:.4f}")
print(f"Required per group: {n_per_group:,}")
print(f"Total required: {n_per_group * 2:,}")

# How many high-risk subscribers does StreamFlow flag per month?
monthly_high_risk = int(2_400_000 * 0.082 * 0.30)  # ~30% of predicted churners
# are above 0.7 threshold
print(f"\nEstimated high-risk subscribers per month: {monthly_high_risk:,}")
print(f"Weeks needed: {np.ceil(n_per_group * 2 / (monthly_high_risk / 4)):.0f}")
Baseline churn rate: 40%
MDE: 5% absolute (12.5% relative)
Effect size (Cohen's d): 0.1021
Required per group: 1,503
Total required: 3,006
Weeks needed: 1

5. Duration and Timing

Even though the sample size can be reached in one week, the experiment must run for at least 30 days --- because the primary metric (30-day churn rate) requires 30 days to observe. You cannot measure 30-day churn in 7 days.

Additionally, add at least one extra week of enrollment to ensure sufficient sample in both arms, and account for the 30-day observation window. Total experiment timeline: approximately 5-6 weeks from start to final analysis.

6. Analysis Plan

Pre-register: the primary metric, the statistical test (chi-squared test or two-proportion z-test for the binary churn outcome), the significance level (0.05), and the analysis date. No peeking.

Try It --- Complete this design document with actual numbers. Consider: what happens if the model's high-risk threshold is wrong? What if the discount offer cannibalizes revenue from subscribers who would have stayed anyway? How would you design the experiment to detect these risks? Write out the full design in your notebook, then compare with a classmate or colleague.


3.11 Summary

Experimental design is not a side skill for data scientists. It is a core competency. A model that cannot be tested in a controlled experiment is a model that cannot be proven to work. And a model that cannot be proven to work will eventually be blamed for something it did not cause --- or credited for something it did not achieve.

The key workflow:

  1. Frame the hypothesis. Null vs. alternative, two-sided by default.
  2. Choose the metric. One primary, several guardrails. Pre-register.
  3. Calculate the sample size. Power analysis with realistic parameters.
  4. Run the A/A test. Validate randomization and infrastructure.
  5. Run the experiment. Full duration, no peeking.
  6. Analyze with rigor. Confidence intervals, not just p-values. Check for multiple testing. Look for Simpson's paradox. Check for novelty effects.
  7. Report honestly. Even when the answer is not what the stakeholder wanted.

The ShopSmart story that opened this chapter ended badly because the team skipped all seven steps. They launched without a controlled experiment, could not attribute the revenue change to their algorithm, and eventually rolled it back without ever knowing whether it worked.

Do not be that team. Design the experiment.


Next chapter: Chapter 4 --- The Math Behind ML, where we build the mathematical intuition for the algorithms you will use in Parts III through V.