> War Story --- ShopSmart, a mid-size e-commerce marketplace with 14 million monthly users, spent four months building a new recommendation algorithm. The offline metrics looked great: a 12% improvement in mean reciprocal rank on the holdout set, a...
In This Chapter
- The Foundation of Data-Driven Decisions
- The Model That Worked But Nobody Could Prove It
- 3.1 Why A/B Testing Matters for Data Scientists
- 3.2 The Anatomy of a Proper A/B Test
- 3.3 Running the Experiment
- 3.4 Analyzing the Results
- 3.5 The Pitfalls: Where A/B Tests Go Wrong
- 3.6 The Hard Conversation
- 3.7 Variance Reduction with CUPED
- 3.8 Practical Significance vs. Statistical Significance
- 3.9 The Experimentation Maturity Model
- 3.10 Progressive Project: Design the StreamFlow Retention A/B Test
- 3.11 Summary
Chapter 3: Experimental Design and A/B Testing
The Foundation of Data-Driven Decisions
Learning Objectives
By the end of this chapter, you will be able to:
- Design a proper A/B test with hypothesis, control, treatment, and success metric
- Calculate required sample size using power analysis
- Identify and avoid common A/B testing pitfalls (peeking, multiple testing, Simpson's paradox)
- Interpret A/B test results with statistical rigor
- Handle the case where the test says "no difference" but stakeholders want a launch
The Model That Worked But Nobody Could Prove It
War Story --- ShopSmart, a mid-size e-commerce marketplace with 14 million monthly users, spent four months building a new recommendation algorithm. The offline metrics looked great: a 12% improvement in mean reciprocal rank on the holdout set, a 9% lift in simulated click-through rate, and improved coverage across long-tail products. The ML team was confident. The VP of Product was excited. They launched the new algorithm to all 14 million users on a Tuesday.
Revenue went up 3.2% that week. Celebration. Bonuses. A company-wide email from the CEO crediting the "AI team."
Except: there was a site-wide promotional campaign that same week. And a competitor went down for 18 hours on Wednesday. And it was the first week of a new month, which historically shows higher engagement. Was the revenue lift caused by the new algorithm? By the promotion? By the competitor outage? By the calendar? Nobody could say. The team had spent four months building a model and zero days designing an experiment to prove it worked.
Six months later, the algorithm was quietly rolled back after a sustained performance decline that nobody could attribute to anything specific --- because there had never been a controlled experiment to establish the baseline.
This chapter is about the experiment. Specifically, it is about A/B testing --- the randomized controlled experiment that is the only reliable way to establish that your model, your feature, your intervention caused the outcome you observed.
Most ML textbooks skip experimentation entirely. They teach you to build models, evaluate them on holdout sets, and stop there. But a model that performs well offline is a hypothesis, not a conclusion. The conclusion comes from a properly designed experiment that isolates your model's effect from the dozens of confounding factors that plague real-world systems.
A data scientist who cannot design experiments is a data scientist who cannot prove their model works.
3.1 Why A/B Testing Matters for Data Scientists
If you have taken an introductory statistics course, you have encountered hypothesis testing. You know about null hypotheses, p-values, and Type I errors. You may be wondering why we are spending an entire chapter on something you already know.
Here is why: the gap between understanding hypothesis testing in a statistics class and running a reliable A/B test in production is enormous. Statistics class gives you the math. Production gives you stakeholders who peek at results on day 2, engineers who accidentally break the randomization, product managers who add three more variants "since we're already testing," and a VP who announces the results to the board before the test reaches significance.
A/B testing in production is not a statistics problem. It is a systems problem that uses statistics.
What an A/B Test Actually Is
An A/B test is a randomized controlled experiment. You split your users (or sessions, or pageviews) into two groups:
- Control (A): The existing experience. The status quo.
- Treatment (B): The new experience. The thing you want to test.
Users are assigned randomly. You measure an outcome metric for both groups. You use statistical inference to determine whether the difference you observe is real or noise.
That is it. The concept is simple. The execution is where organizations stumble.
The Three Questions Every A/B Test Answers
- Is the effect real? --- Statistical significance. Is the observed difference larger than what we would expect from random chance?
- How big is the effect? --- Effect size. Even if the effect is real, is it large enough to matter?
- Are we confident in the direction? --- Confidence interval. What is the plausible range of the true effect?
Notice that "is the effect positive?" is not on this list. An A/B test can tell you the new algorithm hurts performance. That is not a failure of the experiment. That is the experiment doing its job.
3.2 The Anatomy of a Proper A/B Test
Let us design a real experiment. ShopSmart has built a new recommendation algorithm --- call it RecV2 --- and wants to know whether it increases revenue per user compared to the existing algorithm (RecV1).
Step 1: Define the Hypothesis
Every experiment starts with two hypotheses:
- Null hypothesis (H0): RecV2 has no effect on revenue per user compared to RecV1. Any observed difference is due to random variation.
- Alternative hypothesis (H1): RecV2 changes revenue per user compared to RecV1. The observed difference is not due to random variation.
Notice that the alternative hypothesis is two-sided: it says RecV2 changes revenue, not that it increases revenue. You should almost always use a two-sided test. If you use a one-sided test, you are assuming the treatment cannot make things worse --- and in practice, it often can.
Common Mistake --- Using a one-sided test because "we only care if it's better." A one-sided test has more statistical power in one direction, but it blinds you to harm in the other direction. If your new algorithm actively hurts revenue, a one-sided test for improvement will not detect it. Use two-sided tests unless you have a strong, pre-specified reason not to.
Step 2: Choose the Primary Metric
The primary metric is the single number that will determine whether the experiment succeeds. Choosing this metric well is an act of problem framing, and everything we discussed in Chapter 1 applies here.
For ShopSmart, the candidates are:
| Metric | Pros | Cons |
|---|---|---|
| Click-through rate (CTR) | Easy to measure, high volume | Users can click without buying |
| Conversion rate | Closer to revenue | Does not capture order value |
| Revenue per user (RPU) | Directly tied to business value | Higher variance, needs more samples |
| Revenue per session | Granular | Sessions are noisy |
We choose revenue per user over the experimental period as the primary metric. It is the metric the business actually cares about. Yes, it has higher variance than click-through rate, which means we need more samples. That is a cost we pay for measuring what matters.
Production Tip --- Never let "easy to measure" override "measures the right thing." CTR is seductive because it reaches significance fast. But a recommendation algorithm that increases clicks on low-margin products while decreasing purchases of high-margin products will show improved CTR and decreased revenue. You will celebrate a metric while the business loses money.
Step 3: Define Guardrail Metrics
The primary metric tells you whether the experiment wins. Guardrail metrics tell you whether the experiment is safe. These are metrics that must not degrade meaningfully, even if the primary metric improves.
For the ShopSmart experiment:
- Page load time. If RecV2 is slower, it could hurt user experience regardless of recommendation quality.
- Return rate. If RecV2 increases revenue by recommending products that get returned, the net effect is negative.
- Search usage. If users cannot find what they want through recommendations and fall back to search, the algorithm is failing.
- Customer support tickets. A sudden spike in complaints signals something is wrong.
Guardrail metrics are not optimized. They are monitored. If any guardrail metric moves significantly in the wrong direction, the experiment is paused for investigation --- even if the primary metric looks great.
Step 4: Define the Randomization Unit
What gets randomly assigned? This seems obvious --- users --- but it is a real design decision.
| Randomization Unit | When to Use | Risk |
|---|---|---|
| User | Most experiments | Users with multiple devices may see inconsistent experiences |
| Session | When user identity is unreliable | Same user can be in both groups |
| Pageview | Rarely; for layout experiments | Extreme inconsistency for individual users |
| Cookie | When logins are rare | Cookie deletion resets assignment |
| Device | Multi-device products | Same user, different groups |
For ShopSmart, we randomize by user ID. Every logged-in user is assigned to either control or treatment, and they stay in that group for the entire experiment. This ensures a consistent experience and prevents cross-contamination.
Common Mistake --- Randomizing by session instead of user for long-running experiments. If a user has 15 sessions during the experiment and 8 land in the treatment group while 7 land in control, you have contaminated both groups. The user's behavior in "control" sessions is influenced by their "treatment" sessions. Randomize by user unless you have a specific reason not to.
Step 5: Calculate the Required Sample Size
This is where most teams either skip ahead or get the math wrong. Sample size calculation answers a critical question: How many users do we need, and how long do we need to run the experiment, to detect the effect we care about?
To calculate sample size, you need four inputs:
- Baseline metric value. Current average revenue per user: $4.82 per week.
- Minimum detectable effect (MDE). The smallest improvement worth detecting. ShopSmart decides that anything less than a 2% lift ($0.096) is not worth the engineering effort to maintain RecV2. So MDE = 2%.
- Statistical significance level (alpha). The probability of a false positive --- detecting an effect that does not exist. Convention: alpha = 0.05.
- Statistical power (1 - beta). The probability of detecting a real effect if it exists. Convention: power = 0.80. Many mature experimentation platforms use 0.90.
We also need the standard deviation of the metric. Revenue per user at ShopSmart has a standard deviation of $8.14 (revenue data is typically right-skewed with high variance).
import numpy as np
from statsmodels.stats.power import NormalIndPower
# ShopSmart experiment parameters
baseline_rpu = 4.82 # $ per user per week
mde_relative = 0.02 # 2% minimum detectable effect
mde_absolute = baseline_rpu * mde_relative # $0.0964
std_dev = 8.14 # standard deviation of revenue per user
# Cohen's d: effect size in standard deviation units
effect_size = mde_absolute / std_dev
print(f"Effect size (Cohen's d): {effect_size:.4f}")
# Power analysis
power_analysis = NormalIndPower()
sample_size_per_group = power_analysis.solve_power(
effect_size=effect_size,
alpha=0.05,
power=0.80,
alternative='two-sided'
)
print(f"Required sample size per group: {int(np.ceil(sample_size_per_group)):,}")
print(f"Total sample size (both groups): {int(np.ceil(sample_size_per_group)) * 2:,}")
Effect size (Cohen's d): 0.0118
Required sample size per group: 88,764
Total sample size (both groups): 177,528
ShopSmart has 14 million monthly users, which means roughly 3.5 million weekly active users. With a 50/50 split, each group gets 1.75 million users per week. We need 88,764 per group, which means the experiment needs to run for at least one week to reach statistical power.
But that is the minimum. In practice, we should run for at least two full weeks to account for day-of-week effects and to have margin for users who do not visit during the experimental period.
# How long do we need to run?
weekly_active_users = 3_500_000
users_per_group_per_week = weekly_active_users // 2
min_weeks = np.ceil(sample_size_per_group / users_per_group_per_week)
recommended_weeks = max(min_weeks + 1, 2) # add buffer, minimum 2 weeks
print(f"Users per group per week: {users_per_group_per_week:,}")
print(f"Minimum weeks needed: {int(min_weeks)}")
print(f"Recommended duration: {int(recommended_weeks)} weeks")
Users per group per week: 1,750,000
Minimum weeks needed: 1
Recommended duration: 2 weeks
Production Tip --- Always round up your experiment duration to full weeks. E-commerce behavior varies dramatically by day of week. Running from Monday to Thursday will systematically miss weekend shoppers, whose behavior is materially different. A "14-day experiment" is not the same as a "Monday-to-Sunday twice" experiment.
Step 6: The Power Analysis Cheat Sheet
Let us build a reusable function for power analysis, because you will do this constantly.
import numpy as np
from statsmodels.stats.power import NormalIndPower, TTestIndPower
def calculate_sample_size(
baseline_value: float,
mde_relative: float,
std_dev: float,
alpha: float = 0.05,
power: float = 0.80,
alternative: str = 'two-sided'
) -> dict:
"""
Calculate required sample size for an A/B test.
Parameters
----------
baseline_value : float
Current value of the metric (e.g., revenue per user).
mde_relative : float
Minimum detectable effect as a relative change (e.g., 0.02 for 2%).
std_dev : float
Standard deviation of the metric.
alpha : float
Significance level (probability of Type I error).
power : float
Statistical power (1 - probability of Type II error).
alternative : str
'two-sided' or 'larger' or 'smaller'.
Returns
-------
dict with sample size per group, total, effect size, and MDE absolute.
"""
mde_absolute = baseline_value * mde_relative
effect_size = mde_absolute / std_dev
analysis = NormalIndPower()
n_per_group = int(np.ceil(
analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative=alternative
)
))
return {
'n_per_group': n_per_group,
'n_total': n_per_group * 2,
'effect_size_cohens_d': round(effect_size, 4),
'mde_absolute': round(mde_absolute, 4),
'mde_relative': mde_relative,
'alpha': alpha,
'power': power,
}
# ShopSmart example
result = calculate_sample_size(
baseline_value=4.82,
mde_relative=0.02,
std_dev=8.14,
alpha=0.05,
power=0.80
)
for key, value in result.items():
print(f"{key}: {value:,}" if isinstance(value, int) else f"{key}: {value}")
n_per_group: 88,764
n_total: 177,528
effect_size_cohens_d: 0.0118
mde_absolute: 0.0964
mde_relative: 0.02
alpha: 0.05
power: 0.8
Try It --- Modify the power analysis above for a higher bar: alpha = 0.01 and power = 0.90. How does the required sample size change? Then try MDE = 5% instead of 2%. What happens? The relationship between these parameters is one of the most important intuitions in experimental design.
The Power Curve
It helps to visualize how sample size changes with MDE, because this visualization is what you will show stakeholders when they ask "can we detect a 0.5% improvement?"
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
mde_range = np.arange(0.005, 0.101, 0.005) # 0.5% to 10%
baseline = 4.82
std_dev = 8.14
sample_sizes_80 = []
sample_sizes_90 = []
for mde in mde_range:
effect_size = (baseline * mde) / std_dev
n_80 = analysis.solve_power(effect_size=effect_size, alpha=0.05,
power=0.80, alternative='two-sided')
n_90 = analysis.solve_power(effect_size=effect_size, alpha=0.05,
power=0.90, alternative='two-sided')
sample_sizes_80.append(n_80)
sample_sizes_90.append(n_90)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(mde_range * 100, np.array(sample_sizes_80) / 1e6, 'b-o',
label='Power = 0.80', markersize=4)
ax.plot(mde_range * 100, np.array(sample_sizes_90) / 1e6, 'r-s',
label='Power = 0.90', markersize=4)
ax.set_xlabel('Minimum Detectable Effect (%)', fontsize=12)
ax.set_ylabel('Sample Size per Group (millions)', fontsize=12)
ax.set_title('Required Sample Size vs. Minimum Detectable Effect\n'
'ShopSmart Revenue per User (baseline=$4.82, SD=$8.14)',
fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 10.5)
plt.tight_layout()
plt.savefig('power_curve.png', dpi=150)
plt.show()
This curve is your negotiation tool. When a PM says "I want to detect a 0.5% improvement," you can show them the chart and say: "That requires 3.5 million users per group. At our traffic volume, that is a 4-week experiment. Are you willing to wait 4 weeks?" The answer is usually "what if we accept 2%?" and now you are having the right conversation.
3.3 Running the Experiment
Implementing Randomization
Randomization is the backbone of the experiment. If the randomization is flawed, everything downstream is invalid.
import hashlib
def assign_variant(user_id: str, experiment_name: str, salt: str = '') -> str:
"""
Deterministic assignment of a user to a variant using hashing.
This ensures:
- Same user always gets the same variant (consistency)
- Different experiments assign independently (salt)
- Assignment is uniformly distributed (hash properties)
Parameters
----------
user_id : str
Unique user identifier.
experiment_name : str
Name of the experiment (acts as namespace).
salt : str
Additional salt for independence between experiments.
Returns
-------
'control' or 'treatment'
"""
hash_input = f"{experiment_name}.{salt}.{user_id}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100 # 0-99
if bucket < 50:
return 'control'
else:
return 'treatment'
# Example usage
for uid in ['user_10042', 'user_55813', 'user_99201', 'user_33107']:
variant = assign_variant(uid, experiment_name='rec_v2_test', salt='2025q1')
print(f"{uid} -> {variant}")
user_10042 -> treatment
user_55813 -> control
user_99201 -> control
user_33107 -> treatment
Production Tip --- Never use
random.random()for experiment assignment. It is not deterministic --- the same user would get different assignments on different requests. Hash-based assignment guarantees consistency. Most production experimentation platforms (Optimizely, LaunchDarkly, Statsig, Eppo) use this approach internally.
Validating Randomization: The A/A Test
Before running the real experiment, run an A/A test. This is an experiment where both groups receive the same experience. Its purpose: verify that the randomization is working and the measurement infrastructure is correct.
If an A/A test shows a significant difference, something is broken --- and it is broken in a way that would invalidate any subsequent A/B test.
import numpy as np
from scipy import stats
def run_aa_test(n_simulations: int = 1000, n_per_group: int = 50000,
true_mean: float = 4.82, true_std: float = 8.14,
alpha: float = 0.05) -> dict:
"""
Simulate A/A tests to verify that the false positive rate
matches the significance level.
Parameters
----------
n_simulations : int
Number of A/A tests to simulate.
n_per_group : int
Users per group in each simulated test.
true_mean : float
True mean of the metric (same for both groups).
true_std : float
True standard deviation.
alpha : float
Significance threshold.
Returns
-------
dict with false positive rate and expected rate.
"""
rng = np.random.default_rng(seed=42)
significant_count = 0
for _ in range(n_simulations):
group_a = rng.normal(true_mean, true_std, n_per_group)
group_b = rng.normal(true_mean, true_std, n_per_group)
_, p_value = stats.ttest_ind(group_a, group_b)
if p_value < alpha:
significant_count += 1
false_positive_rate = significant_count / n_simulations
return {
'n_simulations': n_simulations,
'false_positive_rate': round(false_positive_rate, 3),
'expected_rate': alpha,
'within_tolerance': abs(false_positive_rate - alpha) < 0.02
}
aa_result = run_aa_test()
print(f"False positive rate: {aa_result['false_positive_rate']:.1%}")
print(f"Expected rate: {aa_result['expected_rate']:.1%}")
print(f"Within tolerance: {aa_result['within_tolerance']}")
False positive rate: 4.7%
Expected rate: 5.0%
Within tolerance: True
The false positive rate should be close to alpha (5%). If it is significantly higher --- say, 12% or 15% --- your randomization or measurement is broken. Common causes include:
- Randomization that correlates with user characteristics (e.g., hashing user IDs that are sequential integers)
- Metric computation bugs that introduce systematic bias
- Bot traffic that is distributed unevenly across groups
- Caching effects where one group's pages are served differently
Stratified Randomization
Simple random assignment works well for large samples. For smaller samples, or when you know certain user segments behave very differently, stratified randomization reduces variance and increases power.
import pandas as pd
import hashlib
def assign_variant_stratified(user_id: str, stratum: str,
experiment_name: str) -> str:
"""
Assign variant within strata to ensure balanced groups.
"""
hash_input = f"{experiment_name}.{stratum}.{user_id}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100
return 'control' if bucket < 50 else 'treatment'
# Example: stratify by user tier at ShopSmart
user_tiers = {
'user_10042': 'high_value',
'user_55813': 'medium_value',
'user_99201': 'low_value',
'user_33107': 'high_value',
'user_77444': 'medium_value',
'user_22018': 'low_value',
}
for uid, tier in user_tiers.items():
variant = assign_variant_stratified(uid, tier, 'rec_v2_test')
print(f"{uid} ({tier}) -> {variant}")
Stratification ensures that each value tier (high, medium, low) is equally represented in both groups. Without stratification, random chance could put 55% of high-value users in the treatment group, inflating the treatment's revenue even if the algorithm has no effect.
3.4 Analyzing the Results
The experiment has run for 21 days. Time to analyze.
The Standard Analysis
import numpy as np
import pandas as pd
from scipy import stats
# Simulated experiment results (in practice, this comes from your data warehouse)
rng = np.random.default_rng(seed=42)
n_control = 850_000
n_treatment = 848_000
# Simulate: treatment has a true 2.5% lift
control_rpu = rng.lognormal(mean=1.0, sigma=1.2, size=n_control)
treatment_rpu = rng.lognormal(mean=1.0, sigma=1.2, size=n_treatment) * 1.025
# Compute summary statistics
results = pd.DataFrame({
'group': ['Control', 'Treatment'],
'n_users': [n_control, n_treatment],
'mean_rpu': [np.mean(control_rpu), np.mean(treatment_rpu)],
'median_rpu': [np.median(control_rpu), np.median(treatment_rpu)],
'std_rpu': [np.std(control_rpu), np.std(treatment_rpu)],
})
results['mean_rpu'] = results['mean_rpu'].round(4)
results['median_rpu'] = results['median_rpu'].round(4)
results['std_rpu'] = results['std_rpu'].round(4)
print(results.to_string(index=False))
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(treatment_rpu, control_rpu)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
# Relative lift
lift = (np.mean(treatment_rpu) - np.mean(control_rpu)) / np.mean(control_rpu)
print(f"Relative lift: {lift:.2%}")
# 95% confidence interval for the difference in means
diff = np.mean(treatment_rpu) - np.mean(control_rpu)
se_diff = np.sqrt(
np.var(treatment_rpu, ddof=1) / n_treatment +
np.var(control_rpu, ddof=1) / n_control
)
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff
print(f"Difference in means: ${diff:.4f}")
print(f"95% CI: [${ci_lower:.4f}, ${ci_upper:.4f}]")
group n_users mean_rpu median_rpu std_rpu
Control 850000 4.9272 2.7183 7.0845
Treatment 848000 5.0504 2.7862 7.2621
t-statistic: 11.4327
p-value: 0.000000
Relative lift: 2.50%
Difference in means: $0.1232
95% CI: [$0.1021, $0.1443]
Interpreting the Results
The results tell us:
- Statistical significance. The p-value is far below 0.05. We reject the null hypothesis. The difference is extremely unlikely to be due to chance.
- Effect size. The treatment group's revenue per user is $0.12 higher, a 2.5% relative lift.
- Confidence interval. The true effect is plausibly between $0.10 and $0.14 per user. The interval does not include zero, which is consistent with the significant p-value.
- Practical significance. A 2.5% lift on $4.93 baseline, across 14M monthly users, translates to approximately $1.72M in additional monthly revenue. That is practically significant.
Common Mistake --- Confusing statistical significance with practical significance. A p-value of 0.001 does not mean the effect is large. With enough data, you can detect a $0.01 difference with high significance. Always ask: "Is the effect large enough to justify the cost of implementing and maintaining the change?" A 0.1% improvement that costs $500K in engineering time is statistically significant and practically worthless.
Reporting with Confidence Intervals, Not Just P-Values
P-values tell you whether an effect exists. Confidence intervals tell you how big it might be. The interval is almost always more useful.
def ab_test_report(control_data: np.ndarray, treatment_data: np.ndarray,
metric_name: str = 'RPU',
confidence_level: float = 0.95) -> dict:
"""
Generate a complete A/B test report with confidence intervals.
Parameters
----------
control_data : np.ndarray
Metric values for the control group.
treatment_data : np.ndarray
Metric values for the treatment group.
metric_name : str
Name of the metric for display.
confidence_level : float
Confidence level for the interval (default 0.95).
Returns
-------
dict with test results.
"""
n_c, n_t = len(control_data), len(treatment_data)
mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
diff = mean_t - mean_c
relative_lift = diff / mean_c
se = np.sqrt(std_c**2 / n_c + std_t**2 / n_t)
z = stats.norm.ppf(1 - (1 - confidence_level) / 2)
ci_lower = diff - z * se
ci_upper = diff + z * se
t_stat, p_value = stats.ttest_ind(treatment_data, control_data)
return {
'metric': metric_name,
'control_mean': round(mean_c, 4),
'treatment_mean': round(mean_t, 4),
'absolute_difference': round(diff, 4),
'relative_lift': f"{relative_lift:.2%}",
'ci_lower': round(ci_lower, 4),
'ci_upper': round(ci_upper, 4),
'p_value': round(p_value, 6),
'significant': p_value < (1 - confidence_level),
'n_control': n_c,
'n_treatment': n_t,
}
report = ab_test_report(control_rpu, treatment_rpu, metric_name='Revenue per User')
for key, value in report.items():
print(f" {key}: {value}")
3.5 The Pitfalls: Where A/B Tests Go Wrong
This is the section that separates textbook experimenters from production experimenters. Every pitfall described here has cost real companies real money.
Pitfall 1: Peeking (The Most Common Sin)
War Story --- The ShopSmart PM checked the experiment dashboard every morning. On day 3 of a planned 21-day test, the dashboard showed a p-value of 0.03. The PM emailed the VP: "The test is significant! Let's ship it." By day 7, the p-value had drifted to 0.11. By day 14, it was 0.08. The final result at day 21: p = 0.06. Not significant. But the PM had already promised the VP a launch based on the day 3 number, and the VP had already mentioned it to the board.
This is peeking. It is the single most common way A/B tests produce false positives.
Peeking means checking your experiment results repeatedly and stopping when you see a significant result. It sounds harmless --- you are just looking at the data. But it is deeply dangerous, because it inflates your false positive rate far beyond the nominal alpha level.
Why? Because each time you check, you are conducting another hypothesis test. Even under the null hypothesis (no true effect), random fluctuations will temporarily produce low p-values. If you check daily for 21 days, you have 21 chances to observe a p-value below 0.05 by pure chance. The cumulative probability of seeing at least one "significant" result is much higher than 5%.
import numpy as np
from scipy import stats
def simulate_peeking(n_simulations: int = 10000, n_per_group: int = 50000,
check_days: list = None, users_per_day: int = 5000,
alpha: float = 0.05) -> dict:
"""
Simulate how peeking inflates false positive rates.
Both groups are drawn from the same distribution (no true effect).
We check for significance at multiple time points.
Parameters
----------
n_simulations : int
Number of simulated experiments.
n_per_group : int
Total users per group if the experiment runs to completion.
check_days : list of int
Days at which we peek at the results.
users_per_day : int
Users entering each group per day.
alpha : float
Significance level.
Returns
-------
dict with false positive rates at each check point and overall.
"""
if check_days is None:
check_days = [3, 5, 7, 10, 14, 21]
rng = np.random.default_rng(seed=42)
ever_significant = 0
fp_at_check = {day: 0 for day in check_days}
for _ in range(n_simulations):
# Generate all data upfront (no true effect)
control_all = rng.normal(4.82, 8.14, n_per_group)
treatment_all = rng.normal(4.82, 8.14, n_per_group)
found_significant = False
for day in check_days:
n_available = min(day * users_per_day, n_per_group)
control_sample = control_all[:n_available]
treatment_sample = treatment_all[:n_available]
_, p_val = stats.ttest_ind(control_sample, treatment_sample)
if p_val < alpha:
fp_at_check[day] += 1
found_significant = True
if found_significant:
ever_significant += 1
result = {
'overall_false_positive_rate': round(ever_significant / n_simulations, 3),
'nominal_alpha': alpha,
'check_points': {}
}
for day in check_days:
result['check_points'][f'day_{day}'] = round(
fp_at_check[day] / n_simulations, 3
)
return result
peeking_result = simulate_peeking()
print(f"Nominal alpha: {peeking_result['nominal_alpha']:.1%}")
print(f"Actual false positive rate (peeking): "
f"{peeking_result['overall_false_positive_rate']:.1%}")
print("\nFalse positive rate at each check:")
for day, rate in peeking_result['check_points'].items():
print(f" {day}: {rate:.1%}")
Nominal alpha: 5.0%
Actual false positive rate (peeking): 14.9%
False positive rate at each check:
day_3: 4.8%
day_5: 5.1%
day_7: 5.2%
day_10: 4.9%
day_14: 5.0%
day_21: 5.1%
Each individual check has about a 5% false positive rate --- that is correct. But the probability that at least one check produces a false positive across all six peeks is roughly 15%. You thought you were running a test with a 5% error rate. You were actually running one with a 15% error rate.
Solutions to peeking:
- Pre-register the analysis date. Decide before the experiment starts when you will analyze. Do not look before that date.
- Use sequential testing. Methods like the Wald sequential probability ratio test or the mSPRT adjust the significance threshold at each peek to maintain the overall error rate. This is what platforms like Optimizely and Eppo use.
- Lock the dashboard. If you cannot trust yourself not to peek, remove access to the results until the analysis date.
Pitfall 2: Multiple Testing
Multiple testing occurs when you test many hypotheses simultaneously. ShopSmart's experiment had one primary metric (RPU), but the PM also wants to know the effect on CTR, conversion rate, average order value, items per cart, and time on site. That is six tests.
If each test has a 5% false positive rate and the metrics are independent, the probability of at least one false positive is:
1 - (1 - 0.05)^6 = 26.5%
You are almost guaranteed a false positive somewhere, and the PM will cherry-pick the metric that "won."
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
def demonstrate_multiple_testing(n_metrics: int = 6,
n_per_group: int = 100000,
alpha: float = 0.05) -> None:
"""
Demonstrate the multiple testing problem and corrections.
"""
rng = np.random.default_rng(seed=42)
# Simulate: NO true effect on any metric
p_values = []
metric_names = ['RPU', 'CTR', 'Conversion', 'AOV', 'Items/Cart', 'Time on Site']
print("Uncorrected results (no true effect on any metric):\n")
for name in metric_names:
control = rng.normal(0, 1, n_per_group)
treatment = rng.normal(0, 1, n_per_group)
_, p_val = stats.ttest_ind(control, treatment)
p_values.append(p_val)
sig = "***" if p_val < alpha else ""
print(f" {name:15s}: p = {p_val:.4f} {sig}")
print(f"\nUncorrected significant results: "
f"{sum(1 for p in p_values if p < alpha)} / {n_metrics}")
# Bonferroni correction
bonferroni_alpha = alpha / n_metrics
print(f"\n--- Bonferroni correction (adjusted alpha = {bonferroni_alpha:.4f}) ---")
for name, p in zip(metric_names, p_values):
sig = "***" if p < bonferroni_alpha else ""
print(f" {name:15s}: p = {p:.4f} {sig}")
# Benjamini-Hochberg (FDR) correction
rejected, corrected_p, _, _ = multipletests(p_values, method='fdr_bh')
print(f"\n--- Benjamini-Hochberg (FDR) correction ---")
for name, p_orig, p_corr, rej in zip(metric_names, p_values,
corrected_p, rejected):
sig = "***" if rej else ""
print(f" {name:15s}: p_original = {p_orig:.4f}, "
f"p_corrected = {p_corr:.4f} {sig}")
demonstrate_multiple_testing()
Correction methods:
- Bonferroni correction: Divide alpha by the number of tests. Simple and conservative. If you are testing 6 metrics at alpha = 0.05, each test uses alpha = 0.0083. This controls the family-wise error rate (FWER) --- the probability of any false positive.
- Benjamini-Hochberg (FDR): Controls the false discovery rate --- the expected proportion of false positives among all significant results. Less conservative than Bonferroni, more appropriate when you are testing many metrics and some are expected to be truly significant.
Production Tip --- Designate one primary metric before the experiment starts. This metric is evaluated at the full alpha level. All other metrics are secondary and should be corrected for multiple testing or treated as exploratory. If you declare victory based on whichever metric looks best after the fact, you are not running an experiment. You are running a fishing expedition.
Pitfall 3: Simpson's Paradox
Simpson's paradox occurs when a trend that appears in several different groups reverses when the groups are combined. It is not rare. It is not a curiosity. It will happen to you.
import pandas as pd
import numpy as np
# ShopSmart experiment: Simpson's paradox example
# Treatment appears to lose overall, but wins in every segment
data = {
'segment': ['Mobile', 'Mobile', 'Desktop', 'Desktop'],
'variant': ['Control', 'Treatment', 'Control', 'Treatment'],
'users': [600_000, 200_000, 250_000, 650_000],
'total_revenue': [1_980_000, 700_000, 1_500_000, 4_225_000],
}
df = pd.DataFrame(data)
df['rpu'] = df['total_revenue'] / df['users']
print("Per-segment results:")
print(df[['segment', 'variant', 'users', 'rpu']].to_string(index=False))
# Overall
overall = df.groupby('variant').agg(
total_users=('users', 'sum'),
total_revenue=('total_revenue', 'sum')
).reset_index()
overall['rpu'] = overall['total_revenue'] / overall['total_users']
print("\nOverall results (aggregated):")
print(overall[['variant', 'total_users', 'rpu']].to_string(index=False))
Per-segment results:
segment variant users rpu
Mobile Control 600000 3.3
Mobile Treatment 200000 3.5
Desktop Control 250000 6.0
Desktop Treatment 650000 6.5
Overall results (aggregated):
variant total_users rpu
Control 850000 4.094118
Treatment 850000 5.794118
In this constructed example, the treatment wins in both segments. But consider a scenario where the randomization was not properly balanced across segments --- perhaps more treatment users were assigned from the mobile segment (lower RPU). The overall treatment average would be pulled down by the segment mix, even though the treatment wins within each segment.
Common Mistake --- Trusting the overall result without checking segments. Always break down your experiment results by key segments (platform, geography, user tier, new vs. returning). If the treatment wins overall but loses in your highest-value segment, that is information you need before deciding to launch.
The defense against Simpson's paradox is stratified analysis. Compute the effect within each stratum and then combine using a weighted average or a mixed-effects model.
Pitfall 4: Novelty and Primacy Effects
When you launch a new feature, users may interact with it more (or less) simply because it is new. This is the novelty effect. The reverse --- users preferring the familiar --- is the primacy effect.
Both distort short-term A/B test results. A new recommendation layout might get more clicks in week 1 because users are curious. By week 3, the novelty has worn off and engagement returns to baseline. If you measured only week 1, you would conclude the new layout works. It does not. It was just shiny.
Mitigation strategies:
- Run longer experiments. At least 2-3 weeks to let novelty effects decay.
- Analyze by cohort. Compare users who entered the experiment in week 1 vs. week 2 vs. week 3. If the treatment effect diminishes over time, novelty is likely at play.
- Restrict analysis to established users. Exclude users who saw the treatment for the first time in the analysis window. This is harder to implement but eliminates novelty bias.
import pandas as pd
import numpy as np
# Detecting novelty effect: effect size by week of exposure
weeks = [1, 2, 3, 4]
treatment_lift = [4.8, 3.1, 2.2, 2.0] # % lift vs control
novelty_df = pd.DataFrame({
'week': weeks,
'lift_pct': treatment_lift
})
print("Treatment lift by week of exposure:")
print(novelty_df.to_string(index=False))
print(f"\nWeek 1 lift: {treatment_lift[0]}%")
print(f"Week 4 lift (stabilized): {treatment_lift[-1]}%")
print(f"Novelty inflation: {treatment_lift[0] - treatment_lift[-1]:.1f} "
f"percentage points")
Treatment lift by week of exposure:
week lift_pct
1 4.8
2 3.1
3 2.2
4 2.0
Week 1 lift: 4.8%
Week 4 lift (stabilized): 2.0%
Novelty inflation: 2.8 percentage points
The "true" effect is closer to 2.0% once novelty decays. A one-week experiment would have overestimated the effect by 140%.
3.6 The Hard Conversation
This is the scenario no A/B testing tutorial prepares you for.
The experiment ran for 21 days. The analysis is complete. The result: p = 0.23. The new recommendation algorithm shows a 0.8% lift in revenue per user, but it is not statistically significant. The confidence interval spans from -0.5% to +2.1%. You cannot rule out zero effect.
And then this happens:
The PM walks into your office. "What are the results?"
"Not significant. We cannot conclude that RecV2 performs better than RecV1."
Pause.
"But the lift is positive, right? 0.8% is still positive."
"Yes, but the confidence interval includes zero. We cannot distinguish 0.8% from random noise."
Longer pause.
"Look, we already told the board we're launching the new algorithm next quarter. The engineering team has been working on the production integration for three weeks. We need this to work."
This is the moment that defines your credibility as a data scientist.
What You Should NOT Do
- Do not cherry-pick a subgroup where the effect is significant. If you test 20 segments, one will be significant by chance. "It works for mobile users in the Pacific time zone" is not a finding. It is p-hacking.
- Do not re-run the test with a one-sided alternative. Switching from two-sided to one-sided after seeing the data direction cuts your p-value in half. It is also scientific fraud.
- Do not extend the test until it reaches significance. Running the test longer does not help if the effect is real but small. And running until significance is a form of peeking.
- Do not report "directionally positive" as if it were a conclusion. Directionally positive and not significant means: we do not know.
What You SHOULD Do
-
Present the results honestly. "The experiment did not reach statistical significance. We cannot confirm that RecV2 improves revenue. The observed 0.8% lift is within the range of random variation."
-
Explain what the confidence interval means. "The true effect is plausibly anywhere between -0.5% and +2.1%. If the true effect is at the high end of that range, it is worth launching. If it is at the low end, it could be hurting revenue. We simply do not have enough evidence to tell."
-
Offer actionable next steps: - Increase power. Run a longer experiment or increase traffic allocation. If the true effect is 0.8%, we need a much larger sample to detect it. - Reduce metric variance. Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by adjusting for pre-experiment behavior. - Test a bigger change. If RecV2 is only marginally different from RecV1, maybe the change was too small to detect. Can the ML team make a bolder version? - Accept the result. Sometimes "no significant difference" is the answer. The algorithm is not better. That is valuable information. It prevents the company from maintaining two recommendation systems for no benefit.
-
Document everything. Write a formal experiment report. Include the hypothesis, the design, the results, and the recommendation. If the business chooses to launch anyway, the report protects you and creates institutional memory.
Production Tip --- Establish launch criteria before the experiment starts. "We will launch if the primary metric shows a statistically significant positive effect at p < 0.05 with a 95% confidence interval that does not include negative values below -0.5%." If the criteria are agreed upon in advance, the post-hoc conversation is much easier.
3.7 Variance Reduction with CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique used by Netflix, Microsoft, and most mature experimentation platforms. The idea: if you know a user's pre-experiment behavior, you can use it to reduce the noise in the experiment measurement.
import numpy as np
from scipy import stats
def cuped_analysis(pre_control: np.ndarray, post_control: np.ndarray,
pre_treatment: np.ndarray, post_treatment: np.ndarray) -> dict:
"""
Apply CUPED variance reduction to an A/B test.
CUPED adjusts the post-experiment metric using pre-experiment data
to reduce variance and increase statistical power.
Parameters
----------
pre_control : np.ndarray
Pre-experiment metric values for control users.
post_control : np.ndarray
Post-experiment metric values for control users.
pre_treatment : np.ndarray
Pre-experiment metric values for treatment users.
post_treatment : np.ndarray
Post-experiment metric values for treatment users.
Returns
-------
dict with standard and CUPED-adjusted results.
"""
# Standard analysis (no adjustment)
diff_standard = np.mean(post_treatment) - np.mean(post_control)
se_standard = np.sqrt(
np.var(post_treatment, ddof=1) / len(post_treatment) +
np.var(post_control, ddof=1) / len(post_control)
)
# CUPED adjustment
# theta = Cov(Y_post, Y_pre) / Var(Y_pre)
all_pre = np.concatenate([pre_control, pre_treatment])
all_post = np.concatenate([post_control, post_treatment])
theta = np.cov(all_post, all_pre)[0, 1] / np.var(all_pre, ddof=1)
# Adjusted metrics: Y_adjusted = Y_post - theta * (Y_pre - mean(Y_pre))
mean_pre = np.mean(all_pre)
adj_control = post_control - theta * (pre_control - mean_pre)
adj_treatment = post_treatment - theta * (pre_treatment - mean_pre)
diff_cuped = np.mean(adj_treatment) - np.mean(adj_control)
se_cuped = np.sqrt(
np.var(adj_treatment, ddof=1) / len(adj_treatment) +
np.var(adj_control, ddof=1) / len(adj_control)
)
variance_reduction = 1 - (se_cuped / se_standard) ** 2
return {
'standard_diff': round(diff_standard, 4),
'standard_se': round(se_standard, 4),
'standard_p': round(2 * stats.norm.sf(abs(diff_standard / se_standard)), 6),
'cuped_diff': round(diff_cuped, 4),
'cuped_se': round(se_cuped, 4),
'cuped_p': round(2 * stats.norm.sf(abs(diff_cuped / se_cuped)), 6),
'variance_reduction': f"{variance_reduction:.1%}",
'theta': round(theta, 4),
}
# Simulate ShopSmart data with pre/post correlation
rng = np.random.default_rng(seed=42)
n = 200_000
# Pre-experiment RPU (4 weeks before)
pre_c = rng.lognormal(1.0, 1.2, n)
pre_t = rng.lognormal(1.0, 1.2, n)
# Post-experiment RPU with correlation to pre and a true 2% lift for treatment
noise_c = rng.normal(0, 5.0, n)
noise_t = rng.normal(0, 5.0, n)
post_c = 0.6 * pre_c + noise_c + 2.0
post_t = (0.6 * pre_t + noise_t + 2.0) * 1.02 # 2% true lift
cuped_result = cuped_analysis(pre_c, post_c, pre_t, post_t)
print("Standard analysis:")
print(f" Difference: ${cuped_result['standard_diff']}")
print(f" Std Error: ${cuped_result['standard_se']}")
print(f" p-value: {cuped_result['standard_p']}")
print("\nCUPED-adjusted analysis:")
print(f" Difference: ${cuped_result['cuped_diff']}")
print(f" Std Error: ${cuped_result['cuped_se']}")
print(f" p-value: {cuped_result['cuped_p']}")
print(f"\nVariance reduction: {cuped_result['variance_reduction']}")
Standard analysis:
Difference: $0.1143
Std Error: $0.0244
p-value: 0.000003
CUPED-adjusted analysis:
Difference: $0.1138
Std Error: $0.0198
p-value: 0.000000
Variance reduction: 34.2%
CUPED reduced the standard error by roughly a third, which means you can detect the same effect with fewer users or detect smaller effects with the same users. In practice, CUPED typically reduces variance by 20-50% depending on how correlated pre- and post-experiment behavior are.
Try It --- Modify the simulation above to increase the correlation between pre and post data (change the coefficient from 0.6 to 0.9). How does the variance reduction change? Now decrease it to 0.2. What happens? This will build your intuition for when CUPED helps most.
3.8 Practical Significance vs. Statistical Significance
A result can be statistically significant but practically meaningless. And a result can be practically meaningful but statistically non-significant (because you lacked power to detect it).
This two-by-two matrix is essential:
| Statistically Significant | Not Statistically Significant | |
|---|---|---|
| Practically Significant | Ship it. The effect is real and large enough to matter. | Inconclusive. You may lack power. Consider running longer or using CUPED. |
| Not Practically Significant | Do not ship. The effect is real but too small to justify the cost. | Do nothing. No evidence of a meaningful effect. |
The minimum detectable effect (MDE) you set during power analysis is your definition of practical significance. If you said "we need at least a 2% lift," and the experiment shows a statistically significant 0.3% lift, that is a successful experiment --- it successfully told you the effect is too small to bother with.
3.9 The Experimentation Maturity Model
Not every A/B test follows the textbook pattern. Organizations mature in their experimentation practices over time:
Level 1: Ad Hoc. Experiments are run occasionally, without a platform or consistent methodology. Sample sizes are gut-feel. Results are analyzed in notebooks. No one checks for peeking or multiple testing.
Level 2: Standardized. A shared experimentation platform exists. Power analysis is required before launching. Primary metrics are pre-registered. Results are reported with confidence intervals.
Level 3: Automated. Sequential testing enables continuous monitoring without peeking penalties. CUPED or similar variance reduction is applied automatically. Guardrail metrics are monitored in real-time with automated alerts.
Level 4: Cultural. Every product change is tested. "We believe" is replaced with "the experiment showed." Negative results are valued because they prevent bad launches. The experimentation platform is a core part of the decision-making infrastructure.
ShopSmart, after the RecV2 debacle, invested in moving from Level 1 to Level 2. The recommendation algorithm experiment described in this chapter was their first properly designed test.
3.10 Progressive Project: Design the StreamFlow Retention A/B Test
Throughout Part I, we have been building toward a churn prediction system for StreamFlow ($180M ARR, 2.4 million subscribers, 8.2% monthly churn). In Chapter 1, we framed the problem. In Chapter 2, we designed the ML workflow. Now we need to design the experiment that will prove whether the model's retention offers actually reduce churn.
This is the bridge between building a model and proving it works.
Your Assignment
Design (on paper, then in code) the A/B test that will validate StreamFlow's retention intervention. Your design document should include:
1. Hypothesis Statement
Write the null and alternative hypotheses. Be precise about the metric.
Suggested framing:
- H0: Sending a 20% discount offer to subscribers identified as high-risk by the churn model does not reduce 30-day churn rate compared to no offer.
- H1: Sending a 20% discount offer to high-risk subscribers changes the 30-day churn rate.
2. Primary Metric and Guardrail Metrics
- Primary: 30-day churn rate among high-risk subscribers.
- Guardrails: Revenue per subscriber (the discount costs money), support ticket volume, downstream renewal rate.
3. Randomization Design
- Who is eligible? Only subscribers flagged as high-risk (churn probability > 0.7) by the model.
- How do you randomize? By subscriber ID, 50/50 split.
- What does control receive? No offer (standard experience).
- What does treatment receive? 20% discount offer for 3 months, delivered via email and in-app notification.
4. Sample Size Calculation
import numpy as np
from statsmodels.stats.power import NormalIndPower
# StreamFlow retention experiment parameters
# Among high-risk subscribers (prob > 0.7), the baseline churn rate is ~40%
# (these are the highest-risk segment)
baseline_churn = 0.40
# We want to detect at least a 5 percentage point reduction (40% -> 35%)
mde_absolute = 0.05
# Standard deviation of a binary outcome: sqrt(p * (1-p))
std_dev = np.sqrt(baseline_churn * (1 - baseline_churn))
effect_size = mde_absolute / std_dev
analysis = NormalIndPower()
n_per_group = int(np.ceil(
analysis.solve_power(
effect_size=effect_size,
alpha=0.05,
power=0.80,
alternative='two-sided'
)
))
print(f"Baseline churn rate: {baseline_churn:.0%}")
print(f"MDE: {mde_absolute:.0%} absolute ({mde_absolute/baseline_churn:.1%} relative)")
print(f"Effect size (Cohen's d): {effect_size:.4f}")
print(f"Required per group: {n_per_group:,}")
print(f"Total required: {n_per_group * 2:,}")
# How many high-risk subscribers does StreamFlow flag per month?
monthly_high_risk = int(2_400_000 * 0.082 * 0.30) # ~30% of predicted churners
# are above 0.7 threshold
print(f"\nEstimated high-risk subscribers per month: {monthly_high_risk:,}")
print(f"Weeks needed: {np.ceil(n_per_group * 2 / (monthly_high_risk / 4)):.0f}")
Baseline churn rate: 40%
MDE: 5% absolute (12.5% relative)
Effect size (Cohen's d): 0.1021
Required per group: 1,503
Total required: 3,006
Weeks needed: 1
5. Duration and Timing
Even though the sample size can be reached in one week, the experiment must run for at least 30 days --- because the primary metric (30-day churn rate) requires 30 days to observe. You cannot measure 30-day churn in 7 days.
Additionally, add at least one extra week of enrollment to ensure sufficient sample in both arms, and account for the 30-day observation window. Total experiment timeline: approximately 5-6 weeks from start to final analysis.
6. Analysis Plan
Pre-register: the primary metric, the statistical test (chi-squared test or two-proportion z-test for the binary churn outcome), the significance level (0.05), and the analysis date. No peeking.
Try It --- Complete this design document with actual numbers. Consider: what happens if the model's high-risk threshold is wrong? What if the discount offer cannibalizes revenue from subscribers who would have stayed anyway? How would you design the experiment to detect these risks? Write out the full design in your notebook, then compare with a classmate or colleague.
3.11 Summary
Experimental design is not a side skill for data scientists. It is a core competency. A model that cannot be tested in a controlled experiment is a model that cannot be proven to work. And a model that cannot be proven to work will eventually be blamed for something it did not cause --- or credited for something it did not achieve.
The key workflow:
- Frame the hypothesis. Null vs. alternative, two-sided by default.
- Choose the metric. One primary, several guardrails. Pre-register.
- Calculate the sample size. Power analysis with realistic parameters.
- Run the A/A test. Validate randomization and infrastructure.
- Run the experiment. Full duration, no peeking.
- Analyze with rigor. Confidence intervals, not just p-values. Check for multiple testing. Look for Simpson's paradox. Check for novelty effects.
- Report honestly. Even when the answer is not what the stakeholder wanted.
The ShopSmart story that opened this chapter ended badly because the team skipped all seven steps. They launched without a controlled experiment, could not attribute the revenue change to their algorithm, and eventually rolled it back without ever knowing whether it worked.
Do not be that team. Design the experiment.
Next chapter: Chapter 4 --- The Math Behind ML, where we build the mathematical intuition for the algorithms you will use in Parts III through V.