Exercises: Chapter 3
Experimental Design and A/B Testing
Exercise 1: Hypothesis Formulation (Conceptual)
A ride-sharing company wants to test whether showing drivers an estimated earnings badge on the dispatch screen increases the acceptance rate for ride requests. Currently, 72% of requests are accepted.
a) Write the null hypothesis and alternative hypothesis. Be precise about the metric.
b) Should this be a one-sided or two-sided test? Justify your answer.
c) What is the randomization unit? Why?
d) Name two guardrail metrics that should be monitored alongside the primary metric.
Exercise 2: Power Analysis Calculation (Applied)
A SaaS company wants to test a new onboarding flow. Their current trial-to-paid conversion rate is 12%. They want to detect a 2 percentage point increase (12% to 14%). They have 8,000 new trial signups per week.
a) Calculate the required sample size per group using statsmodels. Use alpha = 0.05 and power = 0.80.
import numpy as np
from statsmodels.stats.power import NormalIndPower
# Your code here
baseline = ___
mde_absolute = ___
std_dev = np.sqrt(baseline * (1 - baseline))
effect_size = mde_absolute / std_dev
analysis = NormalIndPower()
n_per_group = int(np.ceil(
analysis.solve_power(
effect_size=effect_size,
alpha=___,
power=___,
alternative='two-sided'
)
))
print(f"Required per group: {n_per_group:,}")
b) How many weeks would the experiment need to run? Show your calculation.
c) The VP of Product says: "Two weeks is too long. Can we cut the test to one week?" What would you need to change about the experimental design to make a one-week test viable? What are the tradeoffs?
d) Re-run the power analysis with power = 0.90 instead of 0.80. How does the required sample size change? Why does this matter?
Exercise 3: Identify the Peeking Problem (Conceptual)
A data analyst runs an A/B test on a new checkout page design. The experiment is planned for 28 days. The analyst checks the results every day. Here is a log of the p-values:
| Day | p-value |
|---|---|
| 3 | 0.42 |
| 5 | 0.18 |
| 7 | 0.04 |
| 10 | 0.09 |
| 14 | 0.12 |
| 21 | 0.07 |
| 28 | 0.06 |
a) On day 7, the analyst reports to the team: "The new checkout page significantly increases conversion (p = 0.04). We should ship it." What is wrong with this conclusion?
b) What is the approximate true false positive rate if the analyst would have stopped the experiment any time the p-value dropped below 0.05?
c) If the company wanted to monitor results during the experiment without inflating the false positive rate, what approach should they use?
d) The final result at day 28 is p = 0.06 (not significant at alpha = 0.05). Does this mean the new checkout page definitely does not work? Explain.
Exercise 4: Multiple Testing Correction (Applied)
An e-commerce team runs a single A/B test but analyzes 10 different metrics. The p-values are:
| Metric | p-value |
|---|---|
| Revenue per user | 0.032 |
| Conversion rate | 0.061 |
| Average order value | 0.088 |
| Items per cart | 0.041 |
| Click-through rate | 0.003 |
| Bounce rate | 0.210 |
| Time on site | 0.154 |
| Pages per session | 0.048 |
| Return rate | 0.520 |
| Customer satisfaction | 0.390 |
a) Without any correction, how many metrics are "significant" at alpha = 0.05?
b) Apply the Bonferroni correction. Which metrics remain significant?
alpha = 0.05
n_tests = 10
bonferroni_alpha = alpha / n_tests
p_values = [0.032, 0.061, 0.088, 0.041, 0.003, 0.210, 0.154, 0.048, 0.520, 0.390]
metrics = ['RPU', 'Conv Rate', 'AOV', 'Items/Cart', 'CTR',
'Bounce', 'Time', 'Pages', 'Return Rate', 'CSAT']
# Which metrics survive Bonferroni correction?
for metric, p in zip(metrics, p_values):
pass # Your analysis here
c) Apply the Benjamini-Hochberg (FDR) procedure using statsmodels.stats.multitest.multipletests. Which metrics are significant under FDR control at 0.05?
d) The PM wants to declare victory based on CTR (p = 0.003). Is this justified? Why or why not?
Exercise 5: Simpson's Paradox (Conceptual)
A healthcare company tests a new patient onboarding flow. The overall results show:
| Group | Patients | Completed Onboarding | Completion Rate |
|---|---|---|---|
| Control | 10,000 | 6,200 | 62.0% |
| Treatment | 10,000 | 5,900 | 59.0% |
The treatment appears to hurt onboarding completion. But when broken down by patient age:
| Age Group | Group | Patients | Completed | Rate |
|---|---|---|---|---|
| Under 50 | Control | 2,000 | 1,400 | 70.0% |
| Under 50 | Treatment | 7,000 | 5,040 | 72.0% |
| 50+ | Control | 8,000 | 4,800 | 60.0% |
| 50+ | Treatment | 3,000 | 860 | 28.7% |
Wait --- that does not add up to 5,900 for the treatment group's 50+ segment. Let us correct:
| Age Group | Group | Patients | Completed | Rate |
|---|---|---|---|---|
| Under 50 | Control | 2,000 | 1,400 | 70.0% |
| Under 50 | Treatment | 7,000 | 5,040 | 72.0% |
| 50+ | Control | 8,000 | 4,800 | 60.0% |
| 50+ | Treatment | 3,000 | 860 | 28.7% |
a) Verify: does the treatment win or lose in each age segment?
b) Why does the treatment lose overall despite winning in the Under 50 segment? (Hint: look at the allocation of patients to age groups.)
c) If the randomization was done correctly, how could this imbalance in age group distribution occur?
d) What should the analyst report: the overall result or the segmented result? Justify your answer.
Exercise 6: Choosing the Right Metric (Applied)
A food delivery app wants to A/B test a new search ranking algorithm. Propose primary and guardrail metrics for each scenario:
a) Scenario A: The goal is to increase the number of orders placed per user.
b) Scenario B: The goal is to increase customer satisfaction with search results.
c) Scenario C: The goal is to increase revenue for the platform (which earns a commission on each order).
For each scenario, explain why your chosen primary metric is better than at least one alternative metric.
Exercise 7: Design the Experiment (Applied)
A B2B software company wants to test whether AI-generated email subject lines increase open rates for their marketing campaigns. Currently, the average open rate is 22%, and they send 150,000 emails per week. They consider a 2 percentage point increase (22% to 24%) worth detecting.
Write a complete experiment design document including:
- Null and alternative hypotheses
- Primary metric and at least two guardrail metrics
- Randomization unit and method
- Sample size calculation (show your code)
- Planned duration
- Analysis plan (what statistical test, what significance level, when you will analyze)
- One potential pitfall specific to this experiment and how you would mitigate it
Exercise 8: Interpreting Results (Conceptual)
An A/B test produces the following results:
- Control mean: $14.82 per user
- Treatment mean: $15.01 per user
- Relative lift: 1.28%
- p-value: 0.041
- 95% CI for the difference: [$0.01, $0.38]
- n_control: 120,000
- n_treatment: 118,000
a) Is the result statistically significant at alpha = 0.05?
b) The confidence interval includes values very close to zero ($0.01). What does this tell you about the strength of the evidence?
c) If the company's minimum threshold for a "worthwhile" improvement is 2%, should they launch the treatment? Why or why not?
d) What would you recommend as a next step?
Exercise 9: A/A Test Validation (Applied)
Write a simulation to validate that your experimentation infrastructure has a correct false positive rate. Your simulation should:
- Generate 5,000 A/A tests (both groups drawn from the same distribution)
- Use a normal distribution with mean = 100 and standard deviation = 25
- Use 10,000 observations per group
- Count how many tests produce a p-value below 0.05
- Verify the false positive rate is approximately 5% (within 3.5% to 6.5%)
import numpy as np
from scipy import stats
def validate_aa_tests():
rng = np.random.default_rng(seed=42)
# Your code here
pass
validate_aa_tests()
Bonus: Modify the simulation to introduce a subtle bug --- make group B's mean 0.5 higher than group A's --- and show how the false positive rate increases.
Exercise 10: CUPED Implementation (Applied)
StreamFlow has pre-experiment data (hours watched in the 4 weeks before the experiment) and post-experiment data (hours watched during the 3-week experiment) for 50,000 users per group.
a) Implement CUPED variance reduction for this experiment. Use the following simulated data:
import numpy as np
rng = np.random.default_rng(seed=42)
n = 50_000
# Pre-experiment hours watched (both groups from same population)
pre_control = rng.lognormal(2.5, 0.8, n)
pre_treatment = rng.lognormal(2.5, 0.8, n)
# Post-experiment (treatment has a true 3% lift)
noise_c = rng.normal(0, 4.0, n)
noise_t = rng.normal(0, 4.0, n)
post_control = 0.7 * pre_control + noise_c + 5.0
post_treatment = (0.7 * pre_treatment + noise_t + 5.0) * 1.03
# Standard analysis
# Your code here
# CUPED analysis
# Your code here
b) What is the variance reduction achieved by CUPED?
c) If the standard analysis gives a p-value of 0.03, what does the CUPED analysis give? Is the improvement meaningful?
Exercise 11: Experiment Duration Planning (Applied)
You are planning an experiment for a mobile game that has the following characteristics:
- 500,000 daily active users
- Primary metric: average revenue per daily active user (ARPDAU) = $0.42
- Standard deviation of ARPDAU: $2.80
- Minimum detectable effect: 5% relative lift
- Weekend revenue is 40% higher than weekday revenue
a) Calculate the required sample size per group.
b) How many days would the experiment need to run, assuming 50/50 traffic split?
c) Explain why you should round up to a multiple of 7 days, even if the sample size calculation says you only need 4 days.
d) A PM asks: "Can we just run on weekends to detect the effect faster since revenue is higher?" What is wrong with this suggestion?
Exercise 12: The Stakeholder Conversation (Conceptual)
You run an A/B test for a new ML-powered pricing algorithm. Results:
- Primary metric (revenue): +1.1%, p = 0.08
- Guardrail metric (customer complaints): +23%, p = 0.002
- Guardrail metric (return rate): +8%, p = 0.04
The Head of Revenue says: "The revenue result is almost significant and the complaints will settle down once customers get used to the new prices. Let's launch."
a) Write a 3-4 sentence response explaining why you would not recommend launching.
b) What additional analysis would you propose before making a final decision?
c) Draft an alternative recommendation that addresses the stakeholder's desire for revenue improvement while respecting the experimental evidence.
Exercise 13: Designing for Interference (Advanced)
A social media platform wants to test a new content recommendation algorithm. The problem: if user A is in the treatment group and shares a recommended post with user B (who is in the control group), user B's experience has been contaminated by the treatment. This is called network interference or spillover.
a) Explain why standard A/B testing assumptions are violated in the presence of network interference.
b) Propose two different experimental designs that could mitigate this problem. (Hint: consider cluster randomization and time-based designs.)
c) What are the tradeoffs of each design you proposed?
Return to the chapter for full context on these exercises.