Exercises: Chapter 3

DataField.Dev

Exercises: Chapter 3

Experimental Design and A/B Testing

Exercise 1: Hypothesis Formulation (Conceptual)

A ride-sharing company wants to test whether showing drivers an estimated earnings badge on the dispatch screen increases the acceptance rate for ride requests. Currently, 72% of requests are accepted.

a) Write the null hypothesis and alternative hypothesis. Be precise about the metric.

b) Should this be a one-sided or two-sided test? Justify your answer.

c) What is the randomization unit? Why?

d) Name two guardrail metrics that should be monitored alongside the primary metric.

Exercise 2: Power Analysis Calculation (Applied)

A SaaS company wants to test a new onboarding flow. Their current trial-to-paid conversion rate is 12%. They want to detect a 2 percentage point increase (12% to 14%). They have 8,000 new trial signups per week.

a) Calculate the required sample size per group using statsmodels. Use alpha = 0.05 and power = 0.80.

import numpy as np
from statsmodels.stats.power import NormalIndPower

# Your code here
baseline = ___
mde_absolute = ___
std_dev = np.sqrt(baseline * (1 - baseline))
effect_size = mde_absolute / std_dev

analysis = NormalIndPower()
n_per_group = int(np.ceil(
    analysis.solve_power(
        effect_size=effect_size,
        alpha=___,
        power=___,
        alternative='two-sided'
    )
))

print(f"Required per group: {n_per_group:,}")

b) How many weeks would the experiment need to run? Show your calculation.

c) The VP of Product says: "Two weeks is too long. Can we cut the test to one week?" What would you need to change about the experimental design to make a one-week test viable? What are the tradeoffs?

d) Re-run the power analysis with power = 0.90 instead of 0.80. How does the required sample size change? Why does this matter?

Exercise 3: Identify the Peeking Problem (Conceptual)

A data analyst runs an A/B test on a new checkout page design. The experiment is planned for 28 days. The analyst checks the results every day. Here is a log of the p-values:

Day	p-value
3	0.42
5	0.18
7	0.04
10	0.09
14	0.12
21	0.07
28	0.06

a) On day 7, the analyst reports to the team: "The new checkout page significantly increases conversion (p = 0.04). We should ship it." What is wrong with this conclusion?

b) What is the approximate true false positive rate if the analyst would have stopped the experiment any time the p-value dropped below 0.05?

c) If the company wanted to monitor results during the experiment without inflating the false positive rate, what approach should they use?

d) The final result at day 28 is p = 0.06 (not significant at alpha = 0.05). Does this mean the new checkout page definitely does not work? Explain.

Exercise 4: Multiple Testing Correction (Applied)

An e-commerce team runs a single A/B test but analyzes 10 different metrics. The p-values are:

Metric	p-value
Revenue per user	0.032
Conversion rate	0.061
Average order value	0.088
Items per cart	0.041
Click-through rate	0.003
Bounce rate	0.210
Time on site	0.154
Pages per session	0.048
Return rate	0.520
Customer satisfaction	0.390

a) Without any correction, how many metrics are "significant" at alpha = 0.05?

b) Apply the Bonferroni correction. Which metrics remain significant?

alpha = 0.05
n_tests = 10
bonferroni_alpha = alpha / n_tests

p_values = [0.032, 0.061, 0.088, 0.041, 0.003, 0.210, 0.154, 0.048, 0.520, 0.390]
metrics = ['RPU', 'Conv Rate', 'AOV', 'Items/Cart', 'CTR',
           'Bounce', 'Time', 'Pages', 'Return Rate', 'CSAT']

# Which metrics survive Bonferroni correction?
for metric, p in zip(metrics, p_values):
    pass  # Your analysis here

c) Apply the Benjamini-Hochberg (FDR) procedure using statsmodels.stats.multitest.multipletests. Which metrics are significant under FDR control at 0.05?

d) The PM wants to declare victory based on CTR (p = 0.003). Is this justified? Why or why not?

Exercise 5: Simpson's Paradox (Conceptual)

A healthcare company tests a new patient onboarding flow. The overall results show:

Group	Patients	Completed Onboarding	Completion Rate
Control	10,000	6,200	62.0%
Treatment	10,000	5,900	59.0%

The treatment appears to hurt onboarding completion. But when broken down by patient age:

Age Group	Group	Patients	Completed	Rate
Under 50	Control	2,000	1,400	70.0%
Under 50	Treatment	7,000	5,040	72.0%
50+	Control	8,000	4,800	60.0%
50+	Treatment	3,000	860	28.7%

Wait --- that does not add up to 5,900 for the treatment group's 50+ segment. Let us correct:

Age Group	Group	Patients	Completed	Rate
Under 50	Control	2,000	1,400	70.0%
Under 50	Treatment	7,000	5,040	72.0%
50+	Control	8,000	4,800	60.0%
50+	Treatment	3,000	860	28.7%

a) Verify: does the treatment win or lose in each age segment?

b) Why does the treatment lose overall despite winning in the Under 50 segment? (Hint: look at the allocation of patients to age groups.)

c) If the randomization was done correctly, how could this imbalance in age group distribution occur?

d) What should the analyst report: the overall result or the segmented result? Justify your answer.

Exercise 6: Choosing the Right Metric (Applied)

A food delivery app wants to A/B test a new search ranking algorithm. Propose primary and guardrail metrics for each scenario:

a) Scenario A: The goal is to increase the number of orders placed per user.

b) Scenario B: The goal is to increase customer satisfaction with search results.

c) Scenario C: The goal is to increase revenue for the platform (which earns a commission on each order).

For each scenario, explain why your chosen primary metric is better than at least one alternative metric.

Exercise 7: Design the Experiment (Applied)

A B2B software company wants to test whether AI-generated email subject lines increase open rates for their marketing campaigns. Currently, the average open rate is 22%, and they send 150,000 emails per week. They consider a 2 percentage point increase (22% to 24%) worth detecting.

Write a complete experiment design document including:

Null and alternative hypotheses
Primary metric and at least two guardrail metrics
Randomization unit and method
Sample size calculation (show your code)
Planned duration
Analysis plan (what statistical test, what significance level, when you will analyze)
One potential pitfall specific to this experiment and how you would mitigate it

Exercise 8: Interpreting Results (Conceptual)

An A/B test produces the following results:

Control mean: $14.82 per user
Treatment mean: $15.01 per user
Relative lift: 1.28%
p-value: 0.041
95% CI for the difference: [$0.01, $0.38]
n_control: 120,000
n_treatment: 118,000

a) Is the result statistically significant at alpha = 0.05?

b) The confidence interval includes values very close to zero ($0.01). What does this tell you about the strength of the evidence?

c) If the company's minimum threshold for a "worthwhile" improvement is 2%, should they launch the treatment? Why or why not?

d) What would you recommend as a next step?

Exercise 9: A/A Test Validation (Applied)

Write a simulation to validate that your experimentation infrastructure has a correct false positive rate. Your simulation should:

Generate 5,000 A/A tests (both groups drawn from the same distribution)
Use a normal distribution with mean = 100 and standard deviation = 25
Use 10,000 observations per group
Count how many tests produce a p-value below 0.05
Verify the false positive rate is approximately 5% (within 3.5% to 6.5%)

import numpy as np
from scipy import stats

def validate_aa_tests():
    rng = np.random.default_rng(seed=42)
    # Your code here
    pass

validate_aa_tests()

Bonus: Modify the simulation to introduce a subtle bug --- make group B's mean 0.5 higher than group A's --- and show how the false positive rate increases.

Exercise 10: CUPED Implementation (Applied)

StreamFlow has pre-experiment data (hours watched in the 4 weeks before the experiment) and post-experiment data (hours watched during the 3-week experiment) for 50,000 users per group.

a) Implement CUPED variance reduction for this experiment. Use the following simulated data:

import numpy as np

rng = np.random.default_rng(seed=42)
n = 50_000

# Pre-experiment hours watched (both groups from same population)
pre_control = rng.lognormal(2.5, 0.8, n)
pre_treatment = rng.lognormal(2.5, 0.8, n)

# Post-experiment (treatment has a true 3% lift)
noise_c = rng.normal(0, 4.0, n)
noise_t = rng.normal(0, 4.0, n)
post_control = 0.7 * pre_control + noise_c + 5.0
post_treatment = (0.7 * pre_treatment + noise_t + 5.0) * 1.03

# Standard analysis
# Your code here

# CUPED analysis
# Your code here

b) What is the variance reduction achieved by CUPED?

c) If the standard analysis gives a p-value of 0.03, what does the CUPED analysis give? Is the improvement meaningful?

Exercise 11: Experiment Duration Planning (Applied)

You are planning an experiment for a mobile game that has the following characteristics:

500,000 daily active users
Primary metric: average revenue per daily active user (ARPDAU) = $0.42
Standard deviation of ARPDAU: $2.80
Minimum detectable effect: 5% relative lift
Weekend revenue is 40% higher than weekday revenue

a) Calculate the required sample size per group.

b) How many days would the experiment need to run, assuming 50/50 traffic split?

c) Explain why you should round up to a multiple of 7 days, even if the sample size calculation says you only need 4 days.

d) A PM asks: "Can we just run on weekends to detect the effect faster since revenue is higher?" What is wrong with this suggestion?

Exercise 12: The Stakeholder Conversation (Conceptual)

You run an A/B test for a new ML-powered pricing algorithm. Results:

Primary metric (revenue): +1.1%, p = 0.08
Guardrail metric (customer complaints): +23%, p = 0.002
Guardrail metric (return rate): +8%, p = 0.04

The Head of Revenue says: "The revenue result is almost significant and the complaints will settle down once customers get used to the new prices. Let's launch."

a) Write a 3-4 sentence response explaining why you would not recommend launching.

b) What additional analysis would you propose before making a final decision?

c) Draft an alternative recommendation that addresses the stakeholder's desire for revenue improvement while respecting the experimental evidence.

Exercise 13: Designing for Interference (Advanced)

A social media platform wants to test a new content recommendation algorithm. The problem: if user A is in the treatment group and shares a recommended post with user B (who is in the control group), user B's experience has been contaminated by the treatment. This is called network interference or spillover.

a) Explain why standard A/B testing assumptions are violated in the presence of network interference.

b) Propose two different experimental designs that could mitigate this problem. (Hint: consider cluster randomization and time-based designs.)

c) What are the tradeoffs of each design you proposed?

Return to the chapter for full context on these exercises.