Quiz: Chapter 3
Experimental Design and A/B Testing
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
In an A/B test, the null hypothesis states that:
- A) The treatment group will have a higher metric value than the control group
- B) There is no difference between the treatment and control groups; any observed difference is due to random variation
- C) The experiment was not properly randomized
- D) The sample size is too small to detect a difference
Answer: B) There is no difference between the treatment and control groups; any observed difference is due to random variation. The null hypothesis is the default assumption that the treatment has no effect. The purpose of the experiment is to gather evidence against this assumption.
Question 2 (Multiple Choice)
Which of the following correctly describes a Type I error?
- A) Failing to detect a real effect
- B) Concluding there is an effect when none exists (false positive)
- C) Using the wrong statistical test for the data
- D) Running the experiment for too short a duration
Answer: B) Concluding there is an effect when none exists (false positive). A Type I error occurs when you reject the null hypothesis even though it is true. The probability of a Type I error is controlled by the significance level (alpha), typically set at 0.05.
Question 3 (Short Answer)
Explain the difference between statistical significance and practical significance. Give an example where a result is statistically significant but not practically significant.
Answer: Statistical significance means the observed difference is unlikely to be due to random chance (p-value below alpha). Practical significance means the effect is large enough to matter for the business or application. For example, an A/B test on a website with 10 million users might detect a statistically significant 0.01% increase in click-through rate (p = 0.02), but a 0.01% lift translates to only 1,000 additional clicks per month --- far too small to justify the engineering cost of implementing the change.
Question 4 (Multiple Choice)
A data scientist plans an experiment and determines they need 50,000 users per group. The site has 20,000 daily active users. What is the minimum experiment duration?
- A) 2.5 days
- B) 5 days
- C) 7 days (one full week)
- D) 14 days (two full weeks)
Answer: C) 7 days (one full week). While the raw calculation suggests 5 days (50,000 per group / 10,000 per group per day with a 50/50 split), experiments should always run for at least one full week to capture day-of-week effects. User behavior on weekdays differs significantly from weekends, and a partial week would introduce systematic bias.
Question 5 (Multiple Choice)
What is the primary purpose of an A/A test?
- A) To determine whether the treatment is better than the control
- B) To validate that the randomization and measurement infrastructure are working correctly
- C) To increase statistical power by combining two control groups
- D) To test two different treatments against each other
Answer: B) To validate that the randomization and measurement infrastructure are working correctly. In an A/A test, both groups receive the same experience. If the test shows a significant difference, something is broken in the randomization or measurement system. The expected false positive rate of an A/A test should match the significance level (alpha).
Question 6 (Short Answer)
A PM checks A/B test results daily. On day 4 of a 21-day experiment, they see p = 0.03 and want to ship the treatment immediately. Explain why this is problematic and what the PM should do instead.
Answer: This is the peeking problem. Checking results repeatedly and stopping when you see significance inflates the false positive rate well beyond the nominal alpha level. Each check is an independent opportunity for a false positive, and across 21 daily checks, the cumulative false positive rate can reach 15-25% even when the true alpha is 5%. The PM should wait until the pre-specified analysis date (day 21) to evaluate results, or the team should implement sequential testing methods (such as the mSPRT) that adjust significance thresholds for continuous monitoring.
Question 7 (Multiple Choice)
An experiment tests a new feature across 8 different metrics simultaneously, without applying any multiple testing correction. The true false positive rate for finding at least one "significant" result (when none of the metrics are truly affected) is approximately:
- A) 5%
- B) 8%
- C) 20%
- D) 34%
Answer: D) 34%. The probability of at least one false positive is 1 - (1 - 0.05)^8 = 1 - 0.95^8 = 1 - 0.6634 = 0.3366, approximately 34%. This assumes the metrics are independent; correlated metrics would produce a somewhat lower rate, but the inflation is still substantial.
Question 8 (Multiple Choice)
The Bonferroni correction for multiple testing works by:
- A) Increasing the sample size proportionally to the number of tests
- B) Dividing the significance level (alpha) by the number of tests
- C) Averaging the p-values across all tests
- D) Restricting analysis to the single best-performing metric
Answer: B) Dividing the significance level (alpha) by the number of tests. With 10 tests and alpha = 0.05, the Bonferroni-corrected threshold is 0.05 / 10 = 0.005. Only results below this threshold are considered significant. This controls the family-wise error rate but is conservative --- the Benjamini-Hochberg (FDR) procedure is a less conservative alternative.
Question 9 (Short Answer)
Define guardrail metrics and explain why they are important in A/B testing. Give two examples for an e-commerce website testing a new checkout flow.
Answer: Guardrail metrics are metrics that must not degrade meaningfully during an experiment, even if the primary metric improves. They protect against unintended negative consequences of a change. For an e-commerce checkout test, guardrails might include: (1) page load time (a slow new checkout could frustrate users even if conversion improves for those who complete it), and (2) customer support contact rate (an increase in support tickets suggests the new flow is confusing, even if some users convert more successfully).
Question 10 (Multiple Choice)
Simpson's paradox occurs when:
- A) The sample size is too small to detect an effect
- B) A trend that appears in several subgroups reverses when the subgroups are combined
- C) The control and treatment groups have different sizes
- D) The primary metric and guardrail metrics show contradictory results
Answer: B) A trend that appears in several subgroups reverses when the subgroups are combined. This happens because of unequal representation of subgroups across treatment and control. The defense is to analyze results within key segments and use stratified analysis to ensure the aggregate result is not misleading.
Question 11 (Multiple Choice)
Increasing statistical power from 0.80 to 0.90 while keeping all other parameters constant will:
- A) Decrease the required sample size
- B) Increase the required sample size
- C) Have no effect on sample size
- D) Reduce the false positive rate
Answer: B) Increase the required sample size. Higher power means a higher probability of detecting a true effect, which requires more data. Increasing power from 0.80 to 0.90 typically increases the required sample size by approximately 30%.
Question 12 (Short Answer)
An A/B test shows a 2.1% lift in revenue with p = 0.04 and a 95% confidence interval of [0.1%, 4.1%]. The PM asks: "So the true lift is 2.1%?" How would you correct this interpretation?
Answer: The 2.1% is the point estimate --- the best single guess for the true effect. But the true effect could be anywhere within the confidence interval of [0.1%, 4.1%] (and technically could fall outside it 5% of the time). The correct interpretation is: "We are 95% confident that the true lift is between 0.1% and 4.1%. Our best estimate is 2.1%, but the actual improvement could be as low as 0.1% or as high as 4.1%." The wide interval suggests meaningful uncertainty about the effect size.
Question 13 (Multiple Choice)
CUPED (Controlled-experiment Using Pre-Experiment Data) improves A/B testing by:
- A) Increasing the size of the treatment effect
- B) Eliminating the need for a control group
- C) Reducing metric variance using pre-experiment data, thereby increasing statistical power
- D) Correcting for multiple testing across metrics
Answer: C) Reducing metric variance using pre-experiment data, thereby increasing statistical power. CUPED uses the correlation between pre-experiment and during-experiment metrics to remove noise, reducing the standard error of the treatment effect estimate. This allows experiments to reach significance faster or detect smaller effects with the same sample size.
Question 14 (Multiple Choice)
Which of the following is the best reason to use a two-sided test instead of a one-sided test?
- A) Two-sided tests require less data
- B) Two-sided tests can detect both improvements and degradations
- C) Two-sided tests have lower false positive rates
- D) Two-sided tests are easier to compute
Answer: B) Two-sided tests can detect both improvements and degradations. A one-sided test only has power to detect effects in one direction and cannot detect harm in the other direction. In practice, treatments can have unexpected negative effects, and failing to detect harm is a serious risk. Two-sided tests provide a more complete picture at the cost of slightly less power in any single direction.
Question 15 (Short Answer)
An experiment on a new recommendation algorithm shows a 4.5% lift in click-through rate during week 1, but only a 1.8% lift during week 3. What phenomenon might explain this pattern, and how should it affect your decision about whether to launch?
Answer: This pattern is consistent with a novelty effect, where users interact more with a new experience simply because it is unfamiliar and interesting. As the novelty wears off, engagement returns closer to baseline. The sustained effect (week 3 onward, approximately 1.8%) is a better estimate of the long-term impact than the week 1 number. The launch decision should be based on the stabilized effect, not the inflated initial result. If 1.8% still exceeds the minimum threshold for practical significance, the launch may be justified; if the team was counting on a 4.5% lift, the economics may no longer work.
This quiz covers Chapter 3: Experimental Design and A/B Testing. Return to the chapter for full context.