Chapter 33: Quiz

DataField.Dev

Chapter 33: Quiz

Test your understanding of rigorous experimentation at scale. Answers follow each question.

Question 1

What is the Stable Unit Treatment Value Assumption (SUTVA), and why is it the most commonly violated assumption in online A/B tests?

Answer

SUTVA states that the potential outcome for unit $i$ depends only on unit $i$'s own treatment assignment, not on the assignments of other units: $Y_i(D_1, D_2, \ldots, D_N) = Y_i(D_i)$. It is the most commonly violated assumption in online experiments because users interact: they share content (social network interference), compete for scarce resources (marketplace interference), or are served by shared systems (algorithmic interference). When treatment users share content with control users, the control group's behavior is affected by the treatment, biasing the estimated treatment effect. Unlike ignorability (which randomization guarantees) and positivity (which the design trivially satisfies), SUTVA depends on the structure of the system being experimented on and cannot be guaranteed by the experimental design alone.

Question 2

A naive A/B test on a social platform shows a treatment effect of 1.0 minute of daily engagement. The true direct effect is 1.5 minutes, but positive spillover inflates the control group's engagement. Explain why the naive estimate is biased downward — specifically, decompose the bias into its components.

Answer

The naive estimator computes $\hat{\tau} = \bar{Y}_T - \bar{Y}_C$. Under positive spillover, treated users share content with control users, increasing the control group's engagement. Let $\delta$ be the average spillover effect on control users. Then $\bar{Y}_C = \mu_0 + \delta$ (where $\mu_0$ is the true control mean without any treatment in the system) and $\bar{Y}_T = \mu_0 + \tau$ (the true direct effect). The naive estimate is $\hat{\tau} = (\mu_0 + \tau) - (\mu_0 + \delta) = \tau - \delta$. Since $\delta > 0$ (positive spillover), the estimate is $\tau - \delta < \tau$ — biased downward. In this example, $\tau = 1.5$ and $\delta = 0.5$, so the naive estimate is $1.5 - 0.5 = 1.0$, underestimating the true direct effect by 33%.

Question 3

Describe the cluster-randomized experimental design. What is the design effect, and how does it relate to the intracluster correlation coefficient (ICC)?

Answer

In a cluster-randomized experiment, entire clusters (e.g., friend groups, geographic regions) are randomly assigned to treatment or control, rather than individual units. All units within a cluster receive the same assignment. This eliminates between-group interference contamination because treated and control users are in separate clusters. The **design effect** is $\text{DEFF} = 1 + (m - 1)\rho$, where $m$ is the average cluster size and $\rho$ is the ICC (the fraction of total variance that is between-cluster rather than within-cluster). The design effect represents the multiplicative inflation in the variance of the treatment effect estimator relative to individual randomization. For example, with $m = 10$ and $\rho = 0.05$, DEFF = 1.45, meaning the experiment needs 45% more total users to achieve the same statistical power as individual randomization.

Question 4

When would you use a switchback design instead of a cluster-randomized design? What is the main threat to the validity of a switchback design, and how is it mitigated?

Answer

A switchback design is preferred when **no cluster partition adequately internalizes interference** — particularly in two-sided marketplaces where supply and demand are globally connected (e.g., ride-sharing, where drivers move between regions). In a switchback, the entire population alternates between treatment and control across time periods. The main threat is **carryover**: the treatment effect in period $t$ leaks into period $t+1$ because behavior or system state does not reset instantly. This is mitigated by: (1) **washout periods** — inserting untreated buffer periods between alternations; (2) **longer periods** — reducing the fraction of time contaminated by carryover; and (3) **regression adjustment** — explicitly modeling the carryover as a function of the previous period's assignment, e.g., $Y_t = \mu + \tau D_t + \gamma D_{t-1} + \epsilon_t$, where $\gamma$ captures the carryover effect.

Question 5

Explain the synthetic control method. When is it appropriate, and what is the key assumption?

Answer

The synthetic control method constructs a "synthetic" version of a treated unit as a weighted combination of untreated units that matches the treated unit's pre-treatment trajectory. The treatment effect is estimated as the divergence between the treated unit and its synthetic control after the intervention. It is appropriate when there is **only one (or very few) treated unit(s)** and many potential controls — e.g., a policy adopted in one city or country. The key assumption is that the weighted combination of control units would continue to track the treated unit's trajectory in the absence of treatment (**parallel trends in the synthetic control space**). This assumption is made credible by demonstrating good pre-treatment fit: if the synthetic control closely reproduces the treated unit's outcomes for many pre-treatment periods, it is plausible (though not guaranteed) that it would continue to do so without treatment. Placebo tests (applying the method to each control unit) provide additional evidence of validity.

Question 6

Derive the CUPED-adjusted outcome $\tilde{Y}_i = Y_i - \theta(X_i - \bar{X})$. What is the optimal $\theta$, and what variance reduction does it achieve?

Answer

The goal is to choose $\theta$ to minimize $\text{Var}(\tilde{Y})$. Expanding: $\text{Var}(\tilde{Y}) = \text{Var}(Y) - 2\theta\text{Cov}(Y, X) + \theta^2\text{Var}(X)$. Taking the derivative with respect to $\theta$: $\frac{d}{d\theta}\text{Var}(\tilde{Y}) = -2\text{Cov}(Y, X) + 2\theta\text{Var}(X) = 0$. Solving: $\theta^* = \text{Cov}(Y, X) / \text{Var}(X)$, which is the OLS coefficient from regressing $Y$ on $X$. Substituting $\theta^*$ back: $\text{Var}(\tilde{Y}) = \text{Var}(Y) - \text{Cov}(Y,X)^2 / \text{Var}(X) = \text{Var}(Y)(1 - \rho^2)$, where $\rho = \text{Corr}(Y, X)$. The variance reduction factor is $\rho^2$. For StreamRec with $\rho = 0.65$, variance is reduced by 42%, and the effective sample size multiplier is $1/(1-0.42) = 1.72\times$.

Question 7

Why does CUPED preserve the unbiasedness of the treatment effect estimator? Would CUPED be valid if the covariate $X$ were measured during the experiment?

Answer

CUPED is unbiased because $X$ is a **pre-experiment** variable and treatment is randomly assigned. By random assignment, $\text{E}[X \mid D=1] = \text{E}[X \mid D=0] = \text{E}[X]$, so subtracting $\theta(X_i - \bar{X})$ removes the same amount (in expectation) from both groups: $\text{E}[\tilde{Y} \mid D=1] - \text{E}[\tilde{Y} \mid D=0] = \text{E}[Y \mid D=1] - \text{E}[Y \mid D=0] = \tau$. If $X$ were measured **during** the experiment, it could be affected by the treatment: $\text{E}[X \mid D=1] \neq \text{E}[X \mid D=0]$. The adjustment would then remove different amounts from treatment and control, biasing the estimator. This is a form of post-treatment bias — adjusting for a variable on the causal pathway between treatment and outcome. CUPED requires that $X$ be a pre-treatment variable.

Question 8

What is the false discovery rate (FDR), and how does the Benjamini-Hochberg procedure control it? Why is FDR control preferred over FWER control in large-scale experimentation platforms?

Answer

The **FDR** is the expected proportion of false positives among all rejected hypotheses: $\text{FDR} = \text{E}[\text{false positives} / \text{total rejections}]$ (defined as 0 when there are no rejections). The BH procedure sorts $m$ p-values in ascending order and finds the largest $k$ such that $p_{(k)} \leq (k/m)\alpha$, then rejects all $k$ smallest p-values. This guarantees $\text{FDR} \leq \alpha$ under independence (or positive dependence). FDR control is preferred over FWER control (e.g., Bonferroni) because in large-scale experimentation, some false positives are tolerable — what matters is that the *proportion* of shipped false positives is small. Bonferroni at $\alpha/m$ becomes extremely conservative as $m$ grows (e.g., $\alpha/100 = 0.0005$), rejecting very few true effects. BH maintains meaningful power while controlling the proportion of errors among discoveries.

Question 9

Explain the peeking problem. Under the null hypothesis, if an analyst checks the p-value daily for 14 days and stops when p < 0.05, what is the approximate true type I error rate?

Answer

The **peeking problem** (optional stopping) occurs when an analyst repeatedly checks a running A/B test and stops as soon as the p-value crosses the significance threshold. Under the null hypothesis, the cumulative test statistic follows a random walk, which will eventually cross any fixed boundary. Each check is an opportunity for the random walk to cross the threshold by chance. Simulations (Johari et al., 2017; confirmed in Section 33.9) show that with daily checks over 14 days, the true type I error rate inflates from the nominal 5% to approximately **22-26%** — roughly 5 times the intended rate. The more frequently the analyst checks, the higher the inflation. The root cause is that the fixed-horizon p-value is valid only at the pre-specified sample size; using it at multiple sample sizes violates the assumptions of the test.

Question 10

What is an always-valid confidence sequence, and how does it differ from a standard confidence interval?

Answer

A standard confidence interval $\text{CI}_T$ is valid at a single pre-specified time point $T$: $P(\tau \in \text{CI}_T) \geq 1 - \alpha$. An **always-valid confidence sequence** $\{\text{CI}_t\}_{t \geq 1}$ provides simultaneous coverage for all time points: $P(\tau \in \text{CI}_t \text{ for all } t \geq 1) \geq 1 - \alpha$. This is a much stronger guarantee — it means the analyst can check the confidence interval at any time (daily, hourly, continuously) and the coverage guarantee still holds. The tradeoff is that each individual confidence interval $\text{CI}_t$ is wider than the corresponding fixed-horizon interval at the same sample size. The extra width is the price of validity under continuous monitoring. The mSPRT (mixture sequential probability ratio test) provides a practical construction of always-valid tests for A/B experiments.

Question 11

What is a sample ratio mismatch (SRM), and why is it considered a data quality alarm rather than a statistical nuisance?

Answer

An SRM occurs when the observed ratio of treatment to control users differs significantly from the configured ratio (e.g., 50.1%/49.9% instead of 50.0%/50.0%, detected via chi-squared test with p < 0.001). It is a **data quality alarm** because it indicates the randomization mechanism is broken — treatment and control groups are no longer comparable. If the assignment mechanism has a systematic bias (e.g., treatment pages load slower, causing some users to drop off before logging), then the remaining users in each group differ systematically in ways that confound the treatment effect. SRM does not just add noise; it introduces **selection bias** that no statistical correction can fix. Common causes include redirect differences, performance discrepancies, bot filtering, cache interactions, and trigger condition dependencies. When SRM is detected, the experiment results should be considered unreliable until the root cause is identified and fixed.

Question 12

Describe the difference between a novelty effect and a primacy effect. How would you detect each in an A/B test?

Answer

A **novelty effect** occurs when users are attracted to the newness of the treatment, producing a temporary engagement boost that fades as the change becomes familiar — the daily treatment effect decreases over time. A **primacy effect** is the opposite: users have established habits with the old system, and the treatment disrupts these habits, causing a temporary engagement dip that recovers as users adapt — the daily treatment effect increases over time. Detection involves plotting the daily treatment effect over the experiment's duration and fitting a trend: a significant negative slope indicates novelty; a significant positive slope indicates primacy. The analysis should control for day-of-week effects to avoid confounding seasonal patterns with temporal trends. The practical implication is that the first-week effect estimate is biased for the long-run treatment effect — novelty overestimates it, primacy underestimates it — so experiments should run at least 2-4 weeks.

Question 13

StreamRec runs 50 concurrent experiments with independent randomization. Two experiments are suspected of interacting. Describe how you would test for this interaction using a factorial regression.

Answer

With independent randomization, users are independently assigned to each experiment, creating a $2 \times 2$ factorial structure for any pair of experiments. Fit the regression: $Y_i = \beta_0 + \beta_A A_i + \beta_B B_i + \beta_{AB} A_i B_i + \epsilon_i$, where $A_i$ and $B_i$ are the binary treatment assignments for experiments A and B. The coefficient $\beta_{AB}$ is the **interaction effect** — the amount by which the treatment effect of A differs when B is active vs. inactive. Test $H_0: \beta_{AB} = 0$. A significant interaction means the combined effect of A and B is not the sum of their individual effects. In practice, the interaction is compared to the main effect of A as a relative magnitude: $|\beta_{AB}| / |\beta_A|$ measures how much experiment B modifies A's effect. Interactions exceeding 10-20% of the main effect are typically considered operationally significant.

Question 14

The delta method provides the variance of a ratio metric $R = Y/X$. Why is this necessary, and what error does the naive approach introduce?

Answer

Many experiment metrics are ratios: revenue per user, click-through rate (clicks/impressions), sessions per day. The naive approach computes the variance as $\text{Var}(Y) / \bar{X}^2$, ignoring the variability in the denominator $X$ and the covariance between $Y$ and $X$. The delta method gives the correct variance: $\text{Var}(R) \approx (1/\bar{X}^2)[\text{Var}(Y) - 2R\text{Cov}(Y,X) + R^2\text{Var}(X)] / n$. When $Y$ and $X$ are positively correlated (as they typically are — users who have more sessions also generate more revenue), the covariance term reduces the variance, and the naive approach **overestimates** the SE. When the denominator has high variance, the $R^2\text{Var}(X)$ term inflates the variance, and the naive approach **underestimates** the SE. In practice, the naive SE is typically 15-40% too small, producing overconfident confidence intervals and inflated false positive rates.

Question 15

What are the five components of a mature experimentation platform? For each, give one example of what goes wrong if the component fails.

Answer

**(1) Assignment service:** Deterministic, consistent mapping of users to variants. **Failure:** A hash collision causes 0.5% of users to switch variants mid-experiment, introducing SRM and contaminating the treatment effect estimate. **(2) Exposure logging:** Records when users are actually exposed to the treatment. **Failure:** Treatment exposure events are logged asynchronously and some are dropped during a backend incident; the intent-to-treat analysis dilutes the treatment effect by including unexposed users. **(3) Metric computation pipeline:** Aggregates raw events into per-user metrics, joins with assignment data. **Failure:** Delayed purchase events from a payment processor are not attributed to the correct experiment day, causing the first few days to undercount revenue. **(4) Statistical engine:** Implements CUPED, sequential testing, SRM checks, multiple testing correction. **Failure:** The CUPED implementation uses a post-experiment covariate instead of a pre-experiment one, introducing post-treatment bias. **(5) Decision layer:** Dashboards, reports, alerts. **Failure:** The dashboard shows p-values without confidence intervals, leading product managers to fixate on "p < 0.05" without understanding the effect size.

Question 16

Explain the concept of experimentation culture. What is the HiPPO problem, and why is it the biggest threat to evidence-based decision-making?

Answer

**Experimentation culture** is an organizational commitment to testing changes before shipping them, trusting data over opinions, and accepting surprising or disappointing results. The **HiPPO** (Highest-Paid Person's Opinion) problem occurs when a senior leader's intuition overrides experimental evidence — e.g., a VP insists a feature will work despite an A/B test showing no effect (or a negative effect), and the organization ships the feature anyway. This is the biggest threat because it undermines the entire experimentation infrastructure: if data is overridden when it contradicts authority, teams stop running rigorous experiments (why invest in methodology if the result won't matter?), and the organization loses the ability to distinguish effective changes from ineffective ones. Kohavi et al. report that at Microsoft, approximately two-thirds of tested ideas have no measurable effect — an organization that ships based on HiPPO would ship all of them, accumulating technical debt and codebase complexity with no measurable benefit.

Question 17

A climate scientist wants to estimate the causal effect of a carbon tax on emissions in Country A. They have 20 years of pre-tax data and 5 years of post-tax data for Country A and 30 comparison countries. Why is synthetic control a better approach than a simple before-after comparison?

Answer

A simple before-after comparison attributes all post-tax changes in emissions to the carbon tax, ignoring other factors that changed simultaneously: global economic conditions, technological changes, natural gas price fluctuations, other environmental policies. Any of these confounders could explain the observed change. **Synthetic control** constructs a counterfactual — what Country A's emissions *would have been* without the carbon tax — by finding a weighted combination of control countries that closely tracked Country A's emissions for 20 years before the tax. If the synthetic control diverges from Country A's actual emissions after the tax, this divergence is credibly attributable to the tax (because the synthetic control accounts for all factors that affect both Country A and the control countries). The 20 years of pre-treatment fit provides evidence that the synthetic control is a valid counterfactual. A before-after comparison provides no such evidence and is vulnerable to any time-varying confounder.

Question 18

CUPED uses pre-experiment data to reduce variance. Could you use the same idea within a cluster-randomized experiment? What would the covariate $X$ be?

Answer

Yes. In a cluster-randomized experiment, the outcome is the cluster-level mean $\bar{Y}_k$, and variance reduction is even more valuable because the design effect has already inflated the variance. The pre-experiment covariate $X_k$ would be the **cluster-level mean of the same metric measured before the experiment started** — e.g., the average daily engagement of all users in cluster $k$ during the 2-4 weeks before the experiment. The CUPED adjustment is $\tilde{\bar{Y}}_k = \bar{Y}_k - \theta(X_k - \bar{X})$, reducing the cluster-level outcome variance by the fraction $\rho^2$ where $\rho$ is the correlation between pre- and post-experiment cluster means. Because cluster means are more stable over time than individual-level outcomes (averaging reduces noise), the pre-post correlation for cluster means is often higher ($\rho \approx 0.7\text{-}0.9$) than for individuals ($\rho \approx 0.5\text{-}0.7$), making CUPED even more effective in the cluster-randomized setting.

Question 19

An experiment shows a statistically significant positive effect (p = 0.02) on engagement but also triggers the SRM check (p = 0.0001). Should you ship the treatment? Explain your reasoning.

Answer

**No.** The SRM detection means the randomization mechanism is broken — the treatment and control groups are not comparable. The positive engagement result could be an artifact of the selection bias that caused the SRM. For example, if treatment users with slower internet connections dropped off before logging an exposure (causing the treatment group to be enriched with users who have faster connections, who tend to engage more regardless of the treatment), the measured "treatment effect" is actually a confounding effect from differential attrition. The correct action is: (1) **do not ship** the treatment based on these results; (2) **investigate the SRM root cause** — check for platform-specific differences, redirect chains, performance discrepancies, and bot filtering; (3) **fix the root cause** and re-run the experiment; (4) only trust the engagement result if the re-run passes SRM. An SRM-contaminated experiment cannot be salvaged by statistical adjustment — the problem is in the data, not the analysis.

Question 20

Compare and contrast the following variance reduction techniques: CUPED, stratification, and regression adjustment. When is each most appropriate?

Answer

All three reduce variance by leveraging pre-experiment information. **CUPED** adjusts the outcome by subtracting a linear function of a single pre-experiment covariate: $\tilde{Y} = Y - \theta(X - \bar{X})$. It is simple, requires only one covariate, and achieves variance reduction proportional to $\rho^2$. Most appropriate when one pre-experiment variable (e.g., same metric in the pre-period) has high correlation with the outcome. **Stratification** partitions users into groups based on pre-experiment characteristics and estimates treatment effects within strata. It handles categorical covariates naturally and achieves variance reduction when stratum means differ substantially. Most appropriate when there are natural groupings (user segments, geographic regions) with heterogeneous baseline outcomes. **Regression adjustment** regresses the outcome on treatment and multiple pre-experiment covariates. Lin (2013) showed the fully interacted regression is asymptotically efficient and valid even if misspecified. It is the most general: it subsumes CUPED (single covariate) and stratification (indicator variables), handles multiple covariates, and achieves the largest variance reduction. Most appropriate in modern experimentation platforms where multiple predictive covariates are available. In practice, regression adjustment is the default; CUPED is the simple special case for teams without multi-covariate infrastructure.