Chapter 33: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Interference and SUTVA Violations

Exercise 33.1 (*)

A food delivery marketplace wants to test a new pricing algorithm. They randomize individual customers to treatment (new pricing) or control (old pricing).

(a) Explain why SUTVA is violated in this setting. Identify the specific interference mechanism.

(b) If the new pricing algorithm offers lower prices to treatment customers, predict the direction of bias in the naive difference-in-means estimator. Will it overestimate or underestimate the true effect on order volume? Explain your reasoning.

(c) Propose an experimental design that addresses the interference. What are the tradeoffs?


Exercise 33.2 (*)

A social media platform tests a new content moderation algorithm. Treatment users see fewer toxic comments in their feed; control users see the standard feed.

(a) Describe two interference mechanisms through which the treatment could affect control users.

(b) If the platform randomizes at the user level, will the measured effect on "time spent reading comments" be biased upward or downward? Why?

(c) The platform has 500 million users. Would cluster randomization by friend group be practical? What challenges would arise?


Exercise 33.3 (**)

Modify the simulate_interference_bias function from Section 33.3 to model negative spillover (competition for limited resources) instead of positive spillover.

(a) Define a scenario where treatment users consume a scarce resource, reducing availability for control users. Implement the simulation.

(b) Show that the naive estimator now overestimates the true direct effect. Report the bias as a percentage.

(c) Run the simulation with spillover_magnitude values of 0.1, 0.3, 0.5, 0.8, and 1.0. Plot the bias as a function of spillover magnitude. At what magnitude does the naive estimate exceed the true effect by more than 50%?


Exercise 33.4 (**)

StreamRec's social graph has the following structure: 12 million users, average degree 8, clustering coefficient 0.15. The community detection algorithm identifies 800,000 clusters with a median size of 12 and a maximum size of 340.

(a) Compute the design effect $\text{DEFF} = 1 + (m-1)\rho$ for ICCs of 0.01, 0.03, 0.05, and 0.10, assuming median cluster size. How many clusters are needed for 80% power to detect a 0.5-minute effect (assuming $\sigma = 15$)?

(b) The top 1% of clusters contain 20% of all users. How does this cluster size heterogeneity affect the variance of the cluster-level estimator? (Hint: consider the effective number of clusters.)

(c) Propose a hybrid design: cluster-randomize the largest 10% of clusters and individually randomize users in the remaining 90%. What assumption about interference must hold for this design to be valid?


Cluster and Switchback Designs

Exercise 33.5 (*)

A ride-sharing company runs a switchback experiment with 4-hour periods over 14 days (84 periods total). The treatment is a new surge pricing algorithm.

(a) How many treatment periods and control periods are expected with equal randomization?

(b) If carryover effects last approximately 30 minutes, what fraction of each 4-hour period is "contaminated" by carryover from the previous period?

(c) The company considers using 1-hour periods instead to increase the effective sample size. What is the tradeoff? Would you recommend this?


Exercise 33.6 (**)

Implement a switchback experiment analyzer with the following features:

class SwitchbackAnalyzer:
    """Extended switchback experiment analyzer.

    Supports:
    - Carryover adjustment (lag-1 and lag-2)
    - Day-of-week fixed effects
    - Newey-West standard errors for autocorrelation
    """

    def __init__(self, outcomes, assignments, period_info):
        """
        Args:
            outcomes: Array of period-level mean outcomes.
            assignments: Binary array of period assignments.
            period_info: DataFrame with columns ['period', 'day_of_week',
                         'hour', 'n_observations'].
        """
        pass  # Implement

    def estimate_effect(
        self,
        carryover_lags: int = 1,
        day_of_week_fe: bool = True,
        newey_west_lags: int = 3,
    ) -> dict:
        """Estimate treatment effect with the specified adjustments."""
        pass  # Implement

(a) Implement the class with weighted least squares regression.

(b) Generate synthetic data with known carryover (lag-1 = 0.3, lag-2 = 0.1) and day-of-week effects. Verify that your analyzer recovers the true treatment effect.

(c) Compare Newey-West standard errors (lags 1, 3, 5) with OLS standard errors. By how much do the OLS standard errors underestimate the true uncertainty?


Exercise 33.7 (***)

Derive the variance of the switchback estimator under the following model:

$$Y_t = \mu + \tau D_t + \gamma D_{t-1} + \epsilon_t$$

where $\epsilon_t$ follows an AR(1) process: $\epsilon_t = \phi \epsilon_{t-1} + \nu_t$ with $\nu_t \sim \mathcal{N}(0, \sigma^2)$.

(a) Write the variance of $\hat{\tau}$ as a function of $T$ (number of periods), $\phi$ (autocorrelation), and $\sigma^2$.

(b) Show that positive autocorrelation ($\phi > 0$) increases the variance. By how much for $\phi = 0.3$ and $T = 84$?

(c) Show that the Newey-West variance estimator with $L$ lags is consistent for this model when $L \geq 1$.


Synthetic Control

Exercise 33.8 (**)

StreamRec launches a new content moderation policy in Brazil. Using data from 7 other Latin American markets as controls:

(a) Implement the synthetic control method from Section 33.6. Use 90 days of pre-treatment data and 30 days of post-treatment data. Generate plausible synthetic data for 8 markets (1 treated, 7 controls) with a true treatment effect of +3% engagement.

(b) Report the estimated treatment effect and the pre-treatment RMSE. How close is the estimate to the true effect?

(c) Conduct a placebo test: apply the synthetic control method to each control unit as if it were treated. If the method is valid, the placebo effects should be near zero. Compute the ratio of the actual effect to the distribution of placebo effects — this is a permutation-based p-value.

(d) One control market (Argentina) underwent an unrelated regulatory change during the post-treatment period. How would you modify the analysis to handle this?


CUPED and Variance Reduction

Exercise 33.9 (*)

A streaming platform runs an A/B test on its search algorithm. The primary metric is search success rate (proportion of searches that result in a click within 60 seconds). Pre-experiment search success rate is available for all users.

(a) The pre-post correlation for search success rate is $\rho = 0.45$. What is the variance reduction from CUPED? What is the effective sample size multiplier?

(b) The platform also has pre-experiment data on total searches per day ($\rho = 0.55$ with post-experiment search success rate) and account age ($\rho = 0.20$). Can you use multiple covariates in CUPED? If so, what is the maximum variance reduction achievable?

(c) Implement multivariate CUPED using OLS regression of $Y$ on $(X_1, X_2, X_3)$ and compute the adjusted outcomes. Verify that the treatment effect estimate is unchanged.


Exercise 33.10 (**)

CUPED assumes a linear relationship between pre- and post-experiment outcomes.

(a) Generate data where the true relationship is nonlinear: $Y = X^2 + \epsilon + \tau D$. Apply standard CUPED. How much variance reduction do you achieve compared to the theoretical maximum?

(b) Implement a nonlinear CUPED that uses a gradient-boosted tree (e.g., LightGBM) to predict $Y$ from $X$, then uses the residual $\tilde{Y} = Y - \hat{f}(X)$ as the adjusted outcome.

(c) Show empirically that nonlinear CUPED achieves greater variance reduction for the nonlinear data from (a). What is the risk of overfitting, and how would you mitigate it?


Exercise 33.11 (***)

Prove that the CUPED estimator is unbiased for the ATE under random assignment.

(a) Start from the definition: $\hat{\tau}_{\text{CUPED}} = \bar{\tilde{Y}}_1 - \bar{\tilde{Y}}_0$ where $\tilde{Y}_i = Y_i - \theta(X_i - \bar{X})$. Show that $\text{E}[\hat{\tau}_{\text{CUPED}}] = \tau$ under random assignment.

(b) Derive the variance $\text{Var}(\hat{\tau}_{\text{CUPED}})$ and show it equals $\frac{2\sigma^2(1-\rho^2)}{n}$ for a balanced design with $n$ users per group.

(c) Show that the optimal $\theta$ minimizes this variance. What happens if $\theta$ is estimated from the experimental data rather than known? Does unbiasedness still hold?


Exercise 33.12 (**)

Stratified estimation. Partition StreamRec users into 5 strata based on pre-experiment engagement: very low (<5 min), low (5-15), medium (15-30), high (30-60), very high (>60 min). The stratum population proportions are (0.15, 0.25, 0.30, 0.20, 0.10).

(a) Implement the stratified estimator $\hat{\tau}_{\text{strat}} = \sum_s (N_s/N)\hat{\tau}_s$. Generate synthetic data where the treatment effect is heterogeneous: the effect is +0.3 min for very low users, +0.5 for low, +0.8 for medium, +1.0 for high, and +0.5 for very high users.

(b) Compute the true ATE (weighted average of stratum effects). Verify that both the stratified estimator and the simple difference-in-means recover this ATE (in expectation).

(c) Compare the standard errors of the stratified and unstratified estimators. How much variance reduction does stratification achieve?


Multiple Testing

Exercise 33.13 (*)

StreamRec analyzes 8 metrics for each A/B test: daily engagement minutes, daily sessions, items consumed, search queries, shares, saves, app opens, and revenue. The experiment has no true effect on any metric.

(a) What is the probability of at least one false positive at $\alpha = 0.05$ without correction?

(b) Apply Bonferroni correction. What is the adjusted $\alpha$ per test?

(c) Apply Benjamini-Hochberg at FDR = 0.05. Simulate 10,000 experiments (all null) and compute the actual false discovery rate. Verify it is controlled at 5%.


Exercise 33.14 (**)

A company runs 200 A/B tests per quarter. Of these, 30 have a true positive effect and 170 are null. Effect sizes for the 30 true positives are drawn from $\mathcal{N}(0.5, 0.2^2)$ (in standard deviation units). Each test has $n = 10,000$ per group.

(a) Simulate the 200 p-values. How many tests are significant at $\alpha = 0.05$ without correction?

(b) Apply BH at FDR = 0.10. How many true positives are detected? How many false positives are included? What is the realized false discovery proportion?

(c) Repeat with Bonferroni at FWER = 0.05. Compare the number of true positives detected. Which procedure would you recommend for this company, and why?


Exercise 33.15 (***)

The Benjamini-Hochberg procedure controls FDR under independence of the test statistics. In practice, experiment metrics are correlated (engagement and sessions are strongly correlated).

(a) Prove that BH controls FDR at level $\alpha$ when the null p-values are independent of each other and of the non-null p-values (the PRDS condition of Benjamini and Yekutieli, 2001).

(b) Simulate 8 correlated metrics with correlation matrix $\Sigma$ where all off-diagonal elements are 0.5. All 8 metrics are null. Apply BH at FDR = 0.05 and compute the realized FDR over 10,000 simulations. Is it controlled?

(c) Now set 2 of the 8 metrics to have true effects. Does the BH procedure still control FDR under this correlation structure?


Sequential Testing and Peeking

Exercise 33.16 (*)

An experiment is planned for 14 days. The analyst checks the p-value every day.

(a) Using the simulation from Section 33.9, estimate the inflated type I error rate when checking at days 3, 5, 7, 10, and 14 only (5 checks instead of 14). How does the inflation compare to daily checking?

(b) The product manager asks to check hourly (336 checks). Estimate the type I error inflation.

(c) At what checking frequency does the inflation exceed 3× the nominal alpha?


Exercise 33.17 (**)

Implement a complete sequential monitoring dashboard.

class SequentialMonitor:
    """Monitor an experiment with always-valid inference.

    Updates daily with new data and provides:
    - Always-valid confidence interval
    - mSPRT statistic and stopping decision
    - Expected remaining duration
    """

    def __init__(
        self,
        experiment_id: str,
        alpha: float = 0.05,
        tau_squared: float = 0.01,
        min_days: int = 7,
    ):
        pass  # Implement

    def update(self, day: int, y_treatment: np.ndarray, y_control: np.ndarray) -> dict:
        """Add one day of data and update all statistics."""
        pass  # Implement

    def should_stop(self) -> Tuple[bool, str]:
        """Whether the experiment should stop and why."""
        pass  # Implement

(a) Implement the class with cumulative tracking of outcomes.

(b) Simulate an experiment with true effect = 0.8 minutes and generate daily monitoring output. On which day does the sequential test first reject?

(c) Compare the stopping time distribution for the mSPRT with different $\tau^2$ values: 0.001, 0.01, 0.1. Which value stops earliest for a true effect of 0.8? For a true effect of 0.1?


Exercise 33.18 (***)

Johari et al. (2017) show that the always-valid p-value $p_n = 1/\Lambda_n$ (where $\Lambda_n$ is the mSPRT statistic) satisfies:

$$\Pr\left(\exists\, n \geq 1: p_n \leq \alpha\right) \leq \alpha \quad \text{under } H_0$$

(a) Prove this result using Ville's inequality: if $\{M_n\}$ is a non-negative supermartingale with $\text{E}[M_0] = 1$, then $\Pr(\exists\, n: M_n \geq c) \leq 1/c$.

(b) Show that $\Lambda_n$ under $H_0$ (no treatment effect) is a non-negative martingale with $\text{E}[\Lambda_0] = 1$. (Hint: use the tower property of conditional expectation.)

(c) Verify this empirically: simulate 10,000 null experiments and compute the fraction where $\Lambda_n$ ever exceeds $1/\alpha = 20$.


SRM and Diagnostics

Exercise 33.19 (*)

Three StreamRec experiments show the following user counts:

Experiment Treatment Control Expected Ratio
A (ranking) 3,001,542 2,998,458 50/50
B (notifications) 1,205,812 4,794,188 20/80
C (homepage) 5,987,324 6,012,676 50/50

(a) Run an SRM check for each experiment. Which experiments have SRM?

(b) For the experiment(s) with SRM, list three possible root causes and describe how you would investigate each.

(c) Experiment C's SRM is caused by a browser extension that blocks a specific JavaScript tag. This tag is present only in the treatment variant, causing 0.2% of treatment users to not log an exposure. Should the experiment results be trusted? Why or why not?


Exercise 33.20 (**)

Build an automated SRM monitoring system.

(a) Write a function that checks SRM at multiple granularities: overall, by platform (iOS, Android, web), by country, and by new vs. returning users. If the overall check passes but a segment fails, what does this indicate?

(b) The SRM check is run daily throughout the experiment. What is the probability of a false SRM alarm at $\alpha = 0.001$ over 14 daily checks? How would you adjust the threshold?

(c) Design an alerting policy: when should an SRM detection trigger an immediate experiment pause vs. an investigation ticket?


Novelty and Primacy Effects

Exercise 33.21 (**)

StreamRec runs two experiments simultaneously:

  • Experiment A: New algorithm — shows a 2.0 min/day lift in week 1 that decays to 0.8 min/day by week 3.
  • Experiment B: UI redesign — shows a 0.3 min/day lift in week 1 that grows to 1.1 min/day by week 3.

(a) Classify each experiment's temporal pattern (novelty or primacy).

(b) If you could only run each experiment for 10 days, how much would your estimate differ from the long-run effect? Compute the bias for each experiment.

(c) The product manager wants to ship experiment A based on the week 1 results. Write a quantitative argument for waiting until week 3. What is the expected revenue impact of the decision error?


Exercise 33.22 (***)

Design and implement a test for novelty/primacy effects that accounts for day-of-week seasonality.

(a) The standard trend test (Section 33.11) may detect a "novelty effect" that is actually a day-of-week artifact (if the experiment starts on a Monday, the first 7 days include one full week cycle, but days 1-3 may look different from days 4-7 due to weekday/weekend patterns). Implement a test that includes day-of-week fixed effects.

(b) Generate data with both a novelty effect (exponential decay, half-life 5 days) and strong day-of-week effects (weekends are 20% higher). Show that the standard test and the day-adjusted test give different results.

(c) Propose a regression model: $\hat{\tau}_t = \beta_0 + \beta_1 t + \sum_{d=1}^{6} \gamma_d \mathbf{1}[\text{day}_t = d] + \epsilon_t$. Test $H_0: \beta_1 = 0$.


Experimentation Platforms

Exercise 33.23 (**)

Design the assignment service for an experimentation platform that supports:

  • Up to 500 concurrent experiments
  • Multi-variant experiments (2-10 variants)
  • Mutual exclusion groups (experiments that cannot overlap)
  • Layered experiments (experiments in different "layers" are independent; within a layer, experiments are mutually exclusive)

(a) Implement the hash-based assignment function. Ensure consistency (same user, same experiment = same assignment across calls) and uniformity (assignments are balanced within 0.01% of target ratios for 1M+ users).

(b) Implement mutual exclusion: if experiment A and experiment B are in the same exclusion group, a user assigned to A cannot be assigned to B.

(c) Implement layering: traffic is first divided into layers (e.g., "ranking," "UI," "notifications"). Each layer is independently randomized. This allows a user to be in one ranking experiment, one UI experiment, and one notification experiment simultaneously.


Exercise 33.24 (***)

The delta_method_ratio_variance function in Section 33.13 handles the case of a single ratio metric. Extend it to handle the difference in ratio metrics between treatment and control groups.

(a) Derive the variance of $\hat{\tau}_R = R_T - R_C$ where $R_T = \bar{Y}_T / \bar{X}_T$ and $R_C = \bar{Y}_C / \bar{X}_C$ are the treatment and control group ratio metrics.

(b) Implement the estimator with delta-method confidence intervals.

(c) Compare with a bootstrap confidence interval (1,000 replicates). Are the delta method and bootstrap intervals similar? Under what conditions would they diverge?


Exercise 33.25 (***)

Implement the full experiment interaction detection pipeline for StreamRec's 50 concurrent experiments.

(a) With 50 experiments, there are $\binom{50}{2} = 1,225$ pairwise interactions to test. Apply BH correction at FDR = 0.10. Simulate 50 experiments (3 pairs have true interactions) and compute the detection rate and false discovery proportion.

(b) The computational cost of testing 1,225 interactions is high. Propose a screening strategy: first identify candidate interacting pairs using a cheap proxy (e.g., shared metrics or overlapping user populations), then test only the candidates. Implement the screening strategy.

(c) A detected interaction between experiment A and experiment B means the measured effect of A depends on B's assignment. How should the experimentation platform report A's effect: (i) the marginal effect averaging over B's assignment, (ii) the effect conditional on B=0 (control), or (iii) both? Argue for your recommendation.


Integration and Climate Applications

Exercise 33.26 (**)

Apply synthetic control to estimate the effect of a carbon tax on CO2 emissions. Use publicly available data (or generate plausible synthetic data) for 20 countries over 30 years. One country introduces a carbon tax in year 15.

(a) Construct the synthetic control and estimate the treatment effect. Report the pre-treatment RMSE.

(b) Run placebo tests for all control countries. What is the estimated p-value?

(c) The treatment country also experienced an economic recession in year 17 (2 years after the carbon tax). How would you disentangle the carbon tax effect from the recession effect?


Exercise 33.27 (***)

Design a complete experimentation workflow for a hypothetical climate intervention: a city-level policy requiring 30% green roof coverage on new commercial buildings.

(a) Identify the target estimand: what causal quantity do you want to estimate? Define the potential outcomes.

(b) You cannot randomize cities. Describe three quasi-experimental designs that could credibly estimate the causal effect: (i) synthetic control, (ii) difference-in-differences, (iii) regression discontinuity. For each, state the identifying assumption and the most plausible threat to that assumption.

(c) Which of the three designs would you recommend for this specific intervention, and why?


Exercise 33.28 (****)

Open research question. The mSPRT (Section 33.9) uses a Gaussian mixture over the alternative hypothesis. Recent work (e.g., Howard et al., 2021, "Time-uniform, nonparametric, nonasymptotic confidence sequences") proposes nonparametric confidence sequences based on sub-Gaussian or sub-exponential tail assumptions.

(a) Implement the sub-Gaussian confidence sequence from Howard et al., Theorem 1. Compare its width to the mSPRT confidence interval at time points $n = 100, 1000, 10000$.

(b) The sub-Gaussian confidence sequence requires a bound on the sub-Gaussian parameter $\sigma^2$. In practice, this is estimated from data. How does estimation error affect the validity of the confidence sequence? Propose a conservative approach.

(c) Design a simulation study comparing the mSPRT and the sub-Gaussian confidence sequence in terms of: type I error under the null, power at various effect sizes, average stopping time, and robustness to heavy-tailed outcomes (Student-t with 5 degrees of freedom). Write a 1-page report summarizing your findings.


Exercise 33.29 (****)

Open research question. Interference in large-scale recommendation systems is typically modeled as positive spillover through content sharing. But there is a second mechanism: algorithmic interference. If the recommendation algorithm uses collaborative filtering, treating some users changes the recommendations shown to all users (because the model is retrained on the treatment group's behavior).

(a) Formalize algorithmic interference in the potential outcomes framework. How does it differ from social network interference?

(b) Neither cluster randomization nor switchback designs fully address algorithmic interference. Explain why.

(c) Propose a design that isolates the direct treatment effect from algorithmic interference. (Hint: consider holding the recommendation model fixed during the experiment and only varying the post-scoring re-ranking policy.) What are the limitations of this approach?


Exercise 33.30 (****)

Open research question. CUPED achieves variance reduction proportional to $\rho^2$ using a linear covariate adjustment. The theoretical maximum variance reduction using any function of $X$ is determined by the conditional variance $\text{Var}(Y \mid X)$.

(a) Show that $1 - \rho^2 \geq \text{Var}(Y \mid X) / \text{Var}(Y)$ with equality when $\text{E}[Y \mid X]$ is linear in $X$.

(b) Propose and implement a nonparametric CUPED estimator using gradient-boosted trees. Prove or argue that the resulting test is valid (controls type I error) under random assignment, even if the regression function is misspecified.

(c) Lin (2013) showed that the fully interacted regression $Y \sim D + X + D \times X$ is asymptotically efficient and valid. Does this result extend to nonparametric regression? Under what conditions?