Chapter 16: Exercises

DataField.Dev

Chapter 16: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Potential Outcomes Foundations

Exercise 16.1 (*)

Consider a job training program evaluated on subsequent annual earnings. There are 6 individuals:

Person	$Y_i(0)$	$Y_i(1)$	$D_i$
1	22,000	28,000	1
2	35,000	34,000	0
3	18,000	25,000	1
4	40,000	42,000	0
5	15,000	26,000	1
6	30,000	31,000	0

(a) Compute the individual treatment effect $\tau_i = Y_i(1) - Y_i(0)$ for each person. Which person benefits most from the training program? Which person is harmed?

(b) Compute the ATE, ATT, and ATU. Are they equal? Explain the economic interpretation of each.

(c) What is the observed outcome $Y_i$ for each person? Compute the naive difference in means $\hat{\Delta}_{\text{naive}} = \bar{Y}_{D=1} - \bar{Y}_{D=0}$. Is this a good estimate of the ATE? Decompose the naive estimate into ATT + selection bias.

(d) A policymaker wants to decide whether to expand the program to the untreated group. Which estimand — ATE, ATT, or ATU — is most relevant for this decision? Why?

Exercise 16.2 (*)

For the MediCore Drug X scenario, define the potential outcomes for the following alternative outcomes (instead of readmission):

(a) Length of hospital stay (continuous, in days). Write the ATE in terms of $Y(0)$ and $Y(1)$. What would a negative ATE mean?

(b) 30-day mortality (binary, 0/1). Write the ATT. Why is the ATT particularly relevant when the drug is given only to high-risk patients?

(c) Quality-adjusted life years (QALYs) over 5 years. Why might $Y_i(1)$ depend on when during the disease progression Drug X is administered, and what does this imply for the consistency assumption?

Exercise 16.3 (*)

For the StreamRec recommendation system, consider three different definitions of "treatment":

T1: Item appears anywhere in the user's homepage.
T2: Item appears in the top-3 positions of the homepage carousel.
T3: User receives a push notification about the item.

(a) For each treatment definition, explain what $Y_i(1)$ and $Y_i(0)$ represent. Are the potential outcomes the same across definitions?

(b) Which treatment definition is most likely to satisfy the consistency assumption? Why?

(c) Which treatment definition is most likely to violate positivity? Why?

(d) For T2 specifically, what is the difference between $\mathbb{E}[Y \mid D = 1]$ (average engagement for items shown in top-3) and $\text{ATE}$ (causal effect of top-3 placement)? Which is larger, and why?

Exercise 16.4 (*)

Determine whether SUTVA is plausible in each scenario. If it is violated, state which component fails and explain the mechanism:

(a) Estimating the effect of a new textbook on student test scores, where the textbook is assigned to individual students within the same classroom.

(b) Estimating the effect of a promotional discount on purchase probability, where the discount is randomly sent to individual users of an e-commerce platform.

(c) Estimating the effect of a ride-sharing surge pricing algorithm on driver supply, where the algorithm operates at the city level.

(d) Estimating the effect of a COVID-19 vaccine on individual infection risk.

(e) Estimating the effect of a specific movie recommendation on a user's watch time, where the user shares a household account with three other family members.

Estimands and Selection Bias

Exercise 16.5 (**)

Prove that $\text{ATE} = P(D=1) \cdot \text{ATT} + P(D=0) \cdot \text{ATU}$.

Hint: Start from $\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$ and apply the law of total expectation, conditioning on $D$.

Exercise 16.6 (**)

A hospital introduces a new discharge protocol. Doctors selectively apply the protocol to patients they judge are at high risk of readmission. The observational data show that patients receiving the new protocol have higher readmission rates than patients receiving standard care.

(a) Can we conclude that the new protocol causes more readmissions? Formally decompose the naive comparison into ATT + selection bias and explain the likely direction of the selection bias.

(b) Suppose the true ATT is $-0.05$ (the protocol reduces readmission by 5 percentage points among those who receive it). What does this imply about the magnitude of the selection bias?

(c) A hospital administrator proposes comparing readmission rates "before and after" the protocol was introduced (all patients in January vs. all patients in February). Under what conditions would this give an unbiased estimate of the ATE? What are the threats?

Exercise 16.7 (**)

Consider a setting where the treatment effect is heterogeneous across a binary covariate $G \in \{0, 1\}$:

Group	$P(G)$	$\text{ATE}(G)$	$P(D=1 \mid G)$
$G=0$	0.6	2.0	0.3
$G=1$	0.4	8.0	0.9

(a) Compute the population ATE.

(b) Compute the ATT. Show that it differs from the ATE.

(c) Compute the ATU. Is there a case for expanding treatment to the untreated group?

(d) A naive analyst computes the overall difference in means and finds an estimate of 6.5. Compute the selection bias and explain its source.

Exercise 16.8 (**)

Sign of the selection bias. For each scenario, predict the direction of the selection bias in the naive comparison (positive bias = overstates the effect, negative bias = understates it) and explain your reasoning:

(a) Estimating the effect of attending an elite university on lifetime earnings, comparing elite university graduates to non-elite graduates.

(b) Estimating the effect of a new cancer treatment on survival, where the treatment is offered only to patients whose cancer has not responded to standard therapy.

(c) Estimating the effect of a recommendation algorithm on purchase conversion, where the algorithm recommends items users are already likely to purchase.

(d) Estimating the effect of exercise on cardiovascular health, comparing people who exercise regularly to those who do not.

Identification Assumptions

Exercise 16.9 (**)

For MediCore's Drug X study, a colleague argues: "We have 47 covariates in the EHR, so we can control for everything. Conditional ignorability holds."

(a) Explain why having many covariates is neither necessary nor sufficient for conditional ignorability.

(b) Name three plausible confounders that might not be recorded in an EHR but could affect both Drug X prescribing and 30-day readmission.

(c) Name a variable that IS recorded in the EHR but that you should NOT control for because doing so would introduce bias rather than remove it. Hint: think about variables that are caused by the treatment.

Exercise 16.10 (**)

Positivity violations in practice. Consider a study of a chemotherapy drug where:

Patients under 30: $P(D=1) = 0.02$
Patients 30-65: $P(D=1) = 0.45$
Patients over 65: $P(D=1) = 0.88$

(a) Is positivity violated? Distinguish between deterministic and practical violations.

(b) For which age group(s) is causal estimation most reliable? Why?

(c) A researcher proposes restricting the analysis to the 30-65 age group. What is gained and what is lost?

(d) What does the estimated ATE from the 30-65 group generalize to? Is it the population ATE?

Exercise 16.11 (***)

Formalizing the adjustment formula. Prove that under SUTVA, conditional ignorability, and positivity:

$$\text{ATE} = \mathbb{E}_{\mathbf{X}}\!\left[\mathbb{E}[Y \mid D=1, \mathbf{X}] - \mathbb{E}[Y \mid D=0, \mathbf{X}]\right]$$

Hint: Start from $\text{ATE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]$. Apply the law of iterated expectations to condition on $\mathbf{X}$. Use conditional ignorability to replace $\mathbb{E}[Y(d) \mid \mathbf{X}]$ with $\mathbb{E}[Y(d) \mid D = d, \mathbf{X}]$. Use consistency to replace $Y(d)$ with $Y$ when $D = d$.

Exercise 16.12 (***)

Interference in marketplace experiments. StreamRec decides to run an A/B test where 50% of users see Algorithm A and 50% see Algorithm B. However, the platform has a "trending" section that surfaces items that many users engage with.

(a) Explain how SUTVA (no interference) is violated: how can one user's treatment assignment affect another user's outcome?

(b) In which direction would this interference bias the estimated treatment effect? Hint: think about what happens when Algorithm A drives engagement with item X, which then appears in the "trending" section for users assigned to Algorithm B.

(c) Propose a modified experimental design that mitigates this interference. What are the tradeoffs?

Regression Adjustment and OVB

Exercise 16.13 (**)

Modify the simulate_recommendation_data function from Section 16.4 to include a second confounder: device_type (binary: 0 = mobile, 1 = desktop). Desktop users have higher baseline engagement and are more likely to be recommended items.

(a) Simulate 10,000 observations. Compute the naive difference in means.

(b) Run regression adjustment controlling for preference only. Then run regression adjustment controlling for both preference and device_type. Compare the bias in each case.

(c) How large is the omitted variable bias when device_type is excluded? Verify that it matches the OVB formula $\beta_2 \cdot \delta$.

def simulate_with_device(
    n_users: int = 10000,
    true_ate: float = 2.0,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate data with two confounders: preference and device_type.

    Args:
        n_users: Number of observations.
        true_ate: True average treatment effect.
        seed: Random seed.

    Returns:
        DataFrame with both confounders and potential outcomes.
    """
    rng = np.random.RandomState(seed)

    preference = rng.normal(0, 1, n_users)
    device_type = rng.binomial(1, 0.4, n_users)  # 40% desktop

    # Treatment depends on both confounders
    rec_prob = 1 / (1 + np.exp(-(
        1.5 * preference + 0.8 * device_type
    )))
    treatment = rng.binomial(1, rec_prob)

    # Outcome depends on both confounders
    y0 = (
        10
        + 3 * preference
        + 4 * device_type  # Desktop users have higher engagement
        + rng.normal(0, 2, n_users)
    )
    y1 = y0 + true_ate
    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "preference": preference,
        "device_type": device_type,
        "treatment": treatment,
        "y0": y0, "y1": y1,
        "y_obs": y_obs,
        "true_ite": y1 - y0,
    })

Exercise 16.14 (***)

Derive the OVB formula. Consider the true model $Y = \alpha + \tau D + \beta_1 X + \beta_2 U + \varepsilon$ and the short regression $Y = \tilde{\alpha} + \tilde{\tau} D + \tilde{\beta}_1 X + \tilde{\varepsilon}$ where $U$ is omitted.

(a) Using the Frisch-Waugh-Lovell theorem or direct algebra, show that the probability limit of $\tilde{\tau}$ is:

$$\text{plim}(\tilde{\tau}) = \tau + \beta_2 \cdot \frac{\text{Cov}(\tilde{D}, U)}{\text{Var}(\tilde{D})}$$

where $\tilde{D}$ is the residual from regressing $D$ on $X$ (and a constant).

(b) Under what conditions is the OVB zero even when $U$ is a confounder (affects both $D$ and $Y$)?

(c) Suppose $\beta_2 > 0$ (the omitted confounder increases the outcome) and $\text{Cov}(\tilde{D}, U) > 0$ (the omitted confounder increases treatment probability, conditional on $X$). Is the naive estimate biased upward or downward? Connect this to the MediCore example where healthier patients are more likely to receive Drug X.

Exercise 16.15 (***)

Regression adjustment with interaction effects. The linear regression $Y = \alpha + \tau D + \boldsymbol{\beta}^\top \mathbf{X} + \varepsilon$ assumes a constant treatment effect $\tau$. When treatment effects are heterogeneous, this is misspecified.

(a) Show that the saturated linear model with interactions:

$$Y = \alpha + \tau D + \boldsymbol{\beta}^\top \mathbf{X} + D \cdot \boldsymbol{\gamma}^\top \mathbf{X} + \varepsilon$$

allows for heterogeneous treatment effects $\tau + \boldsymbol{\gamma}^\top \mathbf{X}$.

(b) Under conditional ignorability, what does $\hat{\tau}$ estimate in this model? Is it the ATE, the ATT, or something else?

(c) Write a Python function that estimates the ATE from the interacted regression by averaging the predicted treatment effects $\hat{\tau} + \hat{\boldsymbol{\gamma}}^\top \mathbf{X}_i$ across all units:

def regression_adjustment_interacted(
    y: np.ndarray,
    treatment: np.ndarray,
    X: np.ndarray,
) -> dict[str, float]:
    """Estimate ATE using regression with treatment-covariate interactions.

    Fits Y = alpha + tau*D + beta'X + gamma'(D*X) + epsilon, then
    averages the individual predicted effects tau + gamma'X_i.

    Args:
        y: Observed outcomes of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        X: Covariate matrix of shape (n, p).

    Returns:
        Dictionary with ATE estimate and standard error.
    """
    n, p = X.shape
    interactions = treatment.reshape(-1, 1) * X

    design = np.column_stack([
        np.ones(n), treatment, X, interactions
    ])
    model = sm.OLS(y, design).fit(cov_type="HC1")

    # Individual treatment effects: tau + gamma' X_i
    tau_hat = model.params[1]
    gamma_hat = model.params[2 + p: 2 + 2 * p]
    individual_effects = tau_hat + X @ gamma_hat

    return {
        "ate_estimate": float(individual_effects.mean()),
        "tau_hat": float(tau_hat),
        "gamma_hat": gamma_hat.tolist(),
    }

Apply this to the simulate_streamrec_causal data from Section 16.12 and compare the ATE estimate to the non-interacted regression.

Exercise 16.16 (**)

Covariate balance as a diagnostic. Using the simulate_streamrec_causal data from Section 16.12:

(a) Compute the standardized mean difference (SMD) for all five covariates. Which covariates are most imbalanced?

(b) Create a Love plot (dot plot of SMDs before adjustment). The conventional threshold is $|SMD| < 0.1$. How many covariates are below this threshold?

(c) After regression adjustment, compute the "effective" balance by examining the partial correlations between treatment and each covariate, conditional on all other covariates. Does regression adjustment improve balance?

Simulation Studies

Exercise 16.17 (***)

Monte Carlo evaluation of the difference-in-means estimator. Write a simulation that:

(a) Generates 1,000 datasets from the simulate_recommendation_data function (varying the seed). For each dataset, computes the naive difference in means. Plot the sampling distribution of the naive estimator. Is it centered on the true ATE?

(b) Repeats the exercise with randomized treatment assignment (using simulate_rct). Compare the two sampling distributions. What does this demonstrate about bias vs. variance?

(c) Computes the coverage probability of the 95% CI for both estimators: what fraction of the 1,000 CIs contain the true ATE? A well-calibrated CI should have coverage close to 95%.

def monte_carlo_bias_study(
    n_simulations: int = 1000,
    n_users: int = 5000,
    true_ate: float = 2.0,
) -> pd.DataFrame:
    """Run Monte Carlo study comparing naive and RCT estimators.

    Args:
        n_simulations: Number of simulation repetitions.
        n_users: Sample size per simulation.
        true_ate: True average treatment effect.

    Returns:
        DataFrame with estimates from each simulation.
    """
    results = []

    for sim in range(n_simulations):
        # Observational (biased)
        df_obs = simulate_recommendation_data(n_users=n_users, seed=sim)
        res_obs = difference_in_means(
            df_obs["y_obs"].values, df_obs["treatment"].values
        )

        # Randomized (unbiased)
        df_rct = simulate_rct(n_users=n_users, true_ate=true_ate, seed=sim)
        res_rct = difference_in_means(
            df_rct["y_obs"].values, df_rct["treatment"].values
        )

        results.append({
            "sim": sim,
            "obs_estimate": res_obs["estimate"],
            "obs_covers": res_obs["ci_lower"] <= true_ate <= res_obs["ci_upper"],
            "rct_estimate": res_rct["estimate"],
            "rct_covers": res_rct["ci_lower"] <= true_ate <= res_rct["ci_upper"],
        })

    return pd.DataFrame(results)

Exercise 16.18 (***)

Sensitivity to unmeasured confounding. Using the simulate_omitted_confounder function from Section 16.9:

(a) Vary gamma_u_to_d (the effect of $U$ on treatment) from 0 to 2 in steps of 0.25, holding beta_u_to_y = 3.0 fixed. Plot the regression estimate (controlling for $X$ only) against gamma_u_to_d. How does the bias scale with the confounding strength?

(b) Vary beta_u_to_y (the effect of $U$ on outcome) from 0 to 5 in steps of 0.5, holding gamma_u_to_d = 1.0 fixed. Plot the regression estimate against beta_u_to_y. How does the bias scale?

(c) Create a 2D heatmap with gamma_u_to_d on one axis and beta_u_to_y on the other, showing the bias of the regression estimate. Overlay the contour where the estimated effect changes sign. Interpret this contour in terms of sensitivity analysis: "The estimated effect is robust unless the unmeasured confounder has $\gamma > \_ $ and $\beta > \_$."

Exercise 16.19 (***)

Positivity failure simulation. Modify simulate_recommendation_data so that treatment is deterministic for extreme values of preference:

def simulate_near_positivity_violation(
    n_users: int = 10000,
    deterministic_threshold: float = 1.5,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate data with near-positivity violations.

    Users with preference > threshold always receive treatment.
    Users with preference < -threshold never receive treatment.

    Args:
        n_users: Number of observations.
        deterministic_threshold: Threshold for deterministic assignment.
        seed: Random seed.

    Returns:
        DataFrame with potential positivity violations.
    """
    rng = np.random.RandomState(seed)
    preference = rng.normal(0, 1, n_users)

    # Probabilistic in the middle, deterministic at extremes
    rec_prob = np.where(
        preference > deterministic_threshold, 0.99,
        np.where(
            preference < -deterministic_threshold, 0.01,
            1 / (1 + np.exp(-preference))
        ),
    )
    treatment = rng.binomial(1, rec_prob)

    y0 = 10 + 3 * preference + rng.normal(0, 2, n_users)
    y1 = y0 + 2.0
    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "preference": preference,
        "treatment": treatment,
        "rec_prob": rec_prob,
        "y0": y0, "y1": y1,
        "y_obs": y_obs,
    })

(a) Estimate the ATE using regression adjustment on the full sample. What is the bias?

(b) Restrict the sample to the "overlap region" ($|\text{preference}| < 1.5$) and re-estimate. How does the bias change?

(c) What estimand does the restricted estimate identify? Is it the population ATE or a local ATE? Why might this still be useful?

Advanced and Applied Problems

Exercise 16.20 (***)

Super-population vs. finite-population inference. The chapter presented the finite-population perspective where potential outcomes are fixed numbers. In the super-population perspective, units are i.i.d. draws from an infinite population, and potential outcomes are random variables.

(a) Under the super-population model $(Y(0), Y(1), \mathbf{X}) \sim P$, write the ATE as a population parameter. How does this differ from the finite-population ATE $\frac{1}{N} \sum_{i=1}^N [Y_i(1) - Y_i(0)]$?

(b) Show that the difference-in-means estimator is consistent for the super-population ATE under random sampling and randomized treatment assignment.

(c) Under what conditions do finite-population and super-population inference give different answers? When does the distinction matter in practice?

Exercise 16.21 (****)

The Neyman variance estimator is conservative. Show that:

$$\widehat{\text{Var}}(\hat{\tau}) = \frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}$$

overestimates the true variance of the difference-in-means estimator when the treatment effect is constant ($\tau_i = \tau$ for all $i$).

Hint: Compare the Neyman variance to the exact variance $\frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{S_{01}^2}{N}$. What happens to $S_{01}^2$ when the treatment effect is constant? Show that $S_{01}^2 = 0$ implies the Neyman variance exceeds the true variance by approximately $\frac{S_{01}^2}{N} \geq 0$.

Exercise 16.22 (****)

Regression adjustment in randomized experiments: Lin (2013). Lin showed that in randomized experiments with heterogeneous treatment effects, the standard regression adjustment $Y = \alpha + \tau D + \boldsymbol{\beta}^\top \mathbf{X} + \varepsilon$ can be biased for the ATE if the relationship between $\mathbf{X}$ and $Y$ differs across treatment groups.

(a) Read Lin (2013), "Agnostic notes on regression adjustments to experimental data." Explain why the interacted regression $Y = \alpha + \tau D + \boldsymbol{\beta}^\top (\mathbf{X} - \bar{\mathbf{X}}) + D \cdot \boldsymbol{\gamma}^\top (\mathbf{X} - \bar{\mathbf{X}}) + \varepsilon$ (with demeaned covariates) is guaranteed to be at least as efficient as the unadjusted estimator, even under misspecification.

(b) Implement Lin's estimator in Python:

def lin_regression_adjustment(
    y: np.ndarray,
    treatment: np.ndarray,
    X: np.ndarray,
) -> dict[str, float]:
    """Lin (2013) regression adjustment with demeaned interactions.

    Centers covariates at their grand mean, includes treatment-covariate
    interactions, and uses HC2 standard errors.

    Args:
        y: Observed outcomes.
        treatment: Binary treatment.
        X: Covariates.

    Returns:
        Dictionary with ATE estimate and robust standard error.
    """
    X_demeaned = X - X.mean(axis=0)
    interactions = treatment.reshape(-1, 1) * X_demeaned

    design = sm.add_constant(
        np.column_stack([treatment, X_demeaned, interactions])
    )
    model = sm.OLS(y, design).fit(cov_type="HC2")

    return {
        "ate_estimate": float(model.params[1]),
        "se": float(model.bse[1]),
        "ci_lower": float(model.conf_int()[1, 0]),
        "ci_upper": float(model.conf_int()[1, 1]),
    }

(c) Using simulate_streamrec_causal with randomized treatment, compare three estimators: unadjusted difference in means, standard regression adjustment, and Lin's estimator. Show that Lin's estimator has lower variance than the unadjusted estimator.

Exercise 16.23 (**)

Connecting potential outcomes to prediction. A data scientist at MediCore builds a predictive model $\hat{f}(\mathbf{X}) = \hat{P}(Y = 1 \mid \mathbf{X})$ to predict 30-day readmission.

(a) Is $\hat{f}(\mathbf{X})$ an estimate of $P(Y(0) = 1 \mid \mathbf{X})$, $P(Y(1) = 1 \mid \mathbf{X})$, or neither? Explain.

(b) The hospital uses $\hat{f}(\mathbf{X})$ to decide who receives Drug X: patients with $\hat{f}(\mathbf{X}) > 0.3$ get the drug. After deployment, readmission rates drop among treated patients. Can we conclude that Drug X is effective? Why or why not?

(c) What additional information would be needed to convert this predictive model into a causal estimate of Drug X's effect? Connect your answer to the assumptions of this chapter.

Exercise 16.24 (***)

Implementing a complete causal analysis pipeline. Write a function that takes observational data and performs a full causal analysis following the identification checklist from Section 16.11:

def causal_analysis_pipeline(
    y: np.ndarray,
    treatment: np.ndarray,
    X: np.ndarray,
    feature_names: list[str],
) -> dict:
    """Complete causal analysis pipeline.

    Performs:
    1. Naive difference in means
    2. Covariate balance check
    3. Positivity diagnostics
    4. Regression adjustment
    5. Comparison of estimates

    Args:
        y: Observed outcomes.
        treatment: Binary treatment.
        X: Covariate matrix.
        feature_names: Names for each covariate.

    Returns:
        Dictionary with all analysis results.
    """
    results = {}

    # Step 1: Naive estimate
    results["naive"] = difference_in_means(y, treatment)

    # Step 2: Covariate balance
    results["balance"] = check_covariate_balance(
        X, treatment, feature_names
    ).to_dict("records")

    # Step 3: Positivity
    results["positivity"] = diagnose_positivity(
        X, treatment, feature_names
    )

    # Step 4: Regression adjustment
    results["regression"] = regression_adjustment(
        y, treatment, X, feature_names
    )

    # Step 5: Summary
    results["bias_reduction"] = (
        abs(results["naive"]["estimate"] - results["regression"]["estimate"])
    )

    return results

Apply this pipeline to the simulate_streamrec_causal data. Interpret every component of the output.

Exercise 16.25 (****)

Fundamental limits of observational inference. Consider the following thought experiment. Two researchers analyze the same MediCore dataset:

Researcher A controls for $\mathbf{X}_A = \{\text{age}, \text{severity}, \text{comorbidities}\}$ and estimates $\hat{\tau}_A = -0.08$.
Researcher B controls for $\mathbf{X}_B = \{\text{age}, \text{severity}, \text{comorbidities}, \text{physician preference}\}$ and estimates $\hat{\tau}_B = -0.03$.

(a) Can both estimates be correct (unbiased for the ATE)? Under what conditions?

(b) Researcher B argues: "My estimate is better because I control for more confounders." Under what conditions is this wrong? Hint: think about what happens if physician preference is an instrumental variable rather than a confounder.

(c) Design a simulation that demonstrates a case where controlling for an additional variable increases bias rather than decreasing it. This should involve a collider or an intermediate variable on the causal path. Hint: If $Z$ is a collider ($D \to Z \leftarrow Y$), conditioning on $Z$ opens a non-causal path between $D$ and $Y$.

(d) What does this example imply about the common advice to "control for as many variables as possible"? When is this advice correct, and when is it harmful?

Exercise 16.26 (***)

The table of science. Reproduce Holland's (1986) "table of science" for the StreamRec setting. Create a table with 10 hypothetical user-item pairs showing:

User features (preference, activity)
Item features (popularity)
Both potential outcomes $Y_i(0)$, $Y_i(1)$
Treatment assignment $D_i$
Observed outcome $Y_i$
Missing (counterfactual) outcome marked with "?"

Compute the true ATE, ATT, and ATU from the table. Then compute the naive difference in means using only the observed data. Explain to a non-technical stakeholder why the naive estimate differs from the true ATE.

Exercise 16.27 (**)

Randomization inference. Fisher's exact test uses permutation of treatment labels to test the sharp null hypothesis $H_0: Y_i(1) = Y_i(0)$ for all $i$ (zero effect for every unit).

(a) Using the RCT data from simulate_rct (with n_users=20), implement Fisher's randomization test:

def fisher_exact_test(
    y: np.ndarray,
    treatment: np.ndarray,
    n_permutations: int = 10000,
    seed: int = 42,
) -> dict[str, float]:
    """Fisher's randomization test for the sharp null.

    Under H0: Y_i(1) = Y_i(0) for all i, the treatment labels are
    exchangeable. The p-value is the fraction of permutations that
    produce a test statistic as extreme as the observed.

    Args:
        y: Observed outcomes.
        treatment: Binary treatment.
        n_permutations: Number of random permutations.
        seed: Random seed.

    Returns:
        Dictionary with observed statistic, p-value, and null distribution.
    """
    rng = np.random.RandomState(seed)
    n1 = int(treatment.sum())

    observed_diff = y[treatment == 1].mean() - y[treatment == 0].mean()

    null_diffs = np.zeros(n_permutations)
    for k in range(n_permutations):
        perm = rng.permutation(len(y))
        perm_treated = perm[:n1]
        perm_control = perm[n1:]
        null_diffs[k] = y[perm_treated].mean() - y[perm_control].mean()

    p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))

    return {
        "observed_diff": float(observed_diff),
        "p_value": float(p_value),
        "null_diffs": null_diffs,
    }

(b) Run the test with $N = 20$ and true ATE = 2.0. Do you reject the sharp null at $\alpha = 0.05$? How does the result change with $N = 200$?

(c) The sharp null ($Y_i(1) = Y_i(0)$ for all $i$) is stronger than the weak null ($\mathbb{E}[Y(1)] = \mathbb{E}[Y(0)]$). Give an example where the weak null holds but the sharp null does not.

Exercise 16.28 (**)

Connecting to Chapter 15. Return to the Simpson's paradox example from Chapter 15. Recast it in the potential outcomes framework:

(a) Define $Y(0)$, $Y(1)$, $D$, and the confounder $\mathbf{X}$.

(b) Show that the aggregate comparison (ignoring $\mathbf{X}$) is biased because $\{Y(0), Y(1)\} \not\perp\!\!\!\perp D$, while the stratified comparison (conditioning on $\mathbf{X}$) is unbiased because $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid \mathbf{X}$.

(c) Compute the ATE using the adjustment formula and verify it matches the stratified analysis from Chapter 15.

Exercise 16.29 (***)

Heterogeneous treatment effect detection. Using the simulate_streamrec_causal data (which has heterogeneous effects by design):

(a) Estimate the ATE for three subgroups: low activity ($< 0.5$), medium activity ($0.5$-$1.5$), and high activity ($> 1.5$). Use regression adjustment within each subgroup. How does the estimated ATE vary across subgroups?

(b) This is a preview of Chapter 19 (Causal Machine Learning). Run a regression of the true ITE on the covariates. Which covariates predict the largest treatment effects? Are these the same covariates that predict the outcome $Y$?

(c) Explain why predicting $Y$ well does not imply predicting $\tau_i$ well. What is the fundamental difference between a predictive model and a causal model in the context of heterogeneous effects?

Exercise 16.30 (****)

The stable unit treatment value assumption in network settings. Consider StreamRec's social features: users can follow each other and share recommendations.

(a) Formally define potential outcomes under interference. Let $\mathbf{D} = (D_1, \ldots, D_N)$ be the full treatment vector. Unit $i$'s potential outcome is $Y_i(\mathbf{D})$. How many potential outcomes does each unit have? Why is this intractable for even moderate $N$?

(b) Define the "neighborhood treatment" $G_i = \frac{1}{|\mathcal{N}_i|} \sum_{j \in \mathcal{N}_i} D_j$ as the fraction of $i$'s friends who are treated. Under the assumption that $Y_i(\mathbf{D}) = Y_i(D_i, G_i)$ (outcomes depend only on own treatment and neighborhood treatment fraction), how many potential outcomes does each unit have?

(c) Define the direct effect $\mathbb{E}[Y(1, g) - Y(0, g)]$ (effect of own treatment, holding neighbors' treatment fixed) and the spillover effect $\mathbb{E}[Y(d, g') - Y(d, g)]$ (effect of neighbors' treatment, holding own treatment fixed). Why are both of these relevant for evaluating StreamRec's recommendation algorithm?