Chapter 19: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Meta-Learners

Exercise 19.1 (*)

Consider an e-commerce platform running a randomized promotion (20% discount) to a random 40% of users. The outcome is binary purchase ($Y \in \{0, 1\}$). You have 10 user covariates.

(a) Which meta-learner (S, T, X, R) would you use as a first attempt? Justify your choice considering the treatment/control ratio and outcome type.

(b) The platform reports that the overall conversion rate is 8% in control and 11% in treatment (ATE = 3 percentage points). Would you expect the S-learner to perform well or poorly in this setting? Explain in terms of regularization bias.

(c) A colleague suggests using a logistic regression as the base learner for the S-learner, with $D$ included as a feature. Show that the estimated CATE under this model is constant across all $x$. What modeling assumption makes this happen?


Exercise 19.2 (*)

For the MediCore Drug X scenario, suppose you have observational data with $n = 50,000$ patients, of whom 8,000 received Drug X ($D = 1$). The outcome is 30-day readmission (binary).

(a) Explain why the T-learner is particularly inefficient for this setting. What is the effective sample size for training $\hat{\mu}_1(x)$?

(b) Describe how the X-learner addresses this sample imbalance. Which group's outcome model ($\hat{\mu}_0$ or $\hat{\mu}_1$) will be more reliable, and how does the X-learner exploit this?

(c) Suppose the propensity score $e(x)$ ranges from 0.02 to 0.85. Which meta-learner is most sensitive to the near-zero propensity scores, and why?


Exercise 19.3 (**)

Implement the DR-Learner (Doubly Robust Learner), which uses the AIPW pseudo-outcome:

$$\tilde{Y}_i^{DR} = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)}$$

(a) Show that $\mathbb{E}[\tilde{Y}_i^{DR} \mid X_i] = \tau(X_i)$ when either the outcome models or the propensity model is correctly specified (double robustness).

(b) Implement the DR-learner in Python: (i) estimate $\hat{\mu}_1$, $\hat{\mu}_0$, and $\hat{e}$ with cross-fitting, (ii) compute $\tilde{Y}^{DR}$, (iii) regress $\tilde{Y}^{DR}$ on $X$ using gradient boosting.

(c) Compare the DR-learner to the T-learner and R-learner on the simulate_heterogeneous_treatment data from Section 19.1, using RMSE of the CATE as the evaluation metric.


Exercise 19.4 (**)

The R-learner involves dividing by the treatment residual $\tilde{D}_i = D_i - \hat{e}(X_i)$.

(a) What happens when $\hat{e}(X_i) \approx 1$ for a treated unit? What is $\tilde{D}_i$ in this case? What is the resulting pseudo-outcome $\tilde{Y}_i / \tilde{D}_i$?

(b) Propose two strategies to mitigate the numerical instability when $|\tilde{D}_i|$ is small. Implement both and compare their effect on CATE estimation accuracy.

(c) In the R-learner, why is it critical that $\hat{m}(x)$ and $\hat{e}(x)$ are estimated with cross-fitting rather than on the full sample? Illustrate with a simulation where in-sample nuisance estimation leads to biased CATE estimates.


Exercise 19.5 (*)

You are given the following meta-learner estimates for 5 patients:

Patient S-learner $\hat{\tau}_S$ T-learner $\hat{\tau}_T$ X-learner $\hat{\tau}_X$ R-learner $\hat{\tau}_R$
A 0.02 0.15 0.12 0.10
B 0.01 -0.08 -0.05 -0.04
C 0.03 0.22 0.18 0.16
D 0.01 0.05 0.04 0.03
E 0.02 -0.12 -0.09 -0.07

(a) Which meta-learner appears to shrink the CATE estimates toward zero? What mechanism causes this?

(b) If these are MediCore patients and the treatment cost is equivalent to $\tau = 0.05$, which patients should be treated under each meta-learner's targeting policy?

(c) The meta-learners disagree on whether patient B should be treated. What would you recommend to resolve this disagreement in a clinical setting?


Causal Forests

Exercise 19.6 (*)

Explain the difference between a standard random forest splitting criterion and a causal forest splitting criterion.

(a) Write the splitting criterion for a standard regression tree (minimizing within-node variance of $Y$).

(b) Write the splitting criterion for a causal tree (maximizing across-node variance of $\hat{\tau}$).

(c) Construct a simple example (6 observations with $X$, $D$, $Y$) where the optimal predictive split and the optimal causal split are different features.


Exercise 19.7 (**)

Honesty in causal forests.

(a) Define "honesty" in the context of causal forests. Why is it necessary for valid confidence intervals?

(b) A colleague argues that honesty wastes data by using half for splitting and half for estimation, and proposes using all data for both. Explain the specific bias that arises when the same data is used for splitting and estimation, using a concrete example.

(c) Does honesty affect the point estimate $\hat{\tau}(x)$, the confidence interval width, or both? Explain.


Exercise 19.8 (**)

Using EconML's CausalForestDML, fit a causal forest to the simulated MediCore data.

(a) Plot the distribution of estimated CATEs. What fraction of patients have $\hat{\tau}(x) > 0$ (treatment beneficial)? What fraction have $\hat{\tau}(x) < -0.05$ (treatment harmful)?

(b) Extract the feature importances for treatment effect heterogeneity. Which features drive the most variation in treatment effects? Compare with the feature importances from a standard gradient boosting model predicting $Y$ — are the top features the same?

(c) For a patient with $\text{genetic\_marker} = 1$, $\text{eGFR} = 80$, $\text{age} = 55$, compute the estimated CATE and its 95% confidence interval. Is the effect statistically significant at $\alpha = 0.05$?


Exercise 19.9 (***)

Implement a simplified causal tree (single tree, not forest) from scratch.

(a) Write a function that, for a given node, evaluates all candidate splits and selects the one that maximizes treatment effect heterogeneity:

$$\Delta = \frac{n_L \cdot n_R}{n} \cdot (\hat{\tau}_L - \hat{\tau}_R)^2$$

(b) Implement honest estimation: use a separate estimation sample to compute $\hat{\tau}$ in each leaf.

(c) Grow a depth-3 causal tree on the simulation data. Visualize the tree and interpret the splits. Do the top splits correspond to the true effect modifiers ($X_0$ and $X_1$)?


Exercise 19.10 (**)

The causal forest produces confidence intervals $[\hat{\tau}(x) \pm z_{\alpha/2} \hat{\sigma}(x)]$.

(a) On the simulation data (where $\tau(x)$ is known), compute the empirical coverage of the 95% confidence intervals. Is it close to 95%?

(b) Plot the confidence interval width $2 \cdot 1.96 \cdot \hat{\sigma}(x)$ against the number of observations in the leaf (or local neighborhood). Does the relationship match the expected $O(1/\sqrt{n_{\text{local}}})$ scaling?

(c) Identify regions of the covariate space where the confidence intervals are widest. What do these regions have in common (in terms of propensity score, sample density, etc.)?


Double Machine Learning

Exercise 19.11 (**)

Consider the partially linear model $Y = \theta D + g(X) + \epsilon$, where $X \in \mathbb{R}^{200}$ and $g(X)$ is sparse (depends on only 10 of the 200 features).

(a) Implement the naive "partialling out" approach: fit $\hat{g}(X)$ using LASSO, then regress $Y - \hat{g}(X)$ on $D$. Show that the estimate of $\theta$ is biased due to LASSO regularization.

(b) Implement the full DML estimator with cross-fitting. Verify that the DML estimate is less biased than the naive approach.

(c) Vary the number of cross-fitting folds $K \in \{2, 3, 5, 10\}$. Does the estimate of $\theta$ change substantially? What about the standard error?


Exercise 19.12 (***)

Neyman orthogonality derivation.

Consider the moment condition:

$$\psi(\theta; g, m) = (Y - g(X) - \theta D) \cdot (D - m(X))$$

(a) Show that $\mathbb{E}[\psi(\theta_0; g_0, m_0)] = 0$, where $\theta_0$, $g_0$, $m_0$ are the true values.

(b) Compute the pathwise derivative $\frac{\partial}{\partial r} \mathbb{E}[\psi(\theta_0; g_0 + r \cdot \delta_g, m_0)] \big|_{r=0}$ for an arbitrary perturbation $\delta_g$. Show that it equals zero (Neyman orthogonality with respect to $g$).

(c) Compute the same derivative with respect to $m$ and verify that it is also zero.

(d) Explain in plain language what Neyman orthogonality means: "Small errors in $\hat{g}$ and $\hat{m}$ do not cause first-order errors in $\hat{\theta}$."


Exercise 19.13 (**)

Compare the DML estimator with the unadjusted difference in means and OLS regression adjustment.

(a) Simulate data with $p = 50$ confounders and $\theta = 2$. Compute: (i) the unadjusted difference in means, (ii) OLS with all 50 covariates, (iii) DML with LASSO for nuisance estimation.

(b) Repeat (a) 500 times to obtain the sampling distribution of each estimator. Compare bias, variance, and MSE.

(c) Now increase $p$ to 200 (with $n = 500$, so $p > n$). Which estimators still work? Why does OLS fail?


Exercise 19.14 (*)

Using EconML's LinearDML:

(a) Fit a LinearDML to the simulated data with all 10 covariates as both $X$ (effect modifiers) and $W$ (confounders). Report the ATE and its 95% confidence interval.

(b) Now separate the covariates: put $X_0$ and $X_1$ in $X$ (effect modifiers) and the rest in $W$ (confounders). How does this change the CATE estimates?

(c) Use the coef__inference method to test which effect modifiers have statistically significant interactions with the treatment. Compare with the true data-generating process.


Uplift Modeling

Exercise 19.15 (*)

An online retailer runs an A/B test for a promotional email (50/50 split, $n = 100,000$). Outcomes:

Group $n$ Purchasers Conversion rate
Control 50,000 3,200 6.4%
Treatment 50,000 4,100 8.2%

(a) What is the ATE (in percentage points and as a relative lift)?

(b) The marketing team wants to send the email to the top 20% of users by uplift score. If the uplift model perfectly identifies the Persuadable segment, and Persuadables constitute 20% of the population, what would the conversion rate be in the targeted treated group vs. the control?

(c) If the uplift model is random (no better than chance), what would the conversion rate be in the targeted treated group? Compare with (b).


Exercise 19.16 (**)

Implement the transformed outcome approach for uplift modeling.

(a) For the simulated data, compute the transformed outcome $Y^* = D Y / e - (1 - D)Y / (1 - e)$ assuming a known propensity of $e = 0.5$ (RCT).

(b) Train a gradient boosting model on $(X, Y^*)$ and predict CATEs on a test set. Compute the RMSE against the true CATEs.

(c) Now use estimated propensity scores instead of the known $e = 0.5$. Does the CATE estimation improve or worsen? Under what condition (estimated vs. known propensity) would you expect improvement?


Exercise 19.17 (**)

The four quadrants in practice.

Generate simulated data for 10,000 users with binary outcomes:

  • 20% Persuadables: $Y(0) = 0$, $Y(1) = 1$
  • 30% Sure Things: $Y(0) = 1$, $Y(1) = 1$
  • 40% Lost Causes: $Y(0) = 0$, $Y(1) = 0$
  • 10% Sleeping Dogs: $Y(0) = 1$, $Y(1) = 0$

(a) What is the ATE? Is it positive, negative, or zero?

(b) If you target the top 50% by a predictive model of $P(Y = 1 \mid D = 1)$, which types are you most likely to select? What is the uplift in the targeted group?

(c) If you target the top 50% by a perfect uplift model, which types are selected? What is the uplift? Compare with (b).


Exercise 19.18 (**)

Sleeping Dogs in recommendation systems.

StreamRec observes that for some users, showing recommendations causes them to leave the platform faster (recommendation fatigue). These are "Sleeping Dogs."

(a) Define the potential outcomes precisely: $Y(1)$ = session length with recommendations, $Y(0)$ = session length without. For a Sleeping Dog, what is the sign of $\tau_i$?

(b) Describe how you would detect Sleeping Dogs using a causal forest. What output from the forest identifies them?

(c) The product team asks: "Shouldn't we just remove recommendations for Sleeping Dogs?" Discuss the tradeoffs, considering short-term and long-term effects, and whether the CATE estimate captures both.


Evaluation

Exercise 19.19 (**)

Construct Qini curves for the four meta-learners on simulated data.

(a) Using simulate_heterogeneous_treatment, fit S, T, X, and R learners. Compute the Qini curve and Qini coefficient for each.

(b) Add a "Random" baseline (random ordering) and an "Oracle" baseline (ranking by the true $\tau(x)$). How close are the meta-learners to the Oracle?

(c) The Qini coefficient for the X-learner is 0.85, but its CATE RMSE is higher than the R-learner's (which has Qini = 0.82). Explain how a model can rank better but estimate magnitudes worse.


Exercise 19.20 (**)

CATE calibration.

(a) For a fitted causal forest, create a CATE calibration plot: bin observations by $\hat{\tau}(x)$ into deciles, compute the actual treatment effect within each bin using AIPW, and plot predicted vs. actual.

(b) Is the causal forest well-calibrated? If not, suggest a recalibration approach.

(c) Why is CATE calibration harder to assess than classification calibration? What makes the AIPW estimator within bins noisy?


Exercise 19.21 (***)

Sensitivity analysis for CATE estimation.

(a) Extend the Cinelli and Hazlett (2020) sensitivity analysis framework from the ATE setting (Chapter 18) to the CATE setting. Specifically: for a given $x$, how strong would an unmeasured confounder need to be to change the sign of $\hat{\tau}(x)$?

(b) Implement this sensitivity analysis for the MediCore causal forest. For the subgroup "genetic marker = 1, eGFR > 60," compute the robustness value — the minimum confounder strength needed to nullify the estimated positive CATE.

(c) Compare the robustness value across subgroups. Which subgroup's CATE estimate is most robust to unmeasured confounding? Which is most fragile?


Production and Design

Exercise 19.22 (**)

Method selection decision tree.

Create a flowchart (or decision tree) for selecting a causal ML method based on: - Data source (RCT vs. observational) - Treatment balance (balanced vs. imbalanced) - Number of covariates ($p < 20$ vs. $p > 100$) - Desired output (rankings only vs. calibrated CATEs vs. constant ATE) - Interpretability requirement (black box OK vs. must explain to stakeholders)

For each terminal node of the flowchart, name the recommended method and cite the corresponding section of this chapter.


Exercise 19.23 (***)

EconML production pipeline.

Build a complete CATE estimation pipeline using EconML for the StreamRec data:

(a) Data preprocessing: handle the categorical variable subscription_tier and the cyclical variable hour_of_day.

(b) Model selection: fit LinearDML, SparseLinearDML, and CausalForestDML. Compare using the Qini coefficient on a held-out set.

(c) Policy construction: use SingleTreePolicyInterpreter to create a depth-3 targeting policy. Export the policy as a set of human-readable rules.

(d) Policy evaluation: estimate the expected value of the policy vs. treat-all and treat-none. Compute 95% confidence intervals for the policy value using bootstrap.


Exercise 19.24 (***)

The curse of dimensionality for CATEs.

(a) Generate data with $p = 50$ covariates but only 2 true effect modifiers. Fit a causal forest using all 50 covariates. How does the CATE RMSE compare to a causal forest using only the 2 true modifiers?

(b) Propose a two-stage approach: (i) use a variable importance method to select the top $k$ effect modifiers, (ii) fit a causal forest using only those $k$ variables. Implement this and compare with (a).

(c) Discuss the risk of this two-stage approach: could the variable selection step introduce bias? Under what conditions?


Exercise 19.25 (****)

Causal ML under interference.

Chapters 16 identified SUTVA violations in the StreamRec setting: recommending item $A$ to user $i$ affects the engagement with item $B$.

(a) Define the potential outcomes under interference: $Y_i(\mathbf{d})$ where $\mathbf{d}$ is the full recommendation vector for user $i$. Why is the CATE $\tau_j(x) = \mathbb{E}[Y(\mathbf{d}^{+j}) - Y(\mathbf{d}^{-j}) \mid X = x]$ well-defined only if we specify the "baseline" recommendations $\mathbf{d}^{-j}$?

(b) Propose an approach to estimate item-level CATEs that accounts for interference between items in the same recommendation slate. Consider the concept of "marginal CATE" — the effect of adding item $j$ to the slate, averaging over the other items.

(c) Discuss whether the meta-learner and causal forest methods from this chapter can be adapted to the interference setting, or whether fundamentally different methods are required.


Exercise 19.26 (****)

Connecting DML to semiparametric efficiency.

(a) State the semiparametric efficiency bound for the ATE in the partially linear model under conditional ignorability. How does it relate to the variance of the DML estimator?

(b) Show that the DML estimator using the efficient influence function achieves this bound (is semiparametrically efficient).

(c) Under what conditions does the causal forest also achieve semiparametric efficiency for the CATE? Discuss the role of honest estimation and the bias-variance tradeoff in achieving the optimal rate.


Exercise 19.27 (**)

Cross-fitting sensitivity.

(a) Implement the DML estimator with $K = 2$ cross-fitting folds. Run it 100 times with different random splits. Plot the distribution of $\hat{\theta}$.

(b) Repeat with $K = 5$ and $K = 10$. How does the variability across random splits change with $K$?

(c) The median-of-means variant (Chernozhukov et al., 2018) runs $K$-fold DML multiple times and takes the median. Implement this and show that it reduces sensitivity to the specific fold split.


Exercise 19.28 (***)

Continuous treatments.

The methods in this chapter focused on binary treatments ($D \in \{0, 1\}$). Extend to a continuous treatment (e.g., dosage of Drug X).

(a) Define the CATE for a continuous treatment: $\tau(x, d) = \frac{\partial}{\partial d} \mathbb{E}[Y(d) \mid X = x]$. How does this differ from the binary case?

(b) Implement a DML estimator for the dose-response function using the partially linear model $Y = \theta(X) \cdot D + g(X) + \epsilon$, where $D$ is now continuous.

(c) Use EconML's CausalForestDML with a continuous treatment on simulated dose-response data. Estimate $\hat{\tau}(x)$ and interpret it as the marginal effect of increasing dosage.