Chapter 19: Quiz

DataField.Dev

Chapter 19: Quiz

Test your understanding of causal machine learning. Answers follow each question.

Question 1

What is the conditional average treatment effect (CATE), and how does it differ from the ATE?

Answer

The CATE is defined as $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$ — the average treatment effect for individuals with covariate values $X = x$. The ATE is $\mathbb{E}[Y(1) - Y(0)]$ — the average over the entire population. The ATE is a single number; the CATE is a *function* that maps each covariate vector to a treatment effect. The ATE equals the expectation of the CATE over the covariate distribution: $\text{ATE} = \mathbb{E}_X[\tau(X)]$. The CATE reveals *who* benefits from treatment and by how much, while the ATE only reveals the average effect.

Question 2

Explain the "regularization bias" problem of the S-learner. Why might a gradient-boosted S-learner estimate $\hat{\tau}(x) \approx 0$ even when the true effect is nonzero?

Answer

The S-learner trains a single model $\hat{\mu}(x, d)$ to predict $Y$ from both covariates $X$ and treatment $D$. Regularized models (LASSO, limited-depth trees, gradient boosting with early stopping) penalize all coefficients, including the interaction between $D$ and $X$ that encodes the treatment effect. If the treatment effect is small relative to the main effect of covariates on $Y$, the regularization incentivizes the model to ignore $D$ — it is more efficient (in terms of predictive loss) to focus on the strong predictors in $X$ than to learn the small differential contribution of $D$. The model has no way to "know" that $D$ is a treatment variable that deserves special protection from shrinkage. This results in CATE estimates $\hat{\tau}(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0) \approx 0$ even when the true effect is nonzero.

Question 3

Describe the T-learner. What is its main advantage over the S-learner, and what is its main disadvantage?

Answer

The T-learner trains two separate models: $\hat{\mu}_1(x)$ on treated observations and $\hat{\mu}_0(x)$ on control observations. The CATE estimate is $\hat{\tau}_T(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)$. **Advantage:** The treatment effect cannot be shrunk to zero by regularization, because the effect is the difference between two completely separate models (not a coefficient within a single model). The T-learner can capture arbitrarily complex treatment effect heterogeneity. **Disadvantage:** Each model uses only a fraction of the data (treated or control), wasting statistical efficiency. If the outcome functions $\mu_1(x)$ and $\mu_0(x)$ share structure (e.g., both depend on the same covariates in a similar way), the T-learner does not exploit this shared structure. The CATE estimate is the difference of two potentially noisy estimates, which amplifies variance.

Question 4

How does the X-learner improve upon the T-learner for imbalanced treatment groups?

Answer

The X-learner (Künzel et al., 2019) imputes individual treatment effects using the *other* group's model. For treated units, it computes $\tilde{\tau}_i^T = Y_i - \hat{\mu}_0(X_i)$ (observed outcome minus imputed counterfactual from the control model). For control units, it computes $\tilde{\tau}_i^C = \hat{\mu}_1(X_i) - Y_i$. It then trains separate CATE models on each set of imputed effects, and combines them using propensity score weights: $\hat{\tau}_X(x) = g(x) \hat{\tau}_0(x) + (1-g(x)) \hat{\tau}_1(x)$. When treatment groups are imbalanced (e.g., 90% control, 10% treated), the large control group yields a reliable $\hat{\mu}_0$, which is then used to impute treatment effects for the small treated group. The propensity-weighted combination automatically puts more weight on the estimate derived from the larger group in each region of the covariate space. This adapts to the sample imbalance rather than ignoring it.

Question 5

What is the key property that distinguishes the R-learner from the other meta-learners?

Answer

The R-learner directly targets the CATE through a loss function that is **Neyman orthogonal** with respect to the nuisance parameters (the marginal outcome model $m(x)$ and the propensity score $e(x)$). This means that first-stage estimation errors in $\hat{m}$ and $\hat{e}$ affect the CATE estimate only at a second-order rate. If the nuisance estimates converge at rate $n^{-1/4}$, the CATE estimate converges at the faster rate $n^{-1/2}$. The other meta-learners (S, T, X) are not Neyman orthogonal — errors in the outcome models directly affect the CATE estimate at first order. The R-learner's loss function, based on Robinson's (1988) partial residualization, inherently debiases against nuisance estimation errors, making it particularly suitable when combined with flexible ML methods for nuisance estimation.

Question 6

Explain what "honesty" means in a causal forest and why it is necessary.

Answer

Honesty means that the data used to determine the tree structure (where to split) is different from the data used to estimate the treatment effect in each leaf (what to predict). The training data is randomly split into a splitting sample and an estimation sample. The splitting sample determines the partition of the covariate space; the estimation sample computes $\hat{\tau}$ within each leaf. Honesty is necessary because without it, the adaptive splitting process introduces bias: the tree finds splits that look like large treatment effects *in the training data*, including splits that capitalize on noise. Using the same data for splitting and estimation produces over-optimistic treatment effect estimates and invalid confidence intervals. By using separate data for estimation, honesty ensures that the treatment effect estimates are unbiased conditional on the tree structure, which in turn enables valid asymptotic confidence intervals.

Question 7

How does a causal forest's splitting criterion differ from a standard random forest's?

Answer

A standard random forest selects splits that minimize within-node prediction error: $\sum_{i \in L}(Y_i - \bar{Y}_L)^2 + \sum_{i \in R}(Y_i - \bar{Y}_R)^2$. A causal forest selects splits that maximize across-node treatment effect heterogeneity: $\Delta = n_L(\hat{\tau}_L - \hat{\tau})^2 + n_R(\hat{\tau}_R - \hat{\tau})^2$, where $\hat{\tau}_L$ and $\hat{\tau}_R$ are the estimated treatment effects in the left and right children, and $\hat{\tau}$ is the parent's treatment effect. A standard forest partitions the space to make $Y$ homogeneous within each leaf. A causal forest partitions the space to make the *treatment effect* $\tau$ different across leaves. A split that perfectly separates outcomes but creates leaves with identical treatment effects is useless for a causal forest; a split that creates leaves with different treatment effects but similar outcome distributions is ideal.

Question 8

What is Neyman orthogonality, and why is it important for DML?

Answer

Neyman orthogonality is a property of the moment condition used to estimate the causal parameter. A moment condition $\mathbb{E}[\psi(\theta; \eta)] = 0$ (where $\eta$ represents nuisance parameters) is Neyman orthogonal if the pathwise derivative with respect to $\eta$, evaluated at the true values, is zero: $$\frac{\partial}{\partial r} \mathbb{E}[\psi(\theta_0; \eta_0 + r \delta)] \bigg|_{r=0} = 0 \quad \text{for all perturbations } \delta$$ This means that small errors in the nuisance estimates $\hat{\eta}$ (the outcome model $\hat{g}$ and propensity $\hat{m}$) affect the estimate of $\theta$ only at second order. Specifically, the bias in $\hat{\theta}$ is proportional to the *product* of errors in $\hat{g}$ and $\hat{m}$, not the sum. If each nuisance estimator converges at rate $n^{-1/4}$, the product of errors is $O(n^{-1/2})$, which allows $\hat{\theta}$ to achieve the parametric $\sqrt{n}$ convergence rate. Without Neyman orthogonality, regularization bias from ML nuisance estimators would prevent $\hat{\theta}$ from converging at the parametric rate.

Question 9

What is cross-fitting, and why is it used in DML?

Answer

Cross-fitting is a sample-splitting procedure used in DML to prevent overfitting bias in nuisance estimation. The data is split into $K$ folds. For each fold $k$, the nuisance functions $\hat{g}^{-k}$ and $\hat{m}^{-k}$ are estimated using all data *except* fold $k$, and residuals are computed for observations in fold $k$ using these out-of-fold predictions. The residuals are then pooled across all folds to estimate the causal parameter. Cross-fitting is necessary because flexible ML models can overfit the training data. If the same data is used to fit $\hat{g}(X_i)$ and then to compute $Y_i - \hat{g}(X_i)$, the residual $Y_i - \hat{g}(X_i)$ is artificially small (the model has memorized observation $i$), which biases the causal estimate. Out-of-fold prediction eliminates this: $\hat{g}^{-k}(X_i)$ was not trained on observation $i$, so the residual is a honest estimate of the approximation error.

Question 10

In the DML moment condition $\psi(\theta; g, m) = (Y - g(X) - \theta D)(D - m(X))$, what is the intuition for multiplying two residuals?

Answer

The moment condition performs a "residual-on-residual" regression. The first factor $(Y - g(X) - \theta D)$ is the outcome residual after removing the effect of confounders $g(X)$ and treatment $\theta D$. The second factor $(D - m(X))$ is the treatment residual after removing the systematic part of treatment assignment $m(X)$. The intuition: $D - m(X)$ isolates the "as-if-random" component of treatment — the part of treatment variation that is not predicted by confounders. Projecting the outcome residual onto this quasi-random treatment variation identifies the causal effect, analogous to how an instrumental variable isolates exogenous variation. The product makes the estimate doubly robust: if either $g$ or $m$ is estimated correctly, the moment condition is still satisfied at $\theta = \theta_0$.

Question 11

What are the "four quadrants" of treatment response, and why does uplift modeling distinguish them while predictive modeling does not?

Answer

The four quadrants classify individuals by their potential outcomes: - **Persuadables**: $Y(0) = 0, Y(1) = 1$ — treatment causes the positive outcome. Target these. - **Sure Things**: $Y(0) = 1, Y(1) = 1$ — positive outcome regardless. Treatment is wasted. - **Lost Causes**: $Y(0) = 0, Y(1) = 0$ — no outcome regardless. Treatment is wasted. - **Sleeping Dogs**: $Y(0) = 1, Y(1) = 0$ — treatment prevents the positive outcome. Avoid treating. A predictive model estimates $P(Y = 1 \mid X, D = 1)$, which is high for both Persuadables (true positives of targeting) and Sure Things (false positives of targeting). It cannot distinguish between them because both groups have high outcomes under treatment. An uplift model estimates $P(Y(1) = 1 \mid X) - P(Y(0) = 1 \mid X)$, which is positive for Persuadables and zero for Sure Things, correctly identifying who benefits from treatment.

Question 12

Define the transformed outcome $Y^*$. Why does $\mathbb{E}[Y^* \mid X] = \tau(X)$?

Answer

The transformed outcome is: $$Y^* = \frac{D \cdot Y}{e(X)} - \frac{(1-D) \cdot Y}{1-e(X)}$$ where $e(X) = P(D=1 \mid X)$ is the propensity score. To see why $\mathbb{E}[Y^* \mid X] = \tau(X)$: $$\mathbb{E}[Y^* \mid X] = \mathbb{E}\left[\frac{DY}{e(X)} \mid X\right] - \mathbb{E}\left[\frac{(1-D)Y}{1-e(X)} \mid X\right]$$ For the first term: $\mathbb{E}[DY \mid X] = \mathbb{E}[Y(1) \mid X, D=1] \cdot P(D=1 \mid X) = \mathbb{E}[Y(1) \mid X] \cdot e(X)$ (using ignorability). Dividing by $e(X)$ gives $\mathbb{E}[Y(1) \mid X]$. Similarly, the second term equals $\mathbb{E}[Y(0) \mid X]$. Therefore $\mathbb{E}[Y^* \mid X] = \mathbb{E}[Y(1) \mid X] - \mathbb{E}[Y(0) \mid X] = \tau(X)$. This means a regression of $Y^*$ on $X$ estimates the CATE function.

Question 13

What is the Qini curve, and how is it analogous to the ROC curve?

Answer

The Qini curve evaluates a CATE model's ability to rank individuals by treatment benefit. Individuals are sorted by descending $\hat{\tau}(x)$, and the curve plots the incremental number of positive outcomes (above what would occur under no treatment) as we treat progressively larger fractions of the population. The analogy to the ROC curve: the ROC evaluates a classifier's ability to rank positive and negative examples, plotting true positive rate vs. false positive rate as the classification threshold varies. The Qini evaluates an uplift model's ability to rank high-benefit and low-benefit individuals, plotting cumulative uplift vs. fraction treated. A perfect uplift model produces a Qini curve that rises steeply (treating the highest-benefit individuals first), while a random model produces a diagonal line. The Qini coefficient (area between the curve and the diagonal) is analogous to the AUC — it quantifies ranking quality without requiring knowledge of individual treatment effects.

Question 14

Why can't you evaluate a CATE model using standard MSE ($\frac{1}{n}\sum(\hat{\tau}_i - \tau_i)^2$) on real data?

Answer

Standard MSE requires observing the true CATE $\tau_i = Y_i(1) - Y_i(0)$ for each individual. But the fundamental problem of causal inference ([Chapter 16](../chapter-16-potential-outcomes-framework/index.md)) means we never observe both $Y_i(1)$ and $Y_i(0)$ — we observe only one potential outcome per individual. Therefore $\tau_i$ is never directly observable, and we cannot compute $\hat{\tau}_i - \tau_i$ for any individual. This is why causal ML evaluation uses indirect metrics like the Qini curve (which evaluates ranking without needing individual-level ground truth), the AIPW-based calibration plot (which estimates average effects within bins), or the "Rank-Weighted Average Treatment Effect" (RATE). In simulations where $\tau_i$ is known by construction, MSE can be computed — which is why simulations play a central role in causal ML research.

Question 15

In EconML, what is the distinction between $X$ (effect modifiers) and $W$ (confounders)?

Answer

$X$ variables are **effect modifiers** — covariates that may cause the treatment effect to vary. The CATE $\tau(x)$ is estimated as a function of $X$. Including a variable in $X$ means "I believe the treatment effect may differ at different values of this variable." $W$ variables are **confounders** (or additional controls) — variables needed for causal identification (conditional ignorability) but not expected to modify the treatment effect. They are conditioned on for identification but do not appear in the CATE function. Putting too many variables in $X$ increases the curse of dimensionality for CATE estimation (estimating a function in a high-dimensional space requires more data). Putting too few variables in $X$ restricts the CATE to be constant (or linear) in the retained modifiers. The optimal split requires domain knowledge: which variables plausibly *modify* the treatment effect (e.g., genetic markers for drug response) vs. which variables merely confound the treatment-outcome relationship (e.g., hospital quality).

Question 16

A causal forest estimates $\hat{\tau}(x) = 0.12$ with 95% CI $[0.03, 0.21]$ for a patient. The treatment cost is equivalent to $\tau = 0.08$. Should you treat?

Answer

The point estimate $\hat{\tau}(x) = 0.12$ exceeds the treatment cost of $0.08$, suggesting treatment is net beneficial. The 95% confidence interval $[0.03, 0.21]$ is entirely above zero, so the treatment effect is statistically significant — we can reject $H_0: \tau(x) = 0$. However, the lower bound of the CI is $0.03$, which is *below* the treatment cost of $0.08$. This means we cannot be confident at the 95% level that the treatment benefit exceeds the cost. For a risk-neutral decision maker: treat (expected net benefit = $0.12 - 0.08 = 0.04 > 0$). For a risk-averse decision maker or in a high-stakes clinical setting: the uncertainty about whether benefit exceeds cost may warrant additional data collection or a more conservative threshold. The decision depends on the loss function, not just the statistical significance.

Question 17

Explain why "the features that predict the outcome" and "the features that predict treatment effect heterogeneity" can be completely different.

Answer

A feature that predicts $Y$ strongly is one that explains variation in outcomes across all individuals, regardless of treatment status. A feature that predicts $\tau(X)$ is one that explains variation in *treatment effects* — it moderates how much the treatment changes the outcome. Example: In the MediCore case, disease severity strongly predicts readmission ($Y$) — sicker patients are readmitted more often. But disease severity may not predict *treatment effect heterogeneity* if Drug X helps equally across severity levels. Conversely, a genetic marker may barely predict readmission (carriers and non-carriers have similar baseline risk) but strongly predict the treatment effect (carriers respond to Drug X while non-carriers do not). Causal forest feature importance measures the latter — which features drive variation in $\tau(x)$ — while standard predictive feature importance measures the former. The two lists can have zero overlap, which is one of the most practically useful insights from CATE analysis.

Question 18

What assumption does every method in this chapter rely on that cannot be tested from data?

Answer

**Conditional ignorability** (also called unconfoundedness or the no-unmeasured-confounders assumption): $$Y(0), Y(1) \perp D \mid X$$ This states that, conditional on observed covariates $X$, treatment assignment $D$ is independent of potential outcomes. In other words, there are no unmeasured confounders. This assumption is untestable: we cannot verify from the data alone that we have measured all relevant confounders. It must be argued from domain knowledge and the causal structure of the problem. DML provides robustness to *observed* confounders (through flexible ML estimation), but it cannot fix the problem of *missing* confounders. Sensitivity analysis (e.g., Cinelli and Hazlett, 2020) quantifies how strong an unmeasured confounder would need to be to change the conclusions.

Question 19

Describe a scenario where treating the top 20% by predicted outcome $P(Y=1 \mid X, D=1)$ leads to a worse outcome than treating a random 20%.

Answer

Consider a marketing campaign where: - The top 20% by $P(\text{purchase} \mid \text{campaign})$ are mostly "Sure Things" — loyal customers who would purchase anyway. The campaign has zero incremental effect on them. - Among the bottom 80%, there are "Persuadables" who would not purchase without the campaign but would with it. - Additionally, 5% of the top predictive group are "Sleeping Dogs" — the campaign actually causes them to unsubscribe, reducing future purchases. Targeting the top 20% by prediction: net uplift = 0 (Sure Things) + some negative uplift (Sleeping Dogs) < 0. Targeting a random 20%: net uplift = a small but positive fraction of Persuadables > 0. The prediction-targeted group performs worse because predictive models select for high baseline probability, not high treatment effect, and the two are uncorrelated or negatively correlated in this scenario.

Question 20

You have estimated CATEs for 50,000 StreamRec users. The mean CATE is 2.1 minutes of additional engagement. The standard deviation of estimated CATEs is 1.8 minutes. What should you investigate before deploying a targeting policy based on these estimates?

Answer

Before deployment, investigate: 1. **Positivity**: Are there regions of the covariate space with extreme propensity scores (near 0 or 1)? CATE estimates in these regions are unreliable. Check propensity score distributions and trim or cap extreme values. 2. **Sensitivity to unmeasured confounding**: How strong would an unmeasured confounder need to be to change the sign of the mean CATE or to substantially alter the targeting policy? Run sensitivity analysis. 3. **CATE calibration**: Are the estimated magnitudes trustworthy, or only the rankings? Construct a CATE calibration plot using AIPW within bins of $\hat{\tau}$. 4. **Stability**: How sensitive are the CATE estimates to the choice of method (causal forest vs. meta-learners), hyperparameters, and random seed? If two methods disagree on who should be treated, the estimates are not stable enough for deployment. 5. **Variance of CATE estimates**: The standard deviation of 1.8 minutes is nearly as large as the mean of 2.1 minutes. Some of this variation may be genuine heterogeneity, but some may be estimation noise. Check confidence interval widths — if many individuals have CIs that include zero, the heterogeneity may be artifactual. 6. **SUTVA**: Do recommendations for one user affect others (social sharing, content availability)? If so, the individual-level CATEs may not compose into a valid population-level policy. 7. **Temporal stability**: Were the CATEs estimated on historical data? User preferences and platform dynamics change. The targeting policy should be re-estimated periodically and monitored for drift.