> "The average treatment effect is a useful fiction. In reality, the treatment helps some people, hurts others, and does nothing for most. The question is: which ones?"
In This Chapter
- Learning Objectives
- 19.1 Beyond Average Treatment Effects: Why Heterogeneity Matters
- 19.2 Meta-Learners: The Modular Approach to CATE Estimation
- 19.3 Causal Forests: Partitioning for Heterogeneity
- 19.4 Double/Debiased Machine Learning
- 19.5 Uplift Modeling: The Business of Targeting
- 19.6 Evaluating Causal ML: The Qini Curve and Beyond
- 19.7 EconML: The Production Framework
- 19.8 Putting It All Together: The MediCore Personalized Treatment Pipeline
- 19.9 Progressive Project M7: Causal Forests for StreamRec Targeting
- 19.10 When to Use Which Method
- Chapter Summary
Chapter 19: Causal Machine Learning — Heterogeneous Treatment Effects, Uplift Modeling, and Double Machine Learning
"The average treatment effect is a useful fiction. In reality, the treatment helps some people, hurts others, and does nothing for most. The question is: which ones?" — Susan Athey and Guido Imbens, "Machine Learning Methods That Economists Should Know About" (2019)
Learning Objectives
By the end of this chapter, you will be able to:
- Estimate heterogeneous treatment effects (HTEs) using causal forests and meta-learners (S, T, X, R)
- Implement double/debiased machine learning (DML) for causal estimation with high-dimensional confounders
- Apply uplift modeling to optimize treatment assignment in business contexts
- Use EconML to estimate CATEs and build targeting policies
- Evaluate causal ML models using metrics appropriate for causal (not predictive) tasks
19.1 Beyond Average Treatment Effects: Why Heterogeneity Matters
Chapter 16 introduced the average treatment effect:
$$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$$
Chapter 18 provided tools to estimate it: propensity score methods, instrumental variables, difference-in-differences, regression discontinuity. Each method estimates a single number — the average effect of treatment across the population (or a well-defined subgroup).
But a single number conceals enormous variation. When MediCore Pharmaceuticals estimates that Drug X reduces 30-day readmission by 4.2 percentage points on average, the average masks the reality that the drug may reduce readmission by 12 points for patients with a specific genetic marker, have zero effect for patients with mild hypertension, and actually increase readmission risk for patients with compromised renal function. The ATE of $-0.042$ is not wrong — it is incomplete. And for the physician deciding whether to prescribe Drug X to the patient in front of them, an average across 2.1 million patients is almost useless.
The same logic applies to StreamRec's recommendation system. Chapter 16 established the causal question: does recommending item $j$ to user $i$ cause additional engagement? Chapter 18 estimated the average recommendation effect using IPW. But the average conceals the crucial variation: some users discover genuinely new content through recommendations (large positive treatment effect), while others are shown content they would have found organically (near-zero effect). A targeting policy that recommends to everyone wastes positions on users who would have engaged anyway and misses the opportunity to reach users for whom the recommendation creates genuine value.
This chapter moves from "What is the average effect?" to "Who benefits most, and how much?"
The Conditional Average Treatment Effect
The conditional average treatment effect (CATE) is the treatment effect for a specific subgroup defined by covariates $X$:
$$\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$$
While the ATE is a single number, the CATE $\tau(x)$ is a function — it maps every covariate vector $x$ to a treatment effect. The ATE is the expectation of the CATE over the covariate distribution:
$$\text{ATE} = \mathbb{E}_X[\tau(X)]$$
Mathematical Foundation: The individual treatment effect $\tau_i = Y_i(1) - Y_i(0)$ remains unobservable (the fundamental problem of causal inference). The CATE $\tau(x)$ averages over individuals who share the same covariate values $X = x$, which means it can, in principle, be estimated from data — provided we have enough observations with similar $x$ in both treatment and control groups. The CATE is not the ITE; it is the best we can do given the fundamental problem.
From ATE to CATE: What Changes
| Aspect | ATE estimation (Chapter 18) | CATE estimation (This chapter) |
|---|---|---|
| Target | Single number: $\mathbb{E}[Y(1) - Y(0)]$ | Function: $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$ |
| Methods | IPW, AIPW, IV, DiD, RD | Meta-learners, causal forests, DML |
| Evaluation | Bias, variance, coverage | No ground truth for $\tau(x)$; indirect metrics (Qini, AUUC) |
| Application | "Does the treatment work?" | "For whom does the treatment work?" |
| Assumptions | Standard causal identification | Same, plus sufficient overlap and smoothness of $\tau(x)$ |
The Role of Machine Learning
The phrase "causal machine learning" is precise: it refers to using machine learning models to estimate causal quantities — specifically, CATEs and related objects. Standard ML predicts $Y$; causal ML predicts $Y(1) - Y(0)$. This distinction is fundamental and shapes every design choice.
Standard ML optimizes predictive accuracy: minimize $\mathbb{E}[(Y - \hat{Y})^2]$. Causal ML must optimize a quantity that is never observed: minimize $\mathbb{E}[(\tau(X) - \hat{\tau}(X))^2]$. Since $\tau_i$ is never observed for any individual $i$, causal ML cannot be evaluated the way predictive ML is evaluated. This is the central challenge of the chapter.
import numpy as np
import pandas as pd
from typing import Tuple, Dict, Optional, List, Callable
from dataclasses import dataclass
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.linear_model import LassoCV, LogisticRegressionCV
import matplotlib.pyplot as plt
def simulate_heterogeneous_treatment(
n: int = 10000,
p: int = 10,
seed: int = 42,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Simulate data with heterogeneous treatment effects.
The true CATE depends on X[:, 0] and X[:, 1]:
tau(x) = 2 * x_0 + x_1^2 - 1
This creates subgroups where the treatment is beneficial (tau > 0),
harmful (tau < 0), or neutral (tau ≈ 0).
Args:
n: Number of observations.
p: Number of covariates.
seed: Random seed.
Returns:
X: Covariates (n, p).
D: Treatment assignment (n,).
Y: Observed outcome (n,).
tau_true: True CATE for each unit (n,).
Y0: True potential outcome under control (n,).
"""
rng = np.random.RandomState(seed)
X = rng.normal(0, 1, (n, p))
# True CATE depends on first two covariates
tau_true = 2 * X[:, 0] + X[:, 1] ** 2 - 1
# Treatment assignment depends on confounders (non-random)
propensity = 1 / (1 + np.exp(-(0.5 * X[:, 0] + 0.3 * X[:, 2])))
D = rng.binomial(1, propensity)
# Potential outcomes
Y0 = 3 + X[:, 0] + 0.5 * X[:, 1] + 0.3 * X[:, 2] + rng.normal(0, 1, n)
Y1 = Y0 + tau_true
# Observed outcome
Y = D * Y1 + (1 - D) * Y0
return X, D, Y, tau_true, Y0
19.2 Meta-Learners: The Modular Approach to CATE Estimation
Meta-learners estimate CATEs by combining standard supervised learning models in specific ways. The "meta" prefix means they sit above the base learner — any supervised learning algorithm (gradient boosting, random forest, neural network, LASSO) can be plugged in as the base. This modularity is their strength: advances in supervised learning immediately transfer to CATE estimation.
Four meta-learners dominate the literature: S-learner, T-learner, X-learner, and R-learner. Each makes different tradeoffs between bias and variance, and each has settings where it excels and settings where it fails.
S-Learner (Single Model)
The S-learner trains a single model $\hat{\mu}(x, d)$ that predicts the outcome $Y$ from both covariates $X$ and treatment $D$:
$$\hat{\tau}_S(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0)$$
Intuition: Include treatment $D$ as a feature alongside covariates $X$. The CATE estimate is the model's prediction when we set $D = 1$ minus its prediction when we set $D = 0$, holding $X = x$ fixed.
from sklearn.base import BaseEstimator, RegressorMixin
class SLearner(BaseEstimator, RegressorMixin):
"""S-Learner for CATE estimation.
Trains a single outcome model mu(x, d) and estimates the CATE
as mu(x, 1) - mu(x, 0).
Attributes:
base_learner: Any scikit-learn regressor.
"""
def __init__(self, base_learner: Optional[BaseEstimator] = None):
self.base_learner = base_learner or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
def fit(self, X: np.ndarray, D: np.ndarray, Y: np.ndarray) -> "SLearner":
"""Fit the outcome model on (X, D) -> Y.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
Returns:
self
"""
XD = np.column_stack([X, D])
self.base_learner.fit(XD, Y)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Estimate CATE for each observation.
Args:
X: Covariates (n, p).
Returns:
Estimated CATE (n,).
"""
X1 = np.column_stack([X, np.ones(X.shape[0])])
X0 = np.column_stack([X, np.zeros(X.shape[0])])
return self.base_learner.predict(X1) - self.base_learner.predict(X0)
Strengths: Simple to implement. Uses all data for a single model. Works well when treatment effects are small relative to outcome variation — the model must learn the relationship between $D$ and $Y$ from the data.
Weaknesses: Regularized models (and tree-based models with limited depth) can shrink the treatment effect toward zero. If the treatment effect is small but real, the S-learner may miss it because the model finds it more efficient to ignore $D$ and focus on the strong predictors of $Y$. This is the regularization bias problem: the base learner's inductive bias penalizes the treatment effect just as it penalizes any small coefficient.
Common Misconception: "The S-learner just needs a flexible enough base model." Flexibility is necessary but not sufficient. Even a perfectly flexible model trained on finite data will struggle to separate the treatment effect from the main outcome function when the effect is small, because the squared-loss objective does not distinguish between signal from $X$ and signal from $D$. The S-learner has no way to "know" that $D$ is special — that it represents a manipulation rather than a feature.
T-Learner (Two Models)
The T-learner trains separate outcome models for the treated and control groups:
$$\hat{\mu}_1(x) = \hat{\mathbb{E}}[Y \mid X = x, D = 1] \quad \text{(treated model)}$$ $$\hat{\mu}_0(x) = \hat{\mathbb{E}}[Y \mid X = x, D = 0] \quad \text{(control model)}$$ $$\hat{\tau}_T(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)$$
class TLearner(BaseEstimator, RegressorMixin):
"""T-Learner for CATE estimation.
Trains separate outcome models for treated and control groups.
CATE = mu_1(x) - mu_0(x).
Attributes:
base_learner_t: Regressor for the treated group.
base_learner_c: Regressor for the control group.
"""
def __init__(
self,
base_learner_t: Optional[BaseEstimator] = None,
base_learner_c: Optional[BaseEstimator] = None,
):
self.base_learner_t = base_learner_t or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.base_learner_c = base_learner_c or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
def fit(self, X: np.ndarray, D: np.ndarray, Y: np.ndarray) -> "TLearner":
"""Fit separate models for treated and control.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
Returns:
self
"""
self.base_learner_t.fit(X[D == 1], Y[D == 1])
self.base_learner_c.fit(X[D == 0], Y[D == 0])
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Estimate CATE for each observation.
Args:
X: Covariates (n, p).
Returns:
Estimated CATE (n,).
"""
return self.base_learner_t.predict(X) - self.base_learner_c.predict(X)
Strengths: Cannot shrink the treatment effect to zero (the treatment effect is not a single coefficient but the difference between two separate models). Works well when treatment effects are large and both groups have sufficient data.
Weaknesses: Uses only half the data for each model. When the treatment and control outcome functions share structure (both $\mu_1(x)$ and $\mu_0(x)$ are smooth functions of $x$ that differ only slightly), the T-learner does not exploit this shared structure. It can also overfit in regions where one group is sparse — if positivity is approximately violated for some covariate values, the T-learner's predictions in that region are unreliable.
X-Learner (Cross-Group Imputation)
The X-learner (Künzel et al., 2019) addresses the T-learner's sample efficiency problem by using each group's model to impute missing potential outcomes for the other group:
Step 1: Fit outcome models for treated and control (same as T-learner): $$\hat{\mu}_0(x) \quad \text{and} \quad \hat{\mu}_1(x)$$
Step 2: Impute individual treatment effects: - For treated units: $\tilde{\tau}_i^T = Y_i - \hat{\mu}_0(X_i)$ (observed outcome minus imputed counterfactual) - For control units: $\tilde{\tau}_i^C = \hat{\mu}_1(X_i) - Y_i$ (imputed counterfactual minus observed outcome)
Step 3: Fit CATE models on the imputed ITEs: $$\hat{\tau}_1(x) \text{ trained on } \{(X_i, \tilde{\tau}_i^T)\}_{D_i = 1}$$ $$\hat{\tau}_0(x) \text{ trained on } \{(X_i, \tilde{\tau}_i^C)\}_{D_i = 0}$$
Step 4: Combine using propensity score weights: $$\hat{\tau}_X(x) = g(x) \cdot \hat{\tau}_0(x) + (1 - g(x)) \cdot \hat{\tau}_1(x)$$
where $g(x) = P(D = 1 \mid X = x)$ is the propensity score. This weighting ensures that the estimate relies more on the model trained with more data in each region of the covariate space.
class XLearner(BaseEstimator, RegressorMixin):
"""X-Learner for CATE estimation.
Uses cross-group imputation of individual treatment effects
and propensity-weighted combination.
Attributes:
outcome_learner_t: Regressor for treated outcome.
outcome_learner_c: Regressor for control outcome.
effect_learner_t: Regressor for treatment effect among treated.
effect_learner_c: Regressor for treatment effect among controls.
propensity_model: Classifier for propensity score.
"""
def __init__(
self,
base_learner: Optional[BaseEstimator] = None,
propensity_model: Optional[BaseEstimator] = None,
):
learner = base_learner or GradientBoostingRegressor
self.outcome_learner_t = learner(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.outcome_learner_c = learner(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.effect_learner_t = learner(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.effect_learner_c = learner(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.propensity_model = propensity_model or GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42
)
def fit(self, X: np.ndarray, D: np.ndarray, Y: np.ndarray) -> "XLearner":
"""Fit the X-learner.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
Returns:
self
"""
# Step 1: Fit outcome models
self.outcome_learner_t.fit(X[D == 1], Y[D == 1])
self.outcome_learner_c.fit(X[D == 0], Y[D == 0])
# Step 2: Impute individual treatment effects
# Treated: observed Y_i(1) - imputed Y_i(0)
tau_t = Y[D == 1] - self.outcome_learner_c.predict(X[D == 1])
# Control: imputed Y_i(1) - observed Y_i(0)
tau_c = self.outcome_learner_t.predict(X[D == 0]) - Y[D == 0]
# Step 3: Fit CATE models on imputed effects
self.effect_learner_t.fit(X[D == 1], tau_t)
self.effect_learner_c.fit(X[D == 0], tau_c)
# Fit propensity model for weighting
self.propensity_model.fit(X, D)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Estimate CATE using propensity-weighted combination.
Args:
X: Covariates (n, p).
Returns:
Estimated CATE (n,).
"""
tau_from_treated = self.effect_learner_t.predict(X)
tau_from_control = self.effect_learner_c.predict(X)
g = self.propensity_model.predict_proba(X)[:, 1]
# Weight by propensity: use control-side estimate where
# treated units are dense, treated-side where control is dense
return g * tau_from_control + (1 - g) * tau_from_treated
Strengths: Adapts to imbalanced treatment groups. When 90% of units are in the control group, the X-learner leverages the large control group to estimate $\hat{\mu}_0$ well, then imputes counterfactuals for the small treated group. The propensity-weighted combination ensures each region of the covariate space uses the more reliable estimate.
Weaknesses: More complex, with more hyperparameters. The imputed ITEs are noisy (they inherit error from $\hat{\mu}_0$ and $\hat{\mu}_1$), which means the second-stage models are trained on noisy targets. Errors compound across stages.
R-Learner (Residual-on-Residual)
The R-learner (Nie and Wager, 2021) takes a fundamentally different approach. Rather than modeling outcomes and computing differences, it directly targets the CATE through a loss function that is robust to estimation errors in nuisance components.
Step 1: Estimate nuisance functions: - $\hat{m}(x) = \hat{\mathbb{E}}[Y \mid X = x]$ (marginal outcome model, ignoring treatment) - $\hat{e}(x) = \hat{P}(D = 1 \mid X = x)$ (propensity score)
Step 2: Form residuals: - $\tilde{Y}_i = Y_i - \hat{m}(X_i)$ (outcome residual) - $\tilde{D}_i = D_i - \hat{e}(X_i)$ (treatment residual)
Step 3: Minimize the R-learner loss:
$$\hat{\tau}_R = \arg\min_{\tau} \sum_{i=1}^{n} \left( \tilde{Y}_i - \tau(X_i) \cdot \tilde{D}_i \right)^2 + \Lambda(\tau)$$
where $\Lambda(\tau)$ is a regularization penalty.
Mathematical Foundation: The R-learner loss has a deep connection to Robinson's (1988) partially linear model. Consider the model:
$$Y = \tau(X) \cdot D + m(X) + \epsilon$$
where $m(X) = \mathbb{E}[Y \mid X]$ when the propensity is accounted for. Taking the expectation conditional on $X$:
$$\mathbb{E}[Y \mid X] = \tau(X) \cdot e(X) + m(X)$$
Subtracting:
$$Y - \mathbb{E}[Y \mid X] = \tau(X) \cdot (D - e(X)) + \epsilon$$
This is a regression of the outcome residual on the treatment residual, with coefficient $\tau(X)$. The R-learner operationalizes this by replacing expectations with estimates and minimizing the resulting loss. The key property: errors in $\hat{m}(x)$ and $\hat{e}(x)$ enter the loss only through products of residuals, which means they affect the CATE estimate at a second-order rate. This is an instance of Neyman orthogonality — the same principle underlying DML (Section 19.4).
class RLearner(BaseEstimator, RegressorMixin):
"""R-Learner for CATE estimation.
Residual-on-residual regression with cross-fitting for nuisance
estimation. Directly targets the CATE through a loss function
that is robust to first-stage estimation errors.
Attributes:
outcome_model: Regressor for marginal outcome E[Y|X].
propensity_model: Classifier for propensity score P(D=1|X).
effect_model: Regressor for CATE tau(X).
n_splits: Number of cross-fitting folds.
"""
def __init__(
self,
outcome_model: Optional[BaseEstimator] = None,
propensity_model: Optional[BaseEstimator] = None,
effect_model: Optional[BaseEstimator] = None,
n_splits: int = 5,
):
self.outcome_model = outcome_model or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.propensity_model = propensity_model or GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42
)
self.effect_model = effect_model or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.n_splits = n_splits
def fit(self, X: np.ndarray, D: np.ndarray, Y: np.ndarray) -> "RLearner":
"""Fit the R-learner with cross-fitted nuisance estimates.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
Returns:
self
"""
# Cross-fitted nuisance estimates to avoid overfitting bias
m_hat = cross_val_predict(
self.outcome_model, X, Y, cv=self.n_splits
)
e_hat = cross_val_predict(
self.propensity_model, X, D, cv=self.n_splits, method="predict_proba"
)[:, 1]
# Clip propensity to avoid extreme weights
e_hat = np.clip(e_hat, 0.01, 0.99)
# Form residuals
Y_tilde = Y - m_hat
D_tilde = D - e_hat
# Fit effect model: minimize sum( (Y_tilde - tau(X) * D_tilde)^2 )
# This is equivalent to weighted regression of Y_tilde/D_tilde on X
# with weights D_tilde^2
pseudo_outcome = Y_tilde / D_tilde
weights = D_tilde ** 2
self.effect_model.fit(X, pseudo_outcome, sample_weight=weights)
# Refit nuisance models on full data for prediction
self.outcome_model.fit(X, Y)
self.propensity_model.fit(X, D)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
"""Estimate CATE.
Args:
X: Covariates (n, p).
Returns:
Estimated CATE (n,).
"""
return self.effect_model.predict(X)
Strengths: The R-learner directly targets the CATE, avoiding the subtraction of two large quantities (which amplifies noise in T- and X-learners). The Neyman orthogonality property provides second-order robustness to nuisance estimation errors. Cross-fitting prevents overfitting bias in nuisance estimation.
Weaknesses: Requires dividing by treatment residuals $\tilde{D}_i = D_i - \hat{e}(X_i)$, which can be near zero when the propensity score is close to 0 or 1. This division amplifies noise in regions of low overlap. The pseudo-outcome transformation can produce extreme values that destabilize the second-stage regression.
Comparing Meta-Learners
def compare_meta_learners(
n: int = 10000,
seed: int = 42,
) -> pd.DataFrame:
"""Compare meta-learners on simulated data with known CATEs.
Args:
n: Number of observations.
seed: Random seed.
Returns:
DataFrame with RMSE and bias for each learner.
"""
X, D, Y, tau_true, _ = simulate_heterogeneous_treatment(n=n, seed=seed)
# Train-test split
n_train = int(0.7 * n)
X_tr, D_tr, Y_tr = X[:n_train], D[:n_train], Y[:n_train]
X_te, tau_te = X[n_train:], tau_true[n_train:]
learners = {
"S-Learner": SLearner(),
"T-Learner": TLearner(),
"X-Learner": XLearner(),
"R-Learner": RLearner(),
}
results = []
for name, learner in learners.items():
learner.fit(X_tr, D_tr, Y_tr)
tau_hat = learner.predict(X_te)
rmse = np.sqrt(np.mean((tau_hat - tau_te) ** 2))
bias = np.mean(tau_hat - tau_te)
correlation = np.corrcoef(tau_hat, tau_te)[0, 1]
results.append({
"Learner": name,
"RMSE": round(rmse, 3),
"Bias": round(bias, 3),
"Correlation(tau_hat, tau)": round(correlation, 3),
})
return pd.DataFrame(results)
Implementation Note: The comparison above uses gradient-boosted trees as the base learner for all meta-learners. In practice, the choice of base learner matters. LASSO-based meta-learners work well in high-dimensional sparse settings. Neural network-based meta-learners (e.g., DragonNet, CEVAE) are appropriate when the outcome and treatment functions have complex nonlinear structure. The meta-learner framework is agnostic to the base learner — this modularity is its defining feature.
| Setting | Best meta-learner | Why |
|---|---|---|
| Small treatment effect, good overlap | R-learner | Direct targeting avoids regularization bias |
| Imbalanced treatment groups (e.g., 95/5) | X-learner | Propensity-weighted combination adapts to imbalance |
| Simple treatment effect, complex outcome | S-learner | Shares statistical strength across groups |
| Large treatment effect, moderate $n$ | T-learner | Simple, avoids unnecessary complexity |
| High-dimensional confounders | R-learner with cross-fitting | Neyman orthogonality provides robustness |
19.3 Causal Forests: Partitioning for Heterogeneity
Causal forests (Athey and Imbens, 2016; Wager and Athey, 2018; Athey, Tibshirani, and Wager, 2019) are purpose-built for CATE estimation. While meta-learners repurpose standard ML models, causal forests modify the random forest algorithm itself so that splitting is driven by treatment effect heterogeneity rather than predictive accuracy.
Intuition: Honest Splitting for Treatment Effects
A standard random forest splits to minimize prediction error (e.g., sum of squared residuals in each leaf). A causal forest splits to maximize the difference in treatment effects across leaves.
Consider a node containing $n$ observations with covariates $X$, treatments $D$, and outcomes $Y$. For a candidate split on feature $j$ at threshold $c$, the causal forest computes the treatment effect in each child:
$$\hat{\tau}_L = \frac{\sum_{i \in L, D_i=1} Y_i}{n_{L,1}} - \frac{\sum_{i \in L, D_i=0} Y_i}{n_{L,0}} \quad \text{(left child)}$$ $$\hat{\tau}_R = \frac{\sum_{i \in R, D_i=1} Y_i}{n_{R,1}} - \frac{\sum_{i \in R, D_i=0} Y_i}{n_{R,0}} \quad \text{(right child)}$$
The split is chosen to maximize:
$$\Delta = n_L \cdot (\hat{\tau}_L - \hat{\tau})^2 + n_R \cdot (\hat{\tau}_R - \hat{\tau})^2$$
where $\hat{\tau}$ is the treatment effect estimate in the parent node. This criterion rewards splits that create children with different treatment effects, regardless of whether the split improves prediction of $Y$ itself.
Honesty and Asymptotic Validity
A key innovation of causal forests is honesty: the data used to determine the tree structure (where to split) is different from the data used to estimate the treatment effect in each leaf (what to predict). This separation ensures that the treatment effect estimates are not biased by the adaptive splitting process.
Honest splitting procedure: 1. Randomly split the training data into two halves: the splitting sample and the estimation sample. 2. Use the splitting sample to determine the tree structure (where to split, at what thresholds). 3. Use the estimation sample to compute the treatment effect estimate in each leaf. 4. For a new observation $x$, drop it down the tree and return the treatment effect estimate from the leaf it lands in.
This honesty property, combined with subsampling (each tree is built on a subsample of the data), gives causal forests valid asymptotic confidence intervals for $\tau(x)$ — a property that no other CATE estimator provides without additional assumptions.
Research Insight: Wager and Athey (2018) proved that causal forests are consistent and asymptotically normal under regularity conditions. Specifically, if the tree is built with honesty and subsampling, the CATE estimate $\hat{\tau}(x)$ converges to $\tau(x)$ as $n \to \infty$, and $(\hat{\tau}(x) - \tau(x)) / \hat{\sigma}(x) \to N(0, 1)$. This means you can construct valid confidence intervals and test $H_0: \tau(x) = 0$ for any $x$. This is a remarkable result — it converts a black-box machine learning method into a proper statistical estimator with quantified uncertainty.
The Generalized Random Forest (GRF)
The grf package (Athey, Tibshirani, and Wager, 2019) generalizes causal forests to a broader framework called generalized random forests. The key idea: any estimating equation of the form
$$\mathbb{E}[\psi_{\theta(x)}(O_i) \mid X_i = x] = 0$$
can be solved locally at each $x$ using the forest's adaptive neighborhood weights. For CATE estimation, the estimating equation is:
$$\psi_{\tau(x)}(Y_i, D_i, X_i) = (Y_i - \mu(X_i)) - \tau(x) \cdot (D_i - e(X_i))$$
where $\mu(X_i)$ and $e(X_i)$ are nuisance components. GRF solves this locally — the forest defines a data-adaptive kernel that gives nearby points (in a treatment-effect-relevant sense) higher weight.
Implementation with EconML
from econml.dml import CausalForestDML
def fit_causal_forest(
X: np.ndarray,
D: np.ndarray,
Y: np.ndarray,
n_estimators: int = 500,
min_samples_leaf: int = 10,
max_depth: Optional[int] = None,
) -> CausalForestDML:
"""Fit a causal forest using EconML's CausalForestDML.
CausalForestDML combines causal forest estimation with
double/debiased machine learning for nuisance estimation.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
n_estimators: Number of trees in the forest.
min_samples_leaf: Minimum number of samples per leaf.
max_depth: Maximum tree depth (None for unlimited).
Returns:
Fitted CausalForestDML model.
"""
model = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42
),
n_estimators=n_estimators,
min_samples_leaf=min_samples_leaf,
max_depth=max_depth,
random_state=42,
criterion="het", # Heterogeneity-driven splitting
)
model.fit(Y, D, X=X)
return model
def analyze_causal_forest(
model: CausalForestDML,
X: np.ndarray,
tau_true: Optional[np.ndarray] = None,
) -> Dict[str, np.ndarray]:
"""Analyze causal forest estimates and extract insights.
Args:
model: Fitted CausalForestDML.
X: Covariates for prediction (n, p).
tau_true: True CATEs (for evaluation, None in real applications).
Returns:
Dictionary with CATE estimates, confidence intervals,
and subgroup analysis.
"""
# Point estimates
tau_hat = model.effect(X).flatten()
# Confidence intervals
lb, ub = model.effect_interval(X, alpha=0.05)
lb, ub = lb.flatten(), ub.flatten()
# Feature importance for treatment effect heterogeneity
importances = model.feature_importances_
results = {
"tau_hat": tau_hat,
"ci_lower": lb,
"ci_upper": ub,
"feature_importances": importances,
}
if tau_true is not None:
rmse = np.sqrt(np.mean((tau_hat - tau_true) ** 2))
coverage = np.mean((tau_true >= lb) & (tau_true <= ub))
results["rmse"] = rmse
results["coverage"] = coverage
return results
Interpreting Causal Forest Output
The causal forest produces three essential outputs:
-
CATE estimates $\hat{\tau}(x)$ — the estimated treatment effect for each individual, which enables personalized treatment decisions.
-
Confidence intervals $[\hat{\tau}(x) \pm z_{\alpha/2} \cdot \hat{\sigma}(x)]$ — the uncertainty in each CATE estimate. A wide interval means insufficient data to distinguish the individual's effect from zero.
-
Feature importances — which covariates drive treatment effect heterogeneity. Unlike standard feature importance (which features predict $Y$), causal forest feature importance tells you which features predict treatment effect variation.
Production Reality: Causal forest feature importance often looks nothing like standard predictive feature importance. A feature that is the strongest predictor of the outcome (e.g., disease severity predicts readmission) may have zero importance for treatment effect heterogeneity (the treatment works equally well regardless of disease severity). Conversely, a feature that barely predicts the outcome (e.g., a genetic marker) may be the strongest driver of treatment effect heterogeneity (the drug works only for carriers of that marker). This inversion is one of the most practically useful insights from CATE estimation.
19.4 Double/Debiased Machine Learning
Double/debiased machine learning (DML; Chernozhukov et al., 2018) solves a specific problem: estimating a causal parameter when the nuisance functions (outcome model, propensity score) are estimated by machine learning. The problem is that standard ML estimators have regularization bias — they converge to the true function at a slower rate than parametric estimators — and this bias "leaks" into the causal parameter estimate.
The Problem: Regularization Bias
Consider the partially linear model:
$$Y = \theta \cdot D + g(X) + \epsilon, \quad \mathbb{E}[\epsilon \mid X, D] = 0$$ $$D = m(X) + \eta, \quad \mathbb{E}[\eta \mid X] = 0$$
Here $\theta$ is the (constant) treatment effect, $g(X)$ is an arbitrary function of confounders (the "nuisance" function), and $m(X)$ is the propensity score. We want to estimate $\theta$.
A naive approach: estimate $g(X)$ with a flexible ML model $\hat{g}(X)$, then regress $Y - \hat{g}(X)$ on $D$. The problem: if $\hat{g}(X)$ has bias $b_g$ (e.g., from regularization), then the estimate of $\theta$ inherits that bias. For a LASSO with $p \gg n$, the bias is $O(\sqrt{s \log p / n})$, where $s$ is the sparsity — not the $O(1/\sqrt{n})$ rate we need for valid inference.
The Solution: Neyman Orthogonality and Cross-Fitting
DML solves this through two innovations:
1. Neyman orthogonality. Instead of estimating $\theta$ from the residualized outcome alone, DML uses a doubly-robust moment condition:
$$\psi(\theta; g, m) = (Y - g(X) - \theta \cdot D) \cdot (D - m(X))$$
Setting $\mathbb{E}[\psi(\theta; g, m)] = 0$ and solving for $\theta$ gives the DML estimator. The critical property: the derivative of $\mathbb{E}[\psi]$ with respect to the nuisance parameters $(g, m)$, evaluated at the true values, is zero:
$$\frac{\partial}{\partial g} \mathbb{E}[\psi(\theta; g, m)] \bigg|_{g=g_0, m=m_0} = 0$$
$$\frac{\partial}{\partial m} \mathbb{E}[\psi(\theta; g, m)] \bigg|_{g=g_0, m=m_0} = 0$$
This means small errors in $\hat{g}$ and $\hat{m}$ affect $\hat{\theta}$ only at a second-order rate. If $\hat{g}$ and $\hat{m}$ each converge at rate $n^{-1/4}$ (which most reasonable ML estimators achieve), then $\hat{\theta}$ converges at rate $n^{-1/2}$ — the parametric rate. This is the debiasing.
2. Cross-fitting. To prevent overfitting bias (where the same data is used to estimate nuisance functions and to construct the causal estimate), DML uses sample splitting:
- Split data into $K$ folds.
- For each fold $k$, estimate $\hat{g}^{-k}$ and $\hat{m}^{-k}$ on all data except fold $k$.
- Compute residuals for observations in fold $k$ using $\hat{g}^{-k}$ and $\hat{m}^{-k}$.
- Pool residuals across all folds and solve for $\theta$.
Cross-fitting ensures that the nuisance predictions are out-of-sample, eliminating the overfitting bias that would arise if we predicted $\hat{g}(X_i)$ using a model trained on data including observation $i$.
class DoubleMachineLearning:
"""Double/Debiased Machine Learning for causal estimation.
Implements the DML1 estimator with cross-fitting for the
partially linear model: Y = theta * D + g(X) + epsilon.
Attributes:
model_y: ML model for E[Y|X] (nuisance: outcome).
model_d: ML model for E[D|X] (nuisance: propensity).
n_splits: Number of cross-fitting folds.
"""
def __init__(
self,
model_y: Optional[BaseEstimator] = None,
model_d: Optional[BaseEstimator] = None,
n_splits: int = 5,
):
self.model_y = model_y or GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
self.model_d = model_d or GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42
)
self.n_splits = n_splits
self.theta_hat_: float = 0.0
self.se_: float = 0.0
def fit(self, X: np.ndarray, D: np.ndarray, Y: np.ndarray) -> "DoubleMachineLearning":
"""Fit the DML estimator with cross-fitting.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
Returns:
self
"""
n = len(Y)
kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=42)
# Storage for residuals
Y_res = np.zeros(n)
D_res = np.zeros(n)
for train_idx, test_idx in kf.split(X):
# Fit nuisance models on training fold
self.model_y.fit(X[train_idx], Y[train_idx])
self.model_d.fit(X[train_idx], D[train_idx])
# Predict on held-out fold
Y_hat = self.model_y.predict(X[test_idx])
if hasattr(self.model_d, "predict_proba"):
D_hat = self.model_d.predict_proba(X[test_idx])[:, 1]
else:
D_hat = self.model_d.predict(X[test_idx])
# Residualize
Y_res[test_idx] = Y[test_idx] - Y_hat
D_res[test_idx] = D[test_idx] - D_hat
# Solve for theta: regress Y residuals on D residuals
# theta_hat = sum(D_res * Y_res) / sum(D_res^2)
self.theta_hat_ = np.sum(D_res * Y_res) / np.sum(D_res ** 2)
# Standard error (heteroskedasticity-robust)
epsilon = Y_res - self.theta_hat_ * D_res
V = np.mean((D_res ** 2) * (epsilon ** 2)) / (np.mean(D_res ** 2) ** 2)
self.se_ = np.sqrt(V / n)
return self
@property
def theta(self) -> float:
"""Estimated treatment effect."""
return self.theta_hat_
@property
def confidence_interval(self) -> Tuple[float, float]:
"""95% confidence interval for theta."""
return (self.theta_hat_ - 1.96 * self.se_, self.theta_hat_ + 1.96 * self.se_)
def summary(self) -> str:
"""Print estimation results."""
ci = self.confidence_interval
return (
f"DML Estimate: theta_hat = {self.theta_hat_:.4f}\n"
f"Standard Error: {self.se_:.4f}\n"
f"95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]\n"
f"t-stat: {self.theta_hat_ / self.se_:.2f}"
)
From Constant Effects to Heterogeneous Effects with DML
The basic DML estimator above recovers a constant $\theta$. For heterogeneous effects, we extend to the interactive model:
$$Y = g(X, D) + \epsilon$$
where the treatment effect $\tau(x) = g(x, 1) - g(x, 0)$ varies with $x$. EconML's LinearDML, SparseLinearDML, and CausalForestDML implement this extension. The key idea: use the DML framework for nuisance estimation (outcome and propensity models), then estimate the heterogeneous $\tau(x)$ using the orthogonalized residuals.
from econml.dml import LinearDML, SparseLinearDML
def dml_with_heterogeneity(
X: np.ndarray,
D: np.ndarray,
Y: np.ndarray,
W: Optional[np.ndarray] = None,
) -> Dict[str, object]:
"""Estimate heterogeneous treatment effects using DML.
Uses LinearDML when the CATE is linear in effect modifiers,
and CausalForestDML for nonparametric CATE estimation.
Args:
X: Effect modifiers (variables that modify the treatment effect).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
W: Additional confounders (variables needed for identification
but not expected to modify the treatment effect).
Returns:
Dictionary with fitted models and estimates.
"""
# Linear heterogeneity: tau(x) = x^T * beta
linear_dml = LinearDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
random_state=42,
)
linear_dml.fit(Y, D, X=X, W=W)
# Nonparametric heterogeneity: tau(x) estimated by causal forest
forest_dml = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
n_estimators=500,
min_samples_leaf=10,
random_state=42,
)
forest_dml.fit(Y, D, X=X, W=W)
return {
"linear_dml": linear_dml,
"forest_dml": forest_dml,
"ate_linear": linear_dml.ate(X),
"ate_forest": forest_dml.ate(X),
}
Advanced Sidebar: Chernozhukov et al. (2018) prove a general result: any causal parameter identified by a Neyman orthogonal moment condition can be estimated at the $\sqrt{n}$ rate using ML nuisance estimators, provided the nuisance estimators converge at the $n^{-1/4}$ rate (or more precisely, that the product of the nuisance estimation errors converges at the $n^{-1/2}$ rate). This result is remarkably general — it covers not just the partially linear model but also IV regression, panel data models, and sample selection models. The DML framework has become the standard approach for combining flexible ML methods with rigorous causal inference.
19.5 Uplift Modeling: The Business of Targeting
Uplift modeling is the applied-business realization of CATE estimation. The terminology shifts — marketing campaigns instead of drug treatments, customers instead of patients — but the mathematics is identical. The core question: Which individuals should we treat?
The Four Quadrants of Treatment Response
For a binary outcome (purchase, conversion, engagement), individuals fall into four categories based on their potential outcomes:
| $Y(0) = 0$ (no outcome without treatment) | $Y(0) = 1$ (outcome without treatment) | |
|---|---|---|
| $Y(1) = 1$ | Persuadable ($\tau_i = 1$): Treatment causes the outcome. Target these. | Sure Thing ($\tau_i = 0$): Would have the outcome regardless. Don't waste treatment. |
| $Y(1) = 0$ | Lost Cause ($\tau_i = 0$): No outcome regardless of treatment. Don't waste treatment. | Sleeping Dog ($\tau_i = -1$): Treatment prevents the outcome. Definitely don't treat. |
Standard predictive modeling conflates Persuadables with Sure Things (both have high $P(Y = 1)$ given treatment). Uplift modeling distinguishes them by estimating $\tau_i = Y_i(1) - Y_i(0)$.
Production Reality: In practice, the "Sleeping Dog" quadrant is more common than most marketers expect. A promotional email may cause some customers to unsubscribe (treatment-induced churn). A recommendation may cause some users to leave the platform (recommendation fatigue). Identifying and avoiding Sleeping Dogs can be more valuable than finding Persuadables, because the negative effect is often larger in magnitude than the positive effect.
The Transformed Outcome Approach
The transformed outcome (Athey and Imbens, 2016) provides a simple unbiased estimator of the ITE for randomized experiments:
$$Y_i^* = \frac{D_i \cdot Y_i}{e(X_i)} - \frac{(1 - D_i) \cdot Y_i}{1 - e(X_i)}$$
where $e(X_i) = P(D_i = 1 \mid X_i)$ is the propensity score (known in a randomized experiment, estimated in observational data).
The key property: $\mathbb{E}[Y_i^* \mid X_i] = \tau(X_i)$. The transformed outcome is an unbiased estimate of the individual treatment effect. This means we can train any regression model on $(X_i, Y_i^*)$ pairs to estimate $\tau(x)$:
def compute_transformed_outcome(
Y: np.ndarray,
D: np.ndarray,
e: np.ndarray,
) -> np.ndarray:
"""Compute the transformed outcome for uplift estimation.
Y* = D*Y/e(X) - (1-D)*Y/(1-e(X))
E[Y*|X] = tau(X), so regressing Y* on X estimates the CATE.
Args:
Y: Observed outcome (n,).
D: Treatment indicator (n,).
e: Propensity score P(D=1|X) (n,).
Returns:
Transformed outcome (n,).
"""
e = np.clip(e, 0.01, 0.99) # Clip to avoid division by near-zero
return D * Y / e - (1 - D) * Y / (1 - e)
def fit_uplift_model(
X: np.ndarray,
D: np.ndarray,
Y: np.ndarray,
propensity_model: Optional[BaseEstimator] = None,
effect_model: Optional[BaseEstimator] = None,
) -> Tuple[BaseEstimator, np.ndarray]:
"""Fit an uplift model using the transformed outcome approach.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
propensity_model: Model for P(D=1|X). If None, assumes RCT with e=0.5.
effect_model: Regressor for tau(X).
Returns:
Fitted effect model and transformed outcomes.
"""
# Estimate propensity
if propensity_model is not None:
propensity_model.fit(X, D)
e = propensity_model.predict_proba(X)[:, 1]
else:
e = np.full(len(D), D.mean()) # Assume RCT with constant propensity
# Compute transformed outcome
Y_star = compute_transformed_outcome(Y, D, e)
# Fit regression on transformed outcome
if effect_model is None:
effect_model = GradientBoostingRegressor(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
effect_model.fit(X, Y_star)
return effect_model, Y_star
Building a Targeting Policy
A targeting policy $\pi(x) \in \{0, 1\}$ assigns treatment based on covariates. The optimal policy treats individual $i$ if and only if the treatment benefit exceeds the cost:
$$\pi^*(x) = \mathbb{1}[\tau(x) > c]$$
where $c$ is the per-unit treatment cost (potentially zero for a digital intervention like a recommendation). In practice, we estimate $\tau(x)$ and construct $\hat{\pi}(x) = \mathbb{1}[\hat{\tau}(x) > c]$.
For resource-constrained settings (e.g., budget for treating $k$ out of $N$ individuals), the optimal policy treats the top-$k$ individuals ranked by $\hat{\tau}(x)$.
@dataclass
class TargetingPolicy:
"""A targeting policy that assigns treatment based on estimated CATEs.
Attributes:
cate_model: A fitted model that predicts tau(x).
threshold: Treatment cost / minimum beneficial effect.
budget_fraction: Maximum fraction of population to treat (None = no budget constraint).
"""
cate_model: BaseEstimator
threshold: float = 0.0
budget_fraction: Optional[float] = None
def assign(self, X: np.ndarray) -> np.ndarray:
"""Assign treatment based on estimated CATEs.
Args:
X: Covariates (n, p).
Returns:
Treatment assignments (n,). 1 = treat, 0 = don't treat.
"""
tau_hat = self.cate_model.predict(X)
if self.budget_fraction is not None:
# Treat the top budget_fraction by estimated CATE
k = int(self.budget_fraction * len(X))
threshold = np.sort(tau_hat)[::-1][min(k, len(tau_hat) - 1)]
return (tau_hat >= threshold).astype(int)
else:
return (tau_hat > self.threshold).astype(int)
def expected_value(
self,
X: np.ndarray,
tau_true: np.ndarray,
treatment_cost: float = 0.0,
) -> Dict[str, float]:
"""Evaluate the policy using true CATEs (simulation only).
Args:
X: Covariates (n, p).
tau_true: True CATEs (n,).
treatment_cost: Cost per treated unit.
Returns:
Policy evaluation metrics.
"""
assignments = self.assign(X)
n_treated = assignments.sum()
n_total = len(X)
total_benefit = np.sum(assignments * tau_true)
total_cost = n_treated * treatment_cost
net_value = total_benefit - total_cost
# Compare to treat-all and treat-none
treat_all_value = np.sum(tau_true) - n_total * treatment_cost
treat_none_value = 0.0
return {
"n_treated": int(n_treated),
"fraction_treated": n_treated / n_total,
"total_benefit": total_benefit,
"net_value": net_value,
"treat_all_value": treat_all_value,
"treat_none_value": treat_none_value,
"value_vs_treat_all": net_value - treat_all_value,
"value_vs_treat_none": net_value - treat_none_value,
}
19.6 Evaluating Causal ML: The Qini Curve and Beyond
Evaluating causal ML models is fundamentally harder than evaluating predictive models. For prediction, we observe the target $Y$ and compute $\hat{Y} - Y$. For causation, we never observe $\tau_i = Y_i(1) - Y_i(0)$ — we observe either $Y_i(1)$ or $Y_i(0)$, never both.
Why Standard ML Metrics Fail
- MSE / RMSE: Would require observing $\tau_i$, which is impossible.
- AUC / accuracy: Would require knowing the true binary treatment response, which is impossible.
- R-squared: Would require the variance of $\tau_i$, which is unobservable.
We need evaluation metrics designed for the causal setting, where the ground truth for individual treatment effects is fundamentally unobservable.
The Qini Curve: The Causal Analog of the ROC Curve
The Qini curve (Radcliffe, 2007) evaluates a CATE model by its ability to rank individuals by treatment benefit. It does not require knowing $\tau_i$; it only requires experimental or quasi-experimental data.
Construction: 1. Rank all individuals by $\hat{\tau}(x_i)$ in descending order (highest estimated benefit first). 2. For each fraction $f$ of the population (starting from $f = 0$ and going to $f = 1$), compute the incremental number of positive outcomes from treating the top-$f$ fraction, relative to treating nobody.
Formally, for a population of $n$ individuals sorted by $\hat{\tau}(x_{(i)})$, the Qini value at fraction $f$ is:
$$Q(f) = \left( \frac{\sum_{i=1}^{fn} Y_i D_i}{\sum_{i=1}^{fn} D_i} - \frac{\sum_{i=1}^{fn} Y_i (1-D_i)}{\sum_{i=1}^{fn} (1-D_i)} \right) \cdot fn$$
where the sums are over the top-$f$ fraction of individuals ranked by $\hat{\tau}$.
The Qini curve rises steeply if the model successfully identifies high-benefit individuals first. A random model produces a straight line (diagonal). The Qini coefficient (area between the model's curve and the diagonal) quantifies uplift ranking quality.
def qini_curve(
tau_hat: np.ndarray,
Y: np.ndarray,
D: np.ndarray,
n_bins: int = 100,
) -> Tuple[np.ndarray, np.ndarray, float]:
"""Compute the Qini curve for a CATE model.
The Qini curve measures how well the model ranks individuals
by treatment benefit. It does not require knowing the true
individual treatment effects.
Args:
tau_hat: Estimated CATEs (n,).
Y: Observed outcomes (n,).
D: Treatment indicators (n,).
n_bins: Number of points on the curve.
Returns:
fractions: Population fractions (n_bins,).
qini_values: Qini values at each fraction (n_bins,).
qini_coefficient: Area between curve and random baseline.
"""
n = len(Y)
# Sort by descending estimated CATE
order = np.argsort(-tau_hat)
Y_sorted = Y[order]
D_sorted = D[order]
fractions = np.linspace(0, 1, n_bins + 1)[1:]
qini_values = np.zeros(n_bins)
for i, f in enumerate(fractions):
k = int(f * n)
Y_top = Y_sorted[:k]
D_top = D_sorted[:k]
n_treated = D_top.sum()
n_control = k - n_treated
if n_treated > 0 and n_control > 0:
uplift = (
Y_top[D_top == 1].sum() / n_treated
- Y_top[D_top == 0].sum() / n_control
)
qini_values[i] = uplift * k
else:
qini_values[i] = 0.0
# Qini coefficient: area between curve and diagonal
random_line = qini_values[-1] * fractions
qini_coefficient = np.trapz(qini_values - random_line, fractions)
return fractions, qini_values, qini_coefficient
def plot_qini_curves(
models: Dict[str, np.ndarray],
Y: np.ndarray,
D: np.ndarray,
figsize: Tuple[int, int] = (10, 6),
) -> None:
"""Plot Qini curves for multiple CATE models.
Args:
models: Dictionary mapping model name to estimated CATEs.
Y: Observed outcomes (n,).
D: Treatment indicators (n,).
figsize: Figure size.
"""
fig, ax = plt.subplots(figsize=figsize)
for name, tau_hat in models.items():
fractions, qini_values, qini_coef = qini_curve(tau_hat, Y, D)
ax.plot(fractions, qini_values, label=f"{name} (Qini={qini_coef:.3f})")
# Random baseline
fractions_random = np.linspace(0, 1, 100)
ax.plot(
fractions_random,
qini_values[-1] * fractions_random,
"--",
color="gray",
label="Random",
)
ax.set_xlabel("Fraction of population treated")
ax.set_ylabel("Incremental positive outcomes")
ax.set_title("Qini Curves: Ranking Individuals by Treatment Benefit")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The AUUC and the Uplift Curve
The Area Under the Uplift Curve (AUUC) is an alternative metric. The uplift curve plots the cumulative uplift (treatment effect) as we treat progressively larger fractions of the population, sorted by the model's ranking:
$$\text{Uplift}(f) = \frac{1}{n_T(f)} \sum_{i=1}^{fn} Y_i D_i - \frac{1}{n_C(f)} \sum_{i=1}^{fn} Y_i(1 - D_i)$$
where $n_T(f)$ and $n_C(f)$ are the numbers of treated and control units in the top-$f$ fraction. The uplift curve starts high (if the model correctly identifies high-benefit individuals) and declines toward the ATE as $f \to 1$.
Calibration for CATEs
A well-calibrated CATE model satisfies:
$$\mathbb{E}[Y(1) - Y(0) \mid \hat{\tau}(X) = t] = t$$
In words: among individuals for whom the model estimates a treatment effect of $t$, the actual average treatment effect should be $t$. This can be assessed using a CATE calibration plot (analogous to a reliability diagram for classification):
def cate_calibration_plot(
tau_hat: np.ndarray,
Y: np.ndarray,
D: np.ndarray,
e: np.ndarray,
n_bins: int = 10,
figsize: Tuple[int, int] = (8, 6),
) -> None:
"""Plot CATE calibration: predicted vs. actual treatment effects.
Uses the doubly-robust AIPW estimator within bins of tau_hat
to estimate the actual CATE for each bin.
Args:
tau_hat: Estimated CATEs (n,).
Y: Observed outcomes (n,).
D: Treatment indicators (n,).
e: Propensity scores P(D=1|X) (n,).
n_bins: Number of calibration bins.
figsize: Figure size.
"""
e = np.clip(e, 0.01, 0.99)
# Bin by predicted CATE
bins = pd.qcut(tau_hat, n_bins, labels=False, duplicates="drop")
predicted_means = []
actual_means = []
for b in sorted(np.unique(bins)):
mask = bins == b
predicted_means.append(tau_hat[mask].mean())
# AIPW estimator within bin
Y_b, D_b, e_b = Y[mask], D[mask], e[mask]
aipw = (
D_b * Y_b / e_b - (1 - D_b) * Y_b / (1 - e_b)
).mean()
actual_means.append(aipw)
fig, ax = plt.subplots(figsize=figsize)
ax.scatter(predicted_means, actual_means, s=80, zorder=5)
lims = [
min(min(predicted_means), min(actual_means)) - 0.5,
max(max(predicted_means), max(actual_means)) + 0.5,
]
ax.plot(lims, lims, "--", color="gray", label="Perfect calibration")
ax.set_xlabel("Predicted CATE")
ax.set_ylabel("Actual CATE (AIPW)")
ax.set_title("CATE Calibration Plot")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Common Misconception: "If my uplift model has a higher Qini coefficient than the baseline, it is well-calibrated." The Qini curve evaluates ranking — whether the model correctly orders individuals by treatment benefit. A model can rank perfectly while being poorly calibrated (e.g., estimating $\hat{\tau} = 10$ when the true effect is $\tau = 1$, but preserving the correct ordering). For decisions that depend on the magnitude of the treatment effect (e.g., cost-benefit analysis), calibration matters. For decisions that depend only on the ranking (e.g., treat the top-$k$), the Qini coefficient suffices.
19.7 EconML: The Production Framework
EconML (Microsoft Research) is the most comprehensive library for causal ML in Python. It implements meta-learners, DML, causal forests, doubly-robust learners, and instrumental variable methods under a unified API.
The EconML Workflow
from econml.dml import CausalForestDML, LinearDML
from econml.metalearners import SLearner as EconSLearner
from econml.metalearners import TLearner as EconTLearner
from econml.metalearners import XLearner as EconXLearner
from econml.cate_interpreter import SingleTreeCateInterpreter
def econml_full_pipeline(
X: np.ndarray,
D: np.ndarray,
Y: np.ndarray,
feature_names: Optional[List[str]] = None,
) -> Dict[str, object]:
"""Complete EconML pipeline for CATE estimation.
Estimates CATEs using multiple methods, evaluates them,
and produces interpretable summaries.
Args:
X: Covariates (n, p).
D: Treatment indicator (n,).
Y: Observed outcome (n,).
feature_names: Names for covariates.
Returns:
Dictionary with fitted models, estimates, and diagnostics.
"""
if feature_names is None:
feature_names = [f"X{i}" for i in range(X.shape[1])]
# --- Method 1: Causal Forest ---
cf = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
n_estimators=500,
min_samples_leaf=10,
random_state=42,
)
cf.fit(Y, D, X=X)
# CATE estimates and CIs
tau_cf = cf.effect(X).flatten()
lb_cf, ub_cf = cf.effect_interval(X, alpha=0.05)
# --- Method 2: LinearDML ---
ldml = LinearDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
random_state=42,
)
ldml.fit(Y, D, X=X)
tau_ldml = ldml.effect(X).flatten()
# --- ATE and summary ---
ate_cf = cf.ate(X)
ate_inference_cf = cf.ate_inference(X)
# --- Interpretable summary: single tree approximation ---
interpreter = SingleTreeCateInterpreter(
include_model_uncertainty=True,
max_depth=3,
)
interpreter.interpret(cf, X)
# --- Feature importances ---
importances = cf.feature_importances_
return {
"causal_forest": cf,
"linear_dml": ldml,
"tau_cf": tau_cf,
"tau_ldml": tau_ldml,
"ci_lower": lb_cf.flatten(),
"ci_upper": ub_cf.flatten(),
"ate": ate_cf,
"ate_inference": ate_inference_cf,
"interpreter": interpreter,
"feature_importances": dict(zip(feature_names, importances)),
}
EconML's Unified Inference API
One of EconML's strengths is its consistent inference API across methods:
def demonstrate_econml_inference(
model: CausalForestDML,
X: np.ndarray,
) -> None:
"""Demonstrate EconML's unified inference capabilities.
Args:
model: A fitted EconML model.
X: Covariates for inference.
"""
# Point estimate
tau = model.effect(X)
print(f"CATE estimates shape: {tau.shape}")
# Confidence intervals
lb, ub = model.effect_interval(X, alpha=0.05)
print(f"95% CI width (median): {np.median(ub - lb):.3f}")
# Average treatment effect
ate = model.ate(X)
ate_inf = model.ate_inference(X)
print(f"ATE: {ate:.4f}")
print(f"ATE 95% CI: [{ate_inf.conf_int()[0][0]:.4f}, {ate_inf.conf_int()[1][0]:.4f}]")
print(f"ATE p-value: {ate_inf.pvalue()[0]:.4e}")
# Marginal effect of each feature on the CATE
marginal = model.const_marginal_ate(X)
print(f"Marginal CATE effects shape: {marginal.shape}")
Building a Treatment Policy with EconML
EconML provides SingleTreePolicyInterpreter for constructing interpretable treatment policies:
from econml.cate_interpreter import SingleTreePolicyInterpreter
def build_targeting_policy_econml(
model: CausalForestDML,
X: np.ndarray,
treatment_cost: float = 0.0,
feature_names: Optional[List[str]] = None,
max_depth: int = 3,
) -> SingleTreePolicyInterpreter:
"""Build an interpretable targeting policy from a causal forest.
The policy is a shallow decision tree that approximates the
optimal treatment rule: treat if tau(x) > cost.
Args:
model: Fitted CausalForestDML.
X: Covariates (n, p).
treatment_cost: Per-unit treatment cost.
feature_names: Names for covariates.
max_depth: Maximum depth of the policy tree.
Returns:
Fitted policy interpreter.
"""
if feature_names is None:
feature_names = [f"X{i}" for i in range(X.shape[1])]
policy = SingleTreePolicyInterpreter(
risk_level=treatment_cost,
max_depth=max_depth,
)
policy.interpret(model, X, sample_treatment_costs=treatment_cost)
# The policy tree can be exported for inspection
print("Policy tree structure:")
print(f" Recommend treatment for {policy.treat(X).mean():.1%} of population")
return policy
Implementation Note: EconML distinguishes between $X$ (effect modifiers — variables that may modify the treatment effect) and $W$ (confounders — variables needed for identification but not expected to modify the effect). When you specify both
XandW, EconML conditions on $W$ for identification but estimates $\tau(X)$ as a function of $X$ only. This distinction matters: including too many variables in $X$ increases variance (curse of dimensionality for the CATE function), while putting everything in $W$ assumes a constant treatment effect. The art is choosing which variables should modify the treatment effect (clinician judgment, domain knowledge, or a preliminary variable importance analysis).
19.8 Putting It All Together: The MediCore Personalized Treatment Pipeline
We now combine the methods of this chapter in a realistic pipeline for MediCore's personalized treatment allocation.
from econml.dml import CausalForestDML
from sklearn.model_selection import train_test_split
def medicore_personalized_treatment_pipeline(
n_patients: int = 20000,
seed: int = 42,
) -> Dict[str, object]:
"""End-to-end personalized treatment pipeline for MediCore.
Estimates CATEs for Drug X, identifies subgroups with largest
effects, and builds a targeting policy.
Args:
n_patients: Number of patients in the observational dataset.
seed: Random seed.
Returns:
Pipeline results including CATE estimates, subgroup analysis,
and targeting policy evaluation.
"""
rng = np.random.RandomState(seed)
# --- Simulate patient data ---
# Clinical features
age = rng.normal(65, 10, n_patients)
bmi = rng.normal(28, 5, n_patients)
egfr = rng.normal(70, 20, n_patients) # Renal function (mL/min)
hba1c = rng.normal(6.5, 1.2, n_patients) # Glycemic control
systolic_bp = rng.normal(145, 20, n_patients)
num_comorbidities = rng.poisson(2, n_patients)
genetic_marker = rng.binomial(1, 0.3, n_patients) # 30% prevalence
prior_readmission = rng.binomial(1, 0.2, n_patients)
medication_adherence = rng.beta(3, 2, n_patients) # 0-1 scale
social_support = rng.normal(0, 1, n_patients)
X = np.column_stack([
age, bmi, egfr, hba1c, systolic_bp,
num_comorbidities, genetic_marker, prior_readmission,
medication_adherence, social_support,
])
feature_names = [
"age", "bmi", "egfr", "hba1c", "systolic_bp",
"num_comorbidities", "genetic_marker", "prior_readmission",
"medication_adherence", "social_support",
]
# True CATE: Drug X helps patients with the genetic marker
# and adequate renal function. Harms patients with poor renal function.
tau_true = (
0.08 * genetic_marker # Large benefit for marker carriers
+ 0.003 * (egfr - 60) # Benefit increases with renal function
- 0.001 * (age - 65) # Slight decrease for older patients
+ 0.02 * medication_adherence # More benefit with adherence
- 0.04 # Baseline (small average effect)
)
# Confounded treatment assignment: sicker patients more likely to get drug
propensity = 1 / (1 + np.exp(-(
0.03 * (systolic_bp - 145) + 0.5 * prior_readmission
+ 0.3 * genetic_marker - 0.01 * egfr
)))
D = rng.binomial(1, propensity)
# Potential outcomes
Y0 = (
0.25 - 0.002 * egfr + 0.003 * age + 0.05 * prior_readmission
+ 0.02 * num_comorbidities + rng.normal(0, 0.15, n_patients)
)
Y1 = Y0 + tau_true
Y = D * Y1 + (1 - D) * Y0
# --- Split data ---
train_mask = np.zeros(n_patients, dtype=bool)
train_mask[:int(0.7 * n_patients)] = True
rng.shuffle(train_mask)
test_mask = ~train_mask
# --- Fit causal forest ---
cf = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=seed
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=seed
),
n_estimators=500,
min_samples_leaf=20,
random_state=seed,
)
cf.fit(Y[train_mask], D[train_mask], X=X[train_mask])
# --- Evaluate ---
tau_hat = cf.effect(X[test_mask]).flatten()
tau_test = tau_true[test_mask]
rmse = np.sqrt(np.mean((tau_hat - tau_test) ** 2))
correlation = np.corrcoef(tau_hat, tau_test)[0, 1]
# --- Subgroup analysis ---
# By genetic marker
marker_pos = X[test_mask][:, 6] == 1
ate_marker_pos = tau_hat[marker_pos].mean()
ate_marker_neg = tau_hat[~marker_pos].mean()
# --- Targeting policy ---
# Treat only patients with estimated positive effect
policy_assignments = (tau_hat > 0).astype(int)
policy_value = np.sum(policy_assignments * tau_test)
treat_all_value = np.sum(tau_test)
return {
"rmse": rmse,
"correlation": correlation,
"ate_marker_positive": ate_marker_pos,
"ate_marker_negative": ate_marker_neg,
"policy_fraction_treated": policy_assignments.mean(),
"policy_value": policy_value,
"treat_all_value": treat_all_value,
"value_improvement": (policy_value - treat_all_value) / abs(treat_all_value),
}
19.9 Progressive Project M7: Causal Forests for StreamRec Targeting
This milestone estimates heterogeneous recommendation effects for StreamRec and builds a targeting policy that recommends items only when they cause genuine engagement.
Task 1: Simulate StreamRec Data with Heterogeneous Effects
def simulate_streamrec_heterogeneous(
n_users: int = 30000,
seed: int = 42,
) -> pd.DataFrame:
"""Simulate StreamRec user data with heterogeneous recommendation effects.
Treatment effect heterogeneity:
- New users (low tenure) benefit most from recommendations (discovery).
- Power users (high tenure) benefit least (already find content organically).
- Users in niche categories benefit more (harder to discover niche content).
- Treatment effect depends on time of day (commute vs. evening browsing).
Args:
n_users: Number of users.
seed: Random seed.
Returns:
DataFrame with user features, treatment, outcomes, and true CATEs.
"""
rng = np.random.RandomState(seed)
# User features
tenure_months = rng.exponential(12, n_users)
daily_sessions = rng.poisson(3, n_users) + 1
category_diversity = rng.beta(2, 5, n_users) # Low = niche, high = broad
subscription_tier = rng.choice([0, 1, 2], n_users, p=[0.5, 0.35, 0.15])
age = rng.normal(32, 10, n_users).clip(18, 70)
hour_of_day = rng.choice(24, n_users)
device_type = rng.choice([0, 1], n_users, p=[0.6, 0.4]) # 0=mobile, 1=desktop
content_completion_rate = rng.beta(3, 2, n_users)
# True CATE: tau(x) = recommendation effect on engagement (minutes)
commute_hours = ((hour_of_day >= 7) & (hour_of_day <= 9)) | \
((hour_of_day >= 17) & (hour_of_day <= 19))
tau_true = (
3.0 * np.exp(-tenure_months / 6) # New users benefit most
+ 2.0 * (1 - category_diversity) # Niche users benefit
+ 1.5 * commute_hours.astype(float) # Commute time = receptive
+ 0.5 * subscription_tier # Premium users slightly more receptive
- 0.5 # Baseline offset
)
# Confounding: algorithm recommends more to engaged users
propensity = 1 / (1 + np.exp(-(
0.3 * np.log1p(daily_sessions) + 0.5 * content_completion_rate
- 0.02 * tenure_months + 0.2 * subscription_tier
)))
D = rng.binomial(1, propensity)
# Potential outcomes
Y0 = (
5.0 + 0.5 * np.log1p(tenure_months) + 2.0 * content_completion_rate
+ 0.3 * daily_sessions + rng.normal(0, 2.0, n_users)
).clip(0, None)
Y1 = (Y0 + tau_true).clip(0, None)
Y = D * Y1 + (1 - D) * Y0
return pd.DataFrame({
"tenure_months": tenure_months,
"daily_sessions": daily_sessions,
"category_diversity": category_diversity,
"subscription_tier": subscription_tier,
"age": age,
"hour_of_day": hour_of_day,
"device_type": device_type,
"content_completion_rate": content_completion_rate,
"treatment": D,
"outcome": Y,
"tau_true": tau_true,
"Y0": Y0,
"Y1": Y1,
})
Task 2: Estimate CATEs and Build Targeting Policy
def streamrec_causal_analysis(
df: pd.DataFrame,
) -> Dict[str, object]:
"""Full causal ML analysis for StreamRec.
Estimates CATEs using causal forests, identifies subgroups
with largest effects, and builds a targeting policy.
Args:
df: StreamRec DataFrame from simulate_streamrec_heterogeneous.
Returns:
Analysis results.
"""
feature_cols = [
"tenure_months", "daily_sessions", "category_diversity",
"subscription_tier", "age", "hour_of_day", "device_type",
"content_completion_rate",
]
X = df[feature_cols].values
D = df["treatment"].values
Y = df["outcome"].values
tau_true = df["tau_true"].values
# Train-test split
n_train = int(0.7 * len(df))
X_tr, D_tr, Y_tr = X[:n_train], D[:n_train], Y[:n_train]
X_te = X[n_train:]
tau_te = tau_true[n_train:]
# Fit causal forest
cf = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
n_estimators=500,
min_samples_leaf=20,
random_state=42,
)
cf.fit(Y_tr, D_tr, X=X_tr)
tau_hat = cf.effect(X_te).flatten()
# Subgroup analysis
df_test = df.iloc[n_train:].copy()
df_test["tau_hat"] = tau_hat
df_test["tau_true_val"] = tau_te
# New vs. established users
new_users = df_test["tenure_months"] < 6
established = df_test["tenure_months"] >= 24
subgroups = {
"new_users": {
"tau_hat_mean": tau_hat[new_users.values].mean(),
"tau_true_mean": tau_te[new_users.values].mean(),
},
"established_users": {
"tau_hat_mean": tau_hat[established.values].mean(),
"tau_true_mean": tau_te[established.values].mean(),
},
}
# Targeting policy: recommend only when tau_hat > 0
policy = (tau_hat > 1.0).astype(int) # Threshold at 1 minute of engagement
policy_value = np.sum(policy * tau_te)
treat_all_value = np.sum(tau_te)
treat_none_value = 0.0
# Feature importances for treatment effect heterogeneity
importances = dict(zip(feature_cols, cf.feature_importances_))
return {
"rmse": np.sqrt(np.mean((tau_hat - tau_te) ** 2)),
"correlation": np.corrcoef(tau_hat, tau_te)[0, 1],
"subgroups": subgroups,
"feature_importances": importances,
"policy_fraction_treated": policy.mean(),
"policy_value": policy_value,
"treat_all_value": treat_all_value,
"improvement_over_treat_all": (
(policy_value - treat_all_value) / abs(treat_all_value)
),
}
Task 3: Report and Interpret
The analysis should produce:
-
CATE distribution: A histogram of $\hat{\tau}(x)$ showing the range of estimated treatment effects. The distribution is unlikely to be symmetric — expect a right skew if most users benefit at least slightly from recommendations.
-
Subgroup comparison: New users ($< 6$ months tenure) should have significantly higher CATEs than established users ($> 24$ months). This directly informs the product team's allocation of recommendation slots.
-
Feature importances: Tenure and category diversity should emerge as the top drivers of treatment effect heterogeneity — they determine who benefits, not who engages.
-
Targeting policy value: The selective policy (recommend only when $\hat{\tau} > c$) should outperform treat-all by reallocating recommendation slots from users who would have engaged anyway to users where the recommendation creates genuine value.
19.10 When to Use Which Method
| Situation | Recommended method | Why |
|---|---|---|
| RCT data, moderate $n$, want interpretable CATE | Causal forest | Valid CIs, feature importances, no parametric assumptions |
| Observational data, high-dimensional confounders | CausalForestDML | Combines DML debiasing with causal forest CATE estimation |
| Want constant ATE with ML confounding adjustment | Basic DML | Parametric rate ($\sqrt{n}$), valid CIs, robust to first-stage errors |
| Imbalanced treatment, binary outcome | X-learner | Adapts to treatment group imbalance |
| Business targeting with RCT data | Uplift model (transformed outcome) | Simple, directly actionable ranking |
| Want linear CATE in high dimensions | SparseLinearDML | LASSO-regularized CATE with valid inference |
| Need to explain CATE to stakeholders | SingleTreePolicyInterpreter | Interpretable rules from any CATE model |
What Can Go Wrong
-
Positivity violations. If some covariate strata lack treated or control observations, CATE estimates in those regions are extrapolations. Causal forests partially address this (they will not split into empty regions), but near-violations still cause high variance.
-
Unmeasured confounding. All methods in this chapter assume conditional ignorability (no unmeasured confounders). DML debiases against observed confounders; it cannot fix unmeasured ones. Sensitivity analysis (Chapter 18, Cinelli and Hazlett framework) should accompany every CATE analysis.
-
Overfitting to noise. CATE estimation is inherently noisier than outcome prediction because the signal (treatment effect variation) is typically much smaller than the noise (outcome variation). Honest splitting (causal forests) and cross-fitting (DML, R-learner) help, but in small samples, estimated heterogeneity may be artifactual.
-
SUTVA violations. If one user's recommendation affects another user's engagement (e.g., through social sharing), the potential outcomes framework breaks down. Network effects require specialized methods (Chapter 33 discusses interference in experiments).
Prediction $\neq$ Causation: This chapter's methods estimate $Y(1) - Y(0)$, not $Y$. A model that perfectly predicts $\hat{Y}$ may be useless for treatment targeting, and a model that poorly predicts $\hat{Y}$ may be excellent at identifying treatment effect heterogeneity. The evaluation metrics, the loss functions, and the interpretation are all different. Conflating predictive and causal tasks remains the most common error in applied data science.
Chapter Summary
This chapter bridged causal inference and machine learning. Four families of methods estimate CATEs:
-
Meta-learners (S, T, X, R) repurpose standard ML models for causal estimation. The S-learner is simplest but suffers from regularization bias. The T-learner avoids this but wastes data. The X-learner adapts to imbalanced treatment groups. The R-learner directly targets the CATE through a Neyman-orthogonal loss, providing robustness to nuisance estimation errors.
-
Causal forests modify random forests to split on treatment effect heterogeneity rather than prediction accuracy. Honesty (separate splitting and estimation samples) provides valid asymptotic confidence intervals — a property unique among ML-based CATE estimators.
-
Double/debiased machine learning enables rigorous causal inference with high-dimensional confounders. The key innovation — Neyman orthogonality combined with cross-fitting — ensures that regularization bias in nuisance estimation does not leak into the causal parameter estimate. The causal estimate achieves the parametric $\sqrt{n}$ convergence rate even when nuisance functions are estimated by slow-converging ML methods.
-
Uplift modeling translates CATE estimation into business targeting. The Qini curve evaluates uplift models without observing individual treatment effects. Targeting policies allocate treatment to individuals where the causal effect is largest, outperforming both treat-all and predict-then-treat strategies.
The central message: standard ML predicts $Y$; causal ML predicts $Y(1) - Y(0)$. This distinction reshapes the loss function, the evaluation metric, the feature importance interpretation, and the business decision. Every model that informs an intervention — a drug prescription, a marketing campaign, a recommendation — should be evaluated on its causal effect, not its predictive accuracy.
Chapter 20 shifts from "what is the effect" to "what do we believe about the effect" — Bayesian inference provides a principled framework for incorporating prior knowledge, quantifying uncertainty, and updating beliefs as data accumulates.