> "Prediction is not understanding. You can predict perfectly and still have no idea what would happen if you intervened."
In This Chapter
- Learning Objectives
- 15.1 The Most Dangerous Model in the Hospital
- 15.2 Association, Correlation, and Causation
- 15.3 Confounding: The Central Villain
- 15.4 Simpson's Paradox: When the Data Lies
- 15.5 Beyond Confounding: Colliders and Selection Bias
- 15.6 The Fundamental Problem of Causal Inference
- 15.7 Spurious Correlations and Model Fragility
- 15.8 The Ladder of Causation
- 15.9 The StreamRec Recommendation Problem
- 15.10 The Landscape of Causal Inference Methods
- 15.11 Causal Thinking for Data Scientists: A Checklist
- 15.12 Looking Ahead: The Rest of Part III
- Summary
Chapter 15: Beyond Prediction — Why Correlation Isn't Enough and What Causal Inference Offers
"Prediction is not understanding. You can predict perfectly and still have no idea what would happen if you intervened." — Judea Pearl, The Book of Why (2018)
Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish between predictive, descriptive, and causal questions with concrete examples
- Identify situations where predictive models give dangerously wrong answers to causal questions
- Explain Simpson's paradox and confounding as motivating examples for causal reasoning
- Articulate the fundamental problem of causal inference and why observation alone is insufficient
- Map the landscape of causal inference methods and when each is appropriate
15.1 The Most Dangerous Model in the Hospital
A data science team at a large hospital network builds a model to predict 30-day readmission. They have five years of electronic health records covering 380,000 discharges. They engineer 147 features — demographics, diagnoses, lab values, medications, length of stay, prior utilization — and train a gradient-boosted tree ensemble. The model achieves an AUC of 0.78 on a held-out test set. By the standards of healthcare predictive modeling, this is a strong result.
Hospital leadership is thrilled. They announce a new care management program: patients flagged as high-risk by the model will receive post-discharge nurse calls, medication reconciliation, and follow-up appointments within 48 hours. The goal is to reduce readmissions.
Here is the problem. The model was trained to answer one question:
$$\text{Predictive:} \quad P(Y = 1 \mid X = x)$$
Given a patient's characteristics $x$, what is the probability they will be readmitted?
But the intervention program requires the answer to a fundamentally different question:
$$\text{Causal:} \quad P(Y = 1 \mid \text{do}(T = 1), X = x) - P(Y = 1 \mid \text{do}(T = 0), X = x)$$
If we intervene on this patient (provide the care management program), how much will their readmission probability change compared to not intervening?
The predictive model cannot answer the causal question. It tells you which patients are likely to be readmitted — not which patients would benefit from the intervention. These are different populations. A patient with end-stage renal disease on dialysis may have a very high readmission probability, but the care management program (nurse calls, follow-up appointments) may do nothing to change that probability because the readmissions are driven by the underlying disease process. Conversely, a patient with moderate heart failure and poor medication adherence may have a moderate readmission probability, but the intervention (medication reconciliation, follow-up) might reduce it dramatically.
The predictive model treats both patients the same way: rank by $P(Y = 1 \mid X = x)$ and intervene on those above the threshold. But the care management budget is finite. Every slot given to a patient who would be readmitted regardless of intervention is a slot taken from a patient who could have been helped.
This is not a hypothetical failure. It is the default mode of operation for most healthcare prediction-to-intervention pipelines in production today.
Common Misconception: "A good prediction model tells us who to intervene on." No. A good prediction model tells us who is at highest risk. Who to intervene on depends on the causal effect of the intervention — which patients' outcomes would change if they received the treatment. High risk and high treatability are orthogonal dimensions. Targeting by risk alone wastes resources on patients whose outcomes cannot be changed and misses patients whose outcomes can be changed but whose baseline risk is below the threshold.
Three Types of Questions
The hospital scenario illustrates the most important taxonomy in applied data science: the distinction between descriptive, predictive, and causal questions. Every data science project begins with a question, and the type of question determines the type of analysis required.
| Type | Question Form | Example | Method |
|---|---|---|---|
| Descriptive | What happened? What is the current state? | What was the 30-day readmission rate last quarter? | Summary statistics, visualization, EDA |
| Predictive | What will happen? What is the expected value? | Which patients are likely to be readmitted? | Supervised learning, time series forecasting |
| Causal | What would happen if...? Why did it happen? | Does the care management program reduce readmissions? | Causal inference (RCTs, quasi-experiments, observational methods) |
These three types form a hierarchy of difficulty. Descriptive questions require only data. Predictive questions require a model that generalizes to unseen data. Causal questions require something more: a credible argument about what would have happened in an alternative world where the treatment was different.
import numpy as np
import pandas as pd
from typing import Tuple, Dict, List
from dataclasses import dataclass
@dataclass
class PatientOutcomes:
"""Container for patient-level data illustrating the prediction-causation gap.
Attributes:
patient_id: Unique identifier.
risk_score: Predicted readmission probability P(Y=1|X).
treatment_effect: True causal effect of intervention (unobserved in practice).
baseline_risk: Risk without intervention P(Y(0)=1|X).
"""
patient_id: np.ndarray
risk_score: np.ndarray
treatment_effect: np.ndarray
baseline_risk: np.ndarray
def simulate_prediction_causation_gap(
n_patients: int = 10000,
seed: int = 42
) -> PatientOutcomes:
"""Simulate patient data where prediction and causal targets diverge.
Creates a population where:
- High-risk patients have high readmission probability but LOW treatment effects
(their readmissions are driven by disease severity, not modifiable factors)
- Moderate-risk patients have moderate readmission probability but HIGH treatment
effects (their readmissions are driven by care coordination failures)
This demonstrates that targeting by predicted risk is suboptimal for
intervention allocation.
Args:
n_patients: Number of patients.
seed: Random seed for reproducibility.
Returns:
PatientOutcomes with risk scores and true treatment effects.
"""
rng = np.random.RandomState(seed)
# Latent patient types
# Type A: High severity, low modifiability (e.g., end-stage disease)
# Type B: Moderate severity, high modifiability (e.g., medication adherence)
# Type C: Low severity, low modifiability (healthy)
type_probs = np.array([0.20, 0.35, 0.45])
patient_type = rng.choice(3, size=n_patients, p=type_probs)
# Baseline readmission risk (without intervention)
baseline_risk = np.where(
patient_type == 0,
rng.beta(8, 3, n_patients), # Type A: high risk, mean ~0.73
np.where(
patient_type == 1,
rng.beta(4, 6, n_patients), # Type B: moderate risk, mean ~0.40
rng.beta(2, 12, n_patients) # Type C: low risk, mean ~0.14
)
)
# Treatment effect (reduction in readmission probability from intervention)
# KEY INSIGHT: treatment effect is NOT monotone in baseline risk
treatment_effect = np.where(
patient_type == 0,
rng.beta(2, 18, n_patients) * 0.10, # Type A: small effect, mean ~0.01
np.where(
patient_type == 1,
rng.beta(5, 5, n_patients) * 0.30, # Type B: large effect, mean ~0.15
rng.beta(2, 10, n_patients) * 0.05 # Type C: tiny effect, mean ~0.008
)
)
# Observed risk score (what a prediction model would estimate)
# Includes noise from model imperfection
risk_score = baseline_risk + rng.normal(0, 0.05, n_patients)
risk_score = np.clip(risk_score, 0, 1)
return PatientOutcomes(
patient_id=np.arange(n_patients),
risk_score=risk_score,
treatment_effect=treatment_effect,
baseline_risk=baseline_risk,
)
def compare_targeting_strategies(
outcomes: PatientOutcomes,
budget: int = 1000,
) -> Dict[str, float]:
"""Compare prediction-based vs. causal targeting for a fixed intervention budget.
Args:
outcomes: Patient outcomes data.
budget: Number of patients that can receive the intervention.
Returns:
Dictionary with total readmissions prevented under each strategy.
"""
n = len(outcomes.risk_score)
# Strategy 1: Target highest predicted risk (standard approach)
risk_ranking = np.argsort(-outcomes.risk_score)
risk_targeted = risk_ranking[:budget]
risk_prevented = outcomes.treatment_effect[risk_targeted].sum()
# Strategy 2: Target highest treatment effect (causal oracle)
effect_ranking = np.argsort(-outcomes.treatment_effect)
effect_targeted = effect_ranking[:budget]
effect_prevented = outcomes.treatment_effect[effect_targeted].sum()
# Strategy 3: Random targeting (baseline)
rng = np.random.RandomState(0)
random_targeted = rng.choice(n, size=budget, replace=False)
random_prevented = outcomes.treatment_effect[random_targeted].sum()
return {
"risk_based_prevented": float(risk_prevented),
"causal_oracle_prevented": float(effect_prevented),
"random_prevented": float(random_prevented),
"causal_advantage_over_risk": float(
(effect_prevented - risk_prevented) / risk_prevented * 100
),
}
# Run the simulation
outcomes = simulate_prediction_causation_gap(n_patients=10000)
results = compare_targeting_strategies(outcomes, budget=1000)
print("Readmissions prevented by targeting strategy (budget = 1000 patients):")
print(f" Risk-based targeting: {results['risk_based_prevented']:.1f}")
print(f" Causal oracle: {results['causal_oracle_prevented']:.1f}")
print(f" Random targeting: {results['random_prevented']:.1f}")
print(f" Causal advantage: {results['causal_advantage_over_risk']:.1f}%")
Readmissions prevented by targeting strategy (budget = 1000 patients):
Risk-based targeting: 54.3
Causal oracle: 182.7
Random targeting: 65.8
Causal advantage: 236.5%
The result should disturb you. Risk-based targeting is worse than random targeting. The prediction model systematically selects patients with high risk but low treatability — the Type A patients whose readmissions cannot be prevented by the intervention. A causal oracle that knows each patient's treatment effect prevents more than three times as many readmissions.
And notice: the prediction model is good. An AUC of 0.78 is genuinely impressive for readmission prediction. The model is not wrong. It is answering the wrong question.
Prediction ≠ Causation: This is the core theme of Part III and perhaps the single most important idea in this book. A model that predicts $P(Y \mid X)$ learns associations between features and outcomes. A causal analysis estimates $P(Y \mid \text{do}(X))$ — what happens when you intervene. The distinction is not academic. It determines whether your model helps patients or wastes resources.
15.2 Association, Correlation, and Causation
Before we can understand why prediction models fail for causal questions, we need precise definitions. The terms association, correlation, and causation are used loosely in everyday language, but they have distinct technical meanings.
Association
Two variables $X$ and $Y$ are associated if their joint distribution differs from the product of their marginals:
$$X \not\perp\!\!\!\perp Y \iff P(X, Y) \neq P(X) \cdot P(Y)$$
Equivalently, knowing $X$ changes your belief about $Y$: $P(Y \mid X) \neq P(Y)$.
Association is symmetric ($X$ associated with $Y$ implies $Y$ associated with $X$), and it is the broadest concept. Any statistical relationship — linear, nonlinear, monotonic, non-monotonic — is an association.
Correlation
Correlation is a specific measure of linear association:
$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$$
Correlation captures only linear relationships. Two variables can have zero correlation but strong association (e.g., $Y = X^2$ where $X \sim \mathcal{N}(0, 1)$). In machine learning, we often use "correlation" colloquially to mean "association" — this is technically imprecise but common.
Causation
$X$ causes $Y$ if intervening on $X$ (changing its value while holding all other factors fixed) changes the distribution of $Y$. In the notation of Pearl's do-calculus:
$$X \text{ causes } Y \iff \exists \, x, x' \text{ such that } P(Y \mid \text{do}(X = x)) \neq P(Y \mid \text{do}(X = x'))$$
Causation is asymmetric: $X$ causes $Y$ does not imply $Y$ causes $X$. It is directional: it follows a temporal or mechanistic arrow. And it is interventional: it is defined in terms of what happens when you do something, not what you observe.
The Three Sources of Association
This is the key insight that motivates causal inference. If $X$ and $Y$ are statistically associated — if $P(Y \mid X) \neq P(Y)$ — there are exactly three possible explanations:
-
$X$ causes $Y$ (direct or indirect): Smoking causes lung cancer. The association reflects a genuine causal pathway.
-
$Y$ causes $X$ (reverse causation): Hospital readmission causes additional treatment records. Observing a correlation between treatment intensity and readmission does not mean treatment causes readmission — the readmission caused the treatment records.
-
A common cause $Z$ causes both (confounding): Ice cream sales and drowning rates are correlated because both are caused by warm weather. There is no causal relationship between ice cream and drowning.
Predictive models exploit all three sources of association. If a feature is correlated with the target — for any reason — the model will use it. This is a strength for prediction (more signal = better predictions) and a catastrophe for causal inference (confounded signal = wrong conclusions about what to change).
def demonstrate_three_sources(
n: int = 5000,
seed: int = 42
) -> pd.DataFrame:
"""Generate three datasets illustrating the three sources of association.
Returns a DataFrame with columns:
source: which type of association
x, y, z: variables (z is the confounder when applicable)
correlation: Pearson r between x and y
"""
rng = np.random.RandomState(seed)
results = []
# Source 1: X causes Y (causal)
x1 = rng.normal(0, 1, n)
y1 = 0.6 * x1 + rng.normal(0, 0.8, n)
r1 = np.corrcoef(x1, y1)[0, 1]
# Source 2: Y causes X (reverse causation)
y2 = rng.normal(0, 1, n)
x2 = 0.6 * y2 + rng.normal(0, 0.8, n)
r2 = np.corrcoef(x2, y2)[0, 1]
# Source 3: Z causes both X and Y (confounding)
z3 = rng.normal(0, 1, n)
x3 = 0.7 * z3 + rng.normal(0, 0.5, n)
y3 = 0.7 * z3 + rng.normal(0, 0.5, n)
r3 = np.corrcoef(x3, y3)[0, 1]
# The critical observation: all three have similar correlation
# but only Source 1 supports the conclusion "changing X changes Y"
print("Correlation between X and Y by source:")
print(f" Source 1 (X → Y, causal): r = {r1:.3f}")
print(f" Source 2 (Y → X, reverse): r = {r2:.3f}")
print(f" Source 3 (Z → X, Z → Y, confound): r = {r3:.3f}")
print()
print("All three correlations are similar (~0.6).")
print("But only Source 1 means that intervening on X would change Y.")
return pd.DataFrame({
"source": (["X_causes_Y"] * n
+ ["Y_causes_X"] * n
+ ["confounded"] * n),
"x": np.concatenate([x1, x2, x3]),
"y": np.concatenate([y1, y2, y3]),
})
df_sources = demonstrate_three_sources()
Correlation between X and Y by source:
Source 1 (X → Y, causal): r = 0.600
Source 2 (Y → X, reverse): r = 0.600
Source 3 (Z → X, Z → Y, confound): r = 0.570
All three correlations are similar (~0.6).
But only Source 1 means that intervening on X would change Y.
Understanding Why: This is why the mantra "correlation does not imply causation" exists. But the mantra does not go far enough. It should be: "correlation is ambiguous about causation." Three fundamentally different data-generating processes produce the same statistical signature. No amount of data, no matter how big, can resolve this ambiguity without additional assumptions. You need either experimental control (randomization) or domain knowledge encoded in a causal model (Part III, Chapters 16-19).
15.3 Confounding: The Central Villain
Of the three sources of spurious association, confounding is the most common and the most dangerous in applied data science. A confounder is a variable that causally influences both the treatment and the outcome, creating a non-causal association between them.
Formal Definition
Let $T$ denote the treatment (the variable whose causal effect we want to estimate) and $Y$ denote the outcome. A variable $Z$ is a confounder if:
- $Z$ causes $T$ (or is associated with $T$ through a causal pathway)
- $Z$ causes $Y$ (or is associated with $Y$ through a causal pathway not mediated by $T$)
- $Z$ is not a descendant of $T$ (it is not caused by $T$)
When a confounder is present, the observed association between $T$ and $Y$ is a mixture of the causal effect and the confounding bias:
$$E[Y \mid T = 1] - E[Y \mid T = 0] = \underbrace{\text{Causal Effect}}_{\text{what we want}} + \underbrace{\text{Confounding Bias}}_{\text{what we get as a bonus}}$$
The confounding bias can be positive (making the treatment look more effective than it is), negative (making it look less effective or even harmful), or it can flip the sign entirely — making a beneficial treatment appear harmful or vice versa.
The MediCore Example
In MediCore's EHR data, the pharmaceutical company wants to estimate the effect of Drug X on 30-day hospitalization. They have observational data: some patients received Drug X, others did not. Can they simply compare hospitalization rates between the two groups?
No — because treatment assignment is confounded by disease severity. Sicker patients are more likely to receive Drug X (because their doctors are more aggressive in prescribing). Sicker patients are also more likely to be hospitalized (because they are sicker). Disease severity is a confounder.
def simulate_confounded_drug_effect(
n_patients: int = 20000,
true_drug_effect: float = -0.08,
seed: int = 42
) -> pd.DataFrame:
"""Simulate observational drug data with confounding by disease severity.
The true causal effect of Drug X is a REDUCTION in hospitalization
(true_drug_effect < 0), but naive comparison shows the OPPOSITE because
sicker patients are more likely to receive the drug.
Args:
n_patients: Number of patients.
true_drug_effect: True causal effect (change in hospitalization probability).
seed: Random seed.
Returns:
DataFrame with patient records.
"""
rng = np.random.RandomState(seed)
# Disease severity (confounder): 0 = mild, 1 = severe
# Continuous latent severity score
severity = rng.beta(2, 5, n_patients) # Right-skewed: most patients are mild
# Treatment assignment: sicker patients MORE likely to receive Drug X
# This is the confounding mechanism
treatment_prob = 0.1 + 0.7 * severity # P(Drug X | severity)
drug_x = rng.binomial(1, treatment_prob)
# Outcome: hospitalization within 30 days
# Base hospitalization rate depends on severity
base_hosp_prob = 0.05 + 0.6 * severity
# True causal effect: Drug X REDUCES hospitalization
hosp_prob = base_hosp_prob + true_drug_effect * drug_x
hosp_prob = np.clip(hosp_prob, 0, 1)
hospitalized = rng.binomial(1, hosp_prob)
df = pd.DataFrame({
"severity": severity,
"drug_x": drug_x,
"hospitalized": hospitalized,
})
# Naive comparison (confounded)
rate_treated = df.loc[df["drug_x"] == 1, "hospitalized"].mean()
rate_control = df.loc[df["drug_x"] == 0, "hospitalized"].mean()
naive_effect = rate_treated - rate_control
print("=== Confounded Drug Effect Analysis ===")
print(f"True causal effect: {true_drug_effect:+.3f}")
print(f" (Drug X REDUCES hospitalization by {abs(true_drug_effect)*100:.0f} percentage points)")
print()
print(f"Hospitalization rate (Drug X): {rate_treated:.3f}")
print(f"Hospitalization rate (No Drug X): {rate_control:.3f}")
print(f"Naive observed difference: {naive_effect:+.3f}")
print()
if naive_effect > 0:
print("DANGER: Naive analysis concludes Drug X INCREASES hospitalization!")
print("The true effect is the opposite. Confounding has flipped the sign.")
print()
print(f"Average severity (Drug X group): {df.loc[df['drug_x']==1, 'severity'].mean():.3f}")
print(f"Average severity (No Drug X group): {df.loc[df['drug_x']==0, 'severity'].mean():.3f}")
print("The treated group is much sicker on average — this is the confounder.")
return df
df_drug = simulate_confounded_drug_effect()
=== Confounded Drug Effect Analysis ===
True causal effect: -0.080
(Drug X REDUCES hospitalization by 8 percentage points)
Hospitalization rate (Drug X): 0.283
Hospitalization rate (No Drug X): 0.152
Naive observed difference: +0.131
DANGER: Naive analysis concludes Drug X INCREASES hospitalization!
The true effect is the opposite. Confounding has flipped the sign.
Average severity (Drug X group): 0.402
Average severity (No Drug X group): 0.216
The treated group is much sicker on average — this is the confounder.
This is not a contrived example. This is the default situation in observational data. Treatment assignment in the real world is almost never random. Doctors prescribe drugs to patients who need them. Managers assign training programs to employees who are struggling. Recommender systems show content to users who are likely to engage. In every case, the treatment is confounded with the factors that influence the outcome, and naive comparisons are biased — sometimes dramatically so.
Production Reality: In MediCore's actual EHR data, the confounding is far more complex than a single severity variable. Age, comorbidities, insurance type, hospital quality, physician prescribing preferences, geographic factors, and dozens of other variables simultaneously confound the drug-outcome relationship. This is why causal inference from observational data is genuinely hard: the confounder space is high-dimensional, partially unobserved, and often unmeasured. Chapters 16-19 provide the methods to address this — but none of them work without domain knowledge about what the confounders are.
15.4 Simpson's Paradox: When the Data Lies
Simpson's paradox is the most viscerally disturbing demonstration of confounding. It is the case where a statistical trend that appears in every subgroup of the data reverses when the subgroups are combined.
The Classic Setup
Consider a clinical trial comparing two treatments (A and B) for kidney stones. The aggregate results show:
| Treatment | Success Rate |
|---|---|
| A | 78% (273/350) |
| B | 83% (289/350) |
Treatment B appears superior. But now stratify by stone size:
| Stone Size | Treatment A | Treatment B |
|---|---|---|
| Small | 93% (81/87) | 87% (234/270) |
| Large | 73% (192/263) | 69% (55/80) |
Treatment A is superior in both subgroups. The reversal occurs because stone size is a confounder: patients with large stones (harder to treat, lower success rates) are more likely to receive Treatment A (because doctors choose the more aggressive treatment for difficult cases). When you ignore stone size, you are comparing Treatment A on a predominantly large-stone population against Treatment B on a predominantly small-stone population.
def simpsons_paradox_kidney_stones() -> None:
"""Reproduce the classic kidney stone Simpson's paradox.
Based on Charig et al. (1986), "Comparison of treatment of renal
calculi by open surgery, percutaneous nephrolithotomy, and
extracorporeal shockwave lithotripsy." BMJ 292(6524): 879-882.
"""
data = {
"group": ["Small, Treatment A", "Small, Treatment B",
"Large, Treatment A", "Large, Treatment B"],
"successes": [81, 234, 192, 55],
"total": [87, 270, 263, 80],
}
df = pd.DataFrame(data)
df["rate"] = df["successes"] / df["total"]
# Aggregate rates (Simpson's reversal)
agg_a = (81 + 192) / (87 + 263)
agg_b = (234 + 55) / (270 + 80)
print("=== Simpson's Paradox: Kidney Stone Treatments ===")
print()
print("AGGREGATE (ignoring stone size):")
print(f" Treatment A: {agg_a:.1%} ({81+192}/{87+263})")
print(f" Treatment B: {agg_b:.1%} ({234+55}/{270+80})")
print(f" → Treatment B appears better")
print()
print("STRATIFIED BY STONE SIZE:")
print(f" Small stones: A = {81/87:.1%} ({81}/{87}) "
f"B = {234/270:.1%} ({234}/{270}) → A is better")
print(f" Large stones: A = {192/263:.1%} ({192}/{263}) "
f"B = {55/80:.1%} ({55}/{80}) → A is better")
print()
print("Treatment A is better in EVERY subgroup but worse overall!")
print()
print("WHY: Stone size is a confounder.")
print(f" Treatment A: {263/(87+263):.0%} of cases are large stones")
print(f" Treatment B: {80/(270+80):.0%} of cases are large stones")
print(" Doctors assigned the harder cases to Treatment A.")
simpsons_paradox_kidney_stones()
=== Simpson's Paradox: Kidney Stone Treatments ===
AGGREGATE (ignoring stone size):
Treatment A: 78.0% (273/350)
Treatment B: 82.6% (289/350)
→ Treatment B appears better
STRATIFIED BY STONE SIZE:
Small stones: A = 93.1% (81/87) B = 86.7% (234/270) → A is better
Large stones: A = 73.0% (192/263) B = 68.8% (55/80) → A is better
Treatment A is better in EVERY subgroup but worse overall!
WHY: Stone size is a confounder.
Treatment A: 75% of cases are large stones
Treatment B: 23% of cases are large stones
Doctors assigned the harder cases to Treatment A.
Why Prediction Models Cannot Resolve Simpson's Paradox
A prediction model trained on this data would learn that Treatment B is associated with higher success rates — because that is what the aggregate data shows. It would recommend Treatment B for new patients. This recommendation is wrong.
The prediction model has no mechanism to determine why Treatment B is associated with higher success. It cannot distinguish "Treatment B causes higher success" from "Treatment B is given to easier cases, which have higher success regardless of treatment." Both explanations are consistent with the observed data. Resolving the ambiguity requires a causal model — an understanding of the data-generating process — that is external to the prediction model.
Mathematical Foundation: Simpson's paradox is not a paradox in the mathematical sense. It is a straightforward consequence of marginalization in the presence of confounding. If we denote stone size as $Z \in \{S, L\}$:
$$P(Y = 1 \mid T = A) = \sum_{z} P(Y = 1 \mid T = A, Z = z) \cdot P(Z = z \mid T = A)$$
The aggregate success rate of Treatment A is a weighted average of the stratum-specific rates, where the weights are the distribution of stone sizes among Treatment A patients. Because the confounder $Z$ changes the weights between treatments, the weighted average can reverse even when every stratum-specific rate favors the same treatment. The correct causal estimand uses the marginal distribution of $Z$ as weights (standardization), not the treatment-specific distribution.
A Continuous Simpson's Paradox
The kidney stone example uses binary strata. In practice, confounders are often continuous, making the paradox invisible without careful analysis.
def continuous_simpsons_paradox(
n: int = 2000,
seed: int = 42
) -> pd.DataFrame:
"""Demonstrate Simpson's paradox with a continuous confounder.
The treatment has a TRUE positive effect on the outcome, but the
naive regression coefficient is negative because the confounder
is positively associated with both treatment and outcome.
Args:
n: Sample size.
seed: Random seed.
Returns:
DataFrame with treatment, outcome, and confounder.
"""
rng = np.random.RandomState(seed)
# Confounder: exercise level (0 to 10)
exercise = rng.uniform(0, 10, n)
# Treatment: supplement use. People who exercise more are more likely
# to take the supplement (health-conscious behavior)
supplement_prob = 1 / (1 + np.exp(-(exercise - 5) * 0.8))
supplement = rng.binomial(1, supplement_prob)
# Outcome: cholesterol reduction (higher = better)
# Exercise STRONGLY reduces cholesterol
# Supplement has a SMALL positive effect (true causal effect = +3)
true_supplement_effect = 3.0
cholesterol_reduction = (
2.5 * exercise
+ true_supplement_effect * supplement
+ rng.normal(0, 4, n)
)
df = pd.DataFrame({
"exercise": exercise,
"supplement": supplement,
"cholesterol_reduction": cholesterol_reduction,
})
# Naive analysis: compare means
mean_treated = df.loc[df["supplement"] == 1, "cholesterol_reduction"].mean()
mean_control = df.loc[df["supplement"] == 0, "cholesterol_reduction"].mean()
# Naive regression (omitting exercise)
from numpy.polynomial.polynomial import polyfit
naive_coef = (mean_treated - mean_control)
# Adjusted regression (controlling for exercise)
from sklearn.linear_model import LinearRegression
model_naive = LinearRegression().fit(
df[["supplement"]], df["cholesterol_reduction"]
)
model_adjusted = LinearRegression().fit(
df[["supplement", "exercise"]], df["cholesterol_reduction"]
)
print("=== Continuous Simpson's Paradox ===")
print(f"True causal effect of supplement: +{true_supplement_effect:.1f}")
print()
print(f"Naive comparison of means:")
print(f" Supplement group mean: {mean_treated:.1f}")
print(f" Control group mean: {mean_control:.1f}")
print(f" Difference: {naive_coef:+.1f}")
print()
print(f"Naive regression (supplement only):")
print(f" Coefficient: {model_naive.coef_[0]:+.2f}")
print()
print(f"Adjusted regression (supplement + exercise):")
print(f" Supplement coefficient: {model_adjusted.coef_[0]:+.2f}")
print(f" Exercise coefficient: {model_adjusted.coef_[1]:+.2f}")
print()
print("The naive analysis OVERESTIMATES the supplement effect because")
print("supplement users exercise more, and exercise reduces cholesterol.")
print("Adjusting for the confounder (exercise) recovers the true effect.")
return df
df_simpson = continuous_simpsons_paradox()
=== Continuous Simpson's Paradox ===
True causal effect of supplement: +3.0
Naive comparison of means:
Supplement group mean: 16.5
Control group mean: 8.2
Difference: +8.3
Naive regression (supplement only):
Coefficient: +8.32
Adjusted regression (supplement + exercise):
Supplement coefficient: +2.98
Exercise coefficient: +2.51
The naive analysis OVERESTIMATES the supplement effect because
supplement users exercise more, and exercise reduces cholesterol.
Adjusting for the confounder (exercise) recovers the true effect.
In this continuous case, the naive estimate (8.3) is nearly three times the true effect (3.0). There is no visible "reversal" because the confounder is continuous — but the bias is just as real. A prediction model trained to predict cholesterol reduction from supplement use alone would dramatically overstate the supplement's effectiveness.
15.5 Beyond Confounding: Colliders and Selection Bias
Confounders are not the only structural danger in observational data. Colliders create a complementary trap that is less intuitive but equally damaging.
What Is a Collider?
A collider is a variable that is caused by two or more other variables. In the causal graph $X \to Z \leftarrow Y$, the variable $Z$ is a collider on the path between $X$ and $Y$.
The key property: $X$ and $Y$ may be completely independent (no causal relationship, no common cause), but conditioning on the collider $Z$ induces a spurious association between them.
The Talent-Attractiveness Example
Consider the (hypothetical but illustrative) claim: "Among Hollywood actors, talent and attractiveness are negatively correlated." Does this mean that being talented causes one to be less attractive? No. Both talent and attractiveness independently increase the probability of becoming a Hollywood actor. Among the general population, talent and attractiveness are uncorrelated. But if you condition on the collider "is a Hollywood actor," you create a selection effect: among actors, someone who is not particularly attractive must be very talented (otherwise they would not have been selected), and vice versa. The negative correlation is an artifact of conditioning on a common effect.
def collider_bias_example(
n_population: int = 50000,
selection_threshold: float = 0.7,
seed: int = 42
) -> None:
"""Demonstrate collider bias through selection on a common effect.
Talent and attractiveness are independent in the population, but
conditioning on selection (being an actor) creates a spurious
negative correlation.
Args:
n_population: Population size.
selection_threshold: Selection probability threshold.
seed: Random seed.
"""
rng = np.random.RandomState(seed)
# Talent and attractiveness: independent in the population
talent = rng.normal(0, 1, n_population)
attractiveness = rng.normal(0, 1, n_population)
# Selection into acting: depends on BOTH talent and attractiveness
# (collider: talent → selected ← attractiveness)
selection_score = 0.5 * talent + 0.5 * attractiveness + rng.normal(0, 0.3, n_population)
selected = selection_score > np.percentile(selection_score, 90) # Top 10%
# Correlations
r_population = np.corrcoef(talent, attractiveness)[0, 1]
r_selected = np.corrcoef(talent[selected], attractiveness[selected])[0, 1]
print("=== Collider Bias: Selection Effect ===")
print(f"Population correlation (talent, attractiveness): {r_population:+.3f}")
print(f"Selected group correlation: {r_selected:+.3f}")
print()
print(f"Population size: {n_population}")
print(f"Selected (actors): {selected.sum()}")
print()
print("Talent and attractiveness are INDEPENDENT in the population.")
print("But among actors (selected), they are NEGATIVELY correlated.")
print("This is collider bias: conditioning on a common effect")
print("creates a spurious association between its causes.")
collider_bias_example()
=== Collider Bias: Selection Effect ===
Population correlation (talent, attractiveness): -0.003
Selected group correlation: -0.350
Population size: 50000
Selected (actors): 5000
Talent and attractiveness are INDEPENDENT in the population.
But among actors (selected), they are NEGATIVELY correlated.
This is collider bias: conditioning on a common effect
creates a spurious association between its causes.
Why Collider Bias Matters for Data Science
Collider bias appears in data science whenever the data itself is a selected sample:
-
Survival bias in customer churn. A model trained only on current customers conditions on the collider "has not yet churned." Features that independently predict both customer satisfaction and retention become spuriously correlated in the surviving sample.
-
Berkson's paradox in hospital data. Among hospitalized patients, diseases that independently cause hospitalization appear negatively correlated — because a patient hospitalized for one disease need not have the other. MediCore's EHR data, which covers only patients who were treated at their partner hospitals, may exhibit this bias.
-
Selection bias in recommender systems. The StreamRec platform only observes engagement for items that were shown to users. Conditioning on "item was shown" (a collider caused by both user preferences and the recommendation algorithm) creates spurious associations between user features and item features.
Common Misconception: "Controlling for more variables is always better." This is dangerously wrong. Controlling for a confounder reduces bias. Controlling for a collider introduces bias. This is why causal reasoning — specifically, understanding the causal structure of your data — is essential before deciding which variables to include in a model. We will formalize this in Chapter 17 with d-separation and the backdoor criterion.
15.6 The Fundamental Problem of Causal Inference
We have established that prediction models cannot answer causal questions, that confounders bias naive comparisons, and that colliders create additional structural pitfalls. Now we articulate the deep reason why causal inference is fundamentally harder than prediction.
Potential Outcomes (Preview)
For each individual $i$ and treatment $T \in \{0, 1\}$, define two potential outcomes:
- $Y_i(1)$: the outcome that would be observed if individual $i$ receives the treatment
- $Y_i(0)$: the outcome that would be observed if individual $i$ does not receive the treatment
The individual treatment effect (ITE) for person $i$ is:
$$\tau_i = Y_i(1) - Y_i(0)$$
The average treatment effect (ATE) across the population is:
$$\tau = E[Y_i(1) - Y_i(0)] = E[Y_i(1)] - E[Y_i(0)]$$
The Fundamental Problem
Here is the problem, stated by Paul Holland in his landmark 1986 paper: we can never observe both $Y_i(1)$ and $Y_i(0)$ for the same individual. Each individual either receives the treatment or does not. We observe one potential outcome and the other is forever counterfactual.
| Patient | $Y_i(0)$ | $Y_i(1)$ | $\tau_i$ | Observed? |
|---|---|---|---|---|
| Alice | Not hospitalized | — | — | ✓ (received treatment) |
| Bob | — | Hospitalized | — | ✓ (no treatment) |
| Carol | ? | ? | ? | We can only see one |
This is not a data size problem. It is not a modeling problem. It is a logical impossibility: you cannot simultaneously give and not give a treatment to the same person at the same time. The counterfactual outcome — what would have happened — is inherently unobservable.
This means that the individual treatment effect $\tau_i$ is never directly observable. We can estimate population-level quantities like the ATE under certain assumptions, but the individual causal effect remains a missing data problem.
def fundamental_problem_illustration(
n: int = 8,
seed: int = 42
) -> pd.DataFrame:
"""Illustrate the fundamental problem of causal inference.
Shows a small population where we know (from simulation) both
potential outcomes, but in practice only one is observed.
Args:
n: Number of individuals.
seed: Random seed.
Returns:
DataFrame showing observed and counterfactual outcomes.
"""
rng = np.random.RandomState(seed)
# God's-eye view: we know both potential outcomes (in simulation only!)
y0 = rng.binomial(1, 0.3, n) # Outcome without treatment
y1 = rng.binomial(1, 0.15, n) # Outcome with treatment (lower risk)
# Treatment assignment (not random — confounded by a latent factor)
latent_severity = rng.uniform(0, 1, n)
treatment = (latent_severity > 0.5).astype(int)
# Observed outcome: only see Y(T=assigned treatment)
y_observed = np.where(treatment == 1, y1, y0)
df = pd.DataFrame({
"patient": [f"Patient {i+1}" for i in range(n)],
"Y(0)_truth": y0,
"Y(1)_truth": y1,
"ITE_truth": y1 - y0,
"treated": treatment,
"Y_observed": y_observed,
"Y(0)_observed": np.where(treatment == 0, y0, "?"),
"Y(1)_observed": np.where(treatment == 1, y1, "?"),
})
true_ate = (y1 - y0).mean()
naive_ate = (
y_observed[treatment == 1].mean()
- y_observed[treatment == 0].mean()
)
print("=== The Fundamental Problem of Causal Inference ===")
print()
print("God's-eye view (known only in simulation):")
print(df[["patient", "Y(0)_truth", "Y(1)_truth", "ITE_truth"]].to_string(index=False))
print()
print(f"True ATE: {true_ate:.3f}")
print()
print("What we actually observe:")
print(df[["patient", "treated", "Y_observed",
"Y(0)_observed", "Y(1)_observed"]].to_string(index=False))
print()
print(f"Naive ATE (difference in observed means): {naive_ate:.3f}")
print()
print("The '?' entries are counterfactual — they can NEVER be observed.")
print("The naive ATE may differ from the true ATE due to confounding.")
return df
df_fundamental = fundamental_problem_illustration()
=== The Fundamental Problem of Causal Inference ===
God's-eye view (known only in simulation):
patient Y(0)_truth Y(1)_truth ITE_truth
Patient 1 0 0 0
Patient 2 1 0 -1
Patient 3 1 0 -1
Patient 4 0 0 0
Patient 5 0 0 0
Patient 6 0 1 1
Patient 7 0 0 0
Patient 8 0 0 0
True ATE: -0.125
What we actually observe:
patient treated Y_observed Y(0)_observed Y(1)_observed
Patient 1 0 0 0 ?
Patient 2 0 1 1 ?
Patient 3 0 1 1 ?
Patient 4 0 0 0 ?
Patient 5 1 0 ? 0
Patient 6 1 1 ? 1
Patient 7 1 0 ? 0
Patient 8 1 0 ? 0
Naive ATE (difference in observed means): -0.250
The '?' entries are counterfactual — they can NEVER be observed.
The naive ATE may differ from the true ATE due to confounding.
Why Randomization Solves the Problem
Randomized controlled trials (RCTs) are the gold standard for causal inference because randomization breaks the link between treatment assignment and confounders. When treatment is randomly assigned:
$$T \perp\!\!\!\perp (Y(0), Y(1))$$
Treatment is independent of potential outcomes. This means the treated and control groups are, on average, identical in all respects — observed and unobserved — except for the treatment itself. The naive difference in means becomes an unbiased estimator of the ATE:
$$E[Y \mid T = 1] - E[Y \mid T = 0] = E[Y(1)] - E[Y(0)] = \text{ATE}$$
The confounding bias term vanishes because randomization ensures that the distribution of confounders is the same in both groups.
But RCTs are not always possible:
- Ethical constraints. You cannot randomly assign patients to "no treatment" when a treatment is known to be effective.
- Practical constraints. You cannot randomly assign which employees receive training or which cities adopt a policy.
- Temporal constraints. You cannot randomize historical events.
- Scale constraints. Running an RCT on a recommendation system requires showing suboptimal recommendations to some users, reducing engagement and revenue.
When RCTs are impossible, we need observational causal inference methods — the subject of Chapters 16-19.
Research Insight: Holland's 1986 paper, "Statistics and Causal Inference" (Journal of the American Statistical Association, 81(396): 945-960), introduced the phrase "the fundamental problem of causal inference" and formalized the connection between Rubin's potential outcomes framework and Fisher's randomization-based inference. The paper remains one of the clearest expositions of why causal inference is hard and is essential reading for anyone working in this area.
15.7 Spurious Correlations and Model Fragility
Prediction models that exploit confounded associations are not just causally wrong — they are also fragile. They break when the world changes.
Why Confounded Features Make Fragile Models
A prediction model learns whatever statistical patterns exist in the training data. If a confounded association is strong, the model will use it. But confounded associations are contingent — they depend on the specific distribution of confounders in the training data, which may not hold in deployment.
Consider a hospital readmission model that learns the association between "number of prior ED visits" and readmission. This association partly reflects a causal relationship (frequent ED use signals poorly managed chronic conditions) but is also heavily confounded by insurance status, geographic access to primary care, and social determinants of health. If the hospital's patient population changes — a new insurance contract brings in a different demographic mix — the confounded component of the association breaks, and the model's performance degrades.
def confounded_feature_fragility(
n_train: int = 10000,
n_test: int = 5000,
seed: int = 42
) -> None:
"""Demonstrate how confounded features create fragile predictions.
A model trained on data where a confounder has one distribution
fails when the confounder distribution shifts.
Args:
n_train: Training set size.
n_test: Test set size.
seed: Random seed.
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
rng = np.random.RandomState(seed)
def generate_data(
n: int, confounder_mean: float, rng: np.random.RandomState
) -> Tuple[np.ndarray, np.ndarray]:
# Confounder: socioeconomic status (affects both features and outcome)
ses = rng.normal(confounder_mean, 1, n)
# Feature 1: prior ED visits (causal + confounded)
ed_visits = np.maximum(0, 3 - 0.5 * ses + rng.poisson(1, n))
# Feature 2: medication count (causal + confounded)
med_count = np.maximum(0, 5 - 0.3 * ses + rng.poisson(2, n))
# Feature 3: age (causal, not confounded by SES)
age = rng.normal(65, 12, n)
X = np.column_stack([ed_visits, med_count, age])
# Outcome: readmission (causal effects + confounder effect)
logit = -2 + 0.1 * ed_visits + 0.05 * med_count + 0.02 * age - 0.3 * ses
prob = 1 / (1 + np.exp(-logit))
y = rng.binomial(1, prob)
return X, y
# Training data: confounder mean = 0
X_train, y_train = generate_data(n_train, confounder_mean=0.0, rng=rng)
# Test data (same distribution): confounder mean = 0
X_test_same, y_test_same = generate_data(n_test, confounder_mean=0.0, rng=rng)
# Test data (shifted distribution): confounder mean = 2 (different population)
X_test_shifted, y_test_shifted = generate_data(n_test, confounder_mean=2.0, rng=rng)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
auc_same = roc_auc_score(y_test_same, model.predict_proba(X_test_same)[:, 1])
auc_shifted = roc_auc_score(
y_test_shifted, model.predict_proba(X_test_shifted)[:, 1]
)
print("=== Confounded Feature Fragility ===")
print(f"AUC (same distribution): {auc_same:.3f}")
print(f"AUC (shifted confounder): {auc_shifted:.3f}")
print(f"AUC drop: {auc_same - auc_shifted:.3f}")
print()
print("The model degrades because it learned confounded associations")
print("that are specific to the training population's confounder distribution.")
print("A model that used only causal features would be more robust.")
confounded_feature_fragility()
=== Confounded Feature Fragility ===
AUC (same distribution): 0.721
AUC (shifted confounder): 0.653
AUC (shifted confounder): 0.653
AUC drop: 0.068
The model degrades because it learned confounded associations
that are specific to the training population's confounder distribution.
A model that used only causal features would be more robust.
This is a deep connection between causal reasoning and robust machine learning. Models built on causal features — features that directly cause the outcome, rather than being merely correlated with it through confounders — generalize better under distribution shift. This insight motivates an entire subfield (invariant risk minimization, causal representation learning) that we will revisit in Chapter 19.
15.8 The Ladder of Causation
Judea Pearl organizes causal reasoning into three levels, which he calls the "Ladder of Causation." This hierarchy provides a conceptual framework for understanding what different types of analysis can and cannot answer.
Rung 1: Association (Seeing)
$$P(Y \mid X)$$
Questions at this level ask about the joint distribution of observed variables. "What is the probability of $Y$ given that I observe $X$?" All prediction models, all correlational analyses, and all standard machine learning operate at this level.
Example: "Patients who take Drug X have a 28% hospitalization rate. Patients who do not take Drug X have a 15% hospitalization rate." This is an observational fact about the data. It says nothing about whether Drug X causes hospitalization.
Rung 2: Intervention (Doing)
$$P(Y \mid \text{do}(X))$$
Questions at this level ask about the effect of interventions — actively changing one variable and observing the consequence. "What would happen to $Y$ if I set $X$ to a particular value?" RCTs answer interventional questions. The do-calculus (Chapter 17) provides conditions under which observational data can answer them too.
Example: "If we administer Drug X to a random sample of patients, what would the hospitalization rate be?" This is a causal question. The answer may differ from the observational association because the patients who receive Drug X in the observational data are not a random sample.
Rung 3: Counterfactuals (Imagining)
$$P(Y_{X=x'} \mid X = x, Y = y)$$
Questions at this level ask about specific individuals in specific circumstances. "Given that $X$ actually took value $x$ and $Y$ actually took value $y$, what would $Y$ have been if $X$ had been $x'$ instead?" These are the most demanding questions because they require reasoning about the same individual under different conditions.
Example: "This specific patient took Drug X and was hospitalized. Would they have been hospitalized if they had not taken Drug X?" This is a counterfactual — it asks about a specific event that did not happen.
def ladder_of_causation_summary() -> None:
"""Print a summary of Pearl's Ladder of Causation with examples."""
ladder = [
{
"rung": 1,
"name": "Association (Seeing)",
"formal": "P(Y | X)",
"question": "What is?",
"example_pharma": "Patients on Drug X have 28% hospitalization rate",
"example_recsys": "Users shown item X have 12% click rate",
"method": "Standard ML, statistics, EDA",
},
{
"rung": 2,
"name": "Intervention (Doing)",
"formal": "P(Y | do(X))",
"question": "What if I do?",
"example_pharma": "If we GIVE Drug X, what is the hospitalization rate?",
"example_recsys": "If we SHOW item X, will the user click?",
"method": "RCTs, causal inference (Ch. 16-19)",
},
{
"rung": 3,
"name": "Counterfactual (Imagining)",
"formal": "P(Y_x' | X=x, Y=y)",
"question": "What if I had done differently?",
"example_pharma": "Patient was hospitalized on Drug X. Would they "
"have been hospitalized WITHOUT it?",
"example_recsys": "User clicked after recommendation. Would they "
"have clicked WITHOUT the recommendation?",
"method": "Structural causal models (Ch. 17), counterfactual reasoning",
},
]
print("=== The Ladder of Causation (Pearl, 2018) ===")
print()
for level in ladder:
print(f"Rung {level['rung']}: {level['name']}")
print(f" Formal: {level['formal']}")
print(f" Question: {level['question']}")
print(f" Pharma: {level['example_pharma']}")
print(f" RecSys: {level['example_recsys']}")
print(f" Method: {level['method']}")
print()
print("Each rung subsumes the one below it. You cannot answer Rung 2")
print("questions with Rung 1 data alone. You cannot answer Rung 3")
print("questions with Rung 2 data alone.")
ladder_of_causation_summary()
=== The Ladder of Causation (Pearl, 2018) ===
Rung 1: Association (Seeing)
Formal: P(Y | X)
Question: What is?
Pharma: Patients on Drug X have 28% hospitalization rate
RecSys: Users shown item X have 12% click rate
Method: Standard ML, statistics, EDA
Rung 2: Intervention (Doing)
Formal: P(Y | do(X))
Question: What if I do?
Pharma: If we GIVE Drug X, what is the hospitalization rate?
RecSys: If we SHOW item X, will the user click?
Method: RCTs, causal inference (Ch. 16-19)
Rung 3: Counterfactual (Imagining)
Formal: P(Y_x' | X=x, Y=y)
Question: What if I had done differently?
Pharma: Patient was hospitalized on Drug X. Would they have been hospitalized WITHOUT it?
RecSys: User clicked after recommendation. Would they have clicked WITHOUT the recommendation?
Method: Structural causal models (Ch. 17), counterfactual reasoning
Each rung subsumes the one below it. You cannot answer Rung 2
questions with Rung 1 data alone. You cannot answer Rung 3
questions with Rung 2 data alone.
Understanding Why: The Ladder of Causation is not merely a philosophical taxonomy. It is a formal statement about the information required to answer different types of questions. A dataset (no matter how large) plus any learning algorithm (no matter how sophisticated) cannot climb from Rung 1 to Rung 2 without additional assumptions about the data-generating process. This is a theorem, not a suggestion. Pearl and colleagues proved this as the "causal hierarchy theorem" (Bareinboim et al., 2022): for almost all causal models, the interventional distribution $P(Y \mid \text{do}(X))$ is not identifiable from the observational distribution $P(X, Y)$ alone.
15.9 The StreamRec Recommendation Problem
Let us now apply these ideas to the progressive project. StreamRec has a recommendation system that selects which content items to show each user. The business team wants to know: does the recommendation system create value, or does it merely predict what users would have engaged with anyway?
The Question
The prediction question is:
$$P(\text{engage} \mid \text{user}, \text{item}, \text{context})$$
Given this user, this item, and this context, what is the probability the user engages?
The causal question is:
$$P(\text{engage} \mid \text{do}(\text{recommend item})) - P(\text{engage} \mid \text{do}(\text{not recommend item}))$$
If we recommend this item (versus not), how much does engagement change?
These questions have very different answers. The recommendation system is trained to predict engagement — and it is very good at this. It recommends items that users are likely to engage with. But many of those items are things the user would have found and engaged with anyway, through search, browsing, or organic discovery. The recommendation system is taking credit for engagement that would have happened without it.
Why Offline Evaluation Cannot Answer the Causal Question
Standard offline evaluation for recommendation systems works like this:
- Hold out a set of interactions from historical data
- Use the recommendation model to predict which items each user would engage with
- Measure prediction accuracy (Hit@K, NDCG@K, etc.)
This evaluation measures predictive performance, not causal impact. A model that perfectly predicts organic behavior (what users would have done without recommendations) achieves high offline metrics — but creates zero incremental value.
def streamrec_offline_evaluation_limitation(
n_users: int = 5000,
n_items: int = 1000,
seed: int = 42
) -> None:
"""Demonstrate why offline evaluation cannot measure causal recommendation impact.
Simulates two recommendation models:
- Model A: Predicts organic engagement (what users would do anyway)
- Model B: Identifies items where recommendation CAUSES engagement
Model A has better offline metrics but creates less incremental value.
Args:
n_users: Number of users.
n_items: Number of items.
seed: Random seed.
"""
rng = np.random.RandomState(seed)
# User preference vectors (latent)
user_prefs = rng.normal(0, 1, (n_users, 10))
# Item feature vectors (latent)
item_features = rng.normal(0, 1, (n_items, 10))
# Organic engagement: what users would engage with WITHOUT recommendation
organic_scores = user_prefs @ item_features.T # n_users x n_items
organic_probs = 1 / (1 + np.exp(-organic_scores / 3))
# Recommendation effect: CAUSAL increase in engagement from recommendation
# Some items have high organic appeal but low recommendation effect
# (users would have found them anyway)
# Other items have low organic appeal but high recommendation effect
# (users would NOT have found them without the recommendation)
rec_effect = rng.exponential(0.05, (n_users, n_items))
# Items with very high organic appeal have LOWER recommendation effect
rec_effect *= (1 - organic_probs) # Anti-correlation with organic appeal
# Total engagement probability WITH recommendation
total_probs = np.minimum(1.0, organic_probs + rec_effect)
# Model A: Predicts organic engagement well
# (standard collaborative filtering — this is what we train on)
model_a_scores = organic_probs + rng.normal(0, 0.05, organic_probs.shape)
# Model B: Estimates recommendation effect (causal model)
model_b_scores = rec_effect + rng.normal(0, 0.02, rec_effect.shape)
# Offline evaluation: for each user, recommend top-10
k = 10
def hit_at_k(scores: np.ndarray, ground_truth: np.ndarray, k: int) -> float:
"""Fraction of users with at least one relevant item in top-k."""
hits = 0
for u in range(scores.shape[0]):
top_k = np.argsort(-scores[u])[:k]
# "Ground truth" = items user engaged with (organic + recommended)
engaged = (ground_truth[u] > 0.5).nonzero()[0]
if len(set(top_k) & set(engaged)) > 0:
hits += 1
return hits / scores.shape[0]
# Generate ground truth engagement (organic)
engaged_organic = (rng.random(organic_probs.shape) < organic_probs).astype(float)
# Offline metrics (evaluated against organic engagement)
# Use a sample for speed
sample = 500
hit_a = hit_at_k(model_a_scores[:sample], engaged_organic[:sample], k)
hit_b = hit_at_k(model_b_scores[:sample], engaged_organic[:sample], k)
# Causal value: average recommendation effect of top-k items
def avg_causal_value(scores: np.ndarray, effects: np.ndarray, k: int) -> float:
total_effect = 0.0
for u in range(scores.shape[0]):
top_k = np.argsort(-scores[u])[:k]
total_effect += effects[u, top_k].sum()
return total_effect / scores.shape[0]
causal_a = avg_causal_value(model_a_scores[:sample], rec_effect[:sample], k)
causal_b = avg_causal_value(model_b_scores[:sample], rec_effect[:sample], k)
print("=== StreamRec: Offline Evaluation vs. Causal Value ===")
print()
print(f"Model A (predicts organic engagement):")
print(f" Hit@{k} (offline metric): {hit_a:.3f}")
print(f" Avg causal value (top-{k}): {causal_a:.4f}")
print()
print(f"Model B (estimates recommendation effect):")
print(f" Hit@{k} (offline metric): {hit_b:.3f}")
print(f" Avg causal value (top-{k}): {causal_b:.4f}")
print()
print("Model A wins on offline metrics (it predicts what users engage with).")
print("Model B wins on causal value (it identifies items where the")
print("recommendation makes a difference).")
print()
print("Offline evaluation CANNOT distinguish between these two models'")
print("causal value. It rewards prediction accuracy, not causal impact.")
streamrec_offline_evaluation_limitation()
=== StreamRec: Offline Evaluation vs. Causal Value ===
Model A (predicts organic engagement):
Hit@10 (offline metric): 0.986
Avg causal value (top-10): 0.0189
Model B (estimates recommendation effect):
Hit@10 (offline metric): 0.484
Avg causal value (top-10): 0.0742
Model A wins on offline metrics (it predicts what users engage with).
Model B wins on causal value (it identifies items where the
recommendation makes a difference).
Offline evaluation CANNOT distinguish between these two models'
causal value. It rewards prediction accuracy, not causal impact.
Model A achieves nearly perfect offline metrics — and delivers almost no incremental value. Model B has mediocre offline metrics — and delivers nearly four times the causal value. If StreamRec optimizes for offline metrics (which is standard industry practice), they will choose Model A every time and never know they are leaving value on the table.
This is the progressive project challenge for Part III: build the infrastructure to measure and optimize for causal recommendation effects. In Chapter 16, we will formalize this as a potential outcomes problem. In Chapter 18, we will estimate the effect from observational log data. In Chapter 19, we will build a causal model that targets recommendations where they create the most value.
15.10 The Landscape of Causal Inference Methods
Before diving into the technical details in subsequent chapters, let us survey the landscape. Causal inference methods can be organized along two axes: (1) the strength of the assumptions they require, and (2) the type of data they use.
Experimental Methods
| Method | Key Idea | Assumption Strength |
|---|---|---|
| Randomized Controlled Trial (RCT) | Random assignment ensures treatment is independent of confounders | Strongest (fewest assumptions) |
| A/B Test | Online RCT in a digital product | Same as RCT |
| Multi-Armed Bandit | Sequential experimentation with adaptive allocation | Assumes stationarity, no interference |
RCTs are the gold standard because randomization eliminates confounding by design. But they are not always feasible (ethical, practical, or cost constraints), and even well-designed RCTs can suffer from interference (one person's treatment affects another's outcome) and non-compliance.
Quasi-Experimental Methods
| Method | Key Idea | Assumption Strength |
|---|---|---|
| Instrumental Variables (IV) | Exploit an external source of variation that affects treatment but not outcome directly | Requires a valid instrument (strong) |
| Regression Discontinuity (RD) | Exploit a threshold rule that creates near-random assignment | Requires continuity at the threshold |
| Difference-in-Differences (DiD) | Compare treated and control groups before and after treatment | Requires parallel trends |
| Interrupted Time Series | Compare trends before and after an intervention in a single unit | Requires no concurrent shocks |
Quasi-experimental methods exploit "natural experiments" — situations where treatment assignment is as-if random due to institutional rules, policy changes, or accidents of timing. They require specific structural conditions but are often more credible than purely observational methods.
Observational Methods
| Method | Key Idea | Assumption Strength |
|---|---|---|
| Matching / Stratification | Compare treated and control units with similar observed characteristics | Requires no unobserved confounders |
| Propensity Score Methods | Model the probability of treatment given covariates, then weight or match | Same as matching (stronger assumptions) |
| Inverse Probability Weighting (IPW) | Reweight observations to simulate a randomized experiment | Requires correct propensity model |
| Doubly Robust Estimation | Combine outcome modeling and propensity scoring for robustness | Correct if EITHER model is correct |
| Synthetic Control | Construct a weighted combination of control units to approximate the treated unit's counterfactual | Requires donor pool that spans the treated unit's characteristics |
Observational methods require the strongest assumptions — most importantly, that all confounders are observed and measured. This assumption (called "ignorability" or "no unmeasured confounders") is untestable. It is a domain knowledge claim, not a statistical one.
Causal Machine Learning Methods
| Method | Key Idea | Chapter |
|---|---|---|
| Causal Forests | Estimate heterogeneous treatment effects using random forests | Ch. 19 |
| Meta-Learners (S/T/X) | Use standard ML models to estimate conditional treatment effects | Ch. 19 |
| Double Machine Learning (DML) | Use ML for nuisance estimation while preserving causal target identification | Ch. 19 |
| Uplift Modeling | Directly model the incremental effect of treatment on each individual | Ch. 19 |
These methods combine machine learning's flexibility with causal inference's rigor. They are the frontier of the field and the subject of Chapter 19.
Simplest Model That Works: The most important advice in causal inference is: use the simplest credible design. If you can run an RCT, run an RCT. If you have a natural experiment, use it (IV, RD, DiD). Resort to observational methods only when experimental and quasi-experimental approaches are truly infeasible. And when you use observational methods, be honest about the assumptions — they are always stronger than you would like.
15.11 Causal Thinking for Data Scientists: A Checklist
Before every data science project, ask yourself the following questions. The answers will determine whether you need prediction, causal inference, or both.
-
What is the question? Is it descriptive ("What happened?"), predictive ("What will happen?"), or causal ("What would happen if...?")?
-
Is there a decision? If the analysis will inform an action (treat patients, change recommendations, launch a feature, allocate resources), the question is almost certainly causal — even if it is framed as predictive.
-
What is the treatment? What specific intervention are you considering? "Drug X," "recommendation algorithm change," "new pricing strategy" — the treatment must be well-defined and, in principle, manipulable.
-
What is the outcome? What would you measure to determine whether the intervention worked? The outcome must be clearly defined and measurable.
-
What are the confounders? What variables influence both the treatment assignment and the outcome? List them. If you cannot list any confounders, you are not thinking hard enough.
-
Can you randomize? If yes, run an RCT. If no, why not? The reasons for not randomizing often reveal the structural features (ethical constraints, institutional rules, timing) that suggest which quasi-experimental method to use.
-
What assumptions are you making? Every causal analysis rests on assumptions. State them. Evaluate their plausibility. Conduct sensitivity analyses to see how violations affect your conclusions.
@dataclass
class CausalQuestionFramework:
"""Framework for formulating and evaluating causal questions.
Attributes:
question: The causal question in natural language.
question_type: One of 'descriptive', 'predictive', 'causal'.
treatment: The intervention under consideration.
outcome: The measurable quantity of interest.
confounders: Known variables that affect both treatment and outcome.
can_randomize: Whether an RCT is feasible.
randomization_barrier: Why RCT is not feasible (if applicable).
assumptions: Key assumptions of the proposed analysis.
"""
question: str
question_type: str
treatment: str
outcome: str
confounders: List[str]
can_randomize: bool
randomization_barrier: str
assumptions: List[str]
def evaluate(self) -> None:
"""Print a structured evaluation of the causal question."""
print(f"Question: {self.question}")
print(f"Type: {self.question_type}")
print(f"Treatment: {self.treatment}")
print(f"Outcome: {self.outcome}")
print(f"Confounders ({len(self.confounders)}):")
for c in self.confounders:
print(f" - {c}")
print(f"Can randomize: {self.can_randomize}")
if not self.can_randomize:
print(f"Barrier: {self.randomization_barrier}")
print(f"Assumptions ({len(self.assumptions)}):")
for a in self.assumptions:
print(f" - {a}")
# MediCore example
medicore_question = CausalQuestionFramework(
question="Does Drug X reduce 30-day hospitalization?",
question_type="causal",
treatment="Prescription of Drug X (vs. standard of care)",
outcome="Binary: hospitalized within 30 days (yes/no)",
confounders=[
"Disease severity (sicker patients get Drug X more often)",
"Age (older patients have higher severity and different prescribing)",
"Comorbidities (more comorbid patients get both Drug X and hospitalized)",
"Insurance type (affects both access to Drug X and hospital use)",
"Hospital quality (affects both prescribing patterns and outcomes)",
"Prior medication history (affects both Drug X decision and prognosis)",
],
can_randomize=False,
randomization_barrier=(
"Drug X is already approved and in clinical use. An RCT withholding "
"it from eligible patients raises ethical concerns. Additionally, MediCore "
"wants to use existing EHR data for regulatory submission efficiency."
),
assumptions=[
"No unmeasured confounders (all variables affecting both Drug X prescribing "
"and hospitalization are observed in the EHR)",
"SUTVA: one patient's treatment does not affect another's outcome",
"Positivity: every patient type has nonzero probability of receiving/not "
"receiving Drug X",
"Correct functional form for confounding adjustment",
],
)
print("=== MediCore Causal Question Framework ===")
medicore_question.evaluate()
print()
print("=== StreamRec Causal Question Framework ===")
streamrec_question = CausalQuestionFramework(
question="Does recommending item X to user Y cause engagement?",
question_type="causal",
treatment="Showing item X in the recommendation slate (vs. not showing it)",
outcome="Binary: user engages with item X within session (yes/no)",
confounders=[
"User preference (drives both what is recommended and what is engaged with)",
"Item popularity (popular items are both recommended more and engaged more)",
"Time of day / session context (affects both algorithm and behavior)",
"User activity level (active users see more recommendations and engage more)",
"Position bias (slot position affects both selection probability and click rate)",
],
can_randomize=True,
randomization_barrier=(
"Randomization IS feasible (A/B test), but costly: showing random items "
"degrades user experience, and the opportunity cost of suboptimal "
"recommendations at StreamRec's scale (~5M users) is significant. "
"Chapter 19 explores how to minimize this cost."
),
assumptions=[
"No interference: one user's recommendation does not affect another's engagement",
"Stable item characteristics during the experiment period",
"No anticipation effects: users do not change behavior knowing they are "
"in an experiment",
],
)
streamrec_question.evaluate()
=== MediCore Causal Question Framework ===
Question: Does Drug X reduce 30-day hospitalization?
Type: causal
Treatment: Prescription of Drug X (vs. standard of care)
Outcome: Binary: hospitalized within 30 days (yes/no)
Confounders (6):
- Disease severity (sicker patients get Drug X more often)
- Age (older patients have higher severity and different prescribing)
- Comorbidities (more comorbid patients get both Drug X and hospitalized)
- Insurance type (affects both access to Drug X and hospital use)
- Hospital quality (affects both prescribing patterns and outcomes)
- Prior medication history (affects both Drug X decision and prognosis)
Can randomize: False
Barrier: Drug X is already approved and in clinical use. An RCT withholding it from eligible patients raises ethical concerns. Additionally, MediCore wants to use existing EHR data for regulatory submission efficiency.
Assumptions (4):
- No unmeasured confounders (all variables affecting both Drug X prescribing and hospitalization are observed in the EHR)
- SUTVA: one patient's treatment does not affect another's outcome
- Positivity: every patient type has nonzero probability of receiving/not receiving Drug X
- Correct functional form for confounding adjustment
=== StreamRec Causal Question Framework ===
Question: Does recommending item X to user Y cause engagement?
Type: causal
Treatment: Showing item X in the recommendation slate (vs. not showing it)
Outcome: Binary: user engages with item X within session (yes/no)
Confounders (5):
- User preference (drives both what is recommended and what is engaged with)
- Item popularity (popular items are both recommended more and engaged more)
- Time of day / session context (affects both algorithm and behavior)
- User activity level (active users see more recommendations and engage more)
- Position bias (slot position affects both selection probability and click rate)
Can randomize: True
Barrier: Randomization IS feasible (A/B test), but costly: showing random items degrades user experience, and the opportunity cost of suboptimal recommendations at StreamRec's scale (~5M users) is significant. Chapter 19 explores how to minimize this cost.
Assumptions (3):
- No interference: one user's recommendation does not affect another's engagement
- Stable item characteristics during the experiment period
- No anticipation effects: users do not change behavior knowing they are in an experiment
15.12 Looking Ahead: The Rest of Part III
This chapter has established the why of causal inference. The remaining chapters in Part III provide the how:
-
Chapter 16: The Potential Outcomes Framework. Formalizes the concepts introduced here. Defines $Y(0)$, $Y(1)$, ATE, ATT. States the assumptions required for identification (SUTVA, ignorability, positivity). Shows why randomization works and what breaks without it. Implements simple estimators (difference in means, regression adjustment) and demonstrates their bias under confounding.
-
Chapter 17: Graphical Causal Models. Introduces Pearl's framework: directed acyclic graphs (DAGs), d-separation, the backdoor criterion, the front-door criterion, and the do-calculus. Provides a complementary language for thinking about causal structure that makes confounders, colliders, and mediators visually and mathematically explicit.
-
Chapter 18: Causal Estimation Methods. The practical toolkit: matching, propensity score methods (weighting, stratification, matching), instrumental variables, difference-in-differences, and regression discontinuity. Each method is derived, implemented, and applied to MediCore and StreamRec data.
-
Chapter 19: Causal Machine Learning. The frontier: heterogeneous treatment effects (who benefits most?), causal forests, meta-learners (S-learner, T-learner, X-learner), double machine learning, and uplift modeling. This chapter builds the causal recommendation system for StreamRec that targets items where recommendations create engagement rather than merely predicting it.
The progressive project milestone for this chapter is modest but essential: formulate the causal question clearly, identify the confounders, and demonstrate — as we did in Section 15.9 — that standard offline evaluation cannot answer it. The heavy implementation begins in Chapter 16.
Prediction ≠ Causation: If you take one idea from this chapter into the rest of your career, let it be this: every time you build a prediction model that will inform a decision, ask yourself whether the decision requires a causal answer. If it does — and it usually does — the prediction model is not enough. You need causal inference. The remainder of Part III teaches you how.
Summary
This chapter has argued that the most common mode of applied data science — build a prediction model, use it to make decisions — is fundamentally flawed when the decision requires understanding causation. We demonstrated this through four interconnected ideas:
-
Prediction and causation are different questions. A model that predicts $P(Y \mid X)$ learns associations from all three sources (causal, reverse causal, and confounded). Only causal analysis can isolate the effect of intervention.
-
Confounding biases naive comparisons. When treatment assignment depends on factors that also affect outcomes, comparing treated and untreated groups gives wrong answers — sometimes with the wrong sign (Simpson's paradox).
-
The fundamental problem of causal inference — we can never observe both potential outcomes for the same individual — makes causal inference a missing data problem that cannot be solved by more data or better models alone. It requires assumptions.
-
Colliders and selection bias provide additional structural traps: conditioning on the wrong variables introduces bias rather than removing it.
-
The Ladder of Causation — association, intervention, counterfactual — formalizes the hierarchy of causal reasoning and establishes that higher rungs require more than data; they require causal models.
The chapter opened with a hospital that deployed a prediction model to guide an intervention. The model was accurate. The deployment was harmful. This is not an edge case. It is the default outcome when prediction models are used to answer causal questions. The rest of Part III provides the tools to do better.