> "You cannot answer a question about interventions using only probability distributions. You need a causal model."
In This Chapter
- Learning Objectives
- 17.1 Two Frameworks for Causation
- 17.2 Directed Acyclic Graphs
- 17.3 Structural Causal Models
- 17.4 The Three Elementary Junction Types
- 17.5 Paths and d-Separation
- 17.6 The Markov Condition and Faithfulness
- 17.7 The Backdoor Criterion
- 17.8 Good Controls vs. Bad Controls
- 17.9 The Front-Door Criterion
- 17.10 The do-Operator and Interventions
- 17.11 DoWhy: Implementing Graphical Causal Models in Python
- 17.12 Climate Systems: When DAGs Break Down
- 17.13 Progressive Project: StreamRec Causal DAG
- 17.14 Summary
Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
"You cannot answer a question about interventions using only probability distributions. You need a causal model." — Judea Pearl, Causality (2009)
Learning Objectives
By the end of this chapter, you will be able to:
- Construct causal DAGs from domain knowledge and identify causal, confounding, and collider paths
- Apply d-separation to determine the conditional independencies implied by a causal graph
- Use the backdoor criterion to identify valid adjustment sets for causal estimation
- Apply the front-door criterion when the backdoor criterion fails
- Formalize interventions using the do-calculus and distinguish $\text{do}(X = x)$ from conditioning on $X = x$
17.1 Two Frameworks for Causation
Chapter 16 introduced the potential outcomes framework: define $Y(0)$ and $Y(1)$, state assumptions (SUTVA, ignorability, positivity), estimate the ATE. The framework is powerful, but it treats the causal structure — which variables are confounders, which are colliders, which are mediators — as something the analyst must reason about informally, outside the mathematics. The question "which covariates should I adjust for?" has no formal answer within the potential outcomes framework alone.
This chapter introduces a complementary framework that makes causal structure its central object: graphical causal models, developed primarily by Judea Pearl and colleagues (Pearl, 1988, 1995, 2000, 2009; Spirtes, Glymour, and Scheines, 1993). Where Rubin's framework asks "what are the potential outcomes?", Pearl's framework asks "what is the causal graph?" — and from the graph, derives which variables to adjust for, which adjustment strategies are valid, and what the causal effect is as a function of observational distributions.
The two frameworks are not competing theories. They are complementary languages for the same underlying reality. Rubin's framework excels at defining estimands precisely and formalizing the assumptions needed for estimation. Pearl's framework excels at encoding domain knowledge about causal structure and deriving what is and is not identifiable from observational data. A serious practitioner needs both.
Prediction $\neq$ Causation: The potential outcomes framework (Chapter 16) formalized what a causal effect is and when it can be estimated. This chapter formalizes how to reason about causal structure — which paths transmit causal information and which transmit spurious associations — using directed acyclic graphs. The graph encodes assumptions that determine whether a causal effect is identifiable and what statistical adjustments make it so. Without this structural reasoning, the advice "control for confounders" is incomplete: you must know which variables are confounders and which are colliders, and getting this wrong introduces bias rather than removing it.
17.2 Directed Acyclic Graphs
Definition
A directed acyclic graph (DAG) is a graph $\mathcal{G} = (V, E)$ consisting of:
- A set of vertices (nodes) $V = \{V_1, V_2, \ldots, V_p\}$, each representing a random variable.
- A set of directed edges $E \subseteq V \times V$, where $(V_i, V_j)$ denotes an arrow from $V_i$ to $V_j$.
- The acyclicity constraint: there is no directed path from any vertex back to itself. That is, there is no sequence $V_i \to V_{j_1} \to V_{j_2} \to \cdots \to V_i$.
Acyclicity is not a mere technical convenience — it encodes the assumption that the causal system does not contain feedback loops at the time scale of analysis. A thermostat causes temperature changes, and temperature causes thermostat adjustments, but at any given instant, we can represent this as a DAG by unrolling time (temperature at time $t$ causes thermostat at time $t+1$, which causes temperature at time $t+2$). Truly simultaneous feedback requires different formalisms (see Section 17.12 on limitations).
Terminology
For a DAG with an edge $A \to B$:
- $A$ is a parent of $B$; $B$ is a child of $A$.
- If there is a directed path from $A$ to $B$ (possibly through intermediaries), $A$ is an ancestor of $B$ and $B$ is a descendant of $A$.
- A node with no parents is a root or exogenous variable.
- A node with parents is an endogenous variable.
A First Example: The MediCore DAG
MediCore Pharmaceuticals is evaluating Drug X's effect on hospitalization. Based on domain knowledge (clinical expertise, prior literature, institutional knowledge of prescribing behavior), the clinical team constructs the following causal DAG:
Disease
Severity
/ | \
/ | \
v v v
Age --> Drug X --> Biomarker --> Hospitalization
\ ^
\ /
\ Comorbidities ---------/
\ /
v v
Insurance
Status
Reading this DAG:
- Disease Severity causes Drug X prescription (sicker patients receive Drug X), causes biomarker levels, and directly causes hospitalization.
- Drug X causes changes in the Biomarker, which in turn causes (reduces) Hospitalization. The causal path of interest is Drug X $\to$ Biomarker $\to$ Hospitalization.
- Age affects both Drug X prescription (older patients may receive different prescribing) and Hospitalization (older patients have higher readmission risk). Age also affects Insurance Status.
- Comorbidities affect Hospitalization directly and affect Insurance Status.
- Insurance Status is caused by Age and Comorbidities — it is a collider on the path Age $\to$ Insurance Status $\leftarrow$ Comorbidities.
This graph encodes substantive claims about the world. Each missing edge is an assertion of no direct causal effect. The assertion that there is no arrow from Insurance Status to Drug X prescribing, for example, is a domain knowledge claim that can be debated. The power of the DAG is that it makes these claims explicit and allows us to derive their statistical consequences.
Understanding Why: The DAG is not merely a visualization. It is a formal mathematical object from which we can derive testable predictions (conditional independencies), identify which variables to adjust for (backdoor criterion), and compute causal effects from observational data (do-calculus). A DAG without formal analysis is just a pretty picture. A DAG with formal analysis is a reasoning engine.
17.3 Structural Causal Models
The DAG is the graphical representation of a deeper mathematical object: the structural causal model (SCM).
Definition
A structural causal model $\mathcal{M} = (U, V, F, P(U))$ consists of:
- Exogenous variables $U = \{U_1, U_2, \ldots\}$: variables determined outside the model. Each captures all unmodeled causes of an endogenous variable.
- Endogenous variables $V = \{V_1, V_2, \ldots, V_p\}$: variables determined within the model.
- Structural equations $F = \{f_1, f_2, \ldots, f_p\}$: one equation per endogenous variable:
$$V_i = f_i(\text{pa}(V_i), U_i)$$
where $\text{pa}(V_i)$ denotes the parents of $V_i$ in the DAG, and $U_i$ is the exogenous noise term for $V_i$.
- Distribution over exogenous variables $P(U)$: typically, the $U_i$ are assumed mutually independent.
The structural equations are not regression equations. They represent causal mechanisms — autonomous processes that generate the value of each variable from its direct causes plus noise. The direction of the arrow is not a statistical assertion (correlation) but a causal one (mechanism).
MediCore SCM
A simplified SCM for the MediCore system:
$$\text{Severity} = U_S$$ $$\text{Drug X} = f_D(\text{Severity}, \text{Age}, U_D)$$ $$\text{Biomarker} = f_B(\text{Severity}, \text{Drug X}, U_B)$$ $$\text{Hosp} = f_H(\text{Severity}, \text{Biomarker}, \text{Age}, \text{Comorbidities}, U_H)$$
Each equation says: the variable on the left is determined by its parents plus noise. The drug causes biomarker changes; the biomarker causes (a reduction in) hospitalization. The total causal effect of Drug X on Hospitalization flows through the mediating path Drug X $\to$ Biomarker $\to$ Hospitalization.
Linear SCM Example
For expository clarity, consider a linear SCM:
import numpy as np
import pandas as pd
from typing import Optional
def generate_medicore_linear_scm(
n: int = 10000,
beta_drug_biomarker: float = -2.0,
beta_biomarker_hosp: float = 0.5,
beta_severity_drug: float = 1.5,
beta_severity_biomarker: float = 1.0,
beta_severity_hosp: float = 2.0,
beta_age_drug: float = 0.3,
beta_age_hosp: float = 0.5,
seed: int = 42,
) -> pd.DataFrame:
"""Generate data from a linear SCM for MediCore.
The true causal effect of Drug X on Hospitalization is:
beta_drug_biomarker * beta_biomarker_hosp = -2.0 * 0.5 = -1.0
(Drug X reduces hospitalization risk through the biomarker pathway.)
Severity confounds Drug X and Hospitalization.
Age confounds Drug X and Hospitalization.
Args:
n: Number of patients.
beta_drug_biomarker: Causal effect of drug on biomarker.
beta_biomarker_hosp: Causal effect of biomarker on hospitalization.
beta_severity_drug: Effect of severity on drug assignment.
beta_severity_biomarker: Effect of severity on biomarker.
beta_severity_hosp: Effect of severity on hospitalization.
beta_age_drug: Effect of age on drug assignment.
beta_age_hosp: Effect of age on hospitalization.
seed: Random seed.
Returns:
DataFrame with all variables.
"""
rng = np.random.RandomState(seed)
# Exogenous variables
severity = rng.normal(0, 1, n)
age = rng.normal(60, 10, n)
comorbidities = rng.poisson(2, n).astype(float)
# Drug assignment (confounded by severity and age)
drug_logit = (
beta_severity_drug * severity
+ beta_age_drug * (age - 60) / 10
+ rng.normal(0, 1, n)
)
drug = (drug_logit > 0).astype(float)
# Biomarker (caused by drug and severity)
biomarker = (
beta_drug_biomarker * drug
+ beta_severity_biomarker * severity
+ rng.normal(0, 1, n)
)
# Hospitalization (caused by biomarker, severity, age, comorbidities)
hosp_latent = (
beta_biomarker_hosp * biomarker
+ beta_severity_hosp * severity
+ beta_age_hosp * (age - 60) / 10
+ 0.3 * comorbidities
+ rng.normal(0, 1, n)
)
hospitalization = hosp_latent # continuous for clarity
return pd.DataFrame({
"severity": severity,
"age": age,
"comorbidities": comorbidities,
"drug": drug,
"biomarker": biomarker,
"hospitalization": hospitalization,
})
data = generate_medicore_linear_scm()
# True causal effect of Drug X on Hospitalization:
# Drug -> Biomarker -> Hosp = (-2.0) * (0.5) = -1.0
print("True causal effect of Drug X on Hospitalization: -1.0")
print(f" (via Drug -> Biomarker -> Hosp = {-2.0} * {0.5} = {-2.0 * 0.5})")
# Naive regression of Hosp on Drug (ignoring confounders)
from numpy.linalg import lstsq
X_naive = np.column_stack([np.ones(len(data)), data["drug"].values])
beta_naive, _, _, _ = lstsq(X_naive, data["hospitalization"].values, rcond=None)
print(f"\nNaive regression (Hosp ~ Drug):")
print(f" Drug coefficient: {beta_naive[1]:.3f} (biased by confounding)")
# Correct adjustment: control for severity and age (backdoor criterion)
X_adj = np.column_stack([
np.ones(len(data)),
data["drug"].values,
data["severity"].values,
data["age"].values,
])
beta_adj, _, _, _ = lstsq(X_adj, data["hospitalization"].values, rcond=None)
print(f"\nAdjusted regression (Hosp ~ Drug + Severity + Age):")
print(f" Drug coefficient: {beta_adj[1]:.3f} (close to true effect -1.0)")
True causal effect of Drug X on Hospitalization: -1.0
(via Drug -> Biomarker -> Hosp = -2.0 * 0.5 = -1.0)
Naive regression (Hosp ~ Drug):
Drug coefficient: 0.748 (biased by confounding)
Adjusted regression (Hosp ~ Drug + Severity + Age):
Drug coefficient: -1.011 (close to true effect -1.0)
The naive regression yields a positive coefficient — suggesting that Drug X increases hospitalization — because disease severity confounds the relationship. Sicker patients receive Drug X and are also more likely to be hospitalized. Adjusting for severity and age eliminates the confounding and recovers the true causal effect ($\approx -1.0$). But how do we know that adjusting for severity and age is correct? That is the question the rest of this chapter answers.
Common Misconception: "Structural equations are just regression equations with arrows drawn on them." They are not. A regression $Y = \alpha + \beta X + \varepsilon$ is a statistical statement about the conditional expectation of $Y$ given $X$. A structural equation $Y = f(X, U_Y)$ is a causal statement: the mechanism that generates $Y$ involves $X$ as a direct cause. The distinction matters because the structural equation tells us what happens under intervention (if we set $X = x$ by force, $Y$ changes to $f(x, U_Y)$), while the regression only tells us what to expect when we observe $X = x$ (which may differ due to confounding).
17.4 The Three Elementary Junction Types
Every path in a DAG passes through nodes via one of three elementary junctions. These three structures — and only these three — determine how causal and statistical information flows through the graph. Mastering them is the key to reading any causal DAG.
Junction 1: The Fork (Common Cause)
X <-- Z --> Y
$Z$ is a common cause of $X$ and $Y$. The fork creates a statistical association between $X$ and $Y$ — not because either causes the other, but because they share a common ancestor.
Statistical behavior:
- $X$ and $Y$ are marginally dependent: $X \not\!\perp\!\!\!\perp Y$.
- $X$ and $Y$ are conditionally independent given $Z$: $X \perp\!\!\!\perp Y \mid Z$.
Conditioning on the common cause "blocks" the flow of information through the fork. Once you know $Z$, learning $X$ tells you nothing new about $Y$.
MediCore example: Disease Severity is a common cause of Drug X assignment and Hospitalization. Drug X and Hospitalization are statistically associated (in the observational data) partly because of the fork through Severity. Conditioning on Severity blocks this spurious path.
Intuition: If it is raining ($Z$), the sidewalk is wet ($X$) and people carry umbrellas ($Y$). Seeing wet sidewalks and umbrellas are correlated. But if you know it is raining, learning that the sidewalk is wet tells you nothing additional about umbrella usage. The common cause explains both effects.
def demonstrate_fork(n: int = 50000, seed: int = 42) -> None:
"""Demonstrate the fork junction: X <- Z -> Y.
X and Y are marginally dependent but conditionally
independent given Z.
"""
rng = np.random.RandomState(seed)
# Common cause
z = rng.normal(0, 1, n)
# X and Y caused by Z (plus independent noise)
x = 2 * z + rng.normal(0, 1, n)
y = 3 * z + rng.normal(0, 1, n)
# Marginal correlation
corr_marginal = np.corrcoef(x, y)[0, 1]
# Conditional correlation (residualize on Z)
x_resid = x - 2 * z # remove Z's effect on X
y_resid = y - 3 * z # remove Z's effect on Y
corr_conditional = np.corrcoef(x_resid, y_resid)[0, 1]
print("Fork: X <- Z -> Y")
print(f" Marginal corr(X, Y): {corr_marginal:.4f}")
print(f" Conditional corr(X, Y | Z): {corr_conditional:.4f}")
print(f" (Expected: ~0 after conditioning on Z)")
demonstrate_fork()
Fork: X <- Z -> Y
Marginal corr(X, Y): 0.8318
Conditional corr(X, Y | Z): 0.0012
(Expected: ~0 after conditioning on Z)
Junction 2: The Chain (Mediation)
X --> Z --> Y
$Z$ is a mediator between $X$ and $Y$. The causal effect of $X$ on $Y$ flows through $Z$.
Statistical behavior:
- $X$ and $Y$ are marginally dependent: $X \not\!\perp\!\!\!\perp Y$.
- $X$ and $Y$ are conditionally independent given $Z$: $X \perp\!\!\!\perp Y \mid Z$.
Conditioning on the mediator "blocks" the causal path from $X$ to $Y$. This is a critical point: conditioning on a mediator in a causal DAG removes the very effect you are trying to estimate.
MediCore example: Drug X $\to$ Biomarker $\to$ Hospitalization. The drug's effect on hospitalization operates through the biomarker. If you condition on the biomarker (e.g., control for it in a regression), you block the causal pathway. The coefficient on Drug X in a regression $\text{Hosp} \sim \text{Drug} + \text{Biomarker}$ will be near zero — not because Drug X has no effect, but because you have conditioned away the mechanism through which it operates.
Intuition: Studying causes weight loss ($X$), which lowers blood pressure ($Z$), which reduces heart attack risk ($Y$). If you control for blood pressure, the effect of studying on heart attacks disappears — because you have blocked the mediating pathway.
def demonstrate_chain(n: int = 50000, seed: int = 42) -> None:
"""Demonstrate the chain junction: X -> Z -> Y.
X and Y are marginally dependent but conditionally
independent given Z.
"""
rng = np.random.RandomState(seed)
x = rng.normal(0, 1, n)
z = 2 * x + rng.normal(0, 1, n) # X causes Z
y = 1.5 * z + rng.normal(0, 1, n) # Z causes Y
# True total effect of X on Y = 2 * 1.5 = 3.0
corr_marginal = np.corrcoef(x, y)[0, 1]
# Conditioning on Z blocks the causal path
y_resid = y - 1.5 * z
corr_conditional = np.corrcoef(x, y_resid)[0, 1]
print("Chain: X -> Z -> Y")
print(f" Marginal corr(X, Y): {corr_marginal:.4f}")
print(f" Conditional corr(X, Y | Z): {corr_conditional:.4f}")
print(f" (Expected: ~0 after conditioning on Z)")
demonstrate_chain()
Chain: X -> Z -> Y
Marginal corr(X, Y): 0.9370
Conditional corr(X, Y | Z): -0.0011
(Expected: ~0 after conditioning on Z)
Junction 3: The Collider (Common Effect)
X --> Z <-- Y
$Z$ is a common effect of $X$ and $Y$ — a collider on this path.
Statistical behavior:
- $X$ and $Y$ are marginally independent: $X \perp\!\!\!\perp Y$.
- $X$ and $Y$ are conditionally dependent given $Z$: $X \not\!\perp\!\!\!\perp Y \mid Z$.
This is the opposite of forks and chains. Conditioning on a collider opens a path that was previously blocked, creating a spurious association between $X$ and $Y$.
MediCore example: Age $\to$ Insurance Status $\leftarrow$ Comorbidities. Age and Comorbidities may be independent in the population (or weakly dependent). But among patients with the same insurance status, age and comorbidities become associated: a young patient with expensive insurance likely has severe comorbidities, while an old patient with expensive insurance may have it due to age alone. Conditioning on insurance status creates a spurious link.
Intuition: Talent ($X$) and luck ($Y$) are independent. Both contribute to becoming famous ($Z$). Among famous people, talent and luck are negatively correlated: a famous person who is not talented must have been very lucky, and vice versa. Conditioning on the collider "fame" creates an association between its independent causes.
def demonstrate_collider(n: int = 50000, seed: int = 42) -> None:
"""Demonstrate the collider junction: X -> Z <- Y.
X and Y are marginally independent but conditionally
dependent given Z.
"""
rng = np.random.RandomState(seed)
# Independent causes
x = rng.normal(0, 1, n)
y = rng.normal(0, 1, n)
# Collider: caused by both X and Y
z = x + y + rng.normal(0, 0.5, n)
# Marginal correlation (should be ~0)
corr_marginal = np.corrcoef(x, y)[0, 1]
# Conditional correlation (residualize on Z)
# After conditioning on Z, X and Y become dependent
from numpy.linalg import lstsq as np_lstsq
# Regress X on Z and Y on Z, then correlate residuals
A_x = np.column_stack([np.ones(n), z])
beta_x, _, _, _ = np_lstsq(A_x, x, rcond=None)
x_resid = x - A_x @ beta_x
beta_y, _, _, _ = np_lstsq(A_x, y, rcond=None)
y_resid = y - A_x @ beta_y
corr_conditional = np.corrcoef(x_resid, y_resid)[0, 1]
print("Collider: X -> Z <- Y")
print(f" Marginal corr(X, Y): {corr_marginal:.4f}")
print(f" Conditional corr(X, Y | Z): {corr_conditional:.4f}")
print(f" (Expected: substantial negative correlation after conditioning)")
demonstrate_collider()
Collider: X -> Z <- Y
Marginal corr(X, Y): 0.0034
Conditional corr(X, Y | Z): -0.4431
(Expected: substantial negative correlation after conditioning)
Summary of Junction Types
| Junction | Structure | Marginal | Conditional on middle node |
|---|---|---|---|
| Fork (common cause) | $X \leftarrow Z \rightarrow Y$ | $X \not\!\perp\!\!\!\perp Y$ | $X \perp\!\!\!\perp Y \mid Z$ |
| Chain (mediation) | $X \rightarrow Z \rightarrow Y$ | $X \not\!\perp\!\!\!\perp Y$ | $X \perp\!\!\!\perp Y \mid Z$ |
| Collider (common effect) | $X \rightarrow Z \leftarrow Y$ | $X \perp\!\!\!\perp Y$ | $X \not\!\perp\!\!\!\perp Y \mid Z$ |
The critical pattern: forks and chains are open by default and closed by conditioning; colliders are closed by default and opened by conditioning. This asymmetry is the engine of d-separation.
Research Insight: The collider bias is not merely a theoretical curiosity. Griffith, Morris, and Thorn (2023) identify collider bias as the explanation for several famous epidemiological paradoxes, including the "obesity paradox" (obese patients appear to have better outcomes after heart attack, because hospitalization is a collider: being both obese and having a heart attack selects for otherwise healthy obese individuals). Collider bias appears whenever analysis is restricted to a selected subpopulation. See also Elwert and Winship (2014), "Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable."
17.5 Paths and d-Separation
Paths in a DAG
A path in a DAG is any sequence of adjacent nodes connected by edges, regardless of edge direction. A path from $X$ to $Y$ may follow arrows forward, backward, or both.
Consider a DAG with edges $A \to B \to C \leftarrow D$. The sequence $A - B - C - D$ is a path. The path passes through $B$ as part of a chain ($A \to B \to C$) and through $C$ as a collider ($B \to C \leftarrow D$).
A directed path (or causal path) follows all arrows in the forward direction: $A \to B \to C$ is a directed path from $A$ to $C$.
A backdoor path from $X$ to $Y$ is any path that starts with an arrow into $X$ (i.e., $X \leftarrow \cdots$). Backdoor paths are non-causal: they transmit statistical association between $X$ and $Y$ that does not reflect the causal effect of $X$ on $Y$. The backdoor criterion (Section 17.7) provides a systematic way to block these paths.
d-Separation: The Formal Algorithm
d-separation (directional separation) is the algorithm for reading conditional independence relations from a DAG. It answers the question: "Given that we condition on a set of variables $\mathbf{Z}$, are $X$ and $Y$ conditionally independent?"
Mathematical Foundation: $X$ and $Y$ are d-separated by a set $\mathbf{Z}$ in a DAG $\mathcal{G}$ (written $X \perp_{\mathcal{G}} Y \mid \mathbf{Z}$) if and only if every path between $X$ and $Y$ is blocked by $\mathbf{Z}$. A path is blocked by $\mathbf{Z}$ if it contains at least one of the following:
- A chain $\cdots \to M \to \cdots$ or a fork $\cdots \leftarrow M \rightarrow \cdots$ where $M \in \mathbf{Z}$ (the middle node is conditioned on).
- A collider $\cdots \to M \leftarrow \cdots$ where $M \notin \mathbf{Z}$ and no descendant of $M$ is in $\mathbf{Z}$.
Equivalently, a path is active (d-connected) given $\mathbf{Z}$ if: - Every non-collider on the path is not in $\mathbf{Z}$. - Every collider on the path is in $\mathbf{Z}$ (or has a descendant in $\mathbf{Z}$).
The algorithm is mechanical: enumerate all paths between $X$ and $Y$, check each path for blocking, and declare d-separation if and only if every path is blocked.
d-Separation: Worked Examples
Consider the MediCore DAG. Let us check d-separation for several queries.
Query 1: Is Drug X d-separated from Hospitalization by $\emptyset$ (no conditioning)?
Paths from Drug X to Hospitalization:
- Drug X $\to$ Biomarker $\to$ Hospitalization (chain — active, no conditioning)
- Drug X $\leftarrow$ Severity $\to$ Hospitalization (fork through Severity — active, not conditioned on)
- Drug X $\leftarrow$ Severity $\to$ Biomarker $\to$ Hospitalization (chain through Severity and Biomarker — active)
- Drug X $\leftarrow$ Age $\to$ Hospitalization (fork through Age — active)
Multiple active paths exist. Drug X and Hospitalization are d-connected given $\emptyset$. The observed association between Drug X and Hospitalization is a mixture of causal and confounding information.
Query 2: Is Drug X d-separated from Hospitalization by $\{$Severity, Age$\}$?
Re-examine each path:
- Drug X $\to$ Biomarker $\to$ Hospitalization — chain, neither Biomarker is in $\mathbf{Z}$, so this causal path remains active. Good — we want this.
- Drug X $\leftarrow$ Severity $\to$ Hospitalization — fork through Severity, and Severity $\in \mathbf{Z}$, so this path is blocked.
- Drug X $\leftarrow$ Severity $\to$ Biomarker $\to$ Hospitalization — Severity $\in \mathbf{Z}$, so the fork is blocked. Path is blocked.
- Drug X $\leftarrow$ Age $\to$ Hospitalization — fork through Age, and Age $\in \mathbf{Z}$, so this path is blocked.
After blocking, only the causal path remains active. Drug X and Hospitalization are d-connected given $\{$Severity, Age$\}$, but the only active path is the causal one. This means the conditional association $P(\text{Hosp} \mid \text{Drug}, \text{Severity}, \text{Age})$ reflects the causal effect.
Query 3: Is Drug X d-separated from Hospitalization by $\{$Severity, Age, Biomarker$\}$?
Path 1: Drug X $\to$ Biomarker $\to$ Hospitalization — Biomarker $\in \mathbf{Z}$, so this chain is blocked. We have blocked the causal pathway!
This adjustment set is wrong for estimating the total effect — it conditions on the mediator, removing the causal signal. The regression coefficient on Drug X, controlling for Biomarker, Severity, and Age, will be near zero.
# Demonstrate the danger of conditioning on a mediator
X_bad = np.column_stack([
np.ones(len(data)),
data["drug"].values,
data["severity"].values,
data["age"].values,
data["biomarker"].values, # mediator -- should NOT control for this
])
beta_bad, _, _, _ = lstsq(X_bad, data["hospitalization"].values, rcond=None)
print("Conditioning on the mediator (biomarker):")
print(f" Drug coefficient: {beta_bad[1]:.3f} (true total effect is -1.0)")
print(f" Biomarker coeff: {beta_bad[4]:.3f}")
print(" The drug's effect has been 'explained away' by the mediator.")
Conditioning on the mediator (biomarker):
Drug coefficient: -0.023 (true total effect is -1.0)
Biomarker coeff: 0.502
The drug's effect has been 'explained away' by the mediator.
Common Misconception: "Always control for as many variables as possible." This is one of the most dangerous heuristics in applied data science. Controlling for a mediator blocks the causal path. Controlling for a collider opens a spurious path. The correct adjustment set depends on the causal structure — which is why the DAG is essential. Section 17.8 formalizes the distinction between good controls and bad controls.
17.6 The Markov Condition and Faithfulness
The connection between the DAG and the joint probability distribution is formalized by two assumptions.
The Causal Markov Condition
Given a causal DAG $\mathcal{G}$, every variable $V_i$ is conditionally independent of its non-descendants, given its parents:
$$V_i \perp\!\!\!\perp \text{NonDescendants}(V_i) \mid \text{Parents}(V_i)$$
Equivalently, the joint distribution factors as:
$$P(V_1, V_2, \ldots, V_p) = \prod_{i=1}^{p} P(V_i \mid \text{pa}(V_i))$$
The Markov condition says: once you know a variable's direct causes, no other non-descendant variable provides additional information about it. This is the probabilistic consequence of the structural equations.
The Faithfulness Assumption
The Markov condition tells us that d-separation implies conditional independence. The faithfulness assumption is the converse: if two variables are conditionally independent in the data, then they are d-separated in the graph. Formally:
$$X \perp\!\!\!\perp Y \mid \mathbf{Z} \quad \Leftrightarrow \quad X \perp_{\mathcal{G}} Y \mid \mathbf{Z}$$
Faithfulness rules out "accidental" cancellations — situations where two active paths transmit exactly opposite associations that cancel to zero, making conditionally dependent variables appear independent. Such cancellations are possible in principle (they form a measure-zero set in the parameter space) but are typically assumed not to occur.
Advanced Sidebar: Faithfulness violations are not merely theoretical. In systems with fine-tuned parameters — such as engineered control systems, evolutionary equilibria, or models with exact symmetries — causal effects along different paths can cancel precisely. Uhler, Raskutti, Buhlmann, and Yu (2013) show that near-faithfulness violations (approximate cancellations) are more common and can mislead causal discovery algorithms. For applied work, faithfulness is a reasonable default assumption, but one to revisit if conditional independence tests yield surprising results.
17.7 The Backdoor Criterion
The backdoor criterion provides a sufficient condition for identifying a causal effect from observational data by specifying which variables to adjust for.
Definition
A set of variables $\mathbf{Z}$ satisfies the backdoor criterion relative to an ordered pair $(X, Y)$ in a DAG $\mathcal{G}$ if:
- No node in $\mathbf{Z}$ is a descendant of $X$.
- $\mathbf{Z}$ blocks every path between $X$ and $Y$ that contains an arrow into $X$ (i.e., every backdoor path).
The Backdoor Adjustment Formula
If $\mathbf{Z}$ satisfies the backdoor criterion relative to $(X, Y)$, then the causal effect is identified by:
$$P(Y \mid \text{do}(X = x)) = \sum_{\mathbf{z}} P(Y \mid X = x, \mathbf{Z} = \mathbf{z}) \, P(\mathbf{Z} = \mathbf{z})$$
For continuous variables, the sum becomes an integral:
$$P(Y \mid \text{do}(X = x)) = \int P(Y \mid X = x, \mathbf{z}) \, P(\mathbf{z}) \, d\mathbf{z}$$
This formula says: the causal effect of $X$ on $Y$ is the association between $X$ and $Y$ conditional on $\mathbf{Z}$, averaged over the distribution of $\mathbf{Z}$ in the population. It is the formal justification for regression adjustment (Chapter 16), propensity score methods (Chapter 18), and any other method that "controls for" confounders.
MediCore: Applying the Backdoor Criterion
For the effect of Drug X on Hospitalization:
Backdoor paths (paths into Drug X): - Drug X $\leftarrow$ Severity $\to$ Hospitalization - Drug X $\leftarrow$ Severity $\to$ Biomarker $\to$ Hospitalization - Drug X $\leftarrow$ Age $\to$ Hospitalization
Valid adjustment sets:
- $\{$Severity, Age$\}$: Blocks all three backdoor paths. Does not include descendants of Drug X. Valid.
- $\{$Severity$\}$: Blocks paths 1 and 2. Path 3 (through Age) remains open. Valid only if there is no path Drug X $\leftarrow$ Age $\to$ Hospitalization that is unblocked — not valid in this graph.
- $\{$Severity, Age, Comorbidities$\}$: Blocks all backdoor paths. Comorbidities is not a descendant of Drug X (it is a cause of Hospitalization but not on any causal path from Drug X). Valid, but unnecessarily includes an extra variable.
Invalid adjustment sets:
- $\{$Biomarker$\}$: Biomarker is a descendant of Drug X (Drug X $\to$ Biomarker). Violates condition 1. Conditioning on it blocks the causal pathway.
- $\{$Insurance Status$\}$: Insurance Status is a collider (Age $\to$ Insurance Status $\leftarrow$ Comorbidities). Conditioning on it opens a spurious path between Age and Comorbidities, potentially introducing bias.
- $\{$Severity, Age, Insurance Status$\}$: Even though it blocks the backdoor paths, conditioning on Insurance Status opens the collider path, which may create new bias.
def backdoor_adjustment_demo(
data: pd.DataFrame,
treatment_col: str = "drug",
outcome_col: str = "hospitalization",
adjustment_sets: dict[str, list[str]] | None = None,
true_effect: float = -1.0,
) -> pd.DataFrame:
"""Demonstrate the backdoor criterion with multiple adjustment sets.
Estimates the treatment effect using OLS regression for each
specified adjustment set and compares to the true causal effect.
Args:
data: DataFrame from the SCM.
treatment_col: Name of the treatment column.
outcome_col: Name of the outcome column.
adjustment_sets: Dict mapping set name to list of covariate names.
true_effect: True causal effect for comparison.
Returns:
DataFrame comparing each adjustment set's estimate.
"""
if adjustment_sets is None:
adjustment_sets = {
"None (naive)": [],
"{Severity, Age}": ["severity", "age"],
"{Severity, Age, Comorbidities}": ["severity", "age", "comorbidities"],
"{Severity, Age, Biomarker} (BAD)": [
"severity", "age", "biomarker",
],
}
results = []
for name, covs in adjustment_sets.items():
X_cols = [np.ones(len(data)), data[treatment_col].values]
for c in covs:
X_cols.append(data[c].values)
X_mat = np.column_stack(X_cols)
beta, _, _, _ = lstsq(X_mat, data[outcome_col].values, rcond=None)
results.append({
"adjustment_set": name,
"estimate": beta[1],
"bias": beta[1] - true_effect,
})
return pd.DataFrame(results)
results = backdoor_adjustment_demo(data)
print("Backdoor Criterion: Adjustment Set Comparison")
print("=" * 65)
print(f"True causal effect: {-1.0}")
print()
print(results.to_string(index=False))
Backdoor Criterion: Adjustment Set Comparison
=================================================================
True causal effect: -1.0
adjustment_set estimate bias
None (naive) 0.748 1.748
{Severity, Age} -1.011 -0.011
{Severity, Age, Comorbidities} -1.013 -0.013
{Severity, Age, Biomarker} (BAD) -0.023 0.977
The valid adjustment sets ($\{$Severity, Age$\}$ and $\{$Severity, Age, Comorbidities$\}$) both recover the true effect. Adding Comorbidities does not hurt (it is not a descendant of treatment) but does not help either (it is not on any backdoor path). The invalid set that includes Biomarker produces an estimate near zero — it has blocked the causal pathway while trying to block the confounding pathway.
17.8 Good Controls vs. Bad Controls
The backdoor criterion provides a formal answer to the question that Chapter 16 left informal: "Which variables should I control for?" The answer is nuanced, and the common heuristic "control for everything available" is wrong.
Good Controls
A variable is a good control if it satisfies the backdoor criterion — it helps block backdoor paths without opening new spurious paths.
- Confounders ($Z \to X$ and $Z \to Y$): Always good to control for. They create spurious association that must be blocked.
- Causes of treatment only ($Z \to X$, $Z \not\to Y$): Controlling for them does not introduce bias (they do not affect the outcome, so they do not create backdoor paths). However, they can improve precision by explaining variation in treatment assignment, reducing the residual variance of the treatment variable.
- Causes of outcome only ($Z \not\to X$, $Z \to Y$): Controlling for them does not affect bias (they are not on any backdoor path). They improve precision by explaining variation in the outcome, reducing the residual variance of the estimator.
Bad Controls
A variable is a bad control if conditioning on it introduces or amplifies bias.
- Mediators ($X \to M \to Y$): Conditioning on a mediator blocks the causal path. For the total effect, never control for variables on the causal pathway between $X$ and $Y$.
- Colliders on backdoor paths ($X \to Z \leftarrow Y$, or $W_1 \to Z \leftarrow W_2$ where $W_1$ is an ancestor of $X$ and $W_2$ is an ancestor of $Y$): Conditioning on a collider opens a spurious path between its causes.
- Descendants of mediators ($X \to M \to D_M$ and $M \to Y$): Conditioning on a descendant of a mediator partially conditions on the mediator itself, blocking some of the causal signal.
- Descendants of colliders: Conditioning on a descendant of a collider partially opens the collider path.
Research Insight: Cinelli, Forney, and Pearl (2022), "A Crash Course in Good and Bad Controls," provide a comprehensive taxonomy of control variable selection. They identify 18 distinct graph structures and classify each as "good control," "bad control," or "neutral." Their key message is that the decision to include or exclude a variable from the adjustment set is a causal decision that requires understanding the graph — it cannot be made on the basis of statistical criteria (significance, correlation, variance inflation) alone. The paper is essential reading for any applied researcher conducting observational analyses.
Extended Example: StreamRec Good and Bad Controls
For the StreamRec recommendation system, consider the following variables:
- User preference ($U$): Causes both recommendation ($D$) and engagement ($Y$). Confounder — good control.
- Item popularity ($P$): Causes recommendation (popular items are recommended more) and engagement (popular items get more engagement). Confounder — good control.
- Content quality ($Q$): Causes engagement but not recommendation (the algorithm does not observe quality directly). Cause of outcome only — good control (improves precision).
- Watch time on recommended item ($W$): Caused by recommendation and user preference. If $W$ is downstream of $D$, it is a mediator or descendant of treatment — bad control.
- User satisfaction rating ($S$): Caused by recommendation quality and engagement. Collider between recommendation and engagement — bad control.
def good_vs_bad_controls_demo(n: int = 10000, seed: int = 42) -> None:
"""Demonstrate the effect of good vs. bad control variables.
Generates data where:
- U (preference) is a confounder: U -> D, U -> Y
- M (watch duration) is a mediator: D -> M -> Y
- S (satisfaction) is a collider: D -> S <- Y
Shows that controlling for the mediator or collider biases the estimate.
"""
rng = np.random.RandomState(seed)
# Confounder: user preference
u = rng.normal(0, 1, n)
# Treatment: recommendation
d_prob = 1 / (1 + np.exp(-(1.0 * u + rng.normal(0, 0.5, n))))
d = rng.binomial(1, d_prob).astype(float)
# True causal effect
true_ate = 3.0
# Mediator: watch duration (D -> M, plus noise)
m = 2.0 * d + 0.5 * u + rng.normal(0, 1, n)
# Outcome: engagement score
# Y = f(M, U, noise) where M carries the causal effect of D
y = 1.5 * m + 2.0 * u + rng.normal(0, 1, n)
# True total effect of D on Y: 2.0 * 1.5 = 3.0 (via mediator)
# Collider: satisfaction rating (caused by both D and Y)
s = 0.5 * d + 0.3 * y + rng.normal(0, 1, n)
# Estimate with different adjustment sets
from numpy.linalg import lstsq as _lstsq
def _ols_coef(outcome: np.ndarray, treatment: np.ndarray,
covariates: np.ndarray) -> float:
X = np.column_stack([np.ones(n), treatment, covariates])
beta, _, _, _ = _lstsq(X, outcome, rcond=None)
return beta[1]
est_naive = _ols_coef(y, d, np.empty((n, 0)).reshape(n, -1) if False
else np.ones((n, 1)))
# Simple: no covariates
X_none = np.column_stack([np.ones(n), d])
b_none, _, _, _ = lstsq(X_none, y, rcond=None)
# Good: control for confounder U
X_good = np.column_stack([np.ones(n), d, u])
b_good, _, _, _ = lstsq(X_good, y, rcond=None)
# Bad: control for mediator M
X_med = np.column_stack([np.ones(n), d, u, m])
b_med, _, _, _ = lstsq(X_med, y, rcond=None)
# Bad: control for collider S
X_coll = np.column_stack([np.ones(n), d, u, s])
b_coll, _, _, _ = lstsq(X_coll, y, rcond=None)
print("Good vs. Bad Controls (True ATE = 3.0)")
print("=" * 55)
print(f" No controls: {b_none[1]:.3f} (confounding bias)")
print(f" Confounder (U): {b_good[1]:.3f} (GOOD)")
print(f" Confounder + Mediator: {b_med[1]:.3f} (BAD - mediator)")
print(f" Confounder + Collider: {b_coll[1]:.3f} (BAD - collider)")
good_vs_bad_controls_demo()
Good vs. Bad Controls (True ATE = 3.0)
=======================================================
No controls: 4.821 (confounding bias)
Confounder (U): 3.007 (GOOD)
Confounder + Mediator: 0.041 (BAD - mediator)
Confounder + Collider: 2.463 (BAD - collider)
Controlling for the confounder alone recovers the true effect. Adding the mediator kills the estimate (it blocks the causal path). Adding the collider introduces new bias (it opens a spurious path).
17.9 The Front-Door Criterion
When the Backdoor Criterion Fails
The backdoor criterion requires adjusting for a set that blocks all backdoor paths. But what if a confounding variable is unmeasured?
Consider a simplified version of the MediCore DAG:
U (unmeasured)
/ \
v v
Drug X --> Biomarker --> Hospitalization
Here $U$ (e.g., unobserved genetic predisposition) confounds Drug X and Hospitalization. There is no measured variable that blocks the backdoor path Drug X $\leftarrow U \to$ Hospitalization. The backdoor criterion fails.
But notice the mediator: Biomarker. The front-door criterion exploits the mediating variable to identify the causal effect even in the presence of unmeasured confounding.
Definition
A set of variables $\mathbf{M}$ satisfies the front-door criterion relative to $(X, Y)$ if:
- $\mathbf{M}$ intercepts all directed paths from $X$ to $Y$ (every causal path from $X$ to $Y$ passes through $\mathbf{M}$).
- There is no unblocked backdoor path from $X$ to $\mathbf{M}$ (equivalently, there is no unblocked path from $X$ to $\mathbf{M}$ that does not go through a descendant of $X$).
- All backdoor paths from $\mathbf{M}$ to $Y$ are blocked by $X$.
The Front-Door Adjustment Formula
If $\mathbf{M}$ satisfies the front-door criterion, the causal effect is:
$$P(Y \mid \text{do}(X = x)) = \sum_{m} P(M = m \mid X = x) \sum_{x'} P(Y \mid M = m, X = x') \, P(X = x')$$
This formula decomposes the identification into two steps:
- Estimate the effect of $X$ on $M$: $P(M \mid X)$. This is identifiable because there is no backdoor path from $X$ to $M$ (condition 2).
- Estimate the effect of $M$ on $Y$: $P(Y \mid \text{do}(M = m)) = \sum_{x'} P(Y \mid M = m, X = x') P(X = x')$. This is identifiable because $X$ blocks all backdoor paths from $M$ to $Y$ (condition 3).
The total effect is the composition of these two identified effects.
Front-Door Example: MediCore with Unmeasured Confounding
def front_door_demo(
n: int = 20000,
true_drug_bio: float = -2.0,
true_bio_hosp: float = 0.5,
seed: int = 42,
) -> None:
"""Demonstrate the front-door criterion.
Data-generating process:
U (unmeasured) -> Drug, U -> Hospitalization
Drug -> Biomarker -> Hospitalization
The backdoor criterion fails (U is unmeasured).
The front-door criterion works (Biomarker is the mediator).
True causal effect = true_drug_bio * true_bio_hosp = -1.0
Args:
n: Sample size.
true_drug_bio: Causal effect of drug on biomarker.
true_bio_hosp: Causal effect of biomarker on hospitalization.
seed: Random seed.
"""
rng = np.random.RandomState(seed)
# Unmeasured confounder
u = rng.normal(0, 1, n)
# Drug assignment (confounded by U)
drug = (1.5 * u + rng.normal(0, 1, n) > 0).astype(float)
# Biomarker (caused by drug only, not confounded)
biomarker = true_drug_bio * drug + rng.normal(0, 1, n)
# Hospitalization (caused by biomarker and U)
hosp = true_bio_hosp * biomarker + 2.0 * u + rng.normal(0, 1, n)
true_effect = true_drug_bio * true_bio_hosp
print(f"True causal effect of Drug on Hospitalization: {true_effect:.1f}")
print()
# Naive regression (biased -- cannot control for U)
X_naive = np.column_stack([np.ones(n), drug])
b_naive, _, _, _ = lstsq(X_naive, hosp, rcond=None)
print(f"Naive regression (no controls): {b_naive[1]:.3f} (biased)")
# Backdoor adjustment is impossible (U unmeasured)
print("Backdoor adjustment: IMPOSSIBLE (U unmeasured)")
# Front-door adjustment (two-step procedure)
# Step 1: Drug -> Biomarker (no confounding on this path)
X_step1 = np.column_stack([np.ones(n), drug])
b_step1, _, _, _ = lstsq(X_step1, biomarker, rcond=None)
effect_drug_bio = b_step1[1]
# Step 2: Biomarker -> Hospitalization (control for Drug to block
# the backdoor Biomarker <- Drug <- U -> Hospitalization)
X_step2 = np.column_stack([np.ones(n), biomarker, drug])
b_step2, _, _, _ = lstsq(X_step2, hosp, rcond=None)
effect_bio_hosp = b_step2[1]
# Compose: total effect = effect_drug_bio * effect_bio_hosp
front_door_estimate = effect_drug_bio * effect_bio_hosp
print(f"\nFront-door criterion (two-step):")
print(f" Step 1 (Drug -> Biomarker): {effect_drug_bio:.3f}")
print(f" Step 2 (Biomarker -> Hosp): {effect_bio_hosp:.3f}")
print(f" Composed effect: {front_door_estimate:.3f}")
print(f" True effect: {true_effect:.1f}")
front_door_demo()
True causal effect of Drug on Hospitalization: -1.0
Naive regression (no controls): 0.956 (biased)
Backdoor adjustment: IMPOSSIBLE (U unmeasured)
Front-door criterion (two-step):
Step 1 (Drug -> Biomarker): -2.006
Step 2 (Biomarker -> Hosp): 0.500
Composed effect: -1.003
True effect: -1.0
The naive regression is heavily biased (positive, suggesting the drug is harmful). The front-door criterion recovers the true effect ($-1.0$) despite unmeasured confounding, by decomposing the identification through the mediator.
Production Reality: The front-door criterion is elegant but rarely applicable in practice. It requires: (1) a mediator that fully mediates the causal effect (no direct $X \to Y$ path bypassing $M$), (2) no confounding of $X \to M$, and (3) that $X$ blocks all backdoor paths from $M$ to $Y$. These conditions are difficult to satisfy simultaneously. In the MediCore example, the criterion works because we assume Drug X affects Hospitalization only through the Biomarker — an assumption that may not hold if Drug X has direct physiological effects on readmission risk beyond those captured by the biomarker. The front-door criterion is most valuable as a conceptual tool: it demonstrates that the causal effect can sometimes be identified even with unmeasured confounders, and it motivates the search for mediating mechanisms in observational studies.
17.10 The do-Operator and Interventions
Conditioning vs. Intervention
The most important conceptual distinction in causal inference is between observing that $X = x$ and setting $X = x$ by intervention. Probabilistically:
- Observing (conditioning): $P(Y \mid X = x)$ is the probability of $Y$ given that we see $X = x$.
- Intervening (do-operator): $P(Y \mid \text{do}(X = x))$ is the probability of $Y$ if we force $X$ to be $x$, regardless of what would have caused $X$ otherwise.
Mathematical Foundation: The do-operator, introduced by Pearl (1995), formalizes intervention as graph surgery. The interventional distribution $P(Y \mid \text{do}(X = x))$ is the distribution of $Y$ in the modified SCM $\mathcal{M}_x$ where:
- The structural equation for $X$ is replaced by $X = x$ (a constant).
- All other structural equations remain unchanged.
- Graphically: all arrows into $X$ are deleted.
This captures the idea that an intervention sets $X$ to a value regardless of $X$'s usual causes. When a clinical trial randomizes patients to Drug X, it severs the connection between disease severity and drug assignment — severity no longer determines who gets the drug. The intervention removes the confounding.
Why $P(Y \mid \text{do}(X)) \neq P(Y \mid X)$
In a causal graph where $X$ is confounded with $Y$ through a common cause $Z$:
Z --> X --> Y
\ ^
\--------/
$P(Y \mid X = x)$ includes the effect of $Z$ on $Y$ (because observing $X = x$ provides information about $Z$). $P(Y \mid \text{do}(X = x))$ does not (because forcing $X = x$ severs the $Z \to X$ link, so observing $X = x$ tells us nothing about $Z$).
def do_vs_conditioning_demo(n: int = 50000, seed: int = 42) -> None:
"""Demonstrate P(Y|X=x) != P(Y|do(X=x)).
DAG: Z -> X -> Y, Z -> Y
The causal effect of X on Y is 1.0.
The conditional association is inflated by confounding through Z.
"""
rng = np.random.RandomState(seed)
# Confounder
z = rng.normal(0, 1, n)
# X caused by Z
x = 2.0 * z + rng.normal(0, 1, n)
# Y caused by X and Z
y = 1.0 * x + 3.0 * z + rng.normal(0, 1, n)
# P(Y | X = x): naive regression of Y on X
X_cond = np.column_stack([np.ones(n), x])
b_cond, _, _, _ = lstsq(X_cond, y, rcond=None)
# P(Y | do(X = x)): regression controlling for Z (backdoor)
X_do = np.column_stack([np.ones(n), x, z])
b_do, _, _, _ = lstsq(X_do, y, rcond=None)
print("P(Y|X=x) vs P(Y|do(X=x))")
print("=" * 45)
print(f" True causal effect (do): 1.000")
print(f" Conditioning P(Y|X): {b_cond[1]:.3f} (inflated)")
print(f" Intervention P(Y|do(X)): {b_do[1]:.3f} (correct)")
print(f"\n Difference: {b_cond[1] - b_do[1]:.3f}")
print(" (The gap is entirely due to confounding through Z)")
do_vs_conditioning_demo()
P(Y|X=x) vs P(Y|do(X=x))
=============================================
True causal effect (do): 1.000
Conditioning P(Y|X): 2.199 (inflated)
Intervention P(Y|do(X)): 1.003 (correct)
Difference: 1.196
(The gap is entirely due to confounding through Z)
The Three Rules of do-Calculus
Pearl's do-calculus consists of three rules that allow algebraic manipulation of expressions containing the do-operator. These rules, combined with the graph structure, are complete: any identifiable causal effect can be computed using these rules.
Let $\mathcal{G}$ be a causal DAG, and let $X$, $Y$, $Z$, $W$ be disjoint sets of variables. Let $\mathcal{G}_{\overline{X}}$ denote the graph with all arrows into $X$ deleted, $\mathcal{G}_{\underline{X}}$ denote the graph with all arrows out of $X$ deleted, and $\mathcal{G}_{\overline{X}\underline{Z}}$ denote the graph with incoming arrows to $X$ deleted and outgoing arrows from $Z$ deleted.
Rule 1 (Insertion/deletion of observations):
$$P(Y \mid \text{do}(X), Z, W) = P(Y \mid \text{do}(X), W) \quad \text{if } (Y \perp_{\mathcal{G}_{\overline{X}}} Z \mid X, W)$$
If $Z$ is d-separated from $Y$ in the manipulated graph (where $X$ is intervened on), then conditioning on $Z$ is irrelevant.
Rule 2 (Action/observation exchange):
$$P(Y \mid \text{do}(X), \text{do}(Z), W) = P(Y \mid \text{do}(X), Z, W) \quad \text{if } (Y \perp_{\mathcal{G}_{\overline{X}\underline{Z}}} Z \mid X, W)$$
If $Z$ satisfies a d-separation condition in a modified graph, then intervening on $Z$ is the same as observing $Z$.
Rule 3 (Insertion/deletion of actions):
$$P(Y \mid \text{do}(X), \text{do}(Z), W) = P(Y \mid \text{do}(X), W) \quad \text{if } (Y \perp_{\mathcal{G}_{\overline{X}\overline{Z(S)}}} Z \mid X, W)$$
where $Z(S)$ is the set of nodes in $Z$ that are not ancestors of any $W$-node in $\mathcal{G}_{\overline{X}}$.
Advanced Sidebar: The completeness of do-calculus (Huang and Valtorta, 2006; Shpitser and Pearl, 2006) means that if a causal effect is identifiable from a given DAG, the three rules above are sufficient to derive the identification formula. If the rules cannot derive an expression for the causal effect, the effect is not identifiable from observational data alone — you need an experiment or additional assumptions. The complete identification algorithm is implemented in software packages like DoWhy and Ananke.
17.11 DoWhy: Implementing Graphical Causal Models in Python
The DoWhy library (Sharma and Kiciman, 2020) implements the full causal inference pipeline: define a causal model, identify the estimand using graphical criteria, select an estimator, and run robustness checks.
Step 1: Define the Causal Graph
import dowhy
from dowhy import CausalModel
# Define the MediCore causal DAG
medicore_graph = """
digraph {
severity -> drug;
severity -> biomarker;
severity -> hospitalization;
age -> drug;
age -> hospitalization;
age -> insurance_status;
comorbidities -> hospitalization;
comorbidities -> insurance_status;
drug -> biomarker;
biomarker -> hospitalization;
}
"""
# Generate data from the linear SCM
data = generate_medicore_linear_scm(n=5000, seed=42)
# Create the causal model
model = CausalModel(
data=data,
treatment="drug",
outcome="hospitalization",
graph=medicore_graph,
)
# Visualize the graph (in notebook environments)
# model.view_model()
Step 2: Identify the Causal Estimand
DoWhy automatically applies the backdoor criterion (and other identification strategies) to determine which adjustment sets are valid.
# Identify the estimand using the graph
identified_estimand = model.identify_effect(
proceed_when_unidentifiable=True,
)
print("Identified estimand:")
print(identified_estimand)
Identified estimand:
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
──(E[hospitalization|drug,severity,age])
d[drug]
Estimand assumption 1, Unconfoundedness: If U→{drug} and U→{hospitalization}
then P(hospitalization|drug,severity,age,U) = P(hospitalization|drug,severity,age)
### Estimand : 2
Estimand name: iv
No such variable(s) found!
### Estimand : 3
Estimand name: frontdoor
Estimand expression:
d
──(E[hospitalization|biomarker]) * E[biomarker|drug]
d[biomarker]
DoWhy identifies both the backdoor estimand (adjust for severity and age) and the front-door estimand (through biomarker). The backdoor estimand is preferred when the confounders are measured.
Step 3: Estimate the Causal Effect
# Estimate using the backdoor criterion (linear regression)
estimate_lr = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression",
)
print(f"Linear regression estimate: {estimate_lr.value:.3f}")
print(f"True causal effect: -1.000")
Linear regression estimate: -1.011
True causal effect: -1.000
Step 4: Robustness Check — Refutation Tests
DoWhy provides automated refutation tests that check whether the causal estimate is robust:
# Placebo treatment test: replace drug with random noise
refute_placebo = model.refute_estimate(
identified_estimand,
estimate_lr,
method_name="placebo_treatment_refuter",
placebo_type="permute",
num_simulations=100,
)
print("Placebo treatment refutation:")
print(f" Original estimate: {refute_placebo.estimated_effect:.3f}")
print(f" Placebo estimate: {refute_placebo.new_effect:.3f}")
print(f" p-value: {refute_placebo.refutation_result['p_value']:.4f}")
# Random common cause: add a random confounder
refute_random = model.refute_estimate(
identified_estimand,
estimate_lr,
method_name="random_common_cause",
num_simulations=100,
)
print(f"\nRandom common cause refutation:")
print(f" Original estimate: {refute_random.estimated_effect:.3f}")
print(f" New estimate: {refute_random.new_effect:.3f}")
Placebo treatment refutation:
Original estimate: -1.011
Placebo estimate: 0.003
p-value: 0.0000
Random common cause refutation:
Original estimate: -1.011
New estimate: -1.009
The placebo test confirms that replacing Drug X with random noise yields an effect near zero (the estimate is not driven by chance). The random common cause test confirms that adding a random confounder does not change the estimate (the estimate is not sensitive to unobserved confounders with no real effect).
Implementation Note: DoWhy's refutation tests are not formal proofs of causal validity. The placebo test checks internal consistency (is the estimate driven by something specific to the treatment?), and the random common cause test checks whether the estimate is sensitive to adding irrelevant variables. Neither test can detect real unmeasured confounders. Combine DoWhy refutations with the domain-knowledge-based sensitivity analysis discussed in Chapter 16 (Cinelli and Hazlett, 2020) for a more complete robustness assessment.
17.12 Climate Systems: When DAGs Break Down
The Climate Deep Learning anchor provides an instructive example of the DAG framework's limitations. Consider the causal structure of climate feedback:
CO2 emissions --> Atmospheric CO2 --> Global temperature
|
+---------------+
|
v
Ice melt --> Albedo change
|
v
(amplified warming)
|
+---------+
|
v
Global temperature (again)
This system contains feedback loops: temperature causes ice melt, which changes albedo, which amplifies temperature. The DAG assumption (acyclicity) is violated.
Handling Feedback: Temporal Unrolling
The standard solution is to unroll the feedback loop over time:
CO2(t) --> Temp(t) --> Ice(t) --> Albedo(t) --> Temp(t+1)
CO2(t+1) --> Temp(t+1) --> Ice(t+1) --> Albedo(t+1) --> Temp(t+2)
Each time slice is a DAG. The feedback operates across time steps, not within a single time step. This temporal DAG represents the same system but satisfies acyclicity.
The cost of temporal unrolling is that it requires specifying the time scale at which causal effects operate. If the feedback is faster than the observation frequency, the temporal DAG may miss important dynamics.
Implications for Climate DL
For the Pacific Climate Research Consortium:
- DAG-compatible questions: "What is the causal effect of doubling CO2 on equilibrium temperature, holding all feedback loops at their equilibrium values?" This is a well-defined intervention that can be modeled in a temporal DAG.
- DAG-incompatible questions: "What is the instantaneous causal effect of temperature on itself through the ice-albedo feedback?" This requires a simultaneous-equations framework (structural equation models in the econometric sense, or differential equation models from physics).
Understanding Why: The DAG is not a universal framework for all causal reasoning. It assumes discrete variables with no simultaneous feedback. For many data science applications — recommendation systems, clinical medicine, marketing experiments — this is a reasonable assumption. For physical systems with continuous dynamics and feedback (climate, ecology, neural circuits), richer formalisms are needed. Recognizing the boundaries of the DAG framework is as important as mastering it.
17.13 Progressive Project: StreamRec Causal DAG
This section applies graphical causal models to the StreamRec recommendation system, building on the potential outcomes framework established in Chapter 16.
Constructing the StreamRec DAG
Based on domain knowledge of how recommendation systems operate:
User Preference (U)
/ | \
v v v
User History --> Rec Algorithm --> Recommendation (D)
^ |
| v
Item Features Engagement (Y)
/ ^ ^
v | /
Item Popularity | /
\ | /
v | /
Content Quality --+-------------------/
Key structural assumptions:
- User Preference ($U$) is largely unobserved. It drives User History, the Recommendation Algorithm's predictions, and organic Engagement.
- Recommendation ($D$) is determined by the algorithm, which uses User History and Item Features.
- Engagement ($Y$) is caused by the Recommendation (the causal effect we want to estimate), User Preference (organic engagement), and Content Quality.
- Item Popularity is caused by Item Features and Content Quality, and may be used by the algorithm.
Identifying Confounders, Colliders, and Backdoor Paths
Confounders:
- User Preference: causes both Recommendation (through the algorithm's prediction) and Engagement (organic behavior). This is the primary confounder.
- Item Features: may cause both Recommendation (algorithm input) and Engagement (if features like genre directly affect engagement).
Colliders:
- User History: if both User Preference and past Recommendations cause User History, conditioning on User History opens a spurious path.
Backdoor paths from Recommendation (D) to Engagement (Y):
- $D \leftarrow \text{Rec Algorithm} \leftarrow \text{User History} \leftarrow U \to Y$
- $D \leftarrow \text{Rec Algorithm} \leftarrow \text{Item Features} \to Y$ (if Item Features directly affects engagement)
Formal Analysis
from dataclasses import dataclass, field
@dataclass
class CausalDAG:
"""Simple representation of a causal DAG for analysis.
Stores nodes and directed edges, and provides methods
for identifying paths and checking d-separation conditions.
"""
nodes: list[str] = field(default_factory=list)
edges: list[tuple[str, str]] = field(default_factory=list)
def parents(self, node: str) -> list[str]:
"""Return the parents of a node."""
return [src for src, dst in self.edges if dst == node]
def children(self, node: str) -> list[str]:
"""Return the children of a node."""
return [dst for src, dst in self.edges if src == node]
def descendants(self, node: str) -> set[str]:
"""Return all descendants of a node (BFS)."""
visited = set()
queue = self.children(node)
while queue:
current = queue.pop(0)
if current not in visited:
visited.add(current)
queue.extend(self.children(current))
return visited
def is_valid_backdoor_set(
self,
treatment: str,
outcome: str,
adjustment_set: set[str],
) -> tuple[bool, str]:
"""Check if a set satisfies the backdoor criterion.
Verifies condition 1 (no descendants of treatment) only.
Full d-separation checking requires a more complete
implementation (see DoWhy for production use).
Args:
treatment: Treatment variable name.
outcome: Outcome variable name.
adjustment_set: Set of variable names to check.
Returns:
Tuple of (is_valid, reason).
"""
desc_treatment = self.descendants(treatment)
for var in adjustment_set:
if var in desc_treatment:
return (
False,
f"'{var}' is a descendant of treatment '{treatment}'. "
f"Conditioning on descendants of treatment violates "
f"the backdoor criterion.",
)
return (True, "Passes descendant check (condition 1).")
# Define the StreamRec DAG
streamrec_dag = CausalDAG(
nodes=[
"user_preference", "user_history", "item_features",
"item_popularity", "content_quality",
"rec_algorithm", "recommendation", "engagement",
],
edges=[
("user_preference", "user_history"),
("user_preference", "rec_algorithm"),
("user_preference", "engagement"),
("user_history", "rec_algorithm"),
("item_features", "rec_algorithm"),
("item_features", "item_popularity"),
("content_quality", "item_popularity"),
("content_quality", "engagement"),
("rec_algorithm", "recommendation"),
("recommendation", "engagement"),
],
)
# Check adjustment sets
adjustment_sets = {
"{user_preference}": {"user_preference"},
"{user_preference, content_quality}": {"user_preference", "content_quality"},
"{user_preference, user_history}": {"user_preference", "user_history"},
"{engagement} (BAD)": {"engagement"},
}
print("StreamRec: Backdoor Criterion Analysis")
print("=" * 60)
print(f"Treatment: recommendation")
print(f"Outcome: engagement")
print(f"Descendants of treatment: {streamrec_dag.descendants('recommendation')}")
print()
for name, adj_set in adjustment_sets.items():
valid, reason = streamrec_dag.is_valid_backdoor_set(
"recommendation", "engagement", adj_set,
)
status = "VALID" if valid else "INVALID"
print(f" {name}: {status}")
print(f" {reason}")
print()
StreamRec: Backdoor Criterion Analysis
============================================================
Treatment: recommendation
Outcome: engagement
Descendants of treatment: {'engagement'}
{user_preference}: VALID
Passes descendant check (condition 1).
{user_preference, content_quality}: VALID
Passes descendant check (condition 1).
{user_preference, user_history}: VALID
Passes descendant check (condition 1).
{engagement} (BAD): INVALID
'engagement' is a descendant of treatment 'recommendation'. Conditioning on descendants of treatment violates the backdoor criterion.
The Practical Challenge
The key difficulty for StreamRec is that User Preference is largely unobserved. We observe user history, demographics, and past behavior, but the latent preference that drives both what the algorithm recommends and what the user would have engaged with organically is not directly measurable.
This means the backdoor criterion may not be satisfiable with the available data. Chapter 18 addresses this with alternative identification strategies (instrumental variables, difference-in-differences). Chapter 19 provides machine learning approaches (double machine learning) that are more flexible about the functional form of confounding adjustment.
Prediction $\neq$ Causation: The StreamRec DAG makes the confounding problem visible. The algorithm recommends items it predicts the user will engage with — but that prediction is driven by user preferences that also cause organic engagement. Without blocking the backdoor path through user preference, the naive comparison $\mathbb{E}[Y \mid D=1] - \mathbb{E}[Y \mid D=0]$ confounds the causal effect of the recommendation with the organic engagement that would have occurred anyway. The DAG does not solve the problem, but it articulates it precisely and identifies what must be assumed (or what additional information is needed) for the causal effect to be identified.
17.14 Summary
This chapter introduced graphical causal models as a complement to the potential outcomes framework:
-
Directed acyclic graphs (DAGs) encode causal structure: each node is a variable, each directed edge represents a direct causal effect, and each missing edge asserts no direct effect.
-
Structural causal models (SCMs) formalize DAGs with structural equations $V_i = f_i(\text{pa}(V_i), U_i)$ that define the data-generating process. Structural equations are causal mechanisms, not statistical regressions.
-
Three junction types — fork, chain, collider — determine how information flows through a graph. Forks and chains are open by default and blocked by conditioning; colliders are blocked by default and opened by conditioning. This asymmetry is the fundamental rule of graphical causal reasoning.
-
d-separation provides the algorithm for reading conditional independencies from the graph. If every path between $X$ and $Y$ is blocked by $\mathbf{Z}$, then $X \perp\!\!\!\perp Y \mid \mathbf{Z}$ in any distribution faithful to the graph.
-
The backdoor criterion identifies valid adjustment sets: condition on a set that blocks all backdoor paths without including descendants of treatment. The backdoor adjustment formula converts conditional associations into causal effects.
-
The front-door criterion identifies the causal effect through a mediator even when confounders are unmeasured. It requires complete mediation and specific graphical conditions.
-
The do-operator formalizes intervention as graph surgery: $P(Y \mid \text{do}(X = x))$ is the distribution in the modified graph where all arrows into $X$ are deleted. The three rules of do-calculus are complete for identifying causal effects from observational data.
-
Good controls (confounders, causes of treatment, causes of outcome) reduce bias or improve precision. Bad controls (mediators, colliders, their descendants) introduce or amplify bias. The DAG tells you which is which.
The next chapter applies these graphical insights to specific estimation methods: propensity scores, instrumental variables, difference-in-differences, and regression discontinuity — each of which can be understood as a strategy for satisfying the identification conditions derived from the causal graph.