Chapter 17: Key Takeaways

DataField.Dev

Chapter 17: Key Takeaways

A causal DAG encodes domain knowledge as a formal mathematical object, and every missing edge is an assertion of no direct causal effect. The DAG is not a visualization of correlations — it is a set of testable causal claims. Each edge represents a direct causal mechanism; each missing edge asserts that no such mechanism exists. Constructing the DAG requires domain expertise (clinical knowledge, platform engineering understanding, subject matter theory), not data mining. The DAG then serves as a reasoning engine: from it, we can derive which conditional independencies should hold in the data (the causal Markov condition), which variables to adjust for (the backdoor criterion), and whether the causal effect is identifiable at all (do-calculus completeness). In the MediCore example, the DAG made explicit that Disease Severity and Age are confounders (requiring adjustment), Biomarker is a mediator (requiring exclusion), and Insurance Status is a collider (requiring exclusion). Without the DAG, these classifications would depend on informal judgment — with the DAG, they follow from formal graph-theoretic rules.
Three junction types — fork, chain, collider — are the complete vocabulary for how information flows through a causal graph. Forks ($X \leftarrow Z \rightarrow Y$) and chains ($X \rightarrow Z \rightarrow Y$) are active by default and blocked by conditioning on the middle node. Colliders ($X \rightarrow Z \leftarrow Y$) are blocked by default and opened by conditioning on the middle node (or its descendants). This asymmetry is the key to all of graphical causal reasoning: it explains why conditioning on a confounder removes bias (blocking a fork), why conditioning on a mediator removes the causal signal (blocking a chain), and why conditioning on a collider introduces bias (opening a previously blocked path). Every complex DAG analysis reduces to classifying each node on each path as one of these three junction types.
"Control for everything" is one of the most dangerous heuristics in applied data science. The backdoor criterion formalizes which variables should be adjusted for: those that block backdoor paths without being descendants of treatment. Conditioning on a mediator blocks the causal effect you are trying to measure (the MediCore Biomarker coefficient drops from $-1.01$ to $-0.02$ when Biomarker is included). Conditioning on a collider opens spurious paths that introduce bias (Insurance Status creates an artificial link between Age and Comorbidities). Conditioning on a descendant of either partially induces the same problems. The decision to include or exclude a variable from the adjustment set is a causal decision that requires understanding the graph — it cannot be made on the basis of statistical criteria (significance, correlation, variance inflation) alone.
The do-operator formalizes intervention as graph surgery, and $P(Y \mid \text{do}(X = x)) \neq P(Y \mid X = x)$ is the most important inequality in causal inference. Conditioning on $X = x$ (observing) provides information about $X$'s causes and thus conflates causal and confounding associations. Intervening on $X = x$ (doing) severs $X$ from its causes, isolating the causal effect. The do-operator captures this distinction mathematically: $\text{do}(X = x)$ replaces $X$'s structural equation with a constant and deletes all incoming arrows. The three rules of do-calculus are complete — any identifiable causal effect can be derived from observational distributions using these rules — and this completeness is one of the foundational results of modern causal inference.
The front-door criterion demonstrates that unmeasured confounding does not always prevent causal identification. When the backdoor criterion fails (because a confounder is unmeasured), the front-door criterion can sometimes identify the causal effect through a mediating variable. In the MediCore example with unmeasured genetic confounding, the two-step front-door procedure through Biomarker recovered the true causal effect ($-1.0$) despite the naive estimate being heavily biased ($+0.96$). The front-door criterion is rarely applicable in practice (it requires complete mediation and specific graphical conditions), but it demonstrates a crucial principle: the identifiability of a causal effect depends on the full graph structure, not just on whether the direct confounders are measured.
Pearl's graphical framework and Rubin's potential outcomes framework are complementary, not competing. The potential outcomes framework (Chapter 16) excels at defining estimands precisely (ATE, ATT, ATU), formalizing the identification assumptions (SUTVA, ignorability, positivity), and grounding the fundamental problem of causal inference. The graphical framework excels at encoding domain knowledge about causal structure, automating the search for valid adjustment sets, and determining whether causal effects are identifiable. The backdoor criterion in the graphical framework corresponds exactly to conditional ignorability in the potential outcomes framework. A practitioner who uses only one framework is working with an incomplete toolkit.
For the StreamRec recommendation system, the primary confounder (User Preference) is the same variable that makes the algorithm effective, and it is largely unobserved — making causal identification from observational data alone insufficient. The DAG reveals that every backdoor path from Recommendation to Engagement passes through User Preference. Since User Preference is not directly observed, the backdoor criterion cannot be cleanly satisfied. Proxy variables (User History) reduce but do not eliminate bias, with the residual bias depending on proxy quality. This motivates the alternative identification strategies developed in Chapters 18 (instrumental variables using position randomization, difference-in-differences using algorithm changes) and 19 (double machine learning with flexible confounding adjustment). The DAG does not solve the identification problem, but it makes the problem precise: it specifies exactly what is assumed, what is identified, and what remains uncertain.