Chapter 15: Key Takeaways

DataField.Dev

Chapter 15: Key Takeaways

Prediction and causation are fundamentally different questions, and using prediction models to answer causal questions can produce outcomes worse than random. A prediction model learns $P(Y \mid X)$ — the conditional probability of an outcome given features — by exploiting all statistical associations in the data, including those from confounding and reverse causation. A causal analysis estimates $P(Y \mid \text{do}(X))$ — the effect of an intervention. The hospital readmission example demonstrates this concretely: risk-based targeting selects patients with the highest readmission probability (driven by unmodifiable disease severity) rather than patients whose outcomes would change with the intervention (driven by modifiable care gaps). A causal oracle prevents three times as many readmissions with the same budget.
Confounding is the central obstacle: it biases naive comparisons and can reverse the apparent direction of an effect. A confounder is a variable that causes both the treatment and the outcome, creating a non-causal statistical association between them. In the MediCore drug example, disease severity confounds the drug-hospitalization relationship: sicker patients receive the drug and are also more likely to be hospitalized, making a beneficial drug appear harmful. Simpson's paradox — where a trend reverses when data is aggregated — is the most dramatic manifestation of confounding, but continuous confounders create equally severe bias without the visible reversal.
Colliders are the mirror image of confounders: controlling for them introduces bias rather than removing it. A collider is caused by two or more other variables. Conditioning on a collider (e.g., analyzing only hospitalized patients, only hired employees, only recommended items) creates spurious associations between the collider's causes. The implication is that "control for everything" is dangerous advice — you must understand the causal structure to determine which variables to adjust for and which to leave unadjusted.
The fundamental problem of causal inference is that we can never observe both potential outcomes for the same individual. Each person either receives the treatment or does not; the counterfactual outcome is forever unobservable. This makes causal inference a missing data problem that cannot be solved by larger datasets or more powerful models — it requires assumptions about the data-generating process. Randomization solves the problem by ensuring treatment assignment is independent of potential outcomes, making the naive comparison unbiased. When randomization is not possible, observational methods (Chapters 16-19) provide alternatives under stronger assumptions.
Prediction models that exploit confounded associations are not just causally wrong — they are fragile. Confounded features create associations that are contingent on the specific distribution of confounders in the training data. When the population changes (new patient demographics, new user segments, new market conditions), the confounded associations break and the model's performance degrades. Models built on causal features — those that directly cause the outcome — generalize better under distribution shift.
Standard offline evaluation of recommendation systems measures predictive accuracy, not causal impact. A model that perfectly predicts organic engagement (what users would do without recommendations) achieves excellent offline metrics but creates zero incremental value. The StreamRec case study shows that roughly 41% of observed engagement on recommended items is organic. Optimizing for total engagement rather than incremental engagement leads the recommendation system to take credit for existing behavior rather than creating new value.
Before building any model that will inform a decision, ask: does this decision require a causal answer? If the analysis will inform an action — treating patients, changing recommendations, allocating resources, modifying prices — the decision almost certainly requires causal reasoning. The causal question checklist (Section 15.11) provides a structured framework: identify the treatment, the outcome, the confounders, whether randomization is feasible, and what assumptions you are making. This discipline is the entry point to Part III's methodology (potential outcomes, causal graphs, estimation methods, causal ML).