Chapter 18: Key Takeaways
-
Propensity score methods (PSM, IPW, AIPW) reduce confounding by balancing observed covariates, but they cannot correct for unmeasured confounders. The propensity score $e(X) = P(D = 1 \mid X)$ is a balancing score: conditioning on it makes covariates independent of treatment. This reduces the matching problem from $p$ dimensions to one. PSM creates matched pairs; IPW reweights all observations. Both require conditional ignorability (all confounders measured) and positivity (overlap in propensity scores). The propensity score is a tool for balance, not prediction — a model with high AUC can signal positivity violations rather than good estimation. Always check covariate balance after adjustment using standardized mean differences ($|SMD| < 0.1$) and inspect propensity score overlap. When in doubt about the propensity or outcome model, use AIPW: it is doubly robust, requiring only one of the two models to be correctly specified.
-
Inverse probability weighting reweights observations to create a pseudo-population free of confounding, but extreme weights destabilize estimates. The IPW estimator $\hat{\tau} = \frac{1}{N} \sum_i \left[\frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1-D_i)Y_i}{1-\hat{e}(X_i)}\right]$ reweights treated units by $1/\hat{e}(X)$ and control units by $1/(1-\hat{e}(X))$. When propensity scores are extreme (near 0 or 1), a single observation can dominate the entire estimate. Diagnostics are essential: effective sample sizes should exceed half the actual sample size, and the top 5% of weights should not carry a majority of the total weight. Trimming propensity scores (typically at 0.01-0.05 and 0.95-0.99) stabilizes estimates at the cost of changing the estimand to the overlap population. The Hajek (normalized) estimator is preferred over the Horvitz-Thompson (unnormalized) estimator in practice due to substantially lower variance.
-
Instrumental variables identify causal effects even with unmeasured confounders, but the estimate is local (LATE) and the exclusion restriction is untestable. An instrument $Z$ must be relevant ($Z$ predicts $D$, testable via first-stage $F > 10$), satisfy the exclusion restriction ($Z$ affects $Y$ only through $D$, untestable), and be independent of unmeasured confounders (also untestable). The 2SLS estimator divides the instrument's effect on the outcome (reduced form) by its effect on treatment (first stage). The result is a LATE — the effect among compliers, whose treatment changes in response to the instrument. This is not the ATE: it excludes always-takers and never-takers, who may respond differently to treatment. Weak instruments ($F < 10$) bias 2SLS toward OLS, defeating the purpose of the analysis. The instrument's validity rests entirely on domain knowledge; no statistical test can verify the exclusion restriction.
-
Difference-in-differences exploits treatment variation across groups and over time, but requires the parallel trends assumption and is vulnerable to staggered adoption problems. DiD compares the change in outcomes over time between treated and control groups: $\hat{\tau}_{\text{DiD}} = (\bar{Y}_{1,\text{post}} - \bar{Y}_{1,\text{pre}}) - (\bar{Y}_{0,\text{post}} - \bar{Y}_{0,\text{pre}})$. The first difference removes time-invariant group differences; the second removes common time trends. The parallel trends assumption — that treated and control groups would have followed the same trajectory absent treatment — is untestable for the post-treatment period but assessable with pre-treatment data via event study plots. Pre-treatment coefficients near zero support the assumption; trending pre-treatment coefficients undermine it. For staggered adoption (treatment starts at different times for different groups), the standard TWFE estimator can produce biased estimates because it uses already-treated units as controls; use robust alternatives (Callaway-Sant'Anna, Sun-Abraham, de Chaisemartin-D'Haultfoeuille).
-
Regression discontinuity is the most credible quasi-experimental design when a treatment is assigned by a cutoff on a continuous variable. RD estimates the causal effect at the cutoff by comparing outcomes just above and just below: $\hat{\tau}_{\text{RD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X=x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X=x]$. The identifying assumption — continuity of potential outcomes at the cutoff — is transparent, and the diagnostics are concrete: the McCrary density test checks for manipulation of the running variable; covariate balance at the cutoff checks that nothing other than treatment changes discontinuously; bandwidth sensitivity shows whether the estimate is robust. RD's limitation is locality: the estimate applies only at the cutoff. Sharp RD (deterministic treatment assignment) directly estimates the effect; fuzzy RD (probabilistic jump in treatment) uses the cutoff as an instrument and estimates a LATE among compliers at the threshold.
-
No single method is universally best — the choice depends on data structure and defensible assumptions. If a sharp cutoff exists, use RD (fewest assumptions). If treatment varies across groups over time, use DiD (requires parallel trends). If a valid instrument exists, use IV (handles unmeasured confounders but estimates LATE). If all confounders are observed, use AIPW (doubly robust ATE). In practice, applying multiple methods as robustness checks is the strongest strategy: convergent estimates under different assumptions are more credible than any single estimate. The MediCore case demonstrated this: IPW ($-0.046$), IV ($-0.037$), and DiD ($-0.042$) all converged on a 4-5 percentage point reduction in readmission, providing evidence that would survive regulatory scrutiny despite each method's individual limitations.
-
Diagnostics are not optional — they are integral to the causal argument. Every method has specific diagnostics that evaluate its identifying assumptions: covariate balance and propensity score overlap (PSM/IPW), first-stage F-statistic and exclusion restriction arguments (IV), event study plots and pre-trend tests (DiD), density tests, covariate continuity, and bandwidth sensitivity (RD). Reporting a causal estimate without the corresponding diagnostics is like reporting a confidence interval without specifying the confidence level. Sensitivity analysis — quantifying how strong an unobserved confounder would need to be to change the conclusion — should accompany every observational causal estimate.