Appendix J: Causal Inference Identification Guide

From Research Question to Estimation Strategy: A Decision Framework


This appendix provides a structured decision framework for selecting causal identification strategies. It bridges the potential outcomes framework (Chapter 16) and the graphical causal model framework (Chapter 17), connects them to the estimation methods (Chapter 18) and causal ML techniques (Chapter 19), and gives you a systematic process for moving from a causal question to a credible estimate.

The single most important principle in causal inference is this: identification comes before estimation. No amount of statistical sophistication can rescue an identification strategy that is fundamentally flawed. A perfectly executed instrumental variables analysis with an invalid instrument produces a precise, confident, wrong answer. The decision framework in this appendix is designed to prevent that failure.


J.1 — The Identification Decision Tree

Begin every causal analysis by answering these questions in order. Each answer constrains the set of available methods.

STEP 1: What is the causal question?
  │
  ├── "What is the average effect of treatment T on outcome Y?"
  │     → Target: ATE, ATT, or ATU
  │
  ├── "Which subgroups benefit most from treatment T?"
  │     → Target: CATE / HTE → Jump to Section J.8
  │
  └── "What would happen if we changed the policy?"
        → Target: Counterfactual prediction → Structural model needed
              (Section J.7)

STEP 2: Can you randomize treatment?
  │
  ├── YES → Randomized Controlled Trial (Section J.2)
  │         Is there interference between units?
  │           ├── NO → Standard RCT
  │           └── YES → Cluster-randomized or switchback design
  │                     (Chapter 33)
  │
  └── NO → Proceed to Step 3

STEP 3: Draw the causal DAG. What is the source of endogeneity?
  │
  ├── Observable confounders only (backdoor criterion satisfied)
  │     → Matching, IPW, AIPW (Section J.3)
  │
  ├── Unobservable confounders present
  │     │
  │     ├── Is there a valid instrument?
  │     │     → Instrumental Variables (Section J.4)
  │     │
  │     ├── Is there a policy change or natural experiment
  │     │   creating a treatment/control contrast over time?
  │     │     → Difference-in-Differences (Section J.5)
  │     │
  │     ├── Is there a sharp or fuzzy threshold that
  │     │   determines treatment assignment?
  │     │     → Regression Discontinuity (Section J.6)
  │     │
  │     ├── Is there a single treated unit with many
  │     │   untreated comparison units over time?
  │     │     → Synthetic Control Method (Section J.7)
  │     │
  │     └── None of the above
  │           → Sensitivity analysis with observational methods
  │             (report how much unobserved confounding would
  │              be required to overturn the result)
  │
  └── UNSURE → Consult domain experts. The DAG encodes domain
               knowledge, not statistical knowledge.

J.2 — Randomized Controlled Trial (RCT)

When to Use

The RCT is the gold standard. Use it whenever randomization is feasible, ethical, and the unit of analysis is clear. In the StreamRec system, A/B tests randomize recommendation policies across users. In the MediCore Pharma context, clinical trials randomize drug assignment across patients.

Key Assumptions

Assumption Formal Statement What Can Go Wrong
Randomization $W \perp\!\!\!\perp (Y(0), Y(1))$ Broken randomization (implementation bugs, self-selection). Check with a Sample Ratio Mismatch (SRM) test.
SUTVA: No interference $Y_i(W_1, \ldots, W_n) = Y_i(W_i)$ User i's outcome depends on user j's treatment. Violated in social networks, marketplaces, and shared content platforms.
SUTVA: Consistency Treatment is well-defined Multiple versions of treatment (e.g., different recommendation algorithms triggered by the same treatment flag).
Compliance All subjects receive assigned treatment Non-compliance: some treated subjects don't take the drug; some control subjects obtain it elsewhere. Use ITT or IV (LATE).

Assumption Checklist

  • [ ] Randomization was implemented correctly (SRM check passes: observed split ratio matches intended ratio within statistical tolerance)
  • [ ] Units are independent (no interference via social connections, shared resources, or marketplace dynamics)
  • [ ] Treatment is well-defined and consistently applied across all treated units
  • [ ] Outcome measurement is identical for treatment and control groups
  • [ ] No differential attrition (dropout rates are similar across arms)
  • [ ] Pre-treatment covariates are balanced (check with standardized mean differences < 0.1)
  • [ ] Analysis plan was pre-registered (prevents $p$-hacking and HARKing)

Estimation

Under valid randomization, the ATE is identified by the simple difference in means:

$$\widehat{\text{ATE}} = \bar{Y}_{\text{treatment}} - \bar{Y}_{\text{control}}$$

For improved precision, use regression adjustment or CUPED (Chapter 33):

$$\widehat{\text{ATE}}_{\text{CUPED}} = (\bar{Y}_T - \bar{Y}_C) - \theta(\bar{X}_T - \bar{X}_C)$$

where $X$ is a pre-treatment covariate and $\theta$ is the coefficient from regressing $Y$ on $X$ in the control group.

Common Mistakes

  1. Peeking without correction. Checking results daily and stopping when $p < 0.05$ inflates the false positive rate to 20-30%. Use sequential testing (always-valid $p$-values) or commit to a fixed sample size.
  2. Ignoring interference. In the StreamRec system, if user A shares a recommended article with user B (who is in the control group), the treatment "leaks." Use cluster randomization (randomize by household, region, or social cluster).
  3. Novelty and primacy effects. Users may initially engage more with a new recommendation algorithm simply because it is new. Run experiments long enough (2+ weeks) to capture the steady-state effect.

J.3 — Selection on Observables: Matching and Weighting

When to Use

When randomization is infeasible but all confounders are measured. This is the assumption that, conditional on observed covariates $X$, treatment assignment is as good as random. In the MediCore Pharma case, this means: after controlling for age, disease severity, comorbidities, and hospital, the remaining variation in who receives drug X is unrelated to the potential outcomes.

Key Assumptions

Assumption Formal Statement What Can Go Wrong
Conditional ignorability (unconfoundedness) $(Y(0), Y(1)) \perp\!\!\!\perp W \mid X$ Unmeasured confounders exist. This assumption is untestable.
Positivity (overlap) $0 < P(W=1 \mid X=x) < 1 \; \forall x$ Some covariate values perfectly predict treatment. Propensity scores near 0 or 1 produce extreme weights and unstable estimates.
Correct model specification For regression adjustment: the outcome model is correctly specified. For IPW: the propensity model is correctly specified. Model misspecification biases the estimate.

Assumption Checklist

  • [ ] All plausible confounders are included in $X$ (requires domain knowledge — consult the causal DAG from Chapter 17)
  • [ ] No variable in $X$ is a collider or a descendant of treatment (conditioning on these introduces bias — the "bad controls" problem)
  • [ ] No variable in $X$ is a mediator of the treatment effect (conditioning on mediators blocks the effect you are trying to measure)
  • [ ] Positivity holds: propensity score distribution has adequate overlap between treatment and control groups (inspect histogram of propensity scores)
  • [ ] Propensity scores are not extreme (< 0.01 or > 0.99 — trim or truncate if necessary)
  • [ ] Covariate balance after matching/weighting is adequate (standardized mean differences < 0.1 for all covariates)
  • [ ] Sensitivity analysis conducted: how much unobserved confounding would be needed to overturn the result? (Use the E-value or Rosenbaum bounds)

Methods

Propensity Score Matching (PSM): - Estimate $e(x) = P(W=1 \mid X=x)$ using logistic regression, gradient boosting, or a neural network - Match each treated unit to the nearest control unit(s) in propensity score space - Estimate ATE from the matched sample

Inverse Probability Weighting (IPW):

$$\widehat{\text{ATE}}_{\text{IPW}} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{(1-W_i) Y_i}{1 - \hat{e}(X_i)} \right]$$

Augmented IPW (AIPW / Doubly Robust):

$$\widehat{\text{ATE}}_{\text{AIPW}} = \frac{1}{n} \sum_{i=1}^{n} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{W_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-W_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]$$

where $\hat{\mu}_w(x) = \hat{\mathbb{E}}[Y \mid X=x, W=w]$ is the estimated outcome model.

Research Insight: AIPW is doubly robust: it is consistent if either the propensity model or the outcome model is correctly specified (not necessarily both). In practice, always use AIPW over plain IPW or plain regression adjustment. There is no reason to use a singly-robust estimator when a doubly-robust alternative exists.

Common Mistakes

  1. Claiming unconfoundedness without justification. The assumption is untestable. You must argue, based on domain knowledge, that you have measured all relevant confounders. The causal DAG (Chapter 17) is the tool for this argument.
  2. Ignoring extreme propensity scores. A propensity score of 0.001 produces a weight of 1000. A single unit can dominate the estimate. Trim the sample to units with propensity scores in $[0.01, 0.99]$ or use stabilized weights.
  3. Not checking covariate balance. Matching or weighting should make the treatment and control groups comparable. If standardized mean differences remain large after weighting, the method is not working.
  4. Conditioning on post-treatment variables. Never include variables that are affected by the treatment in the propensity model or outcome model. This introduces post-treatment bias.
from sklearn.linear_model import LogisticRegression
import numpy as np

def aipw_ate(
    y: np.ndarray,
    w: np.ndarray,
    X: np.ndarray,
    propensity_model=None,
    outcome_model_treated=None,
    outcome_model_control=None,
) -> tuple[float, float]:
    """Compute AIPW estimate of ATE with standard error.

    Args:
        y: Outcome vector.
        w: Binary treatment indicator.
        X: Covariate matrix.
        propensity_model: Fitted model with .predict_proba(X).
        outcome_model_treated: Fitted model for E[Y|X, W=1].
        outcome_model_control: Fitted model for E[Y|X, W=0].

    Returns:
        Tuple of (ate_estimate, standard_error).
    """
    n = len(y)
    e_hat = propensity_model.predict_proba(X)[:, 1]
    mu1_hat = outcome_model_treated.predict(X)
    mu0_hat = outcome_model_control.predict(X)

    # Clip propensity scores for stability
    e_hat = np.clip(e_hat, 0.01, 0.99)

    # AIPW scores
    scores = (
        mu1_hat - mu0_hat
        + w * (y - mu1_hat) / e_hat
        - (1 - w) * (y - mu0_hat) / (1 - e_hat)
    )

    ate = scores.mean()
    se = scores.std() / np.sqrt(n)
    return ate, se

J.4 — Instrumental Variables (IV)

When to Use

When unobserved confounders exist but you have an instrument $Z$ that affects treatment $W$ but has no direct effect on outcome $Y$ except through $W$. In the MediCore Pharma context: distance from the patient's home to the nearest hospital that prescribes drug X. Patients who live closer are more likely to receive the drug, but distance does not directly affect health outcomes (conditional on other covariates).

Key Assumptions

Assumption Formal Statement What Can Go Wrong
Relevance $Z$ is correlated with $W$: $\text{Cov}(Z, W) \neq 0$ Weak instrument: $Z$ barely predicts $W$. Testable (first-stage F-statistic > 10).
Exclusion restriction $Z$ affects $Y$ only through $W$ $Z$ has a direct effect on $Y$ not mediated by $W$. Untestable. Must argue from domain knowledge.
Independence $Z \perp\!\!\!\perp (Y(0), Y(1), W(0), W(1))$ $Z$ shares confounders with $Y$. Must argue from domain knowledge.
Monotonicity (for LATE) $W_i(z=1) \geq W_i(z=0) \; \forall i$ Existence of "defiers" (people who would take treatment when not encouraged but refuse when encouraged).

Assumption Checklist

  • [ ] The instrument is relevant: first-stage F-statistic > 10 (Staiger and Stock rule of thumb). For weak instruments, use Anderson-Rubin confidence intervals.
  • [ ] The exclusion restriction is plausible: argue from domain knowledge that $Z$ has no direct effect on $Y$. This is the most critical and most frequently violated assumption.
  • [ ] The instrument is not correlated with unobserved confounders. Consider: could the instrument be proxying for something else? In the distance instrument example, distance correlates with urban/rural status, which may independently affect health.
  • [ ] Monotonicity is plausible: there are no "defiers."
  • [ ] The estimand is clear: IV estimates the LATE (Local Average Treatment Effect) — the effect on compliers, not the ATE on the entire population. If the policy question requires the ATE, LATE may not answer it.

Estimation: Two-Stage Least Squares (2SLS)

Stage 1: Regress $W$ on $Z$ (and covariates $X$):

$$W_i = \alpha_0 + \alpha_1 Z_i + \alpha_2' X_i + \nu_i$$

Stage 2: Regress $Y$ on predicted $\hat{W}$ (and covariates $X$):

$$Y_i = \beta_0 + \beta_1 \hat{W}_i + \beta_2' X_i + \epsilon_i$$

The coefficient $\hat{\beta}_1$ is the IV estimate of the causal effect (LATE).

Common Mistakes

  1. Using a weak instrument. With a weak instrument, 2SLS is biased toward the OLS estimate and standard errors are unreliable. Always report the first-stage F-statistic.
  2. Asserting the exclusion restriction without argument. "Distance to hospital is a valid instrument" is a claim, not a fact. You must argue why distance has no direct effect on health outcomes after controlling for observed covariates. Reviewers and regulatory agencies will challenge this.
  3. Interpreting LATE as ATE. The IV estimate applies to compliers — individuals whose treatment status changes with the instrument. Compliers may differ systematically from always-takers and never-takers. Clearly state the estimand.
  4. Over-identification without testing. If you have more instruments than endogenous variables, use the Hansen J-test for over-identification. Rejection suggests at least one instrument is invalid.

J.5 — Difference-in-Differences (DiD)

When to Use

When a treatment (policy change, intervention, product launch) affects some units at a specific time, and you have pre- and post-treatment data for both treated and untreated units. In the Meridian Financial context: a regulatory change in one state affects lending practices in that state but not in neighboring states.

Key Assumptions

Assumption Formal Statement What Can Go Wrong
Parallel trends $\mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid \text{treated}] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid \text{control}]$ The treatment and control groups were on different trajectories before the intervention.
No anticipation Treatment does not affect outcomes before the treatment period Units change behavior in anticipation of the policy.
No spillover Treatment of one group does not affect the control group Regulatory change in one state causes firms to relocate to control states.
Stable composition The composition of treatment and control groups does not change over time Selective migration: people move into or out of the treated region in response to the policy.

Assumption Checklist

  • [ ] Parallel trends: plot pre-treatment trends for treatment and control groups. They should be approximately parallel. Formal test: estimate the DiD model with leads (pre-treatment period interactions) — these should be statistically insignificant.
  • [ ] Event study plot: estimate effects for each period relative to treatment. Pre-treatment estimates should be near zero; post-treatment estimates should show the effect.
  • [ ] No anticipation: is there a change in trends in the period immediately before treatment? If so, the treatment may have been anticipated.
  • [ ] Stable composition: the treatment and control groups should not change composition over the study period due to selective entry or exit.
  • [ ] If treatment is staggered across units: use the Callaway-Sant'Anna or Sun-Abraham estimator, not two-way fixed effects. Standard TWFE with staggered adoption can produce biased estimates.

Estimation

Basic two-period DiD:

$$Y_{it} = \beta_0 + \beta_1 \cdot \text{Treated}_i + \beta_2 \cdot \text{Post}_t + \beta_3 \cdot (\text{Treated}_i \times \text{Post}_t) + \epsilon_{it}$$

$\hat{\beta}_3$ is the DiD estimate of the causal effect.

Event study specification:

$$Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq -1} \delta_k \cdot \mathbb{1}[t - t_i^* = k] + \epsilon_{it}$$

where $\alpha_i$ are unit fixed effects, $\gamma_t$ are time fixed effects, $t_i^*$ is the treatment time for unit $i$, and $k = -1$ is the reference period. The $\delta_k$ coefficients trace the treatment effect over time, with pre-treatment coefficients serving as a placebo test.

Common Mistakes

  1. Failing to test parallel trends. The parallel trends assumption is the foundation of DiD. If pre-treatment trends diverge, the DiD estimate is biased. Always produce the event study plot.
  2. Using two-way fixed effects with staggered treatment. When different units receive treatment at different times, standard TWFE estimates a weighted average of treatment effects where some weights are negative. This can produce estimates with the wrong sign. Use the Callaway-Sant'Anna estimator (2021) or the Sun-Abraham estimator (2021).
  3. Ignoring serial correlation. DiD with panel data exhibits serial correlation, which inflates standard errors. Cluster standard errors at the unit level (Bertrand, Duflo, and Mullainathan, 2004).
  4. Too few treated or control clusters. With fewer than approximately 20 clusters, cluster-robust standard errors are unreliable. Use the wild cluster bootstrap.

J.6 — Regression Discontinuity (RD)

When to Use

When treatment assignment is determined (fully or partially) by whether a continuous "running variable" crosses a threshold. In the Meridian Financial case: credit applicants above a score threshold of 680 are approved; those below are denied. Applicants just above and just below the threshold are similar in all respects except treatment.

Key Assumptions

Assumption Formal Statement What Can Go Wrong
Continuity of potential outcomes $\mathbb{E}[Y(0) \mid X=c]$ and $\mathbb{E}[Y(1) \mid X=c]$ are continuous at the cutoff $c$ A discontinuity in the outcome at $c$ that is not caused by the treatment (e.g., another policy also has a threshold at $c$).
No manipulation Units cannot precisely control their running variable to sort across the threshold Credit applicants take actions to barely exceed the threshold. Detectable with the McCrary density test.
Local randomization Near the cutoff, assignment is "as good as random" Not a formal assumption of the continuity-based RD, but the intuition behind why RD works.

Assumption Checklist

  • [ ] Running variable is continuous and not subject to precise manipulation by subjects.
  • [ ] McCrary density test: no discontinuity in the density of the running variable at the cutoff. A jump in density suggests manipulation.
  • [ ] Covariate balance: pre-treatment covariates should be continuous at the cutoff. Test by running "placebo" RD regressions with covariates as the outcome.
  • [ ] Bandwidth selection: use data-driven methods (Imbens-Kalyanaraman, Calonico-Cattaneo-Titiunik). Results should be robust to bandwidth choices (report estimates at multiple bandwidths).
  • [ ] Sharp vs. fuzzy: in sharp RD, treatment is deterministic at the cutoff. In fuzzy RD, the probability of treatment jumps at the cutoff but is not 0 or 1 on either side. Fuzzy RD requires IV estimation at the cutoff.

Estimation

Sharp RD:

Estimate the local polynomial regression on each side of the cutoff and compare:

$$\hat{\tau}_{\text{RD}} = \lim_{x \downarrow c} \hat{\mathbb{E}}[Y \mid X = x] - \lim_{x \uparrow c} \hat{\mathbb{E}}[Y \mid X = x]$$

In practice, use the rdrobust package (R) or the rdrobust Python port, which implements bias-corrected local polynomial estimation with robust confidence intervals.

Fuzzy RD:

Use IV estimation at the cutoff, with the threshold indicator as the instrument for treatment:

$$\hat{\tau}_{\text{fuzzy}} = \frac{\lim_{x \downarrow c} \hat{\mathbb{E}}[Y \mid X = x] - \lim_{x \uparrow c} \hat{\mathbb{E}}[Y \mid X = x]}{\lim_{x \downarrow c} \hat{\mathbb{E}}[W \mid X = x] - \lim_{x \uparrow c} \hat{\mathbb{E}}[W \mid X = x]}$$

Common Mistakes

  1. Using a bandwidth that is too wide. A wide bandwidth includes units far from the cutoff, where the "local randomization" intuition breaks down. Always use data-driven bandwidth selection.
  2. Failing to test for manipulation. If applicants can manipulate their credit score to exceed 680, the RD design is invalid. The McCrary test detects manipulation, but subtle manipulation may evade detection.
  3. Overfitting with high-degree polynomials. Higher-degree polynomials can produce dramatic overfitting near the cutoff, creating artificial discontinuities. Use local linear or local quadratic regression, not global polynomials.
  4. Extrapolating beyond the cutoff neighborhood. The RD estimate is local — it applies to units near the cutoff. Do not generalize it to units far from the threshold.

J.7 — Synthetic Control Method (SCM)

When to Use

When a single unit (a state, country, company) receives treatment and you want to estimate the counterfactual: "What would have happened without treatment?" SCM constructs a weighted combination of untreated comparison units that closely matches the treated unit's pre-treatment trajectory.

Key Assumptions

Assumption Formal Statement What Can Go Wrong
No spillover Treatment of the target unit does not affect comparison units Regulatory change in California affects Nevada businesses.
Convex hull The treated unit's pre-treatment outcomes lie within the convex hull of the comparison units' outcomes The treated unit is an outlier with no close comparison.
Good pre-treatment fit The synthetic control closely tracks the treated unit before treatment Poor pre-treatment fit invalidates the counterfactual. Inspect the fit plot.

Assumption Checklist

  • [ ] The donor pool contains units unaffected by the treatment (no spillover)
  • [ ] The pre-treatment fit is close (MSPE ratio < 2-5x the median of placebo fits)
  • [ ] Placebo tests: run the SCM analysis for each untreated unit (as if it were treated). The treated unit's effect should be an outlier relative to the placebo distribution.
  • [ ] The weights are non-negative and sum to one (standard SCM constraint)
  • [ ] No single donor unit receives excessive weight (a synthetic control that is 95% one state is effectively a single comparison unit, not a synthetic one)

Common Mistakes

  1. Poor pre-treatment fit. If the synthetic control does not track the treated unit before treatment, the post-treatment gap is not a credible causal estimate. Report the pre-treatment MSPE and the MSPE ratio.
  2. Cherry-picking the donor pool. The donor pool should be determined by substantive criteria (similar economies, similar demographics), not by which units produce the largest estimated effect. Pre-register the donor pool.
  3. Ignoring inference. SCM produces a point estimate. Use placebo inference (Fischer-style permutation tests) to assess statistical significance: compute the MSPE ratio for the treated unit relative to placebo units.

J.8 — Causal ML: Heterogeneous Treatment Effects

When the goal is not a single ATE but a treatment effect function $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$, use the causal ML methods from Chapter 19. These methods combine the identification strategies above with flexible ML models for the effect estimation.

Method Selection

Method Best When Assumptions Implementation
S-learner Simple baseline; moderate heterogeneity expected Same as underlying identification strategy One model: $\hat{\mu}(x, w)$; CATE $= \hat{\mu}(x,1) - \hat{\mu}(x,0)$
T-learner Strong heterogeneity; large sample in both arms Separate models must each be well-specified Two models: $\hat{\mu}_1(x), \hat{\mu}_0(x)$
X-learner Imbalanced treatment/control; one group much larger Good outcome models in the larger group Cross-impute missing potential outcomes
R-learner High-dimensional covariates; DML-based Neyman orthogonality; cross-fitting Residual-on-residual regression
Causal Forest Nonparametric HTE; honest inference desired Unconfoundedness + forest regularity conditions grf package (R) or EconML (Python)
DML High-dimensional confounders; parametric effect Neyman orthogonality EconML DML class

Evaluation (from Appendix G)

  • Qini curve / AUUC: Requires experimental or credibly quasi-experimental data. Rank individuals by predicted CATE, compute cumulative uplift.
  • Calibration of CATE: Bin predictions by predicted CATE; within each bin, compute the actual treatment effect. Well-calibrated CATE predictions should show actual effects matching predicted effects.
  • GATES (Generic Attribute Treatment Effect Subgroups): Partition the sample into quantiles of predicted CATE. Estimate the ATE within each quantile. Monotonically increasing effects confirm meaningful heterogeneity.

J.9 — Connecting the Graphical and Potential Outcomes Frameworks

Chapters 16 and 17 present two frameworks for causal inference. They are different languages for the same set of ideas:

Concept Potential Outcomes (Rubin) Graphical Models (Pearl)
Causal effect $Y(1) - Y(0)$ $P(Y \mid do(X=1)) - P(Y \mid do(X=0))$
Confounding $(Y(0), Y(1)) \not\perp\!\!\!\perp W$ Open backdoor path between $W$ and $Y$
Identification Conditional ignorability: $(Y(0), Y(1)) \perp\!\!\!\perp W \mid X$ Backdoor criterion: conditioning on $X$ blocks all backdoor paths
Instrument $Z$ affects $W$; $Z \perp\!\!\!\perp (Y(0), Y(1))$; exclusion restriction $Z \to W \to Y$ with no $Z \to Y$ edge; no common causes of $Z$ and $Y$
Mediator $W \to M \to Y$; do not condition on $M$ for total effect Chain structure; conditioning blocks the mediated effect
Collider Not explicit in standard framework $W \to C \leftarrow Y$; conditioning on $C$ opens a spurious path
Counterfactual $Y_i(w)$: what would happen if unit $i$ received treatment $w$ Solution to the structural equation model with $W$ set to $w$

When to use which framework: - Use the potential outcomes framework when the treatment is well-defined, the estimand (ATE, ATT, LATE) is clear, and you are primarily concerned with estimation and inference. - Use the graphical framework when the causal structure is complex (many variables, potential mediators, colliders), when you need to determine which variables to condition on, or when the identification strategy is non-obvious. - In practice, use both. Draw the DAG to identify the estimand and valid adjustment sets (graphical). Then estimate the effect using the potential outcomes estimators (matching, IPW, IV, etc.).


J.10 — Quick Reference: Method Comparison

Method Identifies Key Assumption Testable? Internal Validity External Validity Data Requirement
RCT ATE Randomization Yes (SRM, balance) Highest Limited (sample-specific) Experimental
Matching/IPW ATE, ATT Unconfoundedness No (sensitivity analysis) Moderate Moderate Observational + rich covariates
IV LATE Exclusion restriction Partially (relevance testable) High (if valid) Limited (compliers only) Instrument + outcome
DiD ATT Parallel trends Partially (pre-trend test) High (if valid) Moderate Panel data
RD Local ATE Continuity at cutoff Partially (manipulation test) Very high (near cutoff) Very limited (local) Running variable + outcome
SCM Effect on treated unit Good pre-fit, no spillover Partially (placebo tests) High (if fit is good) None (single unit) Panel data, donor pool

J.11 — Sensitivity Analysis: How Robust Is Your Estimate?

Every observational causal estimate should be accompanied by a sensitivity analysis. The question is: "How much unobserved confounding would be needed to explain away the estimated effect?"

The E-value (VanderWeele and Ding, 2017)

For an observed risk ratio $\text{RR}_{\text{obs}}$:

$$E\text{-value} = \text{RR}_{\text{obs}} + \sqrt{\text{RR}_{\text{obs}} \times (\text{RR}_{\text{obs}} - 1)}$$

The E-value is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome, conditional on measured covariates, to explain away the observed effect.

Interpretation: If the E-value is large (e.g., 3.5), an unmeasured confounder would need to nearly quadruple the risk of both treatment and outcome to explain the result. If the E-value is small (e.g., 1.3), a modest unmeasured confounder could account for the finding.

Rosenbaum Bounds (for matched studies)

After matching on observables, Rosenbaum bounds ask: if hidden bias caused the odds of treatment to differ by a factor of $\Gamma$ between matched pairs, would the conclusion change?

  • Report the $\Gamma$ at which the $p$-value crosses 0.05. This is the "tipping point."
  • $\Gamma = 1$: no hidden bias (the matched design is valid)
  • $\Gamma = 2$: an unmeasured confounder that doubles the odds of treatment would explain the result
  • Higher $\Gamma$ values indicate more robust findings

Reporting Template

Always report: 1. The primary causal estimate with confidence interval 2. The identification strategy and its key assumptions 3. Which assumptions are testable and the test results 4. The sensitivity analysis: E-value or Rosenbaum bounds 5. An honest statement of limitations: "This estimate is valid under the assumption that [X]. If [Y] is unmeasured, the estimate could be biased by [Z]."


The identification strategy is the most important decision in any causal analysis. Get it right, and the estimation is straightforward. Get it wrong, and no amount of statistical sophistication will save you. When in doubt, draw the DAG, identify your assumptions, and ask: what would a skeptic say?