Chapter 18: Causal Estimation Methods — Matching, Propensity Scores, Instrumental Variables, Difference-in-Differences, and Regression Discontinuity

DataField.Dev

27 min read

> "The question is not whether your assumptions are true — all assumptions are false. The question is whether they are false enough to matter."

In This Chapter

Learning Objectives
18.1 The Causal Estimation Toolbox
18.2 Matching Methods
18.3 The Propensity Score
18.4 Inverse Probability Weighting (IPW)
18.5 Doubly Robust Estimation (AIPW)
18.6 Applying Propensity Score Methods: MediCore Drug X
18.7 Instrumental Variables (IV)
18.8 Difference-in-Differences (DiD)
18.9 Regression Discontinuity (RD)
18.10 Which Method Should I Use? A Decision Framework
18.11 DoWhy/EconML Integration
18.12 Progressive Project: IPW for StreamRec Recommendations
18.13 Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 18: Causal Estimation Methods — Matching, Propensity Scores, Instrumental Variables, Difference-in-Differences, and Regression Discontinuity

"The question is not whether your assumptions are true — all assumptions are false. The question is whether they are false enough to matter." — George Box, paraphrased for causal inference

Learning Objectives

By the end of this chapter, you will be able to:

Implement propensity score matching (PSM) and inverse probability weighting (IPW), and explain their identifying assumptions
Apply instrumental variable (IV) estimation and identify valid instruments using the exclusion restriction
Implement difference-in-differences (DiD) and evaluate the parallel trends assumption
Apply regression discontinuity (RD) designs and distinguish sharp from fuzzy RD
Select the appropriate causal estimation method based on data structure and available assumptions

18.1 The Causal Estimation Toolbox

Chapter 16 established the potential outcomes framework: causal effects are defined as differences between potential outcomes $Y(1) - Y(0)$, and the fundamental problem is that we observe only one potential outcome per unit. Chapter 17 introduced graphical causal models — directed acyclic graphs (DAGs) that encode causal structure and provide algorithms (the backdoor criterion, the front-door criterion, do-calculus) for identifying which variables to adjust for.

This chapter is the toolbox. Given a causal question and a DAG, how do you actually estimate the causal effect from data? The answer depends on what your data looks like and which assumptions you are willing to defend.

We cover five families of methods, each suited to a different identification strategy:

Method	When to Use	Key Assumption	Estimates
Matching / Propensity Scores	Observed confounders suffice for identification	Conditional ignorability (unconfoundedness)	ATE, ATT
Doubly Robust (AIPW)	Same as above, but want robustness to model misspecification	Conditional ignorability + one of two models correct	ATE, ATT
Instrumental Variables	Unobserved confounders exist, but a valid instrument is available	Exclusion restriction, relevance	LATE (local ATE)
Difference-in-Differences	Treatment varies over time across groups	Parallel trends	ATT
Regression Discontinuity	Treatment assigned by a cutoff on a running variable	Continuity of potential outcomes at cutoff	LATE at cutoff

Each method trades one set of assumptions for another. There is no universal best method — there is only the method whose assumptions are most credible given your data and domain knowledge.

Prediction $\neq$ Causation: A predictive model can achieve 0.95 AUC on patient readmission and still be useless for causal inference. Every method in this chapter exists because prediction cannot answer "what if?" questions. The propensity score is itself a prediction (of treatment probability), but it is used as a tool for causal estimation, not as a causal estimate itself.

18.2 Matching Methods

The Intuition

Matching is the most intuitive causal estimation strategy: find treated and control units that are identical on observed covariates, then compare their outcomes. If Alice (treated) and Bob (control) have the same age, disease severity, and comorbidities, and Alice's readmission outcome differs from Bob's, we attribute the difference to the treatment.

Formally, matching estimates the ATE or ATT by constructing matched pairs (or sets) of treated and control units with similar covariate values $X$, then averaging the within-pair outcome differences.

Exact Matching

The simplest form: for each treated unit $i$, find a control unit $j$ with $X_j = X_i$ exactly.

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field


def exact_match(
    df: pd.DataFrame,
    treatment_col: str,
    outcome_col: str,
    match_cols: list[str],
) -> Dict[str, float]:
    """Estimate ATT using exact matching on discrete covariates.

    For each treated unit, finds all control units with identical
    values on match_cols and averages their outcomes. Unmatched
    treated units are dropped.

    Args:
        df: Input data.
        treatment_col: Name of binary treatment column.
        outcome_col: Name of outcome column.
        match_cols: Columns to match on exactly.

    Returns:
        Dictionary with ATT estimate, number matched, and fraction dropped.
    """
    treated = df[df[treatment_col] == 1].copy()
    control = df[df[treatment_col] == 0].copy()

    matched_effects = []
    n_unmatched = 0

    for _, row in treated.iterrows():
        mask = pd.Series(True, index=control.index)
        for col in match_cols:
            mask &= control[col] == row[col]
        matches = control[mask]

        if len(matches) > 0:
            matched_effects.append(row[outcome_col] - matches[outcome_col].mean())
        else:
            n_unmatched += 1

    if len(matched_effects) == 0:
        raise ValueError("No treated units could be matched.")

    att = float(np.mean(matched_effects))
    return {
        "att": att,
        "n_treated": len(treated),
        "n_matched": len(treated) - n_unmatched,
        "frac_dropped": n_unmatched / len(treated),
    }

The curse of dimensionality kills exact matching. With $p$ continuous covariates, the probability of finding an exact match approaches zero. Even with discrete covariates, the number of strata grows exponentially: 5 binary covariates create $2^5 = 32$ cells; 10 binary covariates create 1,024 cells. Many cells will be empty, leaving treated units unmatched.

Nearest-Neighbor Matching

Instead of requiring exact matches, find the closest control unit in covariate space:

$$j^*(i) = \arg\min_{j : D_j = 0} \| X_i - X_j \|$$

where $\| \cdot \|$ is typically the Mahalanobis distance:

$$d_M(X_i, X_j) = \sqrt{(X_i - X_j)^\top \hat{\Sigma}^{-1} (X_i - X_j)}$$

The Mahalanobis distance accounts for different scales and correlations among covariates. Without it, matching would be dominated by whichever covariate has the largest variance.

Caliper Matching

A caliper imposes a maximum distance threshold: a treated unit is matched only if a control unit exists within distance $\delta$:

$$j^*(i) = \arg\min_{j : D_j = 0, \, d(X_i, X_j) < \delta} d(X_i, X_j)$$

Treated units without a match within the caliper are dropped. This trades completeness (some treated units go unmatched) for quality (all matched pairs are sufficiently similar). The caliper is typically set to 0.2 standard deviations of the propensity score (Austin, 2011).

Know How Your Model Is Wrong: Matching discards unmatched units. This changes the estimand: the ATT becomes the "ATT among matchable treated units," which may differ from the ATT for the full treated population if matchable and unmatchable treated units differ systematically. Always report the fraction of treated units dropped and investigate whether they differ from matched units.

18.3 The Propensity Score

Motivation: Dimensionality Reduction for Matching

Rosenbaum and Rubin (1983) proved a remarkable result: instead of matching on the full covariate vector $X$ (which may be high-dimensional), you can match on a single scalar — the propensity score — and achieve the same bias reduction.

Definition

The propensity score is the probability of receiving treatment given observed covariates:

$$e(X) = P(D = 1 \mid X)$$

The Balancing Property

Mathematical Foundation: Rosenbaum and Rubin (1983) proved that the propensity score is a balancing score: conditioning on $e(X)$ makes the treatment assignment independent of the observed covariates:

$$X \perp\!\!\!\perp D \mid e(X)$$

This is the balancing property. The proof proceeds as follows. We need to show that $P(D = 1 \mid X, e(X)) = P(D = 1 \mid e(X))$. Since $e(X)$ is a function of $X$, conditioning on both $X$ and $e(X)$ is the same as conditioning on $X$ alone:

$$P(D = 1 \mid X, e(X)) = P(D = 1 \mid X) = e(X)$$

And since this depends on $X$ only through $e(X)$:

$$P(D = 1 \mid e(X)) = \mathbb{E}[P(D = 1 \mid X) \mid e(X)] = \mathbb{E}[e(X) \mid e(X)] = e(X)$$

Therefore, $P(D = 1 \mid X, e(X)) = P(D = 1 \mid e(X))$, which implies $X \perp\!\!\!\perp D \mid e(X)$. $\square$

The practical consequence: if conditional ignorability holds given $X$ — that is, $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X$ — then it also holds given the propensity score alone: $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid e(X)$. This reduces the matching problem from $p$ dimensions to one dimension.

Estimating the Propensity Score

The true propensity score is unknown and must be estimated from data. Logistic regression is the standard choice:

from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt


def estimate_propensity_score(
    X: np.ndarray,
    treatment: np.ndarray,
    feature_names: Optional[list[str]] = None,
    regularization: float = 1.0,
) -> Dict[str, object]:
    """Estimate propensity scores using logistic regression.

    Args:
        X: Covariate matrix of shape (n, p).
        treatment: Binary treatment vector of shape (n,).
        feature_names: Optional covariate names.
        regularization: Inverse regularization strength (C parameter).

    Returns:
        Dictionary with model, propensity scores, and diagnostics.
    """
    if feature_names is None:
        feature_names = [f"x{j}" for j in range(X.shape[1])]

    model = LogisticRegression(
        C=regularization, max_iter=1000, solver="lbfgs"
    )
    model.fit(X, treatment)

    ps = model.predict_proba(X)[:, 1]

    # Diagnostics
    coef_dict = dict(zip(feature_names, model.coef_[0]))

    return {
        "model": model,
        "propensity_scores": ps,
        "coefficients": coef_dict,
        "intercept": float(model.intercept_[0]),
        "mean_ps_treated": float(ps[treatment == 1].mean()),
        "mean_ps_control": float(ps[treatment == 0].mean()),
        "min_ps": float(ps.min()),
        "max_ps": float(ps.max()),
    }

Common Misconception: "A better propensity score model (higher AUC) gives a better causal estimate." This is false and potentially dangerous. The propensity score is a balancing tool, not a prediction target. A propensity score model with AUC = 0.99 may mean that treatment and control groups are nearly perfectly separated — which implies severe positivity violations. What matters is covariate balance after adjustment, not predictive accuracy of the propensity model.

Propensity Score Matching (PSM)

Match each treated unit to the nearest control unit on the estimated propensity score:

from scipy.spatial import KDTree


def propensity_score_matching(
    ps: np.ndarray,
    treatment: np.ndarray,
    outcome: np.ndarray,
    caliper: Optional[float] = None,
    n_neighbors: int = 1,
    replacement: bool = True,
) -> Dict[str, float]:
    """Estimate ATT using propensity score matching.

    Matches each treated unit to the nearest control unit(s) on
    the estimated propensity score.

    Args:
        ps: Propensity scores of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        outcome: Outcome vector of shape (n,).
        caliper: Maximum allowed distance (in PS units). None for no caliper.
        n_neighbors: Number of control matches per treated unit.
        replacement: Whether control units can be reused.

    Returns:
        Dictionary with ATT estimate, standard error, and diagnostics.
    """
    treated_idx = np.where(treatment == 1)[0]
    control_idx = np.where(treatment == 0)[0]

    control_ps = ps[control_idx].reshape(-1, 1)
    tree = KDTree(control_ps)

    matched_effects = []
    matched_treated = []
    matched_control = []
    n_unmatched = 0

    for i in treated_idx:
        distances, indices = tree.query(
            ps[i].reshape(1, -1), k=n_neighbors
        )
        distances = distances.flatten()
        indices = indices.flatten()

        if caliper is not None:
            valid = distances < caliper
            if not valid.any():
                n_unmatched += 1
                continue
            distances = distances[valid]
            indices = indices[valid]

        control_matches = control_idx[indices]
        matched_outcome = outcome[control_matches].mean()
        effect = outcome[i] - matched_outcome

        matched_effects.append(effect)
        matched_treated.append(i)
        matched_control.extend(control_matches.tolist())

    if len(matched_effects) == 0:
        raise ValueError("No matches found. Consider widening the caliper.")

    att = float(np.mean(matched_effects))

    # Abadie-Imbens standard error (simplified)
    se = float(np.std(matched_effects, ddof=1) / np.sqrt(len(matched_effects)))

    return {
        "att": att,
        "se": se,
        "ci_lower": att - 1.96 * se,
        "ci_upper": att + 1.96 * se,
        "n_treated": len(treated_idx),
        "n_matched": len(matched_effects),
        "n_unmatched": n_unmatched,
        "frac_dropped": n_unmatched / len(treated_idx),
        "n_unique_controls_used": len(set(matched_control)),
    }

Covariate Balance After Matching

The purpose of propensity score matching is to create covariate balance. Always check whether it succeeded:

def covariate_balance_report(
    X: np.ndarray,
    treatment: np.ndarray,
    ps: np.ndarray,
    matched_treated_idx: np.ndarray,
    matched_control_idx: np.ndarray,
    feature_names: Optional[list[str]] = None,
) -> pd.DataFrame:
    """Compute standardized mean differences before and after matching.

    Args:
        X: Covariate matrix of shape (n, p).
        treatment: Binary treatment vector of shape (n,).
        ps: Propensity scores of shape (n,).
        matched_treated_idx: Indices of matched treated units.
        matched_control_idx: Indices of matched control units.
        feature_names: Optional covariate names.

    Returns:
        DataFrame with pre- and post-matching SMDs per covariate.
    """
    if feature_names is None:
        feature_names = [f"x{j}" for j in range(X.shape[1])]

    results = []
    for j, name in enumerate(feature_names):
        # Pre-matching
        xt_pre = X[treatment == 1, j]
        xc_pre = X[treatment == 0, j]
        pooled_sd = np.sqrt((xt_pre.var(ddof=1) + xc_pre.var(ddof=1)) / 2)
        smd_pre = (xt_pre.mean() - xc_pre.mean()) / pooled_sd if pooled_sd > 0 else 0.0

        # Post-matching
        xt_post = X[matched_treated_idx, j]
        xc_post = X[matched_control_idx, j]
        smd_post = (xt_post.mean() - xc_post.mean()) / pooled_sd if pooled_sd > 0 else 0.0

        results.append({
            "covariate": name,
            "smd_before": float(smd_pre),
            "smd_after": float(smd_post),
            "reduction_pct": float(
                100 * (1 - abs(smd_post) / abs(smd_pre))
            ) if abs(smd_pre) > 1e-10 else 0.0,
        })

    return pd.DataFrame(results)

A Love plot (Austin, 2009) visualizes covariate balance before and after matching. The threshold $|SMD| < 0.1$ is conventional: covariates exceeding this after matching indicate residual imbalance.

18.4 Inverse Probability Weighting (IPW)

The Intuition

PSM discards unmatched units, which changes the estimand and wastes data. Inverse probability weighting (IPW) uses all units but reweights them so that the pseudo-population has no confounding.

The idea: in the observational data, treated units with low propensity scores are "surprising" — they are similar to control units but received treatment anyway. These units carry more information about the treatment effect and should be upweighted. Conversely, treated units with high propensity scores are expected to be treated and provide less causal information.

The Horvitz-Thompson Estimator

Mathematical Foundation: The IPW estimator for the ATE is:

$$\hat{\tau}_{\text{IPW}} = \frac{1}{N} \sum_{i=1}^N \left[ \frac{D_i Y_i}{\hat{e}(X_i)} - \frac{(1 - D_i) Y_i}{1 - \hat{e}(X_i)} \right]$$

To see why this works, consider the first term. Under conditional ignorability:

$$\mathbb{E}\left[\frac{D_i Y_i}{e(X_i)}\right] = \mathbb{E}\left[\frac{D_i Y_i(1)}{e(X_i)}\right] = \mathbb{E}\left[\mathbb{E}\left[\frac{D_i Y_i(1)}{e(X_i)} \Bigg| X_i\right]\right]$$

Since $D_i \perp\!\!\!\perp Y_i(1) \mid X_i$ by ignorability, and $\mathbb{E}[D_i \mid X_i] = e(X_i)$:

$$= \mathbb{E}\left[\frac{e(X_i) \cdot \mathbb{E}[Y_i(1) \mid X_i]}{e(X_i)}\right] = \mathbb{E}[\mathbb{E}[Y_i(1) \mid X_i]] = \mathbb{E}[Y_i(1)]$$

Similarly, the second term recovers $\mathbb{E}[Y_i(0)]$. Therefore, $\mathbb{E}[\hat{\tau}_{\text{IPW}}] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \text{ATE}$.

Normalized (Hajek) Weights

The Horvitz-Thompson estimator can have large variance when propensity scores are extreme. The Hajek estimator normalizes the weights:

$$\hat{\tau}_{\text{Hajek}} = \frac{\sum_{i=1}^N \frac{D_i Y_i}{\hat{e}(X_i)}}{\sum_{i=1}^N \frac{D_i}{\hat{e}(X_i)}} - \frac{\sum_{i=1}^N \frac{(1 - D_i) Y_i}{1 - \hat{e}(X_i)}}{\sum_{i=1}^N \frac{(1 - D_i)}{1 - \hat{e}(X_i)}}$$

The Hajek estimator is biased (unlike Horvitz-Thompson) but has substantially lower variance. In practice, it almost always outperforms the unnormalized estimator in terms of mean squared error.

Implementation

def ipw_estimator(
    outcome: np.ndarray,
    treatment: np.ndarray,
    ps: np.ndarray,
    normalized: bool = True,
    trimming_threshold: float = 0.01,
) -> Dict[str, float]:
    """Estimate ATE using inverse probability weighting.

    Args:
        outcome: Outcome vector of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        ps: Estimated propensity scores of shape (n,).
        normalized: If True, use Hajek (normalized) estimator.
        trimming_threshold: Trim propensity scores below this or
            above (1 - this) to stabilize weights.

    Returns:
        Dictionary with ATE estimate, standard error, and weight diagnostics.
    """
    # Trim extreme propensity scores
    ps_trimmed = np.clip(ps, trimming_threshold, 1 - trimming_threshold)
    n_trimmed = int(np.sum((ps < trimming_threshold) | (ps > 1 - trimming_threshold)))

    # Compute weights
    w1 = treatment / ps_trimmed          # Weights for treated
    w0 = (1 - treatment) / (1 - ps_trimmed)  # Weights for control

    if normalized:
        # Hajek estimator
        mu1 = np.sum(w1 * outcome) / np.sum(w1)
        mu0 = np.sum(w0 * outcome) / np.sum(w0)
    else:
        # Horvitz-Thompson
        n = len(outcome)
        mu1 = np.sum(w1 * outcome) / n
        mu0 = np.sum(w0 * outcome) / n

    ate = float(mu1 - mu0)

    # Sandwich standard error (linearization)
    scores = (
        treatment * (outcome - mu1) / ps_trimmed
        - (1 - treatment) * (outcome - mu0) / (1 - ps_trimmed)
        + mu1 - mu0 - ate
    )
    se = float(np.sqrt(np.mean(scores ** 2) / len(outcome)))

    # Weight diagnostics
    eff_n_treated = float(np.sum(w1) ** 2 / np.sum(w1 ** 2))
    eff_n_control = float(np.sum(w0) ** 2 / np.sum(w0 ** 2))

    return {
        "ate": ate,
        "se": se,
        "ci_lower": ate - 1.96 * se,
        "ci_upper": ate + 1.96 * se,
        "n_trimmed": n_trimmed,
        "max_weight_treated": float(w1.max()),
        "max_weight_control": float(w0.max()),
        "effective_n_treated": eff_n_treated,
        "effective_n_control": eff_n_control,
    }

Weight Diagnostics

Extreme weights signal trouble. If a few units have propensity scores near 0 or 1, their weights dominate the estimator, inflating variance and producing unstable estimates.

def ipw_weight_diagnostics(
    ps: np.ndarray,
    treatment: np.ndarray,
) -> Dict[str, float]:
    """Compute diagnostic statistics for IPW weights.

    Args:
        ps: Propensity scores of shape (n,).
        treatment: Binary treatment of shape (n,).

    Returns:
        Dictionary with effective sample sizes, weight concentration metrics.
    """
    w1 = treatment / ps
    w0 = (1 - treatment) / (1 - ps)

    # Effective sample size: (sum w)^2 / (sum w^2)
    ess_treated = np.sum(w1) ** 2 / np.sum(w1 ** 2) if np.sum(w1 ** 2) > 0 else 0
    ess_control = np.sum(w0) ** 2 / np.sum(w0 ** 2) if np.sum(w0 ** 2) > 0 else 0

    # Fraction of total weight carried by top 5% of weights
    w1_nonzero = w1[w1 > 0]
    w0_nonzero = w0[w0 > 0]

    def top_weight_share(w: np.ndarray, pct: float = 0.05) -> float:
        if len(w) == 0:
            return 0.0
        k = max(1, int(np.ceil(pct * len(w))))
        return float(np.sort(w)[-k:].sum() / w.sum())

    return {
        "ess_treated": float(ess_treated),
        "ess_control": float(ess_control),
        "ess_ratio_treated": float(ess_treated / treatment.sum()) if treatment.sum() > 0 else 0.0,
        "ess_ratio_control": float(ess_control / (1 - treatment).sum()) if (1 - treatment).sum() > 0 else 0.0,
        "top_5pct_weight_share_treated": top_weight_share(w1_nonzero),
        "top_5pct_weight_share_control": top_weight_share(w0_nonzero),
        "max_ps": float(ps.max()),
        "min_ps": float(ps.min()),
        "pct_ps_below_0.05": float(np.mean(ps < 0.05) * 100),
        "pct_ps_above_0.95": float(np.mean(ps > 0.95) * 100),
    }

Rules of thumb for weight diagnostics:

Effective sample size ratio below 0.5 (ESS is less than half the actual sample size) suggests extreme weights.
If the top 5% of weights carry more than 50% of the total weight, the estimate is driven by a handful of observations.
Propensity scores below 0.05 or above 0.95 signal positivity violations. Standard practice: trim at 0.01 or 0.02 and conduct sensitivity analysis.

Production Reality: In industry applications, propensity score trimming is often the difference between a sensible estimate and nonsense. At MediCore, the most severely ill patients have propensity scores above 0.98 (almost all receive Drug X), and the healthiest have scores below 0.02. Without trimming, a single patient with $e(X) = 0.005$ in the control group receives a weight of $1/(1 - 0.005) = 200\times$, dominating the entire ATE estimate. The standard approach is to restrict analysis to the overlap population: $\{i : 0.02 < \hat{e}(X_i) < 0.98\}$.

18.5 Doubly Robust Estimation (AIPW)

The Belt and Suspenders

IPW requires a correct propensity score model. Regression adjustment (Chapter 16) requires a correct outcome model. Doubly robust methods require only one of the two to be correct — hence "belt and suspenders."

The augmented inverse probability weighting (AIPW) estimator combines both models:

Mathematical Foundation: The AIPW estimator for the ATE is:

$$\hat{\tau}_{\text{AIPW}} = \frac{1}{N} \sum_{i=1}^N \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]$$

where $\hat{\mu}_d(X) = \hat{\mathbb{E}}[Y \mid D = d, X]$ is the estimated outcome model for treatment group $d$, and $\hat{e}(X)$ is the estimated propensity score.

The estimator has a remarkable double robustness property: it is consistent if either $\hat{\mu}_d(X)$ is correctly specified or $\hat{e}(X)$ is correctly specified (Robins, Rotnitzky, and Zhao, 1994; Scharfstein, Rotnitzky, and Robins, 1999). If the outcome model is correct, the IPW residual terms average to zero. If the propensity model is correct, the outcome model residuals are properly reweighted even if the outcome model is wrong.

Implementation

from sklearn.linear_model import LinearRegression


def aipw_estimator(
    outcome: np.ndarray,
    treatment: np.ndarray,
    X: np.ndarray,
    ps: np.ndarray,
    trimming_threshold: float = 0.01,
) -> Dict[str, float]:
    """Estimate ATE using augmented inverse probability weighting (AIPW).

    Doubly robust: consistent if either the outcome model or the
    propensity score model is correctly specified.

    Args:
        outcome: Outcome vector of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        X: Covariate matrix of shape (n, p).
        ps: Estimated propensity scores of shape (n,).
        trimming_threshold: Minimum/maximum propensity score.

    Returns:
        Dictionary with ATE estimate, standard error, and diagnostics.
    """
    ps_trimmed = np.clip(ps, trimming_threshold, 1 - trimming_threshold)
    n = len(outcome)

    # Fit outcome models for treated and control
    treated_mask = treatment == 1
    control_mask = treatment == 0

    mu1_model = LinearRegression().fit(X[treated_mask], outcome[treated_mask])
    mu0_model = LinearRegression().fit(X[control_mask], outcome[control_mask])

    mu1_hat = mu1_model.predict(X)  # Predicted outcome under treatment for all units
    mu0_hat = mu0_model.predict(X)  # Predicted outcome under control for all units

    # AIPW scores
    phi1 = mu1_hat + treatment * (outcome - mu1_hat) / ps_trimmed
    phi0 = mu0_hat + (1 - treatment) * (outcome - mu0_hat) / (1 - ps_trimmed)

    scores = phi1 - phi0
    ate = float(np.mean(scores))
    se = float(np.std(scores, ddof=1) / np.sqrt(n))

    return {
        "ate": ate,
        "se": se,
        "ci_lower": ate - 1.96 * se,
        "ci_upper": ate + 1.96 * se,
        "outcome_model_r2_treated": float(mu1_model.score(X[treated_mask], outcome[treated_mask])),
        "outcome_model_r2_control": float(mu0_model.score(X[control_mask], outcome[control_mask])),
    }

When to Use AIPW Over IPW or Regression

In practice, AIPW should be the default choice for estimating the ATE when conditional ignorability is the identification strategy. It is:

More efficient than IPW alone (the outcome model reduces residual variance).
More robust than regression alone (the IPW component corrects for outcome model misspecification).
Semiparametrically efficient when both models are correctly specified (it achieves the efficiency bound).

The cost is minimal: you must fit two models instead of one. Given that you should be checking both models for diagnostics anyway, the additional complexity is negligible.

18.6 Applying Propensity Score Methods: MediCore Drug X

Let us apply these methods to the MediCore Pharmaceuticals setting from Chapters 15-17. MediCore wants to estimate the causal effect of Drug X on 30-day readmission using observational EHR data from 2.1 million patients.

def simulate_medicore_observational(
    n_patients: int = 50000,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate MediCore observational data with realistic confounding.

    Treatment (Drug X) is prescribed more often to sicker patients.
    Confounders: age, disease severity, comorbidity count, creatinine.

    Args:
        n_patients: Number of patient-admission episodes.
        seed: Random seed.

    Returns:
        DataFrame with patient features, treatment, potential outcomes.
    """
    rng = np.random.RandomState(seed)

    # Patient characteristics
    age = rng.normal(65, 12, n_patients).clip(30, 95)
    severity = rng.exponential(1.0, n_patients)
    comorbidities = rng.poisson(2.5, n_patients).clip(0, 10)
    creatinine = rng.lognormal(0.3, 0.4, n_patients).clip(0.5, 6.0)

    # Treatment assignment (confounded by severity and comorbidities)
    logit_treat = (
        -1.5
        + 0.02 * (age - 65)
        + 0.8 * severity
        + 0.3 * comorbidities
        + 0.5 * creatinine
        + rng.normal(0, 0.5, n_patients)
    )
    ps_true = 1 / (1 + np.exp(-logit_treat))
    treatment = rng.binomial(1, ps_true)

    # Potential outcomes (readmission probability)
    # Y(0): readmission without Drug X
    logit_y0 = (
        -2.0
        + 0.01 * (age - 65)
        + 0.6 * severity
        + 0.2 * comorbidities
        + 0.4 * creatinine
        + rng.normal(0, 0.3, n_patients)
    )
    prob_y0 = 1 / (1 + np.exp(-logit_y0))
    y0 = rng.binomial(1, prob_y0)

    # Y(1): readmission with Drug X (treatment effect is -5 pp on average)
    logit_y1 = logit_y0 - 0.5  # Drug X reduces readmission log-odds by 0.5
    prob_y1 = 1 / (1 + np.exp(-logit_y1))
    y1 = rng.binomial(1, prob_y1)

    # Observed outcome
    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "age": age,
        "severity": severity,
        "comorbidities": comorbidities,
        "creatinine": creatinine,
        "treatment": treatment,
        "ps_true": ps_true,
        "y0": y0,
        "y1": y1,
        "true_ite": y1 - y0,
        "y_obs": y_obs,
    })


# Generate and analyze
medicore = simulate_medicore_observational()

true_ate = medicore["true_ite"].mean()
true_att = medicore.loc[medicore["treatment"] == 1, "true_ite"].mean()

naive_diff = (
    medicore.loc[medicore["treatment"] == 1, "y_obs"].mean()
    - medicore.loc[medicore["treatment"] == 0, "y_obs"].mean()
)

print("MediCore Drug X — Observational Analysis")
print("=" * 50)
print(f"N patients:  {len(medicore)}")
print(f"Treatment %: {medicore['treatment'].mean():.1%}")
print(f"True ATE:    {true_ate:.4f}")
print(f"True ATT:    {true_att:.4f}")
print(f"Naive diff:  {naive_diff:.4f}")
print(f"Naive bias:  {naive_diff - true_ate:+.4f}")

MediCore Drug X — Observational Analysis
==================================================
N patients:  50000
Treatment %: 63.1%
True ATE:    -0.0476
True ATT:    -0.0452
Naive diff:  0.0341
Naive bias:  +0.0817

The naive difference in means is positive — suggesting Drug X increases readmission — even though the true ATE is negative (Drug X reduces readmission by about 4.8 percentage points). The confounding is so severe that it reverses the sign of the estimate. Sicker patients receive Drug X more often and have higher readmission rates regardless of treatment.

# Estimate propensity scores
covariates = medicore[["age", "severity", "comorbidities", "creatinine"]].values
cov_names = ["age", "severity", "comorbidities", "creatinine"]

ps_result = estimate_propensity_score(
    X=covariates,
    treatment=medicore["treatment"].values,
    feature_names=cov_names,
)
ps_hat = ps_result["propensity_scores"]

# IPW estimate
ipw_result = ipw_estimator(
    outcome=medicore["y_obs"].values,
    treatment=medicore["treatment"].values,
    ps=ps_hat,
    normalized=True,
    trimming_threshold=0.02,
)

# AIPW estimate
aipw_result = aipw_estimator(
    outcome=medicore["y_obs"].values,
    treatment=medicore["treatment"].values,
    X=covariates,
    ps=ps_hat,
    trimming_threshold=0.02,
)

# PSM estimate
psm_result = propensity_score_matching(
    ps=ps_hat,
    treatment=medicore["treatment"].values,
    outcome=medicore["y_obs"].values,
    caliper=0.05,
    n_neighbors=3,
)

print("\nMethod Comparison")
print("=" * 50)
print(f"True ATE:    {true_ate:.4f}")
print(f"Naive:       {naive_diff:.4f}  (bias: {naive_diff - true_ate:+.4f})")
print(f"IPW:         {ipw_result['ate']:.4f}  (bias: {ipw_result['ate'] - true_ate:+.4f})")
print(f"AIPW:        {aipw_result['ate']:.4f}  (bias: {aipw_result['ate'] - true_ate:+.4f})")
print(f"PSM (ATT):   {psm_result['att']:.4f}  (bias vs ATT: {psm_result['att'] - true_att:+.4f})")

Method Comparison
==================================================
True ATE:    -0.0476
Naive:       0.0341  (bias: +0.0817)
IPW:         -0.0451  (bias: +0.0025)
AIPW:        -0.0468  (bias: +0.0008)
PSM (ATT):   -0.0439  (bias vs ATT: +0.0013)

All three propensity-score-based methods recover the true causal effect. AIPW is closest, consistent with its semiparametric efficiency. The naive estimate had the wrong sign.

18.7 Instrumental Variables (IV)

When Conditional Ignorability Fails

Propensity score methods assume that all confounders are observed: $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid X$. In many settings, this assumption is indefensible. A physician prescribes Drug X based on clinical judgment, lab results, and experience — factors that are only partially captured in EHR data. An unmeasured confounder (e.g., the physician's private assessment of disease trajectory) can bias every propensity-score-based estimate.

Instrumental variables (IV) provide an identification strategy that works even when there are unmeasured confounders between treatment and outcome.

The Intuition

An instrument $Z$ is a variable that:

Affects treatment (relevance): $Z$ changes the probability of receiving treatment.
Does not directly affect the outcome (exclusion restriction): $Z$ affects $Y$ only through its effect on $D$.
Is not associated with unmeasured confounders (independence): $Z$ is as-good-as-random, or at least uncorrelated with the unmeasured confounders $U$.

The classic example: distance from a patient's home to the nearest hospital that preferentially prescribes Drug X. Patients who live closer are more likely to receive Drug X (relevance). But distance itself does not directly cause readmission (exclusion). And distance is plausibly unrelated to unmeasured health factors (independence — though this requires careful argument, since neighborhood characteristics may correlate with health).

Formal Setup

Consider the structural model:

$$Y_i = \alpha + \tau D_i + \beta X_i + \gamma U_i + \varepsilon_i$$

where $U_i$ is an unmeasured confounder. OLS on $Y_i = a + \tau D_i + b X_i + e_i$ gives a biased estimate of $\tau$ because $D$ is correlated with $U$ (omitted variable bias).

An instrument $Z$ satisfies:

Relevance: $\text{Cov}(Z, D) \neq 0$
Exclusion restriction: $Z$ does not appear in the outcome equation (no direct effect on $Y$)
Independence: $\text{Cov}(Z, U) = 0$ and $\text{Cov}(Z, \varepsilon) = 0$

The Wald Estimator (Binary Instrument)

With a binary instrument $Z \in \{0, 1\}$, the IV estimator simplifies to:

$$\hat{\tau}_{\text{IV}} = \frac{\mathbb{E}[Y \mid Z = 1] - \mathbb{E}[Y \mid Z = 0]}{\mathbb{E}[D \mid Z = 1] - \mathbb{E}[D \mid Z = 0]} = \frac{\text{Reduced form}}{\text{First stage}}$$

The numerator (reduced form) is the effect of the instrument on the outcome. The denominator (first stage) is the effect of the instrument on treatment. The IV estimator divides the total effect by the fraction attributable to the treatment channel.

Two-Stage Least Squares (2SLS)

With continuous instruments or multiple instruments, 2SLS generalizes the Wald estimator.

Stage 1: Regress the treatment on the instrument(s) and controls:

$$D_i = \pi_0 + \pi_1 Z_i + \pi_2 X_i + \nu_i$$

Save the fitted values $\hat{D}_i$.

Stage 2: Regress the outcome on the fitted treatment and controls:

$$Y_i = \alpha + \tau \hat{D}_i + \beta X_i + \epsilon_i$$

The coefficient $\hat{\tau}$ from Stage 2 is the 2SLS estimate.

import statsmodels.api as sm
from linearmodels.iv import IV2SLS


def two_stage_least_squares(
    outcome: np.ndarray,
    treatment: np.ndarray,
    instrument: np.ndarray,
    controls: Optional[np.ndarray] = None,
    control_names: Optional[list[str]] = None,
) -> Dict[str, float]:
    """Estimate causal effect using 2SLS instrumental variables.

    Args:
        outcome: Outcome vector of shape (n,).
        treatment: Endogenous treatment of shape (n,).
        instrument: Instrument(s) of shape (n,) or (n, k).
        controls: Exogenous control variables of shape (n, p). Optional.
        control_names: Names for control variables.

    Returns:
        Dictionary with 2SLS estimate, first-stage F, and diagnostics.
    """
    n = len(outcome)
    instrument = np.atleast_2d(instrument.reshape(n, -1) if instrument.ndim == 1 else instrument)

    # Stage 1: treatment ~ instrument + controls
    if controls is not None:
        stage1_X = np.column_stack([instrument, controls])
    else:
        stage1_X = instrument
    stage1_X = sm.add_constant(stage1_X)

    stage1 = sm.OLS(treatment, stage1_X).fit()
    d_hat = stage1.fittedvalues

    # First-stage F-statistic for instrument(s)
    n_instruments = instrument.shape[1]
    # Test joint significance of instrument coefficients
    # Coefficients 1 through n_instruments (0 is constant)
    r_matrix = np.zeros((n_instruments, stage1_X.shape[1]))
    for k in range(n_instruments):
        r_matrix[k, 1 + k] = 1
    first_stage_f = float(stage1.f_test(r_matrix).fvalue)

    # Stage 2: outcome ~ d_hat + controls
    if controls is not None:
        stage2_X = np.column_stack([d_hat, controls])
    else:
        stage2_X = d_hat.reshape(-1, 1)
    stage2_X = sm.add_constant(stage2_X)

    stage2 = sm.OLS(outcome, stage2_X).fit()

    # Correct standard errors (stage 2 SEs are wrong because d_hat is estimated)
    # Use the linearmodels package for proper inference
    # Here we report the naive SE with a warning
    tau_hat = float(stage2.params[1])
    se_naive = float(stage2.bse[1])

    return {
        "estimate": tau_hat,
        "se_naive": se_naive,
        "ci_lower": tau_hat - 1.96 * se_naive,
        "ci_upper": tau_hat + 1.96 * se_naive,
        "first_stage_f": first_stage_f,
        "first_stage_r2": float(stage1.rsquared),
        "weak_instrument": first_stage_f < 10,
        "n": n,
        "n_instruments": n_instruments,
    }

Implementation Note: The standard errors from the manual 2SLS procedure above are incorrect because they do not account for the fact that $\hat{D}$ is estimated. Always use a dedicated IV package for inference. In Python, linearmodels.iv.IV2SLS or statsmodels.sandbox.regression.gmm.IV2SLS provide correct standard errors. The manual procedure above is for pedagogical clarity only.

Weak Instruments

Common Misconception: "Any valid instrument works." An instrument can satisfy the exclusion restriction perfectly and still produce terrible estimates if it is weakly correlated with treatment. The first-stage F-statistic measures instrument strength:

$F > 10$: Generally adequate (Staiger and Stock, 1997).

$F \in [5, 10]$: Potentially problematic. Consider weak-instrument-robust inference (Anderson-Rubin test).

$F < 5$: The instrument is likely too weak. 2SLS estimates will be biased toward the OLS estimate, confidence intervals will be unreliable, and the asymptotic approximation breaks down.

The weak instrument problem is not merely about imprecision — it causes bias. With a weak instrument, the 2SLS estimator is biased toward the OLS estimator, which is biased due to confounding. This defeats the purpose of using IV.

The LATE Interpretation

When the instrument does not affect all units equally, the 2SLS estimator identifies a local average treatment effect (LATE) — the average treatment effect among compliers (Imbens and Angrist, 1994).

Compliers are units whose treatment status is affected by the instrument:

Type	$D(Z=0)$	$D(Z=1)$	Description
Complier	0	1	Takes treatment when encouraged, does not when not encouraged
Always-taker	1	1	Takes treatment regardless of instrument
Never-taker	0	0	Does not take treatment regardless
Defier	1	0	Does the opposite of the instrument

Under the monotonicity assumption (no defiers), 2SLS estimates the LATE:

$$\hat{\tau}_{\text{2SLS}} \xrightarrow{p} \text{LATE} = \mathbb{E}[Y(1) - Y(0) \mid \text{Complier}]$$

This is not the ATE. It is the treatment effect for a specific subpopulation — those whose behavior is affected by the instrument. In the MediCore context, compliers are patients who would receive Drug X if they live near a Drug-X-prescribing hospital and would not receive it otherwise. Always-takers (patients who demand Drug X regardless) and never-takers (patients who refuse Drug X regardless) are excluded.

MediCore IV Example: Hospital Distance

def simulate_medicore_iv(
    n_patients: int = 50000,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate MediCore data with an instrumental variable.

    Instrument: distance to nearest hospital that preferentially
    prescribes Drug X. Closer patients are more likely to receive
    Drug X, but distance does not directly affect readmission
    (exclusion restriction).

    Unmeasured confounder: physician's private assessment of patient
    trajectory (not in EHR data).

    Args:
        n_patients: Number of patients.
        seed: Random seed.

    Returns:
        DataFrame with patient data, instrument, and potential outcomes.
    """
    rng = np.random.RandomState(seed)

    # Observed covariates
    age = rng.normal(65, 12, n_patients).clip(30, 95)
    severity = rng.exponential(1.0, n_patients)
    comorbidities = rng.poisson(2.5, n_patients).clip(0, 10)

    # Unmeasured confounder (physician's private assessment)
    physician_assessment = 0.5 * severity + rng.normal(0, 1, n_patients)

    # Instrument: distance to Drug-X hospital (miles)
    distance = rng.exponential(15, n_patients).clip(1, 100)

    # Treatment assignment (confounded by measured + unmeasured)
    logit_treat = (
        0.5
        - 0.03 * distance         # Instrument: closer → more likely treated
        + 0.4 * severity           # Measured confounder
        + 0.3 * comorbidities      # Measured confounder
        + 0.6 * physician_assessment  # UNMEASURED confounder
        + rng.normal(0, 0.5, n_patients)
    )
    ps_true = 1 / (1 + np.exp(-logit_treat))
    treatment = rng.binomial(1, ps_true)

    # Potential outcomes
    logit_y0 = (
        -1.5
        + 0.01 * (age - 65)
        + 0.5 * severity
        + 0.15 * comorbidities
        + 0.4 * physician_assessment  # Unmeasured confounder affects outcome
        + rng.normal(0, 0.3, n_patients)
    )
    prob_y0 = 1 / (1 + np.exp(-logit_y0))
    y0 = rng.binomial(1, prob_y0)

    logit_y1 = logit_y0 - 0.4  # True treatment effect
    prob_y1 = 1 / (1 + np.exp(-logit_y1))
    y1 = rng.binomial(1, prob_y1)

    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "age": age,
        "severity": severity,
        "comorbidities": comorbidities,
        "physician_assessment": physician_assessment,
        "distance": distance,
        "treatment": treatment,
        "y0": y0, "y1": y1,
        "true_ite": y1 - y0,
        "y_obs": y_obs,
    })


med_iv = simulate_medicore_iv()

true_ate = med_iv["true_ite"].mean()

# OLS (biased: omits physician_assessment)
ols_X = sm.add_constant(
    med_iv[["treatment", "age", "severity", "comorbidities"]].values
)
ols_result = sm.OLS(med_iv["y_obs"].values, ols_X).fit(cov_type="HC1")
ols_estimate = ols_result.params[1]

# 2SLS with distance as instrument
iv_result = two_stage_least_squares(
    outcome=med_iv["y_obs"].values,
    treatment=med_iv["treatment"].values,
    instrument=med_iv["distance"].values,
    controls=med_iv[["age", "severity", "comorbidities"]].values,
)

print("MediCore IV Analysis (Distance as Instrument)")
print("=" * 55)
print(f"True ATE:          {true_ate:.4f}")
print(f"OLS (biased):      {ols_estimate:.4f}  (bias: {ols_estimate - true_ate:+.4f})")
print(f"2SLS estimate:     {iv_result['estimate']:.4f}  (bias: {iv_result['estimate'] - true_ate:+.4f})")
print(f"First-stage F:     {iv_result['first_stage_f']:.1f}")
print(f"Weak instrument?   {iv_result['weak_instrument']}")

MediCore IV Analysis (Distance as Instrument)
=======================================================
True ATE:          -0.0391
OLS (biased):      0.0213  (bias: +0.0604)
2SLS estimate:     -0.0356  (bias: +0.0035)
First-stage F:     287.4
Weak instrument?   False

OLS produces a biased estimate (wrong sign) because it omits the physician's private assessment. The 2SLS estimate using distance as an instrument recovers close to the true effect. The first-stage F-statistic of 287 confirms a strong instrument.

Know How Your Model Is Wrong: The 2SLS estimate of $-0.036$ is the LATE, not the ATE. It applies to compliers: patients whose Drug X receipt depends on proximity to a prescribing hospital. Patients who would always receive Drug X (because their condition demands it) or who would never receive it (because of contraindications) are excluded from this estimate. Whether the LATE is similar to the ATE depends on whether compliers are representative of the full population — a question that cannot be answered from the data alone.

18.8 Difference-in-Differences (DiD)

The Intuition

Imagine a policy change: on January 1, 2024, a new hospital network policy requires that all patients admitted with heart failure receive Drug X upon discharge. Before this date, Drug X prescription was at physician discretion. Hospitals in the network adopt the policy; hospitals outside the network continue with standard practice.

Difference-in-differences compares the change in outcomes over time between the treated group (policy hospitals) and the control group (non-policy hospitals):

$$\hat{\tau}_{\text{DiD}} = (\bar{Y}_{\text{treated, post}} - \bar{Y}_{\text{treated, pre}}) - (\bar{Y}_{\text{control, post}} - \bar{Y}_{\text{control, pre}})$$

The first difference removes time-invariant group differences (policy hospitals may have sicker patients). The second difference removes common time trends (readmission rates may be changing for everyone). What remains — under the parallel trends assumption — is the causal effect of the policy.

Formal Setup

Let $G_i \in \{0, 1\}$ denote group membership (treatment vs. control group) and $T_t \in \{0, 1\}$ denote time period (pre vs. post). The 2x2 DiD estimator is:

$$\hat{\tau}_{\text{DiD}} = (\bar{Y}_{G=1,T=1} - \bar{Y}_{G=1,T=0}) - (\bar{Y}_{G=0,T=1} - \bar{Y}_{G=0,T=0})$$

Equivalently, the DiD estimator can be obtained from the regression:

$$Y_{it} = \alpha + \beta_G \cdot G_i + \beta_T \cdot T_t + \tau \cdot (G_i \times T_t) + \varepsilon_{it}$$

where $\hat{\tau}$ is the DiD estimate. With individual fixed effects and time fixed effects, this becomes the two-way fixed effects (TWFE) model:

$$Y_{it} = \alpha_i + \lambda_t + \tau \cdot D_{it} + \varepsilon_{it}$$

where $\alpha_i$ are individual (or group) fixed effects and $\lambda_t$ are time fixed effects.

The Parallel Trends Assumption

Mathematical Foundation: The identifying assumption is that, in the absence of treatment, the treated and control groups would have followed parallel trends:

$$\mathbb{E}[Y_{it}(0) \mid G_i = 1] - \mathbb{E}[Y_{it-1}(0) \mid G_i = 1] = \mathbb{E}[Y_{it}(0) \mid G_i = 0] - \mathbb{E}[Y_{it-1}(0) \mid G_i = 0]$$

This does not require that the two groups have the same levels of the outcome — only that they follow the same trends in the absence of treatment. Policy hospitals can have higher readmission rates than non-policy hospitals, as long as the trajectory over time would have been the same without the policy change.

This assumption is untestable for the post-treatment period (we cannot observe $Y(0)$ for the treated group after treatment). However, it is plausibility-checkable using pre-treatment data: if the two groups followed parallel trends before treatment, it is more credible (though not guaranteed) that they would have continued to do so.

Implementation

def difference_in_differences(
    outcome: np.ndarray,
    group: np.ndarray,
    time: np.ndarray,
    cluster_id: Optional[np.ndarray] = None,
) -> Dict[str, float]:
    """Estimate ATT using difference-in-differences.

    Args:
        outcome: Outcome vector of shape (n,).
        group: Binary group indicator (1=treated, 0=control) of shape (n,).
        time: Binary time indicator (1=post, 0=pre) of shape (n,).
        cluster_id: Optional cluster IDs for clustered standard errors.

    Returns:
        Dictionary with DiD estimate, standard error, and components.
    """
    # Four cell means
    y_11 = outcome[(group == 1) & (time == 1)].mean()
    y_10 = outcome[(group == 1) & (time == 0)].mean()
    y_01 = outcome[(group == 0) & (time == 1)].mean()
    y_00 = outcome[(group == 0) & (time == 0)].mean()

    did = float((y_11 - y_10) - (y_01 - y_00))

    # Regression-based with robust SEs
    interaction = group * time
    X = np.column_stack([
        np.ones(len(outcome)),
        group,
        time,
        interaction,
    ])

    if cluster_id is not None:
        model = sm.OLS(outcome, X).fit(
            cov_type="cluster",
            cov_kwds={"groups": cluster_id},
        )
    else:
        model = sm.OLS(outcome, X).fit(cov_type="HC1")

    se = float(model.bse[3])

    return {
        "did_estimate": did,
        "se": se,
        "ci_lower": did - 1.96 * se,
        "ci_upper": did + 1.96 * se,
        "y_treated_post": float(y_11),
        "y_treated_pre": float(y_10),
        "y_control_post": float(y_01),
        "y_control_pre": float(y_00),
        "treated_change": float(y_11 - y_10),
        "control_change": float(y_01 - y_00),
    }

Event Study Plots

The most important diagnostic for DiD is the event study (or dynamic DiD) specification. Estimate treatment effects for each time period relative to the policy change:

$$Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \tau_k \cdot \mathbf{1}[K_{it} = k] + \varepsilon_{it}$$

where $K_{it}$ is the number of periods relative to the treatment date. The period $k = -1$ (one period before treatment) is the reference category.

def event_study(
    df: pd.DataFrame,
    outcome_col: str,
    unit_col: str,
    time_col: str,
    treated_col: str,
    event_time_col: str,
    pre_periods: int = 4,
    post_periods: int = 4,
) -> pd.DataFrame:
    """Estimate dynamic treatment effects for event study plot.

    Args:
        df: Panel data with unit, time, treatment indicators.
        outcome_col: Name of outcome variable.
        unit_col: Name of unit identifier.
        time_col: Name of time variable.
        treated_col: Binary indicator for treatment group.
        event_time_col: Periods relative to treatment (negative = pre).
        pre_periods: Number of pre-treatment periods to include.
        post_periods: Number of post-treatment periods to include.

    Returns:
        DataFrame with event-time coefficients and confidence intervals.
    """
    df = df.copy()

    # Create event-time dummies (omitting k = -1 as reference)
    event_dummies = []
    for k in range(-pre_periods, post_periods + 1):
        if k == -1:
            continue  # Reference period
        col_name = f"k_{k}" if k < 0 else f"k_plus_{k}"
        df[col_name] = ((df[event_time_col] == k) & (df[treated_col] == 1)).astype(float)
        event_dummies.append(col_name)

    # Unit and time fixed effects via demeaning
    # (Simplified: using OLS with dummies for illustration)
    unit_dummies = pd.get_dummies(df[unit_col], prefix="unit", drop_first=True)
    time_dummies = pd.get_dummies(df[time_col], prefix="time", drop_first=True)

    X = pd.concat([df[event_dummies], unit_dummies, time_dummies], axis=1)
    model = sm.OLS(df[outcome_col], sm.add_constant(X)).fit(cov_type="HC1")

    results = []
    for k in range(-pre_periods, post_periods + 1):
        if k == -1:
            results.append({"event_time": k, "coef": 0.0, "se": 0.0,
                            "ci_lower": 0.0, "ci_upper": 0.0})
            continue
        col_name = f"k_{k}" if k < 0 else f"k_plus_{k}"
        if col_name in model.params:
            coef = float(model.params[col_name])
            se = float(model.bse[col_name])
            results.append({
                "event_time": k,
                "coef": coef,
                "se": se,
                "ci_lower": coef - 1.96 * se,
                "ci_upper": coef + 1.96 * se,
            })

    return pd.DataFrame(results)

Reading event study plots:

Pre-treatment coefficients should be near zero. If the $\tau_k$ for $k < 0$ are statistically significant, the parallel trends assumption is questionable — the groups were already diverging before treatment.
Post-treatment coefficients show the treatment effect over time. An immediate effect ($\tau_0 \neq 0$) followed by a sustained effect ($\tau_1, \tau_2, \ldots \neq 0$) suggests a real causal effect.
Pre-treatment trends (a linear pattern in $\tau_{-4}, \tau_{-3}, \tau_{-2}$) suggest a violation even if individual coefficients are not significant.

MediCore DiD Example: Policy Change

def simulate_medicore_did(
    n_hospitals: int = 200,
    n_periods: int = 12,
    treatment_period: int = 6,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate panel data for MediCore policy DiD analysis.

    Half of hospitals adopt a Drug X mandate at treatment_period.
    The other half serve as controls.

    Args:
        n_hospitals: Number of hospitals.
        n_periods: Number of time periods (months).
        treatment_period: Period when treatment starts (0-indexed).
        seed: Random seed.

    Returns:
        Panel DataFrame with hospital-month observations.
    """
    rng = np.random.RandomState(seed)

    hospitals = []
    for h in range(n_hospitals):
        treated_group = h < n_hospitals // 2  # First half is treated

        # Hospital-level fixed effect
        hospital_fe = rng.normal(0, 0.05)

        for t in range(n_periods):
            # Common time trend (readmission rates declining)
            time_trend = -0.003 * t

            # Baseline readmission rate
            base_rate = 0.18 + hospital_fe + time_trend

            # Treatment effect (only for treated hospitals post-policy)
            post = t >= treatment_period
            treat_effect = -0.04 * post * treated_group  # 4 pp reduction

            # Observed readmission rate with noise
            n_patients = rng.poisson(200)
            rate = base_rate + treat_effect + rng.normal(0, 0.01)
            rate = np.clip(rate, 0, 1)

            hospitals.append({
                "hospital_id": h,
                "period": t,
                "treated_group": int(treated_group),
                "post": int(post),
                "event_time": t - treatment_period,
                "readmission_rate": float(rate),
                "n_patients": n_patients,
            })

    return pd.DataFrame(hospitals)


did_data = simulate_medicore_did()

# 2x2 DiD
did_result = difference_in_differences(
    outcome=did_data["readmission_rate"].values,
    group=did_data["treated_group"].values,
    time=did_data["post"].values,
    cluster_id=did_data["hospital_id"].values,
)

print("MediCore DiD — Drug X Policy Mandate")
print("=" * 50)
print(f"DiD estimate:      {did_result['did_estimate']:.4f}")
print(f"  95% CI:          [{did_result['ci_lower']:.4f}, {did_result['ci_upper']:.4f}]")
print(f"  Clustered SE:    {did_result['se']:.4f}")
print(f"Treated change:    {did_result['treated_change']:.4f}")
print(f"Control change:    {did_result['control_change']:.4f}")

MediCore DiD — Drug X Policy Mandate
==================================================
DiD estimate:      -0.0398
  95% CI:          [-0.0452, -0.0344]
  Clustered SE:    0.0028
Treated change:    -0.0562
Control change:    -0.0164

The DiD estimate is $-0.040$: the policy mandate reduced readmission rates by approximately 4 percentage points. Both groups show declining readmission rates over time (the common trend), but the treated group's decline is faster, and DiD isolates the treatment-specific component.

Staggered Adoption

Research Insight: The standard TWFE estimator can produce biased estimates when treatment is adopted at different times by different units (staggered adoption). Goodman-Bacon (2021) showed that the TWFE coefficient is a weighted average of all possible 2x2 DiD comparisons, and some of these comparisons use already-treated units as controls. When treatment effects change over time (dynamic effects), these "forbidden" comparisons can bias the TWFE estimator — even flipping the sign.

Modern solutions include: Callaway and Sant'Anna (2021), who construct group-time ATTs using only not-yet-treated units as controls; Sun and Abraham (2021), who use interaction-weighted estimators; and de Chaisemartin and D'Haultfoeuille (2020), who propose a robust estimator. For staggered adoption problems, use one of these methods rather than the standard TWFE regression.

18.9 Regression Discontinuity (RD)

The Intuition

A consumer applies for a credit card from Meridian Financial. If her FICO score is 660 or above, she is approved; below 660, she is denied. Applicants with a FICO score of 661 and 659 are essentially identical in every respect except that one received the credit card and the other did not. This creates a natural experiment at the cutoff.

Regression discontinuity (RD) exploits this discontinuous jump in treatment probability at a threshold on a continuous running variable to identify the causal effect at the cutoff.

Sharp RD

In a sharp RD, treatment is a deterministic function of the running variable:

$$D_i = \mathbf{1}[X_i \geq c]$$

where $X_i$ is the running variable (e.g., credit score) and $c$ is the cutoff (e.g., 660).

Mathematical Foundation: The sharp RD estimand is:

$$\tau_{\text{SRD}} = \lim_{x \downarrow c} \mathbb{E}[Y \mid X = x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X = x]$$

The identifying assumption is that the conditional expectation functions $\mathbb{E}[Y(0) \mid X = x]$ and $\mathbb{E}[Y(1) \mid X = x]$ are continuous at $x = c$. This means that units just below and just above the cutoff have similar potential outcomes — the only thing that changes discontinuously at the cutoff is the treatment.

If the potential outcome functions are continuous at $c$:

$$\lim_{x \downarrow c} \mathbb{E}[Y \mid X = x] = \lim_{x \downarrow c} \mathbb{E}[Y(1) \mid X = x]$$ $$\lim_{x \uparrow c} \mathbb{E}[Y \mid X = x] = \lim_{x \uparrow c} \mathbb{E}[Y(0) \mid X = x]$$

By continuity: $\lim_{x \downarrow c} \mathbb{E}[Y(1) \mid X = x] = \mathbb{E}[Y(1) \mid X = c]$ and $\lim_{x \uparrow c} \mathbb{E}[Y(0) \mid X = x] = \mathbb{E}[Y(0) \mid X = c]$.

Therefore, $\tau_{\text{SRD}} = \mathbb{E}[Y(1) - Y(0) \mid X = c]$, the average treatment effect at the cutoff.

Fuzzy RD

In a fuzzy RD, the cutoff does not perfectly determine treatment — it only changes the probability of treatment:

$$\lim_{x \downarrow c} P(D = 1 \mid X = x) \neq \lim_{x \uparrow c} P(D = 1 \mid X = x)$$

For example, a credit score of 660 may increase approval probability from 30% to 90%, but some applicants below 660 are approved (manual override) and some above 660 are denied (other risk factors).

The fuzzy RD estimand is:

$$\tau_{\text{FRD}} = \frac{\lim_{x \downarrow c} \mathbb{E}[Y \mid X = x] - \lim_{x \uparrow c} \mathbb{E}[Y \mid X = x]}{\lim_{x \downarrow c} \mathbb{E}[D \mid X = x] - \lim_{x \uparrow c} \mathbb{E}[D \mid X = x]}$$

This is the ratio of the jump in the outcome to the jump in treatment probability — exactly the Wald (IV) estimator, with $\mathbf{1}[X_i \geq c]$ as the instrument. The fuzzy RD estimate is a LATE: the causal effect among compliers at the cutoff (units whose treatment changes because of the cutoff).

Implementation

def regression_discontinuity(
    running_var: np.ndarray,
    outcome: np.ndarray,
    cutoff: float,
    treatment: Optional[np.ndarray] = None,
    bandwidth: Optional[float] = None,
    kernel: str = "triangular",
    polynomial_order: int = 1,
) -> Dict[str, float]:
    """Estimate causal effect using regression discontinuity design.

    Sharp RD if treatment is None (treatment = 1[running_var >= cutoff]).
    Fuzzy RD if treatment is provided.

    Args:
        running_var: Running variable of shape (n,).
        outcome: Outcome variable of shape (n,).
        cutoff: RD cutoff value.
        treatment: Actual treatment for fuzzy RD. None for sharp RD.
        bandwidth: Bandwidth around cutoff. None for IK-optimal.
        kernel: Kernel function for local regression ('triangular' or 'uniform').
        polynomial_order: Polynomial order for local regression.

    Returns:
        Dictionary with RD estimate, standard error, and diagnostics.
    """
    # Center running variable at cutoff
    x = running_var - cutoff

    # Default bandwidth: IK rule-of-thumb (simplified)
    if bandwidth is None:
        h_ik = 1.84 * np.std(x) * len(x) ** (-1 / 5)
        bandwidth = float(h_ik)

    # Restrict to bandwidth
    in_band = np.abs(x) <= bandwidth
    x_band = x[in_band]
    y_band = outcome[in_band]
    above = (x_band >= 0).astype(float)

    n_left = int(np.sum(x_band < 0))
    n_right = int(np.sum(x_band >= 0))

    # Kernel weights
    if kernel == "triangular":
        weights = (1 - np.abs(x_band) / bandwidth).clip(0)
    elif kernel == "uniform":
        weights = np.ones(len(x_band))
    else:
        raise ValueError(f"Unknown kernel: {kernel}")

    if treatment is None:
        # Sharp RD: local polynomial regression
        if polynomial_order == 1:
            # y = a + b*above + c*x + d*(above*x) + e
            X_rd = np.column_stack([
                np.ones(len(x_band)),
                above,
                x_band,
                above * x_band,
            ])
        else:
            # Higher-order polynomial
            terms = [np.ones(len(x_band)), above]
            for p in range(1, polynomial_order + 1):
                terms.extend([x_band ** p, above * x_band ** p])
            X_rd = np.column_stack(terms)

        model = sm.WLS(y_band, X_rd, weights=weights).fit(cov_type="HC1")
        tau = float(model.params[1])
        se = float(model.bse[1])
        design_type = "sharp"

    else:
        # Fuzzy RD: 2SLS with above-cutoff as instrument for treatment
        t_band = treatment[in_band]

        # Stage 1: treatment ~ above + x + above*x
        X_fs = np.column_stack([
            np.ones(len(x_band)),
            above,
            x_band,
            above * x_band,
        ])
        fs_model = sm.WLS(t_band, X_fs, weights=weights).fit()
        t_hat = fs_model.fittedvalues
        first_stage_f = float(fs_model.f_test("x1 = 0").fvalue)

        # Stage 2: outcome ~ t_hat + x + above*x
        X_ss = np.column_stack([
            np.ones(len(x_band)),
            t_hat,
            x_band,
            above * x_band,
        ])
        ss_model = sm.WLS(y_band, X_ss, weights=weights).fit(cov_type="HC1")
        tau = float(ss_model.params[1])
        se = float(ss_model.bse[1])
        design_type = "fuzzy"

    return {
        "estimate": tau,
        "se": se,
        "ci_lower": tau - 1.96 * se,
        "ci_upper": tau + 1.96 * se,
        "bandwidth": bandwidth,
        "n_in_bandwidth": int(in_band.sum()),
        "n_left": n_left,
        "n_right": n_right,
        "design_type": design_type,
        "kernel": kernel,
        "polynomial_order": polynomial_order,
    }

Bandwidth Selection

The bandwidth $h$ controls the bias-variance tradeoff:

Narrow bandwidth: Low bias (units closer to the cutoff are more comparable) but high variance (fewer observations).
Wide bandwidth: Low variance (more observations) but high bias (units far from the cutoff may have different potential outcomes).

The Imbens-Kalyanaraman (IK) and Calonico-Cattaneo-Titiunik (CCT) optimal bandwidths minimize the mean squared error of the RD estimator. The CCT bandwidth (Calonico et al., 2014) is now the standard:

def cct_bandwidth(
    running_var: np.ndarray,
    outcome: np.ndarray,
    cutoff: float,
) -> float:
    """Compute CCT optimal bandwidth (simplified approximation).

    The full CCT procedure (Calonico, Cattaneo, Titiunik, 2014) involves
    estimating the second derivative of the conditional expectation on each
    side. This is a simplified version; use rdrobust in R or rdd in Python
    for the exact implementation.

    Args:
        running_var: Running variable.
        outcome: Outcome variable.
        cutoff: RD cutoff.

    Returns:
        Estimated optimal bandwidth.
    """
    x = running_var - cutoff
    n = len(x)

    # Pilot bandwidth (rule of thumb)
    h_pilot = 2.0 * np.std(x) * n ** (-1 / 5)

    # Estimate second derivatives on each side
    left = (x < 0) & (x > -h_pilot)
    right = (x >= 0) & (x < h_pilot)

    if left.sum() < 10 or right.sum() < 10:
        return h_pilot

    # Quadratic fit on each side
    for side, mask in [("left", left), ("right", right)]:
        x_s = x[mask]
        y_s = outcome[mask]
        X_quad = np.column_stack([np.ones(len(x_s)), x_s, x_s ** 2])
        try:
            coefs = np.linalg.lstsq(X_quad, y_s, rcond=None)[0]
        except np.linalg.LinAlgError:
            return h_pilot

    # Simplified CCT formula (approximation)
    sigma2 = np.var(outcome[(np.abs(x) < h_pilot)])
    h_opt = 1.84 * (sigma2 / n) ** (1 / 5) * (np.std(x))
    return float(max(h_opt, np.std(x) * 0.1))

Implementation Note: For production RD analysis, use the rdrobust package (available in R; Python port via rdd). It implements the full CCT bandwidth selection with bias correction, robust confidence intervals, and automatic sensitivity to bandwidth choice. The simplified implementation above is for understanding the mechanics; do not use it for published analysis.

RD Diagnostics

Three essential checks for any RD design:

McCrary density test: Is there bunching (manipulation) of the running variable at the cutoff? If students can retake an exam to get above the threshold, the "natural experiment" breaks down. Test for a discontinuity in the density of the running variable at the cutoff.
Covariate balance at the cutoff: Pre-treatment covariates should be continuous at the cutoff. If age, income, or other baseline characteristics jump at the cutoff, something other than the treatment is changing.
Sensitivity to bandwidth: The RD estimate should be robust to reasonable changes in bandwidth. Plot the estimate as a function of bandwidth — if it changes dramatically, the result is fragile.

def rd_diagnostics(
    running_var: np.ndarray,
    outcome: np.ndarray,
    cutoff: float,
    covariates: Optional[np.ndarray] = None,
    covariate_names: Optional[list[str]] = None,
    bandwidth_range: Optional[np.ndarray] = None,
) -> Dict[str, object]:
    """Run standard RD diagnostic tests.

    Args:
        running_var: Running variable.
        outcome: Outcome variable.
        cutoff: RD cutoff.
        covariates: Optional covariates to check for balance.
        covariate_names: Names for covariates.
        bandwidth_range: Range of bandwidths for sensitivity analysis.

    Returns:
        Dictionary with density test, covariate balance, bandwidth sensitivity.
    """
    x = running_var - cutoff
    results = {}

    # 1. McCrary density test (simplified)
    # Compare density just below and above cutoff
    small_h = np.std(x) * 0.2
    n_below = np.sum((x >= -small_h) & (x < 0))
    n_above = np.sum((x >= 0) & (x < small_h))
    density_ratio = n_above / n_below if n_below > 0 else float("inf")
    results["density_ratio_at_cutoff"] = float(density_ratio)
    results["manipulation_suspected"] = abs(density_ratio - 1.0) > 0.3

    # 2. Covariate balance at cutoff
    if covariates is not None:
        if covariate_names is None:
            covariate_names = [f"cov_{j}" for j in range(covariates.shape[1])]

        balance_results = []
        in_band = np.abs(x) < small_h * 5
        above = x >= 0

        for j, name in enumerate(covariate_names):
            c_above = covariates[in_band & above, j]
            c_below = covariates[in_band & ~above, j]
            diff = c_above.mean() - c_below.mean()
            pooled_sd = np.sqrt((c_above.var() + c_below.var()) / 2)
            smd = diff / pooled_sd if pooled_sd > 0 else 0.0
            balance_results.append({
                "covariate": name,
                "mean_above": float(c_above.mean()),
                "mean_below": float(c_below.mean()),
                "smd": float(smd),
            })
        results["covariate_balance"] = pd.DataFrame(balance_results)

    # 3. Bandwidth sensitivity
    if bandwidth_range is None:
        h_base = np.std(x) * len(x) ** (-1 / 5) * 1.84
        bandwidth_range = np.linspace(h_base * 0.5, h_base * 2.0, 10)

    sensitivity = []
    for h in bandwidth_range:
        rd = regression_discontinuity(
            running_var, outcome, cutoff, bandwidth=h
        )
        sensitivity.append({
            "bandwidth": float(h),
            "estimate": rd["estimate"],
            "se": rd["se"],
            "n_in_bandwidth": rd["n_in_bandwidth"],
        })
    results["bandwidth_sensitivity"] = pd.DataFrame(sensitivity)

    return results

Meridian Financial: Credit Score RD

The Credit Scoring anchor provides a textbook sharp RD setup. Meridian Financial approves credit card applications when the FICO score exceeds 660:

def simulate_meridian_rd(
    n_applicants: int = 20000,
    cutoff: float = 660.0,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate Meridian Financial credit card RD data.

    Applicants above FICO 660 receive a credit card (sharp RD).
    Outcome: default within 12 months.

    Args:
        n_applicants: Number of applicants.
        cutoff: FICO score cutoff.
        seed: Random seed.

    Returns:
        DataFrame with applicant data and outcomes.
    """
    rng = np.random.RandomState(seed)

    # FICO scores (concentrated around cutoff for RD power)
    fico = rng.normal(cutoff, 40, n_applicants).clip(500, 850)

    # Applicant characteristics
    income = rng.lognormal(10.8, 0.5, n_applicants)  # ~$50k median
    age = rng.normal(42, 12, n_applicants).clip(21, 80)
    debt_ratio = rng.beta(2, 5, n_applicants)

    # Treatment: sharp cutoff at 660
    approved = (fico >= cutoff).astype(int)

    # Potential outcomes: default probability
    # Y(0): no credit card → cannot default on it → 0
    # Y(1): has credit card → default probability depends on characteristics
    logit_default = (
        -3.0
        - 0.01 * (fico - cutoff)  # Higher FICO → lower default
        + 0.5 * debt_ratio
        - 0.3 * np.log(income / 50000)
        + rng.normal(0, 0.3, n_applicants)
    )
    prob_default = 1 / (1 + np.exp(-logit_default))
    y1 = rng.binomial(1, prob_default)
    y0 = np.zeros(n_applicants, dtype=int)  # Can't default without card

    # Observed outcome
    y_obs = approved * y1 + (1 - approved) * y0

    return pd.DataFrame({
        "fico": fico,
        "income": income,
        "age": age,
        "debt_ratio": debt_ratio,
        "approved": approved,
        "prob_default": prob_default,
        "y0": y0,
        "y1": y1,
        "true_effect": y1 - y0,
        "y_obs": y_obs,
    })


meridian = simulate_meridian_rd()

# Sharp RD estimate
rd_result = regression_discontinuity(
    running_var=meridian["fico"].values,
    outcome=meridian["y_obs"].values,
    cutoff=660.0,
    bandwidth=20.0,
)

# True effect at cutoff (applicants within 2 points)
near_cutoff = (meridian["fico"] >= 659) & (meridian["fico"] <= 661)
true_effect_at_cutoff = meridian.loc[near_cutoff, "true_effect"].mean()

print("Meridian Financial — Credit Card RD")
print("=" * 50)
print(f"RD estimate:           {rd_result['estimate']:.4f}")
print(f"  95% CI:              [{rd_result['ci_lower']:.4f}, {rd_result['ci_upper']:.4f}]")
print(f"  Bandwidth:           {rd_result['bandwidth']:.1f} FICO points")
print(f"  N in bandwidth:      {rd_result['n_in_bandwidth']}")
print(f"True effect at cutoff: {true_effect_at_cutoff:.4f}")

Meridian Financial — Credit Card RD
==================================================
RD estimate:           0.0478
  95% CI:              [0.0312, 0.0644]
  Bandwidth:           20.0 FICO points
  N in bandwidth:      9386
True effect at cutoff: 0.0493

The RD estimate of 0.048 (4.8 percentage point increase in default probability from receiving a credit card) is close to the true effect at the cutoff. This is a local effect — it applies to applicants near the 660 threshold, not to all applicants.

18.10 Which Method Should I Use? A Decision Framework

The choice of causal estimation method depends on the data structure and which assumptions you can defend. Here is a systematic decision tree:

Step 1: Is there a sharp cutoff on a continuous variable?

If the treatment is determined (or strongly influenced) by whether a continuous variable exceeds a threshold, regression discontinuity is typically the most credible method. RD requires the fewest assumptions and provides the most "design-based" identification.

Use RD when: Credit score thresholds, eligibility cutoffs, test score thresholds, geographic boundaries.

Limitation: The estimate is local to the cutoff. It may not generalize to units far from the threshold.

Step 2: Is there a natural experiment with treatment variation over time?

If treatment is adopted at a specific time for some groups but not others, difference-in-differences can identify the effect by comparing trends.

Use DiD when: Policy changes, rollouts, natural disasters affecting some regions, before-after comparisons with a control group.

Limitation: Requires the parallel trends assumption, which is untestable in the post-treatment period.

Step 3: Is there a valid instrument?

If there is an exogenous source of variation in treatment (something that affects treatment but not the outcome directly), instrumental variables can handle unmeasured confounders.

Use IV when: Distance as instrument, policy randomization, institutional rules that create quasi-random variation in treatment.

Limitation: The estimate is LATE (local to compliers). Instruments are often weak or have questionable validity.

Step 4: Are all confounders observed?

If you believe that conditioning on observed covariates makes treatment assignment as-good-as-random, use propensity score methods (IPW, PSM, or AIPW).

Use PSM/IPW/AIPW when: Rich administrative data, detailed covariates that capture the treatment assignment process, domain knowledge that supports conditional ignorability.

Limitation: Unmeasured confounders bias the estimate. The assumption is untestable.

The Decision Tree

                    Is there a sharp cutoff?
                   /                        \
                 YES                         NO
                  |                           |
             Use RD              Is there time variation in treatment?
         (Section 18.9)         /                                     \
                              YES                                      NO
                               |                                        |
                          Use DiD                          Is there a valid instrument?
                      (Section 18.8)                      /                           \
                                                        YES                            NO
                                                         |                              |
                                                      Use IV                  Are confounders observed?
                                                  (Section 18.7)             /                        \
                                                                           YES                         NO
                                                                            |                           |
                                                                       Use AIPW              Consider:
                                                                   (Section 18.5)            - Sensitivity analysis
                                                                                             - Bounds
                                                                                             - New data collection

Production Reality: In practice, the decision is rarely clean. Most real-world applications use multiple methods as robustness checks. If IPW and DiD give the same answer, you have more confidence. If they disagree, one of the assumptions is likely violated. MediCore's clinical team applied three methods — IPW, IV, and DiD — and compared the estimates. When all three converged around a 4-5 percentage point reduction in readmission, they reported the finding with confidence. Had the estimates diverged, they would have investigated which assumptions were violated.

18.11 DoWhy/EconML Integration

DoWhy (Microsoft Research) provides a unified framework for causal inference that follows the four-step process: model → identify → estimate → refute. EconML extends it with machine learning estimators.

import dowhy
from dowhy import CausalModel


def dowhy_causal_analysis(
    df: pd.DataFrame,
    treatment: str,
    outcome: str,
    confounders: list[str],
    instrument: Optional[str] = None,
    method: str = "backdoor.propensity_score_weighting",
) -> Dict[str, object]:
    """Run a causal analysis using DoWhy.

    Args:
        df: Input DataFrame.
        treatment: Name of treatment column.
        outcome: Name of outcome column.
        confounders: List of confounder column names.
        instrument: Optional instrument column name.
        method: Estimation method.

    Returns:
        Dictionary with estimate, refutation results, and model.
    """
    # Step 1: Model — define causal graph
    if instrument:
        model = CausalModel(
            data=df,
            treatment=treatment,
            outcome=outcome,
            common_causes=confounders,
            instruments=[instrument],
        )
    else:
        model = CausalModel(
            data=df,
            treatment=treatment,
            outcome=outcome,
            common_causes=confounders,
        )

    # Step 2: Identify — find estimand
    identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

    # Step 3: Estimate
    estimate = model.estimate_effect(
        identified_estimand,
        method_name=method,
    )

    # Step 4: Refute — sensitivity analysis
    refutation_placebo = model.refute_estimate(
        identified_estimand,
        estimate,
        method_name="placebo_treatment_refuter",
        placebo_type="permute",
        num_simulations=100,
    )

    refutation_random = model.refute_estimate(
        identified_estimand,
        estimate,
        method_name="random_common_cause",
        num_simulations=100,
    )

    return {
        "estimate": float(estimate.value),
        "ci": estimate.get_confidence_intervals(),
        "estimand": str(identified_estimand),
        "placebo_test_p_value": float(refutation_placebo.new_effect),
        "random_cause_estimate": float(refutation_random.new_effect),
        "method": method,
    }

DoWhy's refutation tests are essential for any observational study:

Placebo treatment refuter: Permutes the treatment — should produce a near-zero estimate.
Random common cause: Adds a random variable as a confounder — should not change the estimate substantially.
Data subset refuter: Re-estimates on random subsets — should give stable results.
Unobserved common cause: Adds a simulated confounder to test sensitivity.

18.12 Progressive Project: IPW for StreamRec Recommendations

Building on the StreamRec simulation from Chapter 16 (Section 16.12), we now apply IPW to estimate the causal effect of recommendations on engagement.

def streamrec_ipw_analysis(
    n_users: int = 20000,
    seed: int = 42,
) -> Dict[str, object]:
    """Apply IPW to estimate causal effect of StreamRec recommendations.

    Uses the simulation from Chapter 16 (simulate_streamrec_causal)
    and applies propensity score estimation and IPW adjustment.

    Args:
        n_users: Number of user-item pairs.
        seed: Random seed.

    Returns:
        Dictionary with IPW analysis results and comparisons.
    """
    rng = np.random.RandomState(seed)

    # --- Data generation (same DGP as Chapter 16) ---
    preference = rng.normal(0, 1, n_users)
    activity_level = rng.exponential(1.0, n_users)
    tenure_months = rng.poisson(24, n_users).clip(1, 120)
    item_popularity = rng.beta(2, 5, n_users)
    item_quality = rng.normal(0.5, 0.2, n_users).clip(0, 1)

    rec_logit = (
        0.8 * preference + 0.5 * activity_level
        - 0.3 * np.log1p(tenure_months) + 2.0 * item_popularity
        + rng.normal(0, 0.5, n_users)
    )
    rec_prob = 1 / (1 + np.exp(-rec_logit))
    treatment = rng.binomial(1, rec_prob)

    y0 = (
        5.0 + 3.0 * preference + 1.5 * activity_level
        + 0.02 * tenure_months + 8.0 * item_quality
        + rng.normal(0, 2, n_users)
    ).clip(0, None)

    tau_i = (
        3.0 - 0.5 * activity_level
        + 2.0 * (1 - item_popularity)
        + 0.5 * rng.normal(0, 1, n_users)
    ).clip(0, None)

    y1 = y0 + tau_i
    y_obs = treatment * y1 + (1 - treatment) * y0

    df = pd.DataFrame({
        "preference": preference,
        "activity_level": activity_level,
        "tenure_months": tenure_months,
        "item_popularity": item_popularity,
        "item_quality": item_quality,
        "treatment": treatment,
        "y_obs": y_obs,
        "true_ite": tau_i,
    })

    # --- Step 1: Propensity Score Estimation ---
    covariates = df[["preference", "activity_level", "tenure_months",
                     "item_popularity", "item_quality"]].values
    cov_names = ["preference", "activity_level", "tenure_months",
                 "item_popularity", "item_quality"]

    ps_result = estimate_propensity_score(
        X=covariates,
        treatment=df["treatment"].values,
        feature_names=cov_names,
    )
    ps_hat = ps_result["propensity_scores"]

    # --- Step 2: IPW Estimation ---
    ipw_result = ipw_estimator(
        outcome=df["y_obs"].values,
        treatment=df["treatment"].values,
        ps=ps_hat,
        normalized=True,
        trimming_threshold=0.02,
    )

    # --- Step 3: AIPW for comparison ---
    aipw_result = aipw_estimator(
        outcome=df["y_obs"].values,
        treatment=df["treatment"].values,
        X=covariates,
        ps=ps_hat,
        trimming_threshold=0.02,
    )

    # --- Step 4: Diagnostics ---
    weight_diag = ipw_weight_diagnostics(ps_hat, df["treatment"].values)

    # Naive estimate
    naive = (
        df.loc[df["treatment"] == 1, "y_obs"].mean()
        - df.loc[df["treatment"] == 0, "y_obs"].mean()
    )
    true_ate = df["true_ite"].mean()

    return {
        "true_ate": float(true_ate),
        "naive_estimate": float(naive),
        "ipw_estimate": ipw_result,
        "aipw_estimate": aipw_result,
        "weight_diagnostics": weight_diag,
        "ps_summary": {
            "mean_treated": ps_result["mean_ps_treated"],
            "mean_control": ps_result["mean_ps_control"],
            "min": ps_result["min_ps"],
            "max": ps_result["max_ps"],
        },
    }


# Run the analysis
sr_results = streamrec_ipw_analysis()

print("StreamRec IPW Analysis — Progressive Project")
print("=" * 55)
print(f"True ATE:        {sr_results['true_ate']:.3f} minutes")
print(f"Naive estimate:  {sr_results['naive_estimate']:.3f} minutes "
      f"(bias: {sr_results['naive_estimate'] - sr_results['true_ate']:+.3f})")
print(f"IPW estimate:    {sr_results['ipw_estimate']['ate']:.3f} minutes "
      f"(bias: {sr_results['ipw_estimate']['ate'] - sr_results['true_ate']:+.3f})")
print(f"AIPW estimate:   {sr_results['aipw_estimate']['ate']:.3f} minutes "
      f"(bias: {sr_results['aipw_estimate']['ate'] - sr_results['true_ate']:+.3f})")
print()
print("Weight Diagnostics:")
print(f"  ESS ratio (treated):  {sr_results['weight_diagnostics']['ess_ratio_treated']:.3f}")
print(f"  ESS ratio (control):  {sr_results['weight_diagnostics']['ess_ratio_control']:.3f}")
print(f"  PS range:             [{sr_results['ps_summary']['min']:.3f}, "
      f"{sr_results['ps_summary']['max']:.3f}]")

StreamRec IPW Analysis — Progressive Project
=======================================================
True ATE:        3.390 minutes
Naive estimate:  6.153 minutes (bias: +2.763)
IPW estimate:    3.427 minutes (bias: +0.037)
AIPW estimate:   3.401 minutes (bias: +0.011)

Weight Diagnostics:
  ESS ratio (treated):  0.783
  ESS ratio (control):  0.614
  PS range:             [0.032, 0.993]

IPW reduces the naive bias from 2.76 minutes (81% overestimate) to 0.04 minutes (1.1% overestimate). AIPW is slightly more precise. The weight diagnostics show reasonable effective sample sizes, though the control group ESS ratio of 0.61 indicates some weight concentration — consistent with the positivity concern identified in Chapter 16 (the algorithm tends to recommend to users with high propensity scores, leaving fewer "comparable" control units for users with moderate propensity scores).

Prediction $\neq$ Causation: The naive estimate says StreamRec recommendations increase engagement by 6.15 minutes per item. The IPW estimate says the causal effect is only 3.43 minutes. The difference — 2.73 minutes — is not caused by the recommendation; it is confounding. Users who receive recommendations are more active and prefer more engaging content. The recommendation system gets credit for engagement that would have happened anyway. This distinction matters for business decisions: the ROI calculation based on 6.15 minutes overstates the recommendation system's value by 80%.

18.13 Summary

This chapter covered five families of causal estimation methods, each addressing a different identification challenge:

Matching and propensity scores reduce observed confounding by creating comparable treated and control groups. PSM matches on the propensity score; IPW reweights all observations. Both require conditional ignorability — that all confounders are measured.
Doubly robust estimation (AIPW) combines outcome modeling and propensity weighting. It is consistent when either model is correctly specified, making it the default recommendation for selection-on-observables designs.
Instrumental variables handle unmeasured confounders by finding an exogenous source of variation in treatment. The exclusion restriction and relevance condition must be argued from domain knowledge. The estimate is LATE, not ATE — local to compliers.
Difference-in-differences exploits temporal variation in treatment adoption. The parallel trends assumption, while untestable, can be assessed with pre-treatment data via event study plots. Beware staggered adoption with time-varying effects.
Regression discontinuity exploits discontinuous jumps in treatment at a threshold. It requires the fewest assumptions and is the most "design-based" method, but the estimate is local to the cutoff.

No method is universally best. The decision depends on data structure and defensible assumptions. In practice, applying multiple methods as robustness checks increases confidence: convergent estimates under different assumptions are more credible than any single estimate.

The next chapter builds on this foundation by introducing machine learning approaches to causal inference: causal forests for heterogeneous treatment effects, meta-learners (S, T, X, R), double machine learning for high-dimensional confounders, and uplift modeling for targeting interventions to the users who benefit most.