Case Study 1: MediCore Causal DAG — Identifying Confounders and Valid Adjustment Sets

DataField.Dev

Case Study 1: MediCore Causal DAG — Identifying Confounders and Valid Adjustment Sets

Context

MediCore Pharmaceuticals has moved beyond the potential outcomes analysis of Chapter 16 to confront a more nuanced problem. The regression adjustment approach (controlling for observed covariates in a linear model) successfully recovered an unbiased estimate of Drug X's effect on hospitalization — but only because the simulation included all relevant confounders as observed variables. The clinical analytics team now faces the real-world challenge: their electronic health records contain dozens of candidate variables, and they must decide which to include in their adjustment set.

The team has access to the following variables for 2.1 million patient-admission episodes:

Variable	Description	Type
Disease Severity	Composite score from lab values (0-100)	Continuous
Age	Patient age at admission	Continuous
Comorbidity Count	Number of diagnosed comorbidities	Count
Insurance Status	Medicaid, Medicare, Private, Uninsured	Categorical
Biomarker Level	Post-treatment inflammatory biomarker (pg/mL)	Continuous
Prior Hospitalizations	Count of hospitalizations in the past year	Count
Prescribing Physician	Physician ID who prescribed Drug X	Categorical
Hospital	Hospital where admission occurred	Categorical
Drug X	Binary indicator of Drug X prescription at discharge	Binary
30-Day Readmission	Binary indicator of readmission within 30 days	Binary

The question: Which of these variables should be included in the adjustment set?

Constructing the DAG

The clinical team convenes a structured elicitation session with domain experts: two cardiologists, an epidemiologist, a hospital administrator, and a health economist. They build the causal DAG incrementally, debating each edge.

                        Disease Severity
                       /    |    \       \
                      v     |     v       v
            Age ---> Drug X |  Biomarker  Prior Hospitalizations
             |        |     |     |                |
             |        |     v     v                v
             |        +---->  Hospitalization <----+
             |                     ^
             |                    /
             v                   /
  Comorbidity Count ----------/
       |
       v
  Insurance Status <--- Age

  Prescribing Physician ---> Drug X
  Hospital ---> Drug X
  Hospital ---> Hospitalization

Key causal claims encoded in this DAG:

Disease Severity directly causes Drug X prescription (sicker patients receive Drug X), Biomarker levels (severity affects inflammation), Prior Hospitalizations (sicker patients are hospitalized more), and Hospitalization (severity is the strongest predictor of readmission).
Age causes Drug X prescription (prescribing patterns vary by age), Comorbidity Count (older patients have more comorbidities), Insurance Status (age determines Medicare eligibility), and Hospitalization.
Prescribing Physician causes Drug X prescription (physician practice variation) but does not directly cause Hospitalization (conditional on treatment and patient characteristics). This is a potential instrumental variable (Chapter 18).
Biomarker is caused by Drug X and Severity, and causes Hospitalization. It is a mediator on the causal pathway Drug X $\to$ Biomarker $\to$ Hospitalization.
Insurance Status is caused by Age and Comorbidity Count (a collider). Its effect on Hospitalization is debated — the team assumes no direct effect after controlling for severity, comorbidities, and hospital (insurance does not change biology, though it may affect care quality through hospital selection).
Hospital affects both Drug X (prescribing protocols vary by hospital) and Hospitalization (hospital quality varies).

Classifying Each Variable

Using the DAG, the team classifies each candidate control:

import numpy as np
import pandas as pd


def classify_controls() -> pd.DataFrame:
    """Classify each candidate variable as good, bad, or neutral control.

    Based on the MediCore causal DAG and the backdoor criterion
    for the effect of Drug X on Hospitalization.
    """
    variables = [
        {
            "variable": "Disease Severity",
            "role": "Confounder",
            "classification": "GOOD",
            "reason": (
                "Common cause of Drug X (severity drives prescribing) "
                "and Hospitalization (severity is primary risk factor). "
                "Blocks the main backdoor path."
            ),
        },
        {
            "variable": "Age",
            "role": "Confounder",
            "classification": "GOOD",
            "reason": (
                "Common cause of Drug X (age affects prescribing) "
                "and Hospitalization (age is a risk factor). "
                "Blocks a secondary backdoor path."
            ),
        },
        {
            "variable": "Comorbidity Count",
            "role": "Cause of outcome only",
            "classification": "GOOD (precision)",
            "reason": (
                "Causes Hospitalization but does not directly cause "
                "Drug X (conditional on severity). Including it does "
                "not affect bias but may improve precision."
            ),
        },
        {
            "variable": "Hospital",
            "role": "Confounder",
            "classification": "GOOD",
            "reason": (
                "Common cause of Drug X (hospital protocols) and "
                "Hospitalization (hospital quality). Blocks a "
                "backdoor path."
            ),
        },
        {
            "variable": "Biomarker Level",
            "role": "Mediator (post-treatment)",
            "classification": "BAD",
            "reason": (
                "Descendant of Drug X (Drug -> Biomarker). On the "
                "causal pathway Drug -> Biomarker -> Hospitalization. "
                "Conditioning blocks the causal effect, attenuating "
                "the estimate toward zero."
            ),
        },
        {
            "variable": "Insurance Status",
            "role": "Collider",
            "classification": "BAD",
            "reason": (
                "Collider on Age -> Insurance <- Comorbidities. "
                "Conditioning opens spurious path between Age and "
                "Comorbidities, potentially introducing bias."
            ),
        },
        {
            "variable": "Prior Hospitalizations",
            "role": "Pre-treatment variable",
            "classification": "GOOD (with caution)",
            "reason": (
                "Caused by Disease Severity (a confounder). Including "
                "it may partially block the severity backdoor path "
                "even if severity is imperfectly measured. However, if "
                "it also reflects unmeasured confounders, it helps."
            ),
        },
        {
            "variable": "Prescribing Physician",
            "role": "Cause of treatment only",
            "classification": "NEUTRAL / IV candidate",
            "reason": (
                "Causes Drug X but does not directly cause "
                "Hospitalization (conditional on patient factors). "
                "Including as control is neutral (no bias effect). "
                "May be more useful as an instrument (Chapter 18)."
            ),
        },
    ]

    return pd.DataFrame(variables)


classification = classify_controls()
print("MediCore: Variable Classification for Backdoor Adjustment")
print("=" * 70)
for _, row in classification.iterrows():
    print(f"\n  {row['variable']}")
    print(f"    Role:           {row['role']}")
    print(f"    Classification: {row['classification']}")
    print(f"    Reason:         {row['reason']}")

MediCore: Variable Classification for Backdoor Adjustment
======================================================================

  Disease Severity
    Role:           Confounder
    Classification: GOOD
    Reason:         Common cause of Drug X (severity drives prescribing) and Hospitalization (severity is primary risk factor). Blocks the main backdoor path.

  Age
    Role:           Confounder
    Classification: GOOD
    Reason:         Common cause of Drug X (age affects prescribing) and Hospitalization (age is a risk factor). Blocks a secondary backdoor path.

  Comorbidity Count
    Role:           Cause of outcome only
    Classification: GOOD (precision)
    Reason:         Causes Hospitalization but does not directly cause Drug X (conditional on severity). Including it does not affect bias but may improve precision.

  Hospital
    Role:           Confounder
    Classification: GOOD
    Reason:         Common cause of Drug X (hospital protocols) and Hospitalization (hospital quality). Blocks a backdoor path.

  Biomarker Level
    Role:           Mediator (post-treatment)
    Classification: BAD
    Reason:         Descendant of Drug X (Drug -> Biomarker). On the causal pathway Drug -> Biomarker -> Hospitalization. Conditioning blocks the causal effect, attenuating the estimate toward zero.

  Insurance Status
    Role:           Collider
    Classification: BAD
    Reason:         Collider on Age -> Insurance <- Comorbidities. Conditioning opens spurious path between Age and Comorbidities, potentially introducing bias.

  Prior Hospitalizations
    Role:           Pre-treatment variable
    Classification: GOOD (with caution)
    Reason:         Caused by Disease Severity (a confounder). Including it may partially block the severity backdoor path even if severity is imperfectly measured. However, if it also reflects unmeasured confounders, it helps.

  Prescribing Physician
    Role:           Cause of treatment only
    Classification: NEUTRAL / IV candidate
    Reason:         Causes Drug X but does not directly cause Hospitalization (conditional on patient factors). Including as control is neutral (no bias effect). May be more useful as an instrument (Chapter 18).

Empirical Validation

The team generates simulated data matching the DAG structure to validate their reasoning before applying it to the real EHR data:

def simulate_medicore_full(n: int = 50000, seed: int = 42) -> pd.DataFrame:
    """Simulate the full MediCore DAG with all candidate variables.

    True causal effect of Drug X on Hospitalization = -0.08
    (8 percentage point reduction in readmission probability).
    """
    rng = np.random.RandomState(seed)

    severity = rng.normal(50, 15, n)
    age = rng.normal(65, 10, n)
    comorbidities = rng.poisson(np.exp(0.01 * age - 0.3), n)
    hospital = rng.choice(["A", "B", "C", "D"], n, p=[0.3, 0.3, 0.2, 0.2])
    hospital_quality = {"A": 0.0, "B": -0.02, "C": 0.03, "D": 0.01}
    hospital_rx_rate = {"A": 0.0, "B": 0.5, "C": -0.3, "D": 0.2}

    physician = rng.choice(range(50), n)
    physician_preference = rng.normal(0, 0.5, 50)

    # Drug assignment
    drug_logit = (
        0.03 * (severity - 50)
        + 0.01 * (age - 65)
        + np.array([hospital_rx_rate[h] for h in hospital])
        + physician_preference[physician]
    )
    drug = rng.binomial(1, 1 / (1 + np.exp(-drug_logit)))

    # Biomarker (mediator)
    biomarker = 50 + 0.3 * severity - 8 * drug + rng.normal(0, 5, n)

    # Prior hospitalizations
    prior_hosp = rng.poisson(np.exp(-2 + 0.02 * severity), n)

    # Insurance status (collider: caused by age and comorbidities)
    ins_probs = np.column_stack([
        np.exp(0.02 * (age - 65)),        # Medicare (increases with age)
        np.exp(0.1 * comorbidities),       # Medicaid (increases with comorbidities)
        np.ones(n),                        # Private (baseline)
        np.exp(-0.01 * age),               # Uninsured (decreases with age)
    ])
    ins_probs = ins_probs / ins_probs.sum(axis=1, keepdims=True)
    insurance = np.array([
        rng.choice(["Medicare", "Medicaid", "Private", "Uninsured"], p=p)
        for p in ins_probs
    ])

    # Hospitalization (outcome)
    true_drug_effect = -0.08  # 8 percentage point reduction
    hosp_logit = (
        -2.0
        + 0.04 * (severity - 50)
        + 0.02 * (age - 65)
        + 0.1 * comorbidities
        + 0.15 * prior_hosp
        + 0.01 * biomarker
        + true_drug_effect * 5 * drug  # scaled for logit
        + np.array([hospital_quality[h] for h in hospital])
    )
    hospitalization = rng.binomial(1, 1 / (1 + np.exp(-hosp_logit)))

    return pd.DataFrame({
        "severity": severity, "age": age, "comorbidities": comorbidities,
        "hospital": hospital, "physician": physician,
        "drug": drug, "biomarker": biomarker,
        "prior_hosp": prior_hosp, "insurance": insurance,
        "hospitalization": hospitalization,
    })


sim = simulate_medicore_full()
print(f"Simulated data: {len(sim)} patients")
print(f"Drug X rate: {sim['drug'].mean():.3f}")
print(f"Readmission rate: {sim['hospitalization'].mean():.3f}")

Simulated data: 50000 patients
Drug X rate: 0.524
Readmission rate: 0.203

Results and Decision

The team selects the recommended adjustment set: {Disease Severity, Age, Comorbidity Count, Hospital, Prior Hospitalizations}. They explicitly exclude:

Biomarker Level — mediator, would block the causal pathway.
Insurance Status — collider, could open spurious paths.
Prescribing Physician — reserved as a potential instrument for sensitivity analysis (Chapter 18).

This analysis demonstrates the central lesson of graphical causal models: the decision of which covariates to include in an adjustment set is a causal decision, not a statistical one. The DAG — constructed from domain knowledge, not from the data — determines which variables are good controls and which are bad controls. No amount of statistical testing (significance tests, stepwise selection, LASSO) can substitute for this causal reasoning.

Production Reality: In practice, the DAG is always an approximation. The team acknowledges several uncertainties: (1) the assumption that Insurance Status has no direct effect on Hospitalization is debatable (insurance may affect access to follow-up care); (2) the Disease Severity composite score may not capture all relevant severity dimensions (measurement error in the confounder weakens the adjustment); (3) there may be unmeasured confounders not represented in the DAG at all. The sensitivity analysis framework from Chapter 16 (OVB formula, Cinelli and Hazlett bounds) provides a way to quantify how robust the estimate is to these concerns. The DAG does not eliminate uncertainty — it makes the uncertainty explicit and structured.