Case Study 1: MediCore Causal DAG — Identifying Confounders and Valid Adjustment Sets
Context
MediCore Pharmaceuticals has moved beyond the potential outcomes analysis of Chapter 16 to confront a more nuanced problem. The regression adjustment approach (controlling for observed covariates in a linear model) successfully recovered an unbiased estimate of Drug X's effect on hospitalization — but only because the simulation included all relevant confounders as observed variables. The clinical analytics team now faces the real-world challenge: their electronic health records contain dozens of candidate variables, and they must decide which to include in their adjustment set.
The team has access to the following variables for 2.1 million patient-admission episodes:
| Variable | Description | Type |
|---|---|---|
| Disease Severity | Composite score from lab values (0-100) | Continuous |
| Age | Patient age at admission | Continuous |
| Comorbidity Count | Number of diagnosed comorbidities | Count |
| Insurance Status | Medicaid, Medicare, Private, Uninsured | Categorical |
| Biomarker Level | Post-treatment inflammatory biomarker (pg/mL) | Continuous |
| Prior Hospitalizations | Count of hospitalizations in the past year | Count |
| Prescribing Physician | Physician ID who prescribed Drug X | Categorical |
| Hospital | Hospital where admission occurred | Categorical |
| Drug X | Binary indicator of Drug X prescription at discharge | Binary |
| 30-Day Readmission | Binary indicator of readmission within 30 days | Binary |
The question: Which of these variables should be included in the adjustment set?
Constructing the DAG
The clinical team convenes a structured elicitation session with domain experts: two cardiologists, an epidemiologist, a hospital administrator, and a health economist. They build the causal DAG incrementally, debating each edge.
Disease Severity
/ | \ \
v | v v
Age ---> Drug X | Biomarker Prior Hospitalizations
| | | | |
| | v v v
| +----> Hospitalization <----+
| ^
| /
v /
Comorbidity Count ----------/
|
v
Insurance Status <--- Age
Prescribing Physician ---> Drug X
Hospital ---> Drug X
Hospital ---> Hospitalization
Key causal claims encoded in this DAG:
- Disease Severity directly causes Drug X prescription (sicker patients receive Drug X), Biomarker levels (severity affects inflammation), Prior Hospitalizations (sicker patients are hospitalized more), and Hospitalization (severity is the strongest predictor of readmission).
- Age causes Drug X prescription (prescribing patterns vary by age), Comorbidity Count (older patients have more comorbidities), Insurance Status (age determines Medicare eligibility), and Hospitalization.
- Prescribing Physician causes Drug X prescription (physician practice variation) but does not directly cause Hospitalization (conditional on treatment and patient characteristics). This is a potential instrumental variable (Chapter 18).
- Biomarker is caused by Drug X and Severity, and causes Hospitalization. It is a mediator on the causal pathway Drug X $\to$ Biomarker $\to$ Hospitalization.
- Insurance Status is caused by Age and Comorbidity Count (a collider). Its effect on Hospitalization is debated — the team assumes no direct effect after controlling for severity, comorbidities, and hospital (insurance does not change biology, though it may affect care quality through hospital selection).
- Hospital affects both Drug X (prescribing protocols vary by hospital) and Hospitalization (hospital quality varies).
Classifying Each Variable
Using the DAG, the team classifies each candidate control:
import numpy as np
import pandas as pd
def classify_controls() -> pd.DataFrame:
"""Classify each candidate variable as good, bad, or neutral control.
Based on the MediCore causal DAG and the backdoor criterion
for the effect of Drug X on Hospitalization.
"""
variables = [
{
"variable": "Disease Severity",
"role": "Confounder",
"classification": "GOOD",
"reason": (
"Common cause of Drug X (severity drives prescribing) "
"and Hospitalization (severity is primary risk factor). "
"Blocks the main backdoor path."
),
},
{
"variable": "Age",
"role": "Confounder",
"classification": "GOOD",
"reason": (
"Common cause of Drug X (age affects prescribing) "
"and Hospitalization (age is a risk factor). "
"Blocks a secondary backdoor path."
),
},
{
"variable": "Comorbidity Count",
"role": "Cause of outcome only",
"classification": "GOOD (precision)",
"reason": (
"Causes Hospitalization but does not directly cause "
"Drug X (conditional on severity). Including it does "
"not affect bias but may improve precision."
),
},
{
"variable": "Hospital",
"role": "Confounder",
"classification": "GOOD",
"reason": (
"Common cause of Drug X (hospital protocols) and "
"Hospitalization (hospital quality). Blocks a "
"backdoor path."
),
},
{
"variable": "Biomarker Level",
"role": "Mediator (post-treatment)",
"classification": "BAD",
"reason": (
"Descendant of Drug X (Drug -> Biomarker). On the "
"causal pathway Drug -> Biomarker -> Hospitalization. "
"Conditioning blocks the causal effect, attenuating "
"the estimate toward zero."
),
},
{
"variable": "Insurance Status",
"role": "Collider",
"classification": "BAD",
"reason": (
"Collider on Age -> Insurance <- Comorbidities. "
"Conditioning opens spurious path between Age and "
"Comorbidities, potentially introducing bias."
),
},
{
"variable": "Prior Hospitalizations",
"role": "Pre-treatment variable",
"classification": "GOOD (with caution)",
"reason": (
"Caused by Disease Severity (a confounder). Including "
"it may partially block the severity backdoor path "
"even if severity is imperfectly measured. However, if "
"it also reflects unmeasured confounders, it helps."
),
},
{
"variable": "Prescribing Physician",
"role": "Cause of treatment only",
"classification": "NEUTRAL / IV candidate",
"reason": (
"Causes Drug X but does not directly cause "
"Hospitalization (conditional on patient factors). "
"Including as control is neutral (no bias effect). "
"May be more useful as an instrument (Chapter 18)."
),
},
]
return pd.DataFrame(variables)
classification = classify_controls()
print("MediCore: Variable Classification for Backdoor Adjustment")
print("=" * 70)
for _, row in classification.iterrows():
print(f"\n {row['variable']}")
print(f" Role: {row['role']}")
print(f" Classification: {row['classification']}")
print(f" Reason: {row['reason']}")
MediCore: Variable Classification for Backdoor Adjustment
======================================================================
Disease Severity
Role: Confounder
Classification: GOOD
Reason: Common cause of Drug X (severity drives prescribing) and Hospitalization (severity is primary risk factor). Blocks the main backdoor path.
Age
Role: Confounder
Classification: GOOD
Reason: Common cause of Drug X (age affects prescribing) and Hospitalization (age is a risk factor). Blocks a secondary backdoor path.
Comorbidity Count
Role: Cause of outcome only
Classification: GOOD (precision)
Reason: Causes Hospitalization but does not directly cause Drug X (conditional on severity). Including it does not affect bias but may improve precision.
Hospital
Role: Confounder
Classification: GOOD
Reason: Common cause of Drug X (hospital protocols) and Hospitalization (hospital quality). Blocks a backdoor path.
Biomarker Level
Role: Mediator (post-treatment)
Classification: BAD
Reason: Descendant of Drug X (Drug -> Biomarker). On the causal pathway Drug -> Biomarker -> Hospitalization. Conditioning blocks the causal effect, attenuating the estimate toward zero.
Insurance Status
Role: Collider
Classification: BAD
Reason: Collider on Age -> Insurance <- Comorbidities. Conditioning opens spurious path between Age and Comorbidities, potentially introducing bias.
Prior Hospitalizations
Role: Pre-treatment variable
Classification: GOOD (with caution)
Reason: Caused by Disease Severity (a confounder). Including it may partially block the severity backdoor path even if severity is imperfectly measured. However, if it also reflects unmeasured confounders, it helps.
Prescribing Physician
Role: Cause of treatment only
Classification: NEUTRAL / IV candidate
Reason: Causes Drug X but does not directly cause Hospitalization (conditional on patient factors). Including as control is neutral (no bias effect). May be more useful as an instrument (Chapter 18).
Empirical Validation
The team generates simulated data matching the DAG structure to validate their reasoning before applying it to the real EHR data:
def simulate_medicore_full(n: int = 50000, seed: int = 42) -> pd.DataFrame:
"""Simulate the full MediCore DAG with all candidate variables.
True causal effect of Drug X on Hospitalization = -0.08
(8 percentage point reduction in readmission probability).
"""
rng = np.random.RandomState(seed)
severity = rng.normal(50, 15, n)
age = rng.normal(65, 10, n)
comorbidities = rng.poisson(np.exp(0.01 * age - 0.3), n)
hospital = rng.choice(["A", "B", "C", "D"], n, p=[0.3, 0.3, 0.2, 0.2])
hospital_quality = {"A": 0.0, "B": -0.02, "C": 0.03, "D": 0.01}
hospital_rx_rate = {"A": 0.0, "B": 0.5, "C": -0.3, "D": 0.2}
physician = rng.choice(range(50), n)
physician_preference = rng.normal(0, 0.5, 50)
# Drug assignment
drug_logit = (
0.03 * (severity - 50)
+ 0.01 * (age - 65)
+ np.array([hospital_rx_rate[h] for h in hospital])
+ physician_preference[physician]
)
drug = rng.binomial(1, 1 / (1 + np.exp(-drug_logit)))
# Biomarker (mediator)
biomarker = 50 + 0.3 * severity - 8 * drug + rng.normal(0, 5, n)
# Prior hospitalizations
prior_hosp = rng.poisson(np.exp(-2 + 0.02 * severity), n)
# Insurance status (collider: caused by age and comorbidities)
ins_probs = np.column_stack([
np.exp(0.02 * (age - 65)), # Medicare (increases with age)
np.exp(0.1 * comorbidities), # Medicaid (increases with comorbidities)
np.ones(n), # Private (baseline)
np.exp(-0.01 * age), # Uninsured (decreases with age)
])
ins_probs = ins_probs / ins_probs.sum(axis=1, keepdims=True)
insurance = np.array([
rng.choice(["Medicare", "Medicaid", "Private", "Uninsured"], p=p)
for p in ins_probs
])
# Hospitalization (outcome)
true_drug_effect = -0.08 # 8 percentage point reduction
hosp_logit = (
-2.0
+ 0.04 * (severity - 50)
+ 0.02 * (age - 65)
+ 0.1 * comorbidities
+ 0.15 * prior_hosp
+ 0.01 * biomarker
+ true_drug_effect * 5 * drug # scaled for logit
+ np.array([hospital_quality[h] for h in hospital])
)
hospitalization = rng.binomial(1, 1 / (1 + np.exp(-hosp_logit)))
return pd.DataFrame({
"severity": severity, "age": age, "comorbidities": comorbidities,
"hospital": hospital, "physician": physician,
"drug": drug, "biomarker": biomarker,
"prior_hosp": prior_hosp, "insurance": insurance,
"hospitalization": hospitalization,
})
sim = simulate_medicore_full()
print(f"Simulated data: {len(sim)} patients")
print(f"Drug X rate: {sim['drug'].mean():.3f}")
print(f"Readmission rate: {sim['hospitalization'].mean():.3f}")
Simulated data: 50000 patients
Drug X rate: 0.524
Readmission rate: 0.203
Results and Decision
The team selects the recommended adjustment set: {Disease Severity, Age, Comorbidity Count, Hospital, Prior Hospitalizations}. They explicitly exclude:
- Biomarker Level — mediator, would block the causal pathway.
- Insurance Status — collider, could open spurious paths.
- Prescribing Physician — reserved as a potential instrument for sensitivity analysis (Chapter 18).
This analysis demonstrates the central lesson of graphical causal models: the decision of which covariates to include in an adjustment set is a causal decision, not a statistical one. The DAG — constructed from domain knowledge, not from the data — determines which variables are good controls and which are bad controls. No amount of statistical testing (significance tests, stepwise selection, LASSO) can substitute for this causal reasoning.
Production Reality: In practice, the DAG is always an approximation. The team acknowledges several uncertainties: (1) the assumption that Insurance Status has no direct effect on Hospitalization is debatable (insurance may affect access to follow-up care); (2) the Disease Severity composite score may not capture all relevant severity dimensions (measurement error in the confounder weakens the adjustment); (3) there may be unmeasured confounders not represented in the DAG at all. The sensitivity analysis framework from Chapter 16 (OVB formula, Cinelli and Hazlett bounds) provides a way to quantify how robust the estimate is to these concerns. The DAG does not eliminate uncertainty — it makes the uncertainty explicit and structured.