Case Study 2: The Data Leakage Detective
When a Model Is Too Good to Be True, It Usually Is
The Setup
MedPredict Analytics builds predictive models for hospital systems. Their latest contract is with Regional West Medical Center, a 320-bed hospital that wants to predict 30-day readmissions. The goal: identify patients at discharge who are likely to be readmitted within 30 days, so that care coordinators can provide additional post-discharge support.
The data science team — Kenji (lead), Amara (mid-level), and Tyler (junior, three months on the job) — has access to two years of electronic health record data: demographics, diagnoses, procedures, lab results, medications, length of stay, and discharge records.
Tyler is assigned to build the first model. He is eager to prove himself.
The First Model
Tyler works quickly. He wants to demonstrate competence and deliver fast results. Within a week, he has:
- Pulled 48,000 patient discharge records from the data warehouse using a SQL query that joins the admissions, diagnoses, labs, and discharge tables
- Built 35 features including demographics (age, sex, insurance), diagnosis codes (count and categories), lab values (hemoglobin, creatinine, glucose at admission and discharge), length of stay, number of procedures, medications at discharge, and discharge information
- Handled missing values with median imputation
- Trained a gradient boosting model with
GradientBoostingClassifierfrom scikit-learn using 200 estimators - Evaluated on a random 80/20 train/test split
His results:
=== Tyler's Model Evaluation ===
AUC-ROC: 0.967
Accuracy: 0.942
Precision: 0.891
Recall: 0.874
F1: 0.882
Tyler is thrilled. He schedules a demo with Kenji and Amara.
The Red Flags
Kenji's first reaction to the AUC-ROC of 0.967: "That is way too high."
Published readmission models in the literature — including CMS's own Hospital Readmissions Reduction Program models — typically achieve AUC-ROC in the 0.65-0.75 range. The best research models with custom feature engineering rarely exceed 0.80. Tyler's model, built in a week by a junior data scientist, is not just good — it would be the best readmission model ever built.
Kenji does not congratulate Tyler. He asks three questions.
Investigation Phase 1: Feature Importance
"Show me the top 10 features by importance."
import pandas as pd
import numpy as np
# Tyler's feature importances (reconstructed for illustration)
feature_importance = pd.DataFrame({
"feature": [
"discharge_disposition",
"total_charges_billed",
"post_discharge_ed_visits",
"length_of_stay",
"num_diagnoses",
"age",
"num_medications",
"prior_admissions_12m",
"hemoglobin_at_discharge",
"insurance_type",
],
"importance": [0.341, 0.127, 0.098, 0.072, 0.061, 0.054, 0.048, 0.043, 0.039, 0.031],
})
print("Top 10 features by importance:")
print(feature_importance.to_string(index=False))
print(f"\nTop feature accounts for {feature_importance['importance'].iloc[0]:.1%} of total importance")
Top 10 features by importance:
feature importance
discharge_disposition 0.341
total_charges_billed 0.127
post_discharge_ed_visits 0.098
length_of_stay 0.072
num_diagnoses 0.061
age 0.054
num_medications 0.048
prior_admissions_12m 0.043
hemoglobin_at_discharge 0.039
insurance_type 0.031
Top feature accounts for 34.1% of total importance
Kenji circles three features. "These are your problems."
Leakage Source 1: discharge_disposition
Feature: discharge_disposition — where the patient went after discharge (home, skilled nursing facility, rehabilitation, hospice, expired).
Why it is leakage: discharge_disposition includes categories that are proxies for the outcome. Patients discharged to a skilled nursing facility have different readmission patterns than patients discharged home — but the discharge disposition is determined partly based on how sick the patient is, which is the same thing the model is trying to predict. More critically, discharge_disposition = "expired" means the patient died. Dead patients cannot be readmitted. The model learns: "if expired, predict no readmission." Technically correct. Clinically useless — you cannot intervene for a deceased patient.
Beyond the expired category, the discharge disposition encodes the clinical team's subjective assessment of the patient's risk at discharge. It is a human expert's prediction smuggled into the feature set.
The fix: Remove discharge_disposition entirely. If the hospital wants to use discharge destination as a factor, it should be modeled separately or treated as a post-hoc segmentation variable, not an input feature.
Tyler's reaction: "But removing it will tank our AUC." Kenji's response: "Good. Our current AUC is a lie. I would rather have an honest 0.72 than a fake 0.97."
Leakage Source 2: post_discharge_ed_visits
Feature: post_discharge_ed_visits — the number of emergency department visits the patient had after discharge but before readmission (or before the 30-day window closed).
Tyler had included this because "patients who visit the ED are sicker and more likely to be readmitted." His logic was correct. His timing was not.
Why it is leakage: This feature is computed using data from after the prediction point. At the moment of discharge — when the model needs to make its prediction — you do not know how many ED visits the patient will have in the next 30 days. Including this feature means the model is using future information to predict future outcomes. In production, this feature would always be zero at prediction time.
The fix: Remove post_discharge_ed_visits. If ED visits prior to the current admission are predictive, compute pre_admission_ed_visits_90d using data from before the admission date.
Leakage Source 3: total_charges_billed
Feature: total_charges_billed — the total charges for the patient's hospital stay.
This one is subtler.
Why it is leakage: At many hospitals, the final billing total is not computed until after discharge — sometimes days or weeks after. It incorporates post-discharge documentation, late charges, and retrospective coding adjustments. More importantly, readmitted patients often have higher total charges because (a) sicker patients cost more and (b) the billing system may retroactively adjust charges if a readmission triggers a penalty under CMS rules.
Tyler assumed total charges were known at discharge. At Regional West, they are not finalized until 3-5 business days later. The feature is contaminated with post-discharge information.
The fix: Replace with charges_through_day_before_discharge or estimated_charges_at_discharge — a value that is truly available at prediction time. Better yet, use length_of_stay and num_procedures as proxies for cost, since those are definitively known at discharge.
Investigation Phase 2: The Corrected Model
Amara removes the three leaky features and retrains.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
# After removing leaky features:
# - discharge_disposition
# - post_discharge_ed_visits
# - total_charges_billed
# Remaining features: 32 (from 35)
# X_clean: feature matrix without leaky columns
# y: readmission within 30 days
# Use temporal split this time (Kenji insists)
# Train: discharges from months 1-18
# Test: discharges from months 19-24
# Results after retraining with clean features and temporal split
results = {
"AUC-ROC": 0.721,
"Accuracy": 0.834,
"Precision": 0.284,
"Recall": 0.391,
"F1": 0.329,
}
print("=== Corrected Model Evaluation ===")
for metric, value in results.items():
print(f"{metric:15s} {value:.3f}")
=== Corrected Model Evaluation ===
AUC-ROC 0.721
Accuracy 0.834
Precision 0.284
Recall 0.391
F1 0.329
AUC-ROC dropped from 0.967 to 0.721. The model went from "best in the world" to "competitive with published literature." The 0.721 is honest. It represents what the model can actually achieve in production with information available at the moment of discharge.
Tyler is initially demoralized. He spent a week building something he was proud of, and now the numbers are ordinary. Kenji reframes it: "A model with AUC-ROC of 0.72 that works in production is infinitely more valuable than a model with AUC of 0.97 that would have failed on its first day. You just saved this hospital from deploying a system that would have made decisions based on information it could not actually have. That is a win."
Amara adds a more pointed observation: "If we had deployed the leaky model, the care coordinators would have learned to trust the predictions. When the model inevitably failed in production — because the leaky features would not be available — they would have lost trust in the entire data science team. Recovering from a failed deployment is harder than getting the first one right."
Investigation Phase 3: A Deeper Audit
Kenji is not satisfied. The three obvious leaks are fixed, but he wants a systematic audit. He asks Amara to check every feature against a single criterion: "Is this feature's value definitively known at the moment the patient is discharged?"
Amara builds a feature audit table:
feature_audit = {
"age": {"available_at_discharge": True, "notes": "Demographics, known at admission"},
"sex": {"available_at_discharge": True, "notes": "Demographics"},
"insurance_type": {"available_at_discharge": True, "notes": "Verified at registration"},
"num_diagnoses": {"available_at_discharge": "Partial", "notes": "Primary diagnosis yes; secondary coding may be retroactive"},
"length_of_stay": {"available_at_discharge": True, "notes": "Computed from admit/discharge timestamps"},
"num_medications": {"available_at_discharge": True, "notes": "Medication list at discharge"},
"hemoglobin_at_discharge": {"available_at_discharge": True, "notes": "Lab result from last draw before discharge"},
"prior_admissions_12m": {"available_at_discharge": True, "notes": "Historical data"},
"num_procedures": {"available_at_discharge": "Partial", "notes": "Surgical procedures yes; some coded retroactively"},
"primary_diagnosis_category": {"available_at_discharge": "Partial", "notes": "Working diagnosis yes; final DRG may change"},
}
print(f"{'Feature':<30} {'Available?':<15} {'Notes'}")
print("-" * 80)
for feat, info in feature_audit.items():
print(f"{feat:<30} {str(info['available_at_discharge']):<15} {info['notes']}")
Feature Available? Notes
--------------------------------------------------------------------------------
age True Demographics, known at admission
sex True Demographics
insurance_type True Verified at registration
num_diagnoses Partial Primary diagnosis yes; secondary coding may be retroactive
length_of_stay True Computed from admit/discharge timestamps
num_medications True Medication list at discharge
hemoglobin_at_discharge True Lab result from last draw before discharge
prior_admissions_12m True Historical data
num_procedures Partial Surgical procedures yes; some coded retroactively
primary_diagnosis_category Partial Working diagnosis yes; final DRG may change
The "Partial" features are tricky. num_diagnoses might be 3 at discharge but 5 after retrospective coding. This is not catastrophic leakage — the magnitude is small — but it means the model's training data reflects a slightly different version of the feature than what will be available in production.
Kenji's decision: keep the "Partial" features but document the discrepancy. Monitor the feature distributions in production to verify they match training data.
Investigation Phase 4: The Train/Test Split Problem
Tyler had used a random 80/20 split. Kenji asked Amara to also evaluate with a temporal split (train on months 1-18, test on months 19-24). Why?
Random split results: AUC-ROC = 0.748 Temporal split results: AUC-ROC = 0.721
The temporal split performance is lower. This is expected and informative. The gap (0.027) tells you that some patterns in the older data do not generalize to the newer data. Maybe hospital protocols changed. Maybe patient demographics shifted. Maybe seasonal patterns matter.
The temporal split is the honest estimate of production performance. The random split is optimistic because it allows the model to train on December data and test on July data — a sequence that will never happen in production.
The Final Report
Kenji presents the findings to Regional West's Chief Medical Officer:
Original claim: AUC-ROC = 0.967 ("world-class model")
After leakage removal: AUC-ROC = 0.721 ("competitive with published models")
What we learned: 1. Three features were not available at prediction time 2. The model was learning shortcuts (dead patients do not get readmitted) rather than clinical risk factors 3. After correction, the model provides actionable predictions that the care coordination team can use at the point of discharge
Recommendation: Deploy the corrected model (AUC-ROC 0.721) with the following monitoring: track feature distributions monthly, compare predicted vs. actual readmission rates quarterly, retrain every 6 months.
The CMO appreciated the honesty. The hospital deployed the model. It identified a cohort of high-risk patients who received additional post-discharge phone calls and follow-up appointments. After six months, the 30-day readmission rate in the high-risk cohort decreased by 8% compared to the historical baseline — a modest but clinically meaningful improvement.
The Cost of Getting It Wrong
It is worth considering what would have happened if the leakage had not been caught.
Scenario: Leaky model deployed. The model goes live. Care coordinators receive a daily list of "high-risk" patients. The predictions are based partly on discharge_disposition and post_discharge_ed_visits. In production, post_discharge_ed_visits is always zero at the time of discharge (because no post-discharge ED visits have happened yet). The model's predictions immediately degrade. Instead of AUC-ROC of 0.967, the model performs at roughly 0.62 — worse than a simple heuristic based on age and diagnosis alone.
The care coordinators notice that the model's "high-risk" patients do not seem particularly at risk. They begin ignoring the model's recommendations. Within two months, the system is effectively abandoned, but the infrastructure continues running and consuming resources. The hospital has spent $150,000 on development, deployment, and care coordinator training for a system that nobody uses.
When the data science team investigates, they discover the leakage. But now they face a worse problem: the clinical staff has lost trust in ML-based predictions. Deploying the corrected model (AUC-ROC 0.72) is now an uphill battle — not because the model is bad, but because the first experience was bad. Trust is harder to rebuild than to build.
This scenario plays out more often than the industry admits. It is why leakage detection is not a nice-to-have — it is a deployment prerequisite.
A Leakage Detection Checklist
Based on this case, here is a systematic checklist for detecting leakage in any ML project:
1. Suspiciously high performance - AUC-ROC > 0.95 on a real-world problem demands investigation - Compare to published benchmarks for similar problems
2. Feature importance dominance - If one feature accounts for > 25% of total importance, audit it - Ask: "Is this feature truly independent of the target?"
3. Temporal audit - For every feature, ask: "Is this value known at the exact moment I need to make a prediction?" - Trace the feature back to its source system and verify when the value is finalized
4. Train/test gap - Compare random split performance to temporal split performance - A large gap (> 0.05 AUC) suggests either temporal leakage or concept drift
5. Production simulation - Before deployment, run the model on the most recent data as if it were production - Compare to test set performance - If production simulation performance is much worse, investigate
Key Takeaways from This Case
-
Suspiciously high performance is a warning, not a cause for celebration. AUC-ROC of 0.967 on a readmission task should immediately trigger an investigation. Compare your results to published benchmarks for the same problem.
-
Leakage can be subtle.
post_discharge_ed_visitswas temporal leakage — using future data.total_charges_billedwas a timing issue — the value was not finalized at prediction time.discharge_dispositionwas a proxy for the outcome — encoding expert judgment about the patient's risk. Three different leakage mechanisms, all in the same model. -
The fix always reduces performance — and that is the point. Removing leaky features dropped AUC from 0.967 to 0.721. This feels like a loss. It is actually a gain: the 0.721 is real and the 0.967 was a mirage.
-
Feature auditing should be systematic, not ad hoc. Every feature should be traced back to its source and verified for temporal availability. This audit should be documented and reviewed before any model is deployed.
-
Junior team members need guardrails, not blame. Tyler's mistakes were predictable — he did not have the experience to know what to look for. The process should have included a leakage review step before results were presented.
Discussion Questions
-
Tyler's leakage was caught before deployment. In many organizations, it would not have been. What process controls could an organization implement to catch leakage systematically? Consider both technical (automated checks) and organizational (review processes) approaches.
-
The
num_diagnosesfeature was classified as "Partial" — available at discharge but potentially modified afterward through retrospective coding. How would you handle this in practice? Would you use the discharge-time value, the final coded value, or exclude it entirely? What are the tradeoffs of each approach? -
The corrected model achieves AUC-ROC of 0.721. Published readmission models typically achieve 0.65-0.75. Is 0.721 good enough to deploy? What additional information would you need to make this decision? Consider both statistical and business factors.
-
Regional West's 8% readmission reduction was achieved through post-discharge phone calls and follow-up appointments. How would you design an A/B test to rigorously measure this impact and separate the model's contribution from the intervention's effectiveness? What ethical concerns arise from randomizing patients to a control group that does not receive additional support?
-
Tyler was three months into his first job and made a mistake that could have cost the hospital significant credibility. What could Kenji have done differently in terms of team process to reduce the risk of this mistake occurring? Design a code review checklist for ML models that a team lead could use.