Case Study 2: The Data Leakage Detective

DataField.Dev

Case Study 2: The Data Leakage Detective

When a Model Is Too Good to Be True, It Usually Is

The Setup

MedPredict Analytics builds predictive models for hospital systems. Their latest contract is with Regional West Medical Center, a 320-bed hospital that wants to predict 30-day readmissions. The goal: identify patients at discharge who are likely to be readmitted within 30 days, so that care coordinators can provide additional post-discharge support.

The data science team — Kenji (lead), Amara (mid-level), and Tyler (junior, three months on the job) — has access to two years of electronic health record data: demographics, diagnoses, procedures, lab results, medications, length of stay, and discharge records.

Tyler is assigned to build the first model. He is eager to prove himself.

The First Model

Tyler works quickly. He wants to demonstrate competence and deliver fast results. Within a week, he has:

Pulled 48,000 patient discharge records from the data warehouse using a SQL query that joins the admissions, diagnoses, labs, and discharge tables
Built 35 features including demographics (age, sex, insurance), diagnosis codes (count and categories), lab values (hemoglobin, creatinine, glucose at admission and discharge), length of stay, number of procedures, medications at discharge, and discharge information
Handled missing values with median imputation
Trained a gradient boosting model with GradientBoostingClassifier from scikit-learn using 200 estimators
Evaluated on a random 80/20 train/test split

His results:

=== Tyler's Model Evaluation ===
AUC-ROC:      0.967
Accuracy:     0.942
Precision:    0.891
Recall:       0.874
F1:           0.882

Tyler is thrilled. He schedules a demo with Kenji and Amara.

The Red Flags

Kenji's first reaction to the AUC-ROC of 0.967: "That is way too high."

Published readmission models in the literature — including CMS's own Hospital Readmissions Reduction Program models — typically achieve AUC-ROC in the 0.65-0.75 range. The best research models with custom feature engineering rarely exceed 0.80. Tyler's model, built in a week by a junior data scientist, is not just good — it would be the best readmission model ever built.

Kenji does not congratulate Tyler. He asks three questions.

Investigation Phase 1: Feature Importance

"Show me the top 10 features by importance."

import pandas as pd
import numpy as np

# Tyler's feature importances (reconstructed for illustration)
feature_importance = pd.DataFrame({
    "feature": [
        "discharge_disposition",
        "total_charges_billed",
        "post_discharge_ed_visits",
        "length_of_stay",
        "num_diagnoses",
        "age",
        "num_medications",
        "prior_admissions_12m",
        "hemoglobin_at_discharge",
        "insurance_type",
    ],
    "importance": [0.341, 0.127, 0.098, 0.072, 0.061, 0.054, 0.048, 0.043, 0.039, 0.031],
})

print("Top 10 features by importance:")
print(feature_importance.to_string(index=False))
print(f"\nTop feature accounts for {feature_importance['importance'].iloc[0]:.1%} of total importance")

Top 10 features by importance:
                  feature  importance
     discharge_disposition       0.341
       total_charges_billed       0.127
   post_discharge_ed_visits       0.098
          length_of_stay       0.072
            num_diagnoses       0.061
                      age       0.054
          num_medications       0.048
   prior_admissions_12m       0.043
  hemoglobin_at_discharge       0.039
          insurance_type       0.031

Top feature accounts for 34.1% of total importance

Kenji circles three features. "These are your problems."

Leakage Source 1: `discharge_disposition`

Feature: discharge_disposition — where the patient went after discharge (home, skilled nursing facility, rehabilitation, hospice, expired).

Why it is leakage: discharge_disposition includes categories that are proxies for the outcome. Patients discharged to a skilled nursing facility have different readmission patterns than patients discharged home — but the discharge disposition is determined partly based on how sick the patient is, which is the same thing the model is trying to predict. More critically, discharge_disposition = "expired" means the patient died. Dead patients cannot be readmitted. The model learns: "if expired, predict no readmission." Technically correct. Clinically useless — you cannot intervene for a deceased patient.

Beyond the expired category, the discharge disposition encodes the clinical team's subjective assessment of the patient's risk at discharge. It is a human expert's prediction smuggled into the feature set.

The fix: Remove discharge_disposition entirely. If the hospital wants to use discharge destination as a factor, it should be modeled separately or treated as a post-hoc segmentation variable, not an input feature.

Tyler's reaction: "But removing it will tank our AUC." Kenji's response: "Good. Our current AUC is a lie. I would rather have an honest 0.72 than a fake 0.97."

Leakage Source 2: `post_discharge_ed_visits`

Feature: post_discharge_ed_visits — the number of emergency department visits the patient had after discharge but before readmission (or before the 30-day window closed).

Tyler had included this because "patients who visit the ED are sicker and more likely to be readmitted." His logic was correct. His timing was not.

Why it is leakage: This feature is computed using data from after the prediction point. At the moment of discharge — when the model needs to make its prediction — you do not know how many ED visits the patient will have in the next 30 days. Including this feature means the model is using future information to predict future outcomes. In production, this feature would always be zero at prediction time.

The fix: Remove post_discharge_ed_visits. If ED visits prior to the current admission are predictive, compute pre_admission_ed_visits_90d using data from before the admission date.

Leakage Source 3: `total_charges_billed`

Feature: total_charges_billed — the total charges for the patient's hospital stay.

This one is subtler.

Why it is leakage: At many hospitals, the final billing total is not computed until after discharge — sometimes days or weeks after. It incorporates post-discharge documentation, late charges, and retrospective coding adjustments. More importantly, readmitted patients often have higher total charges because (a) sicker patients cost more and (b) the billing system may retroactively adjust charges if a readmission triggers a penalty under CMS rules.

Tyler assumed total charges were known at discharge. At Regional West, they are not finalized until 3-5 business days later. The feature is contaminated with post-discharge information.

The fix: Replace with charges_through_day_before_discharge or estimated_charges_at_discharge — a value that is truly available at prediction time. Better yet, use length_of_stay and num_procedures as proxies for cost, since those are definitively known at discharge.

Investigation Phase 2: The Corrected Model

Amara removes the three leaky features and retrains.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

# After removing leaky features:
# - discharge_disposition
# - post_discharge_ed_visits
# - total_charges_billed

# Remaining features: 32 (from 35)
# X_clean: feature matrix without leaky columns
# y: readmission within 30 days

# Use temporal split this time (Kenji insists)
# Train: discharges from months 1-18
# Test: discharges from months 19-24

# Results after retraining with clean features and temporal split
results = {
    "AUC-ROC": 0.721,
    "Accuracy": 0.834,
    "Precision": 0.284,
    "Recall": 0.391,
    "F1": 0.329,
}

print("=== Corrected Model Evaluation ===")
for metric, value in results.items():
    print(f"{metric:15s} {value:.3f}")

=== Corrected Model Evaluation ===
AUC-ROC         0.721
Accuracy         0.834
Precision        0.284
Recall            0.391
F1               0.329

AUC-ROC dropped from 0.967 to 0.721. The model went from "best in the world" to "competitive with published literature." The 0.721 is honest. It represents what the model can actually achieve in production with information available at the moment of discharge.

Tyler is initially demoralized. He spent a week building something he was proud of, and now the numbers are ordinary. Kenji reframes it: "A model with AUC-ROC of 0.72 that works in production is infinitely more valuable than a model with AUC of 0.97 that would have failed on its first day. You just saved this hospital from deploying a system that would have made decisions based on information it could not actually have. That is a win."

Amara adds a more pointed observation: "If we had deployed the leaky model, the care coordinators would have learned to trust the predictions. When the model inevitably failed in production — because the leaky features would not be available — they would have lost trust in the entire data science team. Recovering from a failed deployment is harder than getting the first one right."

Investigation Phase 3: A Deeper Audit

Kenji is not satisfied. The three obvious leaks are fixed, but he wants a systematic audit. He asks Amara to check every feature against a single criterion: "Is this feature's value definitively known at the moment the patient is discharged?"

Amara builds a feature audit table:

feature_audit = {
    "age": {"available_at_discharge": True, "notes": "Demographics, known at admission"},
    "sex": {"available_at_discharge": True, "notes": "Demographics"},
    "insurance_type": {"available_at_discharge": True, "notes": "Verified at registration"},
    "num_diagnoses": {"available_at_discharge": "Partial", "notes": "Primary diagnosis yes; secondary coding may be retroactive"},
    "length_of_stay": {"available_at_discharge": True, "notes": "Computed from admit/discharge timestamps"},
    "num_medications": {"available_at_discharge": True, "notes": "Medication list at discharge"},
    "hemoglobin_at_discharge": {"available_at_discharge": True, "notes": "Lab result from last draw before discharge"},
    "prior_admissions_12m": {"available_at_discharge": True, "notes": "Historical data"},
    "num_procedures": {"available_at_discharge": "Partial", "notes": "Surgical procedures yes; some coded retroactively"},
    "primary_diagnosis_category": {"available_at_discharge": "Partial", "notes": "Working diagnosis yes; final DRG may change"},
}

print(f"{'Feature':<30} {'Available?':<15} {'Notes'}")
print("-" * 80)
for feat, info in feature_audit.items():
    print(f"{feat:<30} {str(info['available_at_discharge']):<15} {info['notes']}")

Feature                        Available?      Notes
--------------------------------------------------------------------------------
age                            True            Demographics, known at admission
sex                            True            Demographics
insurance_type                 True            Verified at registration
num_diagnoses                  Partial         Primary diagnosis yes; secondary coding may be retroactive
length_of_stay                 True            Computed from admit/discharge timestamps
num_medications                True            Medication list at discharge
hemoglobin_at_discharge        True            Lab result from last draw before discharge
prior_admissions_12m           True            Historical data
num_procedures                 Partial         Surgical procedures yes; some coded retroactively
primary_diagnosis_category     Partial         Working diagnosis yes; final DRG may change

The "Partial" features are tricky. num_diagnoses might be 3 at discharge but 5 after retrospective coding. This is not catastrophic leakage — the magnitude is small — but it means the model's training data reflects a slightly different version of the feature than what will be available in production.

Kenji's decision: keep the "Partial" features but document the discrepancy. Monitor the feature distributions in production to verify they match training data.

Investigation Phase 4: The Train/Test Split Problem

Tyler had used a random 80/20 split. Kenji asked Amara to also evaluate with a temporal split (train on months 1-18, test on months 19-24). Why?

Random split results: AUC-ROC = 0.748 Temporal split results: AUC-ROC = 0.721

The temporal split performance is lower. This is expected and informative. The gap (0.027) tells you that some patterns in the older data do not generalize to the newer data. Maybe hospital protocols changed. Maybe patient demographics shifted. Maybe seasonal patterns matter.

The temporal split is the honest estimate of production performance. The random split is optimistic because it allows the model to train on December data and test on July data — a sequence that will never happen in production.

The Final Report

Kenji presents the findings to Regional West's Chief Medical Officer:

Original claim: AUC-ROC = 0.967 ("world-class model")

After leakage removal: AUC-ROC = 0.721 ("competitive with published models")

What we learned: 1. Three features were not available at prediction time 2. The model was learning shortcuts (dead patients do not get readmitted) rather than clinical risk factors 3. After correction, the model provides actionable predictions that the care coordination team can use at the point of discharge

Recommendation: Deploy the corrected model (AUC-ROC 0.721) with the following monitoring: track feature distributions monthly, compare predicted vs. actual readmission rates quarterly, retrain every 6 months.

The CMO appreciated the honesty. The hospital deployed the model. It identified a cohort of high-risk patients who received additional post-discharge phone calls and follow-up appointments. After six months, the 30-day readmission rate in the high-risk cohort decreased by 8% compared to the historical baseline — a modest but clinically meaningful improvement.

The Cost of Getting It Wrong

It is worth considering what would have happened if the leakage had not been caught.

Scenario: Leaky model deployed. The model goes live. Care coordinators receive a daily list of "high-risk" patients. The predictions are based partly on discharge_disposition and post_discharge_ed_visits. In production, post_discharge_ed_visits is always zero at the time of discharge (because no post-discharge ED visits have happened yet). The model's predictions immediately degrade. Instead of AUC-ROC of 0.967, the model performs at roughly 0.62 — worse than a simple heuristic based on age and diagnosis alone.

The care coordinators notice that the model's "high-risk" patients do not seem particularly at risk. They begin ignoring the model's recommendations. Within two months, the system is effectively abandoned, but the infrastructure continues running and consuming resources. The hospital has spent $150,000 on development, deployment, and care coordinator training for a system that nobody uses.

When the data science team investigates, they discover the leakage. But now they face a worse problem: the clinical staff has lost trust in ML-based predictions. Deploying the corrected model (AUC-ROC 0.72) is now an uphill battle — not because the model is bad, but because the first experience was bad. Trust is harder to rebuild than to build.

This scenario plays out more often than the industry admits. It is why leakage detection is not a nice-to-have — it is a deployment prerequisite.

A Leakage Detection Checklist

Based on this case, here is a systematic checklist for detecting leakage in any ML project:

1. Suspiciously high performance - AUC-ROC > 0.95 on a real-world problem demands investigation - Compare to published benchmarks for similar problems

2. Feature importance dominance - If one feature accounts for > 25% of total importance, audit it - Ask: "Is this feature truly independent of the target?"

3. Temporal audit - For every feature, ask: "Is this value known at the exact moment I need to make a prediction?" - Trace the feature back to its source system and verify when the value is finalized

4. Train/test gap - Compare random split performance to temporal split performance - A large gap (> 0.05 AUC) suggests either temporal leakage or concept drift

5. Production simulation - Before deployment, run the model on the most recent data as if it were production - Compare to test set performance - If production simulation performance is much worse, investigate

Key Takeaways from This Case

Suspiciously high performance is a warning, not a cause for celebration. AUC-ROC of 0.967 on a readmission task should immediately trigger an investigation. Compare your results to published benchmarks for the same problem.
Leakage can be subtle. post_discharge_ed_visits was temporal leakage — using future data. total_charges_billed was a timing issue — the value was not finalized at prediction time. discharge_disposition was a proxy for the outcome — encoding expert judgment about the patient's risk. Three different leakage mechanisms, all in the same model.
The fix always reduces performance — and that is the point. Removing leaky features dropped AUC from 0.967 to 0.721. This feels like a loss. It is actually a gain: the 0.721 is real and the 0.967 was a mirage.
Feature auditing should be systematic, not ad hoc. Every feature should be traced back to its source and verified for temporal availability. This audit should be documented and reviewed before any model is deployed.
Junior team members need guardrails, not blame. Tyler's mistakes were predictable — he did not have the experience to know what to look for. The process should have included a leakage review step before results were presented.

Discussion Questions

Tyler's leakage was caught before deployment. In many organizations, it would not have been. What process controls could an organization implement to catch leakage systematically? Consider both technical (automated checks) and organizational (review processes) approaches.
The num_diagnoses feature was classified as "Partial" — available at discharge but potentially modified afterward through retrospective coding. How would you handle this in practice? Would you use the discharge-time value, the final coded value, or exclude it entirely? What are the tradeoffs of each approach?
The corrected model achieves AUC-ROC of 0.721. Published readmission models typically achieve 0.65-0.75. Is 0.721 good enough to deploy? What additional information would you need to make this decision? Consider both statistical and business factors.
Regional West's 8% readmission reduction was achieved through post-discharge phone calls and follow-up appointments. How would you design an A/B test to rigorously measure this impact and separate the model's contribution from the intervention's effectiveness? What ethical concerns arise from randomizing patients to a control group that does not receive additional support?
Tyler was three months into his first job and made a mistake that could have cost the hospital significant credibility. What could Kenji have done differently in terms of team process to reduce the risk of this mistake occurring? Design a code review checklist for ML models that a team lead could use.

Case Study 2: The Data Leakage Detective

When a Model Is Too Good to Be True, It Usually Is

The Setup

The First Model

The Red Flags

Investigation Phase 1: Feature Importance

Leakage Source 1: discharge_disposition

Leakage Source 2: post_discharge_ed_visits

Leakage Source 3: total_charges_billed

Investigation Phase 2: The Corrected Model

Investigation Phase 3: A Deeper Audit

Investigation Phase 4: The Train/Test Split Problem

The Final Report

The Cost of Getting It Wrong

A Leakage Detection Checklist

Key Takeaways from This Case

Discussion Questions

Leakage Source 1: `discharge_disposition`

Leakage Source 2: `post_discharge_ed_visits`

Leakage Source 3: `total_charges_billed`