Case Study 2: The Data Leakage Disaster — A Cautionary Tale

Contributors to Introduction to Data Science

Case Study 2: The Data Leakage Disaster — A Cautionary Tale

Tier 3 — Illustrative/Composite Example: This case study is fictional but based on widely reported patterns of data leakage in machine learning projects. The specific scenarios, data, and organizations are invented for pedagogical purposes. However, every type of leakage described here has been documented in real projects — some published in academic papers, others shared in industry post-mortems.

The Setup

This is a story about three teams at three different organizations. Each team built a machine learning model. Each model looked spectacular in development. And each model failed badly in production — all because of data leakage.

These aren't stories about bad data scientists. These are stories about talented, hardworking people who made mistakes that are easy to make and hard to catch. The mistakes are so common that an entire category of research papers exists to identify and prevent them.

The lesson isn't "don't be careless." The lesson is "leakage is subtle, pervasive, and can fool even experienced practitioners." The only reliable protection is a rigorous workflow — which is exactly what Chapter 30 teaches.

Story 1: The Scaler That Knew Too Much

The Project

Maplewood Hospital's analytics team builds a model to predict which patients admitted through the emergency department will need ICU transfer within 24 hours. Early identification allows the hospital to prepare resources and potentially improve outcomes.

The Model

The team's data scientist, Ravi, builds a random forest using 15 clinical features (vital signs, lab results, age, admission diagnosis). He follows what he thinks is best practice:

# Ravi's workflow (with hidden leakage)
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Scale all features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)    # <-- Leakage point #1

# Cross-validate
rf = RandomForestClassifier(n_estimators=200, random_state=42)
scores = cross_val_score(rf, X_scaled, y, cv=10, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f}")
# Output: AUC: 0.912

AUC of 0.912. Ravi is excited. The hospital's chief medical officer reviews the number and approves a pilot deployment.

The Failure

In the first month of deployment, the model's AUC drops to 0.873. Not catastrophic, but a noticeable decline. The hospital doesn't notice because 0.873 is still decent. But over the next six months, as the team adds new data and retrains, performance drifts further.

Eventually, someone does a careful analysis and discovers the original sin: the scaler was fit on all data, including the cross-validation test folds. The training-time AUC of 0.912 was inflated. The true performance was always closer to 0.87-0.88.

The Fix

# Correct workflow with Pipeline
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

scores = cross_val_score(pipe, X, y, cv=10, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f}")
# Output: AUC: 0.878

The corrected AUC (0.878) matches what they observed in deployment. The leakage had inflated the cross-validation estimate by about 0.034 — roughly 4 percentage points. On its own, this seems small. But it was enough to give the team false confidence in a threshold they set based on the inflated number. Patients who should have been flagged for ICU were missed because the threshold was calibrated to an unrealistically optimistic model.

The Lesson

Scaling leakage is the most common and least dramatic form of data leakage. It rarely breaks a model completely — it just makes everything look a little better than it really is. That's what makes it dangerous: the model "works," so nobody investigates, and the gap between reported and actual performance becomes a silent tax on every decision the model informs.

Story 2: The Feature That Came from the Future

The Project

Riverdale Insurance wants to predict which policyholders will file a claim in the next 12 months. The goal is proactive outreach: offer safety programs to high-risk customers before they have an accident.

The Model

The data science team compiles 50 features for each policyholder, including demographics, policy details, driving history, vehicle information, and location. Their gradient boosting model achieves AUC of 0.95 — which would be extraordinarily good for insurance prediction, where AUCs above 0.75 are considered strong.

Nobody questions it at first. Then an actuary asks: "Which feature is driving the model?"

The team checks feature importance and discovers that policy_status_change_count — the number of times a customer's policy was modified in the past year — is by far the most important feature.

The Problem

Customers who are about to file a claim often change their policy settings in the weeks before the claim — upgrading coverage, adding riders, or adjusting deductibles. The feature wasn't predicting future claims; it was detecting claim-related activity that had already begun. Some "policy changes" were even initiated by the claims process itself — a data entry that was backdated into the feature window.

This is target leakage: a feature that contains information about the target variable that wouldn't be available at prediction time.

Why It's Insidious

Target leakage doesn't produce a technical error. The model trains and evaluates normally. The metrics look excellent — because the model really is detecting a strong signal. But the signal is contaminated: it comes from the future (or from the process that creates the target), not from genuinely predictive information available before the event.

When the model was deployed to predict which current customers would file claims in the future, the policy_status_change_count feature reflected the customer's normal, pre-event behavior — and the model's AUC dropped to 0.72.

The Fix

The team removed policy_status_change_count and any other feature that could be influenced by the claim process. They established a strict rule: every feature must reflect information available at least 30 days before the prediction target date. They re-evaluated each feature by asking: "Would we know this value at the moment we need to make the prediction?"

The resulting model had AUC of 0.76 — honest, realistic, and actually useful.

The Lesson

Always ask about every feature: "Would I know this value at the time I need to make the prediction?" If the answer is no — or even "maybe" — investigate further. Features that are causally downstream of the target (effects of the event you're trying to predict) are among the most dangerous forms of leakage.

Story 3: The Duplicate That Inflated Everything

The Project

A research team at a university publishes a paper claiming that their deep learning model can predict disease from medical images with 97% accuracy — a breakthrough that would transform diagnostic medicine. The paper goes viral.

The Model

The team used a large dataset of medical images, with each image labeled "disease" or "healthy." They split the data into 80% training and 20% testing, trained a sophisticated model, and reported stunning results.

The Problem

A year later, another research team attempts to replicate the results and fails. Their replication model achieves only 81% accuracy on the same dataset. After months of investigation, they discover the problem:

Many patients in the dataset had multiple images. The same patient's images appeared in both the training and test sets. The model didn't learn to detect the disease — it learned to recognize individual patients. A patient's images in the test set were nearly identical to their images in the training set, so the model just looked up the nearest match.

This is group leakage (also called data snooping through grouping): when related observations appear in both the training and test sets, the model can exploit the similarity between them rather than learning the underlying pattern.

The Fix

The replication team split the data by patient, not by image: all of a given patient's images go into either training or testing, never both. With this patient-aware split, the best model achieved 82% accuracy — a meaningful result, but far from the original claim.

from sklearn.model_selection import GroupKFold

# patient_ids: array of patient IDs corresponding to each image
group_cv = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=group_cv, groups=patient_ids)

The Lesson

When your data has natural groups (patients, students, companies, geographic regions), you must split by group, not by individual observation. GroupKFold and GroupShuffleSplit in scikit-learn handle this automatically. Before splitting, always ask: "Are any observations in my data related to each other in a way that would make them too similar?"

The Taxonomy of Data Leakage

These three stories illustrate the three major categories of data leakage:

Type	Description	Example	Prevention
Preprocessing leakage	Statistics from the test set influence preprocessing (scaling, encoding, imputation)	Fitting StandardScaler on all data before splitting	Use Pipelines
Target leakage	A feature contains information about the target that wouldn't be available at prediction time	Policy changes caused by the claim you're trying to predict	Audit every feature's temporal availability
Group leakage	Related observations appear in both train and test, allowing the model to "cheat"	Same patient's images in both train and test	Split by group (GroupKFold)

All three are preventable. All three are common. And all three produce models that look great in development and fail in production.

A Leakage Detection Checklist

Before trusting any model's evaluation, run through this checklist:

1. Is preprocessing inside the pipeline? - [ ] Scaling, encoding, and imputation are all pipeline steps - [ ] No fit_transform on the full dataset before splitting

2. Would every feature be available at prediction time? - [ ] For each feature, you can answer: "I would know this value before I need to make the prediction" - [ ] No features are caused by or correlated with the target in a way that wouldn't exist in new data

3. Are related observations separated? - [ ] If the data has natural groups (patients, users, companies), all observations from the same group are in the same split - [ ] No duplicate or near-duplicate rows exist across train and test

4. Is the performance realistic? - [ ] The model's AUC or accuracy is in a plausible range for the problem domain - [ ] Performance doesn't drop significantly when moving from cross-validation to production - [ ] If performance seems "too good to be true," investigate

5. Is there a gap between development and deployment performance? - [ ] If yes, leakage is a prime suspect - [ ] Check all preprocessing, feature engineering, and splitting logic

Discussion Questions

In Story 1, the scaler leakage inflated AUC by about 0.034. In what situations would this small difference actually matter? When would it be negligible?
In Story 2, how would you design a feature audit process? What questions would you ask about each feature, and who on the team should be responsible for the audit?
In Story 3, the patient-unaware split produced 97% accuracy and the patient-aware split produced 82%. Both numbers are "correct" for their respective evaluation protocols. Which number should appear in a research paper? Why?
Can you think of a scenario where leakage is not a problem? That is, where using future information during training is actually appropriate? (Hint: think about situations where the same future information would be available in deployment.)
A colleague argues: "Leakage is only a problem if you deploy the model. For exploratory analysis, it doesn't matter." Do you agree? Why or why not?