Case Study 2: HealthBridge --- The Pipeline That Worked in the Notebook but Failed in Production
Background
HealthBridge is a health-tech company that operates a hospital readmission prediction system. Their model predicts the probability that a patient discharged from a hospital will be readmitted within 30 days. The model is used by case managers to prioritize follow-up calls: patients with the highest predicted readmission risk receive a call within 24 hours of discharge.
The data science team built the model over three months. On their holdout set, it achieved an AUC of 0.84 and a precision at the top 10% of 0.72 --- meaning that 72% of the patients the model flagged as highest-risk actually were readmitted. The case management team was enthusiastic. The model was deployed.
Three weeks later, the case management director called an emergency meeting. The model was flagging patients who were obviously low-risk (young, elective surgery, no comorbidities) while missing patients who were obviously high-risk (elderly, multiple chronic conditions, discharged to skilled nursing facilities). The overall prediction accuracy had dropped from usable to worse-than-random.
The model had not degraded. The model was receiving data it had never seen during training, because the preprocessing was applied in the wrong order.
The Investigation
The data science team's notebook processed features in this order:
- One-hot encode
admission_type,discharge_disposition,insurance_type - Impute missing values in
lab_results_abnormalandnum_medicationsusing median imputation - Create interaction features:
age * num_diagnoses,length_of_stay * num_procedures - Standard-scale all numeric features
- Train a gradient boosting model
The engineering team's production pipeline processed features in this order:
- Impute missing values in
lab_results_abnormalandnum_medications - Standard-scale all numeric features
- One-hot encode categorical features
- Create interaction features
- Predict using the trained model
The orders were different. Here is why each difference mattered.
Difference 1: Encoding Before vs. After Imputation
In the notebook, one-hot encoding happened first. The discharge_disposition column had 4 categories: home, SNF (skilled nursing facility), rehab, and other. After encoding, these became 4 binary columns.
In production, imputation happened first. The imputer saw the discharge_disposition column as a non-numeric column and skipped it. Then standard scaling happened. The scaler saw the discharge_disposition column and raised an error --- but only sometimes, because the error depended on whether any missing values existed in the batch. Some batches worked. Some did not. The team added a try/except around the scaler to suppress the error. This "fix" meant that some features were scaled and some were not, depending on the batch.
# What the notebook did (correct for the notebook, but fragile):
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[cat_cols]) # Step 1
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X[num_cols]) # Step 2 (on numeric only)
# What production did (incorrect):
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X) # Step 1 (on ALL columns)
# SimpleImputer with strategy='median' fails on string columns
# unless you pass only numeric columns --- but production passed everything
Difference 2: Scaling Before vs. After Interaction Features
In the notebook, interaction features were created before scaling. The interaction age * num_diagnoses was computed on the raw values (e.g., 72 * 8 = 576). Then all features, including the interaction, were scaled.
In production, scaling happened before interaction features. The interaction was computed on the scaled values (e.g., 1.34 * 0.87 = 1.17). The resulting interaction feature had a completely different distribution than what the model was trained on. The model's learned split points for the interaction feature were meaningless.
# Notebook order (interaction on raw values, then scale):
X['age_x_diagnoses'] = X['age'] * X['num_diagnoses']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# age_x_diagnoses has range ~ [0, 1200], scaled to mean=0, std=1
# Production order (scale first, then interaction on scaled values):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled['age_x_diagnoses'] = X_scaled['age'] * X_scaled['num_diagnoses']
# age_x_diagnoses has range ~ [-4, 4], NOT scaled, wrong distribution
Difference 3: Fit/Transform Leakage in Production
The production pipeline called fit_transform on every batch of incoming data. Each batch contained 50-200 patients discharged that day. The imputer learned medians from each batch. The scaler learned means and standard deviations from each batch.
With batches of 50 patients, the sample statistics were volatile. A batch with several elderly heart failure patients would produce different scaling parameters than a batch dominated by young elective-surgery patients. The model was effectively receiving differently-scaled data every day.
# Production code (simplified):
def predict_batch(df_batch, model):
# WRONG: fit_transform on each batch
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(df_batch[num_cols])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
# ... encode, create interactions ...
return model.predict_proba(X_final)[:, 1]
Every single call to predict_batch created new fitted objects. The imputer medians were different every time. The scaler means were different every time. The model had been trained on data scaled with medians and means from 50,000 historical patients. It was receiving data scaled with medians and means from 50 patients.
The Fix
The fix was a Pipeline. One object. One fit. One file.
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
class InteractionFeatureCreator(BaseEstimator, TransformerMixin):
"""Create clinically meaningful interaction features."""
def __init__(self, interactions=None):
self.interactions = interactions or [
('age', 'num_diagnoses'),
('length_of_stay', 'num_procedures')
]
def fit(self, X, y=None):
self.feature_names_in_ = list(X.columns) if hasattr(X, 'columns') else None
return self
def transform(self, X):
X = pd.DataFrame(X).copy()
if self.feature_names_in_:
X.columns = self.feature_names_in_
for col_a, col_b in self.interactions:
X[f'{col_a}_x_{col_b}'] = X[col_a] * X[col_b]
return X
def get_feature_names_out(self, input_features=None):
base = list(self.feature_names_in_) if self.feature_names_in_ else []
interaction_names = [f'{a}_x_{b}' for a, b in self.interactions]
return base + interaction_names
# Column definitions
num_cols = ['age', 'length_of_stay', 'num_procedures', 'num_diagnoses',
'num_medications', 'lab_results_abnormal']
cat_cols = ['admission_type', 'discharge_disposition', 'insurance_type']
interaction_pairs = [('age', 'num_diagnoses'), ('length_of_stay', 'num_procedures')]
# Define the pipeline: order is explicit, enforced, and documented
healthbridge_pipeline = Pipeline([
# Step 1: Create interaction features FIRST (on raw values)
('interactions', InteractionFeatureCreator(interactions=interaction_pairs)),
# Step 2: Route features to appropriate transformers
('preprocessor', ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), num_cols + ['age_x_num_diagnoses', 'length_of_stay_x_num_procedures']),
('cat', Pipeline([
('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
]), cat_cols)
], remainder='drop')),
# Step 3: Model
('model', GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.1,
random_state=42
))
])
The pipeline encodes the correct order:
- Create interactions on raw values
- Impute missing values (fitted on training data only)
- Scale numeric features (fitted on training data only)
- One-hot encode categorical features (fitted on training data only)
- Predict
There is no ambiguity. There is no opportunity for a production engineer to reorder the steps. The fit_transform / transform contract is enforced automatically.
Training and Serialization
from sklearn.model_selection import cross_val_score
import joblib
# Simulate hospital readmission data
np.random.seed(42)
n = 30000
df = pd.DataFrame({
'age': np.random.normal(65, 15, n).clip(18, 100).astype(int),
'length_of_stay': np.random.exponential(4.5, n).clip(1, 30).astype(int),
'num_procedures': np.random.poisson(2.5, n),
'num_diagnoses': np.random.poisson(5, n).clip(1, 20),
'num_medications': np.where(
np.random.random(n) < 0.08, np.nan,
np.random.poisson(8, n).astype(float)
),
'lab_results_abnormal': np.where(
np.random.random(n) < 0.12, np.nan,
np.random.beta(2, 5, n).round(2)
),
'admission_type': np.random.choice(
['emergency', 'urgent', 'elective'], n, p=[0.5, 0.3, 0.2]
),
'discharge_disposition': np.random.choice(
['home', 'SNF', 'rehab', 'other'], n, p=[0.6, 0.2, 0.1, 0.1]
),
'insurance_type': np.random.choice(
['Medicare', 'Medicaid', 'private', 'self_pay'], n, p=[0.45, 0.2, 0.3, 0.05]
),
})
# Synthetic readmission target
df['readmitted_30d'] = (
(df['age'] > 70).astype(int) * 0.15
+ (df['num_diagnoses'] > 6).astype(int) * 0.2
+ (df['length_of_stay'] > 7).astype(int) * 0.1
+ (df['discharge_disposition'] == 'SNF').astype(int) * 0.15
+ (df['num_medications'].fillna(12) > 10).astype(int) * 0.1
+ np.random.random(n) * 0.3
) > 0.5
df['readmitted_30d'] = df['readmitted_30d'].astype(int)
X = df.drop('readmitted_30d', axis=1)
y = df['readmitted_30d']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Cross-validate
cv_scores = cross_val_score(
healthbridge_pipeline, X_train, y_train,
cv=5, scoring='roc_auc'
)
print(f"CV AUC: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Fit and save
healthbridge_pipeline.fit(X_train, y_train)
from sklearn.metrics import roc_auc_score
y_proba = healthbridge_pipeline.predict_proba(X_test)[:, 1]
print(f"Holdout AUC: {roc_auc_score(y_test, y_proba):.4f}")
joblib.dump(healthbridge_pipeline, 'healthbridge_pipeline_v2.joblib')
print("Pipeline saved.")
CV AUC: 0.7826 +/- 0.0052
Holdout AUC: 0.7891
Pipeline saved.
Production Code After the Fix
import joblib
# Load once at application startup
pipeline = joblib.load('healthbridge_pipeline_v2.joblib')
def predict_batch(df_batch):
"""Predict 30-day readmission probability for a batch of patients.
Args:
df_batch: DataFrame with raw patient features from the EHR.
Returns:
numpy array of readmission probabilities.
"""
return pipeline.predict_proba(df_batch)[:, 1]
Five lines. One load. One call. No fit. No ambiguity about ordering. The try/except around the scaler is gone because the scaler never sees categorical data. The batch-level fit/transform leakage is gone because the pipeline calls transform, not fit_transform, during prediction.
The Cost of Getting It Wrong
The three-week period of incorrect predictions had measurable consequences:
| Metric | Before Deployment | During Bug | After Fix |
|---|---|---|---|
| Precision @ top 10% | 0.72 | 0.31 | 0.71 |
| Patients correctly prioritized | ~85/day | ~37/day | ~84/day |
| Readmissions with no follow-up call | ~12/day | ~31/day | ~13/day |
| Estimated excess readmissions | --- | ~380 total | --- |
Each excess readmission costs an average of $15,200. The estimated financial impact of the preprocessing ordering bug: $5.8 million in excess readmission costs over three weeks. The fix was a Pipeline that took one afternoon to build.
Lessons Learned
-
Notebook order is not pipeline order. The sequence of cells in a notebook reflects the data scientist's thought process, not the correct computational order. Two competent engineers can read the same notebook and implement the preprocessing in different orders. A Pipeline eliminates this ambiguity.
-
Fit/transform leakage is invisible in notebooks. In a notebook, you fit the imputer once and transform test data once. In production, a batch-level
fit_transformcall is the natural pattern if you do not have a pre-fitted pipeline. The leakage is not a bug in the engineer's code --- it is a gap in the handoff between data science and engineering. A serialized pipeline bridges that gap. -
Interaction features must be created before scaling. The mathematical relationship between raw-value interactions and scaled-value interactions is non-trivial.
StandardScaler(a * b)is not the same asStandardScaler(a) * StandardScaler(b). The pipeline enforces the correct order by construction. -
The
try/exceptanti-pattern. When the production scaler raised errors on categorical columns, the team suppressed the error instead of fixing the root cause. This is a general anti-pattern: silencing errors in data pipelines hides the symptoms of ordering bugs, type mismatches, and schema changes. Let the pipeline crash loudly. Fix the cause, not the symptom. -
Production bugs in ML systems rarely produce error messages. The model did not crash. It produced predictions. They were just wrong predictions. In traditional software, a bug that inverts outputs is obvious. In ML, a bug that distorts inputs produces outputs that look plausible but are meaningless. Monitoring (Chapter 32) catches these bugs eventually. Pipelines prevent them.
Discussion Questions
-
The HealthBridge team reported an AUC of 0.84 during development but 0.7891 after rebuilding with a proper pipeline. Where did the 0.05 AUC gap come from? Is the original 0.84 or the new 0.79 more trustworthy?
-
The production code used
fit_transformon each batch because the engineering team did not have a pre-fitted pipeline. Design a deployment architecture where the data science team delivers a fitted pipeline to the engineering team. What artifacts need to be delivered? What contract (input schema, output format) should be documented? -
The
InteractionFeatureCreatorhardcodes the interaction pairs. How would you make this configurable? How would you select which interactions to include? Is there a risk of creating too many interaction features? -
The case management team used the model for three weeks before the bug was discovered. What monitoring metrics (Chapter 32) would have caught this bug sooner? Design a monitoring dashboard that would flag this failure mode within 24 hours.
This case study supports Chapter 10: Building Reproducible Data Pipelines. Return to the chapter for the full discussion.