Case Study 2: Hospital Readmission --- When the Model's Precision Kills Patients
Background
Mercy Regional Medical Center is a 400-bed hospital serving a mixed urban-rural population in the mid-Atlantic United States. Their quality improvement team has a mandate: reduce 30-day readmission rates for heart failure patients. Currently, 22% of heart failure discharges result in a readmission within 30 days. The national average is 21.6%. Medicare penalizes hospitals with above-average readmission rates, and Mercy is losing approximately $1.8 million annually in penalties.
The hospital's data science team (two data scientists and one clinical informatics specialist) built a machine learning model to identify patients at high risk of readmission at the time of discharge. High-risk patients would receive an intensive post-discharge intervention: a home visit from a nurse practitioner within 48 hours, daily phone check-ins for two weeks, and a guaranteed follow-up appointment within 7 days.
The intervention is effective --- a randomized trial showed it reduces readmission rates by 40% for the patients who receive it. But it is expensive: $850 per patient. Mercy cannot afford to give every discharged heart failure patient the intensive intervention. They need the model to identify who needs it most.
The model was deployed six months ago. The readmission rate has not meaningfully changed. The quality improvement team is frustrated. The chief medical officer is asking what went wrong.
This case study is the postmortem.
The Model
The team built a Gradient Boosting model using two years of historical discharge data: 4,200 heart failure discharges, of which 924 (22%) resulted in 30-day readmission.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
roc_auc_score, average_precision_score, precision_score,
recall_score, f1_score, classification_report, confusion_matrix
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(42)
n = 4200
readmit_rate = 0.22
data = pd.DataFrame({
'age': np.random.normal(72, 12, n).clip(30, 100).round(0).astype(int),
'ejection_fraction': np.random.normal(35, 12, n).clip(10, 70).round(0).astype(int),
'length_of_stay': np.random.exponential(5.5, n).round(1).clip(1, 30),
'num_prior_admissions_1yr': np.random.poisson(1.8, n),
'num_comorbidities': np.random.poisson(3.2, n),
'creatinine_discharge': np.random.normal(1.4, 0.6, n).clip(0.5, 5.0).round(2),
'sodium_discharge': np.random.normal(138, 4, n).clip(125, 150).round(0).astype(int),
'hemoglobin_discharge': np.random.normal(11.5, 2.0, n).clip(6, 17).round(1),
'systolic_bp_discharge': np.random.normal(125, 20, n).clip(80, 200).round(0).astype(int),
'bmi': np.random.normal(29, 6, n).clip(15, 55).round(1),
'lives_alone': np.random.binomial(1, 0.35, n),
'has_primary_care': np.random.binomial(1, 0.72, n),
'num_medications': np.random.poisson(8, n).clip(0, 25),
'depression_screen_positive': np.random.binomial(1, 0.28, n),
'insurance_type': np.random.choice(
['medicare', 'medicaid', 'private', 'self_pay'], n,
p=[0.55, 0.18, 0.22, 0.05]
),
'discharge_disposition': np.random.choice(
['home', 'home_health', 'snf', 'rehab'], n,
p=[0.50, 0.25, 0.15, 0.10]
),
})
# Generate readmission target
readmit_logit = (
-2.5
+ 0.03 * (data['age'] - 72)
+ 0.5 * (data['ejection_fraction'] < 25).astype(float)
+ 0.02 * data['length_of_stay']
+ 0.4 * data['num_prior_admissions_1yr']
+ 0.1 * data['num_comorbidities']
+ 0.3 * (data['creatinine_discharge'] > 2.0).astype(float)
+ 0.3 * (data['sodium_discharge'] < 135).astype(float)
+ 0.4 * data['lives_alone']
- 0.3 * data['has_primary_care']
+ 0.3 * data['depression_screen_positive']
)
readmit_prob = 1 / (1 + np.exp(-readmit_logit))
data['readmitted'] = np.random.binomial(1, readmit_prob)
# Encode categoricals
data_encoded = pd.get_dummies(
data, columns=['insurance_type', 'discharge_disposition'], drop_first=True
)
features = [c for c in data_encoded.columns if c != 'readmitted']
X = data_encoded[features]
y = data_encoded['readmitted']
print(f"Patients: {n}")
print(f"Readmissions: {y.sum()} ({y.mean():.1%})")
print(f"Features: {len(features)}")
Patients: 4200
Readmissions: 1021 (24.3%)
Features: 20
The Original Evaluation
The team used a standard 80/20 train-test split and optimized for AUC-ROC.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
model = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=4,
subsample=0.8, random_state=42
)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test) # Default threshold = 0.5
auc = roc_auc_score(y_test, y_proba)
print(f"=== Original Evaluation ===")
print(f"AUC-ROC: {auc:.4f}\n")
print(classification_report(y_test, y_pred,
target_names=['Not Readmitted', 'Readmitted']))
=== Original Evaluation ===
AUC-ROC: 0.7124
precision recall f1-score support
Not Readmitted 0.81 0.89 0.85 636
Readmitted 0.47 0.33 0.39 204
accuracy 0.75 840
macro avg 0.64 0.61 0.62 840
weighted avg 0.73 0.75 0.74 840
The team reported the AUC-ROC (0.71) and accuracy (75%) to the clinical leadership. The chief medical officer approved deployment. Nobody examined the per-class metrics closely enough to notice the problem.
What Went Wrong
The Recall Problem
Look at the classification report for the "Readmitted" class: recall = 0.33. The model identifies only 33% of patients who will be readmitted. Two out of three high-risk patients are sent home without the intensive intervention because the model missed them.
# Break down the confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print("=== The Problem in Numbers ===")
print(f"True Positives (correctly flagged): {tp}")
print(f"False Negatives (missed): {fn}")
print(f"False Positives (unnecessary flags): {fp}")
print(f"True Negatives (correctly cleared): {tn}")
print(f"\nOf {tp + fn} patients who WILL be readmitted:")
print(f" Model catches: {tp} ({tp/(tp+fn):.0%})")
print(f" Model misses: {fn} ({fn/(tp+fn):.0%})")
print(f"\nMissed patients: {fn} people who will end up back in the hospital")
print(f"within 30 days, without receiving the intervention that could")
print(f"have prevented it.")
=== The Problem in Numbers ===
True Positives (correctly flagged): 67
False Negatives (missed): 137
False Positives (unnecessary flags): 72
True Negatives (correctly cleared): 564
Of 204 patients who WILL be readmitted:
Model catches: 67 (33%)
Model misses: 137 (67%)
Missed patients: 137 people who will end up back in the hospital
within 30 days, without receiving the intervention that could
have prevented it.
Each of those 137 missed patients represents a preventable readmission. At $850 per intervention and a 40% effectiveness rate, each true positive prevents a readmission worth approximately $15,000 in hospital costs plus immeasurable patient suffering.
Why the Default Threshold Failed
The model uses the default classification threshold of 0.50. But heart failure readmission is not a 50/50 bet --- the base rate is 24.3%. Setting the threshold at 0.50 means the model only flags patients it is very confident will be readmitted. This maximizes precision at the expense of recall.
For a clinical application where missing a high-risk patient has severe consequences, this is exactly backwards.
The Fix: Threshold Optimization for Clinical Impact
The team needs to find the threshold that maximizes clinical benefit, not the threshold that maximizes accuracy or F1.
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Clinical cost-benefit analysis
cost_intervention = 850 # Per-patient intervention cost
cost_readmission = 15000 # Average cost of a readmission
intervention_effectiveness = 0.40 # 40% of interventions prevent readmission
penalty_per_readmission = 3200 # Medicare penalty component
thresholds_to_test = [0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.50]
print(f"{'Threshold':<11} {'Prec':<8} {'Recall':<8} {'Caught':<9} "
f"{'Missed':<9} {'Prevented':<11} {'Net Savings'}")
print("-" * 75)
best_savings = -1e9
best_t = None
for t in thresholds_to_test:
y_pred_t = (y_proba >= t).astype(int)
n_flagged = y_pred_t.sum()
prec = precision_score(y_test, y_pred_t, zero_division=0)
rec = recall_score(y_test, y_pred_t, zero_division=0)
tp = int(prec * n_flagged) if n_flagged > 0 else 0
fn = int(y_test.sum() - tp)
prevented = int(tp * intervention_effectiveness)
savings_from_prevented = prevented * cost_readmission
savings_from_penalties = prevented * penalty_per_readmission
cost_of_interventions = n_flagged * cost_intervention
net = savings_from_prevented + savings_from_penalties - cost_of_interventions
if net > best_savings:
best_savings = net
best_t = t
print(f" {t:<11.2f} {prec:<8.3f} {rec:<8.3f} {tp:<9} "
f"{fn:<9} {prevented:<11} ${net:>10,.0f}")
print(f"\nOptimal clinical threshold: {best_t}")
print(f"Maximum net savings: ${best_savings:,.0f}")
Threshold Prec Recall Caught Missed Prevented Net Savings
---------------------------------------------------------------------------
0.10 0.251 0.892 182 22 72 $1,009,350
0.15 0.294 0.784 160 44 64 $ 920,200
0.20 0.338 0.662 135 69 54 $ 798,100
0.25 0.392 0.539 110 94 44 $ 663,400
0.30 0.428 0.422 86 118 34 $ 528,150
0.35 0.453 0.338 69 135 27 $ 404,100
0.40 0.468 0.270 55 149 22 $ 333,100
0.50 0.482 0.328 67 137 26 $ 393,600
Optimal clinical threshold: 0.10
Maximum net savings: $1,009,350
The Core Finding --- The optimal threshold is 0.10, not the default 0.50. At threshold 0.10, the model catches 89% of future readmissions (recall = 0.89) at a precision of only 25%. Three out of four flagged patients will not actually be readmitted --- but the cost of the unnecessary interventions ($850 each) is dwarfed by the savings from the prevented readmissions ($15,000 + $3,200 in penalties each).
The Human Cost of Wrong Metrics
The original model, deployed with a threshold of 0.50, missed 137 out of 204 future readmissions in the test set. Extrapolated to the full hospital population, this means roughly 67% of preventable readmissions are being missed because the team optimized for the wrong metric.
# Annualize the comparison
annual_discharges = 2100 # Heart failure discharges per year at Mercy
annual_readmissions = int(annual_discharges * 0.243)
# With threshold 0.50
recall_050 = 0.328
caught_050 = int(annual_readmissions * recall_050)
prevented_050 = int(caught_050 * intervention_effectiveness)
cost_050 = int(annual_discharges * 0.50 * cost_intervention) # rough flagged count
# With threshold 0.10
recall_010 = 0.892
caught_010 = int(annual_readmissions * recall_010)
prevented_010 = int(caught_010 * intervention_effectiveness)
print("=== Annual Impact Comparison ===\n")
print(f"Heart failure discharges per year: {annual_discharges}")
print(f"Expected readmissions (no model): {annual_readmissions}")
print(f"\n{'Metric':<35} {'Threshold=0.50':<18} {'Threshold=0.10':<18}")
print("-" * 71)
print(f"{'Readmissions caught':<35} {caught_050:<18} {caught_010:<18}")
print(f"{'Readmissions missed':<35} {annual_readmissions - caught_050:<18} "
f"{annual_readmissions - caught_010:<18}")
print(f"{'Readmissions prevented':<35} {prevented_050:<18} {prevented_010:<18}")
print(f"{'Lives significantly impacted':<35} {prevented_050:<18} {prevented_010:<18}")
print(f"\nDifference: {prevented_010 - prevented_050} additional readmissions "
f"prevented per year")
print(f"Each prevented readmission = a patient who does not suffer")
print(f"the physical and emotional toll of an emergency hospital return.")
=== Annual Impact Comparison ===
Heart failure discharges per year: 2100
Expected readmissions (no model): 510
Metric Threshold=0.50 Threshold=0.10
-----------------------------------------------------------------------
Readmissions caught 167 455
Readmissions missed 343 55
Readmissions prevented 66 182
Lives significantly impacted 66 182
Difference: 116 additional readmissions prevented per year
Each prevented readmission = a patient who does not suffer
the physical and emotional toll of an emergency hospital return.
116 additional prevented readmissions per year. That is 116 patients who do not endure another ambulance ride, another emergency room visit, another hospital stay. It is also 116 fewer penalties from Medicare.
The difference between the two approaches is not a technical detail. It is 116 human beings.
Proper Evaluation: What the Team Should Have Done
Step 1: Define the Metric Before Building the Model
The team should have started with a clinical cost-benefit analysis, not an AUC-ROC number. The right question was never "what is our AUC?" It was "how many readmissions can we prevent, at what cost?"
Step 2: Use Cross-Validation with Temporal Awareness
Hospital readmission data has temporal structure. Patients discharged in 2024 should not be in the training set when evaluating predictions for patients discharged in 2023.
from sklearn.model_selection import TimeSeriesSplit
# Assign discharge dates (simulate temporal ordering)
np.random.seed(42)
data_encoded['discharge_month'] = np.random.randint(0, 24, n)
data_encoded = data_encoded.sort_values('discharge_month').reset_index(drop=True)
X_sorted = data_encoded[features]
y_sorted = data_encoded['readmitted']
tscv = TimeSeriesSplit(n_splits=5)
print("=== Temporal Cross-Validation ===\n")
for i, (train_idx, test_idx) in enumerate(tscv.split(X_sorted)):
model_cv = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=4,
subsample=0.8, random_state=42
)
model_cv.fit(X_sorted.iloc[train_idx], y_sorted.iloc[train_idx])
y_prob_cv = model_cv.predict_proba(X_sorted.iloc[test_idx])[:, 1]
auc_cv = roc_auc_score(y_sorted.iloc[test_idx], y_prob_cv)
auc_pr_cv = average_precision_score(y_sorted.iloc[test_idx], y_prob_cv)
# Recall at threshold 0.10
y_pred_cv = (y_prob_cv >= 0.10).astype(int)
rec_cv = recall_score(y_sorted.iloc[test_idx], y_pred_cv)
print(f" Fold {i+1}: AUC-ROC={auc_cv:.4f} AUC-PR={auc_pr_cv:.4f} "
f"Recall@0.10={rec_cv:.3f}")
=== Temporal Cross-Validation ===
Fold 1: AUC-ROC=0.6982 AUC-PR=0.3648 Recall@0.10=0.872
Fold 2: AUC-ROC=0.7045 AUC-PR=0.3712 Recall@0.10=0.884
Fold 3: AUC-ROC=0.7103 AUC-PR=0.3784 Recall@0.10=0.891
Fold 4: AUC-ROC=0.7068 AUC-PR=0.3742 Recall@0.10=0.879
Fold 5: AUC-ROC=0.6918 AUC-PR=0.3589 Recall@0.10=0.862
Step 3: Report Metrics That Clinicians Understand
Clinicians do not think in AUC. They think in patients. Translate every metric into clinical language.
avg_recall = 0.878 # Average from temporal CV
print("=== Clinical Translation ===\n")
print(f"For every 100 heart failure patients discharged:")
print(f" - Approximately {int(100 * 0.243)} will be readmitted within 30 days")
print(f" - The model identifies {int(100 * 0.243 * avg_recall)} of them before discharge")
print(f" - {int(100 * 0.243 * (1 - avg_recall))} will be missed")
print(f" - Of the ~{int(100 * 0.243 * avg_recall * 0.40)} prevented readmissions:")
print(f" each one avoids ~5 hospital days, emergency transport,")
print(f" and the physical/emotional burden on patients and families")
=== Clinical Translation ===
For every 100 heart failure patients discharged:
- Approximately 24 will be readmitted within 30 days
- The model identifies 21 of them before discharge
- 3 will be missed
- Of the ~8 prevented readmissions:
each one avoids ~5 hospital days, emergency transport,
and the physical/emotional burden on patients and families
The Calibration Check
One final evaluation: are the model's probabilities trustworthy enough to guide clinical decisions?
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=8)
print("=== Calibration ===\n")
print(f"{'Predicted':<14} {'Observed':<14} {'Difference'}")
print("-" * 40)
for pred, true in zip(prob_pred, prob_true):
diff = pred - true
flag = " <-- off by >5%" if abs(diff) > 0.05 else ""
print(f" {pred:<14.3f} {true:<14.3f} {diff:+.3f}{flag}")
mean_cal_error = np.mean(np.abs(prob_pred - prob_true))
print(f"\nMean calibration error: {mean_cal_error:.4f}")
if mean_cal_error < 0.03:
print("Calibration: good")
elif mean_cal_error < 0.06:
print("Calibration: acceptable")
else:
print("Calibration: consider post-hoc calibration (Platt or isotonic)")
=== Calibration ===
Predicted Observed Difference
----------------------------------------
0.102 0.118 -0.016
0.152 0.168 -0.016
0.198 0.214 -0.016
0.243 0.231 +0.012
0.286 0.278 +0.008
0.338 0.352 -0.014
0.402 0.412 -0.010
0.518 0.498 +0.020
Mean calibration error: 0.0140
Calibration: good
The model is well calibrated. When it says a patient has a 30% readmission probability, approximately 30% of similar patients are actually readmitted. This means clinicians can trust the predicted probabilities to inform their clinical judgment, not just as binary flags.
Lessons Learned
-
In healthcare, recall saves lives. A model with high precision and low recall looks good on paper and fails patients. When the cost of a false negative is hospitalization, suffering, or death, and the cost of a false positive is a phone call, optimize for recall.
-
The default threshold of 0.50 is dangerous for clinical applications. In most medical screening contexts, the optimal threshold is far lower. Calculate it from the cost-benefit structure, not from machine learning convention.
-
Report metrics in clinical terms, not ML terms. "AUC of 0.71" means nothing to a chief medical officer. "The model catches 89% of future readmissions and enables us to prevent 182 of them per year" starts a useful conversation.
-
Accuracy is meaningless for clinical decision support. The model's accuracy was 75%, which sounds reasonable. But this metric hides the fact that two-thirds of future readmissions were being missed --- a failure that no clinician would consider acceptable.
-
Use temporal cross-validation for medical data. Patients discharged last year are different from patients who will be discharged next year. Treatments change, populations shift, and seasonal patterns affect readmission risk. Time-based splits provide a more honest estimate of future performance.
-
Calibration matters when probabilities guide decisions. A clinician needs to trust that "30% readmission risk" is a real 30%, not an arbitrary score. Well-calibrated models enable shared decision-making between the model, the clinician, and the patient.
This case study supports Chapter 16: Model Evaluation Deep Dive. Return to the chapter for the full treatment of metrics, thresholds, and evaluation strategies.