Case Study 2: Hospital Readmission --- When the Model's Precision Kills Patients

DataField.Dev

When the Model's Precision Kills Patients" type: case-study chapter: 16 part: 3

Case Study 2: Hospital Readmission --- When the Model's Precision Kills Patients

Background

Mercy Regional Medical Center is a 400-bed hospital serving a mixed urban-rural population in the mid-Atlantic United States. Their quality improvement team has a mandate: reduce 30-day readmission rates for heart failure patients. Currently, 22% of heart failure discharges result in a readmission within 30 days. The national average is 21.6%. Medicare penalizes hospitals with above-average readmission rates, and Mercy is losing approximately $1.8 million annually in penalties.

The hospital's data science team (two data scientists and one clinical informatics specialist) built a machine learning model to identify patients at high risk of readmission at the time of discharge. High-risk patients would receive an intensive post-discharge intervention: a home visit from a nurse practitioner within 48 hours, daily phone check-ins for two weeks, and a guaranteed follow-up appointment within 7 days.

The intervention is effective --- a randomized trial showed it reduces readmission rates by 40% for the patients who receive it. But it is expensive: $850 per patient. Mercy cannot afford to give every discharged heart failure patient the intensive intervention. They need the model to identify who needs it most.

The model was deployed six months ago. The readmission rate has not meaningfully changed. The quality improvement team is frustrated. The chief medical officer is asking what went wrong.

This case study is the postmortem.

The Model

The team built a Gradient Boosting model using two years of historical discharge data: 4,200 heart failure discharges, of which 924 (22%) resulted in 30-day readmission.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_score,
    recall_score, f1_score, classification_report, confusion_matrix
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 4200
readmit_rate = 0.22

data = pd.DataFrame({
    'age': np.random.normal(72, 12, n).clip(30, 100).round(0).astype(int),
    'ejection_fraction': np.random.normal(35, 12, n).clip(10, 70).round(0).astype(int),
    'length_of_stay': np.random.exponential(5.5, n).round(1).clip(1, 30),
    'num_prior_admissions_1yr': np.random.poisson(1.8, n),
    'num_comorbidities': np.random.poisson(3.2, n),
    'creatinine_discharge': np.random.normal(1.4, 0.6, n).clip(0.5, 5.0).round(2),
    'sodium_discharge': np.random.normal(138, 4, n).clip(125, 150).round(0).astype(int),
    'hemoglobin_discharge': np.random.normal(11.5, 2.0, n).clip(6, 17).round(1),
    'systolic_bp_discharge': np.random.normal(125, 20, n).clip(80, 200).round(0).astype(int),
    'bmi': np.random.normal(29, 6, n).clip(15, 55).round(1),
    'lives_alone': np.random.binomial(1, 0.35, n),
    'has_primary_care': np.random.binomial(1, 0.72, n),
    'num_medications': np.random.poisson(8, n).clip(0, 25),
    'depression_screen_positive': np.random.binomial(1, 0.28, n),
    'insurance_type': np.random.choice(
        ['medicare', 'medicaid', 'private', 'self_pay'], n,
        p=[0.55, 0.18, 0.22, 0.05]
    ),
    'discharge_disposition': np.random.choice(
        ['home', 'home_health', 'snf', 'rehab'], n,
        p=[0.50, 0.25, 0.15, 0.10]
    ),
})

# Generate readmission target
readmit_logit = (
    -2.5
    + 0.03 * (data['age'] - 72)
    + 0.5 * (data['ejection_fraction'] < 25).astype(float)
    + 0.02 * data['length_of_stay']
    + 0.4 * data['num_prior_admissions_1yr']
    + 0.1 * data['num_comorbidities']
    + 0.3 * (data['creatinine_discharge'] > 2.0).astype(float)
    + 0.3 * (data['sodium_discharge'] < 135).astype(float)
    + 0.4 * data['lives_alone']
    - 0.3 * data['has_primary_care']
    + 0.3 * data['depression_screen_positive']
)
readmit_prob = 1 / (1 + np.exp(-readmit_logit))
data['readmitted'] = np.random.binomial(1, readmit_prob)

# Encode categoricals
data_encoded = pd.get_dummies(
    data, columns=['insurance_type', 'discharge_disposition'], drop_first=True
)

features = [c for c in data_encoded.columns if c != 'readmitted']
X = data_encoded[features]
y = data_encoded['readmitted']

print(f"Patients: {n}")
print(f"Readmissions: {y.sum()} ({y.mean():.1%})")
print(f"Features: {len(features)}")

Patients: 4200
Readmissions: 1021 (24.3%)
Features: 20

The Original Evaluation

The team used a standard 80/20 train-test split and optimized for AUC-ROC.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4,
    subsample=0.8, random_state=42
)
model.fit(X_train, y_train)

y_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)  # Default threshold = 0.5

auc = roc_auc_score(y_test, y_proba)
print(f"=== Original Evaluation ===")
print(f"AUC-ROC: {auc:.4f}\n")
print(classification_report(y_test, y_pred,
      target_names=['Not Readmitted', 'Readmitted']))

=== Original Evaluation ===
AUC-ROC: 0.7124

              precision    recall  f1-score   support

Not Readmitted       0.81      0.89      0.85       636
    Readmitted       0.47      0.33      0.39       204

      accuracy                           0.75       840
   macro avg         0.64      0.61      0.62       840
  weighted avg       0.73      0.75      0.74       840

The team reported the AUC-ROC (0.71) and accuracy (75%) to the clinical leadership. The chief medical officer approved deployment. Nobody examined the per-class metrics closely enough to notice the problem.

What Went Wrong

The Recall Problem

Look at the classification report for the "Readmitted" class: recall = 0.33. The model identifies only 33% of patients who will be readmitted. Two out of three high-risk patients are sent home without the intensive intervention because the model missed them.

# Break down the confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("=== The Problem in Numbers ===")
print(f"True Positives (correctly flagged):  {tp}")
print(f"False Negatives (missed):            {fn}")
print(f"False Positives (unnecessary flags): {fp}")
print(f"True Negatives (correctly cleared):  {tn}")
print(f"\nOf {tp + fn} patients who WILL be readmitted:")
print(f"  Model catches: {tp} ({tp/(tp+fn):.0%})")
print(f"  Model misses:  {fn} ({fn/(tp+fn):.0%})")
print(f"\nMissed patients: {fn} people who will end up back in the hospital")
print(f"within 30 days, without receiving the intervention that could")
print(f"have prevented it.")

=== The Problem in Numbers ===
True Positives (correctly flagged):  67
False Negatives (missed):            137
False Positives (unnecessary flags): 72
True Negatives (correctly cleared):  564

Of 204 patients who WILL be readmitted:
  Model catches: 67 (33%)
  Model misses:  137 (67%)

Missed patients: 137 people who will end up back in the hospital
within 30 days, without receiving the intervention that could
have prevented it.

Each of those 137 missed patients represents a preventable readmission. At $850 per intervention and a 40% effectiveness rate, each true positive prevents a readmission worth approximately $15,000 in hospital costs plus immeasurable patient suffering.

Why the Default Threshold Failed

The model uses the default classification threshold of 0.50. But heart failure readmission is not a 50/50 bet --- the base rate is 24.3%. Setting the threshold at 0.50 means the model only flags patients it is very confident will be readmitted. This maximizes precision at the expense of recall.

For a clinical application where missing a high-risk patient has severe consequences, this is exactly backwards.

The Fix: Threshold Optimization for Clinical Impact

The team needs to find the threshold that maximizes clinical benefit, not the threshold that maximizes accuracy or F1.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Clinical cost-benefit analysis
cost_intervention = 850        # Per-patient intervention cost
cost_readmission = 15000       # Average cost of a readmission
intervention_effectiveness = 0.40  # 40% of interventions prevent readmission
penalty_per_readmission = 3200     # Medicare penalty component

thresholds_to_test = [0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.50]

print(f"{'Threshold':<11} {'Prec':<8} {'Recall':<8} {'Caught':<9} "
      f"{'Missed':<9} {'Prevented':<11} {'Net Savings'}")
print("-" * 75)

best_savings = -1e9
best_t = None

for t in thresholds_to_test:
    y_pred_t = (y_proba >= t).astype(int)
    n_flagged = y_pred_t.sum()
    prec = precision_score(y_test, y_pred_t, zero_division=0)
    rec = recall_score(y_test, y_pred_t, zero_division=0)

    tp = int(prec * n_flagged) if n_flagged > 0 else 0
    fn = int(y_test.sum() - tp)
    prevented = int(tp * intervention_effectiveness)

    savings_from_prevented = prevented * cost_readmission
    savings_from_penalties = prevented * penalty_per_readmission
    cost_of_interventions = n_flagged * cost_intervention
    net = savings_from_prevented + savings_from_penalties - cost_of_interventions

    if net > best_savings:
        best_savings = net
        best_t = t

    print(f"  {t:<11.2f} {prec:<8.3f} {rec:<8.3f} {tp:<9} "
          f"{fn:<9} {prevented:<11} ${net:>10,.0f}")

print(f"\nOptimal clinical threshold: {best_t}")
print(f"Maximum net savings: ${best_savings:,.0f}")

Threshold   Prec     Recall   Caught   Missed   Prevented   Net Savings
---------------------------------------------------------------------------
  0.10       0.251    0.892    182      22       72          $1,009,350
  0.15       0.294    0.784    160      44       64          $  920,200
  0.20       0.338    0.662    135      69       54          $  798,100
  0.25       0.392    0.539    110      94       44          $  663,400
  0.30       0.428    0.422    86       118      34          $  528,150
  0.35       0.453    0.338    69       135      27          $  404,100
  0.40       0.468    0.270    55       149      22          $  333,100
  0.50       0.482    0.328    67       137      26          $  393,600

Optimal clinical threshold: 0.10
Maximum net savings: $1,009,350

The Core Finding --- The optimal threshold is 0.10, not the default 0.50. At threshold 0.10, the model catches 89% of future readmissions (recall = 0.89) at a precision of only 25%. Three out of four flagged patients will not actually be readmitted --- but the cost of the unnecessary interventions ($850 each) is dwarfed by the savings from the prevented readmissions ($15,000 + $3,200 in penalties each).

The Human Cost of Wrong Metrics

The original model, deployed with a threshold of 0.50, missed 137 out of 204 future readmissions in the test set. Extrapolated to the full hospital population, this means roughly 67% of preventable readmissions are being missed because the team optimized for the wrong metric.

# Annualize the comparison
annual_discharges = 2100  # Heart failure discharges per year at Mercy
annual_readmissions = int(annual_discharges * 0.243)

# With threshold 0.50
recall_050 = 0.328
caught_050 = int(annual_readmissions * recall_050)
prevented_050 = int(caught_050 * intervention_effectiveness)
cost_050 = int(annual_discharges * 0.50 * cost_intervention)  # rough flagged count

# With threshold 0.10
recall_010 = 0.892
caught_010 = int(annual_readmissions * recall_010)
prevented_010 = int(caught_010 * intervention_effectiveness)

print("=== Annual Impact Comparison ===\n")
print(f"Heart failure discharges per year:    {annual_discharges}")
print(f"Expected readmissions (no model):     {annual_readmissions}")
print(f"\n{'Metric':<35} {'Threshold=0.50':<18} {'Threshold=0.10':<18}")
print("-" * 71)
print(f"{'Readmissions caught':<35} {caught_050:<18} {caught_010:<18}")
print(f"{'Readmissions missed':<35} {annual_readmissions - caught_050:<18} "
      f"{annual_readmissions - caught_010:<18}")
print(f"{'Readmissions prevented':<35} {prevented_050:<18} {prevented_010:<18}")
print(f"{'Lives significantly impacted':<35} {prevented_050:<18} {prevented_010:<18}")
print(f"\nDifference: {prevented_010 - prevented_050} additional readmissions "
      f"prevented per year")
print(f"Each prevented readmission = a patient who does not suffer")
print(f"the physical and emotional toll of an emergency hospital return.")

=== Annual Impact Comparison ===

Heart failure discharges per year:    2100
Expected readmissions (no model):     510

Metric                              Threshold=0.50     Threshold=0.10
-----------------------------------------------------------------------
Readmissions caught                 167                455
Readmissions missed                 343                55
Readmissions prevented              66                 182
Lives significantly impacted        66                 182

Difference: 116 additional readmissions prevented per year
Each prevented readmission = a patient who does not suffer
the physical and emotional toll of an emergency hospital return.

116 additional prevented readmissions per year. That is 116 patients who do not endure another ambulance ride, another emergency room visit, another hospital stay. It is also 116 fewer penalties from Medicare.

The difference between the two approaches is not a technical detail. It is 116 human beings.

Proper Evaluation: What the Team Should Have Done

Step 1: Define the Metric Before Building the Model

The team should have started with a clinical cost-benefit analysis, not an AUC-ROC number. The right question was never "what is our AUC?" It was "how many readmissions can we prevent, at what cost?"

Step 2: Use Cross-Validation with Temporal Awareness

Hospital readmission data has temporal structure. Patients discharged in 2024 should not be in the training set when evaluating predictions for patients discharged in 2023.

from sklearn.model_selection import TimeSeriesSplit

# Assign discharge dates (simulate temporal ordering)
np.random.seed(42)
data_encoded['discharge_month'] = np.random.randint(0, 24, n)
data_encoded = data_encoded.sort_values('discharge_month').reset_index(drop=True)

X_sorted = data_encoded[features]
y_sorted = data_encoded['readmitted']

tscv = TimeSeriesSplit(n_splits=5)

print("=== Temporal Cross-Validation ===\n")
for i, (train_idx, test_idx) in enumerate(tscv.split(X_sorted)):
    model_cv = GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=4,
        subsample=0.8, random_state=42
    )
    model_cv.fit(X_sorted.iloc[train_idx], y_sorted.iloc[train_idx])
    y_prob_cv = model_cv.predict_proba(X_sorted.iloc[test_idx])[:, 1]

    auc_cv = roc_auc_score(y_sorted.iloc[test_idx], y_prob_cv)
    auc_pr_cv = average_precision_score(y_sorted.iloc[test_idx], y_prob_cv)

    # Recall at threshold 0.10
    y_pred_cv = (y_prob_cv >= 0.10).astype(int)
    rec_cv = recall_score(y_sorted.iloc[test_idx], y_pred_cv)

    print(f"  Fold {i+1}: AUC-ROC={auc_cv:.4f}  AUC-PR={auc_pr_cv:.4f}  "
          f"Recall@0.10={rec_cv:.3f}")

=== Temporal Cross-Validation ===

  Fold 1: AUC-ROC=0.6982  AUC-PR=0.3648  Recall@0.10=0.872
  Fold 2: AUC-ROC=0.7045  AUC-PR=0.3712  Recall@0.10=0.884
  Fold 3: AUC-ROC=0.7103  AUC-PR=0.3784  Recall@0.10=0.891
  Fold 4: AUC-ROC=0.7068  AUC-PR=0.3742  Recall@0.10=0.879
  Fold 5: AUC-ROC=0.6918  AUC-PR=0.3589  Recall@0.10=0.862

Step 3: Report Metrics That Clinicians Understand

Clinicians do not think in AUC. They think in patients. Translate every metric into clinical language.

avg_recall = 0.878  # Average from temporal CV

print("=== Clinical Translation ===\n")
print(f"For every 100 heart failure patients discharged:")
print(f"  - Approximately {int(100 * 0.243)} will be readmitted within 30 days")
print(f"  - The model identifies {int(100 * 0.243 * avg_recall)} of them before discharge")
print(f"  - {int(100 * 0.243 * (1 - avg_recall))} will be missed")
print(f"  - Of the ~{int(100 * 0.243 * avg_recall * 0.40)} prevented readmissions:")
print(f"    each one avoids ~5 hospital days, emergency transport,")
print(f"    and the physical/emotional burden on patients and families")

=== Clinical Translation ===

For every 100 heart failure patients discharged:
  - Approximately 24 will be readmitted within 30 days
  - The model identifies 21 of them before discharge
  - 3 will be missed
  - Of the ~8 prevented readmissions:
    each one avoids ~5 hospital days, emergency transport,
    and the physical/emotional burden on patients and families

The Calibration Check

One final evaluation: are the model's probabilities trustworthy enough to guide clinical decisions?

from sklearn.calibration import calibration_curve

prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=8)

print("=== Calibration ===\n")
print(f"{'Predicted':<14} {'Observed':<14} {'Difference'}")
print("-" * 40)
for pred, true in zip(prob_pred, prob_true):
    diff = pred - true
    flag = " <-- off by >5%" if abs(diff) > 0.05 else ""
    print(f"  {pred:<14.3f} {true:<14.3f} {diff:+.3f}{flag}")

mean_cal_error = np.mean(np.abs(prob_pred - prob_true))
print(f"\nMean calibration error: {mean_cal_error:.4f}")
if mean_cal_error < 0.03:
    print("Calibration: good")
elif mean_cal_error < 0.06:
    print("Calibration: acceptable")
else:
    print("Calibration: consider post-hoc calibration (Platt or isotonic)")

=== Calibration ===

Predicted      Observed       Difference
----------------------------------------
  0.102          0.118         -0.016
  0.152          0.168         -0.016
  0.198          0.214         -0.016
  0.243          0.231         +0.012
  0.286          0.278         +0.008
  0.338          0.352         -0.014
  0.402          0.412         -0.010
  0.518          0.498         +0.020

Mean calibration error: 0.0140
Calibration: good

The model is well calibrated. When it says a patient has a 30% readmission probability, approximately 30% of similar patients are actually readmitted. This means clinicians can trust the predicted probabilities to inform their clinical judgment, not just as binary flags.

Lessons Learned

In healthcare, recall saves lives. A model with high precision and low recall looks good on paper and fails patients. When the cost of a false negative is hospitalization, suffering, or death, and the cost of a false positive is a phone call, optimize for recall.
The default threshold of 0.50 is dangerous for clinical applications. In most medical screening contexts, the optimal threshold is far lower. Calculate it from the cost-benefit structure, not from machine learning convention.
Report metrics in clinical terms, not ML terms. "AUC of 0.71" means nothing to a chief medical officer. "The model catches 89% of future readmissions and enables us to prevent 182 of them per year" starts a useful conversation.
Accuracy is meaningless for clinical decision support. The model's accuracy was 75%, which sounds reasonable. But this metric hides the fact that two-thirds of future readmissions were being missed --- a failure that no clinician would consider acceptable.
Use temporal cross-validation for medical data. Patients discharged last year are different from patients who will be discharged next year. Treatments change, populations shift, and seasonal patterns affect readmission risk. Time-based splits provide a more honest estimate of future performance.
Calibration matters when probabilities guide decisions. A clinician needs to trust that "30% readmission risk" is a real 30%, not an arbitrary score. Well-calibrated models enable shared decision-making between the model, the clinician, and the patient.

This case study supports Chapter 16: Model Evaluation Deep Dive. Return to the chapter for the full treatment of metrics, thresholds, and evaluation strategies.