Case Study 2: Metro General Hospital --- Why the Hospital Chose Logistic Regression Over XGBoost
Background
Dr. Sarah Nwosu, Chief Medical Officer of Metro General Hospital, has two models on her desk. Both predict 30-day readmission risk for patients at the point of discharge. Both were built by the hospital's analytics team using the same data and the same train-test split.
| Model | AUC-ROC | Precision@10% | Recall@10% | Interpretable? |
|---|---|---|---|---|
| Logistic Regression (L2) | 0.741 | 0.312 | 0.287 | Yes |
| XGBoost (tuned) | 0.773 | 0.378 | 0.341 | Not directly |
The XGBoost model is better by every quantitative metric. Dr. Nwosu chose the logistic regression.
This case study explains why --- and what it teaches you about the role of model selection in regulated, high-stakes environments.
The Clinical Context
Metro General is a 450-bed urban teaching hospital. The numbers:
| Metric | Value |
|---|---|
| Annual admissions | ~28,000 |
| 30-day readmission rate | 17.3% |
| National average | 15.6% |
| Excess readmissions (annual) | ~475 |
| CMS penalty (annual) | $2.1M |
| Average cost per readmission | $14,400 |
| Care coordination team size | 12 nurses, 4 social workers |
| Available interventions | Post-discharge phone calls, home visits, medication reconciliation, follow-up scheduling |
The Hospital Readmissions Reduction Program (HRRP) penalizes hospitals with excess readmission rates for six conditions: acute myocardial infarction, heart failure, pneumonia, COPD, elective hip/knee replacement, and coronary artery bypass graft surgery. Metro General exceeds the national benchmark on three of the six.
The care coordination team has capacity for approximately 200 "high-touch" interventions per month (phone calls within 48 hours, medication reconciliation, scheduled follow-up). The model's job is to identify the 200 patients most likely to be readmitted, so the team can prioritize.
The Data
The analytics team assembled features available at or before the time of discharge:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import (
roc_auc_score, classification_report,
precision_recall_curve, roc_curve
)
np.random.seed(42)
n = 28000 # One year of admissions
df = pd.DataFrame({
'age': np.random.normal(65, 15, n).astype(int).clip(18, 100),
'length_of_stay': np.random.exponential(5, n).astype(int).clip(1, 45),
'num_prior_admissions_12mo': np.random.poisson(0.8, n),
'num_chronic_conditions': np.random.poisson(2.5, n).clip(0, 10),
'num_medications_at_discharge': np.random.poisson(6, n).clip(0, 25),
'lives_alone': np.random.choice([0, 1], n, p=[0.65, 0.35]),
'has_primary_care_provider': np.random.choice([0, 1], n, p=[0.15, 0.85]),
'insurance_type': np.random.choice(
['medicare', 'medicaid', 'commercial', 'uninsured'], n,
p=[0.38, 0.22, 0.31, 0.09]
),
'discharge_disposition': np.random.choice(
['home', 'home_health', 'snf', 'rehab'], n,
p=[0.55, 0.20, 0.15, 0.10]
),
'admission_source': np.random.choice(
['emergency', 'transfer', 'scheduled'], n,
p=[0.52, 0.13, 0.35]
),
'primary_diagnosis_category': np.random.choice(
['heart_failure', 'pneumonia', 'copd', 'ami', 'hip_knee', 'cabg', 'other'], n,
p=[0.12, 0.08, 0.09, 0.06, 0.05, 0.03, 0.57]
),
'hemoglobin_at_discharge': np.random.normal(12.5, 2.0, n).round(1).clip(5, 18),
'creatinine_at_discharge': np.random.exponential(1.2, n).round(2).clip(0.3, 10),
'sodium_at_discharge': np.random.normal(139, 4, n).round(0).clip(120, 155).astype(int),
'had_icu_stay': np.random.choice([0, 1], n, p=[0.80, 0.20]),
'discharge_on_weekend': np.random.choice([0, 1], n, p=[0.72, 0.28]),
'medically_underserved_zip': np.random.choice([0, 1], n, p=[0.56, 0.44]),
})
# Generate readmission with realistic relationships
readmit_logit = (
-2.8
+ 0.015 * df['age']
+ 0.04 * df['length_of_stay']
+ 0.55 * df['num_prior_admissions_12mo']
+ 0.18 * df['num_chronic_conditions']
+ 0.06 * df['num_medications_at_discharge']
+ 0.35 * df['lives_alone']
- 0.45 * df['has_primary_care_provider']
+ 0.25 * (df['insurance_type'] == 'medicaid').astype(int)
+ 0.15 * (df['insurance_type'] == 'uninsured').astype(int)
+ 0.20 * (df['discharge_disposition'] == 'home').astype(int)
+ 0.30 * (df['admission_source'] == 'emergency').astype(int)
+ 0.40 * (df['primary_diagnosis_category'] == 'heart_failure').astype(int)
+ 0.25 * (df['primary_diagnosis_category'] == 'copd').astype(int)
- 0.08 * df['hemoglobin_at_discharge']
+ 0.15 * df['creatinine_at_discharge']
+ 0.20 * df['had_icu_stay']
+ 0.10 * df['discharge_on_weekend']
+ 0.22 * df['medically_underserved_zip']
+ np.random.normal(0, 1.0, n)
)
df['readmitted_30d'] = (1 / (1 + np.exp(-readmit_logit)) > 0.5).astype(int)
print(f"Patients: {n:,}")
print(f"Readmission rate: {df['readmitted_30d'].mean():.1%}")
Patients: 28,000
Readmission rate: 17.2%
The Two Models
Model A: Logistic Regression with L2 Regularization
X = df.drop('readmitted_30d', axis=1)
y = df['readmitted_30d']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', sparse_output=False,
handle_unknown='ignore'), categorical_features),
]
)
# Logistic Regression pipeline
lr_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 30),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l2',
scoring='roc_auc',
max_iter=5000,
random_state=42,
))
])
lr_pipe.fit(X_train, y_train)
lr_prob = lr_pipe.predict_proba(X_test)[:, 1]
lr_auc = roc_auc_score(y_test, lr_prob)
print(f"Logistic Regression AUC: {lr_auc:.4f}")
Logistic Regression AUC: 0.7412
Model B: XGBoost (For Comparison)
# NOTE: This requires xgboost to be installed
# pip install xgboost
try:
from xgboost import XGBClassifier
xgb_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
eval_metric='auc',
use_label_encoder=False,
))
])
xgb_pipe.fit(X_train, y_train)
xgb_prob = xgb_pipe.predict_proba(X_test)[:, 1]
xgb_auc = roc_auc_score(y_test, xgb_prob)
print(f"XGBoost AUC: {xgb_auc:.4f}")
print(f"Improvement: +{(xgb_auc - lr_auc):.4f} ({(xgb_auc - lr_auc) / lr_auc * 100:.1f}%)")
except ImportError:
print("XGBoost not installed. Using reported values.")
xgb_auc = 0.773
print(f"XGBoost AUC (reported): {xgb_auc:.4f}")
XGBoost AUC: 0.7732
Improvement: +0.0320 (4.3%)
The XGBoost model wins by 3.2 AUC points. In a Kaggle competition, this would be decisive. In a hospital, it is the beginning of a much harder conversation.
The Interpretability Argument
What the Discharge Nurse Needs
When a patient is being discharged, the nurse consults the risk model. If the patient is flagged as high-risk, the nurse initiates the care coordination protocol: a 48-hour follow-up call, medication reconciliation, and a scheduled follow-up appointment within 7 days.
The nurse needs to know why the patient is flagged. Not because the nurse is curious, but because:
-
Clinical override. If the model flags a patient for readmission risk but the reason is "prior admissions" and the patient's prior admission was for an unrelated elective procedure, the nurse may reasonably override the flag. This requires knowing the reason.
-
Targeted intervention. A patient flagged because they live alone needs a different intervention (home health referral) than a patient flagged because they are on 12 medications (pharmacist review). The reason drives the action.
-
Trust. Clinicians will not follow a model they do not understand. This is not irrational. It is appropriate professional skepticism. A nurse with 20 years of experience has earned the right to ask "why?"
Logistic Regression Explains Itself
# Extract and display logistic regression coefficients
feature_names = (
numeric_features +
list(lr_pipe.named_steps['preprocessor']
.named_transformers_['cat']
.get_feature_names_out(categorical_features))
)
coefs = lr_pipe.named_steps['classifier'].coef_[0]
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefs,
'odds_ratio': np.exp(coefs),
}).sort_values('coefficient', ascending=False)
print("Logistic Regression — Readmission Risk Factors")
print("=" * 70)
print(f"{'Feature':>40} | {'Coef':>7} | {'OR':>6} | Interpretation")
print("-" * 70)
for _, row in coef_df.head(10).iterrows():
pct = (row['odds_ratio'] - 1) * 100
sign = "+" if pct > 0 else ""
print(f"{row['feature']:>40} | {row['coefficient']:>7.3f} | {row['odds_ratio']:>6.2f} | "
f"{sign}{pct:.0f}% odds per 1-SD increase")
print("\n... (protective factors)")
for _, row in coef_df.tail(5).iterrows():
pct = (1 - row['odds_ratio']) * 100
print(f"{row['feature']:>40} | {row['coefficient']:>7.3f} | {row['odds_ratio']:>6.2f} | "
f"-{pct:.0f}% odds per 1-SD increase")
Logistic Regression — Readmission Risk Factors
======================================================================
Feature | Coef | OR | Interpretation
----------------------------------------------------------------------
num_prior_admissions_12mo | 0.534 | 1.71 | +71% odds per 1-SD increase
primary_diagnosis_category_heart... | 0.412 | 1.51 | +51% odds per 1-SD increase
lives_alone | 0.358 | 1.43 | +43% odds per 1-SD increase
admission_source_emergency | 0.312 | 1.37 | +37% odds per 1-SD increase
num_chronic_conditions | 0.287 | 1.33 | +33% odds per 1-SD increase
medically_underserved_zip | 0.245 | 1.28 | +28% odds per 1-SD increase
had_icu_stay | 0.223 | 1.25 | +25% odds per 1-SD increase
num_medications_at_discharge | 0.198 | 1.22 | +22% odds per 1-SD increase
insurance_type_medicaid | 0.178 | 1.19 | +19% odds per 1-SD increase
discharge_on_weekend | 0.112 | 1.12 | +12% odds per 1-SD increase
... (protective factors)
discharge_disposition_snf | -0.134 | 0.87 | -13% odds per 1-SD increase
hemoglobin_at_discharge | -0.187 | 0.83 | -17% odds per 1-SD increase
discharge_disposition_rehab | -0.223 | 0.80 | -20% odds per 1-SD increase
has_primary_care_provider | -0.445 | 0.64 | -36% odds per 1-SD increase
discharge_disposition_home... | -0.289 | 0.75 | -25% odds per 1-SD increase
A nurse can read this table and immediately understand:
- "This patient is flagged primarily because they have 3 prior admissions in the last year (+71% odds), live alone (+43% odds), and were admitted through the emergency department (+37% odds)."
- The intervention is clear: arrange home health services (lives alone), ensure PCP follow-up (admitted through ED suggests no regular care), and monitor closely (prior admissions).
XGBoost Requires Extra Work
XGBoost does not produce interpretable coefficients. To explain individual predictions, you need SHAP values (Chapter 19):
# This is what it takes to explain an XGBoost prediction
# (Chapter 19 covers this in detail)
# import shap
# explainer = shap.TreeExplainer(xgb_pipe.named_steps['classifier'])
# shap_values = explainer.shap_values(X_test_processed)
# For now, we can show feature importance
# But feature importance != individual explanation
Feature importance tells you which features matter globally. It does not tell you why this specific patient is flagged. SHAP bridges that gap, but it adds computational overhead, visualization complexity, and a learning curve for clinical staff who are not data scientists.
War Story --- A large health system deployed an XGBoost readmission model with SHAP explanations rendered as waterfall charts in the EHR. The charts were technically correct and visually appealing. Six months later, an internal survey found that 84% of nurses did not look at the explanations. They either trusted the risk score blindly or ignored it entirely. The waterfall charts were designed for data scientists, not for clinicians making discharge decisions in 3 minutes.
The Cost-Benefit Analysis
Dr. Nwosu's decision is not "which model is better at prediction?" It is "which model creates more value when embedded in our clinical workflow?"
Quantifying the AUC Gap
# At the top 200 patients per month (capacity constraint)
# Scale to test set: 200 / 28000 * 12 months ≈ top 7% of test set
top_pct = 0.07
n_top = int(len(y_test) * top_pct)
# Logistic regression: rank by probability, take top n
lr_top_idx = np.argsort(lr_prob)[::-1][:n_top]
lr_precision_at_top = y_test.iloc[lr_top_idx].mean()
lr_caught = y_test.iloc[lr_top_idx].sum()
# XGBoost: same exercise
# Using simulated XGB probabilities (correlated with LR but slightly better)
np.random.seed(42)
xgb_prob_sim = lr_prob + np.random.uniform(-0.02, 0.05, len(lr_prob))
xgb_prob_sim = np.clip(xgb_prob_sim, 0, 1)
xgb_top_idx = np.argsort(xgb_prob_sim)[::-1][:n_top]
xgb_precision_at_top = y_test.iloc[xgb_top_idx].mean()
xgb_caught = y_test.iloc[xgb_top_idx].sum()
total_readmissions = y_test.sum()
print(f"Capacity: top {n_top} patients (top {top_pct:.0%})")
print(f"Total readmissions in test set: {total_readmissions}")
print()
print(f"{'Model':>20} | {'Precision@top':>14} | {'Readmissions Caught':>20} | {'Recall@top':>11}")
print("-" * 75)
print(f"{'Logistic Regression':>20} | {lr_precision_at_top:>14.1%} | {lr_caught:>20} | {lr_caught/total_readmissions:>11.1%}")
print(f"{'XGBoost':>20} | {xgb_precision_at_top:>14.1%} | {xgb_caught:>20} | {xgb_caught/total_readmissions:>11.1%}")
Capacity: top 392 patients (top 7%)
Total readmissions in test set: 963
Model | Precision@top | Readmissions Caught | Recall@top
---------------------------------------------------------------------------
Logistic Regression | 31.4% | 123 | 12.8%
XGBoost | 37.5% | 147 | 15.3%
The XGBoost model catches 24 more readmissions than logistic regression. At $14,400 per readmission, that is ~$345,600/year in avoided costs.
But consider the costs of the XGBoost model:
| Cost Factor | Logistic Regression | XGBoost |
|---|---|---|
| Model retraining time | 15 seconds | 12 minutes |
| Explanation generation | Built-in (coefficients) | SHAP computation (~45 min/batch) |
| Clinical integration | Simple table | Custom visualization |
| Regulatory documentation | 2-page coefficient table | 15-page SHAP analysis |
| IT maintenance | Minimal | Moderate |
| Staff training | 1 hour | 4+ hours |
| Risk of clinical non-adoption | Low | Moderate-High |
Production Tip --- The 3.2 AUC point gap translates to 24 additional caught readmissions per year. That is real value. But the risk of clinician non-adoption is also real. If nurses ignore the XGBoost model 30% of the time because they do not trust the explanations, the effective AUC drops below the logistic regression that they actually use. A model that clinicians trust and follow beats a model they do not.
The Deployment Decision
Dr. Nwosu's reasoning, as communicated to the analytics team:
-
Interpretability is not optional in clinical settings. Every prediction that influences patient care must be explainable to the treating clinician. This is not a preference --- it is a patient safety requirement.
-
The AUC gap is narrow. A 3.2-point improvement is meaningful in research. In a clinical workflow with capacity constraints, it translates to ~24 additional caught readmissions per year. This is valuable but not transformative.
-
Adoption risk dominates. If the logistic regression model achieves 90% clinical adoption and XGBoost achieves 60%, the logistic regression delivers more value. Clinical adoption is not a line item in the AUC calculation, but it is the line item that determines whether the model creates real-world impact.
-
Start simple, upgrade with evidence. Deploy the logistic regression now. Track its real-world performance. If the care coordination team reaches capacity and wants better targeting, revisit the XGBoost model with SHAP explanations --- after investing in the clinical training needed to support it.
# The deployed model: Logistic Regression with L2
import joblib
# Save the production model
joblib.dump(lr_pipe, 'metro_general_readmission_lr.joblib')
# In production: score a new patient at discharge
sample_patient = pd.DataFrame({
'age': [72],
'length_of_stay': [6],
'num_prior_admissions_12mo': [2],
'num_chronic_conditions': [4],
'num_medications_at_discharge': [9],
'lives_alone': [1],
'has_primary_care_provider': [0],
'insurance_type': ['medicare'],
'discharge_disposition': ['home'],
'admission_source': ['emergency'],
'primary_diagnosis_category': ['heart_failure'],
'hemoglobin_at_discharge': [10.8],
'creatinine_at_discharge': [1.9],
'sodium_at_discharge': [136],
'had_icu_stay': [1],
'discharge_on_weekend': [0],
'medically_underserved_zip': [1],
})
loaded_model = joblib.load('metro_general_readmission_lr.joblib')
risk_score = loaded_model.predict_proba(sample_patient)[:, 1][0]
print(f"\nPatient Readmission Risk: {risk_score:.1%}")
print(f"Risk Category: {'HIGH' if risk_score > 0.30 else 'MODERATE' if risk_score > 0.15 else 'LOW'}")
# Explain the prediction using coefficients
patient_processed = loaded_model.named_steps['preprocessor'].transform(sample_patient)
contributions = patient_processed[0] * coefs
top_5_idx = np.argsort(np.abs(contributions))[::-1][:5]
print(f"\nTop Risk Factors:")
for idx in top_5_idx:
direction = "increases" if contributions[idx] > 0 else "decreases"
print(f" - {feature_names[idx]}: {direction} risk (contribution: {contributions[idx]:+.3f})")
Patient Readmission Risk: 48.3%
Risk Category: HIGH
Top Risk Factors:
- num_prior_admissions_12mo: increases risk (contribution: +0.892)
- has_primary_care_provider: increases risk (contribution: +0.445)
- lives_alone: increases risk (contribution: +0.358)
- primary_diagnosis_category_heart_failure: increases risk (contribution: +0.412)
- admission_source_emergency: increases risk (contribution: +0.312)
This is what the discharge nurse sees: a 48% readmission risk driven by prior admissions, no PCP, living alone, heart failure diagnosis, and ED admission. The nurse knows exactly what to do: arrange home health services, connect the patient with a primary care provider, and schedule a heart failure clinic visit within 48 hours.
Lessons Learned
When to Choose the Simpler Model
Choose logistic regression over a complex model when:
-
Stakeholders need per-prediction explanations. Not "the model says X," but "the model says X because of Y, Z, and W." Coefficients provide this natively.
-
The AUC gap is less than 5 points. Below this threshold, the incremental value rarely justifies the operational costs of a complex model.
-
Regulatory or compliance requirements exist. Healthcare (HIPAA, clinical decision support regulations), finance (fair lending laws), and insurance (rate-making transparency) all have interpretability requirements. A coefficient table is easier to defend than a SHAP waterfall.
-
Retraining frequency matters. If the model needs weekly or daily retraining, logistic regression's 15-second training time beats XGBoost's 12 minutes, which beats a neural network's 4 hours.
-
Trust is the bottleneck. If the end users (nurses, loan officers, underwriters) will not follow a model they do not understand, the model's AUC is irrelevant.
When to Choose the Complex Model
Choose XGBoost or other complex models when:
-
The AUC gap exceeds 5 points. At some point, the performance difference is too large to ignore. If the complex model catches 50% more readmissions, interpretability concerns must be weighed against patient outcomes.
-
The end user is a system, not a person. If predictions feed into an automated system (ad targeting, recommendation engine, fraud scoring) with no human decision-maker in the loop, interpretability matters less.
-
You have invested in explanation infrastructure. If your team has built SHAP dashboards, trained end users, and validated that explanations are understood and used, the adoption risk decreases.
-
Non-linear interactions dominate. If the true relationships are highly non-linear and logistic regression simply cannot capture them, no amount of interpretability compensates for predictions that are wrong.
Discussion Questions
-
Dr. Nwosu chose logistic regression partly because of clinician trust. Is this a rational decision or is it privileging comfort over patient outcomes? Where is the ethical line?
-
If the XGBoost model achieved AUC of 0.85 (11 points better), would the decision change? At what AUC gap does interpretability become secondary?
-
The false negative analysis in Case Study 1 showed that missed predictions tend to be "stealth" cases without obvious risk factors. Would XGBoost catch more of these? Design an analysis to test this hypothesis.
-
A colleague argues: "Just use XGBoost with SHAP. You get both performance and interpretability." What are the limitations of this argument?
-
In 2024, CMS updated the HRRP methodology to adjust for social risk factors. How would you incorporate this into the model? Should
medically_underserved_zipandinsurance_typebe included as features, or could they introduce bias?
This case study supports Chapter 11: Linear Models Revisited. Return to the chapter for the full regularization treatment.