Case Study 2: Metro General Hospital --- Why the Hospital Chose Logistic Regression Over XGBoost


Background

Dr. Sarah Nwosu, Chief Medical Officer of Metro General Hospital, has two models on her desk. Both predict 30-day readmission risk for patients at the point of discharge. Both were built by the hospital's analytics team using the same data and the same train-test split.

Model AUC-ROC Precision@10% Recall@10% Interpretable?
Logistic Regression (L2) 0.741 0.312 0.287 Yes
XGBoost (tuned) 0.773 0.378 0.341 Not directly

The XGBoost model is better by every quantitative metric. Dr. Nwosu chose the logistic regression.

This case study explains why --- and what it teaches you about the role of model selection in regulated, high-stakes environments.


The Clinical Context

Metro General is a 450-bed urban teaching hospital. The numbers:

Metric Value
Annual admissions ~28,000
30-day readmission rate 17.3%
National average 15.6%
Excess readmissions (annual) ~475
CMS penalty (annual) $2.1M
Average cost per readmission $14,400
Care coordination team size 12 nurses, 4 social workers
Available interventions Post-discharge phone calls, home visits, medication reconciliation, follow-up scheduling

The Hospital Readmissions Reduction Program (HRRP) penalizes hospitals with excess readmission rates for six conditions: acute myocardial infarction, heart failure, pneumonia, COPD, elective hip/knee replacement, and coronary artery bypass graft surgery. Metro General exceeds the national benchmark on three of the six.

The care coordination team has capacity for approximately 200 "high-touch" interventions per month (phone calls within 48 hours, medication reconciliation, scheduled follow-up). The model's job is to identify the 200 patients most likely to be readmitted, so the team can prioritize.


The Data

The analytics team assembled features available at or before the time of discharge:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import (
    roc_auc_score, classification_report,
    precision_recall_curve, roc_curve
)

np.random.seed(42)
n = 28000  # One year of admissions

df = pd.DataFrame({
    'age': np.random.normal(65, 15, n).astype(int).clip(18, 100),
    'length_of_stay': np.random.exponential(5, n).astype(int).clip(1, 45),
    'num_prior_admissions_12mo': np.random.poisson(0.8, n),
    'num_chronic_conditions': np.random.poisson(2.5, n).clip(0, 10),
    'num_medications_at_discharge': np.random.poisson(6, n).clip(0, 25),
    'lives_alone': np.random.choice([0, 1], n, p=[0.65, 0.35]),
    'has_primary_care_provider': np.random.choice([0, 1], n, p=[0.15, 0.85]),
    'insurance_type': np.random.choice(
        ['medicare', 'medicaid', 'commercial', 'uninsured'], n,
        p=[0.38, 0.22, 0.31, 0.09]
    ),
    'discharge_disposition': np.random.choice(
        ['home', 'home_health', 'snf', 'rehab'], n,
        p=[0.55, 0.20, 0.15, 0.10]
    ),
    'admission_source': np.random.choice(
        ['emergency', 'transfer', 'scheduled'], n,
        p=[0.52, 0.13, 0.35]
    ),
    'primary_diagnosis_category': np.random.choice(
        ['heart_failure', 'pneumonia', 'copd', 'ami', 'hip_knee', 'cabg', 'other'], n,
        p=[0.12, 0.08, 0.09, 0.06, 0.05, 0.03, 0.57]
    ),
    'hemoglobin_at_discharge': np.random.normal(12.5, 2.0, n).round(1).clip(5, 18),
    'creatinine_at_discharge': np.random.exponential(1.2, n).round(2).clip(0.3, 10),
    'sodium_at_discharge': np.random.normal(139, 4, n).round(0).clip(120, 155).astype(int),
    'had_icu_stay': np.random.choice([0, 1], n, p=[0.80, 0.20]),
    'discharge_on_weekend': np.random.choice([0, 1], n, p=[0.72, 0.28]),
    'medically_underserved_zip': np.random.choice([0, 1], n, p=[0.56, 0.44]),
})

# Generate readmission with realistic relationships
readmit_logit = (
    -2.8
    + 0.015 * df['age']
    + 0.04 * df['length_of_stay']
    + 0.55 * df['num_prior_admissions_12mo']
    + 0.18 * df['num_chronic_conditions']
    + 0.06 * df['num_medications_at_discharge']
    + 0.35 * df['lives_alone']
    - 0.45 * df['has_primary_care_provider']
    + 0.25 * (df['insurance_type'] == 'medicaid').astype(int)
    + 0.15 * (df['insurance_type'] == 'uninsured').astype(int)
    + 0.20 * (df['discharge_disposition'] == 'home').astype(int)
    + 0.30 * (df['admission_source'] == 'emergency').astype(int)
    + 0.40 * (df['primary_diagnosis_category'] == 'heart_failure').astype(int)
    + 0.25 * (df['primary_diagnosis_category'] == 'copd').astype(int)
    - 0.08 * df['hemoglobin_at_discharge']
    + 0.15 * df['creatinine_at_discharge']
    + 0.20 * df['had_icu_stay']
    + 0.10 * df['discharge_on_weekend']
    + 0.22 * df['medically_underserved_zip']
    + np.random.normal(0, 1.0, n)
)
df['readmitted_30d'] = (1 / (1 + np.exp(-readmit_logit)) > 0.5).astype(int)

print(f"Patients: {n:,}")
print(f"Readmission rate: {df['readmitted_30d'].mean():.1%}")
Patients: 28,000
Readmission rate: 17.2%

The Two Models

Model A: Logistic Regression with L2 Regularization

X = df.drop('readmitted_30d', axis=1)
y = df['readmitted_30d']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False,
                              handle_unknown='ignore'), categorical_features),
    ]
)

# Logistic Regression pipeline
lr_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 30),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l2',
        scoring='roc_auc',
        max_iter=5000,
        random_state=42,
    ))
])

lr_pipe.fit(X_train, y_train)
lr_prob = lr_pipe.predict_proba(X_test)[:, 1]
lr_auc = roc_auc_score(y_test, lr_prob)

print(f"Logistic Regression AUC: {lr_auc:.4f}")
Logistic Regression AUC: 0.7412

Model B: XGBoost (For Comparison)

# NOTE: This requires xgboost to be installed
# pip install xgboost
try:
    from xgboost import XGBClassifier

    xgb_pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', XGBClassifier(
            n_estimators=300,
            max_depth=6,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=42,
            eval_metric='auc',
            use_label_encoder=False,
        ))
    ])

    xgb_pipe.fit(X_train, y_train)
    xgb_prob = xgb_pipe.predict_proba(X_test)[:, 1]
    xgb_auc = roc_auc_score(y_test, xgb_prob)

    print(f"XGBoost AUC: {xgb_auc:.4f}")
    print(f"Improvement: +{(xgb_auc - lr_auc):.4f} ({(xgb_auc - lr_auc) / lr_auc * 100:.1f}%)")
except ImportError:
    print("XGBoost not installed. Using reported values.")
    xgb_auc = 0.773
    print(f"XGBoost AUC (reported): {xgb_auc:.4f}")
XGBoost AUC: 0.7732
Improvement: +0.0320 (4.3%)

The XGBoost model wins by 3.2 AUC points. In a Kaggle competition, this would be decisive. In a hospital, it is the beginning of a much harder conversation.


The Interpretability Argument

What the Discharge Nurse Needs

When a patient is being discharged, the nurse consults the risk model. If the patient is flagged as high-risk, the nurse initiates the care coordination protocol: a 48-hour follow-up call, medication reconciliation, and a scheduled follow-up appointment within 7 days.

The nurse needs to know why the patient is flagged. Not because the nurse is curious, but because:

  1. Clinical override. If the model flags a patient for readmission risk but the reason is "prior admissions" and the patient's prior admission was for an unrelated elective procedure, the nurse may reasonably override the flag. This requires knowing the reason.

  2. Targeted intervention. A patient flagged because they live alone needs a different intervention (home health referral) than a patient flagged because they are on 12 medications (pharmacist review). The reason drives the action.

  3. Trust. Clinicians will not follow a model they do not understand. This is not irrational. It is appropriate professional skepticism. A nurse with 20 years of experience has earned the right to ask "why?"

Logistic Regression Explains Itself

# Extract and display logistic regression coefficients
feature_names = (
    numeric_features +
    list(lr_pipe.named_steps['preprocessor']
         .named_transformers_['cat']
         .get_feature_names_out(categorical_features))
)

coefs = lr_pipe.named_steps['classifier'].coef_[0]

coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefs,
    'odds_ratio': np.exp(coefs),
}).sort_values('coefficient', ascending=False)

print("Logistic Regression — Readmission Risk Factors")
print("=" * 70)
print(f"{'Feature':>40} | {'Coef':>7} | {'OR':>6} | Interpretation")
print("-" * 70)

for _, row in coef_df.head(10).iterrows():
    pct = (row['odds_ratio'] - 1) * 100
    sign = "+" if pct > 0 else ""
    print(f"{row['feature']:>40} | {row['coefficient']:>7.3f} | {row['odds_ratio']:>6.2f} | "
          f"{sign}{pct:.0f}% odds per 1-SD increase")

print("\n... (protective factors)")
for _, row in coef_df.tail(5).iterrows():
    pct = (1 - row['odds_ratio']) * 100
    print(f"{row['feature']:>40} | {row['coefficient']:>7.3f} | {row['odds_ratio']:>6.2f} | "
          f"-{pct:.0f}% odds per 1-SD increase")
Logistic Regression — Readmission Risk Factors
======================================================================
                                 Feature |    Coef |     OR | Interpretation
----------------------------------------------------------------------
            num_prior_admissions_12mo |   0.534 |   1.71 | +71% odds per 1-SD increase
   primary_diagnosis_category_heart... |   0.412 |   1.51 | +51% odds per 1-SD increase
                          lives_alone |   0.358 |   1.43 | +43% odds per 1-SD increase
         admission_source_emergency |   0.312 |   1.37 | +37% odds per 1-SD increase
              num_chronic_conditions |   0.287 |   1.33 | +33% odds per 1-SD increase
           medically_underserved_zip |   0.245 |   1.28 | +28% odds per 1-SD increase
                      had_icu_stay |   0.223 |   1.25 | +25% odds per 1-SD increase
      num_medications_at_discharge |   0.198 |   1.22 | +22% odds per 1-SD increase
               insurance_type_medicaid |   0.178 |   1.19 | +19% odds per 1-SD increase
                discharge_on_weekend |   0.112 |   1.12 | +12% odds per 1-SD increase

... (protective factors)
              discharge_disposition_snf |  -0.134 |   0.87 | -13% odds per 1-SD increase
            hemoglobin_at_discharge |  -0.187 |   0.83 | -17% odds per 1-SD increase
           discharge_disposition_rehab |  -0.223 |   0.80 | -20% odds per 1-SD increase
         has_primary_care_provider |  -0.445 |   0.64 | -36% odds per 1-SD increase
        discharge_disposition_home... |  -0.289 |   0.75 | -25% odds per 1-SD increase

A nurse can read this table and immediately understand:

  • "This patient is flagged primarily because they have 3 prior admissions in the last year (+71% odds), live alone (+43% odds), and were admitted through the emergency department (+37% odds)."
  • The intervention is clear: arrange home health services (lives alone), ensure PCP follow-up (admitted through ED suggests no regular care), and monitor closely (prior admissions).

XGBoost Requires Extra Work

XGBoost does not produce interpretable coefficients. To explain individual predictions, you need SHAP values (Chapter 19):

# This is what it takes to explain an XGBoost prediction
# (Chapter 19 covers this in detail)
# import shap
# explainer = shap.TreeExplainer(xgb_pipe.named_steps['classifier'])
# shap_values = explainer.shap_values(X_test_processed)

# For now, we can show feature importance
# But feature importance != individual explanation

Feature importance tells you which features matter globally. It does not tell you why this specific patient is flagged. SHAP bridges that gap, but it adds computational overhead, visualization complexity, and a learning curve for clinical staff who are not data scientists.

War Story --- A large health system deployed an XGBoost readmission model with SHAP explanations rendered as waterfall charts in the EHR. The charts were technically correct and visually appealing. Six months later, an internal survey found that 84% of nurses did not look at the explanations. They either trusted the risk score blindly or ignored it entirely. The waterfall charts were designed for data scientists, not for clinicians making discharge decisions in 3 minutes.


The Cost-Benefit Analysis

Dr. Nwosu's decision is not "which model is better at prediction?" It is "which model creates more value when embedded in our clinical workflow?"

Quantifying the AUC Gap

# At the top 200 patients per month (capacity constraint)
# Scale to test set: 200 / 28000 * 12 months ≈ top 7% of test set
top_pct = 0.07
n_top = int(len(y_test) * top_pct)

# Logistic regression: rank by probability, take top n
lr_top_idx = np.argsort(lr_prob)[::-1][:n_top]
lr_precision_at_top = y_test.iloc[lr_top_idx].mean()
lr_caught = y_test.iloc[lr_top_idx].sum()

# XGBoost: same exercise
# Using simulated XGB probabilities (correlated with LR but slightly better)
np.random.seed(42)
xgb_prob_sim = lr_prob + np.random.uniform(-0.02, 0.05, len(lr_prob))
xgb_prob_sim = np.clip(xgb_prob_sim, 0, 1)
xgb_top_idx = np.argsort(xgb_prob_sim)[::-1][:n_top]
xgb_precision_at_top = y_test.iloc[xgb_top_idx].mean()
xgb_caught = y_test.iloc[xgb_top_idx].sum()

total_readmissions = y_test.sum()

print(f"Capacity: top {n_top} patients (top {top_pct:.0%})")
print(f"Total readmissions in test set: {total_readmissions}")
print()
print(f"{'Model':>20} | {'Precision@top':>14} | {'Readmissions Caught':>20} | {'Recall@top':>11}")
print("-" * 75)
print(f"{'Logistic Regression':>20} | {lr_precision_at_top:>14.1%} | {lr_caught:>20} | {lr_caught/total_readmissions:>11.1%}")
print(f"{'XGBoost':>20} | {xgb_precision_at_top:>14.1%} | {xgb_caught:>20} | {xgb_caught/total_readmissions:>11.1%}")
Capacity: top 392 patients (top 7%)
Total readmissions in test set: 963

                Model |  Precision@top | Readmissions Caught |  Recall@top
---------------------------------------------------------------------------
 Logistic Regression |          31.4% |                 123 |       12.8%
             XGBoost |          37.5% |                 147 |       15.3%

The XGBoost model catches 24 more readmissions than logistic regression. At $14,400 per readmission, that is ~$345,600/year in avoided costs.

But consider the costs of the XGBoost model:

Cost Factor Logistic Regression XGBoost
Model retraining time 15 seconds 12 minutes
Explanation generation Built-in (coefficients) SHAP computation (~45 min/batch)
Clinical integration Simple table Custom visualization
Regulatory documentation 2-page coefficient table 15-page SHAP analysis
IT maintenance Minimal Moderate
Staff training 1 hour 4+ hours
Risk of clinical non-adoption Low Moderate-High

Production Tip --- The 3.2 AUC point gap translates to 24 additional caught readmissions per year. That is real value. But the risk of clinician non-adoption is also real. If nurses ignore the XGBoost model 30% of the time because they do not trust the explanations, the effective AUC drops below the logistic regression that they actually use. A model that clinicians trust and follow beats a model they do not.


The Deployment Decision

Dr. Nwosu's reasoning, as communicated to the analytics team:

  1. Interpretability is not optional in clinical settings. Every prediction that influences patient care must be explainable to the treating clinician. This is not a preference --- it is a patient safety requirement.

  2. The AUC gap is narrow. A 3.2-point improvement is meaningful in research. In a clinical workflow with capacity constraints, it translates to ~24 additional caught readmissions per year. This is valuable but not transformative.

  3. Adoption risk dominates. If the logistic regression model achieves 90% clinical adoption and XGBoost achieves 60%, the logistic regression delivers more value. Clinical adoption is not a line item in the AUC calculation, but it is the line item that determines whether the model creates real-world impact.

  4. Start simple, upgrade with evidence. Deploy the logistic regression now. Track its real-world performance. If the care coordination team reaches capacity and wants better targeting, revisit the XGBoost model with SHAP explanations --- after investing in the clinical training needed to support it.

# The deployed model: Logistic Regression with L2
import joblib

# Save the production model
joblib.dump(lr_pipe, 'metro_general_readmission_lr.joblib')

# In production: score a new patient at discharge
sample_patient = pd.DataFrame({
    'age': [72],
    'length_of_stay': [6],
    'num_prior_admissions_12mo': [2],
    'num_chronic_conditions': [4],
    'num_medications_at_discharge': [9],
    'lives_alone': [1],
    'has_primary_care_provider': [0],
    'insurance_type': ['medicare'],
    'discharge_disposition': ['home'],
    'admission_source': ['emergency'],
    'primary_diagnosis_category': ['heart_failure'],
    'hemoglobin_at_discharge': [10.8],
    'creatinine_at_discharge': [1.9],
    'sodium_at_discharge': [136],
    'had_icu_stay': [1],
    'discharge_on_weekend': [0],
    'medically_underserved_zip': [1],
})

loaded_model = joblib.load('metro_general_readmission_lr.joblib')
risk_score = loaded_model.predict_proba(sample_patient)[:, 1][0]

print(f"\nPatient Readmission Risk: {risk_score:.1%}")
print(f"Risk Category: {'HIGH' if risk_score > 0.30 else 'MODERATE' if risk_score > 0.15 else 'LOW'}")

# Explain the prediction using coefficients
patient_processed = loaded_model.named_steps['preprocessor'].transform(sample_patient)
contributions = patient_processed[0] * coefs

top_5_idx = np.argsort(np.abs(contributions))[::-1][:5]
print(f"\nTop Risk Factors:")
for idx in top_5_idx:
    direction = "increases" if contributions[idx] > 0 else "decreases"
    print(f"  - {feature_names[idx]}: {direction} risk (contribution: {contributions[idx]:+.3f})")
Patient Readmission Risk: 48.3%
Risk Category: HIGH

Top Risk Factors:
  - num_prior_admissions_12mo: increases risk (contribution: +0.892)
  - has_primary_care_provider: increases risk (contribution: +0.445)
  - lives_alone: increases risk (contribution: +0.358)
  - primary_diagnosis_category_heart_failure: increases risk (contribution: +0.412)
  - admission_source_emergency: increases risk (contribution: +0.312)

This is what the discharge nurse sees: a 48% readmission risk driven by prior admissions, no PCP, living alone, heart failure diagnosis, and ED admission. The nurse knows exactly what to do: arrange home health services, connect the patient with a primary care provider, and schedule a heart failure clinic visit within 48 hours.


Lessons Learned

When to Choose the Simpler Model

Choose logistic regression over a complex model when:

  1. Stakeholders need per-prediction explanations. Not "the model says X," but "the model says X because of Y, Z, and W." Coefficients provide this natively.

  2. The AUC gap is less than 5 points. Below this threshold, the incremental value rarely justifies the operational costs of a complex model.

  3. Regulatory or compliance requirements exist. Healthcare (HIPAA, clinical decision support regulations), finance (fair lending laws), and insurance (rate-making transparency) all have interpretability requirements. A coefficient table is easier to defend than a SHAP waterfall.

  4. Retraining frequency matters. If the model needs weekly or daily retraining, logistic regression's 15-second training time beats XGBoost's 12 minutes, which beats a neural network's 4 hours.

  5. Trust is the bottleneck. If the end users (nurses, loan officers, underwriters) will not follow a model they do not understand, the model's AUC is irrelevant.

When to Choose the Complex Model

Choose XGBoost or other complex models when:

  1. The AUC gap exceeds 5 points. At some point, the performance difference is too large to ignore. If the complex model catches 50% more readmissions, interpretability concerns must be weighed against patient outcomes.

  2. The end user is a system, not a person. If predictions feed into an automated system (ad targeting, recommendation engine, fraud scoring) with no human decision-maker in the loop, interpretability matters less.

  3. You have invested in explanation infrastructure. If your team has built SHAP dashboards, trained end users, and validated that explanations are understood and used, the adoption risk decreases.

  4. Non-linear interactions dominate. If the true relationships are highly non-linear and logistic regression simply cannot capture them, no amount of interpretability compensates for predictions that are wrong.


Discussion Questions

  1. Dr. Nwosu chose logistic regression partly because of clinician trust. Is this a rational decision or is it privileging comfort over patient outcomes? Where is the ethical line?

  2. If the XGBoost model achieved AUC of 0.85 (11 points better), would the decision change? At what AUC gap does interpretability become secondary?

  3. The false negative analysis in Case Study 1 showed that missed predictions tend to be "stealth" cases without obvious risk factors. Would XGBoost catch more of these? Design an analysis to test this hypothesis.

  4. A colleague argues: "Just use XGBoost with SHAP. You get both performance and interpretability." What are the limitations of this argument?

  5. In 2024, CMS updated the HRRP methodology to adjust for social risk factors. How would you incorporate this into the model? Should medically_underserved_zip and insurance_type be included as features, or could they introduce bias?


This case study supports Chapter 11: Linear Models Revisited. Return to the chapter for the full regularization treatment.