Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide

DataField.Dev

When Prediction and Explanation Collide" type: case-study chapter: 1 part: 1

Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide

Background

Dr. Sarah Nwosu, Chief Medical Officer of Metro General Hospital, has a problem that costs the hospital $2.1 million per year in federal penalties. The Centers for Medicare & Medicaid Services (CMS) penalizes hospitals with excess 30-day readmission rates under the Hospital Readmissions Reduction Program (HRRP). Metro General's rate is 17.3%, well above the national average of 15.6%.

Dr. Nwosu approaches the hospital's data analytics team with a seemingly simple request: "Build me a model that predicts which patients will be readmitted, so we can intervene before they come back."

This case study explores why that request is harder than it sounds --- and why the tension between prediction and explanation is not an academic curiosity but a clinical reality that affects patient outcomes.

The Hospital Context

Metro General is a 450-bed urban teaching hospital affiliated with a major university medical school. Key characteristics:

Metric	Value
Annual admissions	~28,000
30-day readmission rate	17.3%
National average	15.6%
Excess readmissions (annual)	~475
CMS penalty	$2.1M/year
Average cost per readmission	$14,400
Total readmission cost burden	~$70M/year
Care coordination team size	12 nurses, 4 social workers

The patient population is diverse: 38% Medicare, 22% Medicaid, 31% commercial insurance, 9% uninsured or self-pay. Approximately 44% of patients come from zip codes classified as medically underserved. The hospital serves as a safety-net provider for the surrounding community.

The Data

The analytics team has access to the following data sources, all available at or before the time of discharge:

Electronic Health Records (EHR): - Primary and secondary diagnoses (ICD-10 codes) - Procedures performed during the admission - Lab results (complete blood count, metabolic panel, hemoglobin A1c, etc.) - Vital signs at admission and discharge - Length of stay - Medications prescribed at discharge

Administrative Data: - Patient demographics (age, sex, insurance type) - Admission source (emergency department, transfer, scheduled) - Discharge disposition (home, home health, skilled nursing facility, etc.) - Prior admissions in the last 12 months

Limited Social Determinants: - Zip code (proxy for socioeconomic status) - Lives alone (self-reported, available for ~60% of patients) - Primary language - Has a primary care physician (yes/no)

What is NOT available: - Whether the patient fills their prescriptions after discharge (pharmacy data is siloed) - Whether the patient attends their follow-up appointment (data arrives 2-4 weeks later) - Home environment assessment (only available for patients receiving home health) - Caregiver support quality - Food security, transportation access (occasionally documented in social work notes, but not structured)

Two Models, Two Purposes

The analytics team builds two models. Both use the same data. Both predict the same target (readmission within 30 days). But they are designed for fundamentally different purposes, and this difference drives every design decision.

Model 1: The Prediction Machine

Objective: Maximize predictive accuracy. Identify which patients are most likely to be readmitted, regardless of why.

Approach: Gradient boosted trees with all available features. Hyperparameter tuning via 5-fold cross-validation. Feature engineering includes interaction terms and aggregated lab value trends.

Top predictive features (by SHAP importance):

Rank	Feature	SHAP Impact
1	Number of admissions in last 12 months	0.142
2	Length of stay (current admission)	0.098
3	Number of discharge medications	0.087
4	Charlson Comorbidity Index	0.079
5	Insurance type (Medicare/Medicaid vs. commercial)	0.065
6	Admission source (ED vs. scheduled)	0.058
7	Age	0.052
8	Hemoglobin at discharge	0.048
9	Lives alone (where available)	0.041
10	Primary diagnosis group	0.039

Performance: AUC = 0.81. At a threshold that flags the top 15% of patients, precision is 34% and recall is 41%.

The problem: The top predictor --- number of prior admissions --- is powerful but not actionable. You cannot change a patient's admission history. Similarly, Charlson Comorbidity Index (a severity score based on chronic conditions), insurance type, and age are strong predictors precisely because they are proxies for overall health status. The model identifies the sickest patients and says, effectively, "these people are sick and will probably come back." The clinical team already knows this.

Model 2: The Explanation Engine

Objective: Identify actionable factors associated with readmission --- things the care coordination team can actually change before or after discharge.

Approach: Regularized logistic regression with a deliberately restricted feature set. Only features representing modifiable risk factors or factors that inform specific interventions.

Feature set (restricted to actionable variables):

actionable_features = [
    'num_discharge_medications',       # Medication complexity
    'has_follow_up_scheduled',         # Post-discharge care plan
    'has_primary_care_physician',      # Continuity of care
    'lives_alone',                     # Social support
    'discharge_education_completed',   # Patient understanding
    'high_risk_medication_flag',       # Anticoagulants, insulin, opioids
    'language_barrier_flag',           # Communication barriers
    'transportation_barrier_flag',     # Access to follow-up
    'prior_no_show_rate',             # Engagement history
    'medication_change_count',         # Discharge medication changes
]

Performance: AUC = 0.68. Significantly lower than Model 1.

What it reveals:

Feature	Coefficient	Interpretation
has_follow_up_scheduled = No	+0.43	Patients without a scheduled follow-up are at higher risk
num_discharge_medications > 8	+0.38	Medication complexity increases risk
lives_alone = Yes	+0.35	Lack of home support increases risk
high_risk_medication_flag	+0.31	Certain medication classes need closer monitoring
language_barrier_flag	+0.28	Communication barriers impede self-management
has_primary_care_physician = No	+0.26	No continuity of care after discharge

The value: Every coefficient maps to an intervention. No follow-up scheduled? Schedule one before discharge. Too many medications? Trigger a pharmacist review. Lives alone? Refer to home health. Language barrier? Provide translated discharge instructions and a follow-up call in the patient's language.

The Tension

Here is the core tension, stated plainly:

Model 1 predicts better. Model 2 helps more.

Model 1 has an AUC of 0.81. It ranks patients by risk with reasonable accuracy. But its top features --- prior admissions, comorbidity burden, age --- describe who the patient is, not what the care team can do. Flagging a 78-year-old with heart failure and three prior admissions does not tell the discharge nurse anything actionable.

Model 2 has an AUC of 0.68. By standard ML evaluation, it is worse. But it surfaces information the clinical team can act on today, before the patient leaves the hospital. The gap in AUC is the cost of restricting the feature set to actionable variables --- and it might be a cost worth paying.

The Hybrid Approach

In practice, Metro General deploys both models in a layered system:

Layer 1 (Triage): Model 1 scores every patient at discharge. The top 20% are flagged as "high readmission risk."

Layer 2 (Action): For flagged patients, Model 2's feature contributions are displayed on the discharge dashboard. The care coordination nurse sees not just "this patient is high risk" but "this patient is high risk, and here are the modifiable factors: no follow-up scheduled, 12 discharge medications, lives alone."

# Conceptual implementation of the layered approach
def generate_discharge_report(patient_data):
    """Generate a discharge risk report combining both models."""

    # Layer 1: Risk score (Model 1 - prediction machine)
    risk_score = model_prediction.predict_proba(
        patient_data[prediction_features]
    )[0, 1]

    report = {
        'patient_id': patient_data['patient_id'],
        'risk_score': risk_score,
        'risk_tier': 'HIGH' if risk_score > 0.25 else 'STANDARD',
    }

    # Layer 2: Actionable factors (Model 2 - explanation engine)
    if report['risk_tier'] == 'HIGH':
        action_scores = model_explanation.predict_proba(
            patient_data[actionable_features]
        )[0, 1]

        # Identify which actionable factors contribute to risk
        interventions = []
        if not patient_data['has_follow_up_scheduled']:
            interventions.append('Schedule follow-up within 7 days')
        if patient_data['num_discharge_medications'] > 8:
            interventions.append('Pharmacist medication review')
        if patient_data['lives_alone']:
            interventions.append('Refer to home health services')
        if patient_data['high_risk_medication_flag']:
            interventions.append('High-risk medication counseling')
        if patient_data['language_barrier_flag']:
            interventions.append('Translated materials + native-language follow-up call')

        report['interventions'] = interventions

    return report

This is not a compromise. It is a design pattern: use the best predictive model for triage, and the most interpretable model for action. The two models answer different questions, and both questions need answers.

Fairness Considerations

There is a third concern beyond prediction and explanation: fairness.

Model 1's reliance on insurance type as a top feature raises immediate questions. Medicaid patients are flagged as higher risk --- which is statistically true (Medicaid patients have higher readmission rates nationally). But is it fair to allocate more resources to patients based on insurance status? And is it equitable if the model systematically under-flags commercially insured patients who are at risk for different reasons?

Consider two patients:

Patient A: 65-year-old Medicare patient, heart failure, 2 prior admissions, 6 discharge medications, follow-up scheduled, lives with spouse.
Patient B: 42-year-old commercially insured patient, first admission for diabetic ketoacidosis, no prior admissions, 4 discharge medications, no follow-up scheduled, lives alone, no primary care physician.

Model 1 assigns Patient A a higher risk score (Medicare, prior admissions, heart failure). Model 2 flags more actionable risk factors for Patient B (no follow-up, lives alone, no PCP). Who should get the care coordination visit if only one can?

This is not a technical question. It is an ethical one. And it is one that data scientists must grapple with explicitly, not leave to the defaults of their algorithm. We will return to fairness in depth in Chapter 33.

Lessons for Problem Framing

This case study illustrates three principles that apply far beyond healthcare:

1. The best predictive model is not always the most useful model. Usefulness depends on what action the prediction supports. A perfectly accurate but unactionable prediction has no business value.

2. Feature selection is a modeling choice AND an ethical choice. Restricting features to actionable variables reduces predictive accuracy but increases the model's utility for intervention. Including demographic features (age, race, insurance type) may improve prediction but raises fairness concerns.

3. Prediction and explanation serve different masters. When a stakeholder says "I need a model," your first question should be: "Do you need to predict an outcome or understand what drives it?" The answer determines everything --- the model architecture, the feature set, the evaluation metric, and the deployment strategy.

Discussion Questions

The AUC gap. Model 1 achieves AUC = 0.81 and Model 2 achieves AUC = 0.68. A data scientist might argue that 0.81 is clearly better. A clinician might argue that 0.68 with actionable features is more valuable. Who is right? Under what circumstances would you recommend deploying Model 2 alone?
Missing data as signal. The "lives alone" feature is available for only ~60% of patients. The analytics team could (a) drop the feature, (b) impute the missing values, or (c) create a "lives_alone_unknown" indicator and include it as a feature. Which approach would you recommend, and why? Could the missingness itself be informative?
Temporal dynamics. The current model predicts readmission at the time of discharge. But readmission risk is not static --- it changes over the 30-day window. A patient who fills their prescriptions and attends their follow-up is lower risk at day 14 than at day 1. How would you redesign the system to update risk scores over time? What new data sources would you need?
The CMS incentive problem. CMS penalizes hospitals for excess readmissions in specific diagnostic categories (heart failure, pneumonia, hip/knee replacement, etc.). Should Metro General build separate models for each category, or one model for all readmissions? What are the tradeoffs?
Gaming the model. Suppose the hospital deploys the readmission model and care coordination teams receive bonuses based on reducing readmissions among flagged patients. What perverse incentives might this create? How would you design the incentive system to avoid gaming?
Transfer learning. Metro General develops their readmission model using their own patient data. A nearby rural hospital (120 beds, different patient demographics, different disease burden) wants to use the same model. What are the risks? What validation steps would you require before deployment at the new site?

This case study supports Chapter 1: From Analysis to Prediction. Return to the chapter or continue to Exercises.