Case Study 2: Metro General --- SHAP for Clinicians


Background

Metro General Hospital is a 520-bed academic medical center in the mid-Atlantic United States. Its quality improvement team built a 30-day readmission prediction model for heart failure patients in Chapter 16. The model was evaluated, the class imbalance was addressed in Chapter 17, and the hyperparameters were tuned in Chapter 18. The final model is a gradient boosting classifier with an AUC of 0.81 and a recall of 0.72 at a threshold optimized for the hospital's intervention capacity.

The model is clinically promising. The intervention --- a home visit within 48 hours, daily phone check-ins for two weeks, and a guaranteed follow-up within 7 days --- reduces readmission rates by 40% for patients who receive it. The cost is $850 per patient. The hospital can afford to intervene on the top 20% of discharged heart failure patients (about 17 patients per month out of 85 monthly heart failure discharges).

There is one problem. The cardiologists will not use the model.

Dr. Sarah Okonkwo, the chief of cardiology, put it this way in a department meeting: "You are asking me to trust a black box that assigns risk scores to my patients. I have been treating heart failure for 22 years. I can look at a patient and tell you their risk. If you cannot tell me why your model flags a patient, I am not changing my workflow."

She is right. The model needs to explain itself.


The Data and Model

import numpy as np
import pandas as pd
import shap
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, recall_score

np.random.seed(42)
n = 4200
readmit_rate = 0.22

data = pd.DataFrame({
    'age': np.random.normal(72, 12, n).clip(30, 100).round(0).astype(int),
    'ejection_fraction': np.random.normal(35, 12, n).clip(10, 70).round(0).astype(int),
    'length_of_stay': np.random.exponential(5.5, n).round(1).clip(1, 30),
    'num_prior_admissions_1yr': np.random.poisson(1.8, n),
    'num_comorbidities': np.random.poisson(3.2, n),
    'creatinine_discharge': np.random.normal(1.4, 0.6, n).clip(0.5, 5.0).round(2),
    'sodium_discharge': np.random.normal(138, 4, n).clip(125, 150).round(0).astype(int),
    'hemoglobin_discharge': np.random.normal(11.5, 2.0, n).clip(6, 17).round(1),
    'systolic_bp_discharge': np.random.normal(125, 20, n).clip(80, 200).round(0).astype(int),
    'bmi': np.random.normal(29, 6, n).clip(15, 55).round(1),
    'lives_alone': np.random.binomial(1, 0.35, n),
    'has_primary_care': np.random.binomial(1, 0.72, n),
    'num_medications': np.random.poisson(8, n).clip(0, 25),
    'depression_screen_positive': np.random.binomial(1, 0.28, n),
    'insurance_type': pd.Categorical(
        np.random.choice(
            ['medicare', 'medicaid', 'private', 'self_pay'], n,
            p=[0.55, 0.18, 0.22, 0.05]
        )
    ).codes,
    'discharge_disposition': pd.Categorical(
        np.random.choice(
            ['home', 'home_health', 'snf', 'rehab'], n,
            p=[0.50, 0.25, 0.15, 0.10]
        )
    ).codes,
})

# Readmission probability
readmit_logit = (
    -2.5
    + 0.02 * data['age']
    - 0.03 * data['ejection_fraction']
    + 0.06 * data['length_of_stay']
    + 0.35 * data['num_prior_admissions_1yr']
    + 0.12 * data['num_comorbidities']
    + 0.30 * data['creatinine_discharge']
    - 0.02 * data['sodium_discharge']
    - 0.08 * data['hemoglobin_discharge']
    + 0.25 * data['lives_alone']
    - 0.20 * data['has_primary_care']
    + 0.03 * data['num_medications']
    + 0.15 * data['depression_screen_positive']
    + np.random.normal(0, 0.4, n)
)
readmit_prob = 1 / (1 + np.exp(-readmit_logit))
data['readmitted'] = np.random.binomial(1, readmit_prob)

feature_names = [
    'age', 'ejection_fraction', 'length_of_stay',
    'num_prior_admissions_1yr', 'num_comorbidities',
    'creatinine_discharge', 'sodium_discharge',
    'hemoglobin_discharge', 'systolic_bp_discharge', 'bmi',
    'lives_alone', 'has_primary_care', 'num_medications',
    'depression_screen_positive', 'insurance_type',
    'discharge_disposition'
]

X = data[feature_names]
y = data['readmitted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4,
    min_samples_leaf=20, subsample=0.8,
    random_state=42
)
model.fit(X_train, y_train)

probs = model.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, probs):.3f}")
print(f"Readmission rate: {y.mean():.1%}")
print(f"Test set size: {len(X_test)}")
Test AUC: 0.812
Readmission rate: 22.4%
Test set size: 840

What Dr. Okonkwo Needs to See

Dr. Okonkwo's objection is not technical. It is about trust. She needs three things:

  1. Global understanding: "What does the model look at?" She needs to see that the model's priorities align with her clinical knowledge.
  2. Local explanations: "Why did the model flag this patient?" She needs to verify, patient by patient, that the model's reasoning makes sense.
  3. Override mechanism: "What if I disagree?" She needs to know that she can override the model and that her overrides will be recorded and reviewed.

Step 1: Global Interpretation --- Does the Model Know Medicine?

# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: what does the model prioritize?
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test, plot_type="dot", show=False)
plt.title("Heart Failure Readmission Model: Feature Importance")
plt.tight_layout()
plt.savefig("cs2_global_summary.png", dpi=150, bbox_inches='tight')
plt.show()

The summary plot is where Dr. Okonkwo's skepticism either grows or shrinks. If the model's top features are clinically meaningful, trust increases. If the model relies on insurance type or some administrative artifact, trust evaporates.

Expected findings based on the data-generating process:

  • num_prior_admissions_1yr (prior admissions): red dots on the right. Patients with many prior admissions are at higher risk. This aligns perfectly with clinical knowledge --- prior utilization is the single strongest predictor of future utilization.
  • creatinine_discharge (kidney function): red dots on the right. Elevated creatinine indicates impaired renal function, a well-established readmission risk factor in heart failure.
  • ejection_fraction (cardiac function): blue dots on the right. Low ejection fraction (blue = low value) pushes toward readmission. This is exactly what clinicians expect --- reduced ejection fraction means the heart is not pumping effectively.
  • lives_alone (social factor): a binary feature where red (=1, lives alone) pushes toward readmission. Social isolation is a known risk factor for readmission.
  • has_primary_care (care continuity): blue dots on the right --- lacking a PCP (blue = 0) pushes toward readmission. Access to follow-up care is protective.

Presenting to Dr. Okonkwo

You do not show Dr. Okonkwo the SHAP summary plot. You translate it.

"Dr. Okonkwo, the model prioritizes five factors when assessing readmission risk:

  1. Number of prior admissions in the past year --- the single strongest predictor. Patients who have been admitted multiple times are at the highest risk.
  2. Creatinine at discharge --- elevated creatinine signals kidney impairment, which the model treats as a strong risk factor.
  3. Ejection fraction --- lower EF means higher readmission risk.
  4. Whether the patient lives alone --- social isolation elevates risk.
  5. Whether the patient has a primary care provider --- patients without a PCP are at higher risk because follow-up is harder.

Does this match your clinical experience?"

If the answer is yes --- and it should be, because these are well-established risk factors --- you have earned the right to show local explanations.

Step 2: Local Explanations --- "The Model Flagged This Patient Because..."

# Select three patients to explain
# Patient 1: High risk, clinically obvious
high_risk_mask = probs > 0.5
high_risk_idx = np.where(high_risk_mask)[0][0]

# Patient 2: Moderate risk, borderline
moderate_mask = (probs > 0.3) & (probs < 0.4)
moderate_idx = np.where(moderate_mask)[0][0]

# Patient 3: High risk but surprising (social factors dominate)
social_risk_mask = (
    (probs > 0.4) &
    (X_test['lives_alone'] == 1) &
    (X_test['has_primary_care'] == 0)
)
social_idx = np.where(social_risk_mask)[0][0] if social_risk_mask.any() else np.where(probs > 0.4)[0][1]

patients = {
    'Patient A (High Risk)': high_risk_idx,
    'Patient B (Borderline)': moderate_idx,
    'Patient C (Social Risk)': social_idx,
}

for name, idx in patients.items():
    patient = X_test.iloc[idx]
    prob = probs[idx]
    sv = shap_values[idx]
    top3 = np.argsort(np.abs(sv))[::-1][:3]

    print(f"\n{'='*60}")
    print(f"{name}: Readmission Risk = {prob:.1%}")
    print(f"{'='*60}")
    print(f"  Age: {patient['age']:.0f}")
    print(f"  Ejection fraction: {patient['ejection_fraction']:.0f}%")
    print(f"  Creatinine (discharge): {patient['creatinine_discharge']:.2f}")
    print(f"  Prior admissions (1yr): {patient['num_prior_admissions_1yr']:.0f}")
    print(f"  Comorbidities: {patient['num_comorbidities']:.0f}")
    print(f"  Lives alone: {'Yes' if patient['lives_alone'] == 1 else 'No'}")
    print(f"  Has PCP: {'Yes' if patient['has_primary_care'] == 1 else 'No'}")
    print(f"  Depression screen: "
          f"{'Positive' if patient['depression_screen_positive'] == 1 else 'Negative'}")
    print(f"\n  Top 3 reasons:")
    for rank, feat_idx in enumerate(top3, 1):
        feat_name = feature_names[feat_idx]
        feat_val = patient.iloc[feat_idx]
        direction = "increases" if sv[feat_idx] > 0 else "decreases"
        print(f"    {rank}. {feat_name} = {feat_val} ({direction} risk)")

Patient A: The Clinically Obvious Case

pa_idx = high_risk_idx
pa = X_test.iloc[pa_idx]
pa_prob = probs[pa_idx]

shap.plots.waterfall(
    shap.Explanation(
        values=shap_values[pa_idx],
        base_values=explainer.expected_value,
        data=pa.values,
        feature_names=feature_names
    ),
    max_display=10, show=False
)
plt.title(f"Patient A --- Readmission Risk: {pa_prob:.1%}")
plt.tight_layout()
plt.savefig("cs2_patient_a_waterfall.png", dpi=150, bbox_inches='tight')
plt.show()

Patient A is likely flagged for elevated creatinine, multiple prior admissions, and low ejection fraction. These are the cases that build trust --- the model flags patients that clinicians would also flag. Dr. Okonkwo looks at this and says: "Yes, I would have identified this patient too. The model agrees with my assessment."

This is the point. The first patients you show should be the easy ones --- cases where the model and the clinician align. This builds confidence before you get to the interesting cases.

Patient C: The Social Risk Case

pc_idx = social_idx
pc = X_test.iloc[pc_idx]
pc_prob = probs[pc_idx]

shap.plots.waterfall(
    shap.Explanation(
        values=shap_values[pc_idx],
        base_values=explainer.expected_value,
        data=pc.values,
        feature_names=feature_names
    ),
    max_display=10, show=False
)
plt.title(f"Patient C --- Readmission Risk: {pc_prob:.1%}")
plt.tight_layout()
plt.savefig("cs2_patient_c_waterfall.png", dpi=150, bbox_inches='tight')
plt.show()

Patient C is the case that makes the model valuable. The clinical values may be moderate --- ejection fraction is not terrible, creatinine is mildly elevated. But the patient lives alone and does not have a primary care provider. The model flags them because of the social risk factors, not the clinical ones.

This is where you say to Dr. Okonkwo:

"This patient's clinical numbers are borderline. But the model is picking up on the fact that they live alone and do not have a PCP. In the training data, patients with this social profile had significantly higher readmission rates, even when the clinical values were acceptable. The model is flagging a risk you might not see in the chart."

If Dr. Okonkwo agrees that social isolation is a real risk factor --- and the literature strongly supports this --- the model has just demonstrated value. It caught something a busy clinician focused on lab values might miss.


The Clinical Dashboard Design

In production, clinicians do not look at SHAP waterfall plots. They look at a dashboard. Here is the design:

def create_clinical_explanation(patient_data, shap_vals, prob,
                                feature_names, top_n=5):
    """
    Create a clinician-friendly explanation table.
    Translates SHAP values into clinical language.
    """
    # Map feature names to clinical labels
    clinical_labels = {
        'age': 'Age',
        'ejection_fraction': 'Ejection Fraction (%)',
        'length_of_stay': 'Length of Stay (days)',
        'num_prior_admissions_1yr': 'Prior Admissions (past year)',
        'num_comorbidities': 'Number of Comorbidities',
        'creatinine_discharge': 'Creatinine at Discharge',
        'sodium_discharge': 'Sodium at Discharge',
        'hemoglobin_discharge': 'Hemoglobin at Discharge',
        'systolic_bp_discharge': 'Systolic BP at Discharge',
        'bmi': 'BMI',
        'lives_alone': 'Lives Alone',
        'has_primary_care': 'Has Primary Care Provider',
        'num_medications': 'Number of Medications',
        'depression_screen_positive': 'Depression Screen Positive',
        'insurance_type': 'Insurance Type',
        'discharge_disposition': 'Discharge Disposition',
    }

    # Map feature names to reference ranges
    reference_ranges = {
        'ejection_fraction': '55-70% (normal)',
        'creatinine_discharge': '0.7-1.3 mg/dL (normal)',
        'sodium_discharge': '136-145 mEq/L (normal)',
        'hemoglobin_discharge': '12-17 g/dL (normal)',
        'systolic_bp_discharge': '90-120 mmHg (normal)',
    }

    top_indices = np.argsort(np.abs(shap_vals))[::-1][:top_n]

    rows = []
    for idx in top_indices:
        feat_name = feature_names[idx]
        feat_val = patient_data[idx]
        sv = shap_vals[idx]

        # Format binary features
        if feat_name in ('lives_alone', 'has_primary_care',
                         'depression_screen_positive'):
            display_val = 'Yes' if feat_val == 1 else 'No'
        else:
            display_val = f"{feat_val}"

        ref = reference_ranges.get(feat_name, '')

        rows.append({
            'Factor': clinical_labels.get(feat_name, feat_name),
            'Value': display_val,
            'Reference': ref,
            'Impact': 'Increases risk' if sv > 0 else 'Decreases risk',
            'Magnitude': 'Strong' if abs(sv) > 0.3 else (
                'Moderate' if abs(sv) > 0.1 else 'Mild'
            ),
        })

    explanation_df = pd.DataFrame(rows)
    return explanation_df


# Generate explanation for Patient A
explanation = create_clinical_explanation(
    X_test.iloc[pa_idx].values,
    shap_values[pa_idx],
    probs[pa_idx],
    feature_names
)

print(f"Patient A --- 30-Day Readmission Risk: {probs[pa_idx]:.0%}")
print(f"\nTop factors contributing to this risk assessment:")
print(explanation.to_string(index=False))
Patient A --- 30-Day Readmission Risk: 54%

Top factors contributing to this risk assessment:
                       Factor       Value         Reference         Impact Magnitude
     Prior Admissions (past year)         4                       Increases risk   Strong
        Creatinine at Discharge      2.31  0.7-1.3 mg/dL (normal) Increases risk   Strong
       Ejection Fraction (%)        24    55-70% (normal)       Increases risk Moderate
                    Lives Alone       Yes                       Increases risk Moderate
  Has Primary Care Provider        No                       Increases risk     Mild

This table is what goes on the discharge planning screen. No SHAP values. No log-odds. Just: "Here are the top reasons the model flagged this patient, with the patient's values, normal reference ranges where applicable, and the direction and strength of the effect."


The Override Workflow

Dr. Okonkwo's third requirement was an override mechanism. The clinician must be able to disagree with the model --- and the disagreement must be recorded.

def record_clinical_override(patient_id, model_risk, clinician_risk,
                              override_reason, clinician_name):
    """
    Record a clinician override of the model's risk assessment.
    In production, this writes to a database. Here, we return a dict.
    """
    return {
        'patient_id': patient_id,
        'model_risk_score': round(model_risk, 3),
        'clinician_risk_assessment': clinician_risk,
        'override': model_risk > 0.4 and clinician_risk == 'low',
        'override_reason': override_reason,
        'clinician_name': clinician_name,
        'timestamp': pd.Timestamp.now().isoformat(),
    }


# Example: Dr. Okonkwo reviews Patient B and disagrees
override = record_clinical_override(
    patient_id='MRN-47823',
    model_risk=probs[moderate_idx],
    clinician_risk='low',
    override_reason=(
        'Patient has strong family support system not captured in model. '
        'Daughter is a registered nurse and will monitor daily. '
        'Patient is adherent to medication regimen.'
    ),
    clinician_name='Dr. Sarah Okonkwo'
)

print("Clinical Override Record:")
for k, v in override.items():
    print(f"  {k}: {v}")
Clinical Override Record:
  patient_id: MRN-47823
  model_risk_score: 0.352
  clinician_risk_assessment: low
  override: False
  override_reason: Patient has strong family support system not captured in model. Daughter is a registered nurse and will monitor daily. Patient is adherent to medication regimen.
  clinician_name: Dr. Sarah Okonkwo
  timestamp: 2026-03-25T14:30:00

The override record serves two purposes:

  1. Clinical autonomy: Dr. Okonkwo can override the model. She is not bound by an algorithm. This is critical for clinician buy-in.
  2. Model improvement: The override reasons are a gold mine of information about what the model is missing. If many overrides cite "strong family support," that is a feature you need to add. If many overrides cite "medication adherence," that is another gap. The override log is a feedback loop.

Analyzing Override Patterns

# Simulated override log (in production, this comes from the database)
override_reasons = pd.DataFrame({
    'reason_category': [
        'Family support', 'Family support', 'Medication adherence',
        'Recent clinical improvement', 'Family support',
        'Medication adherence', 'Planned procedure',
        'Family support', 'Transportation arranged',
        'Recent clinical improvement', 'Family support',
        'Medication adherence', 'Medication adherence',
        'Family support', 'Transportation arranged',
    ]
})

print("Override reason distribution:")
print(override_reasons['reason_category'].value_counts().to_string())
print(f"\nInsight: 'Family support' is the most common override reason.")
print("This suggests the model should incorporate a social support")
print("assessment as a feature in the next iteration.")
Override reason distribution:
Family support               6
Medication adherence         4
Recent clinical improvement  2
Transportation arranged      2
Planned procedure            1

Insight: 'Family support' is the most common override reason.
This suggests the model should incorporate a social support
assessment as a feature in the next iteration.

The One-Page Clinical Guide

This is the document that goes in the discharge planning workflow:

HEART FAILURE READMISSION RISK MODEL
Clinical Decision Support Guide --- Metro General Hospital

WHAT THIS TOOL DOES:
The readmission risk model estimates the probability that a heart
failure patient will be readmitted within 30 days of discharge.
Patients above 35% risk are flagged for the intensive post-discharge
intervention (home visit + phone check-ins + 7-day follow-up).

HOW TO READ THE RISK SCORE:
  0-20%    Low risk        Standard discharge plan
  20-35%   Moderate risk   Enhanced discharge education
  35-50%   High risk       Intensive post-discharge intervention
  50%+     Very high risk  Intensive intervention + case management

WHAT THE MODEL CONSIDERS (top factors):
  1. Number of prior admissions in the past year
  2. Creatinine at discharge (kidney function)
  3. Ejection fraction (cardiac function)
  4. Whether the patient lives alone
  5. Whether the patient has a primary care provider

Each flagged patient has a table showing the specific factors
that contributed most to their risk score, with their values
compared to normal reference ranges.

IF YOU DISAGREE WITH THE MODEL:
  1. You may override the model's recommendation.
  2. Document your override reason in the system.
  3. Your overrides are reviewed monthly by the QI team.
  4. Common override reasons help us improve the model.

WHAT THE MODEL CANNOT SEE:
  - Medication adherence history (not yet in our data)
  - Family and social support strength
  - Patient motivation and self-management ability
  - Recent conversations with the care team
  - Planned procedures or specialist follow-ups

The model is a screening tool, not a replacement for clinical
judgment. It is most valuable when it flags patients whose risk
factors you might not have seen in a busy discharge workflow.

QUESTIONS? Contact the Clinical Informatics team:
[contact information]

Outcome: Six Months Later

After implementing the SHAP-based explanation system:

  • Clinician adoption: 78% of cardiologists review the model's recommendations during discharge planning (up from 12% before SHAP explanations).
  • Override rate: 15% of model flags are overridden, with documented reasons.
  • Model improvement: The social support assessment was added as a feature based on override patterns, improving the model's AUC from 0.81 to 0.84.
  • Readmission rate: Down from 22% to 18.5% --- a 15.9% relative reduction.

The model did not change. The explanation system changed. The explanations built the trust that enabled adoption. The adoption enabled the intervention. The intervention reduced readmissions.

Core Lesson --- The model was always capable of reducing readmissions. It was clinician trust --- not model accuracy --- that was the bottleneck. SHAP explanations did not make the model better. They made it usable.


This case study supports Chapter 19: Model Interpretation. Return to the chapter for the complete treatment of SHAP, PDP, permutation importance, and LIME.