Case Study 1: Metro General Readmission Fairness Audit


Background

Metro General Hospital's 30-day readmission prediction model has been through the full pipeline. Chapter 1 introduced the problem ($2.1 million in annual CMS penalties, a 17.3% readmission rate). Chapters 11--14 built and compared models. Chapter 16 evaluated performance. Chapter 17 addressed class imbalance. Chapter 19 added SHAP-based explanations so clinicians could understand individual predictions. Chapter 31 deployed the model as a FastAPI endpoint. Chapter 32 set up monitoring.

The model works. It has an AUC of 0.83 and a recall of 0.67. The care coordination team uses it daily: patients above the 0.20 risk threshold receive a follow-up call within 48 hours of discharge. Over six months, readmission rates have dropped by 2.1 percentage points for patients who received calls --- saving an estimated $1.4 million annually.

Then, in a routine quality review, Dr. Sarah Nwosu (Chief Medical Officer) asks a question that changes everything: "Does the model work equally well for all of our patients?"

The data science team runs the numbers. The answer is no.


Phase 1: Discovering the Disparity

The team computes AUC and recall by racial group on the last six months of production data.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, accuracy_score, confusion_matrix, roc_curve
)

np.random.seed(42)

# Metro General patient data: 14,200 discharges
n = 14200

race = np.random.choice(
    ['White', 'Black', 'Hispanic', 'Asian'],
    size=n,
    p=[0.45, 0.25, 0.20, 0.10]
)

age = np.random.normal(62, 15, n).clip(18, 99).astype(int)
length_of_stay = np.random.exponential(4.5, n).round(1).clip(1, 45)
num_prior_admissions = np.random.poisson(1.2, n)
num_medications = np.random.poisson(8, n).clip(1, 30)
comorbidity_index = np.random.poisson(2, n)
ed_visits_6m = np.random.poisson(1.5, n)
has_pcp = np.random.binomial(1, 0.65, n)
insurance_type = np.random.choice(
    ['Medicare', 'Medicaid', 'Commercial', 'Uninsured'],
    size=n, p=[0.38, 0.22, 0.31, 0.09]
)
discharge_to_snf = np.random.binomial(1, 0.15, n)

insurance_map = {'Medicare': 0, 'Medicaid': 1, 'Commercial': 2, 'Uninsured': 3}
insurance_encoded = np.array([insurance_map[i] for i in insurance_type])

# Differential base rates: reflect systemic disparities in post-discharge care
race_risk = {'White': 0.0, 'Black': 0.06, 'Hispanic': 0.04, 'Asian': -0.02}
race_offset = np.array([race_risk[r] for r in race])

logit = (
    -2.0
    + 0.015 * age
    + 0.08 * length_of_stay
    + 0.25 * num_prior_admissions
    + 0.03 * num_medications
    + 0.15 * comorbidity_index
    + 0.10 * ed_visits_6m
    - 0.30 * has_pcp
    - 0.20 * discharge_to_snf
    + 0.10 * (insurance_encoded == 3).astype(int)
    + race_offset
)
prob_true = 1 / (1 + np.exp(-logit))
readmitted = np.random.binomial(1, prob_true)

df = pd.DataFrame({
    'age': age, 'length_of_stay': length_of_stay,
    'num_prior_admissions': num_prior_admissions,
    'num_medications': num_medications,
    'comorbidity_index': comorbidity_index,
    'ed_visits_6m': ed_visits_6m,
    'has_pcp': has_pcp,
    'insurance_encoded': insurance_encoded,
    'discharge_to_snf': discharge_to_snf,
    'race': race, 'readmitted': readmitted,
})

print("Readmission base rates by race:")
print(df.groupby('race')['readmitted'].mean().round(3))
print(f"\nOverall rate: {df['readmitted'].mean():.3f}")

Training the Model (Race Excluded)

feature_cols = [
    'age', 'length_of_stay', 'num_prior_admissions',
    'num_medications', 'comorbidity_index', 'ed_visits_6m',
    'has_pcp', 'insurance_encoded', 'discharge_to_snf'
]

X = df[feature_cols]
y = df['readmitted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
race_test = df.loc[X_test.index, 'race'].values

model = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(f"Overall AUC:      {roc_auc_score(y_test, y_prob):.3f}")
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Group-Level Performance

print("AUC by racial group:")
for group in ['White', 'Black', 'Hispanic', 'Asian']:
    mask = race_test == group
    if len(np.unique(y_test.values[mask])) == 2:
        auc = roc_auc_score(y_test.values[mask], y_prob[mask])
        n_group = mask.sum()
        print(f"  {group:12s}  AUC={auc:.3f}  (n={n_group})")

print("\nRecall (TPR) by racial group at threshold=0.20:")
for group in ['White', 'Black', 'Hispanic', 'Asian']:
    mask = race_test == group
    y_true_g = y_test.values[mask]
    y_pred_g = (y_prob[mask] >= 0.20).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true_g, y_pred_g).ravel()
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    print(f"  {group:12s}  TPR={tpr:.3f}  FPR={fpr:.3f}")

The numbers reveal a clear pattern: the model catches a higher proportion of readmissions for white patients than for Black or Hispanic patients. The AUC gap between the best-performing group and the worst-performing group is substantial. The model is not deliberately biased --- it simply learned better in data-rich regions (more white patients) and struggled in data-sparse regions (fewer Asian patients).

But the effect on patient care is the same regardless of intent. Black patients who will be readmitted are less likely to receive the follow-up call that could prevent that readmission. The care coordination team, guided by the model, is inadvertently allocating fewer resources to the patients who need them most.

Clinical Implication --- Dr. Nwosu summarizes the problem in one sentence: "We built a system to reduce readmissions, and it reduces readmissions less for Black patients than white patients. That is not acceptable."


Phase 2: The Full Fairness Audit

The team runs a comprehensive fairness audit.

def full_audit(y_true, y_pred, y_prob, groups):
    """Complete fairness audit with all five metrics."""
    results = []
    for group in np.unique(groups):
        mask = groups == group
        y_t = y_true[mask]
        y_p = y_pred[mask]
        y_pr = y_prob[mask]

        tn, fp, fn, tp = confusion_matrix(y_t, y_p).ravel()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
        pos_rate = y_p.mean()
        base_rate = y_t.mean()
        auc = roc_auc_score(y_t, y_pr) if len(np.unique(y_t)) == 2 else np.nan

        results.append({
            'Group': group,
            'N': mask.sum(),
            'Base Rate': round(base_rate, 3),
            'Pos Pred Rate': round(pos_rate, 3),
            'TPR (Recall)': round(tpr, 3),
            'FPR': round(fpr, 3),
            'Precision': round(ppv, 3),
            'AUC': round(auc, 3),
        })

    audit_df = pd.DataFrame(results).set_index('Group')

    print("=" * 70)
    print("METRO GENERAL READMISSION MODEL — FAIRNESS AUDIT")
    print("=" * 70)
    print(audit_df.to_string())
    print()

    # Disparities
    print("DISPARITIES:")
    print(f"  TPR range:       {audit_df['TPR (Recall)'].max() - audit_df['TPR (Recall)'].min():.3f}")
    print(f"  FPR range:       {audit_df['FPR'].max() - audit_df['FPR'].min():.3f}")
    print(f"  Precision range: {audit_df['Precision'].max() - audit_df['Precision'].min():.3f}")
    print(f"  AUC range:       {audit_df['AUC'].max() - audit_df['AUC'].min():.3f}")

    # Disparate impact
    ref_rate = audit_df.loc['White', 'Pos Pred Rate']
    print("\nDISPARATE IMPACT (reference: White):")
    for group in audit_df.index:
        ratio = audit_df.loc[group, 'Pos Pred Rate'] / ref_rate
        flag = " *** BELOW 0.80 ***" if ratio < 0.80 else ""
        print(f"  {group:12s}  DI={ratio:.3f}{flag}")

    print("=" * 70)
    return audit_df

# Run the audit at the default threshold
y_pred_20 = (y_prob >= 0.20).astype(int)
audit = full_audit(y_test.values, y_pred_20, y_prob, race_test)

Phase 3: Understanding Why

The team investigates three hypotheses for the disparity.

Hypothesis 1: Proxy Variables

Even though race is not a model feature, insurance type is --- and insurance type is correlated with race.

print("Insurance distribution by race (training data):")
cross = pd.crosstab(
    df['race'], df['insurance_encoded'].map(
        {0: 'Medicare', 1: 'Medicaid', 2: 'Commercial', 3: 'Uninsured'}
    ),
    normalize='index'
).round(3)
print(cross)

The pattern is clear: uninsured and Medicaid patients are disproportionately Black and Hispanic. The model uses insurance type as a legitimate predictor of readmission (uninsured patients have fewer post-discharge resources), but this feature carries racial information. Removing insurance type would reduce the proxy effect but also reduce model accuracy, because insurance genuinely affects readmission through access to post-discharge care.

Hypothesis 2: Representation Imbalance

print("Training set size by race:")
race_train = df.loc[X_train.index, 'race']
print(race_train.value_counts())

The training set has roughly 4,500 white patients and 700 Asian patients. The model has learned the patterns of readmission for white patients with higher fidelity. This is representation bias: the model is not intentionally worse for minority groups, but it has less information to work with.

Hypothesis 3: Different Base Rates and the Impossibility Theorem

print("Base rates by race (test set):")
for group in ['White', 'Black', 'Hispanic', 'Asian']:
    mask = race_test == group
    print(f"  {group:12s}  {y_test.values[mask].mean():.3f}")

Different base rates trigger the impossibility theorem. With a single threshold applied to all groups, the model cannot simultaneously equalize TPR, FPR, and calibration across groups. The team must choose.

Diagnosis --- The disparity is not caused by a single factor. It is the combined effect of (1) proxy variables carrying racial information, (2) representation imbalance giving the model more information about majority groups, and (3) different base rates making equal error rates mathematically impossible with a single threshold. All three must be addressed.


Phase 4: Applying Mitigation

Strategy A: Threshold Adjustment (Post-Processing)

The team starts with the simplest approach: group-specific thresholds that equalize TPR.

def find_equalized_thresholds(y_true, y_prob, groups, target_tpr=0.70):
    """Find per-group thresholds for approximately equal TPR."""
    thresholds = {}
    for group in np.unique(groups):
        mask = groups == group
        fpr_arr, tpr_arr, thresh_arr = roc_curve(
            y_true[mask], y_prob[mask]
        )
        idx = np.argmin(np.abs(tpr_arr - target_tpr))
        thresholds[group] = thresh_arr[idx]
    return thresholds

# Target: 70% TPR for all groups
eq_thresh = find_equalized_thresholds(
    y_test.values, y_prob, race_test, target_tpr=0.70
)

print("Group-specific thresholds for ~70% TPR:")
for group, thresh in eq_thresh.items():
    print(f"  {group:12s}  threshold={thresh:.3f}")

# Apply equalized thresholds
y_pred_eq = np.zeros(len(y_test), dtype=int)
for group, thresh in eq_thresh.items():
    mask = race_test == group
    y_pred_eq[mask] = (y_prob[mask] >= thresh).astype(int)

print("\nAfter threshold adjustment:")
audit_eq = full_audit(y_test.values, y_pred_eq, y_prob, race_test)

Measuring the Accuracy Cost

acc_before = accuracy_score(y_test, y_pred_20)
acc_after = accuracy_score(y_test, y_pred_eq)

print(f"\nAccuracy with uniform threshold (0.20):   {acc_before:.3f}")
print(f"Accuracy with equalized thresholds:        {acc_after:.3f}")
print(f"Accuracy cost:                             {acc_before - acc_after:.3f}")

Strategy B: Reweighting (Pre-Processing)

def compute_fairness_weights(df, protected_col, target_col):
    """Compute sample weights to equalize group-label distributions."""
    overall_pos_rate = df[target_col].mean()
    overall_neg_rate = 1 - overall_pos_rate
    weights = np.ones(len(df))

    for group in df[protected_col].unique():
        group_mask = df[protected_col] == group
        group_size = group_mask.sum()
        group_pos = df.loc[group_mask, target_col].sum()
        group_neg = group_size - group_pos

        pos_mask = group_mask & (df[target_col] == 1)
        if group_pos > 0:
            weights[pos_mask] = (overall_pos_rate * len(df)) / (2 * group_pos)

        neg_mask = group_mask & (df[target_col] == 0)
        if group_neg > 0:
            weights[neg_mask] = (overall_neg_rate * len(df)) / (2 * group_neg)

    return weights

sample_weights = compute_fairness_weights(df, 'race', 'readmitted')

model_rw = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
model_rw.fit(X_train, y_train, sample_weight=sample_weights[X_train.index])

y_prob_rw = model_rw.predict_proba(X_test)[:, 1]
y_pred_rw = (y_prob_rw >= 0.20).astype(int)

print("Reweighted model audit:")
audit_rw = full_audit(y_test.values, y_pred_rw, y_prob_rw, race_test)

Strategy C: Combined (Reweighting + Threshold Adjustment)

# Reweighted model + equalized thresholds
eq_thresh_rw = find_equalized_thresholds(
    y_test.values, y_prob_rw, race_test, target_tpr=0.70
)

y_pred_combined = np.zeros(len(y_test), dtype=int)
for group, thresh in eq_thresh_rw.items():
    mask = race_test == group
    y_pred_combined[mask] = (y_prob_rw[mask] >= thresh).astype(int)

print("Combined (reweighting + threshold adjustment) audit:")
audit_combined = full_audit(
    y_test.values, y_pred_combined, y_prob_rw, race_test
)

Phase 5: The Decision

The team presents three options to Dr. Nwosu and the hospital's ethics committee:

# Summary comparison
summary = pd.DataFrame({
    'Approach': [
        'Original (uniform threshold)',
        'Threshold adjustment only',
        'Reweighting only',
        'Combined (reweighting + thresholds)',
    ],
    'Overall Accuracy': [
        accuracy_score(y_test, y_pred_20),
        accuracy_score(y_test, y_pred_eq),
        accuracy_score(y_test, y_pred_rw),
        accuracy_score(y_test, y_pred_combined),
    ],
    'Overall AUC': [
        roc_auc_score(y_test, y_prob),
        roc_auc_score(y_test, y_prob),  # AUC unchanged by threshold
        roc_auc_score(y_test, y_prob_rw),
        roc_auc_score(y_test, y_prob_rw),
    ],
})

for col in ['Overall Accuracy', 'Overall AUC']:
    summary[col] = summary[col].round(3)

print("Decision Summary for Dr. Nwosu:")
print(summary.to_string(index=False))

The Team's Recommendation --- The data science team recommends the combined approach (reweighting + group-specific thresholds). The reasoning: reweighting improves the model's ability to learn patterns for underrepresented groups, and threshold adjustment equalizes the TPR so that every racial group has the same chance of receiving a follow-up call when they are at risk of readmission. The accuracy cost is modest --- typically 2--3 percentage points. The ethical benefit is substantial: Black patients who will be readmitted are no longer systematically less likely to receive the care coordination call.

Dr. Nwosu asks one more question: "If we equalize the follow-up call rates, will that actually reduce the readmission disparity?"

The honest answer: the model can equalize who gets flagged, but whether that translates to equitable outcomes depends on whether the follow-up care itself is equally effective for all groups. If Black patients face systemic barriers (transportation, pharmacy access, follow-up appointment availability) that the follow-up call cannot address, equalizing model predictions is necessary but not sufficient. The model is one piece of a larger system.


Phase 6: The Model Card

import json

model_card = {
    "model_details": {
        "name": "Metro General 30-Day Readmission Predictor",
        "version": "3.0 (fairness-adjusted)",
        "type": "GradientBoostingClassifier",
        "date": "2025-07-15",
        "trained_by": "Metro General Analytics Team",
    },
    "intended_use": {
        "primary": (
            "Risk-stratify discharged patients for care coordination. "
            "Patients above group-specific risk thresholds receive a "
            "follow-up call within 48 hours of discharge."
        ),
        "out_of_scope": [
            "Pediatric patients (model trained on adults 18+)",
            "Psychiatric admissions (excluded from training data)",
            "Patients transferred to LTAC facilities",
            "Sole basis for clinical decisions without clinician review",
        ],
    },
    "training_data": {
        "source": "Metro General EHR, Jan 2020 - Dec 2024",
        "size": "14,200 adult discharges",
        "demographics": "45% White, 25% Black, 20% Hispanic, 10% Asian",
        "features": "9 clinical and administrative features (no race)",
        "target": "30-day all-cause readmission (17.3% base rate)",
        "preprocessing": "Fairness reweighting applied to equalize group-label distributions",
    },
    "performance": {
        "overall_auc": 0.821,
        "overall_accuracy": 0.742,
        "group_tpr": {
            "White": 0.70, "Black": 0.71,
            "Hispanic": 0.69, "Asian": 0.68,
        },
        "group_fpr_note": (
            "FPR varies by group (range ~0.04) as a consequence of "
            "the impossibility theorem. TPR was prioritized over FPR "
            "equalization per ethics committee decision."
        ),
    },
    "fairness": {
        "protected_attribute": "Race (White, Black, Hispanic, Asian)",
        "criterion": "Equal opportunity (equalized TPR)",
        "mitigation": "Reweighting (pre-processing) + group-specific thresholds (post-processing)",
        "residual_disparities": [
            "AUC for Asian patients is lower than other groups due to small sample size",
            "FPR is not fully equalized (impossibility theorem constraint)",
        ],
        "decision_makers": "Dr. Sarah Nwosu (CMO), Hospital Ethics Committee",
    },
    "limitations": [
        "Does not capture post-discharge medication adherence or follow-up attendance",
        "AUC for Asian patients is lower due to representation bias (5% of training data)",
        "Fairness was evaluated on held-out data; production monitoring is required",
        "Model does not address systemic barriers to post-discharge care",
    ],
    "ethical_considerations": [
        "Readmission rates reflect systemic healthcare disparities, not patient characteristics",
        "Model predictions should identify patients who need MORE resources, never less",
        "Group-specific thresholds accepted by ethics committee on 2025-06-20",
        "Production fairness monitoring dashboard required (see Chapter 32 integration)",
    ],
}

print(json.dumps(model_card, indent=2))

Lessons from Metro General

  1. Performance metrics hide disparities. An overall AUC of 0.83 can mask the fact that the model works significantly worse for some groups. Always disaggregate.

  2. Removing the protected attribute is not enough. Insurance type, zip code, and other features carry racial information. "Fairness through unawareness" is a myth.

  3. Different base rates make perfect fairness impossible. The impossibility theorem is not abstract --- it is the reason Metro General had to choose between equalizing TPR and equalizing FPR. They chose TPR because missing a readmission has higher human cost than an unnecessary follow-up call.

  4. The accuracy cost of fairness is small. A 2--3 point accuracy drop for equitable error rates across racial groups is a tradeoff most healthcare institutions should be willing to make.

  5. Fairness is necessary but not sufficient. Equalizing model predictions does not equalize outcomes. The model is one component of a care system that must address the systemic barriers --- transportation, pharmacy access, housing stability --- that drive readmission disparities in the first place.

  6. Document everything. The model card records what was done, why, and who decided. When the next data scientist inherits this model, they need to understand the fairness decisions and the reasoning behind them.


This case study accompanies Chapter 33: Fairness, Bias, and Responsible ML. Return to the chapter for full context.