Case Study 2: Screening for Disease Risk — Classification in Public Health


Tier 1 — Verified Concepts: This case study explores classification in medical screening, a well-documented application of predictive modeling. The concepts of sensitivity/specificity tradeoffs, screening vs. diagnostic testing, and the base rate fallacy are standard topics in biostatistics and epidemiology. The specific data is simulated for pedagogical purposes, but the patterns and ethical considerations are grounded in published medical and public health literature.


The Screening Question

Dr. Amara Okafor runs a public health program in a resource-limited region. She has budget to conduct detailed diabetes screening for 500 people out of a community of 5,000. The question is: which 500?

She could screen randomly — pick 500 people at random and test them. But she suspects she can do better. She has basic health data on all 5,000 community members: age, BMI, blood pressure, family history of diabetes, and physical activity level. If she could build a model that predicts who is most likely to have diabetes, she could target the screening to the highest-risk individuals.

This is classification with real stakes. A false negative (missing someone with diabetes) means they don't get early treatment, and their condition may worsen. A false positive (flagging someone without diabetes for screening) wastes limited screening resources but does no direct harm — the screening test itself will reveal the truth.

The Data

Dr. Okafor builds a model using data from a previous screening campaign where everyone was tested:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix,
    classification_report, recall_score, precision_score)

np.random.seed(42)
n = 2000  # Previous campaign data

patients = pd.DataFrame({
    'age': np.random.normal(50, 15, n).clip(20, 85),
    'bmi': np.random.normal(27, 5, n).clip(16, 45),
    'blood_pressure': np.random.normal(125, 18, n).clip(80, 200),
    'family_history': np.random.binomial(1, 0.3, n),
    'activity_level': np.random.choice(
        [1, 2, 3], n, p=[0.3, 0.45, 0.25])
})

# Diabetes probability
logit = (
    -6 +
    0.04 * patients['age'] +
    0.08 * patients['bmi'] +
    0.02 * patients['blood_pressure'] +
    0.8 * patients['family_history'] +
    -0.4 * patients['activity_level'] +
    np.random.normal(0, 0.5, n)
)

patients['has_diabetes'] = (
    np.random.random(n) < 1 / (1 + np.exp(-logit))
).astype(int)

prevalence = patients['has_diabetes'].mean()
print(f"Diabetes prevalence: {prevalence:.1%}")
print(f"Total with diabetes: {patients['has_diabetes'].sum()}")
print(f"Total without: {(1-patients['has_diabetes']).sum()}")

The Prevalence Problem

Diabetes prevalence in the community is around 15-20%. This creates a class imbalance. A model that predicts "no diabetes" for everyone achieves 80-85% accuracy. Dr. Okafor knows this baseline and knows it's useless for her purpose.

She needs a model that specifically identifies people with diabetes — she needs high recall.

Building the Model

features = ['age', 'bmi', 'blood_pressure',
            'family_history', 'activity_level']
X = patients[features]
y = patients['has_diabetes']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standard model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("=== Standard Model (threshold 0.5) ===")
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print(f"\n{classification_report(y_test, y_pred, target_names=['No Diabetes', 'Diabetes'])}")

The Threshold Decision

With the default 0.5 threshold, the model has reasonable accuracy but its recall for diabetes might not be high enough. Dr. Okafor asks: "What percentage of actual diabetes cases does the model catch?"

If recall is 70%, that means 30% of people with diabetes are missed. For a screening program, that's not acceptable. She experiments with lower thresholds:

print(f"{'Threshold':>10}  {'Recall':>8}  {'Precision':>10}  "
      f"{'Flagged':>8}")

for threshold in [0.5, 0.4, 0.3, 0.2, 0.15, 0.1]:
    preds = (y_proba >= threshold).astype(int)
    rec = recall_score(y_test, preds, zero_division=0)
    prec = precision_score(y_test, preds, zero_division=0)
    n_flagged = preds.sum()
    print(f"{threshold:>10.2f}  {rec:>8.3f}  {prec:>10.3f}  "
          f"{n_flagged:>8}")

The table reveals the tradeoff. At threshold 0.5, recall might be 70% and the model flags relatively few people. At threshold 0.2, recall jumps to 90%+ but the model flags many more people — and precision drops (many of the flagged people don't have diabetes).

Dr. Okafor chooses a threshold based on her constraints:

  • Constraint: She can screen 500 out of 5,000 people
  • Goal: Maximize the number of actual diabetes cases found within those 500 screening slots
  • Strategy: Rank all 5,000 by predicted probability, screen the top 500

The Ranking Approach

Rather than using a fixed threshold, Dr. Okafor uses the probability rankings directly:

# Simulate applying the model to the full community
np.random.seed(123)
n_community = 5000

community = pd.DataFrame({
    'age': np.random.normal(50, 15, n_community).clip(20, 85),
    'bmi': np.random.normal(27, 5, n_community).clip(16, 45),
    'blood_pressure': np.random.normal(125, 18, n_community).clip(80, 200),
    'family_history': np.random.binomial(1, 0.3, n_community),
    'activity_level': np.random.choice(
        [1, 2, 3], n_community, p=[0.3, 0.45, 0.25])
})

# True diabetes status (unknown to Dr. Okafor)
logit_c = (-6 + 0.04*community['age'] + 0.08*community['bmi'] +
    0.02*community['blood_pressure'] + 0.8*community['family_history'] +
    -0.4*community['activity_level'] + np.random.normal(0, 0.5, n_community))
community['has_diabetes'] = (np.random.random(n_community) <
    1/(1+np.exp(-logit_c))).astype(int)

# Model predictions
X_community = community[features]
community['risk_score'] = model.predict_proba(X_community)[:, 1]

# Screen top 500
top_500 = community.nlargest(500, 'risk_score')
random_500 = community.sample(500, random_state=42)

diabetes_found_targeted = top_500['has_diabetes'].sum()
diabetes_found_random = random_500['has_diabetes'].sum()
total_diabetes = community['has_diabetes'].sum()

print(f"Total diabetes cases in community: {total_diabetes}")
print(f"Found with targeted screening (top 500): "
      f"{diabetes_found_targeted}")
print(f"Found with random screening (500): "
      f"{diabetes_found_random}")
print(f"\nTargeted approach finds "
      f"{diabetes_found_targeted/diabetes_found_random:.1f}x "
      f"more cases")

The targeted screening finds substantially more diabetes cases than random screening — typically 2-3 times more. This is the practical value of the model: not perfect prediction, but better allocation of limited resources.

The Errors That Matter

Dr. Okafor examines who the model misses — the false negatives in the targeted approach:

# Who gets missed?
not_screened = community.nsmallest(
    n_community - 500, 'risk_score')
missed = not_screened[not_screened['has_diabetes'] == 1]

print(f"\nDiabetes cases not selected for screening: {len(missed)}")
print(f"\nProfile of missed cases:")
print(missed[features].describe().round(1))

print(f"\nProfile of caught cases:")
caught = top_500[top_500['has_diabetes'] == 1]
print(caught[features].describe().round(1))

The missed cases tend to have lower-risk profiles — younger, lower BMI, no family history. These are the "surprising" diabetes cases that don't fit the typical pattern. The model catches the typical cases but misses the atypical ones.

This is an inherent limitation of any risk-based screening model: it works best for typical cases and worst for atypical ones. The people most likely to be missed are precisely the people whose diabetes would be most surprising.

Ethical Considerations

Dr. Okafor identifies several ethical concerns:

1. Who Gets Left Out?

Risk-based screening concentrates resources on the highest-risk individuals. This is efficient but potentially inequitable: younger, healthier-looking individuals with early-stage diabetes may be systematically excluded.

One mitigation: reserve some screening slots (say, 100 out of 500) for random sampling. This ensures that every community member has some chance of being screened, regardless of their risk score.

2. The Base Rate Illusion

Even with the model, most flagged individuals won't have diabetes (because diabetes is relatively rare). When Dr. Okafor tells someone they're "high risk" and should be screened, the person might assume they probably have diabetes. But if precision is 40%, most people flagged for screening don't have it.

Dr. Okafor addresses this by framing the screening invitation carefully: "Based on your health profile, you're in a group where screening is especially recommended" — not "We think you have diabetes."

3. Feature Sensitivity

The model uses BMI, which is imperfect as a health indicator and has known limitations across different ethnic groups. If the model was trained on data from one population, its predictions might be less accurate for a different population with different diabetes risk profiles.

4. The Right to Not Know

Some community members might not want to know their risk score. Medical ethics includes the principle of autonomy — people have the right to decline screening even if a model flags them as high risk.

The Results

After the screening campaign, Dr. Okafor reviews the outcomes:

# Summary
print("=== Screening Campaign Results ===")
print(f"Community size: {n_community}")
print(f"Screening capacity: 500")
print(f"Actual diabetes prevalence: "
      f"{community['has_diabetes'].mean():.1%}")
print(f"\nWith model-targeted screening:")
print(f"  Screened: 500")
print(f"  Diabetes found: {diabetes_found_targeted}")
print(f"  Detection rate: "
      f"{diabetes_found_targeted/total_diabetes:.1%} of all cases")
print(f"\nWithout model (random screening):")
print(f"  Screened: 500")
print(f"  Diabetes found: {diabetes_found_random}")
print(f"  Detection rate: "
      f"{diabetes_found_random/total_diabetes:.1%} of all cases")

The model doesn't catch everyone. It doesn't need to. By screening 10% of the population (500 out of 5,000), the targeted approach finds 30-40% of all diabetes cases, compared to roughly 10-15% with random screening. That's a 2-3x improvement in detection efficiency with the same resources.

Connecting to the Chapter

Chapter Concept Case Study Application
Binary classification Diabetes (yes/no) prediction
Class imbalance Diabetes prevalence ~15-20%
Precision vs. recall Recall prioritized for screening
Threshold selection Lower threshold to maximize recall
predict_proba Risk ranking for resource allocation
Confusion matrix Understanding missed cases and false alarms
Baseline comparison Targeted vs. random screening
Ethical considerations Equity, framing, population differences

Discussion Questions

  1. Dr. Okafor chose to screen the top 500 by risk score. What would change if she had budget to screen 1,000? 200? How does the screening budget affect the optimal strategy?

  2. The model misses "atypical" diabetes cases — younger people with lower BMI who don't match the typical pattern. How could the screening program address this gap without abandoning the risk-based approach?

  3. If the model was trained on data from an urban population but applied in a rural area with different demographics and health profiles, what problems might arise?

  4. Is it ethical to use a model to decide who gets screened for a serious disease? What if the alternative is random screening that catches fewer cases? How do you weigh efficiency against equity?

  5. Dr. Okafor framed screening invitations carefully to avoid the base rate illusion. Why is this important, and what harm could result from poor framing?


Key Takeaways from This Case Study

  • Classification can allocate limited resources more effectively than random selection
  • Probability rankings are more useful than binary predictions when resources are constrained
  • The threshold should be set based on the specific decision context, not defaulted to 0.5
  • Risk-based screening catches typical cases but may miss atypical ones — this is an inherent limitation
  • Class imbalance means most "flagged" individuals will be false positives — communication must be careful
  • Ethical considerations (equity, autonomy, framing) are inseparable from technical modeling decisions
  • The baseline comparison (targeted vs. random) demonstrates the model's practical value