Case Study 1: Evaluating a Medical Diagnostic Model

Precision/Recall Tradeoffs in Breast Cancer Screening

Background

Dr. Sarah Chen leads the AI diagnostics team at Metro General Hospital. Her team has developed a machine learning model to classify mammogram images as either "suspicious" (requiring follow-up biopsy) or "normal." The model will serve as a second reader, assisting radiologists rather than replacing them.

The stakes are asymmetric and severe: - A false negative (missed cancer classified as normal) delays diagnosis, potentially allowing the disease to progress to a later stage. Estimated cost: catastrophic -- both in patient outcomes and liability. - A false positive (healthy tissue classified as suspicious) leads to an unnecessary biopsy. Estimated cost: patient anxiety, a minor invasive procedure ($2,000-5,000), and radiologist time.

The team's dataset contains 50,000 mammogram feature vectors extracted by a pre-trained convolutional neural network. Of these, 2,500 (5%) are confirmed cancer cases and 47,500 (95%) are confirmed benign.

The Challenge

Dr. Chen must decide: 1. Which evaluation metrics to use. 2. What classification threshold to set for the deployed model. 3. How to communicate performance to clinicians, hospital administrators, and patients.

Step 1: Choosing Appropriate Metrics

The team initially reported accuracy: 97.2%. The hospital administrator was thrilled. But Dr. Chen knew better. She asked her team to compute the full set of metrics.

At the default threshold of 0.5:

Metric Value
Accuracy 97.2%
Precision 0.78
Recall (Sensitivity) 0.62
Specificity 0.99
F1 Score 0.69
F2 Score 0.65
AUC-ROC 0.94
AUC-PR 0.72

Critical observation: A recall of 0.62 means the model misses 38% of cancers. For a screening tool, this is unacceptable. Even the best accuracy metric hides this failure because the 95% negative class dominates the calculation.

Dr. Chen rejected accuracy and AUC-ROC as primary metrics. Instead, she chose: - Primary metric: Recall (sensitivity) -- must exceed 0.95. - Secondary metric: Precision at the required recall level. - Monitoring metric: F2 score (weights recall 2x over precision).

Step 2: Threshold Analysis

The model outputs a continuous probability. By varying the classification threshold, the team generated the following operating points:

Threshold Precision Recall F1 F2 FP per 1000
0.70 0.89 0.45 0.60 0.51 3
0.50 0.78 0.62 0.69 0.65 9
0.30 0.55 0.82 0.66 0.75 35
0.20 0.38 0.91 0.53 0.72 78
0.15 0.29 0.95 0.44 0.67 122
0.10 0.19 0.98 0.32 0.56 213
import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score, fbeta_score

def threshold_analysis(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    thresholds: list[float]
) -> list[dict]:
    """Analyze model performance at different classification thresholds.

    Args:
        y_true: Ground truth binary labels.
        y_prob: Predicted probabilities for the positive class.
        thresholds: List of threshold values to evaluate.

    Returns:
        List of dictionaries with metrics at each threshold.
    """
    results = []
    for threshold in thresholds:
        y_pred = (y_prob >= threshold).astype(int)
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))
        tn = np.sum((y_pred == 0) & (y_true == 0))

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        f2 = 5 * precision * recall / (4 * precision + recall) if (4 * precision + recall) > 0 else 0

        results.append({
            "threshold": threshold,
            "precision": round(precision, 3),
            "recall": round(recall, 3),
            "f1": round(f1, 3),
            "f2": round(f2, 3),
            "fp_per_1000": round(fp / (fp + tn) * 1000, 0),
        })

    return results

Step 3: Decision-Making

Dr. Chen convened a meeting with the radiology department to discuss the tradeoff.

Option A (Threshold = 0.50): The "default" option. Precision of 0.78 means few unnecessary biopsies. But recall of 0.62 means 38% of cancers are missed. The radiologists unanimously rejected this.

Option B (Threshold = 0.15): Catches 95% of cancers (recall = 0.95). But precision drops to 0.29, meaning 71% of flagged cases are false positives. This translates to 122 false positives per 1,000 screened patients.

Option C (Threshold = 0.10): Catches 98% of cancers. But 213 false positives per 1,000 -- roughly 1 in 5 patients flagged unnecessarily.

The team chose Option B (threshold = 0.15) with the following reasoning:

  1. Clinical requirement: Missing fewer than 5% of cancers is the minimum acceptable sensitivity for a screening tool. This matches established clinical guidelines.
  2. Workflow impact: 122 false positives per 1,000 seems high, but remember the model is a second reader. The radiologist still reviews all flagged cases. The additional biopsy referrals are manageable within the department's capacity.
  3. Cost-benefit analysis: Each missed cancer (FN) costs an estimated $500,000 in delayed treatment and liability. Each unnecessary biopsy (FP) costs $3,500. At threshold 0.15, the expected cost per 1,000 patients is: $(0.05 \times 50) \times 500{,}000 \times 0.05 + 122 \times 3{,}500 = \$551{,}250$ -- significantly less than the cost at threshold 0.50: $(0.05 \times 50) \times 500{,}000 \times 0.38 + 9 \times 3{,}500 = \$4{,}781{,}500$.

Step 4: Validation Strategy

Given the critical nature of the application, the team used a rigorous validation approach:

  1. Stratified 10-fold cross-validation to estimate performance with uncertainty bounds.
  2. Temporal validation: The most recent 6 months of data were held out as a temporal test set, because the imaging equipment was upgraded mid-study and data distribution may have shifted.
  3. Subgroup analysis: Performance was evaluated separately by patient age group, breast density category, and imaging center to ensure the model did not systematically fail for any subpopulation.
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import recall_score, precision_score

def subgroup_evaluation(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    group_labels: np.ndarray,
    threshold_recall: float = 0.95
) -> dict[str, dict[str, float]]:
    """Evaluate model performance across subgroups.

    Args:
        y_true: Ground truth labels.
        y_pred: Predicted labels (after thresholding).
        group_labels: Array indicating subgroup membership.
        threshold_recall: Minimum acceptable recall.

    Returns:
        Dictionary mapping group names to their metrics.
    """
    results = {}
    for group in np.unique(group_labels):
        mask = group_labels == group
        if np.sum(y_true[mask]) == 0:
            continue

        recall = recall_score(y_true[mask], y_pred[mask])
        precision = precision_score(y_true[mask], y_pred[mask], zero_division=0)

        results[str(group)] = {
            "n_samples": int(np.sum(mask)),
            "n_positive": int(np.sum(y_true[mask])),
            "recall": round(recall, 3),
            "precision": round(precision, 3),
            "meets_threshold": recall >= threshold_recall,
        }

    return results

The subgroup analysis revealed a critical finding: recall dropped to 0.87 for patients with extremely dense breast tissue (BI-RADS category D). The team decided to add an alert when the model processes dense breast tissue cases, recommending additional imaging modalities.

Step 5: Communicating Results

Dr. Chen prepared three different reports for three audiences:

For clinicians: "The model catches 95 out of 100 cancers. For every cancer it catches, it also flags about 2.4 non-cancerous cases. You should treat the model's output as a recommendation, not a diagnosis."

For administrators: "The model reduces missed cancers from our current rate of 12% to approximately 5%, potentially preventing 35 delayed diagnoses per year. The additional workload from false positives is approximately 40 extra biopsies per month, within our current capacity."

For regulatory submission: Full statistical report with cross-validated metrics, confidence intervals, subgroup analyses, calibration curves, and temporal stability analysis.

Lessons Learned

  1. Accuracy is insufficient for imbalanced, high-stakes classification. Always examine precision and recall at the operating point that matches your application requirements.

  2. The threshold is a business decision, not a technical one. The optimal threshold depends on the relative costs of false positives and false negatives, which are domain-specific.

  3. Subgroup analysis is essential for medical AI. A model that performs well on average may fail for specific patient populations, raising both clinical and ethical concerns (see Chapter 20 on AI ethics).

  4. Calibration matters when probabilities are used for clinical decision-making. A model that says "70% chance of cancer" should be right about 70% of the time, not 40%.

  5. Temporal validation catches distribution shifts that cross-validation misses. Medical data distributions change as equipment is upgraded, protocols evolve, and patient demographics shift.

Connection to Chapter Concepts

This case study directly applies the following concepts from Chapter 8:

  • Section 8.4.3: Precision and recall as complementary metrics for imbalanced classification.
  • Section 8.4.4: F-beta score with $\beta = 2$ for recall-weighted evaluation.
  • Section 8.4.6: AUC-PR as a more informative summary than AUC-ROC for imbalanced data.
  • Section 8.3.2: Stratified cross-validation to maintain class proportions.
  • Section 8.10.1: Data leakage prevention through proper preprocessing.
  • Section 8.11.1: Calibration assessment for probability-based decisions.
  • Section 8.11.2: Fairness evaluation through subgroup analysis.

Discussion Questions

  1. If the hospital could only afford to perform 50 additional biopsies per month due to false positives, what threshold would you recommend? How would you handle the recall tradeoff?

  2. A competing model achieves recall of 0.97 at precision of 0.25. Dr. Chen's model achieves recall of 0.95 at precision of 0.29. How would you statistically compare these two models? Which would you deploy?

  3. The model was trained on data from 3 hospitals. A 4th hospital wants to deploy it but has different imaging equipment. What evaluation strategy would you recommend before deployment?

  4. How would you monitor the model's performance after deployment to detect performance degradation over time?