19 min read

Let me tell you about a model that was 99.7% accurate and completely worthless.

Learning Objectives

  • Explain why accuracy alone is insufficient for evaluating classifiers, especially with imbalanced classes
  • Construct and interpret a confusion matrix, identifying true positives, false positives, true negatives, and false negatives
  • Define precision, recall, and F1 score and explain when to prioritize each
  • Generate and interpret a ROC curve and compute AUC as a threshold-independent measure of model quality
  • Implement k-fold cross-validation and explain why it gives a more reliable estimate of model performance than a single train/test split
  • Compare multiple models using appropriate metrics and select the best model for a given problem context
  • Apply regression evaluation metrics including RMSE, MAE, and R-squared

Chapter 29: Evaluating Models — Accuracy, Precision, Recall, and Why "Good" Depends on the Question

"All models are wrong, but some are useful." — George Box, statistician


Chapter Overview

Let me tell you about a model that was 99.7% accurate and completely worthless.

A health insurance company wanted to predict which members would be hospitalized in the next year so they could offer proactive wellness programs. Their data scientist built a model, ran it on the test set, and proudly reported: "99.7% accuracy."

The director of analytics asked one question: "What does the model predict for most people?"

"That they won't be hospitalized."

"And what percentage of people actually aren't hospitalized?"

"About 99.7%."

Silence.

The model had learned a single, trivially correct rule: predict "not hospitalized" for everyone. It was right 99.7% of the time — and it identified exactly zero of the people who would be hospitalized. The 0.3% who needed help? The model missed every one of them.

This is the accuracy paradox, and it's the reason this chapter exists. Accuracy — the percentage of correct predictions — is the first metric everyone learns, and it's the first metric you should stop trusting as your default. Not because accuracy is wrong, but because accuracy is incomplete. It tells you how often the model is right but nothing about how it's wrong. And in most real-world problems, how you're wrong matters far more than how often you're right.

In this chapter, you'll learn to evaluate models the way professionals do — with a full toolkit of metrics that capture different dimensions of model quality. By the end, you'll never again look at a single accuracy number and say "good enough."

In this chapter, you will learn to:

  1. Explain why accuracy alone is insufficient, especially with imbalanced classes (all paths)
  2. Construct and interpret a confusion matrix (all paths)
  3. Define precision, recall, and F1 score and explain when to prioritize each (all paths)
  4. Generate and interpret a ROC curve and compute AUC (standard + deep dive paths)
  5. Implement k-fold cross-validation for reliable performance estimation (all paths)
  6. Compare multiple models using appropriate metrics (all paths)
  7. Apply regression evaluation metrics: RMSE, MAE, and R-squared (standard + deep dive paths)

29.1 The Accuracy Trap: When Being Right Isn't Enough

Let's build the intuition with an example that makes the problem visceral.

The Medical Screening Scenario

Imagine you're evaluating a screening test for a rare disease that affects 1 in 1,000 people. You test 10,000 people. Here are two possible models:

Model A: "Nobody has the disease" - Predicts "healthy" for all 10,000 people - Gets 9,990 right (the truly healthy ones) and 10 wrong (the sick ones it missed) - Accuracy: 9,990 / 10,000 = 99.9% - People it helped: zero

Model B: An actual screening test - Correctly identifies 8 of the 10 sick people (misses 2) - Correctly identifies 9,800 of the 9,990 healthy people (falsely alarms 190) - Accuracy: (8 + 9,800) / 10,000 = 98.08% - People it helped: 8 who got early treatment

Model A has higher accuracy. Model B is obviously better. The difference is that Model B finds the people who need help, even though it makes more total mistakes.

This isn't a contrived example. This exact scenario plays out in: - Fraud detection: 99.8% of transactions are legitimate. A model that says "not fraud" every time is 99.8% accurate. - Spam filtering: 95% of emails are not spam. A model that delivers everything to your inbox is 95% accurate. - Disease diagnosis: Most people don't have any particular disease. A model that says "healthy" is almost always right. - Manufacturing quality control: 99.9% of products pass inspection. A model that says "pass" for everything is 99.9% accurate.

The pattern is clear: when one class is much more common than the other, accuracy is dominated by the majority class and tells you almost nothing about the minority class — which is usually the one you care about.

This is called class imbalance, and it's not a rare edge case. It's the norm. Most interesting classification problems involve rare events: fraud, disease, equipment failure, customer churn, loan default. If you evaluate these models with accuracy alone, you'll consistently choose useless models over useful ones.

Check Your Understanding

  1. A credit card fraud model achieves 99.5% accuracy on a dataset where 0.5% of transactions are fraudulent. Is this model necessarily good? What's the simplest model that achieves the same accuracy?
  2. Would the accuracy problem be as severe if the classes were 60/40 instead of 99.5/0.5? Why or why not?

29.2 The Confusion Matrix: A Complete Picture of Model Behavior

To move beyond accuracy, we need a way to see how a model is right and how it's wrong. That's the confusion matrix.

A confusion matrix for a binary classifier is a 2x2 table:

                    Predicted: Positive    Predicted: Negative
                    ─────────────────     ─────────────────
Actual: Positive   True Positive (TP)    False Negative (FN)
Actual: Negative   False Positive (FP)   True Negative (TN)

Let's define each cell with the medical screening example:

  • True Positive (TP): The model predicted "sick" and the person IS sick. The test correctly found someone who needs help. This is good.

  • False Positive (FP): The model predicted "sick" but the person is actually healthy. A false alarm. This causes unnecessary worry and follow-up tests.

  • True Negative (TN): The model predicted "healthy" and the person IS healthy. The test correctly cleared someone. This is good.

  • False Negative (FN): The model predicted "healthy" but the person is actually sick. A missed case. This is dangerous — the person doesn't get treatment.

Building a Confusion Matrix in scikit-learn

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Assume y_test and y_pred are already defined from your model
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Visual version
disp = ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

A typical output might look like:

[[850  50]
 [ 30  70]]

Reading this: 850 true negatives, 50 false positives, 30 false negatives, 70 true positives. Total of 1,000 samples.

Accuracy from the confusion matrix: (850 + 70) / 1000 = 92%. But notice: of the 100 actual positives (30 + 70), the model only caught 70. That's a 70% detection rate for the class we care about — very different from the 92% that accuracy suggests.

Why It's Called a "Confusion" Matrix: The name comes from the fact that it shows how the model "confuses" one class for another. Each off-diagonal cell represents a type of confusion.

The Four Outcomes in Different Contexts

Context TP FP FN TN
Medical screening Correctly identified sick person Healthy person told they're sick (false alarm) Sick person told they're healthy (missed diagnosis) Correctly cleared healthy person
Spam filter Spam correctly caught Legitimate email in spam folder Spam in inbox Legitimate email delivered
Fraud detection Fraud correctly flagged Legitimate transaction blocked Fraud that slipped through Legitimate transaction approved
Manufacturing QC Defective product caught Good product rejected (wasted) Defective product shipped Good product shipped

Notice that in every context, false positives and false negatives have different costs. A false positive in spam filtering is annoying (you miss an email). A false negative in medical screening is potentially fatal (you miss a disease). This asymmetry is why a single number like accuracy can't capture model quality — you need to understand the types of errors separately.


29.3 Precision and Recall: Two Sides of the Same Coin

Now we can define the two most important metrics for classification:

Precision: "Of the items I flagged, how many were actually positive?"

Precision = TP / (TP + FP)

Precision answers: when the model says "positive," how often is it right? High precision means few false alarms.

Using our confusion matrix (TP=70, FP=50):

Precision = 70 / (70 + 50) = 70 / 120 = 0.583

Only 58.3% of the model's "positive" predictions were actually positive. For every 10 people flagged as sick, about 4 are healthy and will undergo unnecessary procedures.

Recall: "Of all actual positives, how many did I find?"

Recall = TP / (TP + FN)

Recall answers: of all the truly positive cases, how many did the model catch? High recall means few missed cases. Recall is also called sensitivity (especially in medicine) or the true positive rate.

Using our confusion matrix (TP=70, FN=30):

Recall = 70 / (70 + 30) = 70 / 100 = 0.700

The model catches 70% of sick people but misses 30%. In a medical context, those 30 missed cases could be catastrophic.

The Precision-Recall Trade-off

Here's the fundamental tension: precision and recall are inversely related. You can almost always improve one at the expense of the other.

Think about it intuitively with a spam filter:

  • Maximum recall: Flag every email as spam. You'll catch 100% of spam (perfect recall) — but you'll also flag every legitimate email (terrible precision).
  • Maximum precision: Only flag an email as spam if you're absolutely certain — say, it contains the exact string "Nigerian prince." You'll have perfect precision (every flagged email really is spam) — but you'll miss most spam (terrible recall).

The practical question is: which type of error is more costly?

  • In medical screening, missing a sick person (false negative) could be fatal. Prioritize recall. You'd rather have false alarms than missed diagnoses.

  • In email spam filtering, putting a legitimate email in spam (false positive) could mean missing an important message. Prioritize precision. You'd rather let some spam through than lose real emails.

  • In criminal justice, convicting an innocent person (false positive) is considered worse than letting a guilty person go free. Prioritize precision. "Better that ten guilty persons escape than that one innocent suffer."

  • In fraud detection, the costs are more balanced. Blocking a legitimate transaction is bad (customer frustration), but missing fraud is also bad (financial loss). You need both precision and recall.

Computing in scikit-learn

from sklearn.metrics import precision_score, recall_score, classification_report

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")

# Or get everything at once:
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

The classification_report function is your best friend for model evaluation. It shows precision, recall, F1, and support (sample count) for each class, plus the overall accuracy.

              precision    recall  f1-score   support

    Negative       0.97      0.94      0.96       900
    Positive       0.58      0.70      0.64       100

    accuracy                           0.92      1000
   macro avg       0.77      0.82      0.80      1000
weighted avg       0.93      0.92      0.92      1000

Read this carefully. The overall accuracy (0.92) looks great. But precision for the positive class (0.58) and recall (0.70) tell a more nuanced story. This model catches 70% of positives but has a lot of false alarms. Whether that's acceptable depends entirely on the problem context.

Check Your Understanding

  1. In a spam filter, would you prioritize precision or recall? What happens if you get it backwards?
  2. A model has precision = 0.90 and recall = 0.30. Describe this model's behavior in plain English. Is it useful?
  3. Can a model have high precision AND high recall? What would that look like in the confusion matrix?

29.4 The F1 Score: Balancing Precision and Recall

Sometimes you need a single number that balances precision and recall. That's the F1 score — the harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why the harmonic mean instead of the regular (arithmetic) mean? Because the harmonic mean punishes extreme imbalances. If precision is 0.99 and recall is 0.01, the arithmetic mean would be 0.50 — suggesting a mediocre but functional model. The harmonic mean gives 0.02 — revealing that the model is practically useless for finding positives.

from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)
print(f"F1 score: {f1:.3f}")

Using our example (precision = 0.583, recall = 0.700):

F1 = 2 × (0.583 × 0.700) / (0.583 + 0.700) = 2 × 0.408 / 1.283 = 0.636

When to Use F1

The F1 score is useful when: - You care about both precision and recall roughly equally - You're comparing models and need a single number - You're dealing with imbalanced classes (accuracy is misleading)

The F1 score is NOT the right choice when: - False positives and false negatives have very different costs (use precision or recall directly) - You need a threshold-independent measure (use AUC, covered next) - Your classes are balanced AND you just want overall correctness (accuracy is fine)


29.5 ROC Curves and AUC: The Threshold-Independent View

Every classifier that outputs probabilities (logistic regression, random forests, etc.) has a hidden knob: the classification threshold. By default, scikit-learn predicts class 1 when the predicted probability exceeds 0.5. But you can change this:

# Default: threshold = 0.5
y_pred_default = (model.predict_proba(X_test)[:, 1] >= 0.5).astype(int)

# Lower threshold: catches more positives but more false alarms
y_pred_aggressive = (model.predict_proba(X_test)[:, 1] >= 0.3).astype(int)

# Higher threshold: fewer false alarms but misses more positives
y_pred_conservative = (model.predict_proba(X_test)[:, 1] >= 0.7).astype(int)

Lowering the threshold increases recall (you flag more cases as positive, catching more true positives) but decreases precision (you also flag more false positives). Raising the threshold has the opposite effect.

The ROC curve (Receiver Operating Characteristic) shows you how the trade-off changes across all possible thresholds. It plots:

  • X-axis: False Positive Rate (FPR) = FP / (FP + TN) — the fraction of negatives incorrectly flagged
  • Y-axis: True Positive Rate (TPR) = TP / (TP + FN) — the fraction of positives correctly caught (this is recall!)
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get predicted probabilities (not class labels)
y_proba = model.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'Model (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Reading a ROC Curve

  • The diagonal line (from (0,0) to (1,1)) represents a random classifier — one that's no better than flipping a coin. Any useful model should be above this line.

  • The upper-left corner (0, 1) represents a perfect classifier — 100% true positive rate, 0% false positive rate. The closer your curve hugs this corner, the better.

  • The curve itself traces out every possible precision-recall trade-off as you sweep the threshold from high to low. At threshold = 1.0, you predict everything as negative (bottom-left corner). At threshold = 0.0, you predict everything as positive (top-right corner).

AUC: Area Under the Curve

AUC (Area Under the ROC Curve) collapses the entire ROC curve into a single number between 0 and 1. It represents the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example.

AUC Range Interpretation
0.90-1.00 Excellent
0.80-0.90 Good
0.70-0.80 Fair
0.60-0.70 Poor
0.50-0.60 Barely better than random
Below 0.50 Worse than random (something's wrong)

AUC is threshold-independent — it evaluates the model's ability to rank examples by their probability of being positive, regardless of where you set the cutoff. This makes it especially useful for comparing models.

Comparing Models with ROC Curves

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=200)
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: Model Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

This gives you a visual and numerical comparison of all your models at every possible threshold. The model with the highest AUC is generally the best ranker — but the best model for your specific problem also depends on where you set the threshold, which depends on the costs of false positives vs. false negatives.


29.6 Cross-Validation: Don't Trust a Single Split

So far, we've been evaluating models on a single train/test split. But here's an uncomfortable truth: the specific test accuracy you get depends on which samples happened to land in the test set.

Try this experiment:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

scores = []
for seed in range(20):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=seed
    )
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    scores.append(rf.score(X_test, y_test))

print(f"Accuracy range: {min(scores):.3f} to {max(scores):.3f}")
print(f"Mean: {sum(scores)/len(scores):.3f}")

You'll see that the accuracy varies by several percentage points depending on the random split. A model that looks 3% better than another on one split might look 3% worse on a different split. Making model selection decisions based on a single split is like choosing a restaurant based on one meal — you might have caught them on a great day or a terrible one.

K-Fold Cross-Validation

K-fold cross-validation solves this by using every sample for both training and testing:

  1. Divide the data into k equal-sized folds (typically k=5 or k=10).
  2. Train the model on k-1 folds and test on the remaining fold.
  3. Repeat k times, using a different fold as the test set each time.
  4. Average the k test scores to get the final estimate.
Fold 1: [TEST] [train] [train] [train] [train]  → score₁
Fold 2: [train] [TEST] [train] [train] [train]  → score₂
Fold 3: [train] [train] [TEST] [train] [train]  → score₃
Fold 4: [train] [train] [train] [TEST] [train]  → score₄
Fold 5: [train] [train] [train] [train] [TEST]  → score₅

Final score = mean(score₁, score₂, score₃, score₄, score₅)

Every sample appears in exactly one test fold, so every sample contributes to the evaluation. The average across folds is a much more stable estimate than any single split.

from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=200, random_state=42)
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

The +/- standard deviation tells you how stable the model is. A model with "0.85 +/- 0.02" is very consistent. A model with "0.85 +/- 0.08" is unreliable — its performance fluctuates wildly depending on the data split.

Stratified K-Fold

For classification problems, especially with imbalanced classes, use stratified k-fold, which ensures each fold has approximately the same class distribution as the full dataset:

from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=skf, scoring='f1')

print(f"F1 scores: {scores}")
print(f"Mean F1: {scores.mean():.3f} (+/- {scores.std():.3f})")

Notice that you can change the scoring parameter to evaluate any metric: 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', etc.

Cross-Validation for Model Comparison

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5),
    'Random Forest (200 trees)': RandomForestClassifier(n_estimators=200)
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std():.3f})")

Now you're comparing models fairly — using averaged performance across multiple data splits, with the metric that actually matters for your problem.


29.7 Learning Curves: Diagnosing Overfitting and Underfitting

How do you know if your model is overfitting (too complex) or underfitting (too simple)? Learning curves show you by plotting model performance as a function of training set size.

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy'
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), 's-', label='Cross-validation score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve: Random Forest')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Reading Learning Curves

Overfitting pattern: Training score is high (near 1.0), cross-validation score is much lower, and the gap doesn't close as you add more data. The model is memorizing training data. Fix: simplify the model, add regularization, or get more data.

Underfitting pattern: Both training and cross-validation scores are low, even with lots of data. The model isn't complex enough to capture the patterns. Fix: use a more complex model, add features, or reduce regularization.

Good fit pattern: Training and cross-validation scores converge to a reasonably high value as training size increases. The gap is small. This is what you want.

OVERFITTING                    UNDERFITTING                  GOOD FIT
Score                          Score                         Score
  1.0 ── train                   1.0                          1.0
  0.9                            0.9                          0.9 ── train
  0.8                            0.8                          0.8 ── test
  0.7                            0.7                          0.7
  0.6 ── test                    0.6 ── train                 0.6
  0.5                            0.5 ── test                  0.5
       training size                  training size                 training size

29.8 Regression Metrics: RMSE, MAE, and R-Squared

Classification has precision, recall, and F1. Regression has its own set of metrics. Since you built a linear regression in Chapter 26, let's formalize how to evaluate it.

Mean Absolute Error (MAE)

MAE is the average of the absolute differences between predictions and actual values. It tells you, "on average, how far off are my predictions?"

MAE = (1/n) × Σ|yᵢ - ŷᵢ|
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")

If MAE = 5.3 and you're predicting house prices in thousands of dollars, your model is off by $5,300 on average. MAE is easy to interpret because it's in the same units as the target variable.

Root Mean Squared Error (RMSE)

RMSE is the square root of the average of squared differences. It penalizes large errors more heavily than MAE because squaring amplifies big deviations.

RMSE = sqrt((1/n) × Σ(yᵢ - ŷᵢ)²)
from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse:.2f}")

RMSE is always greater than or equal to MAE. If RMSE is much larger than MAE, it means there are some large errors pulling up the average. If they're close, errors are fairly uniform.

R-Squared (R²)

R-squared measures the proportion of variance in the target variable that the model explains. It ranges from negative infinity to 1:

  • R² = 1.0: Perfect predictions (the model explains all variance)
  • R² = 0.0: The model is no better than predicting the mean for everything
  • R² < 0: The model is worse than predicting the mean (something is very wrong)
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.3f}")

An R² of 0.75 means the model explains 75% of the variance in the target. Whether that's "good" depends on the domain — in physics, you'd expect R² > 0.99; in social science, R² = 0.30 might be excellent.

Which Regression Metric to Use?

Metric Best For Units Interpretation
MAE When all errors are equally important Same as target "Average error magnitude"
RMSE When large errors are especially bad Same as target "Penalizes outlier errors"
Comparing models on different scales Unitless (0-1) "Fraction of variance explained"

Use MAE when you want an intuitive sense of typical error. Use RMSE when big mistakes are particularly costly. Use R² when comparing models or communicating to stakeholders who want a percentage.


29.9 Putting It All Together: A Complete Model Evaluation

Let's evaluate all three models from Chapters 26-28 on the vaccination coverage problem.

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, roc_auc_score,
                             roc_curve, confusion_matrix)
import matplotlib.pyplot as plt

# Prepare data (same as previous chapters)
health = pd.read_csv('global_health_indicators.csv')
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)

features = ['gdp_per_capita', 'health_spending_pct', 'physicians_per_1000',
            'literacy_rate', 'urban_population_pct']
X = health[features].dropna()
y = health.loc[X.index, 'high_coverage']
# Cross-validated comparison
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest (200 trees)': RandomForestClassifier(n_estimators=200, random_state=42)
}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Cross-Validated Scores (5-fold)")
print("-" * 60)
for name, model in models.items():
    acc = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
    f1 = cross_val_score(model, X, y, cv=skf, scoring='f1')
    auc = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
    print(f"\n{name}:")
    print(f"  Accuracy: {acc.mean():.3f} (+/- {acc.std():.3f})")
    print(f"  F1:       {f1.mean():.3f} (+/- {f1.std():.3f})")
    print(f"  AUC:      {auc.mean():.3f} (+/- {auc.std():.3f})")
# ROC curves on a single held-out test set for visualization
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

plt.figure(figsize=(8, 6))
for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Model Comparison: ROC Curves for Vaccination Coverage')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('project_roc_comparison.png', dpi=150)
plt.show()

This gives you a complete, professional model comparison: cross-validated scores for reliability, multiple metrics for different perspectives, and ROC curves for visual assessment.


29.10 Choosing the Right Metric: A Decision Framework

With all these metrics available, how do you choose? Here's a decision framework:

Is your problem classification or regression?
  |
  ├── REGRESSION → Use RMSE or MAE (plus R² for context)
  |     |
  |     ├── Large errors especially costly? → RMSE
  |     └── All errors equally important? → MAE
  |
  └── CLASSIFICATION
        |
        ├── Are classes balanced (roughly 50/50)?
        |     └── Yes → Accuracy is fine, plus F1 for safety
        |
        └── Are classes imbalanced?
              |
              ├── What matters more — catching positives or avoiding false alarms?
              |     |
              |     ├── Catching positives (medical screening, fraud detection)
              |     |     → Prioritize RECALL
              |     |
              |     └── Avoiding false alarms (spam filter, criminal justice)
              |           → Prioritize PRECISION
              |     |
              |     └── Both matter roughly equally
              |           → Use F1 score
              |
              └── Need a threshold-independent comparison?
                    → Use AUC

And always, always, always: use cross-validation to get reliable estimates, not a single train/test split.


29.11 Progressive Project: Evaluating All Your Models

Time to bring everything together. In previous chapters, you built three models for the vaccination coverage project. Now let's evaluate them properly.

Project Task: Comprehensive Model Evaluation

# 1. Cross-validated comparison (see Section 29.9 code)
# Run the cross-validation code from above and record results

# 2. Confusion matrices for each model
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (name, model) in zip(axes, models.items()):
    model.fit(X_train, y_train)
    cm = confusion_matrix(y_test, model.predict(X_test))
    disp = ConfusionMatrixDisplay(cm, display_labels=['Low', 'High'])
    disp.plot(ax=ax, cmap='Blues')
    ax.set_title(name)
plt.tight_layout()
plt.savefig('project_confusion_matrices.png', dpi=150)
plt.show()
# 3. Detailed classification reports
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"\n{'='*50}")
    print(f"{name}")
    print(f"{'='*50}")
    print(classification_report(y_test, model.predict(X_test),
                                target_names=['Low Coverage', 'High Coverage']))

What to write in your project notebook:

  1. Create a summary table comparing all three models across accuracy, F1, AUC, precision, and recall.
  2. For each model, describe its strengths and weaknesses using the evaluation metrics.
  3. Choose the "best" model for the vaccination coverage problem and justify your choice. Consider: who is the audience? What are the costs of false positives vs. false negatives?
  4. Reflect: did the "best" model change depending on which metric you used? What does this tell you about the importance of choosing the right evaluation metric?

Milestone Check: Your project notebook should now contain a comprehensive model comparison with confusion matrices, classification reports, ROC curves, and cross-validated scores. In Chapter 30, you'll package everything into a clean scikit-learn pipeline.


29.12 Common Mistakes in Model Evaluation

Mistake 1: Evaluating on Training Data

This is the most fundamental error. Your training accuracy is meaningless for predicting real-world performance. Always evaluate on held-out data (test set or cross-validation).

Mistake 2: Using Accuracy with Imbalanced Classes

We've hammered this point, but it bears repeating. If your classes are 95/5, a model that always predicts the majority class is 95% accurate and 100% useless. Use F1, precision, recall, or AUC instead.

Mistake 3: Comparing Models on a Single Split

A single train/test split is a single roll of the dice. Use cross-validation to get stable estimates. If model A beats model B by 1% on one split, that difference might be noise.

Mistake 4: Ignoring the Variance of Cross-Validation Scores

"0.85 accuracy" sounds solid. But "0.85 +/- 0.10" means the model's performance fluctuates wildly — from 0.75 to 0.95 depending on the fold. Pay attention to the standard deviation, not just the mean.

Mistake 5: Optimizing the Wrong Metric

If your problem is catching fraud, optimizing for accuracy will fail. If your problem is predicting house prices, optimizing for R² when you really care about absolute dollar error will mislead you. Always choose the metric that reflects the business cost of being wrong.

Mistake 6: Peeking at the Test Set

If you look at the test set while developing your model — adjusting hyperparameters until the test score improves — you've contaminated your evaluation. The test set should be touched exactly once, at the very end. Use cross-validation on the training set for model selection.


Summary

Accuracy alone is dangerously misleading, especially with imbalanced classes. The confusion matrix reveals four types of outcomes — true positives, false positives, true negatives, and false negatives — each with different real-world costs. Precision measures how many flagged items are truly positive; recall measures how many true positives are found. The F1 score balances both. ROC curves and AUC provide threshold-independent evaluation. Cross-validation gives stable performance estimates by averaging across multiple data splits. The right metric depends on the problem context — specifically, the relative costs of different types of errors.

For regression, MAE measures average error magnitude, RMSE penalizes large errors, and R-squared measures explained variance.

The most important lesson: always ask "what are the costs of being wrong?" before choosing how to evaluate your model.

Coming up in Chapter 30: You know how to build models and evaluate them. Now let's learn to do it properly — with scikit-learn pipelines that handle preprocessing, model selection, and cross-validation in a single, reproducible workflow. This is where we tie together everything from Part V.