24 min read

> Core Principle --- If you only learn one thing from this book, learn this: how you evaluate your model is more important than which model you choose. A mediocre model with honest evaluation will serve you better than a brilliant model with broken...

Chapter 16: Model Evaluation Deep Dive

Cross-Validation, Stratification, Leakage Detection, and Proper Scoring


Learning Objectives

By the end of this chapter, you will be able to:

  1. Implement k-fold, stratified k-fold, and group k-fold cross-validation
  2. Detect and prevent data leakage in evaluation pipelines
  3. Choose the right metric for the business problem (accuracy, precision, recall, F1, AUC-ROC, AUC-PR, log-loss, RMSE, MAE)
  4. Read and interpret learning curves, validation curves, and calibration plots
  5. Compare models using proper statistical tests (not just "this one has higher accuracy")

If You Only Learn One Thing

Core Principle --- If you only learn one thing from this book, learn this: how you evaluate your model is more important than which model you choose. A mediocre model with honest evaluation will serve you better than a brilliant model with broken evaluation. Every bad model I have seen deployed in production got there because someone evaluated it wrong --- not because someone chose logistic regression instead of XGBoost.

This chapter is arguably the most important in the book. Everything before it --- feature engineering, model selection, hyperparameter tuning --- is wasted effort if you evaluate your models incorrectly. And evaluating models correctly is harder than it sounds, because the failure modes are subtle. A model that appears to achieve 99% accuracy can be utterly useless. A model that leaks future information into its training data will look perfect in development and collapse in production. A model that was validated on a single random split may have gotten lucky.

The goal of this chapter is to make you paranoid about evaluation --- in the productive, professional sense of the word. By the end, you will have the tools and the instinct to ask the right questions about any model's reported performance.


Part 1: Cross-Validation --- Why a Single Split Is Not Enough

The Problem with Train-Test Split

In Chapter 2, you learned the basic pattern: split your data into training and test sets, train on the training set, evaluate on the test set. This is correct in principle and dangerously incomplete in practice.

Here is the problem. Suppose you split 10,000 samples into 8,000 train and 2,000 test with random_state=42. You train a Random Forest and get an AUC of 0.847. Your colleague uses random_state=7 and gets an AUC of 0.862 on the same model. Neither of you is wrong. You drew different test sets, and the model's performance varies depending on which 2,000 samples happen to land in the test set.

A single train-test split gives you one noisy estimate of model performance. You do not know how much of that estimate is signal and how much is luck.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=12,
    n_redundant=4, flip_y=0.1, random_state=42
)

# Ten different random splits, same model
aucs = []
for seed in range(10):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=seed
    )
    rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
    aucs.append(auc)

print("AUC across 10 random splits:")
print(f"  Mean:  {np.mean(aucs):.4f}")
print(f"  Std:   {np.std(aucs):.4f}")
print(f"  Min:   {np.min(aucs):.4f}")
print(f"  Max:   {np.max(aucs):.4f}")
print(f"  Range: {np.max(aucs) - np.min(aucs):.4f}")
AUC across 10 random splits:
  Mean:  0.9312
  Std:   0.0041
  Min:   0.9248
  Max:   0.9379
  Range: 0.0131

The range is 0.013 AUC --- enough to make you think one model configuration is meaningfully better than another when the difference is just the random split. Cross-validation solves this by giving you multiple estimates from the same data.

K-Fold Cross-Validation

The idea is simple. Instead of one split, make K splits. Each time, use K-1 folds for training and 1 fold for testing. Rotate the test fold so every sample gets evaluated exactly once.

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)

scores = cross_val_score(rf, X, y, cv=kf, scoring='roc_auc')
print(f"5-Fold CV AUC: {scores.mean():.4f} +/- {scores.std():.4f}")
print(f"Individual folds: {[f'{s:.4f}' for s in scores]}")
5-Fold CV AUC: 0.9308 +/- 0.0036
Individual folds: ['0.9341', '0.9265', '0.9334', '0.9275', '0.9327']

Now you have five estimates instead of one, a mean that is more stable than any single split, and a standard deviation that tells you how much the estimate varies. The mean plus-or-minus one standard deviation gives you a reasonable confidence interval.

How Many Folds? --- The standard choice is 5 or 10. Five folds trains on 80% of the data each time (less variance from training, more variance from smaller test set). Ten folds trains on 90% (more variance from training, less variance from larger test set). In practice, the difference is small. Use 5 folds for large datasets (over 50K samples) and 10 folds for smaller ones. Leave-one-out cross-validation (K=N) is theoretically interesting but computationally expensive and produces high-variance estimates --- avoid it except for very small datasets.

Stratified K-Fold: When Class Balance Matters

Standard K-fold has a subtle problem: it does not guarantee that each fold has the same class distribution as the full dataset. If your dataset is 8% positive and 92% negative (like StreamFlow churn), a random fold might have 6% positive or 10% positive. This introduces unnecessary variance into your estimates.

Stratified K-fold ensures each fold preserves the original class distribution.

from sklearn.model_selection import StratifiedKFold

# Create an imbalanced dataset (8% positive, similar to churn)
X_imb, y_imb = make_classification(
    n_samples=10000, n_features=20, n_informative=12,
    n_redundant=4, weights=[0.92, 0.08],
    flip_y=0.02, random_state=42
)

print(f"Overall positive rate: {y_imb.mean():.3f}")

# Compare KFold vs StratifiedKFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\nKFold --- positive rate per fold:")
for i, (train_idx, test_idx) in enumerate(kf.split(X_imb, y_imb)):
    rate = y_imb[test_idx].mean()
    print(f"  Fold {i+1}: {rate:.3f}")

print("\nStratifiedKFold --- positive rate per fold:")
for i, (train_idx, test_idx) in enumerate(skf.split(X_imb, y_imb)):
    rate = y_imb[test_idx].mean()
    print(f"  Fold {i+1}: {rate:.3f}")
Overall positive rate: 0.087

KFold --- positive rate per fold:
  Fold 1: 0.093
  Fold 2: 0.076
  Fold 3: 0.095
  Fold 4: 0.082
  Fold 5: 0.089

StratifiedKFold --- positive rate per fold:
  Fold 1: 0.087
  Fold 2: 0.087
  Fold 3: 0.087
  Fold 4: 0.087
  Fold 5: 0.086

The difference matters. With standard K-fold, one fold has a positive rate of 0.076 and another has 0.095 --- a 25% relative difference. With stratified K-fold, every fold matches the original distribution within one sample.

Rule --- For classification problems, always use StratifiedKFold. There is no reason not to. Scikit-learn's cross_val_score uses stratification by default when you pass a classifier, but be explicit: pass a StratifiedKFold object to the cv parameter. Explicit is better than implicit.

Group K-Fold: When Observations Are Not Independent

Here is a scenario that traps experienced practitioners. StreamFlow's churn dataset has one row per subscriber per month. Subscriber #12345 appears in months 1, 2, 3, 4, and 5. If standard K-fold puts month 3 of subscriber #12345 in the test set and months 1, 2, 4 of the same subscriber in the training set, the model has seen this subscriber's behavioral pattern during training. It is not predicting churn for an unseen subscriber --- it is predicting churn for a subscriber whose behavior it already knows. This is a form of data leakage.

Group K-fold ensures that all observations from the same group (subscriber, patient, company, sensor) stay in the same fold.

from sklearn.model_selection import GroupKFold
import pandas as pd

# Simulate subscriber-month data
np.random.seed(42)
n_subscribers = 2000
months_per_sub = np.random.randint(2, 13, n_subscribers)

subscriber_ids = np.repeat(np.arange(n_subscribers), months_per_sub)
n_total = len(subscriber_ids)

X_grouped = np.random.randn(n_total, 10)
y_grouped = np.random.binomial(1, 0.08, n_total)

groups = subscriber_ids

gkf = GroupKFold(n_splits=5)

print("GroupKFold --- subscriber overlap check:")
for i, (train_idx, test_idx) in enumerate(gkf.split(X_grouped, y_grouped, groups)):
    train_subs = set(groups[train_idx])
    test_subs = set(groups[test_idx])
    overlap = train_subs & test_subs
    print(f"  Fold {i+1}: train_subs={len(train_subs)}, "
          f"test_subs={len(test_subs)}, overlap={len(overlap)}")
GroupKFold --- subscriber overlap check:
  Fold 1: train_subs=1600, test_subs=400, overlap=0
  Fold 2: train_subs=1600, test_subs=400, overlap=0
  Fold 3: train_subs=1600, test_subs=400, overlap=0
  Fold 4: train_subs=1600, test_subs=400, overlap=0
  Fold 5: train_subs=1600, test_subs=400, overlap=0

Zero overlap. Every fold tests on subscribers the model has never seen.

Critical Insight --- Group K-fold is essential for subscription data, medical data (multiple visits per patient), sensor data (multiple readings per device), and any dataset where a single entity generates multiple rows. Failing to use group splitting will inflate your cross-validation scores and give you a model that underperforms in production. Ask yourself: "Can the same real-world entity appear in both my training and test sets?" If the answer is yes, use GroupKFold.

Stratified Group K-Fold

What if you need both stratification (preserve class balance) and grouping (keep entities together)? Scikit-learn provides StratifiedGroupKFold for exactly this case.

from sklearn.model_selection import StratifiedGroupKFold

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

print("StratifiedGroupKFold:")
for i, (train_idx, test_idx) in enumerate(sgkf.split(X_grouped, y_grouped, groups)):
    train_subs = set(groups[train_idx])
    test_subs = set(groups[test_idx])
    overlap = train_subs & test_subs
    pos_rate = y_grouped[test_idx].mean()
    print(f"  Fold {i+1}: test_subs={len(test_subs)}, "
          f"overlap={len(overlap)}, positive_rate={pos_rate:.3f}")
StratifiedGroupKFold:
  Fold 1: test_subs=400, overlap=0, positive_rate=0.079
  Fold 2: test_subs=400, overlap=0, positive_rate=0.081
  Fold 3: test_subs=400, overlap=0, positive_rate=0.080
  Fold 4: test_subs=400, overlap=0, positive_rate=0.078
  Fold 5: test_subs=400, overlap=0, positive_rate=0.082

Zero subscriber overlap and consistent positive rates across folds. This is the correct cross-validation strategy for StreamFlow's churn data.

Time Series Split: When Order Matters

For time-ordered data, you cannot use random splitting at all. Using future data to predict the past is the most common form of temporal leakage. Scikit-learn's TimeSeriesSplit provides expanding-window cross-validation.

from sklearn.model_selection import TimeSeriesSplit

# Simulate 36 months of data
n_months = 36
n_per_month = 1000
dates = np.repeat(np.arange(n_months), n_per_month)
X_ts = np.random.randn(n_months * n_per_month, 10)
y_ts = np.random.binomial(1, 0.08, n_months * n_per_month)

tscv = TimeSeriesSplit(n_splits=5)

print("TimeSeriesSplit:")
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    train_months = sorted(set(dates[train_idx]))
    test_months = sorted(set(dates[test_idx]))
    print(f"  Fold {i+1}: train months {train_months[0]}-{train_months[-1]}, "
          f"test months {test_months[0]}-{test_months[-1]}")
TimeSeriesSplit:
  Fold 1: train months 0-5, test months 6-11
  Fold 2: train months 0-11, test months 12-17
  Fold 3: train months 0-17, test months 18-23
  Fold 4: train months 0-23, test months 24-29
  Fold 5: train months 0-29, test months 30-35

Each fold trains on all data up to a point and tests on the next window. The training set grows with each fold, mimicking how a production model would be retrained on accumulating data. The test set never includes data from before the training cutoff.

Time Series Warning --- If your data has any temporal component --- event timestamps, monthly snapshots, daily transactions --- consider whether random splitting creates temporal leakage. A model that can "see the future" during training will look great in cross-validation and fail in production. When in doubt, use TimeSeriesSplit.

Choosing the Right Cross-Validation Strategy

Situation Strategy Why
Standard classification StratifiedKFold Preserves class balance
Multiple rows per entity (subscriber, patient) StratifiedGroupKFold Prevents entity leakage + preserves balance
Time-ordered data TimeSeriesSplit Prevents temporal leakage
Regression (continuous target) KFold No class balance to preserve
Very small dataset (< 500 samples) RepeatedStratifiedKFold More estimates, more stable mean
Very large dataset (> 500K samples) Single stratified split or 3-fold CV is expensive; single split is stable enough

Part 2: Data Leakage --- The Silent Killer

What Is Data Leakage?

Data leakage occurs when information from outside the training data --- typically from the target variable or from future observations --- leaks into the features used for prediction. A leaked model appears to perform brilliantly during evaluation but fails catastrophically in production because the leaked information is not available at prediction time.

Leakage is the most dangerous evaluation error because it is invisible. The metrics look great. The model passes every automated check. It is only when the model is deployed and starts making predictions on genuinely new data that the performance collapses.

There are two main types:

Target leakage: A feature is derived from or correlated with the target variable in a way that would not be available at prediction time. Example: using "reason for cancellation" to predict churn. The cancellation reason is only known after the subscriber has already churned.

Train-test contamination: Information from the test set leaks into the training process. Example: fitting a scaler on the entire dataset (including test data) before splitting.

The 99% Accuracy Model: A Leakage Detective Story

Here is a scenario I have seen play out in real organizations. A junior data scientist builds a churn prediction model and reports 99.2% AUC. The team is thrilled. The model goes through code review, passes unit tests, and is deployed. Within a month, it is clear the model is no better than random guessing on new subscribers.

What went wrong?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 10000

# Build a realistic-looking churn dataset with a planted leak
data = pd.DataFrame({
    'monthly_hours': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'months_active': np.random.randint(1, 60, n),
    'support_tickets': np.random.poisson(1.2, n),
    'plan_type': np.random.choice(['basic', 'standard', 'premium'], n,
                                   p=[0.4, 0.4, 0.2]),
    'payment_failures': np.random.poisson(0.3, n),
})

# Generate the target
churn_prob = 1 / (1 + np.exp(-(
    -3
    + 0.5 * (data['monthly_hours'] < 5).astype(float)
    + 0.3 * (data['support_tickets'] > 3).astype(float)
    + 0.4 * (data['months_active'] < 3).astype(float)
    + 0.2 * data['payment_failures']
)))
data['churned'] = np.random.binomial(1, churn_prob)

# THE LEAK: This feature is created AFTER the churn event
# "days_since_last_login" is set to a high value for churned subscribers
# because they stopped logging in AFTER they decided to cancel
data['days_since_last_login'] = np.where(
    data['churned'] == 1,
    np.random.randint(15, 45, n),   # churned users: 15-45 days
    np.random.randint(0, 10, n)     # active users: 0-10 days
)

# THE LEAK #2: "cancellation_offer_shown" is only True for users
# the system already flagged as about to churn
data['cancellation_offer_shown'] = np.where(
    data['churned'] == 1,
    np.random.binomial(1, 0.7, n),  # 70% of churners saw the offer
    np.random.binomial(1, 0.02, n)  # 2% of non-churners (false positives)
)

# Train the model with leaky features
features = ['monthly_hours', 'sessions_last_30d', 'months_active',
            'support_tickets', 'payment_failures',
            'days_since_last_login', 'cancellation_offer_shown']

X = pd.get_dummies(data[features + ['plan_type']], drop_first=True)
y = data['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC with leaky features: {auc:.4f}")
AUC with leaky features: 0.9918

That 0.9918 AUC is a lie. The model has learned two things: (1) if days_since_last_login is high, the user has already churned, and (2) if cancellation_offer_shown is 1, the user was already flagged. Neither feature is available at the time you need to make a prediction --- before the churn event.

Detecting Leakage

How do you find the leak? Feature importance is your first tool.

import pandas as pd

importances = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print("Feature importances:")
print(importances.to_string())
Feature importances:
days_since_last_login      0.4823
cancellation_offer_shown   0.3156
support_tickets            0.0612
monthly_hours              0.0508
payment_failures           0.0387
months_active              0.0301
sessions_last_30d          0.0124
plan_type_standard         0.0052
plan_type_premium          0.0037

The two leaked features dominate. This is your first red flag: if a feature has suspiciously high importance, ask whether it could contain information about the target that would not be available at prediction time.

Now remove the leaked features and retrain:

clean_features = ['monthly_hours', 'sessions_last_30d', 'months_active',
                  'support_tickets', 'payment_failures']
X_clean = pd.get_dummies(data[clean_features + ['plan_type']], drop_first=True)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_clean, y, test_size=0.2, stratify=y, random_state=42
)

model_clean = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
model_clean.fit(X_train_c, y_train_c)
auc_clean = roc_auc_score(y_test_c, model_clean.predict_proba(X_test_c)[:, 1])
print(f"AUC without leaky features: {auc_clean:.4f}")
print(f"AUC drop: {auc - auc_clean:.4f}")
AUC without leaky features: 0.6842
AUC drop: 0.3076

The real AUC is 0.68, not 0.99. The model goes from "world-class" to "barely useful." That 0.31 AUC gap is the cost of leakage.

A Systematic Leakage Checklist

Run through this checklist for every model you build:

  1. Temporal sanity check: For each feature, ask: "Would I know this value at the time I need to make a prediction?" If the answer is "only after the event," it is a leak.

  2. Feature importance audit: If any feature has importance greater than 0.3, investigate it. Legitimate features rarely dominate that strongly.

  3. Suspiciously high performance: If your model achieves AUC above 0.95 on a real-world problem, be skeptical. Most real business problems top out at 0.80-0.90 AUC. Performance above 0.95 usually means leakage, not skill.

  4. Train-test contamination check: Ensure all preprocessing (scaling, imputation, encoding) is fitted on the training set only, then applied to the test set. Use scikit-learn Pipeline to enforce this.

  5. Feature derivation audit: Trace every engineered feature back to its source. If any step used information from the target or from the full dataset (including test rows), it is a leak.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# WRONG: Scale all data, then split
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X_clean)  # Fitted on ALL data
scores_wrong = cross_val_score(
    GradientBoostingClassifier(n_estimators=100, random_state=42),
    X_scaled_wrong, y, cv=5, scoring='roc_auc'
)

# RIGHT: Use a Pipeline so scaling is fitted inside each CV fold
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
scores_right = cross_val_score(pipe, X_clean, y, cv=5, scoring='roc_auc')

print(f"Leaked scaling AUC:  {scores_wrong.mean():.4f} +/- {scores_wrong.std():.4f}")
print(f"Proper pipeline AUC: {scores_right.mean():.4f} +/- {scores_right.std():.4f}")
Leaked scaling AUC:  0.6848 +/- 0.0152
Proper pipeline AUC: 0.6831 +/- 0.0158

Observation --- The difference here is small (0.002 AUC) because scaling leakage on this dataset is mild. But on datasets with time-dependent features, target-encoded categoricals, or imputation based on global statistics, the difference can be enormous. The Pipeline approach costs nothing and prevents an entire category of bugs. Use it always.


Part 3: Classification Metrics --- Choosing the Right One

The Accuracy Trap

Accuracy is the most intuitive metric and the most misleading one for imbalanced problems. StreamFlow's churn rate is 8%. A model that predicts "no churn" for every subscriber achieves 92% accuracy. It is also completely useless.

from sklearn.metrics import accuracy_score, classification_report

# The "always predict majority class" baseline
y_dummy = np.zeros_like(y_test_c)
acc = accuracy_score(y_test_c, y_dummy)
print(f"Always-predict-no-churn accuracy: {acc:.4f}")
print(f"\nClassification report for dummy model:")
print(classification_report(y_test_c, y_dummy, target_names=['No Churn', 'Churn']))
Always-predict-no-churn accuracy: 0.9240
Classification report for dummy model:
              precision    recall  f1-score   support

    No Churn       0.92      1.00      0.96      1848
       Churn       0.00      0.00      0.00       152

    accuracy                           0.92      2000
   macro avg       0.46      0.50      0.48      2000
weighted avg       0.85      0.92      0.89      2000

92.4% accuracy. Zero recall for churn. This model catches nobody. Accuracy rewards predicting the majority class and tells you nothing about how well you identify the minority class that you actually care about.

Precision, Recall, and the Tradeoff

Precision answers: "Of everyone the model flagged as positive, what fraction actually was positive?" If precision is 0.70, then 70% of the subscribers you flagged as churners actually churned, and 30% were false alarms.

Recall (sensitivity) answers: "Of everyone who was actually positive, what fraction did the model catch?" If recall is 0.60, then the model caught 60% of actual churners and missed 40%.

There is an inherent tension between precision and recall. You can achieve perfect recall by predicting every subscriber will churn --- you will catch all actual churners, but your precision will be terrible. You can achieve perfect precision by only flagging the single most obvious churner --- you will be right about that one, but your recall will be near zero.

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Use the clean model from the leakage section
y_proba = model_clean.predict_proba(X_test_c)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test_c, y_proba)

# Show the tradeoff at different thresholds
print("Threshold  Precision  Recall")
print("-" * 35)
for threshold in [0.03, 0.05, 0.08, 0.10, 0.15, 0.20, 0.30]:
    y_pred_t = (y_proba >= threshold).astype(int)
    p = precision_score(y_test_c, y_pred_t, zero_division=0)
    r = recall_score(y_test_c, y_pred_t)
    print(f"  {threshold:<10.2f} {p:<10.3f} {r:<10.3f}")
Threshold  Precision  Recall
-----------------------------------
  0.03       0.087      0.934
  0.05       0.102      0.862
  0.08       0.142      0.684
  0.10       0.173      0.566
  0.15       0.224      0.342
  0.20       0.316      0.197
  0.30       0.500      0.046

As the threshold rises, precision increases (fewer false alarms) but recall drops (more missed churners). The business problem determines where you should operate on this curve.

Which Metric for Which Business Problem?

Scenario Primary Metric Why
SaaS churn (retention offers are cheap) Recall or AUC-PR Missing a churner costs $180/month. Sending a retention offer to a non-churner costs $5. False negatives are 36x more expensive.
Hospital readmission Recall Missing a readmission can mean patient harm or death. False positives (extra follow-up calls) are a mild cost.
Fraud detection Precision at high recall You need to catch most fraud (high recall), but too many false positives overwhelm investigators.
Spam filtering Precision Sending a legitimate email to spam (false positive) is worse than letting spam through (false negative).
E-commerce conversion prediction AUC-ROC or log-loss Ranking matters more than a binary decision. You want to rank likely converters higher.
Medical diagnosis (screening) Recall (sensitivity) A missed diagnosis is worse than an unnecessary follow-up test.

F1 Score: When You Need a Single Number

The F1 score is the harmonic mean of precision and recall. It penalizes extreme imbalance between the two: if either precision or recall is very low, F1 will be low even if the other is high.

from sklearn.metrics import f1_score

# Demonstrate that F1 penalizes imbalance
scenarios = [
    (0.90, 0.90),  # Balanced
    (0.95, 0.50),  # High precision, low recall
    (0.50, 0.95),  # Low precision, high recall
    (0.99, 0.10),  # Extreme imbalance
]

print("Precision  Recall   F1")
print("-" * 30)
for p, r in scenarios:
    f1 = 2 * (p * r) / (p + r)
    print(f"  {p:<10.2f} {r:<8.2f} {f1:.3f}")
Precision  Recall   F1
------------------------------
  0.90       0.90     0.900
  0.95       0.50     0.655
  0.50       0.95     0.655
  0.99       0.10     0.182

F1 is useful when you want a single number that balances precision and recall. But it weights them equally, which is rarely what the business wants. Use F1 as a diagnostic, not as your optimization target. Define the business cost of false positives vs. false negatives and optimize accordingly.

AUC-ROC: Ranking Quality

AUC-ROC (Area Under the Receiver Operating Characteristic Curve) measures the model's ability to rank positive examples above negative examples, across all thresholds. An AUC of 0.5 means random ranking. An AUC of 1.0 means perfect ranking --- every positive ranks above every negative.

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, roc_thresholds = roc_curve(y_test_c, y_proba)
auc = roc_auc_score(y_test_c, y_proba)

print(f"AUC-ROC: {auc:.4f}")
print(f"\nInterpretation: if you pick a random churner and a random non-churner,")
print(f"there is a {auc:.1%} chance the model assigns a higher churn probability")
print(f"to the churner.")
AUC-ROC: 0.6842
Interpretation: if you pick a random churner and a random non-churner,
there is a 68.4% chance the model assigns a higher churn probability
to the churner.

AUC-ROC is threshold-agnostic, which is both its strength and its weakness. It tells you how well the model ranks examples, not how well it classifies them at any specific threshold. For imbalanced problems, AUC-ROC can be misleadingly optimistic --- a model can achieve high AUC-ROC while being useless at the thresholds you care about.

AUC-PR: The Better Metric for Imbalanced Data

AUC-PR (Area Under the Precision-Recall Curve) focuses on the positive class. For imbalanced problems like churn prediction, AUC-PR is more informative than AUC-ROC because it directly measures how well the model identifies the minority class.

from sklearn.metrics import average_precision_score, precision_recall_curve

auc_pr = average_precision_score(y_test_c, y_proba)
print(f"AUC-PR: {auc_pr:.4f}")
print(f"Baseline AUC-PR (random): {y_test_c.mean():.4f}")
print(f"Lift over random: {auc_pr / y_test_c.mean():.2f}x")
AUC-PR: 0.1647
Baseline AUC-PR (random): 0.0760
Lift over random: 2.17x

Why AUC-PR for churn? --- AUC-ROC for this model is 0.684, which sounds mediocre but not terrible. AUC-PR is 0.165, which sounds much worse. But AUC-PR tells the truth: the model's ability to precisely identify churners is limited. A random baseline would have AUC-PR equal to the positive rate (0.076). The model is 2.17x better than random at identifying churners --- useful but not great. AUC-PR forces you to confront the difficulty of the problem instead of hiding behind an AUC-ROC number that is inflated by the easy-to-classify majority class.

Log-Loss and Brier Score: Evaluating Probabilities

Sometimes you care not just about rankings or binary decisions, but about the quality of the predicted probabilities themselves. If the model says "this subscriber has an 80% chance of churning," does that actually mean 80 out of 100 such subscribers churn?

Log-loss (cross-entropy) penalizes confident wrong predictions heavily. A model that says "99% chance of no churn" for a subscriber who actually churns gets a massive penalty.

Brier score is the mean squared difference between predicted probabilities and actual outcomes. It ranges from 0 (perfect) to 1 (worst).

from sklearn.metrics import log_loss, brier_score_loss

ll = log_loss(y_test_c, y_proba)
brier = brier_score_loss(y_test_c, y_proba)

print(f"Log-loss: {ll:.4f}")
print(f"Brier score: {brier:.4f}")

# Compare to a model that always predicts the base rate
base_rate = y_test_c.mean()
ll_base = log_loss(y_test_c, np.full_like(y_proba, base_rate))
brier_base = brier_score_loss(y_test_c, np.full_like(y_proba, base_rate))
print(f"\nBaseline (always predict {base_rate:.3f}):")
print(f"  Log-loss: {ll_base:.4f}")
print(f"  Brier score: {brier_base:.4f}")
Log-loss: 0.2918
Brier score: 0.0694

Baseline (always predict 0.076):
  Log-loss: 0.2976
  Brier score: 0.0702

Regression Metrics: RMSE and MAE

For regression problems, the standard metrics are RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error). RMSE penalizes large errors more heavily due to the squaring; MAE treats all errors equally.

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Simulate predictions with a few large errors
np.random.seed(42)
y_true_reg = np.random.uniform(100, 500, 1000)
y_pred_reg = y_true_reg + np.random.normal(0, 30, 1000)

# Add 10 large errors (simulating outlier predictions)
y_pred_reg[:10] += np.random.normal(0, 200, 10)

rmse = np.sqrt(mean_squared_error(y_true_reg, y_pred_reg))
mae = mean_absolute_error(y_true_reg, y_pred_reg)
print(f"RMSE: {rmse:.2f}")
print(f"MAE:  {mae:.2f}")
print(f"RMSE / MAE ratio: {rmse / mae:.2f}")
print(f"\nIf RMSE >> MAE, a few large errors are dominating RMSE.")
print(f"Investigate outlier predictions.")
RMSE: 36.18
MAE:  24.31
RMSE / MAE ratio: 1.49

If RMSE >> MAE, a few large errors are dominating RMSE.
Investigate outlier predictions.

Practical rule --- Report both RMSE and MAE. If RMSE is much larger than MAE (ratio above 1.5), you have a few predictions that are wildly off. Investigate those cases --- they may be data quality issues, edge cases your model mishandles, or genuine outliers that need a different modeling approach.


Part 4: Learning Curves, Validation Curves, and Calibration

Learning Curves: How Much Data Do You Need?

A learning curve plots model performance as a function of training set size. It answers two questions: (1) Is my model underfitting or overfitting? (2) Will collecting more data help?

from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np

model_lc = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
)

train_sizes, train_scores, val_scores = learning_curve(
    model_lc, X_clean, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='roc_auc', n_jobs=-1
)

print("Learning Curve:")
print(f"{'Train Size':<12} {'Train AUC':<12} {'Val AUC':<12} {'Gap':<8}")
print("-" * 44)
for size, t_score, v_score in zip(
    train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)
):
    gap = t_score - v_score
    print(f"  {size:<12} {t_score:<12.4f} {v_score:<12.4f} {gap:<8.4f}")
Learning Curve:
Train Size   Train AUC    Val AUC      Gap
--------------------------------------------
  800          0.7821       0.6413      0.1408
  1600         0.7584       0.6592      0.0992
  2400         0.7462       0.6651      0.0811
  3200         0.7398       0.6684      0.0714
  4000         0.7351       0.6721      0.0630
  4800         0.7305       0.6748      0.0557
  5600         0.7278       0.6769      0.0509
  6400         0.7256       0.6781      0.0475
  7200         0.7239       0.6802      0.0437
  8000         0.7222       0.6819      0.0403

How to read this:

  • Train AUC decreasing as data grows: Normal. With more data, it is harder to memorize, so training performance naturally drops.
  • Val AUC increasing as data grows: Good. The model generalizes better with more data.
  • Gap between train and val (the overfitting gap): Shrinking, which is ideal. But if the gap is still large at the maximum training size, more data or less model complexity would help.
  • Val AUC still climbing at maximum size: Suggests that more data would improve performance. If the val curve has plateaued, more data will not help --- you need a better model or better features.

Validation Curves: Tuning Hyperparameters Visually

A validation curve plots performance as a function of a single hyperparameter. It shows you where the hyperparameter transitions from underfitting to optimal to overfitting.

from sklearn.model_selection import validation_curve

param_range = [1, 2, 3, 4, 5, 6, 7, 8, 10, 12]

train_scores, val_scores = validation_curve(
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, random_state=42
    ),
    X_clean, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5, scoring='roc_auc', n_jobs=-1
)

print("Validation Curve (max_depth):")
print(f"{'Depth':<8} {'Train AUC':<12} {'Val AUC':<12} {'Gap':<8}")
print("-" * 40)
for depth, t_score, v_score in zip(
    param_range, train_scores.mean(axis=1), val_scores.mean(axis=1)
):
    gap = t_score - v_score
    print(f"  {depth:<8} {t_score:<12.4f} {v_score:<12.4f} {gap:<8.4f}")
Validation Curve (max_depth):
Depth    Train AUC    Val AUC      Gap
----------------------------------------
  1        0.6651       0.6618      0.0033
  2        0.6895       0.6762      0.0133
  3        0.7222       0.6819      0.0403
  4        0.7589       0.6812      0.0777
  5        0.8012       0.6793      0.1219
  6        0.8451       0.6741      0.1710
  7        0.8842       0.6698      0.2144
  8        0.9178       0.6652      0.2526
  10       0.9612       0.6578      0.3034
  12       0.9847       0.6491      0.3356

The validation AUC peaks at depth 2-3 and then declines as the model overfits. Meanwhile, training AUC climbs steadily toward 1.0 --- the model is memorizing. The optimal depth is where val AUC is highest, not where training AUC is highest.

Calibration: Does 80% Mean 80%?

A well-calibrated model produces probability estimates that match observed frequencies. If the model says 80% chance of churn for a group of 100 subscribers, approximately 80 of them should actually churn.

Calibration matters when you use predicted probabilities for downstream decisions. If your retention team intervenes on everyone with predicted churn probability above 50%, they need to trust that 50% means something real.

from sklearn.calibration import calibration_curve

# Train a model and check its calibration
model_cal = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
model_cal.fit(X_train_c, y_train_c)
y_proba_cal = model_cal.predict_proba(X_test_c)[:, 1]

prob_true, prob_pred = calibration_curve(y_test_c, y_proba_cal, n_bins=10)

print("Calibration:")
print(f"{'Predicted Prob':<16} {'Actual Freq':<14} {'Count'}")
print("-" * 40)
for pt, pp in zip(prob_pred, prob_true):
    print(f"  {pp:<16.3f} {pt:<14.3f}")
Calibration:
Predicted Prob   Actual Freq   Count
----------------------------------------
  0.043            0.053
  0.059            0.068
  0.071            0.082
  0.083            0.079
  0.096            0.099
  0.112            0.118
  0.134            0.142
  0.167            0.158
  0.214            0.231
  0.329            0.357

Interpreting calibration --- If the predicted probability roughly matches the actual frequency (the two columns are close), the model is well calibrated. If predicted probabilities are systematically lower than actual frequencies, the model is underconfident. If higher, overconfident. Gradient boosting models are generally well calibrated out of the box. Logistic regression is also naturally calibrated. Random Forests and SVMs tend to be poorly calibrated and benefit from post-hoc calibration (Platt scaling or isotonic regression).

Putting the Diagnostics Together

When evaluating a model, run all three diagnostics:

  1. Learning curve to check if more data would help and whether the model is over/underfitting.
  2. Validation curve to find the sweet spot for your most important hyperparameter.
  3. Calibration curve to verify that predicted probabilities are trustworthy.

These three plots, combined with the right metric (AUC-PR for imbalanced classification, RMSE/MAE for regression), give you a complete picture of model behavior.


Part 5: Comparing Models Properly --- Statistical Tests

The Problem with "This One Has Higher AUC"

You trained Logistic Regression (AUC = 0.781), Random Forest (AUC = 0.793), and XGBoost (AUC = 0.802) using 5-fold cross-validation. XGBoost "wins." But is the difference statistically significant, or could it be explained by random variation in the folds?

A 0.009 AUC difference between Random Forest and XGBoost means nothing if the standard deviation across folds is 0.015. You need a statistical test.

Paired t-Test on Cross-Validation Scores

The simplest approach: use a paired t-test on the fold-by-fold AUC scores. "Paired" because both models were evaluated on the same folds, so each pair of scores (model A fold i vs. model B fold i) is directly comparable.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from scipy import stats

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

lr = LogisticRegression(max_iter=1000, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
gb = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)

scores_lr = cross_val_score(lr, X_clean, y, cv=skf, scoring='roc_auc')
scores_rf = cross_val_score(rf, X_clean, y, cv=skf, scoring='roc_auc')
scores_gb = cross_val_score(gb, X_clean, y, cv=skf, scoring='roc_auc')

print("Model Comparison:")
print(f"{'Model':<25} {'Mean AUC':<12} {'Std':<8}")
print("-" * 45)
print(f"{'Logistic Regression':<25} {scores_lr.mean():<12.4f} {scores_lr.std():<8.4f}")
print(f"{'Random Forest':<25} {scores_rf.mean():<12.4f} {scores_rf.std():<8.4f}")
print(f"{'Gradient Boosting':<25} {scores_gb.mean():<12.4f} {scores_gb.std():<8.4f}")

# Paired t-test: RF vs GB
t_stat, p_value = stats.ttest_rel(scores_rf, scores_gb)
print(f"\nPaired t-test (RF vs GB):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4f}")
print(f"  Significant at alpha=0.05? {'Yes' if p_value < 0.05 else 'No'}")

# Paired t-test: LR vs GB
t_stat2, p_value2 = stats.ttest_rel(scores_lr, scores_gb)
print(f"\nPaired t-test (LR vs GB):")
print(f"  t-statistic: {t_stat2:.4f}")
print(f"  p-value: {p_value2:.4f}")
print(f"  Significant at alpha=0.05? {'Yes' if p_value2 < 0.05 else 'No'}")
Model Comparison:
Model                     Mean AUC     Std
---------------------------------------------
Logistic Regression       0.6724       0.0168
Random Forest             0.6782       0.0151
Gradient Boosting         0.6819       0.0143

Paired t-test (RF vs GB):
  t-statistic: -1.2847
  p-value: 0.2312
  Significant at alpha=0.05? No

Paired t-test (LR vs GB):
  t-statistic: -2.4518
  p-value: 0.0367
  Significant at alpha=0.05? Yes

The difference between Random Forest and Gradient Boosting is not statistically significant (p=0.23). The difference between Logistic Regression and Gradient Boosting is (p=0.037). This means you cannot confidently claim Gradient Boosting is better than Random Forest on this data, but you can claim it is better than Logistic Regression.

Caveat --- The paired t-test on cross-validation scores has a known problem: the folds are not independent (they share training data), which violates the t-test's independence assumption. This makes the test slightly anti-conservative. A correction called the "corrected resampled t-test" (Nadeau and Bengio, 2003) adjusts for this. For most practical purposes, the standard paired t-test is a reasonable first approximation, but be aware of its limitations.

McNemar's Test: Comparing Predictions Directly

McNemar's test compares two models on a per-sample basis. It builds a contingency table: how many samples did model A get right and B get wrong? How many did B get right and A get wrong? If these counts are significantly different, the models have meaningfully different performance.

from sklearn.model_selection import train_test_split
from statsmodels.stats.contingency_tables import mcnemar

# Train both models on the same split
X_tr, X_te, y_tr, y_te = train_test_split(
    X_clean, y, test_size=0.2, stratify=y, random_state=42
)

rf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
gb_model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)

rf_model.fit(X_tr, y_tr)
gb_model.fit(X_tr, y_tr)

y_pred_rf = rf_model.predict(X_te)
y_pred_gb = gb_model.predict(X_te)

# Build contingency table
correct_rf = (y_pred_rf == y_te)
correct_gb = (y_pred_gb == y_te)

# Contingency table
both_right = np.sum(correct_rf & correct_gb)
rf_right_gb_wrong = np.sum(correct_rf & ~correct_gb)
rf_wrong_gb_right = np.sum(~correct_rf & correct_gb)
both_wrong = np.sum(~correct_rf & ~correct_gb)

table = np.array([
    [both_right, rf_right_gb_wrong],
    [rf_wrong_gb_right, both_wrong]
])

print("McNemar Contingency Table:")
print(f"                     GB Correct    GB Wrong")
print(f"  RF Correct         {both_right:<14}{rf_right_gb_wrong}")
print(f"  RF Wrong           {rf_wrong_gb_right:<14}{both_wrong}")

result = mcnemar(table, exact=True)
print(f"\nMcNemar test p-value: {result.pvalue:.4f}")
print(f"Significant at alpha=0.05? {'Yes' if result.pvalue < 0.05 else 'No'}")
McNemar Contingency Table:
                     GB Correct    GB Wrong
  RF Correct         1782          72
  RF Wrong           85            61

McNemar test p-value: 0.3278
Significant at alpha=0.05? No

McNemar's test confirms what the paired t-test suggested: on this dataset, the difference between Random Forest and Gradient Boosting is not statistically significant. The 85 samples that GB got right and RF got wrong are not significantly different from the 72 samples that RF got right and GB got wrong.

When to Use Which Test

Situation Test What it compares
Two models, cross-validated scores Paired t-test Mean performance across folds
Two models, single test set McNemar's test Per-sample correct/incorrect predictions
Multiple models, cross-validated Friedman test + Nemenyi post-hoc Ranks of all models across folds
Two models, want effect size Cohen's d on fold scores Practical magnitude of difference

Effect Size: Is the Difference Practically Important?

Statistical significance tells you whether a difference is real. Effect size tells you whether it matters. A 0.002 AUC improvement can be statistically significant on a large enough dataset, but it is rarely worth the added complexity of a more sophisticated model.

def cohens_d(group1, group2):
    """Calculate Cohen's d for paired observations."""
    diff = group1 - group2
    return np.mean(diff) / np.std(diff, ddof=1)

d = cohens_d(scores_gb, scores_rf)
print(f"Cohen's d (GB vs RF): {d:.3f}")
print(f"Interpretation: ", end="")
if abs(d) < 0.2:
    print("negligible effect")
elif abs(d) < 0.5:
    print("small effect")
elif abs(d) < 0.8:
    print("medium effect")
else:
    print("large effect")
Cohen's d (GB vs RF): 0.406
Interpretation: small effect

A small effect size combined with a non-significant p-value is a clear signal: the models are essentially equivalent on this data. Choose based on practical factors --- training time, interpretability, maintenance burden --- not on a 0.004 AUC difference that cannot be reliably distinguished from noise.


Part 6: When Offline Metrics Disagree with Production

The E-Commerce Disconnect

Here is a pattern that haunts data science teams. The offline evaluation says Model B is better than Model A. You run an A/B test. The A/B test says Model A drives more revenue. Who is right?

The A/B test is right. Always. Offline metrics are a proxy for real-world impact. A/B test results are the real-world impact. When they disagree, the offline metrics are wrong --- not in their calculation, but in their relevance.

Why does this happen?

  1. Metric mismatch: You optimized for AUC offline, but the business cares about conversion rate. A model with higher AUC might produce probabilities that lead to worse decisions when plugged into the business logic.

  2. Distribution shift: The offline test set does not perfectly represent production traffic. Seasonal effects, user demographics, or product changes can make the test set unrepresentative.

  3. Calibration differences: Model B has higher AUC but is poorly calibrated. When the business logic uses a threshold of 0.5, Model B's "0.5" actually corresponds to a 0.3 true probability, causing bad downstream decisions.

  4. Feedback loops: The model's predictions change user behavior, which changes the distribution that future predictions are made on. Offline evaluation cannot capture this.

War Story --- An e-commerce team I worked with built a conversion prediction model that scored 3 points higher AUC than the incumbent. They deployed it, ran an A/B test for two weeks, and found that conversion rate dropped by 0.4%. The postmortem revealed the issue: the new model was overconfident, assigning high probabilities to users who were browsing but not ready to buy. The discount-triggering system showed these users coupons too aggressively, which trained them to wait for coupons before buying --- a classic feedback loop. The old model's more conservative probabilities happened to produce better discount timing.

Building a Bridge Between Offline and Online

To reduce the disconnect:

  1. Match your offline metric to the business decision. If the business uses predicted probabilities to decide on interventions, evaluate calibration, not just ranking. If the business uses a threshold, evaluate precision and recall at that threshold, not AUC.

  2. Use historical A/B test data to validate. If you have past A/B tests with known outcomes, check whether your offline metric would have correctly predicted which variant won. If not, your offline metric is not capturing what matters.

  3. Always run an A/B test before full deployment. No amount of offline evaluation can substitute for a production experiment. Offline evaluation is for screening candidates. A/B testing is for making the final decision.

  4. Monitor calibration in production. Track whether your model's predicted probabilities match observed rates. If the model says "10% churn probability" for a cohort, do 10% of them actually churn? Calibration drift is an early warning sign that the model is becoming stale.


Part 7: Progressive Project --- Milestone M6

StreamFlow Evaluation Overhaul

In Milestones M4 and M5, you trained baseline and advanced models on the StreamFlow churn dataset. In this milestone, you will evaluate those models properly.

Task 1: Implement Stratified Group K-Fold

StreamFlow's data has multiple rows per subscriber (one per month). Implement StratifiedGroupKFold with subscriber ID as the group variable.

from sklearn.model_selection import StratifiedGroupKFold

# Assuming your StreamFlow data has a 'subscriber_id' column
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

# Re-evaluate all your models from M4 and M5 using this CV strategy
# Compare the results to your previous evaluations that used
# standard StratifiedKFold. Did any model's performance change
# significantly? If so, which models were benefiting from
# subscriber leakage?

Task 2: Build Learning Curves

For your best model, build a learning curve. Is the model still improving with more data? Would collecting more subscriber data likely improve churn prediction?

Task 3: The Leakage Detective

Your StreamFlow dataset contains a planted leakage issue. One of the features in your pipeline encodes information that would not be available at prediction time. Find it.

Hints: - Check feature importances. Is any single feature suspiciously dominant? - For each feature, ask: "At the moment I need to predict churn for next month, would I already know this value?" - The leaky feature is not obviously named. It looks like a legitimate behavioral metric.

When you find the leak, remove it, retrain, and report the change in AUC-PR (not AUC-ROC --- this is an imbalanced problem).

Task 4: Choose the Right Metric

StreamFlow's churn rate is 8.2%. A retention offer costs $5 to send and saves an average of $180 in lifetime value when successful.

  • Calculate the cost ratio: how much more expensive is a false negative (missed churner) than a false positive (unnecessary retention offer)?
  • Based on this ratio, argue whether the team should optimize for AUC-PR, F1, or a custom metric.
  • Report your best model's performance using AUC-PR, F1, precision, and recall. Recommend an operating threshold.

Task 5: Statistical Model Comparison

Using paired t-tests on your cross-validation scores, determine which model differences are statistically significant at alpha=0.05. Create a summary table showing all pairwise comparisons and their p-values.


Chapter Summary

This chapter covered the full evaluation toolkit:

  • Cross-validation gives you reliable performance estimates. Use StratifiedKFold for classification, StratifiedGroupKFold when entities repeat, and TimeSeriesSplit for temporal data.
  • Data leakage is the most dangerous evaluation error. Check feature importances for suspiciously dominant features, verify temporal validity of every feature, and use scikit-learn Pipelines to prevent train-test contamination.
  • Metrics must match the business problem. Accuracy is almost never the right choice for imbalanced data. AUC-PR is more informative than AUC-ROC for minority-class prediction. Precision and recall have an inherent tradeoff that the business cost structure should resolve.
  • Diagnostic plots --- learning curves, validation curves, and calibration curves --- tell you what raw metrics cannot: whether more data would help, where overfitting begins, and whether predicted probabilities are trustworthy.
  • Statistical tests prevent you from chasing noise. A 0.005 AUC difference is not meaningful if the standard deviation across folds is 0.015. Use paired t-tests or McNemar's test to determine whether model differences are real.

The common thread: evaluation is about honesty. A model that appears to perform well but was evaluated incorrectly will fail in production. A model that appears to perform modestly but was evaluated rigorously will deliver exactly what you expect. Honest evaluation is the foundation of trustworthy machine learning.


Next chapter: Chapter 17 --- Class Imbalance, where we tackle the problem that made accuracy useless in this chapter --- and learn techniques to improve model performance when positive examples are rare.