> "All models are wrong, but some are useful." -- George E. P. Box
In This Chapter
- 8.1 The Evaluation Problem
- 8.2 Data Splitting Strategies
- 8.3 Cross-Validation
- 8.4 Classification Metrics
- 8.5 Regression Metrics
- 8.6 The Bias-Variance Tradeoff
- 8.7 Hyperparameter Tuning
- 8.8 Statistical Tests for Model Comparison
- 8.9 Putting It All Together: An Evaluation Workflow
- 8.10 Common Pitfalls and How to Avoid Them
- 8.11 Advanced Topics
- 8.12 Summary
- References
Chapter 8: Model Evaluation, Selection, and Validation
"All models are wrong, but some are useful." -- George E. P. Box
In the preceding chapters, we explored how to build machine learning models -- from linear regression (Chapter 4) and logistic regression (Chapter 5) to decision trees (Chapter 6) and ensemble methods (Chapter 7). But building a model is only half the story. The other half, and arguably the more critical half, is knowing whether your model is any good. A model that performs brilliantly on training data but fails catastrophically in production is worse than useless -- it is dangerous.
This chapter equips you with the tools, techniques, and statistical rigor needed to answer the fundamental question every AI engineer must face: How well will this model perform on data it has never seen?
We will cover the complete evaluation pipeline: how to partition your data responsibly, how to measure performance with the right metrics, how to diagnose whether your model suffers from too much bias or too much variance, how to tune hyperparameters systematically, and how to compare competing models with statistical confidence. By the end of this chapter, you will be able to make principled, defensible decisions about which model to deploy.
8.1 The Evaluation Problem
8.1.1 Why Evaluation Matters
Consider a spam classifier that achieves 99% accuracy. Impressive? Not if only 1% of emails are spam -- a model that labels everything as "not spam" achieves the same accuracy while catching zero spam. This example, which we will revisit in detail, illustrates a recurring theme: naive evaluation leads to naive conclusions.
The core challenge is generalization. We train models on a finite sample of data, but we care about performance on the infinite stream of future, unseen data. The gap between training performance and true performance is where evaluation science lives.
Recall from Chapter 2 that we introduced the concepts of underfitting and overfitting. This chapter formalizes those intuitions and provides the machinery to detect and correct both problems.
8.1.2 The Generalization Gap
Let $f_{\hat{\theta}}$ be our learned model with parameters $\hat{\theta}$, and let $L$ be our loss function. We define:
- Training error: $E_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, f_{\hat{\theta}}(x_i))$, computed over the training set.
- Generalization error: $E_{\text{gen}} = \mathbb{E}_{(x,y) \sim P}[L(y, f_{\hat{\theta}}(x))]$, the expected loss over the true data-generating distribution $P$.
The generalization gap is $E_{\text{gen}} - E_{\text{train}}$. A positive gap means our model performs worse on new data than on training data -- the hallmark of overfitting. Our entire evaluation framework aims to estimate $E_{\text{gen}}$ as accurately as possible.
8.2 Data Splitting Strategies
8.2.1 The Train/Validation/Test Split
The most fundamental evaluation technique is to hold out data that the model never sees during training. We partition our dataset $D$ into three disjoint subsets:
| Split | Typical Size | Purpose |
|---|---|---|
| Training set | 60-80% | Fit model parameters |
| Validation set | 10-20% | Tune hyperparameters, select models |
| Test set | 10-20% | Final, unbiased performance estimate |
The test set is sacred. You touch it once, at the very end, to report your final performance estimate. If you peek at test performance during development and make decisions based on it, your test set becomes a second validation set and your reported performance becomes optimistically biased.
import numpy as np
from sklearn.model_selection import train_test_split
def create_splits(
X: np.ndarray,
y: np.ndarray,
val_size: float = 0.15,
test_size: float = 0.15,
random_state: int = 42
) -> tuple[np.ndarray, ...]:
"""Split data into train, validation, and test sets.
Args:
X: Feature matrix of shape (n_samples, n_features).
y: Target vector of shape (n_samples,).
val_size: Fraction of data for validation.
test_size: Fraction of data for test.
random_state: Random seed for reproducibility.
Returns:
Tuple of (X_train, X_val, X_test, y_train, y_val, y_test).
"""
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
# Second split: separate validation from training
relative_val_size = val_size / (1 - test_size)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=relative_val_size,
random_state=random_state, stratify=y_temp
)
return X_train, X_val, X_test, y_train, y_val, y_test
Key principle: The stratify parameter ensures that each split preserves the class distribution of the original dataset. This is critical for imbalanced datasets, a topic we discussed in Chapter 5.
8.2.2 When Simple Splits Fail
A single train/test split has a significant weakness: the performance estimate depends heavily on which particular examples end up in which split. With small datasets (say, fewer than 10,000 examples), this randomness can cause performance estimates to vary by several percentage points. Cross-validation addresses this problem.
8.3 Cross-Validation
8.3.1 K-Fold Cross-Validation
K-fold cross-validation partitions the dataset into $k$ equally sized folds. For each fold $i \in \{1, \ldots, k\}$, we train on the remaining $k-1$ folds and evaluate on fold $i$. The overall performance is the average across all $k$ evaluations:
$$\hat{E}_{\text{CV}} = \frac{1}{k} \sum_{i=1}^{k} L_i$$
where $L_i$ is the loss computed on fold $i$.
The standard choice is $k = 5$ or $k = 10$. The tradeoff is:
- Larger $k$: Less bias (training sets are larger, closer to the full dataset), but higher variance (folds overlap more, so estimates are correlated) and higher computational cost.
- Smaller $k$: More bias but lower variance and faster computation.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
def evaluate_with_cv(
X: np.ndarray,
y: np.ndarray,
k: int = 5,
random_state: int = 42
) -> dict[str, float]:
"""Evaluate a model using k-fold cross-validation.
Args:
X: Feature matrix of shape (n_samples, n_features).
y: Target vector of shape (n_samples,).
k: Number of folds.
random_state: Random seed for reproducibility.
Returns:
Dictionary with mean and std of cross-validation scores.
"""
model = RandomForestClassifier(n_estimators=100, random_state=random_state)
scores = cross_val_score(model, X, y, cv=k, scoring="accuracy")
return {
"mean_accuracy": float(np.mean(scores)),
"std_accuracy": float(np.std(scores)),
"fold_scores": scores.tolist(),
}
8.3.2 Stratified K-Fold
When dealing with classification problems, especially those with imbalanced classes (see Chapter 5, Section 5.6), standard k-fold can produce folds where some classes are underrepresented or entirely absent. Stratified k-fold ensures that each fold has approximately the same class distribution as the full dataset.
from sklearn.model_selection import StratifiedKFold
def stratified_cv(
X: np.ndarray,
y: np.ndarray,
k: int = 5,
random_state: int = 42
) -> list[tuple[np.ndarray, np.ndarray]]:
"""Generate stratified k-fold splits.
Args:
X: Feature matrix of shape (n_samples, n_features).
y: Target vector of shape (n_samples,).
k: Number of folds.
random_state: Random seed for reproducibility.
Returns:
List of (train_indices, test_indices) tuples.
"""
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state)
return list(skf.split(X, y))
Rule of thumb: Always use stratified k-fold for classification tasks. Use standard k-fold for regression tasks.
8.3.3 Leave-One-Out Cross-Validation (LOOCV)
LOOCV is the extreme case where $k = n$ (the number of samples). Each sample serves as its own test set exactly once. The advantage is minimal bias -- each training set is nearly the full dataset. The disadvantage is extreme computational cost ($n$ model fits) and, perhaps counterintuitively, high variance because the $n$ training sets are nearly identical.
LOOCV is most useful for very small datasets (fewer than 100 samples) where every data point counts.
8.3.4 Time Series Cross-Validation
Standard k-fold is invalid for time series data because it allows the model to train on future data and predict the past, violating the temporal ordering. Time series cross-validation respects chronological order:
Fold 1: Train [---] Test [-]
Fold 2: Train [----] Test [-]
Fold 3: Train [-----] Test [-]
Fold 4: Train [------] Test [-]
Each fold uses all data up to a certain point for training and the subsequent window for testing.
from sklearn.model_selection import TimeSeriesSplit
def time_series_cv(
X: np.ndarray,
y: np.ndarray,
n_splits: int = 5
) -> list[tuple[np.ndarray, np.ndarray]]:
"""Generate time series cross-validation splits.
Args:
X: Feature matrix of shape (n_samples, n_features),
assumed to be in chronological order.
y: Target vector of shape (n_samples,).
n_splits: Number of splits.
Returns:
List of (train_indices, test_indices) tuples.
"""
tscv = TimeSeriesSplit(n_splits=n_splits)
splits = []
for train_idx, test_idx in tscv.split(X):
splits.append((train_idx, test_idx))
print(f"Train: {train_idx[0]}-{train_idx[-1]}, "
f"Test: {test_idx[0]}-{test_idx[-1]}")
return splits
This technique is essential for any temporal prediction task, including the time series forecasting problems discussed later in Part III.
8.3.5 Nested Cross-Validation
When you use cross-validation both to select hyperparameters and to estimate model performance, you introduce a subtle optimistic bias. Nested cross-validation resolves this by using two loops:
- Outer loop: Estimates generalization performance (e.g., 5-fold).
- Inner loop: Selects the best hyperparameters for each outer fold (e.g., 5-fold within each outer training set).
$$\hat{E}_{\text{nested}} = \frac{1}{k_{\text{outer}}} \sum_{i=1}^{k_{\text{outer}}} L_i^*$$
where $L_i^*$ is the test loss on outer fold $i$ using the model whose hyperparameters were selected by the inner cross-validation on the corresponding outer training set.
from sklearn.model_selection import cross_val_score, GridSearchCV
def nested_cv(
X: np.ndarray,
y: np.ndarray,
model,
param_grid: dict,
outer_k: int = 5,
inner_k: int = 5
) -> np.ndarray:
"""Perform nested cross-validation.
Args:
X: Feature matrix of shape (n_samples, n_features).
y: Target vector of shape (n_samples,).
model: Scikit-learn estimator.
param_grid: Dictionary of hyperparameters to search.
outer_k: Number of outer folds.
inner_k: Number of inner folds.
Returns:
Array of outer fold scores.
"""
inner_cv = GridSearchCV(
model, param_grid, cv=inner_k, scoring="accuracy", n_jobs=-1
)
outer_scores = cross_val_score(
inner_cv, X, y, cv=outer_k, scoring="accuracy"
)
return outer_scores
Nested cross-validation is the gold standard for comparing model families (e.g., "Is a random forest better than a gradient-boosted tree for this problem?"). It is computationally expensive but provides the most honest performance estimates.
Why nested CV matters: Suppose you use 5-fold CV to select the best hyperparameters for a random forest and find that the best configuration achieves 93.2% accuracy. Can you report 93.2% as your expected performance? No---that estimate is optimistically biased because you selected the configuration that happened to perform best on those particular folds. The outer loop of nested CV provides an unbiased estimate by evaluating each "inner winner" on truly held-out data.
Worked Example: Consider comparing three model families on a medical diagnosis task with 2,000 samples:
- Outer loop: 5 folds, each holding out 400 samples for testing.
- Inner loop: For each outer training set (1,600 samples), run 5-fold CV over a hyperparameter grid.
- For each outer fold, the inner loop selects the best hyperparameters, trains on all 1,600 inner samples, and evaluates on the 400 outer test samples.
If the outer fold scores are [0.91, 0.89, 0.93, 0.90, 0.92], the nested CV estimate is $0.910 \pm 0.015$. This is honest because the hyperparameter selection was performed entirely within each outer training set. Compare this to a non-nested estimate of 0.932 --- the 2.2 percentage point gap is the optimistic bias that nested CV eliminates.
The computational cost is substantial: with $k_{\text{outer}} = 5$, $k_{\text{inner}} = 5$, and a grid of 27 configurations, you fit $5 \times 5 \times 27 = 675$ models. For expensive models like gradient boosting, this can take hours. However, the investment pays for itself by providing estimates you can trust.
8.4 Classification Metrics
Choosing the right metric is one of the most consequential decisions in any machine learning project. Different metrics encode different values about what constitutes "good" performance. We begin with classification metrics.
8.4.1 The Confusion Matrix
For a binary classifier, predictions and ground truth combine into four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
Every classification metric is derived from these four numbers. The art of evaluation lies in choosing which combinations matter most for your application.
from sklearn.metrics import confusion_matrix
import numpy as np
def detailed_confusion_matrix(
y_true: np.ndarray,
y_pred: np.ndarray
) -> dict[str, int]:
"""Compute detailed confusion matrix components.
Args:
y_true: Ground truth labels.
y_pred: Predicted labels.
Returns:
Dictionary with TP, TN, FP, FN counts.
"""
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return {"TP": int(tp), "TN": int(tn), "FP": int(fp), "FN": int(fn)}
8.4.2 Accuracy
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
Accuracy is the fraction of predictions that are correct. It is intuitive but deeply misleading for imbalanced datasets. If 99% of transactions are legitimate, a "predict everything as legitimate" classifier achieves 99% accuracy while catching zero fraud.
Use accuracy when: Classes are roughly balanced AND all errors are equally costly.
8.4.3 Precision and Recall
Precision (also called Positive Predictive Value) answers: "Of all the items I flagged as positive, how many actually are positive?"
$$\text{Precision} = \frac{TP}{TP + FP}$$
Recall (also called Sensitivity or True Positive Rate) answers: "Of all the actually positive items, how many did I catch?"
$$\text{Recall} = \frac{TP}{TP + FN}$$
There is an inherent tension between precision and recall. Increasing the classification threshold (requiring more confidence to predict positive) increases precision but decreases recall. Lowering the threshold does the opposite.
Application guidelines: - High precision critical: Spam filtering (users hate losing legitimate email), content recommendation (irrelevant suggestions erode trust). - High recall critical: Medical diagnosis (missing a disease is dangerous), fraud detection (letting fraud through is costly), security screening (missing a threat is catastrophic).
8.4.4 F1 Score and F-beta Score
The F1 score is the harmonic mean of precision and recall:
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$
The harmonic mean penalizes extreme imbalance: if either precision or recall is near zero, $F_1$ is near zero, even if the other is high.
The generalized F-beta score allows you to weight recall $\beta$ times as important as precision:
$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
- $\beta = 0.5$: Weights precision twice as much as recall.
- $\beta = 1$: Equal weight (standard F1).
- $\beta = 2$: Weights recall twice as much as precision (common in medical applications).
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, fbeta_score, classification_report
)
def compute_classification_metrics(
y_true: np.ndarray,
y_pred: np.ndarray,
beta: float = 1.0
) -> dict[str, float]:
"""Compute comprehensive classification metrics.
Args:
y_true: Ground truth labels.
y_pred: Predicted labels.
beta: Beta parameter for F-beta score.
Returns:
Dictionary of metric names to values.
"""
return {
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred, zero_division=0),
"recall": recall_score(y_true, y_pred, zero_division=0),
"f1": f1_score(y_true, y_pred, zero_division=0),
"f_beta": fbeta_score(y_true, y_pred, beta=beta, zero_division=0),
}
8.4.5 ROC Curves and AUC
Most classifiers output a continuous score (probability or confidence) rather than a hard class label. The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (recall) against the False Positive Rate ($FPR = FP / (FP + TN)$) as the classification threshold varies from 0 to 1.
The Area Under the ROC Curve (AUC-ROC) summarizes the entire curve into a single number: - AUC = 1.0: Perfect classifier. - AUC = 0.5: Random guessing (the diagonal line). - AUC < 0.5: Worse than random (likely labels are inverted).
AUC-ROC has an elegant probabilistic interpretation: it equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example. Formally:
$$\text{AUC} = P(s(\mathbf{x}^+) > s(\mathbf{x}^-))$$
where $s(\mathbf{x}^+)$ is the score assigned to a randomly chosen positive example and $s(\mathbf{x}^-)$ is the score assigned to a randomly chosen negative example. This interpretation connects AUC to the Wilcoxon rank-sum statistic and provides an intuition independent of any threshold choice.
Worked Example: Suppose a model produces the following scores for 5 positive and 5 negative examples:
| Example | True Label | Score |
|---|---|---|
| 1 | + | 0.92 |
| 2 | + | 0.85 |
| 3 | - | 0.78 |
| 4 | + | 0.71 |
| 5 | - | 0.65 |
| 6 | + | 0.55 |
| 7 | - | 0.42 |
| 8 | - | 0.30 |
| 9 | + | 0.22 |
| 10 | - | 0.10 |
At threshold 0.80: TP=2, FP=0, FN=3, TN=5. TPR=0.4, FPR=0.0. At threshold 0.50: TP=4, FP=1, FN=1, TN=4. TPR=0.8, FPR=0.2. At threshold 0.20: TP=5, FP=3, FN=0, TN=2. TPR=1.0, FPR=0.6.
Plotting these (FPR, TPR) points and connecting them traces the ROC curve. The AUC can be computed by counting concordant pairs: out of $5 \times 5 = 25$ positive-negative pairs, the positive example has a higher score in 21 cases, giving AUC $= 21/25 = 0.84$.
Limitations of AUC-ROC: For highly imbalanced datasets, AUC-ROC can be misleadingly optimistic because the False Positive Rate denominator ($FP + TN$) is dominated by the large negative class. A model might have a low FPR (say 0.01) while still producing thousands of false positives when the negative class contains millions of examples. In such cases, Precision-Recall curves are more informative.
8.4.6 Precision-Recall Curves and AUC-PR
The Precision-Recall (PR) curve plots precision against recall as the threshold varies. It focuses entirely on the positive class, making it more informative than ROC for imbalanced problems.
The AUC-PR (also called Average Precision) summarizes the PR curve. A random classifier has an AUC-PR equal to the prevalence (fraction of positives), so baseline performance depends on the class balance -- making it harder to game.
from sklearn.metrics import (
roc_curve, auc, precision_recall_curve, average_precision_score
)
def compute_curve_metrics(
y_true: np.ndarray,
y_scores: np.ndarray
) -> dict[str, float]:
"""Compute ROC-AUC and PR-AUC from predicted scores.
Args:
y_true: Ground truth binary labels.
y_scores: Predicted probabilities or scores.
Returns:
Dictionary with AUC-ROC and AUC-PR values.
"""
fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
precision, recall, _ = precision_recall_curve(y_true, y_scores)
pr_auc = average_precision_score(y_true, y_scores)
return {"auc_roc": roc_auc, "auc_pr": pr_auc}
8.4.7 Multi-Class Metrics
For problems with more than two classes, metrics must be extended. Three common averaging strategies exist:
- Macro average: Compute the metric independently for each class, then average. Treats all classes equally regardless of size.
- Micro average: Aggregate TP, FP, FN across all classes, then compute the metric. Equivalent to accuracy for precision/recall/F1.
- Weighted average: Like macro, but weight each class by its support (number of true instances).
$$\text{Macro-}F_1 = \frac{1}{C} \sum_{c=1}^{C} F_1^{(c)}, \qquad \text{Weighted-}F_1 = \sum_{c=1}^{C} \frac{n_c}{n} F_1^{(c)}$$
from sklearn.metrics import f1_score
def multiclass_f1(
y_true: np.ndarray,
y_pred: np.ndarray
) -> dict[str, float]:
"""Compute F1 scores with different averaging strategies.
Args:
y_true: Ground truth labels.
y_pred: Predicted labels.
Returns:
Dictionary of averaging strategy to F1 score.
"""
return {
"macro_f1": f1_score(y_true, y_pred, average="macro"),
"micro_f1": f1_score(y_true, y_pred, average="micro"),
"weighted_f1": f1_score(y_true, y_pred, average="weighted"),
}
8.5 Regression Metrics
8.5.1 Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \qquad \text{RMSE} = \sqrt{\text{MSE}}$$
MSE penalizes large errors quadratically, making it sensitive to outliers. RMSE has the same units as the target variable, making it more interpretable.
Recall from Chapter 4 that MSE is the loss function we minimized in ordinary least squares regression. Here we use it as an evaluation metric -- the same quantity, but serving a different purpose.
8.5.2 Mean Absolute Error (MAE)
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
MAE penalizes all errors linearly and is therefore more robust to outliers than MSE. The choice between MSE and MAE depends on whether large errors should be penalized disproportionately.
Guideline: If a prediction error of 10 units is truly "twice as bad" as an error of 5 units, use MAE. If it is "four times as bad," use MSE.
8.5.3 R-Squared (Coefficient of Determination)
$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}$$
$R^2$ measures the fraction of variance in the target that the model explains. An $R^2$ of 1 means perfect prediction; an $R^2$ of 0 means the model is no better than predicting the mean; an $R^2 < 0$ means the model is worse than the mean predictor.
Caution: $R^2$ on training data always increases as you add more features, even useless ones. Always evaluate $R^2$ on held-out data. The adjusted $R^2$ penalizes for the number of features:
$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$
where $p$ is the number of features and $n$ is the number of samples.
from sklearn.metrics import (
mean_squared_error, mean_absolute_error, r2_score
)
def compute_regression_metrics(
y_true: np.ndarray,
y_pred: np.ndarray
) -> dict[str, float]:
"""Compute comprehensive regression metrics.
Args:
y_true: Ground truth values.
y_pred: Predicted values.
Returns:
Dictionary of metric names to values.
"""
mse = mean_squared_error(y_true, y_pred)
return {
"mse": mse,
"rmse": np.sqrt(mse),
"mae": mean_absolute_error(y_true, y_pred),
"r2": r2_score(y_true, y_pred),
}
8.5.4 Choosing Regression Metrics
| Metric | Sensitive to Outliers | Interpretable Units | Use When |
|---|---|---|---|
| MSE | Very | No (squared units) | Large errors are especially costly |
| RMSE | Very | Yes | Same as MSE, but need interpretable scale |
| MAE | Moderately | Yes | Outliers are expected/acceptable |
| R-squared | Moderately | No (dimensionless) | Comparing models, communicating to stakeholders |
8.5.5 Confidence Intervals for Metrics
Reporting a single number for any metric is incomplete without a measure of uncertainty. Two common methods for computing confidence intervals are:
Bootstrap confidence intervals: Resample the test set with replacement $B$ times (typically $B = 1000$ or more), compute the metric on each resample, and take the $\alpha/2$ and $1 - \alpha/2$ quantiles of the resulting distribution.
def bootstrap_confidence_interval(
y_true: np.ndarray,
y_pred: np.ndarray,
metric_fn,
n_bootstrap: int = 1000,
confidence: float = 0.95,
random_state: int = 42,
) -> tuple[float, float, float]:
"""Compute a bootstrap confidence interval for a metric.
Args:
y_true: Ground truth values.
y_pred: Predicted values.
metric_fn: Metric function taking (y_true, y_pred).
n_bootstrap: Number of bootstrap samples.
confidence: Confidence level (e.g., 0.95 for 95% CI).
random_state: Random seed.
Returns:
Tuple of (point_estimate, ci_lower, ci_upper).
"""
rng = np.random.default_rng(random_state)
n = len(y_true)
scores = np.zeros(n_bootstrap)
for i in range(n_bootstrap):
idx = rng.choice(n, size=n, replace=True)
scores[i] = metric_fn(y_true[idx], y_pred[idx])
alpha = (1 - confidence) / 2
ci_lower = np.percentile(scores, 100 * alpha)
ci_upper = np.percentile(scores, 100 * (1 - alpha))
point_estimate = metric_fn(y_true, y_pred)
return point_estimate, ci_lower, ci_upper
Normal approximation: For metrics like accuracy, the standard error can be approximated as $\text{SE} = \sqrt{p(1-p)/n}$ where $p$ is the accuracy and $n$ is the test set size. The 95% CI is approximately $p \pm 1.96 \cdot \text{SE}$. This approximation is valid for large $n$ and moderate $p$, but bootstrapping is more general and reliable.
Practical guideline: For a test set of 1,000 examples and an accuracy of 90%, the 95% CI is approximately $[88.1\%, 91.9\%]$. For a test set of 100 examples, it widens to $[83.1\%, 96.9\%]$. This illustrates why small test sets produce unreliable estimates and why you should always report confidence intervals.
8.5.6 Metric Selection by Application Domain
The choice of metric is ultimately a domain decision that encodes your values about what constitutes acceptable model behavior. Here is a guide organized by application area:
| Domain | Primary Metric | Secondary Metrics | Rationale |
|---|---|---|---|
| Medical screening | Recall at fixed precision | AUC-PR, F2 | Missing a disease (FN) is worse than a false alarm (FP) |
| Spam filtering | Precision at fixed recall | F1, AUC-ROC | Blocking legitimate email (FP) destroys user trust |
| Fraud detection | AUC-PR | Recall, Precision@k | Extreme imbalance; need to rank suspicious cases |
| Search ranking | NDCG, MAP | MRR | Order matters, not just correctness |
| Demand forecasting | MAPE, RMSE | MAE, R-squared | Business decisions depend on error magnitude |
| Autonomous driving | Safety-critical recall | Latency, FPR | Missing an obstacle is catastrophic |
As we discussed in Chapter 6 (Section 6.4.6), the confusion matrix is the foundation for all classification metrics. Always examine it before choosing a summary metric, especially for imbalanced problems.
8.6 The Bias-Variance Tradeoff
8.6.1 Decomposition of Expected Error
The expected prediction error for a model $f$ at a point $x_0$ can be decomposed as:
$$\mathbb{E}[(y - f(x_0))^2] = \text{Bias}^2(f(x_0)) + \text{Var}(f(x_0)) + \sigma^2$$
where: - Bias = $\mathbb{E}[f(x_0)] - f^*(x_0)$: the systematic error due to wrong assumptions in the model. A linear model fit to a quadratic relationship has high bias. - Variance = $\mathbb{E}[(f(x_0) - \mathbb{E}[f(x_0)])^2]$: how much the model changes when trained on different datasets. A deep decision tree has high variance. - Irreducible error ($\sigma^2$): noise inherent in the data that no model can eliminate.
8.6.2 Diagnosing Bias and Variance
The key diagnostic tool is the learning curve, which plots training and validation performance as a function of training set size.
| Symptom | Diagnosis | Remedy |
|---|---|---|
| High training error, high validation error, small gap | High bias (underfitting) | More complex model, more features, less regularization |
| Low training error, high validation error, large gap | High variance (overfitting) | More data, simpler model, more regularization, ensemble methods |
| Both errors converge to acceptable level | Good fit | Deploy with confidence |
from sklearn.model_selection import learning_curve
def plot_learning_curve_data(
estimator,
X: np.ndarray,
y: np.ndarray,
cv: int = 5,
n_points: int = 10
) -> dict[str, np.ndarray]:
"""Compute learning curve data for bias-variance diagnosis.
Args:
estimator: Scikit-learn estimator.
X: Feature matrix.
y: Target vector.
cv: Number of cross-validation folds.
n_points: Number of training set sizes to evaluate.
Returns:
Dictionary with train_sizes, train_scores, and val_scores.
"""
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y, cv=cv,
train_sizes=np.linspace(0.1, 1.0, n_points),
scoring="neg_mean_squared_error",
n_jobs=-1,
)
return {
"train_sizes": train_sizes,
"train_scores_mean": -train_scores.mean(axis=1),
"train_scores_std": train_scores.std(axis=1),
"val_scores_mean": -val_scores.mean(axis=1),
"val_scores_std": val_scores.std(axis=1),
}
8.6.3 Regularization Revisited
As we discussed in Chapter 4 (Section 4.5) on Ridge and Lasso regression, regularization is the primary lever for controlling the bias-variance tradeoff. Increasing regularization strength increases bias but decreases variance. The validation curve plots performance as a function of a regularization parameter to find the sweet spot.
from sklearn.model_selection import validation_curve
def compute_validation_curve(
estimator,
X: np.ndarray,
y: np.ndarray,
param_name: str,
param_range: np.ndarray,
cv: int = 5
) -> dict[str, np.ndarray]:
"""Compute validation curve for a hyperparameter.
Args:
estimator: Scikit-learn estimator.
X: Feature matrix.
y: Target vector.
param_name: Name of the hyperparameter to vary.
param_range: Array of parameter values to evaluate.
cv: Number of cross-validation folds.
Returns:
Dictionary with parameter values, train and validation scores.
"""
train_scores, val_scores = validation_curve(
estimator, X, y,
param_name=param_name,
param_range=param_range,
cv=cv, scoring="accuracy", n_jobs=-1,
)
return {
"param_range": param_range,
"train_mean": train_scores.mean(axis=1),
"train_std": train_scores.std(axis=1),
"val_mean": val_scores.mean(axis=1),
"val_std": val_scores.std(axis=1),
}
8.6.4 The Double Descent Phenomenon
Recent research has revealed that the classical bias-variance tradeoff story is incomplete. In heavily overparameterized models (models with more parameters than training samples), test error can decrease again after the "interpolation threshold" -- the point where the model perfectly fits the training data. This double descent phenomenon, observed in neural networks and even simple models, challenges the classical U-shaped test error curve.
While a full treatment is beyond our scope, the practical takeaway is: the classical bias-variance tradeoff remains an excellent guide for the models covered in Chapters 4-7, but be prepared for more nuanced behavior when we reach deep learning in Part IV.
8.7 Hyperparameter Tuning
8.7.1 Hyperparameters vs. Parameters
Recall the distinction introduced in Chapter 2: - Parameters are learned from data during training (e.g., weights in linear regression, split thresholds in decision trees). - Hyperparameters are set before training and control the learning process (e.g., learning rate, regularization strength, number of trees, maximum depth).
Hyperparameter tuning (also called hyperparameter optimization) is the process of finding the hyperparameter configuration that yields the best validation performance.
8.7.2 Grid Search
Grid search exhaustively evaluates every combination of hyperparameter values in a predefined grid.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
def grid_search_example(
X: np.ndarray,
y: np.ndarray,
cv: int = 5
) -> dict:
"""Demonstrate grid search hyperparameter tuning.
Args:
X: Feature matrix.
y: Target vector.
cv: Number of cross-validation folds.
Returns:
Dictionary with best parameters and best score.
"""
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, 7],
"learning_rate": [0.01, 0.1, 0.2],
}
grid = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid,
cv=cv,
scoring="f1",
n_jobs=-1,
verbose=1,
)
grid.fit(X, y)
return {
"best_params": grid.best_params_,
"best_score": grid.best_score_,
"total_fits": len(grid.cv_results_["mean_test_score"]),
}
Complexity: Grid search evaluates $\prod_{i=1}^{d} |V_i|$ combinations, where $|V_i|$ is the number of values for hyperparameter $i$. With 5 hyperparameters, each with 5 values, that is $5^5 = 3125$ model fits -- multiplied by $k$ for cross-validation. This exponential growth is called the curse of dimensionality of hyperparameter search and is the main limitation of grid search.
8.7.3 Random Search
Random search samples hyperparameter configurations randomly from specified distributions. Bergstra and Bengio (2012) showed that random search is more efficient than grid search for most problems, because:
- Not all hyperparameters are equally important. Grid search wastes evaluations on unimportant dimensions.
- Random search explores a more diverse set of values for each hyperparameter.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
def random_search_example(
X: np.ndarray,
y: np.ndarray,
n_iter: int = 50,
cv: int = 5
) -> dict:
"""Demonstrate random search hyperparameter tuning.
Args:
X: Feature matrix.
y: Target vector.
n_iter: Number of random configurations to evaluate.
cv: Number of cross-validation folds.
Returns:
Dictionary with best parameters and best score.
"""
param_distributions = {
"n_estimators": randint(50, 300),
"max_depth": randint(2, 10),
"learning_rate": uniform(0.001, 0.3),
"subsample": uniform(0.6, 0.4),
"min_samples_split": randint(2, 20),
}
search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_iter=n_iter,
cv=cv,
scoring="f1",
n_jobs=-1,
random_state=42,
verbose=1,
)
search.fit(X, y)
return {
"best_params": search.best_params_,
"best_score": search.best_score_,
"total_fits": n_iter,
}
8.7.4 Bayesian Optimization
Bayesian optimization treats hyperparameter tuning as a sequential optimization problem. It builds a surrogate model (typically a Gaussian Process) of the objective function and uses an acquisition function to decide which configuration to evaluate next, balancing exploration (trying new regions) and exploitation (refining promising regions).
The algorithm: 1. Evaluate a few random configurations as initial data points. 2. Fit a Gaussian Process to the observed (configuration, performance) pairs. 3. Select the next configuration by maximizing the acquisition function (e.g., Expected Improvement). 4. Evaluate the new configuration. 5. Repeat steps 2-4 until the budget is exhausted.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
def bayesian_optimization_manual(
X: np.ndarray,
y: np.ndarray,
n_calls: int = 30,
cv: int = 5
) -> dict:
"""Demonstrate the concept of Bayesian optimization.
This is a simplified illustration. In practice, use libraries
like scikit-optimize, Optuna, or Hyperopt.
Args:
X: Feature matrix.
y: Target vector.
n_calls: Number of evaluations.
cv: Number of cross-validation folds.
Returns:
Dictionary with best parameters and best score.
"""
# In practice, use skopt.BayesSearchCV or optuna.create_study()
# Here we show the conceptual flow
try:
from skopt import BayesSearchCV
from skopt.space import Integer, Real
search_spaces = {
"n_estimators": Integer(50, 300),
"max_depth": Integer(2, 10),
"learning_rate": Real(0.001, 0.3, prior="log-uniform"),
}
opt = BayesSearchCV(
GradientBoostingClassifier(random_state=42),
search_spaces,
n_iter=n_calls,
cv=cv,
scoring="f1",
n_jobs=-1,
random_state=42,
)
opt.fit(X, y)
return {
"best_params": dict(opt.best_params_),
"best_score": opt.best_score_,
}
except ImportError:
print("Install scikit-optimize: pip install scikit-optimize")
return {}
8.7.5 Comparison of Tuning Strategies
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Grid search | Exhaustive, reproducible | Exponential cost, wasteful | Few hyperparameters (2-3), small grids |
| Random search | Efficient, easy to parallelize | No learning between evaluations | Moderate-dimensional spaces, limited budget |
| Bayesian optimization | Sample-efficient, learns from history | Sequential (hard to parallelize), overhead of surrogate model | Expensive-to-evaluate models, large search spaces |
8.7.6 Practical Tips for Hyperparameter Tuning
- Start with random search to identify promising regions, then use grid search to refine.
- Use log-uniform distributions for learning rates and regularization parameters (e.g., search $10^{-4}$ to $10^{-1}$ rather than $0.0001$ to $0.1$ linearly).
- Fix unimportant hyperparameters first. Use domain knowledge to identify which hyperparameters matter most.
- Set a computational budget before starting. Decide how many configurations you can afford to evaluate.
- Warm-start from previous experiments. If you have run similar experiments before, use those results to initialize your search.
8.8 Statistical Tests for Model Comparison
8.8.1 The Need for Statistical Rigor
Suppose Model A achieves 92.3% accuracy and Model B achieves 91.8% accuracy on a single test set. Is Model A truly better, or did it just get lucky with this particular data split? Without statistical testing, we cannot distinguish genuine differences from random fluctuation.
8.8.2 Why Do We Need Statistical Tests?
To appreciate the need for statistical rigor, consider a concrete scenario. You train five models on a credit scoring dataset and obtain the following 5-fold cross-validation accuracies:
| Model | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.841 | 0.823 | 0.856 | 0.832 | 0.848 | 0.840 |
| Random Forest | 0.853 | 0.831 | 0.862 | 0.839 | 0.855 | 0.848 |
| Gradient Boosting | 0.861 | 0.837 | 0.859 | 0.842 | 0.851 | 0.850 |
Gradient Boosting has the highest mean accuracy (0.850), but is it genuinely better than Random Forest (0.848)? The 0.2 percentage point difference could easily arise from random variation in the fold splits. Without a statistical test, we cannot distinguish a real improvement from noise. This is exactly the situation where a paired t-test or Wilcoxon test provides clarity.
The general principle is: always test whether an observed difference is statistically significant before drawing conclusions. A difference that is not significant should not influence your model selection decision.
8.8.3 McNemar's Test
For comparing two classifiers on the same test set, McNemar's test examines the contingency table of their predictions:
| Model B Correct | Model B Incorrect | |
|---|---|---|
| Model A Correct | $n_{00}$ | $n_{01}$ |
| Model A Incorrect | $n_{10}$ | $n_{11}$ |
Only the discordant pairs ($n_{01}$ and $n_{10}$) matter -- cases where the models disagree. Under the null hypothesis that both models have equal error rates:
$$\chi^2 = \frac{(|n_{01} - n_{10}| - 1)^2}{n_{01} + n_{10}}$$
This follows a chi-squared distribution with 1 degree of freedom. The continuity correction ($-1$) improves the approximation for small sample sizes.
from scipy.stats import chi2
def mcnemar_test(
y_true: np.ndarray,
y_pred_a: np.ndarray,
y_pred_b: np.ndarray
) -> dict[str, float]:
"""Perform McNemar's test comparing two classifiers.
Args:
y_true: Ground truth labels.
y_pred_a: Predictions from model A.
y_pred_b: Predictions from model B.
Returns:
Dictionary with chi-squared statistic and p-value.
"""
correct_a = (y_pred_a == y_true)
correct_b = (y_pred_b == y_true)
# Discordant pairs
n_01 = np.sum(correct_a & ~correct_b) # A correct, B wrong
n_10 = np.sum(~correct_a & correct_b) # A wrong, B correct
# McNemar's test with continuity correction
chi2_stat = (abs(n_01 - n_10) - 1) ** 2 / (n_01 + n_10)
p_value = 1 - chi2.cdf(chi2_stat, df=1)
return {"chi2_statistic": chi2_stat, "p_value": p_value}
Worked Example: Suppose we test 500 samples. Model A gets 420 correct and Model B gets 415 correct. The contingency table is:
| B Correct (415) | B Incorrect (85) | |
|---|---|---|
| A Correct (420) | $n_{00} = 395$ | $n_{01} = 25$ |
| A Incorrect (80) | $n_{10} = 20$ | $n_{11} = 60$ |
The discordant pairs are $n_{01} = 25$ (A right, B wrong) and $n_{10} = 20$ (A wrong, B right):
$$\chi^2 = \frac{(|25 - 20| - 1)^2}{25 + 20} = \frac{16}{45} = 0.356$$
The critical value for $\chi^2_1$ at $\alpha = 0.05$ is 3.841. Since $0.356 < 3.841$, we fail to reject the null hypothesis: the difference between the models is not statistically significant, despite Model A having 5 more correct predictions.
8.8.4 Paired t-Test on Cross-Validation Scores
When using k-fold cross-validation, we can compute a paired t-test on the $k$ paired performance differences:
$$t = \frac{\bar{d}}{s_d / \sqrt{k}}$$
where $\bar{d}$ is the mean difference in performance across folds and $s_d$ is the standard deviation of those differences.
Caveat: The independence assumption of the t-test is violated because cross-validation folds share training data. Dietterich (1998) showed this can lead to inflated Type I error rates (false positives). The corrected resampled t-test adjusts the variance estimate:
$$t = \frac{\bar{d}}{\sqrt{\left(\frac{1}{k} + \frac{n_{\text{test}}}{n_{\text{train}}}\right) s_d^2}}$$
from scipy.stats import t as t_dist
def corrected_paired_t_test(
scores_a: np.ndarray,
scores_b: np.ndarray,
n_train: int,
n_test: int
) -> dict[str, float]:
"""Corrected resampled paired t-test for CV scores.
Args:
scores_a: Cross-validation scores for model A.
scores_b: Cross-validation scores for model B.
n_train: Number of training samples per fold.
n_test: Number of test samples per fold.
Returns:
Dictionary with t-statistic and p-value.
"""
k = len(scores_a)
differences = scores_a - scores_b
d_bar = np.mean(differences)
s_d_sq = np.var(differences, ddof=1)
# Corrected variance
corrected_var = (1 / k + n_test / n_train) * s_d_sq
t_stat = d_bar / np.sqrt(corrected_var)
p_value = 2 * (1 - t_dist.cdf(abs(t_stat), df=k - 1))
return {"t_statistic": t_stat, "p_value": p_value}
8.8.5 The Wilcoxon Signed-Rank Test
When cross-validation scores may not be normally distributed, the Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test. It tests whether the median difference between paired observations is zero.
from scipy.stats import wilcoxon
def wilcoxon_comparison(
scores_a: np.ndarray,
scores_b: np.ndarray
) -> dict[str, float]:
"""Wilcoxon signed-rank test for comparing CV scores.
Args:
scores_a: Cross-validation scores for model A.
scores_b: Cross-validation scores for model B.
Returns:
Dictionary with test statistic and p-value.
"""
stat, p_value = wilcoxon(scores_a, scores_b)
return {"statistic": stat, "p_value": p_value}
8.8.6 Multiple Comparisons
When comparing more than two models, performing pairwise tests inflates the probability of finding a spurious significant difference (the multiple comparisons problem). Use the Bonferroni correction: multiply each p-value by the number of comparisons, or equivalently, divide the significance threshold $\alpha$ by the number of comparisons.
For comparing $m$ models, there are $\binom{m}{2}$ pairs. With $m = 5$ models, that is 10 comparisons. At $\alpha = 0.05$, the corrected threshold is $0.05 / 10 = 0.005$.
Alternatively, the Friedman test followed by a post-hoc Nemenyi test provides a more powerful framework for comparing multiple classifiers across multiple datasets. This is the recommended approach when you are comparing more than three models.
8.9 Putting It All Together: An Evaluation Workflow
Here is a principled evaluation workflow that synthesizes everything in this chapter:
-
Split the data into train, validation, and test sets (or set up cross-validation). For time series, use temporal splits.
-
Choose the right metrics based on your application: - Balanced classification? Accuracy or macro-F1. - Imbalanced classification? F1, AUC-PR, recall at fixed precision. - Regression? RMSE for interpretability, MAE for robustness.
-
Train candidate models on the training set (or using cross-validation).
-
Diagnose bias and variance using learning curves and validation curves.
-
Tune hyperparameters using random search (for exploration) and then grid search or Bayesian optimization (for refinement), always evaluated with cross-validation.
-
Compare models statistically using McNemar's test (single split) or corrected paired t-test (cross-validation).
-
Evaluate the final model on the test set exactly once. Report confidence intervals.
-
Document everything: metrics, hyperparameters, data splits, random seeds. Reproducibility is not optional -- it is professional responsibility.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report
import numpy as np
def full_evaluation_pipeline(random_state: int = 42) -> None:
"""Demonstrate a complete model evaluation pipeline.
Args:
random_state: Random seed for reproducibility.
"""
# Step 1: Generate data and split
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=10,
n_redundant=5, weights=[0.7, 0.3], random_state=random_state,
)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=random_state
)
# Step 2: Define candidates
candidates = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(
n_estimators=100, random_state=random_state
),
"Gradient Boosting": GradientBoostingClassifier(
n_estimators=100, random_state=random_state
),
}
# Step 3: Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
results = {}
for name, model in candidates.items():
scores = cross_val_score(model, X_temp, y_temp, cv=cv, scoring="f1")
results[name] = scores
print(f"{name}: F1 = {scores.mean():.4f} (+/- {scores.std():.4f})")
# Step 4: Statistical comparison
best_name = max(results, key=lambda k: results[k].mean())
print(f"\nBest model: {best_name}")
# Step 5: Final evaluation on test set
best_model = candidates[best_name]
best_model.fit(X_temp, y_temp)
y_pred = best_model.predict(X_test)
print(f"\nTest set performance:\n{classification_report(y_test, y_pred)}")
8.10 Common Pitfalls and How to Avoid Them
8.10.1 Data Leakage
Data leakage occurs when information from the test set (or the future) contaminates the training process. Common sources:
- Feature leakage: Using features that encode the target (e.g., "claim_approved" as a feature to predict insurance fraud).
- Preprocessing leakage: Fitting a scaler or imputer on the full dataset (including test data) before splitting.
- Temporal leakage: Using future information in time series features.
Prevention: Always fit preprocessing steps on the training set only and transform the validation and test sets using those fitted parameters. Use scikit-learn Pipeline objects (introduced in Chapter 3) to ensure this is done correctly.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def leak_proof_pipeline(
model,
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray
) -> float:
"""Demonstrate a leak-proof evaluation pipeline.
Args:
model: Scikit-learn estimator.
X_train: Training features.
y_train: Training targets.
X_test: Test features.
y_test: Test targets.
Returns:
Test accuracy.
"""
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", model),
])
# Scaler is fit ONLY on training data
pipeline.fit(X_train, y_train)
return pipeline.score(X_test, y_test)
8.10.2 Overfitting to the Validation Set
If you evaluate hundreds of model configurations on the same validation set, you will eventually find one that performs well on that validation set by chance. This is why the test set must remain untouched until the very end.
8.10.3 Ignoring Class Imbalance
As we discussed in Chapter 5, accuracy is misleading when classes are imbalanced. Always inspect the class distribution first and choose appropriate metrics (F1, AUC-PR) and sampling strategies (SMOTE, class weights).
8.10.4 Reporting Single Numbers Without Uncertainty
A model accuracy of "93.2%" is meaningless without a confidence interval or standard deviation. Always report uncertainty estimates from cross-validation or bootstrapping.
8.10.5 Cherry-Picking Results
It is tempting to report only the best fold from cross-validation or the best metric for your model. This is scientific fraud. Report all relevant metrics, including where your model struggles.
8.11 Advanced Topics
8.11.1 Calibration
A classifier is calibrated if its predicted probabilities reflect true probabilities. If a model says "this email has a 70% chance of being spam," then among all emails given a 70% spam probability, approximately 70% should actually be spam.
Calibration can be assessed with reliability diagrams (also called calibration curves) and quantified with metrics like Brier score and Expected Calibration Error (ECE).
from sklearn.calibration import calibration_curve
def compute_calibration(
y_true: np.ndarray,
y_prob: np.ndarray,
n_bins: int = 10
) -> dict[str, np.ndarray]:
"""Compute calibration curve data.
Args:
y_true: Ground truth binary labels.
y_prob: Predicted probabilities.
n_bins: Number of bins for the calibration curve.
Returns:
Dictionary with fraction of positives and mean predicted values.
"""
fraction_of_positives, mean_predicted_value = calibration_curve(
y_true, y_prob, n_bins=n_bins
)
return {
"fraction_of_positives": fraction_of_positives,
"mean_predicted_value": mean_predicted_value,
}
Platt scaling (fitting a logistic regression on the model's outputs) and isotonic regression are common post-hoc calibration methods.
8.11.2 Calibration Methods in Detail
Platt Scaling fits a logistic regression model to the raw outputs of the classifier. If the original classifier produces scores $s(\mathbf{x})$, Platt scaling learns parameters $a$ and $b$ such that:
$$P(y = 1 \mid s) = \frac{1}{1 + \exp(a \cdot s + b)}$$
This is particularly effective for SVMs, which produce uncalibrated decision values rather than probabilities.
Isotonic Regression fits a non-decreasing step function to the (score, outcome) pairs. It makes fewer assumptions than Platt scaling (no parametric form) but requires more data to avoid overfitting. As a rule of thumb, use Platt scaling with fewer than 1,000 calibration samples and isotonic regression with more.
Temperature Scaling is a single-parameter variant of Platt scaling commonly used with neural networks. It divides the logits by a temperature parameter $T$ before the softmax:
$$P(y = k \mid \mathbf{x}) = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}$$
When $T > 1$, the distribution becomes softer (more uniform); when $T < 1$, it becomes sharper. The optimal $T$ is found by minimizing the negative log-likelihood on a validation set.
from sklearn.calibration import CalibratedClassifierCV
# Platt scaling (sigmoid calibration)
calibrated_platt = CalibratedClassifierCV(
model, method="sigmoid", cv=5
)
calibrated_platt.fit(X_train, y_train)
# Isotonic regression calibration
calibrated_isotonic = CalibratedClassifierCV(
model, method="isotonic", cv=5
)
calibrated_isotonic.fit(X_train, y_train)
Brier Score is a calibration-aware metric that measures the mean squared difference between predicted probabilities and actual outcomes:
$$\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2$$
where $p_i$ is the predicted probability and $y_i \in \{0, 1\}$ is the true label. A perfectly calibrated and discriminating model achieves a Brier score of 0; random guessing on balanced data gives 0.25.
Expected Calibration Error (ECE) partitions predictions into $B$ bins by predicted probability and computes the weighted average gap between predicted confidence and actual accuracy:
$$\text{ECE} = \sum_{b=1}^{B} \frac{|B_b|}{n} \left| \text{acc}(B_b) - \text{conf}(B_b) \right|$$
where $\text{acc}(B_b)$ is the fraction of positive outcomes in bin $b$ and $\text{conf}(B_b)$ is the average predicted probability in that bin.
When does calibration matter? Calibration is critical whenever predicted probabilities are used directly for decision-making: setting insurance premiums, computing expected costs, combining predictions from multiple models, or communicating risk to humans. If you only use predictions for ranking (e.g., which email is most likely spam), calibration is less important, and AUC-ROC is a better metric.
8.11.3 Learning Curves: Diagnosing Model Behavior
While we introduced learning curves briefly in Section 8.6.2, they deserve deeper treatment as a diagnostic tool. A learning curve plots model performance as a function of the number of training examples, and its shape reveals specific information about what is limiting your model.
The anatomy of a learning curve: As the training set grows:
- Training error typically increases (the model finds it harder to perfectly fit more data).
- Validation error typically decreases (the model generalizes better with more data).
- The gap between them narrows as training set size grows.
Interpreting common patterns:
Pattern 1: Both curves plateau at a high error. This signals high bias (underfitting). The model lacks the capacity to capture the underlying pattern. Adding more data will not help --- you need a more complex model, additional features, or less regularization.
Pattern 2: Large gap between training and validation curves. This signals high variance (overfitting). The model fits the training data well but fails to generalize. More data is the most reliable cure. Alternatively, increase regularization, reduce model complexity, or use ensemble methods.
Pattern 3: Both curves converge to a low error. This is the ideal scenario. The model has sufficient capacity and enough data to generalize well.
Pattern 4: Validation error decreases but has not yet plateaued. Collecting more data is likely to improve performance. This is a strong signal to invest in data acquisition rather than algorithm tuning.
from sklearn.model_selection import learning_curve
import numpy as np
def comprehensive_learning_curve(
estimator,
X: np.ndarray,
y: np.ndarray,
cv: int = 5,
n_points: int = 10,
scoring: str = "accuracy",
) -> dict:
"""Compute and analyze a learning curve.
Args:
estimator: Scikit-learn estimator.
X: Feature matrix.
y: Target vector.
cv: Number of cross-validation folds.
n_points: Number of training set sizes to evaluate.
scoring: Scoring metric to use.
Returns:
Dictionary with learning curve data and diagnostic message.
"""
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y, cv=cv,
train_sizes=np.linspace(0.1, 1.0, n_points),
scoring=scoring, n_jobs=-1,
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
gap = train_mean[-1] - val_mean[-1]
# Simple diagnostic
if val_mean[-1] < 0.7 and gap < 0.05:
diagnosis = "High bias: consider a more complex model."
elif gap > 0.15:
diagnosis = "High variance: consider more data or regularization."
elif val_mean[-1] - val_mean[-3] > 0.02:
diagnosis = "Still improving: more data may help."
else:
diagnosis = "Good fit: model appears well-tuned."
return {
"train_sizes": train_sizes,
"train_mean": train_mean,
"val_mean": val_mean,
"gap": gap,
"diagnosis": diagnosis,
}
8.11.4 Fairness Metrics
As AI systems increasingly affect people's lives, evaluating models for fairness is becoming essential. Key fairness metrics include:
- Demographic parity: $P(\hat{y} = 1 | A = a) = P(\hat{y} = 1 | A = b)$ for protected groups $a$ and $b$.
- Equal opportunity: $P(\hat{y} = 1 | y = 1, A = a) = P(\hat{y} = 1 | y = 1, A = b)$ (equal recall across groups).
- Equalized odds: Both TPR and FPR are equal across groups.
These metrics often conflict with each other and with overall accuracy -- a reflection of the philosophical and mathematical impossibility results proven by Chouldechova (2017) and Kleinberg et al. (2017). We will explore fairness in depth in Chapter 20.
8.11.5 Evaluation in Production
Model evaluation does not stop at deployment. Offline metrics computed during development provide an estimate of performance, but the true test is how the model behaves on live data, with real users, under production conditions. Online evaluation bridges this gap.
A/B Testing for Machine Learning
A/B testing is the gold standard for measuring the real-world impact of a new model. The protocol is:
- Define the hypothesis: "The new model will increase click-through rate by at least 2%."
- Determine sample size: Use a power analysis to compute how many users you need to detect the expected effect with statistical significance (typically $\alpha = 0.05$, power $= 0.80$).
- Randomize: Randomly assign users (or requests) to the control group (old model) or treatment group (new model). Ensure randomization is at the right granularity --- per-user for personalization models, per-session for session-based recommendations.
- Run the experiment: Collect data for the predetermined duration. Do not peek at results and stop early unless you are using a sequential testing framework.
- Analyze: Compute the difference in the primary metric and its confidence interval. Use a two-sample t-test or Mann-Whitney U test. Account for multiple comparisons if testing several metrics.
from scipy import stats
import numpy as np
def ab_test_analysis(
control_metric: np.ndarray,
treatment_metric: np.ndarray,
alpha: float = 0.05,
) -> dict[str, float]:
"""Analyze an A/B test using a two-sample t-test.
Args:
control_metric: Metric values for the control group.
treatment_metric: Metric values for the treatment group.
alpha: Significance level.
Returns:
Dictionary with test results and effect size.
"""
t_stat, p_value = stats.ttest_ind(treatment_metric, control_metric)
control_mean = np.mean(control_metric)
treatment_mean = np.mean(treatment_metric)
lift = (treatment_mean - control_mean) / control_mean
pooled_std = np.sqrt(
(np.var(control_metric) + np.var(treatment_metric)) / 2
)
cohens_d = (treatment_mean - control_mean) / pooled_std
return {
"control_mean": control_mean,
"treatment_mean": treatment_mean,
"lift_percent": lift * 100,
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < alpha,
"cohens_d": cohens_d,
}
Common pitfalls in ML A/B testing: - Novelty effects: Users may interact differently with a new system simply because it is new. Run experiments long enough for the novelty to wear off. - Interference between groups: In social networks or marketplaces, treating one user can affect outcomes for others (spillover effects). Use cluster-randomized designs to mitigate. - Metric selection: Offline metrics (accuracy, AUC) may not correlate with business metrics (revenue, retention). Choose primary metrics that reflect actual business value.
Production Monitoring Metrics
Once deployed, models can degrade due to data drift (the distribution of inputs changes), concept drift (the relationship between inputs and outputs changes), or infrastructure issues. Monitoring should track:
- Prediction distributions: Plot histograms of predicted probabilities or values over time. A shift in the distribution may indicate drift.
- Feature distributions: Monitor the mean, variance, and quantiles of key input features. Statistical tests like the Kolmogorov-Smirnov test can detect distributional changes.
- Latency and throughput: Ensure the model meets service-level agreements (SLAs) for response time.
- Label feedback loops: When ground truth becomes available (e.g., whether a flagged transaction was actually fraud), compute live metrics and compare to offline estimates.
- Slice-based metrics: Overall accuracy may be stable while performance degrades for a specific subpopulation. Monitor metrics broken down by key segments (geography, device type, user cohort).
Shadow Mode and Canary Deployments
- Shadow mode: Run the new model in parallel without serving its predictions; compare with the production model. This is risk-free but requires infrastructure to run two models simultaneously.
- Canary deployment: Gradually roll out the new model to a small fraction of traffic (e.g., 1%, then 5%, then 25%) and monitor for regressions. This provides real-world feedback while limiting blast radius.
- Multi-armed bandits: For scenarios where the cost of exploration is acceptable, bandit algorithms adaptively allocate traffic to the better-performing model, converging faster than fixed-allocation A/B tests.
These production evaluation strategies connect to the MLOps practices discussed in Part V.
8.11.6 Common Evaluation Pitfalls
Beyond the pitfalls discussed in Section 8.10, several subtle issues can undermine your evaluation:
Pitfall: Training on the full pipeline but evaluating only the model. If your pipeline includes feature selection, imputation, or encoding, these steps must be included in the cross-validation loop. Fitting a feature selector on all data and then cross-validating the model alone leaks information and inflates performance estimates.
Pitfall: Averaging metrics that should not be averaged. Macro-averaged precision across 10 classes can hide the fact that precision is 0% for the rarest class. Always inspect per-class metrics alongside aggregates.
Pitfall: Using accuracy on a regression problem. This sounds absurd, but it happens when practitioners discretize predictions and labels into bins. Use proper regression metrics (RMSE, MAE, $R^2$) for regression tasks.
Pitfall: Ignoring confidence intervals. A model with 95.2% accuracy and a 95% CI of [92.1%, 98.3%] is very different from one with the same point estimate and a CI of [94.8%, 95.6%]. The first has high uncertainty; the second is precisely estimated. Always report intervals, especially for small test sets.
Pitfall: Evaluating on a non-representative test set. If your test set does not reflect the distribution of data the model will encounter in production, offline evaluation is meaningless. This is particularly common with time-dependent data, where training on old data and testing on recent data is essential.
8.12 Summary
This chapter covered the essential toolkit for model evaluation, selection, and validation:
- Data splitting (train/validation/test) provides the foundation for honest evaluation. The test set must remain untouched until the final evaluation.
- Cross-validation (k-fold, stratified, time series, nested) provides more robust performance estimates than single splits, especially for small datasets.
- Classification metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC) each encode different values about what constitutes "good" performance. The choice of metric is a domain decision, not a technical one.
- Regression metrics (MSE, RMSE, MAE, R-squared) quantify different aspects of prediction quality.
- The bias-variance tradeoff provides a theoretical framework for understanding underfitting and overfitting, diagnosable through learning curves and validation curves.
- Hyperparameter tuning (grid search, random search, Bayesian optimization) systematically optimizes model configuration. Random search is usually the best starting point.
- Statistical tests (McNemar's, paired t-test, Wilcoxon) provide the rigor needed to distinguish genuine model differences from random fluctuation.
In the next chapter, we will explore feature engineering and selection (Chapter 9), where we will see how the evaluation techniques from this chapter guide the process of crafting informative features.
References
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281-305.
- Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1923.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
- Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808.
- Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163.
- Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2021). Deep double descent: Where bigger models and more data can hurt. Journal of Statistical Mechanics: Theory and Experiment.