Appendix F: Evaluation Metrics Reference

Every evaluation metric discussed in this book, in one place. For each metric: the formula, what it actually measures, when to use it, when not to use it, and the scikit-learn function that computes it.

This appendix is organized by task type: classification, regression, clustering, ranking, time series, and fairness. Cross-reference with Chapter 16 (Model Evaluation Deep Dive) and Chapter 33 (Fairness and Responsible ML) for worked examples.

Classification Metrics

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)
Intuition: The proportion of predictions that are correct. The metric everyone learns first and should stop using as a default.
When to use: Balanced classes where all errors cost the same.
When NOT to use: Imbalanced classes. A model that predicts "not fraud" for every transaction achieves 99.8% accuracy on credit card data and catches zero fraud.
scikit-learn: sklearn.metrics.accuracy_score(y_true, y_pred)

Precision

Formula: TP / (TP + FP)
Intuition: Of all the items the model flagged as positive, what fraction actually were positive? Measures the cost of false alarms.
When to use: When false positives are expensive. Spam filters (marking a real email as spam loses trust). Churn retention offers (each offer costs money — you want to target actual churners).
When NOT to use: When you care more about catching every positive case than about avoiding false alarms.
scikit-learn: sklearn.metrics.precision_score(y_true, y_pred)

Recall (Sensitivity, True Positive Rate)

Formula: TP / (TP + FN)
Intuition: Of all the items that actually were positive, what fraction did the model catch? Measures the cost of misses.
When to use: When false negatives are dangerous. Cancer screening (missing a tumor is worse than ordering an extra biopsy). Fraud detection (missing fraud is worse than flagging a legitimate transaction).
When NOT to use: When false positives are expensive and you cannot afford to flag too many items for review.
scikit-learn: sklearn.metrics.recall_score(y_true, y_pred)

F1 Score

Formula: 2 * (precision * recall) / (precision + recall)
Intuition: The harmonic mean of precision and recall. Balances false positives and false negatives equally.
When to use: When you need a single number that balances precision and recall, and you value them equally. Common default for imbalanced classification.
When NOT to use: When precision and recall have different business costs. In that case, use F-beta or optimize the threshold directly based on the cost matrix.
scikit-learn: sklearn.metrics.f1_score(y_true, y_pred)

F-beta Score

Formula: (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall)
Intuition: Generalization of F1. Beta > 1 weights recall higher; beta < 1 weights precision higher. F2 cares twice as much about recall as precision.
When to use: When you can quantify the relative cost of false negatives vs. false positives. F2 for medical screening (catching cases matters more). F0.5 for spam filtering (false positives matter more).
When NOT to use: When you cannot articulate why one type of error is worse than the other.
scikit-learn: sklearn.metrics.fbeta_score(y_true, y_pred, beta=2.0)

ROC AUC (Area Under the Receiver Operating Characteristic Curve)

Formula: Area under the curve plotting TPR (recall) vs. FPR (1 - specificity) at all classification thresholds.
Intuition: The probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. Threshold-independent.
When to use: Comparing models when you have not yet chosen an operating threshold. Good for model selection during development.
When NOT to use: Severely imbalanced datasets. ROC AUC can be misleadingly high because it accounts for true negatives, which dominate in imbalanced data. Use PR AUC instead.
scikit-learn: sklearn.metrics.roc_auc_score(y_true, y_score)

PR AUC (Area Under the Precision-Recall Curve)

Formula: Area under the curve plotting precision vs. recall at all classification thresholds.
Intuition: Summarizes the tradeoff between precision and recall without being inflated by a large number of true negatives. More informative than ROC AUC for imbalanced problems.
When to use: Imbalanced classification. Fraud detection, rare disease diagnosis, churn prediction with low churn rates.
When NOT to use: Balanced classes where ROC AUC is fine.
scikit-learn: sklearn.metrics.average_precision_score(y_true, y_score)

Log Loss (Binary Cross-Entropy)

Formula: -1/N * sum(y * log(p) + (1-y) * log(1-p))
Intuition: Measures how well the predicted probabilities match the actual outcomes. Penalizes confident wrong predictions severely. A model that says "99% positive" for a negative example gets crushed.
When to use: When you care about calibrated probabilities, not just rankings. Risk scoring, insurance pricing, clinical decision support where the probability itself is the output.
When NOT to use: When you only care about the final class label, not the probability. Also difficult to interpret for non-technical stakeholders.
scikit-learn: sklearn.metrics.log_loss(y_true, y_pred_proba)

Cohen's Kappa

Formula: (observed_accuracy - expected_accuracy) / (1 - expected_accuracy)
Intuition: Accuracy adjusted for chance agreement. Kappa = 1 means perfect agreement, kappa = 0 means no better than random, kappa < 0 means worse than random.
When to use: When evaluating against a majority-class baseline. Multiclass problems where class distribution is skewed. Inter-rater agreement comparisons.
When NOT to use: Not widely used in production ML. More common in medical and social science research. F1 or balanced accuracy are more interpretable for engineering teams.
scikit-learn: sklearn.metrics.cohen_kappa_score(y_true, y_pred)

Matthews Correlation Coefficient (MCC)

Formula: (TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Intuition: A correlation coefficient between observed and predicted classes. Ranges from -1 (perfect inverse) to +1 (perfect). Regarded as a balanced metric even for imbalanced classes because it uses all four quadrants of the confusion matrix.
When to use: When you want a single metric that handles class imbalance without requiring you to choose between precision and recall.
When NOT to use: Less commonly used in industry. Stakeholders may not understand it. Often better to show precision, recall, and F1 alongside it.
scikit-learn: sklearn.metrics.matthews_corrcoef(y_true, y_pred)

Specificity (True Negative Rate)

Formula: TN / (TN + FP)
Intuition: Of all the actual negatives, what fraction did the model correctly identify as negative? The recall of the negative class.
When to use: Medical testing (specificity of a diagnostic test). Paired with sensitivity (recall) to describe test performance.
When NOT to use: Imbalanced data where negatives overwhelm positives — specificity will be trivially high.
scikit-learn: No dedicated function. Compute from the confusion matrix: tn / (tn + fp) using sklearn.metrics.confusion_matrix.

Balanced Accuracy

Formula: (sensitivity + specificity) / 2 or equivalently, macro-averaged recall.
Intuition: Average of recall across all classes. Each class contributes equally regardless of size.
When to use: Multiclass problems with imbalanced classes where you want each class weighted equally.
When NOT to use: When class costs are not equal — a rare class may actually be less important, not more.
scikit-learn: sklearn.metrics.balanced_accuracy_score(y_true, y_pred)

Regression Metrics

Mean Absolute Error (MAE)

Formula: (1/N) * sum(|y_true - y_pred|)
Intuition: Average absolute distance between predictions and actual values, in the same units as the target. Robust to outliers because it does not square the errors.
When to use: When you want an interpretable error in the original units and outliers should not dominate the metric.
When NOT to use: When large errors should be penalized disproportionately (use MSE/RMSE instead).
scikit-learn: sklearn.metrics.mean_absolute_error(y_true, y_pred)

Mean Squared Error (MSE)

Formula: (1/N) * sum((y_true - y_pred)^2)
Intuition: Average squared distance. Penalizes large errors more than small ones because of the squaring.
When to use: When large errors are disproportionately bad. Predicting house prices — being off by $100K is more than twice as bad as being off by $50K. Also the default loss for most regression algorithms.
When NOT to use: When you need an interpretable number in the original units (use RMSE). When outliers in the target should not dominate the metric (use MAE).
scikit-learn: sklearn.metrics.mean_squared_error(y_true, y_pred)

Root Mean Squared Error (RMSE)

Formula: sqrt((1/N) * sum((y_true - y_pred)^2))
Intuition: Square root of MSE, returning the error to the original units. "On average, the prediction is off by X."
When to use: Same situations as MSE, but when you want to communicate the result in interpretable units.
When NOT to use: When outlier robustness matters.
scikit-learn: sklearn.metrics.root_mean_squared_error(y_true, y_pred) (scikit-learn >= 1.4), or np.sqrt(mean_squared_error(y_true, y_pred))

R-squared (Coefficient of Determination)

Formula: 1 - (SS_res / SS_tot) where SS_res = sum((y_true - y_pred)^2) and SS_tot = sum((y_true - y_mean)^2)
Intuition: Proportion of variance in the target explained by the model. R^2 = 0.85 means the model explains 85% of the variance. R^2 = 0 means no better than predicting the mean.
When to use: Comparing models on the same dataset. Quick sanity check — if R^2 is negative, the model is worse than predicting the mean.
When NOT to use: Comparing across different datasets. R^2 depends on the inherent variance of the target — a model predicting stock prices might have R^2 = 0.02 and still be useful, while R^2 = 0.7 on housing data might be mediocre.
scikit-learn: sklearn.metrics.r2_score(y_true, y_pred)

Mean Absolute Percentage Error (MAPE)

Formula: (1/N) * sum(|y_true - y_pred| / |y_true|) * 100
Intuition: Average error as a percentage of the actual value. "The prediction is off by X% on average."
When to use: When stakeholders think in percentages. Sales forecasting, demand prediction.
When NOT to use: When actual values can be zero or near-zero (causes division by zero or explosion). When the scale of the target varies widely.
scikit-learn: sklearn.metrics.mean_absolute_percentage_error(y_true, y_pred)

Median Absolute Error

Formula: median(|y_true - y_pred|)
Intuition: The median (not mean) of absolute errors. Completely robust to outliers.
When to use: When your target has extreme outliers and you want a metric that reflects typical prediction quality.
When NOT to use: When you need differentiable loss for optimization. When extreme errors are actually important.
scikit-learn: sklearn.metrics.median_absolute_error(y_true, y_pred)

Clustering Metrics

Silhouette Score

Formula: For each point: (b - a) / max(a, b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. Average over all points.
Intuition: Ranges from -1 to +1. High values mean points are close to their own cluster and far from others. Negative values mean the point is probably in the wrong cluster.
When to use: When you have no ground truth labels and need to evaluate cluster quality. Comparing different values of k.
When NOT to use: Favors convex, equally-sized clusters. Will undervalue DBSCAN-style clusters with irregular shapes.
scikit-learn: sklearn.metrics.silhouette_score(X, labels)

Adjusted Rand Index (ARI)

Formula: Adjusted-for-chance version of the Rand index, measuring agreement between predicted clusters and ground truth labels.
Intuition: Ranges from -0.5 to 1.0. ARI = 1 means perfect agreement with ground truth. ARI = 0 means random assignment.
When to use: When you have ground truth cluster labels and want to evaluate how well the algorithm recovered them.
When NOT to use: When there is no ground truth (the usual case in production clustering). Use silhouette score instead.
scikit-learn: sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)

Inertia (Within-Cluster Sum of Squares)

Formula: sum over all clusters: sum of squared distances from each point to its centroid
Intuition: Total compactness of clusters. Lower is better, but always decreases as k increases. The "elbow" in the inertia-vs-k plot suggests a good k.
When to use: K-means model selection via elbow method. Quick sanity check during iteration.
When NOT to use: As a standalone metric — it always improves with more clusters. Must be paired with silhouette or domain judgment.
scikit-learn: kmeans_model.inertia_ (attribute, not a standalone function)

Davies-Bouldin Index

Formula: Average over all clusters of the maximum ratio of within-cluster scatter to between-cluster distance for each pair.
Intuition: Lower is better. Measures how compact and well-separated clusters are. Zero is the theoretical best.
When to use: Alternative to silhouette score for model selection. Computationally cheaper than silhouette for large datasets.
When NOT to use: Same limitations as silhouette — favors convex, similarly-sized clusters.
scikit-learn: sklearn.metrics.davies_bouldin_score(X, labels)

Ranking Metrics

NDCG (Normalized Discounted Cumulative Gain)

Formula: DCG@k / IDCG@k where DCG@k = sum(rel_i / log2(i+1)) for positions 1 to k, and IDCG is the DCG of the ideal ranking.
Intuition: Measures ranking quality with diminishing returns for items lower in the list. Getting the top result right matters more than getting result #10 right.
When to use: Recommender systems (Chapter 24), search engines. When position in the ranked list matters.
When NOT to use: Binary relevance tasks where you just need hit/miss. Use precision@k instead.
scikit-learn: sklearn.metrics.ndcg_score(y_true, y_score, k=10)

Precision@k and Recall@k

Formula: Precision@k = relevant items in top k / k. Recall@k = relevant items in top k / total relevant items.
Intuition: Of the top k recommendations, how many were relevant? (Precision@k.) Of all relevant items, how many appeared in the top k? (Recall@k.)
When to use: Recommender systems, information retrieval. When you present a fixed number of results to the user.
When NOT to use: When the length of the recommendation list varies.
scikit-learn: No built-in function. Implement directly or use libraries like recmetrics or surprise.

Mean Average Precision (MAP)

Formula: Mean of average precision across all queries, where average precision is the mean of precision@k at each relevant position.
Intuition: Summarizes precision across all recall levels. Rewards models that place relevant items at the top.
When to use: Information retrieval systems evaluated across multiple queries.
When NOT to use: Single-query evaluation (use AP). Non-binary relevance (use NDCG).
scikit-learn: sklearn.metrics.average_precision_score for binary classification. For ranking, use custom implementation or trec_eval.

Time Series Metrics

RMSE, MAE, MAPE

The same regression metrics apply to time series forecasting. See the Regression Metrics section above. The key difference: compute these only on the test set (future data), never on the training period.

Mean Absolute Scaled Error (MASE)

Formula: MAE / MAE_of_naive_forecast where the naive forecast is the previous period's value.
Intuition: Error relative to a naive (persistence) baseline. MASE < 1 means the model beats the naive forecast. MASE > 1 means you would have been better off predicting "same as last time."
When to use: Time series comparison across different scales. A MASE of 0.7 means the same thing whether you are predicting daily temperatures or monthly revenue.
When NOT to use: When the naive forecast is not appropriate (e.g., data with no temporal autocorrelation).
scikit-learn: No built-in function. Implement as: mae / mean(|y[t] - y[t-1]|).

Weighted Absolute Percentage Error (WAPE)

Formula: sum(|y_true - y_pred|) / sum(|y_true|)
Intuition: Like MAPE but weighted by actual values, avoiding the explosion when actuals are near zero.
When to use: Demand forecasting, sales forecasting. Standard metric in supply chain and retail.
When NOT to use: When you need per-observation error analysis rather than aggregate.
scikit-learn: No built-in function. One line: np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)).

Fairness Metrics

Cross-reference: Chapter 33. These metrics compare model performance across protected groups (e.g., race, gender, age).

Demographic Parity (Statistical Parity)

Formula: P(Y_pred = 1 | group = A) = P(Y_pred = 1 | group = B)
Intuition: The model gives positive predictions at the same rate across groups. Does not consider whether the predictions are correct.
When to use: Hiring, lending, advertising — contexts where equal treatment is required by policy or law.
When NOT to use: When base rates genuinely differ between groups and you want equal accuracy, not equal rates.
scikit-learn: No built-in. Use aif360.metrics.BinaryLabelDatasetMetric or compute directly from grouped predictions.

Equalized Odds

Formula: P(Y_pred = 1 | Y_true = y, group = A) = P(Y_pred = 1 | Y_true = y, group = B) for y in {0, 1}
Intuition: Equal true positive rates AND equal false positive rates across groups. The model makes the same types of mistakes for everyone.
When to use: Criminal justice risk assessment, medical diagnosis — contexts where error rates should not differ by group.
When NOT to use: When equal outcome rates matter more than equal error rates.
scikit-learn: No built-in. Use fairlearn.metrics.equalized_odds_difference or compute TPR and FPR per group from the confusion matrix.

Equal Opportunity

Formula: P(Y_pred = 1 | Y_true = 1, group = A) = P(Y_pred = 1 | Y_true = 1, group = B)
Intuition: Equal true positive rates across groups. A relaxation of equalized odds that only requires equal catch rates for the positive class.
When to use: When missing a positive case is the critical error and that miss rate should not differ by group.
When NOT to use: When false positive rates also matter (use full equalized odds).
scikit-learn: No built-in. Compute recall per group.

Predictive Parity

Formula: P(Y_true = 1 | Y_pred = 1, group = A) = P(Y_true = 1 | Y_pred = 1, group = B)
Intuition: Equal precision across groups. When the model says "positive," the probability of actually being positive is the same for everyone.
When to use: When the model's predictions are used to allocate resources (e.g., follow-up calls, interventions) and you want equal resource efficiency across groups.
When NOT to use: The impossibility theorem (Chouldechova, 2017) shows you generally cannot have predictive parity AND equalized odds when base rates differ. Choose one.
scikit-learn: No built-in. Compute precision per group.

Disparate Impact Ratio

Formula: P(Y_pred = 1 | group = A) / P(Y_pred = 1 | group = B) where group B is the advantaged group.
Intuition: Ratio of positive prediction rates. The "four-fifths rule" in US employment law states this ratio should be at least 0.8.
When to use: Legal compliance in hiring and lending contexts.
When NOT to use: As a standalone fairness metric — it does not account for whether predictions are correct.
scikit-learn: No built-in. Use aif360.metrics.BinaryLabelDatasetMetric.disparate_impact() or compute from grouped prediction rates.

Quick Reference Table

Metric	Task	Handles Imbalance	Threshold-Free	scikit-learn
Accuracy	Classification	No	No	`accuracy_score`
Precision	Classification	Yes	No	`precision_score`
Recall	Classification	Yes	No	`recall_score`
F1	Classification	Yes	No	`f1_score`
ROC AUC	Classification	Somewhat	Yes	`roc_auc_score`
PR AUC	Classification	Yes	Yes	`average_precision_score`
Log Loss	Classification	No	Yes	`log_loss`
MCC	Classification	Yes	No	`matthews_corrcoef`
MAE	Regression	N/A	N/A	`mean_absolute_error`
RMSE	Regression	N/A	N/A	`root_mean_squared_error`
R-squared	Regression	N/A	N/A	`r2_score`
MAPE	Regression	N/A	N/A	`mean_absolute_percentage_error`
Silhouette	Clustering	N/A	N/A	`silhouette_score`
ARI	Clustering	N/A	N/A	`adjusted_rand_score`
NDCG@k	Ranking	N/A	N/A	`ndcg_score`

When in doubt: use PR AUC for imbalanced classification, RMSE for regression, and silhouette for clustering. Then show precision and recall (or the full confusion matrix) to your stakeholders, because a single number never tells the whole story.