Appendix F: Evaluation Metrics Reference
Every evaluation metric discussed in this book, in one place. For each metric: the formula, what it actually measures, when to use it, when not to use it, and the scikit-learn function that computes it.
This appendix is organized by task type: classification, regression, clustering, ranking, time series, and fairness. Cross-reference with Chapter 16 (Model Evaluation Deep Dive) and Chapter 33 (Fairness and Responsible ML) for worked examples.
Classification Metrics
Accuracy
- Formula:
(TP + TN) / (TP + TN + FP + FN) - Intuition: The proportion of predictions that are correct. The metric everyone learns first and should stop using as a default.
- When to use: Balanced classes where all errors cost the same.
- When NOT to use: Imbalanced classes. A model that predicts "not fraud" for every transaction achieves 99.8% accuracy on credit card data and catches zero fraud.
- scikit-learn:
sklearn.metrics.accuracy_score(y_true, y_pred)
Precision
- Formula:
TP / (TP + FP) - Intuition: Of all the items the model flagged as positive, what fraction actually were positive? Measures the cost of false alarms.
- When to use: When false positives are expensive. Spam filters (marking a real email as spam loses trust). Churn retention offers (each offer costs money — you want to target actual churners).
- When NOT to use: When you care more about catching every positive case than about avoiding false alarms.
- scikit-learn:
sklearn.metrics.precision_score(y_true, y_pred)
Recall (Sensitivity, True Positive Rate)
- Formula:
TP / (TP + FN) - Intuition: Of all the items that actually were positive, what fraction did the model catch? Measures the cost of misses.
- When to use: When false negatives are dangerous. Cancer screening (missing a tumor is worse than ordering an extra biopsy). Fraud detection (missing fraud is worse than flagging a legitimate transaction).
- When NOT to use: When false positives are expensive and you cannot afford to flag too many items for review.
- scikit-learn:
sklearn.metrics.recall_score(y_true, y_pred)
F1 Score
- Formula:
2 * (precision * recall) / (precision + recall) - Intuition: The harmonic mean of precision and recall. Balances false positives and false negatives equally.
- When to use: When you need a single number that balances precision and recall, and you value them equally. Common default for imbalanced classification.
- When NOT to use: When precision and recall have different business costs. In that case, use F-beta or optimize the threshold directly based on the cost matrix.
- scikit-learn:
sklearn.metrics.f1_score(y_true, y_pred)
F-beta Score
- Formula:
(1 + beta^2) * (precision * recall) / (beta^2 * precision + recall) - Intuition: Generalization of F1. Beta > 1 weights recall higher; beta < 1 weights precision higher. F2 cares twice as much about recall as precision.
- When to use: When you can quantify the relative cost of false negatives vs. false positives. F2 for medical screening (catching cases matters more). F0.5 for spam filtering (false positives matter more).
- When NOT to use: When you cannot articulate why one type of error is worse than the other.
- scikit-learn:
sklearn.metrics.fbeta_score(y_true, y_pred, beta=2.0)
ROC AUC (Area Under the Receiver Operating Characteristic Curve)
- Formula: Area under the curve plotting TPR (recall) vs. FPR (1 - specificity) at all classification thresholds.
- Intuition: The probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. Threshold-independent.
- When to use: Comparing models when you have not yet chosen an operating threshold. Good for model selection during development.
- When NOT to use: Severely imbalanced datasets. ROC AUC can be misleadingly high because it accounts for true negatives, which dominate in imbalanced data. Use PR AUC instead.
- scikit-learn:
sklearn.metrics.roc_auc_score(y_true, y_score)
PR AUC (Area Under the Precision-Recall Curve)
- Formula: Area under the curve plotting precision vs. recall at all classification thresholds.
- Intuition: Summarizes the tradeoff between precision and recall without being inflated by a large number of true negatives. More informative than ROC AUC for imbalanced problems.
- When to use: Imbalanced classification. Fraud detection, rare disease diagnosis, churn prediction with low churn rates.
- When NOT to use: Balanced classes where ROC AUC is fine.
- scikit-learn:
sklearn.metrics.average_precision_score(y_true, y_score)
Log Loss (Binary Cross-Entropy)
- Formula:
-1/N * sum(y * log(p) + (1-y) * log(1-p)) - Intuition: Measures how well the predicted probabilities match the actual outcomes. Penalizes confident wrong predictions severely. A model that says "99% positive" for a negative example gets crushed.
- When to use: When you care about calibrated probabilities, not just rankings. Risk scoring, insurance pricing, clinical decision support where the probability itself is the output.
- When NOT to use: When you only care about the final class label, not the probability. Also difficult to interpret for non-technical stakeholders.
- scikit-learn:
sklearn.metrics.log_loss(y_true, y_pred_proba)
Cohen's Kappa
- Formula:
(observed_accuracy - expected_accuracy) / (1 - expected_accuracy) - Intuition: Accuracy adjusted for chance agreement. Kappa = 1 means perfect agreement, kappa = 0 means no better than random, kappa < 0 means worse than random.
- When to use: When evaluating against a majority-class baseline. Multiclass problems where class distribution is skewed. Inter-rater agreement comparisons.
- When NOT to use: Not widely used in production ML. More common in medical and social science research. F1 or balanced accuracy are more interpretable for engineering teams.
- scikit-learn:
sklearn.metrics.cohen_kappa_score(y_true, y_pred)
Matthews Correlation Coefficient (MCC)
- Formula:
(TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)) - Intuition: A correlation coefficient between observed and predicted classes. Ranges from -1 (perfect inverse) to +1 (perfect). Regarded as a balanced metric even for imbalanced classes because it uses all four quadrants of the confusion matrix.
- When to use: When you want a single metric that handles class imbalance without requiring you to choose between precision and recall.
- When NOT to use: Less commonly used in industry. Stakeholders may not understand it. Often better to show precision, recall, and F1 alongside it.
- scikit-learn:
sklearn.metrics.matthews_corrcoef(y_true, y_pred)
Specificity (True Negative Rate)
- Formula:
TN / (TN + FP) - Intuition: Of all the actual negatives, what fraction did the model correctly identify as negative? The recall of the negative class.
- When to use: Medical testing (specificity of a diagnostic test). Paired with sensitivity (recall) to describe test performance.
- When NOT to use: Imbalanced data where negatives overwhelm positives — specificity will be trivially high.
- scikit-learn: No dedicated function. Compute from the confusion matrix:
tn / (tn + fp)usingsklearn.metrics.confusion_matrix.
Balanced Accuracy
- Formula:
(sensitivity + specificity) / 2or equivalently, macro-averaged recall. - Intuition: Average of recall across all classes. Each class contributes equally regardless of size.
- When to use: Multiclass problems with imbalanced classes where you want each class weighted equally.
- When NOT to use: When class costs are not equal — a rare class may actually be less important, not more.
- scikit-learn:
sklearn.metrics.balanced_accuracy_score(y_true, y_pred)
Regression Metrics
Mean Absolute Error (MAE)
- Formula:
(1/N) * sum(|y_true - y_pred|) - Intuition: Average absolute distance between predictions and actual values, in the same units as the target. Robust to outliers because it does not square the errors.
- When to use: When you want an interpretable error in the original units and outliers should not dominate the metric.
- When NOT to use: When large errors should be penalized disproportionately (use MSE/RMSE instead).
- scikit-learn:
sklearn.metrics.mean_absolute_error(y_true, y_pred)
Mean Squared Error (MSE)
- Formula:
(1/N) * sum((y_true - y_pred)^2) - Intuition: Average squared distance. Penalizes large errors more than small ones because of the squaring.
- When to use: When large errors are disproportionately bad. Predicting house prices — being off by $100K is more than twice as bad as being off by $50K. Also the default loss for most regression algorithms.
- When NOT to use: When you need an interpretable number in the original units (use RMSE). When outliers in the target should not dominate the metric (use MAE).
- scikit-learn:
sklearn.metrics.mean_squared_error(y_true, y_pred)
Root Mean Squared Error (RMSE)
- Formula:
sqrt((1/N) * sum((y_true - y_pred)^2)) - Intuition: Square root of MSE, returning the error to the original units. "On average, the prediction is off by X."
- When to use: Same situations as MSE, but when you want to communicate the result in interpretable units.
- When NOT to use: When outlier robustness matters.
- scikit-learn:
sklearn.metrics.root_mean_squared_error(y_true, y_pred)(scikit-learn >= 1.4), ornp.sqrt(mean_squared_error(y_true, y_pred))
R-squared (Coefficient of Determination)
- Formula:
1 - (SS_res / SS_tot)whereSS_res = sum((y_true - y_pred)^2)andSS_tot = sum((y_true - y_mean)^2) - Intuition: Proportion of variance in the target explained by the model. R^2 = 0.85 means the model explains 85% of the variance. R^2 = 0 means no better than predicting the mean.
- When to use: Comparing models on the same dataset. Quick sanity check — if R^2 is negative, the model is worse than predicting the mean.
- When NOT to use: Comparing across different datasets. R^2 depends on the inherent variance of the target — a model predicting stock prices might have R^2 = 0.02 and still be useful, while R^2 = 0.7 on housing data might be mediocre.
- scikit-learn:
sklearn.metrics.r2_score(y_true, y_pred)
Mean Absolute Percentage Error (MAPE)
- Formula:
(1/N) * sum(|y_true - y_pred| / |y_true|) * 100 - Intuition: Average error as a percentage of the actual value. "The prediction is off by X% on average."
- When to use: When stakeholders think in percentages. Sales forecasting, demand prediction.
- When NOT to use: When actual values can be zero or near-zero (causes division by zero or explosion). When the scale of the target varies widely.
- scikit-learn:
sklearn.metrics.mean_absolute_percentage_error(y_true, y_pred)
Median Absolute Error
- Formula:
median(|y_true - y_pred|) - Intuition: The median (not mean) of absolute errors. Completely robust to outliers.
- When to use: When your target has extreme outliers and you want a metric that reflects typical prediction quality.
- When NOT to use: When you need differentiable loss for optimization. When extreme errors are actually important.
- scikit-learn:
sklearn.metrics.median_absolute_error(y_true, y_pred)
Clustering Metrics
Silhouette Score
- Formula: For each point:
(b - a) / max(a, b)wherea= mean intra-cluster distance,b= mean nearest-cluster distance. Average over all points. - Intuition: Ranges from -1 to +1. High values mean points are close to their own cluster and far from others. Negative values mean the point is probably in the wrong cluster.
- When to use: When you have no ground truth labels and need to evaluate cluster quality. Comparing different values of k.
- When NOT to use: Favors convex, equally-sized clusters. Will undervalue DBSCAN-style clusters with irregular shapes.
- scikit-learn:
sklearn.metrics.silhouette_score(X, labels)
Adjusted Rand Index (ARI)
- Formula: Adjusted-for-chance version of the Rand index, measuring agreement between predicted clusters and ground truth labels.
- Intuition: Ranges from -0.5 to 1.0. ARI = 1 means perfect agreement with ground truth. ARI = 0 means random assignment.
- When to use: When you have ground truth cluster labels and want to evaluate how well the algorithm recovered them.
- When NOT to use: When there is no ground truth (the usual case in production clustering). Use silhouette score instead.
- scikit-learn:
sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)
Inertia (Within-Cluster Sum of Squares)
- Formula:
sum over all clusters: sum of squared distances from each point to its centroid - Intuition: Total compactness of clusters. Lower is better, but always decreases as k increases. The "elbow" in the inertia-vs-k plot suggests a good k.
- When to use: K-means model selection via elbow method. Quick sanity check during iteration.
- When NOT to use: As a standalone metric — it always improves with more clusters. Must be paired with silhouette or domain judgment.
- scikit-learn:
kmeans_model.inertia_(attribute, not a standalone function)
Davies-Bouldin Index
- Formula: Average over all clusters of the maximum ratio of within-cluster scatter to between-cluster distance for each pair.
- Intuition: Lower is better. Measures how compact and well-separated clusters are. Zero is the theoretical best.
- When to use: Alternative to silhouette score for model selection. Computationally cheaper than silhouette for large datasets.
- When NOT to use: Same limitations as silhouette — favors convex, similarly-sized clusters.
- scikit-learn:
sklearn.metrics.davies_bouldin_score(X, labels)
Ranking Metrics
NDCG (Normalized Discounted Cumulative Gain)
- Formula:
DCG@k / IDCG@kwhereDCG@k = sum(rel_i / log2(i+1))for positions 1 to k, and IDCG is the DCG of the ideal ranking. - Intuition: Measures ranking quality with diminishing returns for items lower in the list. Getting the top result right matters more than getting result #10 right.
- When to use: Recommender systems (Chapter 24), search engines. When position in the ranked list matters.
- When NOT to use: Binary relevance tasks where you just need hit/miss. Use precision@k instead.
- scikit-learn:
sklearn.metrics.ndcg_score(y_true, y_score, k=10)
Precision@k and Recall@k
- Formula: Precision@k = relevant items in top k / k. Recall@k = relevant items in top k / total relevant items.
- Intuition: Of the top k recommendations, how many were relevant? (Precision@k.) Of all relevant items, how many appeared in the top k? (Recall@k.)
- When to use: Recommender systems, information retrieval. When you present a fixed number of results to the user.
- When NOT to use: When the length of the recommendation list varies.
- scikit-learn: No built-in function. Implement directly or use libraries like
recmetricsorsurprise.
Mean Average Precision (MAP)
- Formula: Mean of average precision across all queries, where average precision is the mean of precision@k at each relevant position.
- Intuition: Summarizes precision across all recall levels. Rewards models that place relevant items at the top.
- When to use: Information retrieval systems evaluated across multiple queries.
- When NOT to use: Single-query evaluation (use AP). Non-binary relevance (use NDCG).
- scikit-learn:
sklearn.metrics.average_precision_scorefor binary classification. For ranking, use custom implementation ortrec_eval.
Time Series Metrics
RMSE, MAE, MAPE
The same regression metrics apply to time series forecasting. See the Regression Metrics section above. The key difference: compute these only on the test set (future data), never on the training period.
Mean Absolute Scaled Error (MASE)
- Formula:
MAE / MAE_of_naive_forecastwhere the naive forecast is the previous period's value. - Intuition: Error relative to a naive (persistence) baseline. MASE < 1 means the model beats the naive forecast. MASE > 1 means you would have been better off predicting "same as last time."
- When to use: Time series comparison across different scales. A MASE of 0.7 means the same thing whether you are predicting daily temperatures or monthly revenue.
- When NOT to use: When the naive forecast is not appropriate (e.g., data with no temporal autocorrelation).
- scikit-learn: No built-in function. Implement as:
mae / mean(|y[t] - y[t-1]|).
Weighted Absolute Percentage Error (WAPE)
- Formula:
sum(|y_true - y_pred|) / sum(|y_true|) - Intuition: Like MAPE but weighted by actual values, avoiding the explosion when actuals are near zero.
- When to use: Demand forecasting, sales forecasting. Standard metric in supply chain and retail.
- When NOT to use: When you need per-observation error analysis rather than aggregate.
- scikit-learn: No built-in function. One line:
np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)).
Fairness Metrics
Cross-reference: Chapter 33. These metrics compare model performance across protected groups (e.g., race, gender, age).
Demographic Parity (Statistical Parity)
- Formula:
P(Y_pred = 1 | group = A) = P(Y_pred = 1 | group = B) - Intuition: The model gives positive predictions at the same rate across groups. Does not consider whether the predictions are correct.
- When to use: Hiring, lending, advertising — contexts where equal treatment is required by policy or law.
- When NOT to use: When base rates genuinely differ between groups and you want equal accuracy, not equal rates.
- scikit-learn: No built-in. Use
aif360.metrics.BinaryLabelDatasetMetricor compute directly from grouped predictions.
Equalized Odds
- Formula:
P(Y_pred = 1 | Y_true = y, group = A) = P(Y_pred = 1 | Y_true = y, group = B)for y in {0, 1} - Intuition: Equal true positive rates AND equal false positive rates across groups. The model makes the same types of mistakes for everyone.
- When to use: Criminal justice risk assessment, medical diagnosis — contexts where error rates should not differ by group.
- When NOT to use: When equal outcome rates matter more than equal error rates.
- scikit-learn: No built-in. Use
fairlearn.metrics.equalized_odds_differenceor compute TPR and FPR per group from the confusion matrix.
Equal Opportunity
- Formula:
P(Y_pred = 1 | Y_true = 1, group = A) = P(Y_pred = 1 | Y_true = 1, group = B) - Intuition: Equal true positive rates across groups. A relaxation of equalized odds that only requires equal catch rates for the positive class.
- When to use: When missing a positive case is the critical error and that miss rate should not differ by group.
- When NOT to use: When false positive rates also matter (use full equalized odds).
- scikit-learn: No built-in. Compute recall per group.
Predictive Parity
- Formula:
P(Y_true = 1 | Y_pred = 1, group = A) = P(Y_true = 1 | Y_pred = 1, group = B) - Intuition: Equal precision across groups. When the model says "positive," the probability of actually being positive is the same for everyone.
- When to use: When the model's predictions are used to allocate resources (e.g., follow-up calls, interventions) and you want equal resource efficiency across groups.
- When NOT to use: The impossibility theorem (Chouldechova, 2017) shows you generally cannot have predictive parity AND equalized odds when base rates differ. Choose one.
- scikit-learn: No built-in. Compute precision per group.
Disparate Impact Ratio
- Formula:
P(Y_pred = 1 | group = A) / P(Y_pred = 1 | group = B)where group B is the advantaged group. - Intuition: Ratio of positive prediction rates. The "four-fifths rule" in US employment law states this ratio should be at least 0.8.
- When to use: Legal compliance in hiring and lending contexts.
- When NOT to use: As a standalone fairness metric — it does not account for whether predictions are correct.
- scikit-learn: No built-in. Use
aif360.metrics.BinaryLabelDatasetMetric.disparate_impact()or compute from grouped prediction rates.
Quick Reference Table
| Metric | Task | Handles Imbalance | Threshold-Free | scikit-learn |
|---|---|---|---|---|
| Accuracy | Classification | No | No | accuracy_score |
| Precision | Classification | Yes | No | precision_score |
| Recall | Classification | Yes | No | recall_score |
| F1 | Classification | Yes | No | f1_score |
| ROC AUC | Classification | Somewhat | Yes | roc_auc_score |
| PR AUC | Classification | Yes | Yes | average_precision_score |
| Log Loss | Classification | No | Yes | log_loss |
| MCC | Classification | Yes | No | matthews_corrcoef |
| MAE | Regression | N/A | N/A | mean_absolute_error |
| RMSE | Regression | N/A | N/A | root_mean_squared_error |
| R-squared | Regression | N/A | N/A | r2_score |
| MAPE | Regression | N/A | N/A | mean_absolute_percentage_error |
| Silhouette | Clustering | N/A | N/A | silhouette_score |
| ARI | Clustering | N/A | N/A | adjusted_rand_score |
| NDCG@k | Ranking | N/A | N/A | ndcg_score |
When in doubt: use PR AUC for imbalanced classification, RMSE for regression, and silhouette for clustering. Then show precision and recall (or the full confusion matrix) to your stakeholders, because a single number never tells the whole story.