Appendix F: Evaluation Metrics Reference

Every evaluation metric discussed in this book, in one place. For each metric: the formula, what it actually measures, when to use it, when not to use it, and the scikit-learn function that computes it.

This appendix is organized by task type: classification, regression, clustering, ranking, time series, and fairness. Cross-reference with Chapter 16 (Model Evaluation Deep Dive) and Chapter 33 (Fairness and Responsible ML) for worked examples.


Classification Metrics

Accuracy

  • Formula: (TP + TN) / (TP + TN + FP + FN)
  • Intuition: The proportion of predictions that are correct. The metric everyone learns first and should stop using as a default.
  • When to use: Balanced classes where all errors cost the same.
  • When NOT to use: Imbalanced classes. A model that predicts "not fraud" for every transaction achieves 99.8% accuracy on credit card data and catches zero fraud.
  • scikit-learn: sklearn.metrics.accuracy_score(y_true, y_pred)

Precision

  • Formula: TP / (TP + FP)
  • Intuition: Of all the items the model flagged as positive, what fraction actually were positive? Measures the cost of false alarms.
  • When to use: When false positives are expensive. Spam filters (marking a real email as spam loses trust). Churn retention offers (each offer costs money — you want to target actual churners).
  • When NOT to use: When you care more about catching every positive case than about avoiding false alarms.
  • scikit-learn: sklearn.metrics.precision_score(y_true, y_pred)

Recall (Sensitivity, True Positive Rate)

  • Formula: TP / (TP + FN)
  • Intuition: Of all the items that actually were positive, what fraction did the model catch? Measures the cost of misses.
  • When to use: When false negatives are dangerous. Cancer screening (missing a tumor is worse than ordering an extra biopsy). Fraud detection (missing fraud is worse than flagging a legitimate transaction).
  • When NOT to use: When false positives are expensive and you cannot afford to flag too many items for review.
  • scikit-learn: sklearn.metrics.recall_score(y_true, y_pred)

F1 Score

  • Formula: 2 * (precision * recall) / (precision + recall)
  • Intuition: The harmonic mean of precision and recall. Balances false positives and false negatives equally.
  • When to use: When you need a single number that balances precision and recall, and you value them equally. Common default for imbalanced classification.
  • When NOT to use: When precision and recall have different business costs. In that case, use F-beta or optimize the threshold directly based on the cost matrix.
  • scikit-learn: sklearn.metrics.f1_score(y_true, y_pred)

F-beta Score

  • Formula: (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall)
  • Intuition: Generalization of F1. Beta > 1 weights recall higher; beta < 1 weights precision higher. F2 cares twice as much about recall as precision.
  • When to use: When you can quantify the relative cost of false negatives vs. false positives. F2 for medical screening (catching cases matters more). F0.5 for spam filtering (false positives matter more).
  • When NOT to use: When you cannot articulate why one type of error is worse than the other.
  • scikit-learn: sklearn.metrics.fbeta_score(y_true, y_pred, beta=2.0)

ROC AUC (Area Under the Receiver Operating Characteristic Curve)

  • Formula: Area under the curve plotting TPR (recall) vs. FPR (1 - specificity) at all classification thresholds.
  • Intuition: The probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. Threshold-independent.
  • When to use: Comparing models when you have not yet chosen an operating threshold. Good for model selection during development.
  • When NOT to use: Severely imbalanced datasets. ROC AUC can be misleadingly high because it accounts for true negatives, which dominate in imbalanced data. Use PR AUC instead.
  • scikit-learn: sklearn.metrics.roc_auc_score(y_true, y_score)

PR AUC (Area Under the Precision-Recall Curve)

  • Formula: Area under the curve plotting precision vs. recall at all classification thresholds.
  • Intuition: Summarizes the tradeoff between precision and recall without being inflated by a large number of true negatives. More informative than ROC AUC for imbalanced problems.
  • When to use: Imbalanced classification. Fraud detection, rare disease diagnosis, churn prediction with low churn rates.
  • When NOT to use: Balanced classes where ROC AUC is fine.
  • scikit-learn: sklearn.metrics.average_precision_score(y_true, y_score)

Log Loss (Binary Cross-Entropy)

  • Formula: -1/N * sum(y * log(p) + (1-y) * log(1-p))
  • Intuition: Measures how well the predicted probabilities match the actual outcomes. Penalizes confident wrong predictions severely. A model that says "99% positive" for a negative example gets crushed.
  • When to use: When you care about calibrated probabilities, not just rankings. Risk scoring, insurance pricing, clinical decision support where the probability itself is the output.
  • When NOT to use: When you only care about the final class label, not the probability. Also difficult to interpret for non-technical stakeholders.
  • scikit-learn: sklearn.metrics.log_loss(y_true, y_pred_proba)

Cohen's Kappa

  • Formula: (observed_accuracy - expected_accuracy) / (1 - expected_accuracy)
  • Intuition: Accuracy adjusted for chance agreement. Kappa = 1 means perfect agreement, kappa = 0 means no better than random, kappa < 0 means worse than random.
  • When to use: When evaluating against a majority-class baseline. Multiclass problems where class distribution is skewed. Inter-rater agreement comparisons.
  • When NOT to use: Not widely used in production ML. More common in medical and social science research. F1 or balanced accuracy are more interpretable for engineering teams.
  • scikit-learn: sklearn.metrics.cohen_kappa_score(y_true, y_pred)

Matthews Correlation Coefficient (MCC)

  • Formula: (TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))
  • Intuition: A correlation coefficient between observed and predicted classes. Ranges from -1 (perfect inverse) to +1 (perfect). Regarded as a balanced metric even for imbalanced classes because it uses all four quadrants of the confusion matrix.
  • When to use: When you want a single metric that handles class imbalance without requiring you to choose between precision and recall.
  • When NOT to use: Less commonly used in industry. Stakeholders may not understand it. Often better to show precision, recall, and F1 alongside it.
  • scikit-learn: sklearn.metrics.matthews_corrcoef(y_true, y_pred)

Specificity (True Negative Rate)

  • Formula: TN / (TN + FP)
  • Intuition: Of all the actual negatives, what fraction did the model correctly identify as negative? The recall of the negative class.
  • When to use: Medical testing (specificity of a diagnostic test). Paired with sensitivity (recall) to describe test performance.
  • When NOT to use: Imbalanced data where negatives overwhelm positives — specificity will be trivially high.
  • scikit-learn: No dedicated function. Compute from the confusion matrix: tn / (tn + fp) using sklearn.metrics.confusion_matrix.

Balanced Accuracy

  • Formula: (sensitivity + specificity) / 2 or equivalently, macro-averaged recall.
  • Intuition: Average of recall across all classes. Each class contributes equally regardless of size.
  • When to use: Multiclass problems with imbalanced classes where you want each class weighted equally.
  • When NOT to use: When class costs are not equal — a rare class may actually be less important, not more.
  • scikit-learn: sklearn.metrics.balanced_accuracy_score(y_true, y_pred)

Regression Metrics

Mean Absolute Error (MAE)

  • Formula: (1/N) * sum(|y_true - y_pred|)
  • Intuition: Average absolute distance between predictions and actual values, in the same units as the target. Robust to outliers because it does not square the errors.
  • When to use: When you want an interpretable error in the original units and outliers should not dominate the metric.
  • When NOT to use: When large errors should be penalized disproportionately (use MSE/RMSE instead).
  • scikit-learn: sklearn.metrics.mean_absolute_error(y_true, y_pred)

Mean Squared Error (MSE)

  • Formula: (1/N) * sum((y_true - y_pred)^2)
  • Intuition: Average squared distance. Penalizes large errors more than small ones because of the squaring.
  • When to use: When large errors are disproportionately bad. Predicting house prices — being off by $100K is more than twice as bad as being off by $50K. Also the default loss for most regression algorithms.
  • When NOT to use: When you need an interpretable number in the original units (use RMSE). When outliers in the target should not dominate the metric (use MAE).
  • scikit-learn: sklearn.metrics.mean_squared_error(y_true, y_pred)

Root Mean Squared Error (RMSE)

  • Formula: sqrt((1/N) * sum((y_true - y_pred)^2))
  • Intuition: Square root of MSE, returning the error to the original units. "On average, the prediction is off by X."
  • When to use: Same situations as MSE, but when you want to communicate the result in interpretable units.
  • When NOT to use: When outlier robustness matters.
  • scikit-learn: sklearn.metrics.root_mean_squared_error(y_true, y_pred) (scikit-learn >= 1.4), or np.sqrt(mean_squared_error(y_true, y_pred))

R-squared (Coefficient of Determination)

  • Formula: 1 - (SS_res / SS_tot) where SS_res = sum((y_true - y_pred)^2) and SS_tot = sum((y_true - y_mean)^2)
  • Intuition: Proportion of variance in the target explained by the model. R^2 = 0.85 means the model explains 85% of the variance. R^2 = 0 means no better than predicting the mean.
  • When to use: Comparing models on the same dataset. Quick sanity check — if R^2 is negative, the model is worse than predicting the mean.
  • When NOT to use: Comparing across different datasets. R^2 depends on the inherent variance of the target — a model predicting stock prices might have R^2 = 0.02 and still be useful, while R^2 = 0.7 on housing data might be mediocre.
  • scikit-learn: sklearn.metrics.r2_score(y_true, y_pred)

Mean Absolute Percentage Error (MAPE)

  • Formula: (1/N) * sum(|y_true - y_pred| / |y_true|) * 100
  • Intuition: Average error as a percentage of the actual value. "The prediction is off by X% on average."
  • When to use: When stakeholders think in percentages. Sales forecasting, demand prediction.
  • When NOT to use: When actual values can be zero or near-zero (causes division by zero or explosion). When the scale of the target varies widely.
  • scikit-learn: sklearn.metrics.mean_absolute_percentage_error(y_true, y_pred)

Median Absolute Error

  • Formula: median(|y_true - y_pred|)
  • Intuition: The median (not mean) of absolute errors. Completely robust to outliers.
  • When to use: When your target has extreme outliers and you want a metric that reflects typical prediction quality.
  • When NOT to use: When you need differentiable loss for optimization. When extreme errors are actually important.
  • scikit-learn: sklearn.metrics.median_absolute_error(y_true, y_pred)

Clustering Metrics

Silhouette Score

  • Formula: For each point: (b - a) / max(a, b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. Average over all points.
  • Intuition: Ranges from -1 to +1. High values mean points are close to their own cluster and far from others. Negative values mean the point is probably in the wrong cluster.
  • When to use: When you have no ground truth labels and need to evaluate cluster quality. Comparing different values of k.
  • When NOT to use: Favors convex, equally-sized clusters. Will undervalue DBSCAN-style clusters with irregular shapes.
  • scikit-learn: sklearn.metrics.silhouette_score(X, labels)

Adjusted Rand Index (ARI)

  • Formula: Adjusted-for-chance version of the Rand index, measuring agreement between predicted clusters and ground truth labels.
  • Intuition: Ranges from -0.5 to 1.0. ARI = 1 means perfect agreement with ground truth. ARI = 0 means random assignment.
  • When to use: When you have ground truth cluster labels and want to evaluate how well the algorithm recovered them.
  • When NOT to use: When there is no ground truth (the usual case in production clustering). Use silhouette score instead.
  • scikit-learn: sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)

Inertia (Within-Cluster Sum of Squares)

  • Formula: sum over all clusters: sum of squared distances from each point to its centroid
  • Intuition: Total compactness of clusters. Lower is better, but always decreases as k increases. The "elbow" in the inertia-vs-k plot suggests a good k.
  • When to use: K-means model selection via elbow method. Quick sanity check during iteration.
  • When NOT to use: As a standalone metric — it always improves with more clusters. Must be paired with silhouette or domain judgment.
  • scikit-learn: kmeans_model.inertia_ (attribute, not a standalone function)

Davies-Bouldin Index

  • Formula: Average over all clusters of the maximum ratio of within-cluster scatter to between-cluster distance for each pair.
  • Intuition: Lower is better. Measures how compact and well-separated clusters are. Zero is the theoretical best.
  • When to use: Alternative to silhouette score for model selection. Computationally cheaper than silhouette for large datasets.
  • When NOT to use: Same limitations as silhouette — favors convex, similarly-sized clusters.
  • scikit-learn: sklearn.metrics.davies_bouldin_score(X, labels)

Ranking Metrics

NDCG (Normalized Discounted Cumulative Gain)

  • Formula: DCG@k / IDCG@k where DCG@k = sum(rel_i / log2(i+1)) for positions 1 to k, and IDCG is the DCG of the ideal ranking.
  • Intuition: Measures ranking quality with diminishing returns for items lower in the list. Getting the top result right matters more than getting result #10 right.
  • When to use: Recommender systems (Chapter 24), search engines. When position in the ranked list matters.
  • When NOT to use: Binary relevance tasks where you just need hit/miss. Use precision@k instead.
  • scikit-learn: sklearn.metrics.ndcg_score(y_true, y_score, k=10)

Precision@k and Recall@k

  • Formula: Precision@k = relevant items in top k / k. Recall@k = relevant items in top k / total relevant items.
  • Intuition: Of the top k recommendations, how many were relevant? (Precision@k.) Of all relevant items, how many appeared in the top k? (Recall@k.)
  • When to use: Recommender systems, information retrieval. When you present a fixed number of results to the user.
  • When NOT to use: When the length of the recommendation list varies.
  • scikit-learn: No built-in function. Implement directly or use libraries like recmetrics or surprise.

Mean Average Precision (MAP)

  • Formula: Mean of average precision across all queries, where average precision is the mean of precision@k at each relevant position.
  • Intuition: Summarizes precision across all recall levels. Rewards models that place relevant items at the top.
  • When to use: Information retrieval systems evaluated across multiple queries.
  • When NOT to use: Single-query evaluation (use AP). Non-binary relevance (use NDCG).
  • scikit-learn: sklearn.metrics.average_precision_score for binary classification. For ranking, use custom implementation or trec_eval.

Time Series Metrics

RMSE, MAE, MAPE

The same regression metrics apply to time series forecasting. See the Regression Metrics section above. The key difference: compute these only on the test set (future data), never on the training period.

Mean Absolute Scaled Error (MASE)

  • Formula: MAE / MAE_of_naive_forecast where the naive forecast is the previous period's value.
  • Intuition: Error relative to a naive (persistence) baseline. MASE < 1 means the model beats the naive forecast. MASE > 1 means you would have been better off predicting "same as last time."
  • When to use: Time series comparison across different scales. A MASE of 0.7 means the same thing whether you are predicting daily temperatures or monthly revenue.
  • When NOT to use: When the naive forecast is not appropriate (e.g., data with no temporal autocorrelation).
  • scikit-learn: No built-in function. Implement as: mae / mean(|y[t] - y[t-1]|).

Weighted Absolute Percentage Error (WAPE)

  • Formula: sum(|y_true - y_pred|) / sum(|y_true|)
  • Intuition: Like MAPE but weighted by actual values, avoiding the explosion when actuals are near zero.
  • When to use: Demand forecasting, sales forecasting. Standard metric in supply chain and retail.
  • When NOT to use: When you need per-observation error analysis rather than aggregate.
  • scikit-learn: No built-in function. One line: np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)).

Fairness Metrics

Cross-reference: Chapter 33. These metrics compare model performance across protected groups (e.g., race, gender, age).

Demographic Parity (Statistical Parity)

  • Formula: P(Y_pred = 1 | group = A) = P(Y_pred = 1 | group = B)
  • Intuition: The model gives positive predictions at the same rate across groups. Does not consider whether the predictions are correct.
  • When to use: Hiring, lending, advertising — contexts where equal treatment is required by policy or law.
  • When NOT to use: When base rates genuinely differ between groups and you want equal accuracy, not equal rates.
  • scikit-learn: No built-in. Use aif360.metrics.BinaryLabelDatasetMetric or compute directly from grouped predictions.

Equalized Odds

  • Formula: P(Y_pred = 1 | Y_true = y, group = A) = P(Y_pred = 1 | Y_true = y, group = B) for y in {0, 1}
  • Intuition: Equal true positive rates AND equal false positive rates across groups. The model makes the same types of mistakes for everyone.
  • When to use: Criminal justice risk assessment, medical diagnosis — contexts where error rates should not differ by group.
  • When NOT to use: When equal outcome rates matter more than equal error rates.
  • scikit-learn: No built-in. Use fairlearn.metrics.equalized_odds_difference or compute TPR and FPR per group from the confusion matrix.

Equal Opportunity

  • Formula: P(Y_pred = 1 | Y_true = 1, group = A) = P(Y_pred = 1 | Y_true = 1, group = B)
  • Intuition: Equal true positive rates across groups. A relaxation of equalized odds that only requires equal catch rates for the positive class.
  • When to use: When missing a positive case is the critical error and that miss rate should not differ by group.
  • When NOT to use: When false positive rates also matter (use full equalized odds).
  • scikit-learn: No built-in. Compute recall per group.

Predictive Parity

  • Formula: P(Y_true = 1 | Y_pred = 1, group = A) = P(Y_true = 1 | Y_pred = 1, group = B)
  • Intuition: Equal precision across groups. When the model says "positive," the probability of actually being positive is the same for everyone.
  • When to use: When the model's predictions are used to allocate resources (e.g., follow-up calls, interventions) and you want equal resource efficiency across groups.
  • When NOT to use: The impossibility theorem (Chouldechova, 2017) shows you generally cannot have predictive parity AND equalized odds when base rates differ. Choose one.
  • scikit-learn: No built-in. Compute precision per group.

Disparate Impact Ratio

  • Formula: P(Y_pred = 1 | group = A) / P(Y_pred = 1 | group = B) where group B is the advantaged group.
  • Intuition: Ratio of positive prediction rates. The "four-fifths rule" in US employment law states this ratio should be at least 0.8.
  • When to use: Legal compliance in hiring and lending contexts.
  • When NOT to use: As a standalone fairness metric — it does not account for whether predictions are correct.
  • scikit-learn: No built-in. Use aif360.metrics.BinaryLabelDatasetMetric.disparate_impact() or compute from grouped prediction rates.

Quick Reference Table

Metric Task Handles Imbalance Threshold-Free scikit-learn
Accuracy Classification No No accuracy_score
Precision Classification Yes No precision_score
Recall Classification Yes No recall_score
F1 Classification Yes No f1_score
ROC AUC Classification Somewhat Yes roc_auc_score
PR AUC Classification Yes Yes average_precision_score
Log Loss Classification No Yes log_loss
MCC Classification Yes No matthews_corrcoef
MAE Regression N/A N/A mean_absolute_error
RMSE Regression N/A N/A root_mean_squared_error
R-squared Regression N/A N/A r2_score
MAPE Regression N/A N/A mean_absolute_percentage_error
Silhouette Clustering N/A N/A silhouette_score
ARI Clustering N/A N/A adjusted_rand_score
NDCG@k Ranking N/A N/A ndcg_score

When in doubt: use PR AUC for imbalanced classification, RMSE for regression, and silhouette for clustering. Then show precision and recall (or the full confusion matrix) to your stakeholders, because a single number never tells the whole story.