Appendix G: Evaluation Metrics Reference

A Practitioner's Guide to Choosing, Computing, and Interpreting Model Metrics


This appendix is a comprehensive reference for the evaluation metrics used throughout this book. For each metric, we provide the mathematical definition, interpretation, guidance on when to use it, common pitfalls, and — where relevant — a Python implementation. The goal is not to catalog every metric ever invented, but to give you the decision framework to select the right metric for every problem you encounter.

The most common evaluation mistake in applied data science is not computing a metric incorrectly — it is computing the wrong metric correctly.


G.1 — Classification Metrics

Classification metrics evaluate models that predict discrete categories. The choice of metric depends on the class distribution, the cost structure of errors, and whether the model outputs probabilities or hard labels.

G.1.1 — The Confusion Matrix Foundation

Every binary classification metric derives from four counts:

Predicted Positive Predicted Negative
Actually Positive True Positive (TP) False Negative (FN)
Actually Negative False Positive (FP) True Negative (TN)

The total sample size is $n = \text{TP} + \text{FP} + \text{FN} + \text{TN}$. The prevalence (base rate) is $\pi = (\text{TP} + \text{FN}) / n$. Every metric below is a function of these four quantities, weighted differently depending on what matters for the application.

G.1.2 — Accuracy

Formula:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$

Interpretation: The fraction of all predictions that are correct.

When to use: Only when classes are approximately balanced AND the costs of false positives and false negatives are roughly equal. In practice, this is rare.

Common pitfalls: - The accuracy paradox: In a dataset with 99% negatives, a model that always predicts "negative" achieves 99% accuracy while being completely useless. This is the single most common metric mistake in applied ML. - Accuracy treats all errors as equally costly. In credit scoring (Chapter 31), denying a qualified applicant and approving a defaulting applicant have very different costs.

When to avoid: Any dataset with class imbalance greater than approximately 80/20. Any application where false positives and false negatives have different costs — which is nearly every real-world application.

G.1.3 — Precision

Formula:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

Interpretation: Of all examples the model labeled as positive, what fraction are actually positive? Precision answers: "When the model says yes, how often is it right?"

When to use: When the cost of false positives is high. In the StreamRec recommendation system (Chapters 13, 24), showing irrelevant recommendations wastes screen real estate and erodes user trust. In spam filtering, false positives mean legitimate emails end up in the spam folder.

Common pitfalls: - Precision can be trivially maximized by being very conservative — predict positive only for the single most confident example, and precision approaches 1.0. This is useless unless paired with recall. - Precision is undefined when $\text{TP} + \text{FP} = 0$ (the model never predicts positive). Many implementations return 0.0 in this case, which can silently corrupt aggregated metrics.

G.1.4 — Recall (Sensitivity, True Positive Rate)

Formula:

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

Interpretation: Of all actually positive examples, what fraction did the model identify? Recall answers: "Of all the positives that exist, how many did we catch?"

When to use: When the cost of false negatives is high. In the MediCore Pharma case (Chapters 15-19), missing a patient who would benefit from treatment is potentially life-threatening. In fraud detection, missing a fraudulent transaction is a direct financial loss.

Common pitfalls: - Recall can be trivially maximized by predicting everything as positive. This is useless unless paired with precision. - In highly imbalanced datasets, recall alone can be misleading. A model with 95% recall but 1% precision floods the system with false alarms.

G.1.5 — F1 Score

Formula:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \text{TP}}{2 \text{TP} + \text{FP} + \text{FN}}$$

Interpretation: The harmonic mean of precision and recall. It is high only when both precision and recall are high.

When to use: When you need a single number that balances precision and recall, and the costs of false positives and false negatives are approximately equal.

The $F_\beta$ generalization: When false negatives are $\beta^2$ times as costly as false positives:

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

$F_2$ weights recall twice as heavily as precision (use when missing positives is costly). $F_{0.5}$ weights precision twice as heavily (use when false alarms are costly).

Common pitfalls: - The harmonic mean is dominated by the smaller value. If precision is 0.95 and recall is 0.10, $F_1 \approx 0.18$ — the model is essentially penalized entirely for poor recall. This is a feature, not a bug, but be aware of it. - $F_1$ implicitly assumes equal cost for false positives and false negatives. If the costs are not equal, use $F_\beta$ or a cost-sensitive metric. - $F_1$ is computed per class. For multiclass problems, the macro-average (unweighted mean across classes), micro-average (compute from global TP/FP/FN), and weighted-average (weight by class support) can give different rankings of models. Always report which averaging method you use.

G.1.6 — AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

Formula:

The ROC curve plots the True Positive Rate (recall) against the False Positive Rate ($\text{FPR} = \text{FP} / (\text{FP} + \text{TN})$) at all classification thresholds. The AUC is the area under this curve:

$$\text{AUC-ROC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(t)) \, dt$$

Equivalently, AUC-ROC equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:

$$\text{AUC-ROC} = P(\hat{y}_{\text{pos}} > \hat{y}_{\text{neg}})$$

Interpretation: A model with AUC = 0.5 is no better than random; AUC = 1.0 is perfect ranking. AUC measures discrimination — the model's ability to separate positives from negatives — independent of the classification threshold.

When to use: When you care about ranking quality across all possible thresholds. When class balance may change between training and deployment (AUC-ROC is invariant to class prior shifts). When comparing models that will be deployed at different operating points.

Common pitfalls: - AUC-ROC is misleading under severe class imbalance. When 99.9% of examples are negative, a model with a high false positive rate (say 5%) generates a massive number of false positives in absolute terms. But because TN is so large, FPR = FP/(FP + TN) remains small, and the ROC curve looks excellent. In these settings, use AUC-PR instead. - AUC is a threshold-free metric, but deployment requires choosing a threshold. A model with high AUC may have no single threshold where both precision and recall are acceptable. - AUC measures ranking, not calibration. A model with perfect AUC can have wildly miscalibrated probabilities. See Section G.6.

from sklearn.metrics import roc_auc_score, roc_curve
import numpy as np

def compute_auc_roc(y_true: np.ndarray, y_score: np.ndarray) -> float:
    """Compute AUC-ROC with input validation."""
    if len(np.unique(y_true)) < 2:
        raise ValueError("AUC-ROC requires both positive and negative examples.")
    return roc_auc_score(y_true, y_score)

# Manual computation via the probabilistic interpretation
def auc_roc_manual(y_true: np.ndarray, y_score: np.ndarray) -> float:
    """AUC-ROC via the Mann-Whitney U statistic."""
    pos_scores = y_score[y_true == 1]
    neg_scores = y_score[y_true == 0]
    # Count concordant pairs
    n_concordant = sum(
        (p > n) + 0.5 * (p == n)
        for p in pos_scores
        for n in neg_scores
    )
    return n_concordant / (len(pos_scores) * len(neg_scores))

G.1.7 — AUC-PR (Area Under the Precision-Recall Curve)

Formula:

The PR curve plots precision against recall at all classification thresholds. The AUC is the area under this curve:

$$\text{AUC-PR} = \int_0^1 \text{Precision}(\text{Recall}^{-1}(r)) \, dr$$

Interpretation: A model with AUC-PR equal to the prevalence $\pi$ is no better than random; AUC-PR = 1.0 is perfect. Unlike AUC-ROC, the random baseline depends on class balance.

When to use: Always when dealing with imbalanced classification. In the Meridian Financial credit scoring system (Chapters 31, 35), default rates are typically 2-5%. In fraud detection, fraud rates may be below 0.1%. AUC-PR directly reflects performance on the minority class.

Common pitfalls: - AUC-PR values are not comparable across datasets with different prevalence rates. An AUC-PR of 0.3 on a dataset with 0.1% positive rate may be excellent, while 0.3 on a balanced dataset is terrible. - The interpolation method for computing AUC-PR matters. Use the trapezoidal rule on the step function (scikit-learn's average_precision_score), not linear interpolation between PR points.

G.1.8 — Log Loss (Binary Cross-Entropy)

Formula:

$$\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]$$

Interpretation: The average negative log-likelihood of the true labels under the model's predicted probabilities. It measures both discrimination (ranking) and calibration (probability accuracy). A model that assigns probability 0.7 to an event that occurs 70% of the time achieves lower log loss than a model that assigns 0.99.

When to use: When well-calibrated probabilities matter, not just rankings. When training a model (it is the standard loss function for binary classification, as derived in Chapter 3 from maximum likelihood). When the downstream application uses the predicted probability directly (e.g., expected value calculations, risk pricing in the Meridian Financial system).

Common pitfalls: - Log loss is extremely sensitive to confident wrong predictions. A single prediction of $\hat{p} = 0.999$ for a negative example contributes $-\log(0.001) \approx 6.9$ nats to the loss. In practice, clip predictions to $[\epsilon, 1 - \epsilon]$ for numerical stability. - Log loss has no intuitive scale. A log loss of 0.35 is hard to interpret without context. Always compare to baselines: the prevalence-only model (predict $\hat{p} = \pi$ for all examples) gives log loss $= -\pi \log \pi - (1-\pi)\log(1-\pi)$, which is the entropy of the label distribution.

G.1.9 — Brier Score

Formula:

$$\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2$$

Interpretation: The mean squared error of probabilistic predictions. It ranges from 0 (perfect) to 1 (worst). Unlike log loss, Brier score does not penalize confident wrong predictions as catastrophically — the maximum penalty for a single example is 1.0, not infinity.

When to use: When you want a calibration-sensitive metric that is less extreme than log loss. When comparing probabilistic forecasts (weather forecasting, the climate projections in the Pacific Climate case). When the application can tolerate occasional confident errors better than consistently uncertain predictions.

Brier decomposition (see also Section G.6):

$$\text{Brier} = \underbrace{\text{Reliability}}_{\text{calibration error}} - \underbrace{\text{Resolution}}_{\text{discrimination}} + \underbrace{\text{Uncertainty}}_{\text{inherent}}$$

This decomposition separates calibration quality from discrimination quality — a powerful diagnostic.

G.1.10 — Expected Calibration Error (ECE)

Formula:

Partition predictions into $M$ equal-width bins $B_1, \ldots, B_M$ by predicted probability. For each bin:

$$\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|$$

where $\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} y_i$ is the observed accuracy in bin $m$ and $\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i$ is the average predicted probability.

Interpretation: The weighted average gap between predicted confidence and observed accuracy. A perfectly calibrated model has ECE = 0.

When to use: When evaluating calibration quality specifically (Chapter 34). When deploying models where the predicted probability drives downstream decisions.

Common pitfalls: - ECE depends on the number of bins $M$ and the binning strategy (equal-width vs. equal-count). Always report $M$ and the strategy. - ECE can be gamed: a model that outputs only two values (e.g., 0.3 and 0.8) can achieve low ECE if those values happen to match the observed frequencies, even if it loses all discrimination. - For a more robust alternative, use the adaptive ECE with equal-count bins, or the kernel-based calibration error.

See Section G.6 for full calibration metric treatment.


G.2 — Regression Metrics

Regression metrics evaluate models that predict continuous values. The choice depends on the error distribution, the presence of outliers, and whether errors should be penalized symmetrically.

G.2.1 — Mean Absolute Error (MAE)

Formula:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

Interpretation: The average magnitude of prediction errors, in the same units as the target. MAE = 3.2 means "on average, predictions are off by 3.2 units."

When to use: When all errors of the same magnitude should be penalized equally, regardless of direction. When outliers should not dominate the metric. When the error distribution is heavy-tailed (e.g., financial data, user engagement counts in the StreamRec system).

Connection to loss optimization: MAE is the $L_1$ loss. Minimizing MAE is equivalent to predicting the conditional median of $Y | X$, not the conditional mean. This is a feature when the error distribution is skewed.

Common pitfalls: - MAE is not differentiable at zero, which can cause issues for gradient-based optimization at the point where the residual passes through zero. In practice, this rarely matters for evaluation, but it matters if you use MAE as a training loss. - MAE does not penalize large errors more than small ones. If rare catastrophic errors are costly (e.g., demand forecasting where a 100-unit underestimate shuts down a production line), consider RMSE or an asymmetric loss.

G.2.2 — Root Mean Squared Error (RMSE)

Formula:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

Interpretation: Like MAE, RMSE is in the same units as the target. But RMSE penalizes large errors more heavily due to the squaring: a single error of 10 contributes 100 to MSE, while ten errors of 1 contribute only 10.

$$\text{RMSE} \geq \text{MAE}, \quad \text{with equality iff all errors are identical}$$

When to use: When large errors are disproportionately costly. When the error distribution is approximately Gaussian (minimizing MSE is equivalent to MLE under Gaussian noise, as shown in Chapter 3). When you want the metric to be sensitive to variance in errors, not just their average magnitude.

Common pitfalls: - RMSE is sensitive to outliers. A single prediction error of 1000 in a dataset of 1000 examples with otherwise-perfect predictions gives RMSE $\approx 31.6$, while MAE $= 1.0$. - MSE (the squared version, without the root) is in squared units, which makes interpretation difficult. Always report RMSE, not MSE, for interpretability. - RMSE is scale-dependent. An RMSE of 5 is excellent for house prices (\$5 error) but terrible for binary predictions (probability scale 0-1). Use MAPE or normalized metrics for cross-task comparison.

G.2.3 — Mean Absolute Percentage Error (MAPE)

Formula:

$$\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|$$

Interpretation: The average absolute error expressed as a percentage of the true value. MAPE = 8% means "on average, predictions are off by 8% of the true value."

When to use: When errors should be measured relative to the magnitude of the target. In demand forecasting (Chapter 23), underestimating demand for a high-volume product by 100 units is less serious than underestimating demand for a low-volume product by the same amount.

Common pitfalls: - MAPE is undefined when $y_i = 0$ and approaches infinity when $y_i$ is close to zero. This makes it unsuitable for datasets with zero-valued targets (zero-inflated counts, sparse engagement metrics). - MAPE is asymmetric: it penalizes under-predictions more than over-predictions. Consider $y = 100, \hat{y} = 50$: MAPE = 50%. But $y = 100, \hat{y} = 150$: MAPE = 50%. Now consider $y = 50, \hat{y} = 100$: MAPE = 100%. Over-prediction of the same magnitude gets penalized more when the true value is smaller. - For a symmetric alternative, use the Symmetric MAPE (sMAPE): $\text{sMAPE} = \frac{100\%}{n} \sum \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|) / 2}$.

G.2.4 — $R^2$ (Coefficient of Determination)

Formula:

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$$

Interpretation: The fraction of variance in the target that the model explains. $R^2 = 0$ means the model is no better than predicting the mean; $R^2 = 1$ is perfect. $R^2$ can be negative for models worse than the mean baseline.

When to use: When you want a scale-independent measure of how much of the target's variance the model captures. When comparing models across different prediction tasks with different scales.

Common pitfalls: - $R^2$ always increases (or stays the same) when adding features to a linear model, even if the features are noise. In linear regression, use adjusted $R^2$: $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$, where $p$ is the number of predictors. - $R^2$ is sensitive to the variance of $y$. If the target has very little variance (e.g., nearly constant), even a good model can have low $R^2$ because the denominator $\text{SS}_{\text{tot}}$ is small. - $R^2$ does not indicate whether the model's predictions are biased. A model that systematically over-predicts by 10 can have high $R^2$ if it otherwise tracks the target perfectly. - $R^2 < 0$ on test data means your model is worse than predicting the training mean. This is a diagnostic, not an error.

G.2.5 — Quantile Loss (Pinball Loss)

Formula:

For quantile $\tau \in (0, 1)$:

$$L_\tau(y, \hat{y}) = \begin{cases} \tau \cdot (y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1 - \tau) \cdot (\hat{y} - y) & \text{if } y < \hat{y} \end{cases}$$

$$\text{Quantile Loss} = \frac{1}{n} \sum_{i=1}^{n} L_\tau(y_i, \hat{y}_i)$$

Interpretation: Quantile loss is asymmetric: for $\tau = 0.9$, under-predictions are penalized 9 times more than over-predictions. Minimizing quantile loss at level $\tau$ yields the $\tau$-th conditional quantile of $Y | X$.

When to use: When building probabilistic forecasts (Chapter 23). When the cost of under-prediction differs from the cost of over-prediction. When constructing prediction intervals: train separate models at $\tau = 0.05$ and $\tau = 0.95$ to get a 90% prediction interval.

Common pitfalls: - Quantile regression does not guarantee that quantile predictions are monotonically ordered. A model predicting the 90th percentile can output a value below the model's 10th percentile prediction. Use quantile crossing corrections or a single model with monotonicity constraints (e.g., the SQF approach). - When averaging quantile losses across multiple quantiles, weight them equally for an overall score, or use the Continuous Ranked Probability Score (CRPS) for a proper scoring rule that integrates over all quantiles.


G.3 — Ranking Metrics

Ranking metrics evaluate models that produce ordered lists — recommendations, search results, information retrieval. These metrics are central to the StreamRec recommendation system (Chapters 13-14, 24, 36).

G.3.1 — Normalized Discounted Cumulative Gain (NDCG)

Formula:

Discounted Cumulative Gain at rank $K$:

$$\text{DCG@}K = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i + 1)}$$

where $\text{rel}_i$ is the relevance score of the item at position $i$. NDCG normalizes by the ideal ordering:

$$\text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$

where $\text{IDCG@}K$ is the DCG of the ideal ranking (items sorted by decreasing relevance).

Interpretation: NDCG ranges from 0 to 1. It accounts for both the relevance of items and their position: a relevant item at position 1 is worth more than the same item at position 10. The logarithmic discount captures the intuition that users are less likely to examine items further down a list.

When to use: When items have graded relevance (not just relevant/irrelevant). When position matters — which it almost always does in recommendation and search.

Common pitfalls: - NDCG is sensitive to the choice of $K$. NDCG@5 and NDCG@100 can rank models differently. Report at multiple cutoffs, or report the full-list NDCG. - NDCG is undefined when there are no relevant items in the ground truth for a query (IDCG = 0). Convention: define NDCG = 0 or exclude such queries. Document your choice. - The relevance scale matters. Binary relevance (0/1) vs. graded relevance (0-4) changes the metric's behavior significantly.

import numpy as np

def ndcg_at_k(relevances: np.ndarray, k: int) -> float:
    """Compute NDCG@k for a single ranked list.

    Args:
        relevances: Array of relevance scores in ranked order.
        k: Cutoff rank.

    Returns:
        NDCG@k score in [0, 1].
    """
    relevances = np.asarray(relevances)[:k]
    if len(relevances) == 0:
        return 0.0

    # DCG
    discounts = np.log2(np.arange(len(relevances)) + 2)  # log2(2), log2(3), ...
    dcg = np.sum(relevances / discounts)

    # IDCG (ideal ranking)
    ideal_relevances = np.sort(relevances)[::-1]
    idcg = np.sum(ideal_relevances / discounts)

    if idcg == 0:
        return 0.0
    return dcg / idcg

G.3.2 — Mean Average Precision (MAP)

Formula:

Average Precision for a single query:

$$\text{AP} = \frac{1}{|\text{relevant}|} \sum_{k=1}^{n} \text{Precision@}k \cdot \mathbb{1}[\text{item}_k \text{ is relevant}]$$

MAP is the mean of AP across all queries:

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

Interpretation: AP rewards rankings that place relevant items early. It is equivalent to the area under the precision-recall curve for a single query, with interpolation at the recall levels where relevant items appear.

When to use: When relevance is binary (relevant/irrelevant). When you care about the set of relevant items retrieved, not just the top-1.

Common pitfalls: - MAP treats all relevant items as equally important. If some relevant items matter more than others, use NDCG with graded relevance. - MAP is dominated by queries with many relevant items. A query with 100 relevant items contributes more variance to MAP than a query with 2 relevant items.

G.3.3 — Mean Reciprocal Rank (MRR)

Formula:

$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}(q)}$$

where $\text{rank}(q)$ is the rank position of the first relevant item for query $q$.

Interpretation: MRR measures how quickly the model surfaces the first relevant result. MRR = 1.0 means the first item is always relevant; MRR = 0.5 means the first relevant item is at position 2 on average.

When to use: When only the top result matters (search, question answering). When the user's goal is to find one relevant item, not to browse a set.

Common pitfalls: - MRR considers only the first relevant item. A ranking that puts 10 relevant items at positions 2-11 and one at position 1 has the same MRR as a ranking that puts one relevant item at position 1 and all others at position 1000. - MRR gives zero credit to queries with no relevant items in the result list. Either exclude these queries (and report exclusion count) or define a maximum rank beyond which MRR contribution is zero.

G.3.4 — Recall@k and Hit@k

Formula:

$$\text{Recall@}k = \frac{|\{\text{relevant items in top-}k\}|}{|\{\text{all relevant items}\}|}$$

$$\text{Hit@}k = \mathbb{1}[\text{at least one relevant item in top-}k]$$

Interpretation: Recall@k measures the coverage of relevant items in the top-$k$ results. Hit@k measures whether any relevant item appears. Hit@k is the degenerate case of Recall@k for users with one relevant item.

When to use: Recall@k is the standard metric for the candidate retrieval stage of recommendation systems (Chapter 24). At this stage, the goal is to include the relevant item(s) in a candidate set of $k$ items that will be re-ranked by a more expensive model. For StreamRec's two-tower retrieval model, Recall@100 or Recall@500 measures whether the retrieval stage successfully passes the relevant items to the ranking stage.

Common pitfalls: - Recall@k is trivially 1.0 for $k$ equal to the catalog size. The value of $k$ must be set to a meaningful operating point. - Recall@k and Hit@k ignore ranking within the top-$k$. A system that puts the relevant item at position 1 and one that puts it at position $k$ achieve the same Recall@k. Use NDCG or MAP when position within the result set matters.


G.4 — Causal Inference Metrics

Causal metrics evaluate models that estimate treatment effects — the causal impact of an intervention. These metrics are fundamentally different from predictive metrics because the ground truth (the individual treatment effect) is never directly observed. This is the core challenge discussed in Chapters 15-19.

G.4.1 — Average Treatment Effect (ATE)

Formula:

$$\text{ATE} = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]$$

Estimated from data via the difference-in-means (under randomization), IPW, AIPW, or other estimators (Chapter 18):

$$\widehat{\text{ATE}}_{\text{IPW}} = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{W_i Y_i}{e(X_i)} - \frac{(1-W_i) Y_i}{1 - e(X_i)} \right]$$

where $W_i$ is the treatment indicator and $e(X_i)$ is the propensity score.

Interpretation: The average causal effect of the treatment across the entire population. An ATE of 0.05 in the StreamRec context means "on average, recommending an item increases the probability of engagement by 5 percentage points."

When to use: When the policy question is: "What is the average impact of this intervention?" When the treatment would be applied uniformly to the entire population.

Common pitfalls: - The ATE can be zero even when the treatment helps some people and hurts others. If subgroup effects cancel out, the ATE misses meaningful heterogeneity. - ATE estimation under observational data requires untestable assumptions (ignorability). Always conduct sensitivity analyses (Chapter 18). - Confidence intervals for ATE estimates are often wider than practitioners expect. Report them honestly.

G.4.2 — Conditional Average Treatment Effect (CATE)

Formula:

$$\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$$

Estimated using causal forests, meta-learners (S/T/X/R), or DML (Chapter 19).

Interpretation: The average treatment effect for individuals with covariates $X = x$. CATE enables personalized treatment decisions: recommend an item to a user only when the predicted causal effect exceeds a threshold.

When to use: When treatment effects vary across subgroups and you want to identify who benefits most. This is the foundation of personalized medicine (MediCore Pharma, Chapter 19) and uplift-based recommendation targeting (StreamRec, Chapter 19).

Common pitfalls: - CATE estimation is harder than ATE estimation because you are estimating a function, not a single number. Sample sizes for specific subgroups may be insufficient. - There is no direct ground truth for individual CATEs. Evaluation relies on surrogate metrics (see Qini curve below) or randomized experiments. - Overfitting to heterogeneity: causal forests and meta-learners can find "significant" heterogeneity that is actually noise. Use honest estimation (sample-splitting) and calibration checks.

G.4.3 — Qini Coefficient and Qini Curve

Formula:

The Qini curve plots the cumulative uplift (incremental conversions from treatment) against the fraction of the population targeted, where individuals are sorted by predicted CATE in descending order:

$$Q(t) = n_t(1) \cdot \bar{Y}_t(1) - n_t(0) \cdot \bar{Y}_t(0) \cdot \frac{n_t(1)}{n_t(0)}$$

for the top $t$ fraction of the population. The Qini coefficient is the area between the Qini curve and the random targeting baseline.

Interpretation: The Qini curve shows how much additional conversion you gain by targeting the highest-CATE individuals first, compared to random targeting. A model with a higher Qini coefficient is better at identifying who to treat.

When to use: When evaluating uplift models (Chapter 19). When the business question is: "Given a limited treatment budget, who should we target?" This directly applies to the StreamRec recommendation targeting policy.

G.4.4 — Area Under the Uplift Curve (AUUC)

Formula:

$$\text{AUUC} = \int_0^1 U(t) \, dt$$

where $U(t)$ is the uplift at the top-$t$ fraction of the population, normalized by sample sizes.

Interpretation: Similar to the Qini coefficient, AUUC measures the overall quality of an uplift model's ability to rank individuals by treatment effect. Higher AUUC indicates better targeting.

When to use: As an alternative to the Qini coefficient. The two metrics are closely related; AUUC is sometimes preferred because its computation is slightly more straightforward.

Common pitfalls (shared with Qini): - Both Qini and AUUC require randomized data (or credible causal estimation) to compute. You cannot evaluate an uplift model using observational data without addressing confounding. - Confidence intervals for Qini/AUUC require bootstrap or permutation methods. They are wider than for predictive metrics like AUC-ROC because the signal (differential treatment effect) is typically smaller than the predictive signal.


G.5 — Fairness Metrics

Fairness metrics evaluate whether a model's errors are distributed equitably across demographic groups defined by a protected attribute $A$ (e.g., race, gender, age). These metrics are covered in depth in Chapter 31 and applied throughout the Meridian Financial credit scoring case.

Research Insight: The impossibility theorem (Chouldechova, 2017; Kleinberg, Mullainathan, and Raghavan, 2016) proves that demographic parity, equalized odds, and calibration across groups cannot simultaneously hold except in trivial cases. Selecting a fairness metric is an ethical decision, not a technical one.

G.5.1 — Demographic Parity (Statistical Parity)

Formula:

$$P(\hat{Y} = 1 \mid A = a) = P(\hat{Y} = 1 \mid A = b) \quad \forall a, b$$

Or equivalently, the selection rate should be equal across groups.

Interpretation: The model's positive prediction rate is the same regardless of group membership. "The same fraction of applicants from each group are approved."

When to use: When the policy goal is equal representation in outcomes, regardless of qualifications. When there is reason to believe that the base rates in the training data reflect historical discrimination rather than genuine differences.

Common pitfalls: - Demographic parity requires equal prediction rates, which may conflict with equal accuracy if base rates differ across groups. This is the impossibility theorem in action. - Enforcing demographic parity can reduce overall accuracy and may disadvantage the very group it is meant to help (if qualified members of the disadvantaged group receive worse service because the threshold is lowered for unqualified members).

G.5.2 — Equalized Odds

Formula:

$$P(\hat{Y} = 1 \mid Y = y, A = a) = P(\hat{Y} = 1 \mid Y = y, A = b) \quad \forall y \in \{0, 1\}, \forall a, b$$

This requires equal True Positive Rate AND equal False Positive Rate across groups.

Equal opportunity is the relaxed version requiring only equal True Positive Rate:

$$P(\hat{Y} = 1 \mid Y = 1, A = a) = P(\hat{Y} = 1 \mid Y = 1, A = b) \quad \forall a, b$$

Interpretation: Among people who actually default (or actually click, or are actually qualified), the model identifies them at the same rate regardless of group. Among people who do not, the model falsely flags them at the same rate regardless of group.

When to use: When errors should be equally distributed across groups, conditional on the true outcome. In the Meridian Financial system, equalized odds means: among borrowers who will default, the false negative rate is the same across racial groups; among borrowers who will repay, the false positive rate is the same.

G.5.3 — Disparate Impact Ratio

Formula:

$$\text{DI} = \frac{P(\hat{Y} = 1 \mid A = \text{disadvantaged})}{P(\hat{Y} = 1 \mid A = \text{advantaged})}$$

Interpretation: The ratio of positive prediction rates between the disadvantaged and advantaged groups. A DI of 1.0 indicates perfect parity. The EEOC's "four-fifths rule" (see below) uses this ratio as a legal threshold.

G.5.4 — The Four-Fifths Rule

Rule: A selection procedure has adverse impact if the selection rate for any protected group is less than four-fifths (80%) of the selection rate for the group with the highest rate:

$$\text{DI} \geq 0.8$$

Interpretation: This is a legal standard, not a statistical one. Failing the four-fifths rule does not prove discrimination, but it triggers closer scrutiny and may require the employer/lender to demonstrate business necessity.

When to use: When regulatory compliance is required (ECOA, Title VII). In the Meridian Financial credit scoring system, every model update must pass the four-fifths rule across all protected categories.

Common pitfalls: - The four-fifths rule is a rough threshold, not a statistical test. Small samples can cause false violations or mask real disparities. - The rule applies to selection rates, not error rates. A model can pass the four-fifths rule while having dramatically different false positive rates across groups.

import numpy as np
from dataclasses import dataclass

@dataclass
class FairnessReport:
    """Compute fairness metrics for binary classification."""

    demographic_parity_diff: float
    equalized_odds_diff: float
    disparate_impact_ratio: float
    four_fifths_pass: bool

    @classmethod
    def compute(
        cls,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        sensitive_attr: np.ndarray,
    ) -> "FairnessReport":
        """Compute fairness metrics across two groups.

        Args:
            y_true: Ground truth labels (0/1).
            y_pred: Predicted labels (0/1).
            sensitive_attr: Group membership (0/1).
        """
        groups = np.unique(sensitive_attr)
        assert len(groups) == 2, "Binary group attribute required."

        mask_a = sensitive_attr == groups[0]
        mask_b = sensitive_attr == groups[1]

        # Selection rates
        rate_a = y_pred[mask_a].mean()
        rate_b = y_pred[mask_b].mean()
        dp_diff = abs(rate_a - rate_b)

        # Disparate impact ratio
        di = min(rate_a, rate_b) / max(rate_a, rate_b) if max(rate_a, rate_b) > 0 else 0.0

        # Equalized odds: max of TPR diff and FPR diff
        tpr_a = y_pred[(y_true == 1) & mask_a].mean() if ((y_true == 1) & mask_a).sum() > 0 else 0
        tpr_b = y_pred[(y_true == 1) & mask_b].mean() if ((y_true == 1) & mask_b).sum() > 0 else 0
        fpr_a = y_pred[(y_true == 0) & mask_a].mean() if ((y_true == 0) & mask_a).sum() > 0 else 0
        fpr_b = y_pred[(y_true == 0) & mask_b].mean() if ((y_true == 0) & mask_b).sum() > 0 else 0
        eo_diff = max(abs(tpr_a - tpr_b), abs(fpr_a - fpr_b))

        return cls(
            demographic_parity_diff=dp_diff,
            equalized_odds_diff=eo_diff,
            disparate_impact_ratio=di,
            four_fifths_pass=di >= 0.8,
        )

G.6 — Calibration Metrics and Diagnostics

Calibration measures whether predicted probabilities match observed frequencies. A model is well-calibrated if, among all instances where it predicts $\hat{p} = 0.7$, approximately 70% are actually positive. Calibration is the central topic of Chapter 34 (Uncertainty Quantification).

G.6.1 — Expected Calibration Error (ECE)

See Section G.1.10 for the formula and basic treatment. Here we cover advanced diagnostics.

Choosing the number of bins: Too few bins (e.g., $M = 5$) averages over broad probability ranges and misses local miscalibration. Too many bins (e.g., $M = 100$) produces noisy estimates because many bins have few samples. A practical default is $M = 15$ with equal-width bins, or $M = 10$ with equal-count (quantile) bins.

Classwise ECE: For multiclass problems, compute ECE separately for each class and average. This avoids masking miscalibration in minority classes.

G.6.2 — Reliability Diagrams

A reliability diagram plots the observed accuracy (y-axis) against the predicted confidence (x-axis) for each bin. A perfectly calibrated model falls on the diagonal $y = x$.

Reading the diagram: - Points above the diagonal: the model is under-confident (it predicts 0.6 but the actual rate is 0.8) - Points below the diagonal: the model is over-confident (it predicts 0.8 but the actual rate is 0.6) - Horizontal gap between the point and the diagonal is the calibration error for that bin

Common diagnostic patterns: - Sigmoid-shaped: over-confident at extremes, reasonable in the middle — common for uncalibrated neural networks - Flat: model outputs cluster in a narrow probability range regardless of true label frequency — the model has not learned to distinguish confidence levels - Shifted: systematically above or below the diagonal — the model is systematically biased

import numpy as np
import matplotlib.pyplot as plt

def reliability_diagram(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    n_bins: int = 15,
    strategy: str = "uniform",
) -> tuple[np.ndarray, np.ndarray, np.ndarray, float]:
    """Compute reliability diagram data and ECE.

    Args:
        y_true: Binary labels.
        y_prob: Predicted probabilities.
        n_bins: Number of bins.
        strategy: 'uniform' (equal-width) or 'quantile' (equal-count).

    Returns:
        Tuple of (bin_centers, bin_accuracies, bin_counts, ece).
    """
    if strategy == "quantile":
        bin_edges = np.quantile(y_prob, np.linspace(0, 1, n_bins + 1))
        bin_edges = np.unique(bin_edges)
    else:
        bin_edges = np.linspace(0, 1, n_bins + 1)

    bin_centers = []
    bin_accs = []
    bin_counts = []
    ece = 0.0
    n = len(y_true)

    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (y_prob > lo) & (y_prob <= hi) if lo > 0 else (y_prob >= lo) & (y_prob <= hi)
        count = mask.sum()
        if count == 0:
            continue
        acc = y_true[mask].mean()
        conf = y_prob[mask].mean()
        bin_centers.append(conf)
        bin_accs.append(acc)
        bin_counts.append(count)
        ece += (count / n) * abs(acc - conf)

    return np.array(bin_centers), np.array(bin_accs), np.array(bin_counts), ece

G.6.3 — Brier Score Decomposition

The Brier score admits a three-term decomposition that separates calibration from discrimination:

$$\text{Brier} = \underbrace{\frac{1}{n} \sum_{m=1}^{M} n_m (\bar{p}_m - \bar{y}_m)^2}_{\text{Reliability (calibration error)}} - \underbrace{\frac{1}{n} \sum_{m=1}^{M} n_m (\bar{y}_m - \bar{y})^2}_{\text{Resolution (discrimination)}} + \underbrace{\bar{y}(1 - \bar{y})}_{\text{Uncertainty (irreducible)}}$$

where $n_m$ is the number of predictions in bin $m$, $\bar{p}_m$ is the average predicted probability in bin $m$, $\bar{y}_m$ is the average observed outcome in bin $m$, and $\bar{y}$ is the overall base rate.

Interpretation: - Reliability measures calibration error. Lower is better. A perfectly calibrated model has reliability = 0. - Resolution measures discrimination — how much the model's predictions vary with the true outcome. Higher is better. - Uncertainty is a property of the data, not the model. It is maximized at $\bar{y} = 0.5$.

When to use: When you need to diagnose why a model has a high Brier score: is it poorly calibrated, or does it lack discrimination? This decomposition directly maps to the recalibration vs. retraining decision.


G.7 — Metric Selection Decision Flowcharts

G.7.1 — Classification Metric Selection

START: What type of output does your model produce?
  |
  ├── Hard labels only → Is class balance within 60/40?
  |     ├── Yes → Are FP and FN costs equal?
  |     |     ├── Yes → Accuracy or F1
  |     |     └── No → F_β (set β based on cost ratio)
  |     └── No → F1 (macro-averaged) or class-specific precision/recall
  |
  └── Probabilities → Do you need calibrated probabilities?
        ├── Yes → Log loss or Brier score (primary) + ECE diagnostic
        |         + AUC-ROC or AUC-PR (secondary, for ranking)
        └── No  → Is the dataset imbalanced (>90/10)?
              ├── Yes → AUC-PR (primary) + Recall@k if retrieval stage
              └── No  → AUC-ROC (primary) + AUC-PR (secondary)

G.7.2 — Regression Metric Selection

START: What is the error distribution of your problem?
  |
  ├── Approximately Gaussian → RMSE + R²
  ├── Heavy-tailed / outlier-prone → MAE
  ├── Need relative (%) errors → MAPE (if no zeros in target)
  |                                sMAPE (if zeros possible)
  └── Asymmetric costs → Quantile loss at appropriate τ

ADDITIONALLY: Is the target scale consistent across segments?
  ├── Yes → Use absolute metrics (RMSE, MAE)
  └── No  → Use relative metrics (MAPE, sMAPE) or per-segment evaluation

G.7.3 — Ranking Metric Selection

START: What is the user's task?
  |
  ├── Find one good result → MRR
  ├── Browse a set of results → Is relevance graded or binary?
  |     ├── Graded → NDCG@k
  |     └── Binary → MAP
  └── Candidate retrieval (pre-ranking) → Recall@k or Hit@k

G.7.4 — Causal Metric Selection

START: What is the causal question?
  |
  ├── Average population effect → ATE (with CI)
  ├── Effect for subgroups → CATE
  └── Who to target? → Qini / AUUC (requires RCT or credible
                        quasi-experimental estimation)

G.8 — Common Cross-Cutting Pitfalls

G.8.1 — Metric Gaming

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Any single metric can be gamed. The StreamRec recommendation system learned this lesson: optimizing for click-through rate alone led to clickbait recommendations. The solution: a balanced scorecard of metrics (engagement, completion rate, diversity, fairness) with explicit constraints.

G.8.2 — Statistical Significance of Metric Differences

A model with AUC = 0.831 is not meaningfully better than a model with AUC = 0.829. Always compute confidence intervals for metrics using bootstrap or analytic methods. Report the $p$-value of the metric difference using a paired test (e.g., DeLong test for AUC-ROC, bootstrap for other metrics).

import numpy as np

def bootstrap_metric_ci(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    metric_fn,
    n_bootstrap: int = 10000,
    alpha: float = 0.05,
    seed: int = 42,
) -> tuple[float, float, float]:
    """Compute bootstrap confidence interval for any metric.

    Returns:
        Tuple of (point_estimate, lower_ci, upper_ci).
    """
    rng = np.random.RandomState(seed)
    n = len(y_true)
    point_est = metric_fn(y_true, y_pred)

    boot_estimates = []
    for _ in range(n_bootstrap):
        idx = rng.randint(0, n, size=n)
        boot_estimates.append(metric_fn(y_true[idx], y_pred[idx]))

    lower = np.percentile(boot_estimates, 100 * alpha / 2)
    upper = np.percentile(boot_estimates, 100 * (1 - alpha / 2))
    return point_est, lower, upper

G.8.3 — Train-Test Leakage

Any metric computed on data that overlaps with training data is invalid. This is obvious in principle and surprisingly common in practice. Temporal leakage (using future data to predict the past) is the most insidious form, especially in time series and causal inference settings (Chapters 18, 23, 25).

G.8.4 — Metric Aggregation Across Subgroups

Overall metrics can mask subgroup failures. A recommendation model with NDCG@10 = 0.45 overall may have NDCG@10 = 0.52 for active users and NDCG@10 = 0.12 for new users. Always stratify metrics by key segments: user activity level, item popularity, demographic group, time period.


G.9 — Quick Reference Table

Metric Task Range Better Threshold-Free? Handles Imbalance?
Accuracy Classification [0, 1] Higher No Poorly
Precision Classification [0, 1] Higher No Moderately
Recall Classification [0, 1] Higher No Well (for + class)
F1 Classification [0, 1] Higher No Moderately
AUC-ROC Classification [0, 1] Higher Yes Poorly
AUC-PR Classification [$\pi$, 1] Higher Yes Well
Log Loss Classification [0, $\infty$) Lower Yes Yes
Brier Score Classification [0, 1] Lower Yes Moderately
ECE Calibration [0, 1] Lower Yes Moderately
MAE Regression [0, $\infty$) Lower N/A N/A
RMSE Regression [0, $\infty$) Lower N/A N/A
MAPE Regression [0, $\infty$) Lower N/A N/A
$R^2$ Regression ($-\infty$, 1] Higher N/A N/A
Quantile Loss Regression [0, $\infty$) Lower N/A N/A
NDCG@k Ranking [0, 1] Higher N/A N/A
MAP Ranking [0, 1] Higher N/A N/A
MRR Ranking [0, 1] Higher N/A N/A
Recall@k Ranking [0, 1] Higher N/A N/A
ATE Causal ($-\infty$, $\infty$) Depends N/A N/A
CATE Causal ($-\infty$, $\infty$) Depends N/A N/A
Qini Coeff. Causal ($-\infty$, $\infty$) Higher N/A N/A
Dem. Parity Fairness [0, 1] Lower diff No N/A
Eq. Odds Fairness [0, 1] Lower diff No N/A
Disp. Impact Fairness [0, $\infty$) Closer to 1 No N/A

This appendix is a living reference. Return to it whenever you evaluate a model, and remember: the metric you choose defines what "good" means. Choose deliberately.