31 min read

> "A model that says '80% sure' but is right 50% of the time is worse than one that says 'I don't know.'"

Chapter 34: Uncertainty Quantification — Calibration, Conformal Prediction, and Knowing What Your Model Doesn't Know

"A model that says '80% sure' but is right 50% of the time is worse than one that says 'I don't know.'" — Tilmann Gneiting and Adrian E. Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (JASA, 2007)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Distinguish aleatoric from epistemic uncertainty and explain why both matter for trustworthy prediction
  2. Evaluate and improve model calibration using reliability diagrams, expected calibration error (ECE), and recalibration methods (Platt scaling, isotonic regression, temperature scaling)
  3. Implement conformal prediction for distribution-free prediction sets with finite-sample coverage guarantees
  4. Apply Monte Carlo dropout and deep ensembles to estimate epistemic uncertainty in neural networks
  5. Use uncertainty estimates for downstream decision-making: abstention policies, active learning selection, and risk-aware deployment

34.1 The Problem: Models That Are Confidently Wrong

On September 3, 2025, a production credit scoring model at Meridian Financial approved a loan application with predicted probability of default $\hat{p} = 0.04$ — a 96% confidence that the borrower would repay. The borrower defaulted within six months. This is not, by itself, a failure: even a perfectly calibrated model will have 4% of its "$\hat{p} = 0.04$" predictions default. The failure was systemic. When the risk team audited the model's predictions over the previous quarter, they found that among all applications where the model predicted $\hat{p} \leq 0.05$, the actual default rate was 11.2%. The model was not 96% confident and right — it was 96% confident and wrong more than twice as often as it claimed.

This is a calibration failure, and it is the norm rather than the exception for modern neural networks. Guo et al. (2017) showed that deep networks trained with modern techniques (batch normalization, dropout, weight decay, cross-entropy loss) produce probability estimates that are systematically overconfident. A ResNet-110 on CIFAR-100 predicted class probabilities above 0.9 for 85% of test examples, but the actual accuracy among those predictions was only 72%. The model was not uncertain enough.

The consequences of overconfidence are not merely statistical. In the Meridian Financial case, the calibration failure caused the risk model to underestimate expected loss by 2.3x, leading to $14M in excess charge-offs in Q3 alone. In clinical settings, overconfident diagnostic models lead physicians to skip confirmatory tests. In autonomous systems, overconfident perception models fail to trigger fallback behaviors. In climate modeling, underestimating model uncertainty leads policymakers to ignore scenario ranges that include catastrophic outcomes.

Uncertainty quantification (UQ) is the field that addresses this problem. It encompasses a family of techniques for (1) measuring how well a model's confidence scores match reality (calibration), (2) producing prediction sets with formal coverage guarantees (conformal prediction), (3) decomposing uncertainty into reducible and irreducible components (epistemic vs. aleatoric), and (4) using uncertainty estimates to make better decisions (abstention, active learning, risk assessment).

Know How Your Model Is Wrong (Theme 4): This chapter is the most direct expression of Part VI's central theme. Chapters 31-32 asked how your model is wrong about specific groups (fairness) and whether it leaks private information (privacy). This chapter asks how your model is wrong about its own confidence. A model that produces well-calibrated, honest uncertainty estimates is not necessarily more accurate — but it is far more useful for human decision-making, because it allows downstream systems and human operators to know when to trust and when to intervene.

Two Types of Uncertainty

The literature distinguishes two fundamentally different sources of uncertainty:

Aleatoric uncertainty (from Latin aleator, "dice player") is irreducible randomness in the data-generating process. Even with perfect knowledge of all relevant features, there is inherent noise in outcomes. A patient's response to a drug depends on genetic variants, microbiome composition, stress levels, and countless factors that no model can capture. A user's click on a recommendation depends on their mood, the time of day, and whether their phone buzzed with a notification at the wrong moment. Aleatoric uncertainty is a property of the world, not of the model. More data and better models cannot reduce it.

Epistemic uncertainty (from Greek episteme, "knowledge") is uncertainty due to limited knowledge — insufficient data, model misspecification, or ambiguity about the right hypothesis. A credit scoring model is epistemically uncertain about applicants from demographic segments underrepresented in training data. A climate model is epistemically uncertain about cloud feedback parameters because observational constraints are insufficient to pin down the physics. Epistemic uncertainty is a property of the model, and it can be reduced by collecting more data, using a more flexible model, or incorporating domain knowledge.

Property Aleatoric Epistemic
Source Data-generating process Limited knowledge
Reducible? No Yes (with more data/better model)
Depends on input $x$? Yes (heteroscedastic) Yes (high in sparse regions)
Example (climate) Chaotic weather variability Uncertain cloud feedback parameters
Example (credit) Inherent unpredictability of individual behavior Insufficient data for rare applicant profiles
Example (pharma) Biological variation in drug response Limited trial size for rare subgroups
Estimation technique Heteroscedastic regression, quantile regression Ensembles, MC dropout, Bayesian methods

The distinction is essential because the appropriate response differs. Aleatoric uncertainty should be communicated to decision-makers ("this patient's outcome is inherently unpredictable; plan for a wide range of scenarios"). Epistemic uncertainty should trigger data collection ("the model is uncertain about this region of input space; we need more training examples here").


34.2 Calibration: Does 80% Confident Mean 80% Correct?

34.2.1 The Formal Definition

A probabilistic classifier $f: \mathcal{X} \to [0, 1]$ is perfectly calibrated if, for all $p \in [0, 1]$:

$$P(Y = 1 \mid f(X) = p) = p$$

In words: among all instances where the model predicts probability $p$, the fraction that are actually positive is $p$. If the model says "70% chance of default," then exactly 70% of such cases should default.

Perfect calibration is a necessary condition for a model's probability outputs to be interpretable as true probabilities. A model that is well-calibrated but has low discrimination (AUC close to 0.5) is still useful for risk assessment — it gives honest estimates of risk, even if it cannot rank-order individuals well. A model with high discrimination but poor calibration is useful for ranking but misleading for absolute risk estimation.

34.2.2 Reliability Diagrams

The standard visualization tool for calibration is the reliability diagram (also called a calibration plot). The procedure:

  1. Collect predicted probabilities $\hat{p}_i$ and true labels $y_i$ for a test set of $n$ examples.
  2. Partition the predictions into $M$ bins $B_1, \ldots, B_M$ based on the predicted probability (e.g., $B_1 = [0, 0.1), B_2 = [0.1, 0.2), \ldots$).
  3. For each bin $B_m$, compute: - Mean predicted probability: $\bar{p}_m = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i$ - Observed frequency: $\bar{o}_m = \frac{1}{|B_m|} \sum_{i \in B_m} y_i$
  4. Plot $\bar{p}_m$ (x-axis) against $\bar{o}_m$ (y-axis). A perfectly calibrated model lies on the diagonal $y = x$.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from dataclasses import dataclass, field
from typing import List, Tuple, Dict, Optional


@dataclass
class CalibrationDiagnostics:
    """Comprehensive calibration analysis for a probabilistic classifier.

    Computes reliability diagrams, ECE, MCE, and per-bin statistics.
    Used for all three anchors: credit scoring, climate prediction,
    and pharma treatment effect estimation.
    """
    y_true: np.ndarray
    y_prob: np.ndarray
    n_bins: int = 15
    strategy: str = "uniform"  # "uniform" or "quantile"

    # Computed attributes
    bin_edges: np.ndarray = field(init=False, repr=False)
    bin_counts: np.ndarray = field(init=False, repr=False)
    bin_mean_predicted: np.ndarray = field(init=False, repr=False)
    bin_mean_observed: np.ndarray = field(init=False, repr=False)
    ece: float = field(init=False)
    mce: float = field(init=False)

    def __post_init__(self):
        self.y_true = np.asarray(self.y_true, dtype=float)
        self.y_prob = np.asarray(self.y_prob, dtype=float)
        self._compute_bins()
        self._compute_ece_mce()

    def _compute_bins(self):
        """Bin predictions and compute per-bin statistics."""
        if self.strategy == "uniform":
            self.bin_edges = np.linspace(0, 1, self.n_bins + 1)
        elif self.strategy == "quantile":
            quantiles = np.linspace(0, 100, self.n_bins + 1)
            self.bin_edges = np.percentile(self.y_prob, quantiles)
            self.bin_edges[0] = 0.0
            self.bin_edges[-1] = 1.0
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

        self.bin_counts = np.zeros(self.n_bins, dtype=int)
        self.bin_mean_predicted = np.full(self.n_bins, np.nan)
        self.bin_mean_observed = np.full(self.n_bins, np.nan)

        for m in range(self.n_bins):
            if m < self.n_bins - 1:
                mask = (self.y_prob >= self.bin_edges[m]) & (
                    self.y_prob < self.bin_edges[m + 1]
                )
            else:
                mask = (self.y_prob >= self.bin_edges[m]) & (
                    self.y_prob <= self.bin_edges[m + 1]
                )

            count = mask.sum()
            self.bin_counts[m] = count
            if count > 0:
                self.bin_mean_predicted[m] = self.y_prob[mask].mean()
                self.bin_mean_observed[m] = self.y_true[mask].mean()

    def _compute_ece_mce(self):
        """Expected and maximum calibration error."""
        n = len(self.y_true)
        nonempty = self.bin_counts > 0
        gaps = np.abs(
            self.bin_mean_predicted[nonempty] - self.bin_mean_observed[nonempty]
        )
        weights = self.bin_counts[nonempty] / n
        self.ece = float(np.sum(weights * gaps))
        self.mce = float(np.max(gaps)) if len(gaps) > 0 else 0.0

    def plot(self, ax: Optional[plt.Axes] = None, label: str = "Model") -> plt.Axes:
        """Plot the reliability diagram."""
        if ax is None:
            fig, ax = plt.subplots(1, 1, figsize=(6, 6))

        nonempty = self.bin_counts > 0
        ax.plot([0, 1], [0, 1], "k--", linewidth=1, label="Perfect calibration")
        ax.bar(
            self.bin_mean_predicted[nonempty],
            self.bin_mean_observed[nonempty],
            width=1.0 / self.n_bins * 0.8,
            alpha=0.6,
            edgecolor="black",
            label=label,
        )
        ax.set_xlabel("Mean predicted probability")
        ax.set_ylabel("Observed frequency")
        ax.set_title(f"Reliability Diagram (ECE = {self.ece:.4f})")
        ax.legend(loc="upper left")
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.set_aspect("equal")
        return ax

    def summary(self) -> Dict[str, float]:
        """Return summary calibration metrics."""
        nonempty = self.bin_counts > 0
        return {
            "ece": self.ece,
            "mce": self.mce,
            "n_samples": len(self.y_true),
            "n_bins_nonempty": int(nonempty.sum()),
            "mean_predicted": float(self.y_prob.mean()),
            "mean_observed": float(self.y_true.mean()),
            "prevalence": float(self.y_true.mean()),
        }

34.2.3 Expected Calibration Error (ECE)

The expected calibration error (Naeini et al., 2015) is the standard scalar summary of miscalibration:

$$\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \bar{o}_m - \bar{p}_m \right|$$

ECE is the weighted average of per-bin calibration gaps, where the weight is the fraction of samples in each bin. A perfectly calibrated model has ECE = 0. In practice:

ECE Range Interpretation
< 0.02 Excellent calibration
0.02 - 0.05 Good calibration (acceptable for most applications)
0.05 - 0.10 Moderate miscalibration (recalibration recommended)
0.10 - 0.20 Poor calibration (recalibration required)
> 0.20 Severe miscalibration (predictions not trustworthy as probabilities)

The maximum calibration error (MCE) is $\max_m |\bar{o}_m - \bar{p}_m|$ — the worst-case bin. MCE is important in safety-critical applications where any bin's miscalibration could have severe consequences.

Caveats about ECE:

  1. Bin sensitivity. ECE depends on the number of bins $M$ and the binning strategy. Fewer bins smooth out calibration gaps; more bins increase variance. Quantile binning (equal samples per bin) is generally more stable than uniform binning (equal-width bins).
  2. ECE can be gamed. A model that predicts the overall prevalence $P(Y=1)$ for every input has ECE = 0 but zero discrimination. ECE measures only calibration, not sharpness or discrimination.
  3. Classwise ECE. In multi-class settings, ECE should be computed per class and averaged, not computed on the top-1 predicted probability alone — the latter can hide systematic miscalibration of lower-confidence classes.

34.2.4 Modern Neural Networks Are Overconfident

Guo et al. (2017) demonstrated a striking historical reversal. Before the modern era of deep learning — before batch normalization, aggressive data augmentation, and long training schedules — neural networks were reasonably well-calibrated. A 5-layer LeNet on CIFAR-10 in 2000 had ECE around 0.02. A ResNet-110 on CIFAR-10 in 2017 had ECE around 0.05, despite being far more accurate. On CIFAR-100 (a harder task), the ResNet-110 had ECE around 0.17 — severely miscalibrated.

The cause is increased model capacity combined with NLL (cross-entropy) loss optimization. Cross-entropy loss is a strictly proper scoring rule: it is minimized by the true conditional probabilities. But modern networks have enough capacity to fit training data perfectly, and NLL continues to decrease as the model pushes logits further apart — making the softmax output more peaked — even after the predictions are already correct. The result is systematically overconfident predictions: the model learns to predict probabilities close to 0 and 1 even when the true conditional probability is moderate.


34.3 Recalibration: Post-Hoc Correction of Confidence Scores

Recalibration methods learn a mapping from a model's raw scores to calibrated probabilities using a held-out calibration set (distinct from both the training and test sets). The key insight: recalibration is a post-processing step that does not change the model's ranking of examples (discrimination is preserved), only the absolute probability values.

34.3.1 Platt Scaling

Platt scaling (Platt, 1999) fits a logistic regression on the model's raw logits $z = f(x)$ using the calibration set:

$$q = \sigma(a \cdot z + b)$$

where $\sigma$ is the sigmoid function and $a, b$ are learned parameters. The logistic regression is fit by minimizing NLL on the calibration set.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset


class PlattScaler:
    """Platt scaling: logistic recalibration of model logits.

    Learns parameters a, b such that sigmoid(a * logit + b) is calibrated.
    Preserves the model's ranking (monotonic transformation of logits).
    """

    def __init__(self):
        self.a = None
        self.b = None

    def fit(
        self,
        logits: np.ndarray,
        labels: np.ndarray,
        lr: float = 0.01,
        max_iter: int = 1000,
    ) -> "PlattScaler":
        """Fit Platt scaling parameters on calibration set."""
        logits_t = torch.tensor(logits, dtype=torch.float32).unsqueeze(1)
        labels_t = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)

        # Initialize a=1, b=0 (identity before calibration)
        a_param = nn.Parameter(torch.ones(1))
        b_param = nn.Parameter(torch.zeros(1))

        optimizer = optim.LBFGS([a_param, b_param], lr=lr, max_iter=max_iter)
        criterion = nn.BCEWithLogitsLoss()

        def closure():
            optimizer.zero_grad()
            scaled_logits = a_param * logits_t + b_param
            loss = criterion(scaled_logits, labels_t)
            loss.backward()
            return loss

        optimizer.step(closure)

        self.a = a_param.item()
        self.b = b_param.item()
        return self

    def calibrate(self, logits: np.ndarray) -> np.ndarray:
        """Apply learned Platt scaling to new logits."""
        if self.a is None:
            raise RuntimeError("Call fit() before calibrate()")
        scaled = self.a * logits + self.b
        return 1.0 / (1.0 + np.exp(-scaled))


class TemperatureScaler:
    """Temperature scaling: single-parameter recalibration.

    Learns a temperature T > 0 such that softmax(logits / T) is calibrated.
    Preserves the model's ranking (dividing logits by a constant
    does not change argmax). Simpler than Platt scaling (1 parameter
    vs. 2) but often equally effective for multi-class calibration.

    Reference: Guo et al., "On Calibration of Modern Neural Networks" (2017).
    """

    def __init__(self):
        self.temperature = None

    def fit(
        self,
        logits: np.ndarray,
        labels: np.ndarray,
        lr: float = 0.01,
        max_iter: int = 500,
    ) -> "TemperatureScaler":
        """Fit temperature parameter on calibration set.

        For binary classification, logits is shape (n,) and labels is shape (n,).
        For multi-class, logits is shape (n, K) and labels is shape (n,) with
        integer class indices.
        """
        logits_t = torch.tensor(logits, dtype=torch.float32)
        labels_t = torch.tensor(labels, dtype=torch.long)

        # Initialize temperature at 1.0 (identity)
        # Use log(T) to ensure T > 0 after exp
        log_temperature = nn.Parameter(torch.zeros(1))

        if logits_t.ndim == 1:
            # Binary classification
            labels_float = labels_t.float().unsqueeze(1)
            logits_t = logits_t.unsqueeze(1)
            criterion = nn.BCEWithLogitsLoss()

            def compute_loss():
                T = torch.exp(log_temperature)
                scaled = logits_t / T
                return criterion(scaled, labels_float)
        else:
            # Multi-class
            criterion = nn.CrossEntropyLoss()

            def compute_loss():
                T = torch.exp(log_temperature)
                scaled = logits_t / T
                return criterion(scaled, labels_t)

        optimizer = optim.LBFGS([log_temperature], lr=lr, max_iter=max_iter)

        def closure():
            optimizer.zero_grad()
            loss = compute_loss()
            loss.backward()
            return loss

        optimizer.step(closure)

        self.temperature = torch.exp(log_temperature).item()
        return self

    def calibrate(self, logits: np.ndarray) -> np.ndarray:
        """Apply temperature scaling to new logits."""
        if self.temperature is None:
            raise RuntimeError("Call fit() before calibrate()")
        scaled = logits / self.temperature
        if scaled.ndim == 1:
            return 1.0 / (1.0 + np.exp(-scaled))
        else:
            exp_scaled = np.exp(scaled - scaled.max(axis=1, keepdims=True))
            return exp_scaled / exp_scaled.sum(axis=1, keepdims=True)

    def __repr__(self) -> str:
        T_str = f"{self.temperature:.4f}" if self.temperature else "not fit"
        return f"TemperatureScaler(T={T_str})"

Platt scaling strengths and limitations:

  • Strengths: Simple, fast, works well when the uncalibrated scores are already monotonically related to true probabilities. Two parameters make overfitting unlikely even on small calibration sets.
  • Limitations: Assumes the logistic model is the right recalibration function. Fails when the miscalibration is non-linear (e.g., the model is overconfident at the extremes but well-calibrated in the middle).

34.3.2 Isotonic Regression

Isotonic regression fits a non-decreasing step function $g: \mathbb{R} \to [0, 1]$ to the calibration set, solving:

$$\min_g \sum_{i=1}^{n} (y_i - g(\hat{p}_i))^2 \quad \text{subject to } g \text{ is non-decreasing}$$

This is solved exactly by the pool adjacent violators (PAV) algorithm in $O(n)$ time.

from sklearn.isotonic import IsotonicRegression


class IsotonicCalibrator:
    """Isotonic regression calibration.

    Non-parametric: learns a non-decreasing step function from
    predicted scores to calibrated probabilities. More flexible
    than Platt/temperature scaling but requires a larger calibration
    set (typically >= 1000 samples) to avoid overfitting.
    """

    def __init__(self):
        self.iso = IsotonicRegression(
            y_min=0.0, y_max=1.0, out_of_bounds="clip"
        )

    def fit(self, scores: np.ndarray, labels: np.ndarray) -> "IsotonicCalibrator":
        """Fit isotonic regression on calibration set."""
        self.iso.fit(scores, labels)
        return self

    def calibrate(self, scores: np.ndarray) -> np.ndarray:
        """Apply isotonic calibration to new scores."""
        return self.iso.predict(scores)

Isotonic regression strengths and limitations:

  • Strengths: Non-parametric — can correct arbitrary monotonic miscalibration patterns. No functional form assumption.
  • Limitations: Requires more calibration data than Platt scaling (1,000+ samples recommended). Produces a step function, which can be coarse with small calibration sets. May overfit with fewer than 500 calibration samples.

34.3.3 Temperature Scaling

Temperature scaling (Guo et al., 2017) is the simplest recalibration method and often the most effective for multi-class neural networks. It learns a single scalar parameter $T > 0$ (the "temperature") and divides all logits by $T$ before applying softmax:

$$\hat{p}_k = \frac{\exp(z_k / T)}{\sum_{j=1}^{K} \exp(z_j / T)}$$

When $T > 1$ (the typical case for overconfident networks), the softmax output is "softened" — probabilities are pushed away from 0 and 1 toward the uniform distribution, reducing overconfidence. When $T < 1$, the output is "sharpened." When $T = 1$, the output is unchanged.

The TemperatureScaler class above implements this. The key properties:

  1. One parameter. Temperature scaling has a single scalar parameter, making it nearly impossible to overfit even on small calibration sets (a few hundred examples suffice).
  2. Preserves ranking. Dividing all logits by the same constant does not change their relative order, so accuracy, AUC, and ranking metrics are unchanged.
  3. Sufficient for modern networks. Guo et al. showed that temperature scaling is as effective as Platt scaling and isotonic regression for deep networks on standard benchmarks, despite having only one parameter. This suggests that the miscalibration of modern networks is largely uniform (same degree of overconfidence across the probability range).

34.3.4 Comparing Recalibration Methods

@dataclass
class RecalibrationComparison:
    """Compare Platt, isotonic, and temperature scaling.

    Splits held-out data into calibration (fit recalibrators)
    and test (evaluate calibration). Reports ECE before and after
    each method.
    """
    logits: np.ndarray
    labels: np.ndarray
    cal_fraction: float = 0.5  # Fraction used for calibration

    def run(self, seed: int = 42) -> Dict[str, Dict[str, float]]:
        rng = np.random.RandomState(seed)
        n = len(self.labels)
        idx = rng.permutation(n)
        n_cal = int(n * self.cal_fraction)
        cal_idx, test_idx = idx[:n_cal], idx[n_cal:]

        cal_logits, cal_labels = self.logits[cal_idx], self.labels[cal_idx]
        test_logits, test_labels = self.logits[test_idx], self.labels[test_idx]

        # Uncalibrated baseline
        uncal_probs = 1.0 / (1.0 + np.exp(-test_logits))
        uncal_diag = CalibrationDiagnostics(test_labels, uncal_probs)

        # Platt scaling
        platt = PlattScaler().fit(cal_logits, cal_labels)
        platt_probs = platt.calibrate(test_logits)
        platt_diag = CalibrationDiagnostics(test_labels, platt_probs)

        # Temperature scaling
        temp = TemperatureScaler().fit(cal_logits, cal_labels)
        temp_probs = temp.calibrate(test_logits)
        temp_diag = CalibrationDiagnostics(test_labels, temp_probs)

        # Isotonic regression
        cal_probs = 1.0 / (1.0 + np.exp(-cal_logits))
        test_probs = 1.0 / (1.0 + np.exp(-test_logits))
        iso = IsotonicCalibrator().fit(cal_probs, cal_labels)
        iso_probs = iso.calibrate(test_probs)
        iso_diag = CalibrationDiagnostics(test_labels, iso_probs)

        return {
            "uncalibrated": uncal_diag.summary(),
            "platt": platt_diag.summary(),
            "temperature": temp_diag.summary(),
            "isotonic": iso_diag.summary(),
            "temperature_T": temp.temperature,
            "platt_a": platt.a,
            "platt_b": platt.b,
        }

34.3.5 Calibration Across Subgroups: The Meridian Financial Problem

Global calibration — ECE computed over the entire test set — can mask severe subgroup miscalibration. The Meridian Financial credit scoring model (anchor example, Chapters 2, 5, 6, 24, 31) illustrates this. After Platt scaling, the model achieves ECE = 0.018 on the full test set — excellent calibration by any standard. But disaggregated analysis reveals:

Subgroup ECE (Post-Platt) Direction of Bias
Overall 0.018
Age 25-35 0.034 Overconfident (predicts lower default risk than actual)
Age 35-50 0.015 Well-calibrated
Age 50-65 0.012 Well-calibrated
Income < $40K 0.071 Overconfident
Income $40K-100K 0.019 Well-calibrated
Income > $100K 0.042 Underconfident (predicts higher default risk than actual)
Rural applicants 0.089 Overconfident
Urban applicants 0.014 Well-calibrated

The model is systematically overconfident for young, low-income, and rural applicants — precisely the groups where lending decisions are most consequential and regulatory scrutiny is highest. This connects directly to Chapter 31's fairness analysis: calibration by group is one of the fairness criteria defined there, and the impossibility theorem proves it cannot hold simultaneously with equalized odds when base rates differ across groups.

Understanding Why (Theme 1): The subgroup calibration failure is not a random numerical artifact. It has a causal explanation: the model has fewer training examples for young, low-income, and rural applicants, so its epistemic uncertainty is higher for these groups. But the model's confidence scores do not reflect this epistemic uncertainty — they are driven by the softmax over logits, which produces high-confidence predictions regardless of how much training data supports them. This is the fundamental link between calibration (Section 34.2-34.3) and epistemic uncertainty estimation (Section 34.6-34.7).

The remedy is group-conditional calibration: fit separate recalibration models for each subgroup. In the CalibrationDiagnostics class above, this amounts to filtering the calibration set by subgroup and fitting separate Platt or isotonic regressors per group. The tradeoff: subgroup calibration sets are smaller, increasing the risk of recalibration overfitting — which is why temperature scaling (1 parameter) is often preferred over isotonic regression (many parameters) for subgroup recalibration.


34.4 Conformal Prediction: Distribution-Free Prediction Sets

Recalibration methods improve the average quality of probability estimates, but they do not provide formal guarantees for individual predictions. A Platt-scaled model with ECE = 0.02 is well-calibrated in aggregate, but nothing prevents it from being catastrophically overconfident on a specific instance. Conformal prediction provides something stronger: for each test example, it constructs a prediction set $C(x)$ with a finite-sample guarantee that the true label is contained in $C(x)$ with probability at least $1 - \alpha$, for any user-specified $\alpha$.

The guarantee holds regardless of the model, the data distribution, and the true function — it is distribution-free. The only assumption is exchangeability: the calibration data and the test point are exchangeably distributed (roughly, their joint distribution is invariant to permutations). This is weaker than the IID assumption and is violated only by certain forms of distribution shift.

34.4.1 Split Conformal Prediction (Classification)

The simplest conformal method is split conformal prediction (Vovk et al., 2005; Lei et al., 2018). The procedure for classification:

  1. Split the data into a training set and a calibration set. Train the model on the training set.
  2. Compute nonconformity scores on the calibration set. For classification, a natural score is $s_i = 1 - \hat{p}_{y_i}$, where $\hat{p}_{y_i}$ is the model's predicted probability for the true class of calibration example $i$. High scores indicate the model is "surprised" by the true label.
  3. Compute the threshold $\hat{q}$ as the $\lceil (1 - \alpha)(n_{\text{cal}} + 1) \rceil / n_{\text{cal}}$ quantile of the calibration scores. (The ceiling adjustment ensures the coverage guarantee is exactly $1 - \alpha$, not merely approximately.)
  4. Form prediction sets for new test points: $C(x_{\text{test}}) = \{y : 1 - \hat{p}_y(x_{\text{test}}) \leq \hat{q}\} = \{y : \hat{p}_y(x_{\text{test}}) \geq 1 - \hat{q}\}$. Include every class whose predicted probability exceeds the threshold.
@dataclass
class SplitConformalClassifier:
    """Split conformal prediction for classification.

    Provides distribution-free prediction sets with finite-sample
    coverage guarantee: P(Y in C(X)) >= 1 - alpha.

    Assumption: calibration data and test data are exchangeable.
    """
    alpha: float = 0.10  # Target miscoverage rate (90% coverage)

    # Computed
    threshold: float = field(init=False, default=None)
    cal_scores: np.ndarray = field(init=False, repr=False, default=None)

    def calibrate(
        self,
        cal_probs: np.ndarray,
        cal_labels: np.ndarray,
    ) -> "SplitConformalClassifier":
        """Compute the conformal threshold from calibration data.

        Args:
            cal_probs: shape (n_cal, K) - predicted probabilities per class
            cal_labels: shape (n_cal,) - true class labels (integer indices)

        Returns:
            self (for chaining)
        """
        n_cal = len(cal_labels)
        # Nonconformity score: 1 - probability assigned to true class
        self.cal_scores = 1.0 - cal_probs[np.arange(n_cal), cal_labels]

        # Quantile level with finite-sample correction
        quantile_level = np.ceil((1 - self.alpha) * (n_cal + 1)) / n_cal
        quantile_level = min(quantile_level, 1.0)

        self.threshold = float(np.quantile(self.cal_scores, quantile_level))
        return self

    def predict_sets(self, test_probs: np.ndarray) -> List[List[int]]:
        """Return prediction sets for each test example.

        Args:
            test_probs: shape (n_test, K) - predicted probabilities per class

        Returns:
            List of lists, each containing the class indices in the prediction set.
        """
        if self.threshold is None:
            raise RuntimeError("Call calibrate() before predict_sets()")

        sets = []
        for i in range(test_probs.shape[0]):
            # Include class k if its score <= threshold
            # Equivalently: include class k if prob_k >= 1 - threshold
            included = np.where(test_probs[i] >= 1.0 - self.threshold)[0]
            sets.append(included.tolist())
        return sets

    def evaluate_coverage(
        self,
        test_probs: np.ndarray,
        test_labels: np.ndarray,
    ) -> Dict[str, float]:
        """Evaluate empirical coverage and set size on test data."""
        pred_sets = self.predict_sets(test_probs)

        covered = sum(
            1 for i, pset in enumerate(pred_sets) if test_labels[i] in pset
        )
        sizes = [len(pset) for pset in pred_sets]
        empty_count = sum(1 for s in sizes if s == 0)

        return {
            "empirical_coverage": covered / len(test_labels),
            "target_coverage": 1 - self.alpha,
            "mean_set_size": float(np.mean(sizes)),
            "median_set_size": float(np.median(sizes)),
            "max_set_size": int(np.max(sizes)),
            "empty_sets": empty_count,
            "singleton_sets": sum(1 for s in sizes if s == 1),
            "n_test": len(test_labels),
        }

34.4.2 The Coverage Guarantee

The theoretical guarantee is:

$$P(Y_{n+1} \in C(X_{n+1})) \geq 1 - \alpha$$

This holds for any model, any data distribution, and any sample size $n_{\text{cal}}$. It is a marginal guarantee: averaged over the randomness in both the calibration data and the test point. It does not guarantee conditional coverage (i.e., $P(Y \in C(X) \mid X = x) \geq 1 - \alpha$ for all $x$) — that is a stronger property that requires additional assumptions.

The proof is elegant and rests on a single insight. If the calibration scores $s_1, \ldots, s_n$ and the test score $s_{n+1}$ are exchangeable, then the rank of $s_{n+1}$ among $\{s_1, \ldots, s_n, s_{n+1}\}$ is uniformly distributed over $\{1, 2, \ldots, n+1\}$. Setting the threshold at the $\lceil(1-\alpha)(n+1)\rceil$-th largest calibration score ensures that the test score falls below the threshold with probability at least $1 - \alpha$.

Exchangeability vs. IID. Exchangeability is weaker than IID. A sequence $(Z_1, \ldots, Z_{n+1})$ is exchangeable if its joint distribution is invariant under any permutation of indices. IID implies exchangeability, but exchangeability allows dependencies — for example, samples drawn without replacement from a finite population are exchangeable but not independent. The coverage guarantee holds under exchangeability, making conformal prediction more broadly applicable than methods that require IID data.

34.4.3 Split Conformal Prediction (Regression)

For regression, we want a prediction interval $C(x) = [\hat{y}(x) - q, \hat{y}(x) + q]$ with coverage guarantee. The nonconformity score is the absolute residual $s_i = |y_i - \hat{y}(x_i)|$:

@dataclass
class SplitConformalRegressor:
    """Split conformal prediction for regression.

    Produces symmetric prediction intervals with guaranteed coverage.
    For adaptive-width intervals, use conformalized quantile regression
    (Section 34.4.5).
    """
    alpha: float = 0.10

    threshold: float = field(init=False, default=None)
    cal_residuals: np.ndarray = field(init=False, repr=False, default=None)

    def calibrate(
        self,
        cal_predictions: np.ndarray,
        cal_targets: np.ndarray,
    ) -> "SplitConformalRegressor":
        """Compute the conformal threshold from calibration residuals."""
        n_cal = len(cal_targets)
        self.cal_residuals = np.abs(cal_targets - cal_predictions)

        quantile_level = np.ceil((1 - self.alpha) * (n_cal + 1)) / n_cal
        quantile_level = min(quantile_level, 1.0)

        self.threshold = float(np.quantile(self.cal_residuals, quantile_level))
        return self

    def predict_intervals(
        self,
        test_predictions: np.ndarray,
    ) -> np.ndarray:
        """Return prediction intervals for each test example.

        Returns: shape (n_test, 2) array of [lower, upper] bounds.
        """
        if self.threshold is None:
            raise RuntimeError("Call calibrate() before predict_intervals()")

        lower = test_predictions - self.threshold
        upper = test_predictions + self.threshold
        return np.column_stack([lower, upper])

    def evaluate_coverage(
        self,
        test_predictions: np.ndarray,
        test_targets: np.ndarray,
    ) -> Dict[str, float]:
        """Evaluate empirical coverage and interval width."""
        intervals = self.predict_intervals(test_predictions)
        covered = ((test_targets >= intervals[:, 0]) &
                   (test_targets <= intervals[:, 1]))
        widths = intervals[:, 1] - intervals[:, 0]

        return {
            "empirical_coverage": float(covered.mean()),
            "target_coverage": 1 - self.alpha,
            "mean_width": float(widths.mean()),
            "median_width": float(np.median(widths)),
            "threshold": self.threshold,
        }

34.4.4 Adaptive Conformal Inference (ACI) for Distribution Shift

The standard conformal guarantee assumes exchangeability between calibration and test data. Under distribution shift — which is the norm in production ML systems — this assumption fails, and the coverage guarantee degrades.

Adaptive conformal inference (Gibbs and Candes, 2021) addresses this by dynamically adjusting the conformal threshold over time. The idea: after each prediction, observe the true label and update the threshold based on whether the prediction set covered. If coverage has been below target, widen the sets; if above target, narrow them.

@dataclass
class AdaptiveConformalClassifier:
    """Adaptive Conformal Inference (ACI) for streaming predictions.

    Adjusts the conformal threshold dynamically to maintain target
    coverage under distribution shift. Uses online gradient descent
    on the threshold, treating the coverage constraint as a loss.

    Reference: Gibbs and Candes, "Adaptive Conformal Inference Under
    Distribution Shift" (NeurIPS, 2021).
    """
    alpha: float = 0.10
    gamma: float = 0.01  # Step size for threshold update
    initial_threshold: float = 0.5

    # State
    threshold: float = field(init=False)
    coverage_history: List[bool] = field(init=False, default_factory=list)
    threshold_history: List[float] = field(init=False, default_factory=list)

    def __post_init__(self):
        self.threshold = self.initial_threshold

    def predict_set(self, probs: np.ndarray) -> List[int]:
        """Predict set for a single instance using current threshold."""
        included = np.where(probs >= 1.0 - self.threshold)[0]
        return included.tolist()

    def update(self, probs: np.ndarray, true_label: int):
        """Update threshold after observing true label.

        If the true label was covered: decrease threshold (narrow sets).
        If the true label was not covered: increase threshold (widen sets).
        """
        self.threshold_history.append(self.threshold)
        pred_set = self.predict_set(probs)
        covered = true_label in pred_set
        self.coverage_history.append(covered)

        # Online gradient descent on the quantile loss
        # err_t = alpha if covered, err_t = alpha - 1 if not covered
        if covered:
            self.threshold = self.threshold - self.gamma * self.alpha
        else:
            self.threshold = self.threshold + self.gamma * (1 - self.alpha)

        # Clip to valid range
        self.threshold = np.clip(self.threshold, 0.0, 1.0)

    def get_running_coverage(self, window: int = 100) -> List[float]:
        """Compute running coverage over a sliding window."""
        coverages = []
        for i in range(len(self.coverage_history)):
            start = max(0, i - window + 1)
            window_coverage = np.mean(self.coverage_history[start:i + 1])
            coverages.append(float(window_coverage))
        return coverages

ACI provides a long-run coverage guarantee: the average coverage over a sequence of predictions converges to $1 - \alpha$, even under adversarial distribution shift. This makes it suitable for production monitoring systems where the data distribution evolves over time — precisely the setting of Chapter 30's monitoring framework.

34.4.5 Conformalized Quantile Regression

A limitation of basic split conformal regression is that the prediction intervals have constant width (the threshold $\hat{q}$ is the same for all inputs). In many applications, uncertainty varies across the input space — a climate model is more uncertain about tropical cyclone regions than about mid-latitude weather. Conformalized quantile regression (Romano et al., 2019) produces adaptive-width intervals by combining quantile regression with conformal calibration.

The procedure:

  1. Train a quantile regression model (or a neural network with quantile loss) to predict the $\alpha/2$ and $1 - \alpha/2$ quantiles: $\hat{q}_{\text{lo}}(x)$ and $\hat{q}_{\text{hi}}(x)$.
  2. On the calibration set, compute conformity scores $s_i = \max(\hat{q}_{\text{lo}}(x_i) - y_i, \; y_i - \hat{q}_{\text{hi}}(x_i))$.
  3. Compute the conformal correction $\hat{c}$ as the appropriate quantile of the calibration scores.
  4. Form the prediction interval: $C(x) = [\hat{q}_{\text{lo}}(x) - \hat{c}, \; \hat{q}_{\text{hi}}(x) + \hat{c}]$.

The result: adaptive-width intervals that are wider where the model's quantile regression predicts more spread, with a conformal correction that provides the finite-sample coverage guarantee. The intervals are both valid (provably $\geq 1 - \alpha$ coverage) and sharper (narrower where the model is confident) than constant-width conformal intervals.

@dataclass
class ConformalizedQuantileRegressor:
    """Conformalized quantile regression for adaptive prediction intervals.

    Combines a quantile regression model (which produces input-dependent
    interval width) with conformal calibration (which provides the
    coverage guarantee). Results in intervals that are both valid
    and sharp.

    Reference: Romano, Patterson, Candes, "Conformalized Quantile
    Regression" (NeurIPS, 2019).
    """
    alpha: float = 0.10
    correction: float = field(init=False, default=None)

    def calibrate(
        self,
        cal_lower: np.ndarray,
        cal_upper: np.ndarray,
        cal_targets: np.ndarray,
    ) -> "ConformalizedQuantileRegressor":
        """Compute conformal correction from calibration data.

        Args:
            cal_lower: predicted alpha/2 quantile on calibration set
            cal_upper: predicted (1-alpha/2) quantile on calibration set
            cal_targets: true targets on calibration set
        """
        n_cal = len(cal_targets)
        scores = np.maximum(cal_lower - cal_targets, cal_targets - cal_upper)

        quantile_level = np.ceil((1 - self.alpha) * (n_cal + 1)) / n_cal
        quantile_level = min(quantile_level, 1.0)

        self.correction = float(np.quantile(scores, quantile_level))
        return self

    def predict_intervals(
        self,
        test_lower: np.ndarray,
        test_upper: np.ndarray,
    ) -> np.ndarray:
        """Return conformalized prediction intervals."""
        if self.correction is None:
            raise RuntimeError("Call calibrate() first")

        lower = test_lower - self.correction
        upper = test_upper + self.correction
        return np.column_stack([lower, upper])

34.5 Coverage and Sharpness: Evaluating Prediction Intervals

A prediction interval (or prediction set) must satisfy two desiderata:

  1. Coverage: $P(Y \in C(X)) \geq 1 - \alpha$. The interval must contain the true value at least $(1-\alpha) \times 100\%$ of the time.
  2. Sharpness: The interval should be as narrow as possible. A trivial interval $(-\infty, +\infty)$ achieves 100% coverage but is useless.

Sharpness is the complementary criterion to coverage: among all intervals with the same coverage, we prefer the narrowest. The sharpness-coverage tradeoff is analogous to the precision-recall tradeoff in classification: increasing one typically decreases the other.

@dataclass
class PredictionIntervalEvaluator:
    """Evaluate prediction intervals on coverage, sharpness,
    and conditional coverage.
    """
    intervals: np.ndarray  # shape (n, 2): [lower, upper]
    targets: np.ndarray    # shape (n,)

    def overall_metrics(self) -> Dict[str, float]:
        widths = self.intervals[:, 1] - self.intervals[:, 0]
        covered = (
            (self.targets >= self.intervals[:, 0]) &
            (self.targets <= self.intervals[:, 1])
        )
        return {
            "coverage": float(covered.mean()),
            "mean_width": float(widths.mean()),
            "median_width": float(np.median(widths)),
            "width_std": float(widths.std()),
            "coverage_width_product": float(covered.mean() * widths.mean()),
        }

    def conditional_coverage(
        self,
        groups: np.ndarray,
    ) -> Dict[str, Dict[str, float]]:
        """Coverage and width broken down by group.

        Critical for fairness: conformal guarantees are marginal,
        but group-conditional coverage may differ substantially.
        """
        widths = self.intervals[:, 1] - self.intervals[:, 0]
        covered = (
            (self.targets >= self.intervals[:, 0]) &
            (self.targets <= self.intervals[:, 1])
        )

        results = {}
        for g in np.unique(groups):
            mask = groups == g
            results[str(g)] = {
                "coverage": float(covered[mask].mean()),
                "mean_width": float(widths[mask].mean()),
                "n_samples": int(mask.sum()),
            }
        return results

The conditional_coverage method above connects conformal prediction to the fairness concerns of Chapter 31. Conformal prediction's marginal coverage guarantee ($P(Y \in C(X)) \geq 1 - \alpha$) does not imply conditional coverage for every subgroup. In practice, conformal intervals may have 95% coverage overall but only 85% coverage for underrepresented subgroups — exactly the pattern observed in Meridian Financial's subgroup calibration analysis. Romano, Sesia, and Candes (2020) address this with equalized coverage conformal prediction, which calibrates separate thresholds per group.


34.6 Monte Carlo Dropout: Approximate Bayesian Inference for Free

In a Bayesian framework, epistemic uncertainty is captured by the posterior distribution over model parameters $p(\theta \mid D)$. At test time, the predictive distribution integrates over this posterior:

$$p(y \mid x, D) = \int p(y \mid x, \theta) \, p(\theta \mid D) \, d\theta$$

This integral is intractable for neural networks. Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) provides an approximation. The insight: a neural network with dropout applied at test time (not just during training) is equivalent to an approximate variational inference procedure. Each forward pass with different dropout masks samples a different "sub-network" — an implicit sample from an approximate posterior over network parameters.

34.6.1 The Procedure

  1. Train a neural network with dropout (standard training; no changes needed).
  2. At test time, keep dropout enabled (do not switch to eval() mode for dropout layers).
  3. Run $T$ forward passes for each test input $x$, producing $T$ predictions $\hat{y}_1, \ldots, \hat{y}_T$.
  4. Use the distribution of predictions to estimate uncertainty: - Mean prediction: $\bar{y} = \frac{1}{T}\sum_{t=1}^{T} \hat{y}_t$ (more accurate than a single pass) - Predictive variance: $\text{Var}[\hat{y}] = \frac{1}{T}\sum_{t=1}^{T}(\hat{y}_t - \bar{y})^2$ (epistemic uncertainty) - For classification: predictive entropy $H = -\sum_k \bar{p}_k \log \bar{p}_k$ where $\bar{p}_k = \frac{1}{T}\sum_t p_{k,t}$
class MCDropoutPredictor:
    """Monte Carlo dropout for epistemic uncertainty estimation.

    Runs T stochastic forward passes with dropout enabled at test time.
    The variance of predictions captures epistemic uncertainty:
    high variance means the model is uncertain because different
    sub-networks disagree.

    Reference: Gal and Ghahramani, "Dropout as a Bayesian
    Approximation: Representing Model Uncertainty in Deep
    Learning" (ICML, 2016).
    """

    def __init__(self, model: nn.Module, n_samples: int = 50):
        self.model = model
        self.n_samples = n_samples

    def _enable_dropout(self):
        """Enable dropout layers during inference."""
        for module in self.model.modules():
            if isinstance(module, nn.Dropout):
                module.train()

    @torch.no_grad()
    def predict(
        self, x: torch.Tensor
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Run MC dropout and return mean, std, and all samples.

        Args:
            x: input tensor, shape (batch_size, ...)

        Returns:
            mean_pred: shape (batch_size, output_dim) - averaged prediction
            std_pred: shape (batch_size, output_dim) - epistemic uncertainty
            all_preds: shape (n_samples, batch_size, output_dim) - all samples
        """
        self.model.eval()   # Set batch norm etc. to eval mode
        self._enable_dropout()  # But keep dropout active

        predictions = []
        for _ in range(self.n_samples):
            pred = self.model(x).cpu().numpy()
            predictions.append(pred)

        all_preds = np.stack(predictions, axis=0)  # (T, batch, output_dim)
        mean_pred = all_preds.mean(axis=0)
        std_pred = all_preds.std(axis=0)

        return mean_pred, std_pred, all_preds

    def predictive_entropy(self, x: torch.Tensor) -> np.ndarray:
        """Compute predictive entropy for classification.

        High entropy = high total uncertainty (aleatoric + epistemic).
        """
        mean_pred, _, all_preds = self.predict(x)
        # Mean probabilities across MC samples
        mean_probs = mean_pred  # Already averaged
        # Clip for numerical stability
        mean_probs = np.clip(mean_probs, 1e-10, 1.0)
        entropy = -np.sum(mean_probs * np.log(mean_probs), axis=-1)
        return entropy

    def mutual_information(self, x: torch.Tensor) -> np.ndarray:
        """Compute mutual information (epistemic uncertainty only).

        MI = H[y|x,D] - E_{theta}[H[y|x,theta]]

        High MI = high disagreement between MC samples = high
        epistemic uncertainty (model doesn't know).
        Low MI, high entropy = high aleatoric uncertainty
        (inherently unpredictable).
        """
        mean_pred, _, all_preds = self.predict(x)

        # Total entropy: H[y|x,D]
        mean_probs = np.clip(mean_pred, 1e-10, 1.0)
        total_entropy = -np.sum(mean_probs * np.log(mean_probs), axis=-1)

        # Expected entropy: E_theta[H[y|x,theta]]
        all_probs = np.clip(all_preds, 1e-10, 1.0)
        per_sample_entropy = -np.sum(
            all_probs * np.log(all_probs), axis=-1
        )  # (T, batch)
        expected_entropy = per_sample_entropy.mean(axis=0)

        return total_entropy - expected_entropy

34.6.2 Decomposing Uncertainty with MC Dropout

For classification, MC dropout enables a clean decomposition:

  • Predictive entropy $H[\bar{p}]$ captures total uncertainty (aleatoric + epistemic).
  • Expected entropy $\mathbb{E}_t[H[p_t]]$ captures aleatoric uncertainty (the average uncertainty of each sub-network).
  • Mutual information $H[\bar{p}] - \mathbb{E}_t[H[p_t]]$ captures epistemic uncertainty (the disagreement between sub-networks).

This decomposition is critical for decision-making:

Scenario Predictive Entropy Mutual Information Interpretation Action
Low entropy, low MI Low Low Confident, all sub-networks agree Trust prediction
High entropy, low MI High Low Uncertain, but all sub-networks agree — aleatoric Communicate uncertainty, do not collect more data
High entropy, high MI High High Sub-networks disagree — epistemic Collect more data, refine model
Low entropy, high MI Moderate High Rare: sub-networks agree on confidence but disagree on class Investigate — possible model pathology

34.6.3 Practical Considerations

Number of samples $T$. More samples reduce the Monte Carlo variance but increase inference cost linearly. Empirically, $T = 30{-}50$ is sufficient for reliable uncertainty estimates. $T = 10$ provides a noisy but useful signal for ranking by uncertainty (e.g., for active learning).

Dropout rate. The standard training dropout rate (typically 0.1-0.5) is also used at test time. Higher dropout rates produce more diverse sub-networks and higher epistemic uncertainty estimates. Some practitioners tune the test-time dropout rate as a hyperparameter on a validation set, optimizing for calibration.

Computational cost. MC dropout requires $T$ forward passes per test input, making it $T\times$ slower than standard inference. For the StreamRec recommendation system serving 12 million users, this is prohibitive for real-time serving but feasible for batch scoring, offline analysis, and active learning candidate selection.

Quality of uncertainty estimates. MC dropout uncertainty is an approximation. The underlying variational distribution (a factored Bernoulli over individual weights) is not a good approximation to the true posterior for all architectures. In practice, MC dropout tends to underestimate epistemic uncertainty compared to deep ensembles (Section 34.7). It is best regarded as a computationally cheap lower bound on epistemic uncertainty.


34.7 Deep Ensembles: Simple and Strong

Deep ensembles (Lakshminarayanan et al., 2017) are a simpler and empirically stronger approach to uncertainty estimation than MC dropout. The method: train $M$ independent copies of the same architecture, each with a different random initialization and a different random data shuffle. At test time, average their predictions and use the variance as an uncertainty estimate.

@dataclass
class DeepEnsemblePredictor:
    """Deep ensemble uncertainty estimation.

    Trains M independent models (same architecture, different
    initialization and data order). Disagreement between models
    captures epistemic uncertainty.

    Empirically stronger than MC dropout, but M times the
    training cost. Standard: M = 5.

    Reference: Lakshminarayanan, Pritzel, Blundell, "Simple and
    Scalable Predictive Uncertainty Estimation Using Deep Ensembles"
    (NeurIPS, 2017).
    """
    models: List[nn.Module] = field(default_factory=list)

    @property
    def n_models(self) -> int:
        return len(self.models)

    @torch.no_grad()
    def predict(
        self, x: torch.Tensor
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Predict using all ensemble members.

        Returns:
            mean_pred: shape (batch, output_dim)
            std_pred: shape (batch, output_dim) - epistemic uncertainty
            all_preds: shape (M, batch, output_dim)
        """
        predictions = []
        for model in self.models:
            model.eval()
            pred = model(x).cpu().numpy()
            predictions.append(pred)

        all_preds = np.stack(predictions, axis=0)
        mean_pred = all_preds.mean(axis=0)
        std_pred = all_preds.std(axis=0)

        return mean_pred, std_pred, all_preds

    def predictive_entropy(self, x: torch.Tensor) -> np.ndarray:
        """Predictive entropy from ensemble mean probabilities."""
        mean_pred, _, _ = self.predict(x)
        mean_probs = np.clip(mean_pred, 1e-10, 1.0)
        return -np.sum(mean_probs * np.log(mean_probs), axis=-1)

    def mutual_information(self, x: torch.Tensor) -> np.ndarray:
        """Mutual information: epistemic uncertainty from ensemble disagreement."""
        mean_pred, _, all_preds = self.predict(x)

        mean_probs = np.clip(mean_pred, 1e-10, 1.0)
        total_entropy = -np.sum(mean_probs * np.log(mean_probs), axis=-1)

        all_probs = np.clip(all_preds, 1e-10, 1.0)
        per_model_entropy = -np.sum(all_probs * np.log(all_probs), axis=-1)
        expected_entropy = per_model_entropy.mean(axis=0)

        return total_entropy - expected_entropy


@dataclass
class HeteroscedasticEnsemble:
    """Ensemble where each member predicts both mean and variance.

    Captures aleatoric uncertainty (predicted variance) and epistemic
    uncertainty (disagreement between members) simultaneously.

    Each model outputs (mu, log_sigma^2) and is trained with
    negative log-likelihood loss:
      L = 0.5 * [log(sigma^2) + (y - mu)^2 / sigma^2]
    """
    models: List[nn.Module] = field(default_factory=list)

    @torch.no_grad()
    def predict(
        self, x: torch.Tensor
    ) -> Dict[str, np.ndarray]:
        """Predict with decomposed uncertainty.

        Returns dict with:
            mean: ensemble mean of predicted means
            aleatoric: mean of predicted variances (data noise)
            epistemic: variance of predicted means (model disagreement)
            total: aleatoric + epistemic
        """
        means = []
        variances = []

        for model in self.models:
            model.eval()
            output = model(x).cpu().numpy()
            # Assume output has 2 columns: [mu, log_sigma_sq]
            mu = output[:, 0]
            log_sigma_sq = output[:, 1]
            sigma_sq = np.exp(log_sigma_sq)
            means.append(mu)
            variances.append(sigma_sq)

        means = np.stack(means, axis=0)       # (M, batch)
        variances = np.stack(variances, axis=0)  # (M, batch)

        ensemble_mean = means.mean(axis=0)
        aleatoric = variances.mean(axis=0)  # E[sigma^2]
        epistemic = means.var(axis=0)       # Var[mu]

        return {
            "mean": ensemble_mean,
            "aleatoric": aleatoric,
            "epistemic": epistemic,
            "total": aleatoric + epistemic,
        }

34.7.1 Why Ensembles Work

The effectiveness of deep ensembles is surprising because each member sees the same data (no bagging). The diversity arises from two sources:

  1. Random initialization. Different initializations lead to different local minima of the loss landscape. Modern neural networks have many approximately equivalent solutions (the "loss surface" has many flat valleys), and different initializations find different valleys. These solutions agree on most inputs but disagree in regions of the input space where the data is sparse or the function is poorly constrained — exactly the regions where epistemic uncertainty is high.

  2. Random data ordering. SGD with different data shuffles follows different optimization trajectories, further diversifying the solutions.

Empirically, deep ensembles with $M = 5$ members outperform MC dropout ($T = 50$ samples) on every calibration and uncertainty benchmark tested by Lakshminarayanan et al. (2017). The cost is $M\times$ training time and $M\times$ inference time. For the climate deep learning model (anchor, Chapters 1, 4, 8, 9, 10, 23, 26), an ensemble of 5 models is standard practice because the cost is amortized over long inference pipelines and the consequences of underestimating uncertainty are severe.

34.7.2 Heteroscedastic Deep Ensembles

The HeteroscedasticEnsemble class above extends deep ensembles to simultaneously capture both uncertainty types. Each ensemble member predicts not just a point estimate $\hat{\mu}(x)$ but also an input-dependent variance $\hat{\sigma}^2(x)$, trained with a Gaussian negative log-likelihood loss:

$$\mathcal{L} = \frac{1}{2} \left[ \log \hat{\sigma}^2(x) + \frac{(y - \hat{\mu}(x))^2}{\hat{\sigma}^2(x)} \right]$$

The total predictive variance decomposes as:

$$\text{Var}[y \mid x] \approx \underbrace{\frac{1}{M}\sum_{m=1}^{M} \hat{\sigma}_m^2(x)}_{\text{aleatoric}} + \underbrace{\frac{1}{M}\sum_{m=1}^{M}(\hat{\mu}_m(x) - \bar{\mu}(x))^2}_{\text{epistemic}}$$

This is the uncertainty decomposition discussed in Section 34.1, now made computationally concrete.


34.8 Decision-Making with Uncertainty

Uncertainty estimates are not valuable in themselves — they are valuable because they enable better decisions. This section covers three downstream applications: abstention, active learning, and risk assessment.

34.8.1 Abstention (Selective Prediction)

An abstention policy (also called selective prediction or reject option) allows the model to decline to make a prediction when its uncertainty exceeds a threshold. Abstained examples are routed to a human expert, a more expensive model, or a fallback system.

@dataclass
class AbstentionPolicy:
    """Selective prediction with uncertainty-based abstention.

    The model predicts only when uncertainty is below a threshold.
    Abstained examples are routed to a human or fallback system.

    Key metric: the accuracy-coverage tradeoff. As the uncertainty
    threshold decreases (more abstentions), accuracy on non-abstained
    examples should increase.
    """
    uncertainty_threshold: float
    uncertainty_type: str = "entropy"  # "entropy", "std", "mi"

    def decide(
        self,
        predictions: np.ndarray,
        uncertainties: np.ndarray,
    ) -> Dict[str, np.ndarray]:
        """Split predictions into accepted and abstained.

        Returns dict with:
            accepted_mask: bool array, True for accepted predictions
            coverage: fraction of predictions accepted
        """
        accepted = uncertainties <= self.uncertainty_threshold

        return {
            "accepted_mask": accepted,
            "coverage": float(accepted.mean()),
            "n_accepted": int(accepted.sum()),
            "n_abstained": int((~accepted).sum()),
        }

    @staticmethod
    def accuracy_coverage_curve(
        predictions: np.ndarray,
        true_labels: np.ndarray,
        uncertainties: np.ndarray,
        n_thresholds: int = 100,
    ) -> Dict[str, np.ndarray]:
        """Compute the accuracy-coverage tradeoff curve.

        For classification: accuracy of non-abstained predictions
        as a function of coverage (fraction non-abstained).
        """
        thresholds = np.linspace(
            uncertainties.min(), uncertainties.max(), n_thresholds
        )

        coverages = []
        accuracies = []

        pred_classes = predictions.argmax(axis=1) if predictions.ndim > 1 else (
            (predictions > 0.5).astype(int)
        )

        for t in thresholds:
            mask = uncertainties <= t
            coverage = mask.mean()
            if mask.sum() > 0:
                acc = (pred_classes[mask] == true_labels[mask]).mean()
            else:
                acc = np.nan
            coverages.append(float(coverage))
            accuracies.append(float(acc))

        return {
            "coverages": np.array(coverages),
            "accuracies": np.array(accuracies),
            "thresholds": thresholds,
        }

The accuracy-coverage tradeoff: As we abstain on more examples (lower coverage), the accuracy on accepted examples should increase — we are keeping only the predictions the model is most confident about. The shape of this curve reveals how well the uncertainty estimate separates correct from incorrect predictions. A perfect uncertainty estimate would produce a curve that reaches 100% accuracy at some non-trivial coverage level. The area under the accuracy-coverage curve (AUACC) is a scalar summary of uncertainty quality.

For Meridian Financial's credit scoring model, the abstention policy has a natural business interpretation: applications where the model is most uncertain are routed to human underwriters. At an uncertainty threshold that abstains on 15% of applications (those with the highest epistemic uncertainty from a 5-model ensemble), the automated approval accuracy improves from 88.3% to 94.1%, while the human underwriters receive a manageable caseload of high-uncertainty applications that genuinely require judgment.

34.8.2 Active Learning with Uncertainty

Active learning uses uncertainty to select the most informative examples for labeling, reducing annotation costs. The classic approach: select examples where the model is most epistemically uncertain (highest mutual information from MC dropout or deep ensembles), because these are the examples where additional labels would most reduce the model's ignorance.

@dataclass
class UncertaintyActiveLearner:
    """Pool-based active learning using uncertainty sampling.

    Selects examples from an unlabeled pool where the model is
    most epistemically uncertain (highest mutual information).
    These are examples where collecting a label would most reduce
    the model's ignorance.
    """
    batch_size: int = 100  # Number of examples to select per round
    strategy: str = "mutual_information"  # or "entropy", "margin"

    def select(
        self,
        pool_uncertainties: np.ndarray,
        pool_indices: np.ndarray,
    ) -> np.ndarray:
        """Select the most uncertain examples from the pool.

        Returns:
            Selected indices into the original pool.
        """
        if self.strategy in ("mutual_information", "entropy"):
            # Select highest uncertainty
            selected_positions = np.argsort(pool_uncertainties)[-self.batch_size:]
        elif self.strategy == "margin":
            # Select smallest margin (most confused between top-2 classes)
            # Here pool_uncertainties should be negative margins
            selected_positions = np.argsort(pool_uncertainties)[:self.batch_size]
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

        return pool_indices[selected_positions]

    @staticmethod
    def evaluate_acquisition(
        selected_uncertainties: np.ndarray,
        pool_uncertainties: np.ndarray,
    ) -> Dict[str, float]:
        """Evaluate whether selection targets high-uncertainty examples."""
        return {
            "mean_selected_uncertainty": float(selected_uncertainties.mean()),
            "mean_pool_uncertainty": float(pool_uncertainties.mean()),
            "uncertainty_ratio": float(
                selected_uncertainties.mean() / (pool_uncertainties.mean() + 1e-10)
            ),
        }

The connection to the StreamRec progressive project: in Chapter 22, Thompson sampling (a Bayesian exploration strategy) was used to balance exploration and exploitation in recommendation. Active learning with epistemic uncertainty is the supervised learning analog — it identifies which users and items the model understands least, enabling targeted data collection for those regions of the feature space.

34.8.3 Risk Assessment with Calibrated Uncertainty

In regulated domains — credit (Meridian Financial), pharma (MediCore), climate policy — the goal is not just to predict outcomes but to quantify the risk of adverse outcomes with honesty.

For the MediCore pharmaceutical anchor (Chapters 3, 15, 16, 17, 18, 19, 21), uncertainty quantification directly addresses a regulatory requirement: the FDA expects submissions to characterize not just the average treatment effect but the uncertainty around the treatment effect, including both statistical uncertainty (epistemic — limited trial size) and biological variability (aleatoric — patient-to-patient variation). A heteroscedastic ensemble trained on clinical trial data produces exactly this decomposition: the epistemic uncertainty reflects how much the estimated treatment effect would change with more data, and the aleatoric uncertainty reflects the irreducible patient-to-patient variation.

@dataclass
class RiskAssessmentReport:
    """Generate a risk assessment report with calibrated uncertainty.

    Combines point predictions, calibrated probabilities, conformal
    intervals, and uncertainty decomposition into a single report
    suitable for regulatory review.
    """
    predictions: np.ndarray
    calibrated_probs: np.ndarray
    conformal_intervals: np.ndarray  # shape (n, 2)
    epistemic_uncertainty: np.ndarray
    aleatoric_uncertainty: np.ndarray

    def risk_tier(self, thresholds: Tuple[float, float] = (0.1, 0.3)) -> np.ndarray:
        """Assign risk tiers based on calibrated probability.

        Low risk: calibrated_prob < thresholds[0]
        Medium risk: thresholds[0] <= calibrated_prob < thresholds[1]
        High risk: calibrated_prob >= thresholds[1]
        """
        tiers = np.full(len(self.calibrated_probs), "medium", dtype=object)
        tiers[self.calibrated_probs < thresholds[0]] = "low"
        tiers[self.calibrated_probs >= thresholds[1]] = "high"
        return tiers

    def flag_high_epistemic(self, threshold: float = None) -> np.ndarray:
        """Flag examples with high epistemic uncertainty.

        These are examples where the model's prediction is unreliable
        due to insufficient training data in this region of input space.
        """
        if threshold is None:
            threshold = np.percentile(self.epistemic_uncertainty, 90)
        return self.epistemic_uncertainty > threshold

    def summary(self) -> Dict[str, float]:
        return {
            "n_predictions": len(self.predictions),
            "mean_calibrated_prob": float(self.calibrated_probs.mean()),
            "mean_interval_width": float(
                (self.conformal_intervals[:, 1] - self.conformal_intervals[:, 0]).mean()
            ),
            "mean_epistemic": float(self.epistemic_uncertainty.mean()),
            "mean_aleatoric": float(self.aleatoric_uncertainty.mean()),
            "frac_high_epistemic": float(self.flag_high_epistemic().mean()),
        }

34.9 The StreamRec Progressive Project: Calibrating and Quantifying Uncertainty

This section applies the chapter's techniques to the StreamRec recommendation system (progressive project, Chapters 6-32).

Milestone M16: Uncertainty Quantification for StreamRec

34.9.1 Step 1: Calibration Diagnosis

The StreamRec click-prediction model (MLP with 40 features, trained in Chapter 7, deployed in Chapter 29) produces predicted click probabilities. We first diagnose calibration on the held-out test set.

# Assume: test_logits (n_test,), test_labels (n_test,) from StreamRec MLP
# test_logits are raw logits before sigmoid

# Uncalibrated probabilities
uncal_probs = 1.0 / (1.0 + np.exp(-test_logits))

# Reliability diagram
diag_uncal = CalibrationDiagnostics(test_labels, uncal_probs, n_bins=15)
print(f"Uncalibrated ECE: {diag_uncal.ece:.4f}")
print(f"Uncalibrated MCE: {diag_uncal.mce:.4f}")
# Expected output (typical for modern networks):
# Uncalibrated ECE: 0.0823
# Uncalibrated MCE: 0.1547

34.9.2 Step 2: Temperature Scaling

# Split held-out data: 50% calibration, 50% test
n = len(test_logits)
rng = np.random.RandomState(42)
perm = rng.permutation(n)
n_cal = n // 2
cal_logits = test_logits[perm[:n_cal]]
cal_labels = test_labels[perm[:n_cal]]
eval_logits = test_logits[perm[n_cal:]]
eval_labels = test_labels[perm[n_cal:]]

# Fit temperature scaling
temp_scaler = TemperatureScaler().fit(cal_logits, cal_labels)
print(f"Learned temperature: {temp_scaler.temperature:.4f}")
# Expected: T ≈ 1.8 - 2.5 (network is overconfident, so T > 1)

# Apply to evaluation set
cal_probs = temp_scaler.calibrate(eval_logits)
diag_cal = CalibrationDiagnostics(eval_labels, cal_probs, n_bins=15)
print(f"Post-temperature ECE: {diag_cal.ece:.4f}")
print(f"Post-temperature MCE: {diag_cal.mce:.4f}")
# Expected: ECE ≈ 0.012 - 0.025 (substantial improvement)

34.9.3 Step 3: Conformal Prediction Sets

For the StreamRec model, conformal prediction answers: "Given the model's uncertainty, which items might this user click?" We construct binary prediction sets (click / no-click) with 90% coverage.

# Binary classification conformal prediction
# Convert to 2-class probability format for SplitConformalClassifier
cal_probs_2class = np.column_stack([1 - cal_probs[:n_cal], cal_probs[:n_cal]])
eval_probs_2class = np.column_stack([1 - cal_probs[n_cal:], cal_probs[n_cal:]])

conformal = SplitConformalClassifier(alpha=0.10)
conformal.calibrate(cal_probs_2class[:n_cal // 2], cal_labels[:n_cal // 2])

coverage_results = conformal.evaluate_coverage(
    eval_probs_2class, eval_labels
)
print(f"Empirical coverage: {coverage_results['empirical_coverage']:.4f}")
print(f"Mean set size: {coverage_results['mean_set_size']:.4f}")
# Expected: coverage >= 0.90, mean set size ≈ 1.1 - 1.3

34.9.4 Step 4: MC Dropout Uncertainty

# Wrap StreamRec MLP in MC dropout predictor
# Assume streamrec_model is a PyTorch nn.Module with Dropout layers
mc_predictor = MCDropoutPredictor(streamrec_model, n_samples=50)

# Predict on evaluation set
x_eval = torch.tensor(eval_features, dtype=torch.float32)
mean_pred, std_pred, all_preds = mc_predictor.predict(x_eval)

# Compute mutual information (epistemic uncertainty)
mi = mc_predictor.mutual_information(x_eval)

# Identify least-confident users (top 10% by epistemic uncertainty)
least_confident_mask = mi >= np.percentile(mi, 90)
print(f"Least-confident users: {least_confident_mask.sum()}")
print(f"Mean MI (all): {mi.mean():.6f}")
print(f"Mean MI (least confident): {mi[least_confident_mask].mean():.6f}")

34.9.5 Step 5: Uncertainty-Based Abstention

# Build accuracy-coverage curve
acc_cov = AbstentionPolicy.accuracy_coverage_curve(
    predictions=mean_pred,
    true_labels=eval_labels,
    uncertainties=mi,
)

# Find coverage at 95% accuracy target
target_acc = 0.95
for i, (cov, acc) in enumerate(
    zip(acc_cov["coverages"], acc_cov["accuracies"])
):
    if not np.isnan(acc) and acc >= target_acc:
        print(f"95% accuracy achievable at {cov:.1%} coverage")
        print(f"  (abstain on {1-cov:.1%} of predictions)")
        break
# Expected: ~85-90% coverage (abstain on 10-15% of predictions)

This milestone integrates the chapter's full UQ pipeline. The least-confident users identified in Step 4 are candidates for active learning (Chapter 22's Thompson sampling can prioritize exploration for these users) or for fallback to a rule-based system that does not require high-confidence predictions.


34.10 Uncertainty in Practice: A Comparative Summary

Method Type Assumptions Guarantees Cost Strengths Weaknesses
Reliability diagram / ECE Calibration evaluation None (diagnostic) None (descriptive) Negligible Universal, interpretable Bin-dependent, does not fix the problem
Temperature scaling Post-hoc recalibration Uniform miscalibration Improved ECE (empirical) 1 parameter fit Simple, preserves ranking, 1 parameter Cannot fix non-uniform miscalibration
Platt scaling Post-hoc recalibration Logistic recalibration model Improved ECE (empirical) 2 parameter fit Slightly more flexible than temperature Still parametric
Isotonic regression Post-hoc recalibration Monotonic relationship Improved ECE (empirical) Non-parametric fit Most flexible recalibration Needs larger calibration set
Split conformal Prediction sets Exchangeability Coverage $\geq 1-\alpha$ Score computation Distribution-free, finite-sample guarantee Constant-width intervals; marginal guarantee
CQR Adaptive intervals Exchangeability + quantile model Coverage $\geq 1-\alpha$ Quantile model + calibration Adaptive width, formal guarantee Requires quantile regression model
ACI Online conformal None (adversarial shift OK) Long-run average coverage Online update Works under distribution shift Coverage may deviate in short term
MC dropout Epistemic uncertainty Dropout as variational inference Approximate $T \times$ inference Free (no retraining), decomposition Underestimates uncertainty, approximation quality varies
Deep ensemble Both types Diverse initialization Empirical $M \times$ training + inference Best empirical performance, simple Expensive to train, $M$ models to store
Heteroscedastic ensemble Both (decomposed) Gaussian output model Empirical $M \times$ training + inference Clean aleatoric/epistemic decomposition Gaussian assumption may not hold

34.11 Common Pitfalls

  1. Confusing calibration with discrimination. A model that predicts the base rate for every input is perfectly calibrated (ECE = 0) but has no discriminative power (AUC = 0.5). Always report calibration alongside discrimination metrics. Calibration without discrimination is useless; discrimination without calibration is dangerous.

  2. Evaluating calibration on the training set. Models are often well-calibrated on training data (they have memorized the correct probabilities) but miscalibrated on test data. Always evaluate calibration on held-out data, and always use a separate calibration set for fitting recalibration parameters (three-way split: train/calibrate/test).

  3. Ignoring subgroup calibration. Global ECE can hide severe subgroup miscalibration. Always disaggregate calibration analysis by relevant subgroups — the same subgroups used in the fairness audit (Chapter 31).

  4. Treating conformal coverage as conditional. Conformal prediction guarantees marginal coverage ($P(Y \in C(X)) \geq 1 - \alpha$ averaged over $X$), not conditional coverage ($P(Y \in C(X) | X = x) \geq 1 - \alpha$ for all $x$). In regions where the model is poor, conformal intervals may be too narrow; in regions where the model is good, they may be too wide.

  5. Using MC dropout uncertainty without validation. MC dropout uncertainty quality depends on the architecture, the dropout rate, and the data. Always validate uncertainty estimates against ground truth: if high-uncertainty examples are not more likely to be wrong, the uncertainty signal is not useful.

  6. Treating ensemble disagreement as the only form of uncertainty. If all ensemble members make the same mistake (e.g., all are trained on the same biased data), they will agree confidently on the wrong answer. Ensemble disagreement captures epistemic uncertainty within the hypothesis class but not uncertainty about the hypothesis class itself (model misspecification).

  7. Ignoring calibration in production monitoring. Calibration degrades over time as the data distribution shifts. ECE should be monitored continuously (Chapter 30's monitoring framework), with recalibration triggered when ECE exceeds a threshold or when conformal coverage drops below target.


34.12 Connecting to the Broader Curriculum

From Chapter 3 (Probability and Inference): Calibration is the frequentist requirement that probability estimates match long-run frequencies. The Bayesian approach (Chapters 20-22) provides a natural framework for uncertainty through posterior distributions. This chapter bridges the gap: MC dropout and deep ensembles are approximate Bayesian methods, while conformal prediction is a frequentist approach that provides finite-sample guarantees without distributional assumptions.

From Chapter 20 (Bayesian Thinking): Bayesian posterior predictive distributions naturally decompose into aleatoric and epistemic uncertainty. The methods in this chapter approximate this decomposition for neural networks that are too large for exact Bayesian inference.

From Chapter 23 (Advanced Time Series): ACI for conformal prediction was previewed in Chapter 23's temporal forecasting context. The AdaptiveConformalClassifier here is the general-purpose version of the temporal conformal wrapper used for the StreamRec engagement forecaster.

From Chapter 30 (Monitoring): Calibration is a key monitoring signal. ECE should be tracked alongside the data drift (PSI/KS) and performance metrics (AUC, Recall@20) in the four-layer Grafana dashboard. When ECE exceeds its SLO threshold, the incident response runbook should trigger recalibration.

From Chapter 31 (Fairness): Group-conditional calibration is a fairness criterion. The impossibility theorem implies that perfect calibration across all groups is incompatible with equalized odds when base rates differ. The practitioner must choose which constraint to prioritize — and uncertainty quantification provides the tools to make that choice transparently.

To Chapter 35 (Interpretability): Uncertainty estimates enhance interpretability. An explanation of "the model predicts this applicant will default with 73% probability" is more useful when accompanied by "and the model is highly uncertain about this prediction (epistemic uncertainty in the 95th percentile) because this applicant profile is underrepresented in training data."


Chapter Summary

Uncertainty quantification transforms a model from a black-box oracle that issues confident pronouncements into a calibrated, self-aware system that communicates what it knows and what it does not know. Calibration (Sections 34.2-34.3) ensures that probability estimates match observed frequencies — a necessary condition for predictions to be usable as inputs to risk calculations, decision rules, and regulatory reports. Conformal prediction (Sections 34.4-34.5) provides something stronger: distribution-free prediction sets with finite-sample coverage guarantees, requiring only the assumption of exchangeability. MC dropout and deep ensembles (Sections 34.6-34.7) decompose uncertainty into aleatoric (irreducible, data-inherent) and epistemic (reducible, knowledge-limited) components, enabling different responses: communicate aleatoric uncertainty to decision-makers, reduce epistemic uncertainty through data collection. Abstention, active learning, and risk assessment (Section 34.8) are the downstream applications that make uncertainty quantification practically valuable.

The central lesson: a model that honestly communicates its uncertainty is more valuable than a model that is slightly more accurate but confidently wrong. In the Meridian Financial credit example, the ensemble-based abstention policy — routing 15% of applications to human underwriters — improved automated approval accuracy from 88.3% to 94.1% while reducing charge-offs by $8M per quarter. In the climate modeling example (Case Study 1), decomposing model, data, and scenario uncertainty enabled policymakers to distinguish "we need more observations" from "this is inherently unpredictable" — a distinction that changes policy responses. The cost of uncertainty quantification — temperature scaling is nearly free; a 5-model ensemble costs 5x training — is almost always justified by the improvement in downstream decision-making.

Know How Your Model Is Wrong: This chapter provides the tools to answer a specific version of the question: "How wrong is your model about its own confidence?" The reliability diagram shows you. The ECE quantifies it. Temperature scaling corrects the easy part. Conformal prediction provides formal guarantees regardless of the model. Deep ensembles reveal where the model is uncertain and why (aleatoric vs. epistemic). Together, these tools ensure that when your model says "80% sure," it means it.