Chapter 34: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Calibration Fundamentals

Exercise 34.1 (*)

A binary classifier for loan default prediction produces the following test-set predictions and outcomes. The predictions are grouped into 5 bins:

Bin Mean Predicted Prob Observed Freq Count
[0.0, 0.2) 0.10 0.08 4,200
[0.2, 0.4) 0.31 0.28 2,800
[0.4, 0.6) 0.52 0.41 1,500
[0.6, 0.8) 0.71 0.59 800
[0.8, 1.0] 0.91 0.78 700

(a) Compute the ECE.

(b) Compute the MCE. Which bin is worst-calibrated?

(c) Is the model overconfident or underconfident? How can you tell from the table alone?

(d) Sketch the reliability diagram. What does the gap between the diagonal and the observed frequencies tell a risk manager at Meridian Financial?


Exercise 34.2 (*)

A data scientist reports that their model has ECE = 0.003 and concludes "our model is perfectly calibrated." Name two reasons why this conclusion may be wrong, citing specific pitfalls from Section 34.11.


Exercise 34.3 (*)

Using the CalibrationDiagnostics class from Section 34.2, write code to:

(a) Generate 10,000 synthetic predictions from a model that is well-calibrated (true probabilities sampled from Uniform(0, 1), labels sampled from Bernoulli(p)). Verify that ECE is close to 0.

(b) Generate 10,000 synthetic predictions from an overconfident model: true probabilities are Uniform(0, 1), but the model reports $\hat{p} = 0.5 + 1.5 \cdot (p - 0.5)$ (clipped to [0, 1]). Compute ECE and plot the reliability diagram.

(c) Apply temperature scaling to the overconfident model's logits and show that ECE decreases.

import numpy as np

# (a) Well-calibrated model
rng = np.random.RandomState(42)
true_probs = rng.uniform(0, 1, 10000)
labels = rng.binomial(1, true_probs)

diag_a = CalibrationDiagnostics(labels, true_probs, n_bins=15)
print(f"Well-calibrated ECE: {diag_a.ece:.4f}")
# Expected: ECE ≈ 0.005 - 0.015 (small, due to finite-sample variance)

# (b) Overconfident model
overconf_probs = np.clip(0.5 + 1.5 * (true_probs - 0.5), 0, 1)
diag_b = CalibrationDiagnostics(labels, overconf_probs, n_bins=15)
print(f"Overconfident ECE: {diag_b.ece:.4f}")
# Expected: ECE ≈ 0.08 - 0.12

# (c) Temperature scaling
# Convert overconfident probabilities to logits
logits_overconf = np.log(overconf_probs / (1 - overconf_probs + 1e-10) + 1e-10)
# Use a calibration set (first 5000) and test set (last 5000)
temp = TemperatureScaler().fit(logits_overconf[:5000], labels[:5000])
cal_probs_c = temp.calibrate(logits_overconf[5000:])
diag_c = CalibrationDiagnostics(labels[5000:], cal_probs_c, n_bins=15)
print(f"Post-temperature ECE: {diag_c.ece:.4f}")
print(f"Temperature: {temp.temperature:.4f}")

Exercise 34.4 (**)

Explain why temperature scaling preserves the model's ranking of examples (AUC, Recall@K, NDCG) while changing calibration. Prove that for any two inputs $x_1, x_2$, if $\hat{p}(x_1) > \hat{p}(x_2)$ before temperature scaling, then $\hat{p}_T(x_1) > \hat{p}_T(x_2)$ after temperature scaling, regardless of $T > 0$.


Exercise 34.5 (**)

The Meridian Financial credit scoring model has the following subgroup calibration results after Platt scaling:

Subgroup ECE Bias Direction n
Young (18-30) 0.068 Overconfident 3,200
Middle (30-50) 0.015 Neutral 8,400
Senior (50+) 0.021 Underconfident 4,100

(a) Why might the model be overconfident for young borrowers? Propose both a data-driven and a modeling explanation.

(b) Design a group-conditional recalibration strategy. Should you use temperature scaling or isotonic regression for the young subgroup? Justify your choice given the subgroup size.

(c) The compliance team asks: "Is it legal to fit different calibration models for different age groups? Doesn't that use a protected attribute?" Write a 3-4 sentence response explaining why post-hoc calibration per subgroup is different from using age as a prediction feature.


Conformal Prediction

Exercise 34.6 (*)

A 5-class image classifier produces the following calibration scores (nonconformity scores $s_i = 1 - \hat{p}_{y_i}$) on a calibration set of 200 examples:

The sorted scores range from 0.02 to 0.91, with the 180th largest being 0.38 and the 190th being 0.52.

(a) For $\alpha = 0.10$ (90% coverage), what is the conformal threshold $\hat{q}$?

(b) A new test image has predicted probabilities [0.05, 0.12, 0.55, 0.20, 0.08]. What is the prediction set?

(c) Another test image has probabilities [0.22, 0.21, 0.20, 0.19, 0.18]. What is the prediction set? What does this say about conformal prediction for uncertain inputs?


Exercise 34.7 (*)

Using the SplitConformalRegressor class from Section 34.4.3, implement a conformal prediction experiment:

(a) Train a linear regression model on the Boston Housing dataset (or a synthetic regression dataset). Split the data 60/20/20 into train/calibrate/test.

(b) Compute conformal prediction intervals at $\alpha = 0.10$ (90% coverage). Report empirical coverage and mean interval width.

(c) Repeat at $\alpha = 0.05$ and $\alpha = 0.20$. Plot the coverage-width tradeoff.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_regression(n_samples=2000, n_features=10, noise=20.0, random_state=42)

# 60/20/20 split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_cal, X_test, y_cal, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train model
model = LinearRegression().fit(X_train, y_train)

# Conformal prediction at alpha = 0.10
cal_preds = model.predict(X_cal)
test_preds = model.predict(X_test)

conformal = SplitConformalRegressor(alpha=0.10)
conformal.calibrate(cal_preds, y_cal)
results = conformal.evaluate_coverage(test_preds, y_test)
print(f"Coverage: {results['empirical_coverage']:.4f} (target: 0.90)")
print(f"Mean width: {results['mean_width']:.2f}")

Exercise 34.8 (**)

The coverage guarantee of conformal prediction requires exchangeability. For each of the following scenarios, state whether exchangeability holds and whether conformal prediction will maintain its coverage guarantee:

(a) Calibration data is from January 2025; test data is from January 2025 (same time period, IID sample).

(b) Calibration data is from January 2025; test data is from July 2025 (gradual distribution shift due to seasonality).

(c) Calibration data is sampled uniformly from the population; test data is the same population but stratified by age group.

(d) Calibration data is from the US market; test data is from the EU market (different distribution).

(e) For scenario (b), how would Adaptive Conformal Inference (ACI) help?


Exercise 34.9 (**)

Implement conformalized quantile regression (CQR) for a synthetic heteroscedastic regression problem where the noise variance depends on the input:

$$y = 2x + \sin(3x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0, (0.5 + |x|)^2)$$

(a) Train a neural network with quantile loss at quantiles 0.05 and 0.95.

(b) Apply conformal calibration using the ConformalizedQuantileRegressor class.

(c) Compare the interval widths of CQR to standard split conformal. Show that CQR produces narrower intervals where the noise is low and wider intervals where the noise is high, while maintaining the same coverage guarantee.

import torch
import torch.nn as nn

# Generate heteroscedastic data
rng = np.random.RandomState(42)
n = 3000
x = rng.uniform(-3, 3, n)
noise_std = 0.5 + np.abs(x)
y = 2 * x + np.sin(3 * x) + rng.normal(0, noise_std)

# Split: 60% train, 20% calibrate, 20% test
idx = rng.permutation(n)
n_train, n_cal = int(0.6 * n), int(0.2 * n)
x_train, y_train = x[idx[:n_train]], y[idx[:n_train]]
x_cal, y_cal = x[idx[n_train:n_train + n_cal]], y[idx[n_train:n_train + n_cal]]
x_test, y_test = x[idx[n_train + n_cal:]], y[idx[n_train + n_cal:]]


class QuantileNet(nn.Module):
    """Neural network for quantile regression (two quantiles)."""

    def __init__(self, hidden_dim: int = 64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2),  # Output: [lower_quantile, upper_quantile]
        )

    def forward(self, x):
        return self.net(x)


def quantile_loss(pred, target, quantiles):
    """Pinball loss for multiple quantiles."""
    target = target.unsqueeze(1).expand_as(pred)
    errors = target - pred
    loss = torch.zeros_like(errors)
    for i, q in enumerate(quantiles):
        loss[:, i] = torch.where(
            errors[:, i] >= 0,
            q * errors[:, i],
            (q - 1) * errors[:, i],
        )
    return loss.mean()


# Training loop
quantiles_target = [0.05, 0.95]
model_qr = QuantileNet(hidden_dim=64)
optimizer = torch.optim.Adam(model_qr.parameters(), lr=0.001)

x_train_t = torch.tensor(x_train, dtype=torch.float32).unsqueeze(1)
y_train_t = torch.tensor(y_train, dtype=torch.float32)

for epoch in range(500):
    optimizer.zero_grad()
    pred = model_qr(x_train_t)
    loss = quantile_loss(pred, y_train_t, quantiles_target)
    loss.backward()
    optimizer.step()

# Predict quantiles on calibration and test sets
model_qr.eval()
with torch.no_grad():
    cal_quantiles = model_qr(
        torch.tensor(x_cal, dtype=torch.float32).unsqueeze(1)
    ).numpy()
    test_quantiles = model_qr(
        torch.tensor(x_test, dtype=torch.float32).unsqueeze(1)
    ).numpy()

# Conformalize
cqr = ConformalizedQuantileRegressor(alpha=0.10)
cqr.calibrate(cal_quantiles[:, 0], cal_quantiles[:, 1], y_cal)
cqr_intervals = cqr.predict_intervals(test_quantiles[:, 0], test_quantiles[:, 1])

# Compare to standard conformal
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(x_train.reshape(-1, 1), y_train)
cal_preds_lr = lr.predict(x_cal.reshape(-1, 1))
test_preds_lr = lr.predict(x_test.reshape(-1, 1))

std_conformal = SplitConformalRegressor(alpha=0.10)
std_conformal.calibrate(cal_preds_lr, y_cal)
std_intervals = std_conformal.predict_intervals(test_preds_lr)

# Evaluate
cqr_widths = cqr_intervals[:, 1] - cqr_intervals[:, 0]
std_widths = std_intervals[:, 1] - std_intervals[:, 0]

covered_cqr = ((y_test >= cqr_intervals[:, 0]) & (y_test <= cqr_intervals[:, 1]))
covered_std = ((y_test >= std_intervals[:, 0]) & (y_test <= std_intervals[:, 1]))

print(f"CQR coverage: {covered_cqr.mean():.4f}, mean width: {cqr_widths.mean():.2f}")
print(f"Std conformal coverage: {covered_std.mean():.4f}, mean width: {std_widths.mean():.2f}")
print(f"CQR width std: {cqr_widths.std():.2f} (adaptive)")
print(f"Std width std: {std_widths.std():.2f} (constant)")

Epistemic Uncertainty

Exercise 34.10 (*)

Explain the difference between predictive entropy and mutual information in the MC dropout framework. A test example has: - Predictive entropy = 0.95 nats - Mutual information = 0.05 nats

(a) What type of uncertainty dominates for this example?

(b) Would collecting more training data similar to this example likely reduce the model's uncertainty? Why or why not?

(c) Give a concrete example from the StreamRec recommendation system where this pattern (high entropy, low MI) would occur.


Exercise 34.11 (**)

Another test example has: - Predictive entropy = 0.92 nats - Mutual information = 0.80 nats

(a) What type of uncertainty dominates?

(b) What action should a data scientist take for examples with this uncertainty profile?

(c) In the Meridian Financial credit scoring context, describe a borrower profile that would likely produce this pattern. Why?


Exercise 34.12 (**)

Implement MC dropout for a 3-layer MLP on a synthetic 2D classification dataset. Visualize the uncertainty landscape:

(a) Train the MLP with dropout (rate 0.3) on data from two Gaussian clusters.

(b) Create a 2D grid of points covering the input space (including regions far from training data).

(c) Compute mutual information at each grid point using the MCDropoutPredictor class with $T = 50$.

(d) Plot a heatmap of mutual information overlaid with the training points. Verify that epistemic uncertainty is highest far from training data and near the decision boundary.

from sklearn.datasets import make_moons

# Generate training data
X_train, y_train = make_moons(n_samples=500, noise=0.15, random_state=42)

# Define simple MLP with dropout
class MLP(nn.Module):
    def __init__(self, dropout_rate=0.3):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(2, 64),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(64, 2),
            nn.Softmax(dim=-1),
        )

    def forward(self, x):
        return self.layers(x)

# Train
model = MLP(dropout_rate=0.3)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

X_t = torch.tensor(X_train, dtype=torch.float32)
y_t = torch.tensor(y_train, dtype=torch.long)

for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    loss = criterion(model(X_t), y_t)
    loss.backward()
    optimizer.step()

# Create grid
xx, yy = np.meshgrid(np.linspace(-2.5, 3.5, 100), np.linspace(-2.0, 2.5, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])
grid_t = torch.tensor(grid, dtype=torch.float32)

# MC dropout uncertainty
mc = MCDropoutPredictor(model, n_samples=50)
mi = mc.mutual_information(grid_t)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
mi_grid = mi.reshape(xx.shape)
contour = ax.contourf(xx, yy, mi_grid, levels=20, cmap="YlOrRd")
plt.colorbar(contour, label="Mutual Information (epistemic uncertainty)")
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
           c="blue", s=10, alpha=0.5, label="Class 0")
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
           c="green", s=10, alpha=0.5, label="Class 1")
ax.legend()
ax.set_title("Epistemic Uncertainty Landscape (MC Dropout)")
plt.tight_layout()
plt.savefig("epistemic_uncertainty_landscape.png", dpi=150)

Exercise 34.13 (**)

Compare MC dropout ($T = 50$, dropout rate 0.2) and a deep ensemble ($M = 5$) on the same 2D classification task from Exercise 34.12:

(a) Train both models. Plot the uncertainty landscape (mutual information) for each.

(b) Compare the magnitude of uncertainty estimates. Which method produces higher epistemic uncertainty in out-of-distribution regions?

(c) Compare wall-clock training time and inference time. Report the ratio.


Exercise 34.14 (***)

Implement a heteroscedastic neural network that predicts both mean and variance for a regression task:

(a) Define a network with two output heads: $\hat{\mu}(x)$ and $\log \hat{\sigma}^2(x)$.

(b) Implement the Gaussian NLL loss: $\mathcal{L} = \frac{1}{2}[\log \hat{\sigma}^2 + (y - \hat{\mu})^2 / \hat{\sigma}^2]$.

(c) Train on the heteroscedastic dataset from Exercise 34.9. Show that the predicted variance $\hat{\sigma}^2(x)$ increases with $|x|$, matching the true noise structure.

(d) Combine 5 such networks into a HeteroscedasticEnsemble. Decompose total variance into aleatoric and epistemic components. Verify that aleatoric uncertainty tracks the true noise variance while epistemic uncertainty is highest at the boundaries of the training data.

class HeteroscedasticNet(nn.Module):
    """Network predicting mean and log-variance."""

    def __init__(self, hidden_dim: int = 64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(1, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.mean_head = nn.Linear(hidden_dim, 1)
        self.logvar_head = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        h = self.shared(x)
        mu = self.mean_head(h)
        log_sigma_sq = self.logvar_head(h)
        return torch.cat([mu, log_sigma_sq], dim=-1)


def gaussian_nll_loss(output, target):
    """Gaussian negative log-likelihood loss."""
    mu = output[:, 0]
    log_sigma_sq = output[:, 1]
    sigma_sq = torch.exp(log_sigma_sq)
    loss = 0.5 * (log_sigma_sq + (target - mu) ** 2 / sigma_sq)
    return loss.mean()


# Train 5 models with different random seeds
ensemble_models = []
for seed in range(5):
    torch.manual_seed(seed)
    model_h = HeteroscedasticNet(hidden_dim=64)
    optimizer = torch.optim.Adam(model_h.parameters(), lr=0.001)

    x_t = torch.tensor(x_train, dtype=torch.float32).unsqueeze(1)
    y_t = torch.tensor(y_train, dtype=torch.float32)

    for epoch in range(500):
        optimizer.zero_grad()
        pred = model_h(x_t)
        loss = gaussian_nll_loss(pred, y_t)
        loss.backward()
        optimizer.step()

    ensemble_models.append(model_h)

ensemble = HeteroscedasticEnsemble(models=ensemble_models)

Decision-Making with Uncertainty

Exercise 34.15 (*)

A model produces predictions for 1,000 test examples. The accuracy on all 1,000 is 82%. If the model abstains on the 100 most uncertain examples (by mutual information), the accuracy on the remaining 900 is 89%.

(a) Is the uncertainty estimate useful for selective prediction? How can you tell?

(b) What is the accuracy on the 100 abstained examples? Show your calculation.

(c) If each correct prediction earns \$10 and each incorrect prediction costs \$50, what is the expected profit per prediction with and without abstention (assuming abstained examples are handled by a human at \$20 each)?


Exercise 34.16 (**)

Implement the full accuracy-coverage curve for a model with uncertainty estimates:

(a) Using the AbstentionPolicy.accuracy_coverage_curve method, generate the curve for a model of your choice (you may use the MC dropout model from Exercise 34.12).

(b) Compute the AUACC (area under the accuracy-coverage curve). A higher AUACC indicates better uncertainty estimates.

(c) Compare AUACC for three uncertainty measures: (i) predictive entropy, (ii) mutual information, (iii) the maximum softmax probability. Which correlates best with prediction correctness?


Exercise 34.17 (**)

Design an active learning experiment for the StreamRec model:

(a) Starting with 10% of the labeled training data, use the UncertaintyActiveLearner class (mutual information strategy) to select the next 5% for labeling.

(b) Retrain the model on the enlarged training set. Compare the test accuracy to a model trained on a random 15% sample.

(c) Repeat for 5 rounds (10% → 15% → 20% → 25% → 30%). Plot test accuracy vs. labeled fraction for uncertainty sampling and random sampling. At what labeled fraction does uncertainty sampling match random sampling's performance at 30%?


Exercise 34.18 (***)

The Meridian Financial risk team wants to set the abstention threshold so that the model's automated decisions satisfy two constraints simultaneously:

  1. Accuracy on accepted predictions $\geq 95\%$
  2. Coverage (fraction accepted) $\geq 80\%$

(a) Is it always possible to satisfy both constraints? Under what conditions is it impossible?

(b) Write code that finds the maximum coverage subject to the accuracy constraint using the accuracy-coverage curve.

(c) The risk team adds a third constraint: subgroup coverage must be at least 70% for every age group (18-30, 30-50, 50+). Modify your code to check this constraint. Why might this constraint be difficult to satisfy simultaneously with the global accuracy constraint?

def find_optimal_threshold(
    predictions: np.ndarray,
    true_labels: np.ndarray,
    uncertainties: np.ndarray,
    min_accuracy: float = 0.95,
    min_coverage: float = 0.80,
    min_subgroup_coverage: float = 0.70,
    subgroup_labels: np.ndarray = None,
) -> Dict:
    """Find the threshold that maximizes coverage subject to constraints."""
    thresholds = np.linspace(
        uncertainties.min(), uncertainties.max(), 1000
    )

    pred_classes = (predictions > 0.5).astype(int) if predictions.ndim == 1 else (
        predictions.argmax(axis=1)
    )

    best_coverage = 0.0
    best_threshold = None

    for t in thresholds:
        mask = uncertainties <= t
        coverage = mask.mean()

        if coverage < min_coverage:
            continue

        if mask.sum() == 0:
            continue

        accuracy = (pred_classes[mask] == true_labels[mask]).mean()
        if accuracy < min_accuracy:
            continue

        # Check subgroup coverage if provided
        if subgroup_labels is not None:
            subgroup_ok = True
            for g in np.unique(subgroup_labels):
                g_mask = subgroup_labels == g
                g_coverage = mask[g_mask].mean()
                if g_coverage < min_subgroup_coverage:
                    subgroup_ok = False
                    break
            if not subgroup_ok:
                continue

        if coverage > best_coverage:
            best_coverage = coverage
            best_threshold = t

    return {
        "threshold": best_threshold,
        "coverage": best_coverage,
        "feasible": best_threshold is not None,
    }

Conformal Prediction (Advanced)

Exercise 34.19 (***)

Prove the marginal coverage guarantee of split conformal prediction. Specifically, show that if $(X_1, Y_1), \ldots, (X_n, Y_n), (X_{n+1}, Y_{n+1})$ are exchangeable and the nonconformity scores $s_i = s(X_i, Y_i)$ are computed using a score function $s$ that does not depend on the ordering:

$$P(Y_{n+1} \in C(X_{n+1})) \geq 1 - \alpha$$

where $C(x) = \{y : s(x, y) \leq \hat{q}\}$ and $\hat{q}$ is the $\lceil(1-\alpha)(n+1)\rceil / n$ quantile of $\{s_1, \ldots, s_n\}$.

Hint: Use the fact that the rank of $s_{n+1}$ among $\{s_1, \ldots, s_{n+1}\}$ is uniformly distributed over $\{1, \ldots, n+1\}$ by exchangeability.


Exercise 34.20 (***)

Implement Adaptive Conformal Inference (ACI) for a scenario with gradual distribution shift:

(a) Generate a streaming classification dataset where the class boundary shifts linearly over 2,000 time steps.

(b) Apply standard split conformal prediction (calibrated on the first 500 points) to all 2,000 points. Plot the running coverage over time. Show that coverage degrades as the distribution shifts.

(c) Apply ACI (using the AdaptiveConformalClassifier class with $\gamma = 0.005$) to the same stream. Plot the running coverage. Show that ACI maintains approximately $1 - \alpha$ coverage despite the shift.

(d) How does the step size $\gamma$ affect the tradeoff between coverage stability and set size variability? Experiment with $\gamma \in \{0.001, 0.005, 0.02, 0.05\}$.

# Generate shifting classification data
rng = np.random.RandomState(42)
n_total = 2000
n_cal = 500

# The decision boundary shifts over time: threshold = 0.5 + 0.001 * t
x_stream = rng.uniform(0, 1, (n_total, 5))
shift = np.linspace(0, 1.0, n_total)
true_prob = 1.0 / (1.0 + np.exp(-(x_stream[:, 0] - 0.5 - shift * 0.3) * 5))
y_stream = rng.binomial(1, true_prob)

# Simulate model predictions (assume a static model trained on early data)
# The model does not adapt, so its predictions become stale
model_prob = 1.0 / (1.0 + np.exp(-(x_stream[:, 0] - 0.5) * 5))
model_probs_2class = np.column_stack([1 - model_prob, model_prob])

# Standard conformal (calibrated on first 500)
static_conformal = SplitConformalClassifier(alpha=0.10)
static_conformal.calibrate(model_probs_2class[:n_cal], y_stream[:n_cal])
static_results = static_conformal.evaluate_coverage(
    model_probs_2class[n_cal:], y_stream[n_cal:]
)

# ACI
aci = AdaptiveConformalClassifier(alpha=0.10, gamma=0.005)
for t in range(n_cal, n_total):
    aci.update(model_probs_2class[t], y_stream[t])

running_cov = aci.get_running_coverage(window=100)
print(f"Static conformal coverage: {static_results['empirical_coverage']:.4f}")
print(f"ACI mean coverage (last 500): {np.mean(aci.coverage_history[-500:]):.4f}")

Climate and Pharma Applications

Exercise 34.21 (**)

A climate ensemble (5 GCMs) produces the following global mean temperature projections for 2100 under the RCP 8.5 scenario:

GCM Projection (°C above pre-industrial)
GFDL-ESM4 4.2
CESM2 4.8
MPI-ESM1-2 3.9
UKESM1 5.3
MIROC6 4.0

(a) Compute the ensemble mean and standard deviation.

(b) The standard deviation of the ensemble captures which type of uncertainty: aleatoric, epistemic, or scenario uncertainty? Explain.

(c) Each GCM also reports internal variability of $\pm 0.3$°C (from running the same model with perturbed initial conditions). What type of uncertainty does this capture?

(d) The scenario choice (RCP 8.5 vs. RCP 4.5 vs. RCP 2.6) contributes a range of 1.0-5.3°C. This is which type of uncertainty? Is it reducible by better models or more data?


Exercise 34.22 (***)

In the MediCore pharmaceutical setting, a clinical trial with 800 patients (400 treatment, 400 control) produces a point estimate of the average treatment effect of $\hat{\tau} = -5.2$ mmHg (systolic blood pressure reduction).

(a) Using a deep ensemble of 5 treatment effect estimators (see Chapter 19's CATE estimation), the ensemble mean is $-5.2$ mmHg and the ensemble standard deviation is $0.8$ mmHg. What type of uncertainty does the ensemble standard deviation primarily capture?

(b) Construct a conformal prediction interval for the individual treatment effect (ITE) at $\alpha = 0.10$. If the residuals on the calibration set have 90th percentile of $4.5$ mmHg, what is the prediction interval for a new patient with predicted ITE $-6.0$ mmHg?

(c) The FDA reviewer asks: "What fraction of patients experience a clinically meaningful benefit (> 3 mmHg reduction)?" Using the calibrated ensemble, estimate this with a confidence interval. Explain why the answer depends on both aleatoric and epistemic uncertainty.


Integration and System Design

Exercise 34.23 (***)

Design a complete uncertainty quantification pipeline for the StreamRec recommendation system that integrates with the monitoring framework from Chapter 30:

(a) Specify which UQ methods to use at each stage of the StreamRec pipeline (retrieval, ranking, re-ranking, serving). Justify each choice based on latency constraints.

(b) Define 3 uncertainty-related SLOs (Service Level Objectives) with appropriate thresholds and monitoring frequencies.

(c) Write a monitoring runbook for the scenario: "ECE on the production traffic has increased from 0.02 to 0.08 over the past week."


Exercise 34.24 (****)

The conformal prediction guarantee is marginal: $P(Y \in C(X)) \geq 1 - \alpha$. Conditional coverage — $P(Y \in C(X) \mid X = x) \geq 1 - \alpha$ for all $x$ — is generally impossible without distributional assumptions (Vovk, 2012; Barber et al., 2021).

(a) Construct a concrete example where split conformal prediction achieves 90% marginal coverage but has only 50% conditional coverage for a specific subregion of the input space.

(b) Explain why perfect conditional coverage is impossible in general (a one-paragraph argument).

(c) Describe the "Mondrian conformal prediction" approach (Vovk et al., 2005) for obtaining group-conditional coverage. How does it differ from split conformal?

(d) For the Meridian Financial credit scoring model, propose a practical approach to achieve approximate conditional coverage across the subgroups identified in Exercise 34.5.


Exercise 34.25 (****)

Deep ensembles are empirically the strongest method for uncertainty estimation, but they are expensive ($M$ times the training and inference cost). Several recent methods attempt to approximate ensemble diversity with a single model:

(a) Read and summarize the "BatchEnsemble" approach (Wen et al., 2020), which uses rank-1 perturbations to simulate $M$ ensemble members within a single network.

(b) Read and summarize the "MIMO" (Multi-Input Multi-Output) approach (Havasi et al., 2021), which trains a single network with $M$ input-output pairs.

(c) For a classification task of your choice, implement one of these methods and compare its uncertainty estimates (calibration ECE, AUACC, mutual information quality) to a true 5-member deep ensemble. Report the compute savings and the quality gap.


Exercise 34.26 (***)

A production model serves predictions with MC dropout ($T = 30$) for uncertainty estimation. The mean inference latency is 45ms (single pass) and 850ms (30 MC samples). The SLO requires p99 latency under 200ms.

(a) Propose three strategies to reduce the latency of MC dropout while preserving uncertainty quality. For each, estimate the latency reduction and the impact on uncertainty estimate quality.

(b) One approach is to use a small "uncertainty head" trained to predict the MC dropout uncertainty from a single forward pass (a distillation approach). Outline this approach and discuss its strengths and limitations.

(c) Another approach is to use conformal prediction (which requires only a single forward pass) instead of MC dropout. Under what conditions is conformal prediction a sufficient replacement, and when is MC dropout's decomposition into aleatoric/epistemic genuinely needed?


Exercise 34.27 (****)

Design an end-to-end experiment comparing calibration, conformal prediction, MC dropout, and deep ensembles on a real dataset (e.g., CIFAR-10, a tabular classification dataset from OpenML, or a clinical prediction task):

(a) Train a base model (e.g., ResNet-18 for CIFAR-10, or an MLP for tabular data).

(b) Evaluate calibration: reliability diagram, ECE (before and after temperature scaling).

(c) Construct conformal prediction sets at $\alpha = 0.10$. Report coverage and mean set size.

(d) Implement MC dropout ($T = 50$) and a 5-model deep ensemble. Compare predictive entropy and mutual information.

(e) Build the accuracy-coverage curve for each uncertainty method. Report AUACC.

(f) Summarize: which method(s) would you recommend for production deployment? Justify in terms of calibration quality, uncertainty quality, and computational cost.