Chapter 34: Quiz

DataField.Dev

Chapter 34: Quiz

Test your understanding of uncertainty quantification. Answers follow each question.

Question 1

What is the difference between aleatoric and epistemic uncertainty? Give one example of each in the context of credit scoring.

Answer

**Aleatoric uncertainty** is irreducible randomness in the data-generating process — noise that cannot be reduced by collecting more data or building better models. In credit scoring: even with perfect knowledge of a borrower's financial profile, there is inherent unpredictability in whether they will default, because default depends on future events (job loss, medical emergency, divorce) that are not captured by any model features. **Epistemic uncertainty** is uncertainty due to limited knowledge — insufficient data, model misspecification, or ambiguity about the right hypothesis. In credit scoring: the model has high epistemic uncertainty for borrower profiles that are rare in training data (e.g., self-employed applicants with non-standard income documentation), because it has not seen enough similar examples to learn the relationship between features and outcomes. The key distinction: aleatoric uncertainty is a property of the world (irreducible); epistemic uncertainty is a property of the model (reducible with more data). The appropriate response differs: communicate aleatoric uncertainty to decision-makers; reduce epistemic uncertainty through targeted data collection.

Question 2

Define calibration for a probabilistic classifier. What does it mean for a model to be "perfectly calibrated"?

Answer

A probabilistic classifier $f: \mathcal{X} \to [0, 1]$ is **perfectly calibrated** if for all $p \in [0, 1]$: $P(Y = 1 \mid f(X) = p) = p$. Among all instances where the model predicts probability $p$, the fraction that are truly positive is exactly $p$. If the model says "80% chance of default," then exactly 80% of such cases should default. Calibration is about the *absolute accuracy of probability estimates*, not the model's ability to rank-order examples (discrimination). A model can be well-calibrated with poor discrimination (predicting the base rate for every input) or have excellent discrimination with poor calibration (producing accurate rankings but misleading probabilities).

Question 3

What is the Expected Calibration Error (ECE)? How is it computed?

Answer

ECE is the weighted average of per-bin calibration gaps: $\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} |\bar{o}_m - \bar{p}_m|$, where $B_m$ is the $m$-th probability bin, $|B_m|$ is the number of examples in that bin, $\bar{p}_m$ is the mean predicted probability in the bin, and $\bar{o}_m$ is the observed positive fraction in the bin. The weight $|B_m|/n$ ensures that bins with more examples contribute more to the metric. A perfectly calibrated model has ECE = 0. Values below 0.02 indicate excellent calibration; values above 0.10 indicate poor calibration that should be corrected.

Question 4

Guo et al. (2017) showed that modern deep neural networks are systematically miscalibrated. In which direction — overconfident or underconfident? What causes this?

Answer

Modern deep networks are systematically **overconfident**: they predict probabilities closer to 0 or 1 than the true conditional probabilities warrant. A ResNet-110 on CIFAR-100 predicts class probabilities above 0.9 for 85% of test examples, but the actual accuracy among those predictions is only 72%. The cause is increased model capacity combined with cross-entropy (NLL) loss optimization. Modern networks have enough capacity to fit training data perfectly, and NLL continues to decrease as the model pushes logits further apart — making softmax outputs more peaked — even after predictions are already correct. Batch normalization and weight decay (which improve accuracy) exacerbate overconfidence by allowing more extreme logits. The historical irony: older, smaller networks (e.g., a 5-layer LeNet) were better calibrated than modern, more accurate networks.

Question 5

Describe temperature scaling. Why does it work for overconfident networks, and why does it preserve model ranking?

Answer

Temperature scaling divides all logits by a learned scalar $T > 0$ before applying softmax: $\hat{p}_k = \exp(z_k / T) / \sum_j \exp(z_j / T)$. For overconfident networks, $T > 1$ is learned, which "softens" the softmax output by pushing probabilities away from 0 and 1 toward more moderate values. It preserves ranking because dividing all logits by the same positive constant does not change their relative order: if $z_1 > z_2$, then $z_1 / T > z_2 / T$ for any $T > 0$. Therefore $\text{argmax}_k(z_k) = \text{argmax}_k(z_k / T)$, so accuracy, AUC, and all ranking-based metrics are unchanged. Temperature scaling is surprisingly effective because the miscalibration of modern networks is largely uniform — the same degree of overconfidence across the probability range — so a single parameter suffices to correct it.

Question 6

What is the difference between Platt scaling, temperature scaling, and isotonic regression for post-hoc recalibration? When would you prefer each?

Answer

**Platt scaling** fits a 2-parameter logistic regression $q = \sigma(az + b)$ on the model's logits. It corrects both the scale and shift of the logit distribution. **Temperature scaling** is a special case with 1 parameter ($T$, equivalent to $a = 1/T$, $b = 0$), correcting only the scale. **Isotonic regression** fits a non-parametric non-decreasing step function. Prefer **temperature scaling** when: (1) you have a small calibration set (a few hundred examples suffice for 1 parameter), (2) the miscalibration is approximately uniform, or (3) you need subgroup-conditional calibration (1 parameter per subgroup is manageable). Prefer **Platt scaling** when: the model's logits need both scaling and shifting (e.g., the model is overconfident at high probabilities but well-calibrated at low probabilities). Prefer **isotonic regression** when: (1) you have a large calibration set (1,000+ examples), (2) the miscalibration is non-uniform and non-linear, and (3) you do not need subgroup-conditional calibration (isotonic regression has many effective parameters and risks overfitting on small subgroups).

Question 7

Why is calibration across subgroups important? How does it relate to Chapter 31's fairness analysis?

Answer

Global calibration (ECE computed over the entire population) can hide severe subgroup miscalibration. A model with ECE = 0.02 overall may have ECE = 0.09 for young, low-income applicants — meaning the model is systematically overconfident for a vulnerable group, leading to underestimated risk and excess charge-offs. This connects to [Chapter 31](../chapter-31-fairness-in-ml/index.md)'s fairness analysis because **calibration by group** is one of the fairness criteria: $P(Y = 1 \mid \hat{p} = p, A = a) = p$ for all groups $a$. The impossibility theorem (Chouldechova, 2017) proves that perfect calibration across all groups is incompatible with equalized odds when base rates differ across groups. Therefore, subgroup calibration analysis must be done alongside the fairness audit, and the practitioner must explicitly choose which calibration or fairness constraint to prioritize.

Question 8

State the marginal coverage guarantee of split conformal prediction. What is the key assumption, and why is it weaker than IID?

Answer

The guarantee: $P(Y_{n+1} \in C(X_{n+1})) \geq 1 - \alpha$, where $C(X_{n+1})$ is the prediction set constructed from the calibration scores and the conformal threshold $\hat{q}$. The key assumption is **exchangeability**: the calibration data $(X_1, Y_1), \ldots, (X_n, Y_n)$ and the test point $(X_{n+1}, Y_{n+1})$ are exchangeably distributed, meaning their joint distribution is invariant under any permutation of indices. This is weaker than IID because exchangeability allows dependencies — for example, samples drawn without replacement from a finite population are exchangeable but not independent. IID implies exchangeability, but not vice versa. The coverage guarantee holds under exchangeability, making conformal prediction applicable to a broader class of problems than methods that require IID data.

Question 9

Explain the conformal prediction procedure for regression. How is the conformal threshold $\hat{q}$ computed, and how are prediction intervals formed?

Answer

For regression, the nonconformity score is the absolute residual: $s_i = |y_i - \hat{y}(x_i)|$ computed on the calibration set. The threshold $\hat{q}$ is the $\lceil(1 - \alpha)(n_{\text{cal}} + 1)\rceil / n_{\text{cal}}$ quantile of these residuals. For a new test point $x_{\text{test}}$, the prediction interval is $C(x_{\text{test}}) = [\hat{y}(x_{\text{test}}) - \hat{q}, \; \hat{y}(x_{\text{test}}) + \hat{q}]$. This produces constant-width intervals (the width is $2\hat{q}$ for all inputs). For adaptive-width intervals, conformalized quantile regression (CQR) trains a quantile regression model to predict input-dependent quantiles, then applies conformal calibration to the quantile predictions. CQR intervals are both valid (same coverage guarantee) and sharper (narrower where the model is confident).

Question 10

What is Adaptive Conformal Inference (ACI), and why is it needed?

Answer

Standard conformal prediction assumes exchangeability between calibration and test data. Under distribution shift, this assumption fails and coverage degrades. **Adaptive Conformal Inference** (Gibbs and Candes, 2021) dynamically adjusts the conformal threshold in an online fashion: after each prediction, the true label is observed, and the threshold is updated — increased if coverage was missed (widening future sets) and decreased if coverage was achieved (narrowing future sets). The update rule is: $\hat{q}_{t+1} = \hat{q}_t + \gamma(1 - \alpha)$ if the true label was not covered, and $\hat{q}_{t+1} = \hat{q}_t - \gamma \alpha$ if it was covered. This provides a *long-run* coverage guarantee: the average coverage converges to $1 - \alpha$ even under adversarial distribution shift. ACI is essential for production ML systems where data distributions evolve over time.

Question 11

Describe the Monte Carlo (MC) dropout procedure for estimating epistemic uncertainty. What assumption makes it theoretically motivated?

Answer

The MC dropout procedure: (1) train a network with dropout (standard training), (2) at test time, keep dropout enabled (do not switch dropout layers to eval mode), (3) run $T$ forward passes for each input, each with different random dropout masks, (4) use the mean of predictions as the final prediction and the variance (or entropy, mutual information) as the uncertainty estimate. The theoretical motivation (Gal and Ghahramani, 2016) is that a network with dropout applied at test time is equivalent to an approximate variational inference procedure. Each forward pass with a different dropout mask samples a different "sub-network" — effectively sampling from an approximate posterior over network parameters. The variance of predictions captures epistemic uncertainty: high variance means different sub-networks disagree, indicating the model is uncertain due to limited knowledge.

Question 12

How does mutual information decompose total uncertainty into aleatoric and epistemic components in the MC dropout framework?

Answer

For classification with MC dropout: - **Predictive entropy** $H[\bar{p}] = -\sum_k \bar{p}_k \log \bar{p}_k$ captures **total** uncertainty (aleatoric + epistemic), where $\bar{p}_k$ is the mean probability for class $k$ across $T$ MC samples. - **Expected entropy** $\mathbb{E}_t[H[p_t]] = \frac{1}{T}\sum_t(-\sum_k p_{k,t} \log p_{k,t})$ captures **aleatoric** uncertainty (the average uncertainty of each individual sub-network). - **Mutual information** $= H[\bar{p}] - \mathbb{E}_t[H[p_t]]$ captures **epistemic** uncertainty (disagreement between sub-networks). High MI means the sub-networks disagree (epistemic — model needs more data). High expected entropy with low MI means each sub-network is itself uncertain but they agree on their uncertainty (aleatoric — inherently unpredictable).

Question 13

Why are deep ensembles considered the gold standard for uncertainty estimation in neural networks? What are their main advantages and disadvantages compared to MC dropout?

Answer

**Advantages:** (1) Empirically strongest calibration and uncertainty estimates on standard benchmarks — ensembles of 5 models consistently outperform MC dropout with 50 samples. (2) Simple to implement: train $M$ copies of the same architecture with different random seeds, no architectural changes needed. (3) The diversity arises naturally from random initialization finding different local minima, providing genuine diversity in model hypotheses. (4) Can capture both aleatoric (with heteroscedastic heads) and epistemic uncertainty (from disagreement). **Disadvantages:** (1) $M$ times the training cost (typically $M = 5$). (2) $M$ times the inference cost (need $M$ forward passes per prediction). (3) $M$ times the storage (must store $M$ separate model checkpoints). For the StreamRec system serving 12M users, this means 5x the serving infrastructure cost. Compared to MC dropout, ensembles produce higher-quality uncertainty estimates (less underestimation of epistemic uncertainty) but at higher cost. MC dropout is essentially "free" at training time (dropout is already used) but only costs $T$ inference passes and produces a coarser approximation.

Question 14

What is a heteroscedastic deep ensemble, and how does it decompose total predictive variance?

Answer

A heteroscedastic deep ensemble is an ensemble where each member predicts both a mean $\hat{\mu}_m(x)$ and a variance $\hat{\sigma}_m^2(x)$, trained with Gaussian negative log-likelihood loss: $\mathcal{L} = \frac{1}{2}[\log \hat{\sigma}^2 + (y - \hat{\mu})^2 / \hat{\sigma}^2]$. The total predictive variance decomposes as: $$\text{Var}[y \mid x] \approx \underbrace{\frac{1}{M}\sum_m \hat{\sigma}_m^2(x)}_{\text{aleatoric (mean of predicted variances)}} + \underbrace{\frac{1}{M}\sum_m (\hat{\mu}_m(x) - \bar{\mu}(x))^2}_{\text{epistemic (variance of predicted means)}}$$ The aleatoric component is the average variance predicted by individual models (data noise). The epistemic component is the variance of the predicted means across models (model disagreement). This is a computationally concrete realization of the aleatoric/epistemic decomposition.

Question 15

Explain the abstention (selective prediction) strategy. How does the accuracy-coverage tradeoff work?

Answer

An abstention policy allows the model to decline predictions when uncertainty exceeds a threshold, routing uncertain examples to a human expert or fallback system. As the uncertainty threshold decreases (stricter abstention criterion), coverage decreases (fewer predictions are made automatically) but accuracy on accepted predictions increases (only confident, likely-correct predictions are kept). The accuracy-coverage curve plots accuracy on non-abstained predictions (y-axis) vs. coverage fraction (x-axis) as the threshold varies. A good uncertainty estimate produces a curve that rises steeply: accuracy increases rapidly as the most uncertain (and likely incorrect) predictions are removed. The area under this curve (AUACC) is a scalar summary of uncertainty quality. A perfect uncertainty estimator would have AUACC close to 1.0 (accuracy reaches 100% at high coverage).

Question 16

In the context of active learning, why is epistemic uncertainty (mutual information) a better selection criterion than total uncertainty (predictive entropy)?

Answer

Predictive entropy is high both when the model is epistemically uncertain (does not have enough data) and when the outcome is inherently noisy (aleatoric uncertainty). Selecting examples with high predictive entropy would select both types — but labeling a new example with high aleatoric uncertainty does not help the model learn, because the outcome is inherently noisy regardless of how much data we collect. Mutual information isolates epistemic uncertainty: it is high only when different model sub-networks (or ensemble members) disagree, indicating that the model needs more data in this region of input space. Labeling examples with high MI directly reduces the model's ignorance. In the StreamRec context, high-MI users are those whose preferences the model has not learned — they are underrepresented in training data. Labeling their interactions (or exploring more aggressively for these users via Thompson sampling) is the most efficient way to improve the model.

Question 17

A conformal prediction set for a classification task contains 3 out of 5 possible classes. What does this mean, and how should the user interpret it?

Answer

It means the model cannot confidently distinguish between these 3 classes for this particular input. The conformal guarantee states that the true class is in this set with probability at least $1 - \alpha$ (e.g., 90%). A set of size 3 (out of 5) indicates substantial model uncertainty for this input — the model is unsure and honestly communicates this. For a decision-maker, this signals: (1) the automated system should not make a definitive decision for this input, (2) a human expert should review this case, and (3) the model may need more training data for inputs similar to this one. Conformal set sizes are a natural measure of prediction difficulty: singleton sets (size 1) are easy cases; large sets are hard cases. The mean set size across the test set is a measure of overall model quality — a better model produces smaller prediction sets while maintaining the same coverage guarantee.

Question 18

Why is conformal prediction's marginal coverage guarantee insufficient for high-stakes applications? What is the difference between marginal and conditional coverage?

Answer

**Marginal coverage** guarantees $P(Y \in C(X)) \geq 1 - \alpha$ averaged over the distribution of $X$. This means that across all test inputs, the true label is in the prediction set at least $(1-\alpha) \times 100\%$ of the time. However, for a *specific* input $x$ or subgroup of inputs, the coverage may be much lower or higher than $1 - \alpha$. **Conditional coverage** requires $P(Y \in C(X) \mid X = x) \geq 1 - \alpha$ for all $x$ — a much stronger guarantee that coverage holds for every individual input. This is generally impossible without distributional assumptions. In high-stakes applications (credit scoring, clinical diagnosis), marginal coverage is insufficient because failures are concentrated on specific subpopulations. If the model achieves 90% coverage overall but only 70% coverage for underrepresented groups, those groups bear a disproportionate share of prediction failures. This is why conditional coverage analysis (broken down by subgroup) is essential, even though the formal guarantee is only marginal.

Question 19

Describe two practical strategies for monitoring calibration in a production ML system and explain how they connect to Chapter 30's monitoring framework.

Answer

**Strategy 1: Periodic ECE monitoring.** Compute ECE on a rolling window of recent production predictions with observed outcomes (e.g., weekly for credit scoring models where outcomes take months to resolve, or daily for click prediction where outcomes are immediate). Track ECE as a time series in the [Chapter 30](../../part-05-production-ml-systems/chapter-30-monitoring-and-observability/index.md) Grafana dashboard. Set an alert threshold (e.g., ECE > 0.05 triggers a warning; ECE > 0.10 triggers a recalibration runbook). **Strategy 2: Conformal coverage monitoring.** Use ACI (Adaptive Conformal Inference) on the production prediction stream. Track the running coverage over a sliding window. If coverage drops below $(1 - \alpha) - \delta$ for a sustained period (e.g., 3 consecutive days below 85% when the target is 90%), trigger an alert. This is more robust than ECE monitoring because conformal coverage has a formal guarantee under exchangeability, and ACI adjusts automatically for gradual drift. Both integrate into the four-layer Grafana dashboard from [Chapter 30](../../part-05-production-ml-systems/chapter-30-monitoring-and-observability/index.md): ECE and conformal coverage are model-layer metrics, alongside AUC and Recall@20. They should be monitored with the same tiered AlertManager escalation (Slack for warnings, PagerDuty for critical).

Question 20

The StreamRec team deploys temperature scaling (T = 2.1) and conformal prediction (alpha = 0.10) on their click-prediction model. After two weeks, they observe that ECE has increased from 0.015 to 0.067 and conformal coverage has dropped from 91% to 84%. What is happening, and what should they do?

Answer

Both symptoms indicate **distribution shift**: the test data distribution has drifted from the calibration data distribution. Temperature scaling was fit on data that no longer represents the current user population, so the calibrated probabilities are no longer well-calibrated. Conformal prediction's exchangeability assumption is violated, so the coverage guarantee no longer holds. **Immediate actions:** (1) Trigger the recalibration runbook from the monitoring framework. (2) Re-fit temperature scaling on recent production data (last 7 days). (3) Re-calibrate the conformal threshold on recent data. (4) Switch to Adaptive Conformal Inference (ACI) for the conformal layer, which automatically adjusts the threshold under drift. **Root cause investigation:** (5) Check the data drift metrics (PSI/KS from [Chapter 30](../../part-05-production-ml-systems/chapter-30-monitoring-and-observability/index.md)) to identify which features have shifted. (6) Determine if this is a gradual drift (seasonal, trend) or a sudden shift (product change, external event). (7) If the drift is substantial, trigger a model retraining ([Chapter 29](../../part-05-production-ml-systems/chapter-29-continuous-training-and-deployment/index.md)'s continuous training pipeline) — recalibration alone cannot fix a model whose learned patterns no longer apply.