Chapter 34: Further Reading
Essential Sources
1. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger, "On Calibration of Modern Neural Networks" (ICML, 2017)
The paper that established the modern understanding of neural network miscalibration. Guo et al. demonstrate that deep networks trained with contemporary techniques (batch normalization, weight decay, long training schedules) are systematically overconfident — a reversal from older, smaller networks that were reasonably well-calibrated. The paper introduces temperature scaling as a one-parameter post-hoc recalibration method and shows that it is as effective as more complex methods (Platt scaling, isotonic regression, histogram binning) on standard benchmarks.
Reading guidance: Section 3 contains the core finding: reliability diagrams and ECE measurements for LeNet, ResNet, Wide ResNet, and DenseNet on CIFAR-10, CIFAR-100, SVHN, and ImageNet. The historical comparison (Figure 1) — showing that calibration error has increased even as accuracy has improved over two decades of deep learning research — is the paper's most cited result. Section 4 introduces temperature scaling and demonstrates its effectiveness. The key insight is that modern miscalibration is largely uniform (the same degree of overconfidence across the probability range), which is why a single scalar parameter suffices. Section 5's ablation study identifies the culprits: increased model capacity and NLL optimization on high-capacity networks push logits apart, making softmax outputs more peaked than warranted. For extensions to multi-class calibration, see Nixon et al., "Measuring Calibration in Deep Learning" (CVPR Workshops, 2019), which proposes classwise and top-label ECE variants. For a Bayesian perspective on why modern networks are overconfident, see Wilson and Izmailov, "Bayesian Deep Learning and a Probabilistic Perspective of Generalization" (NeurIPS, 2020).
2. Vladimir Vovk, Alexander Gammerman, and Glenn Shafer, Algorithmic Learning in a Random World (Springer, 2005; 2nd edition 2022)
The foundational text on conformal prediction. Vovk, Gammerman, and Shafer develop the theory from first principles: the exchangeability assumption, nonconformity measures, the coverage guarantee, and applications to classification, regression, and anomaly detection. The 2nd edition (2022) incorporates a decade of advances, including split conformal prediction, conformalized quantile regression, and connections to online learning.
Reading guidance: Part I (Chapters 1-4) establishes the theoretical framework. Chapter 2's treatment of exchangeability and the proof of the coverage guarantee — based on the key insight that the rank of the test nonconformity score is uniformly distributed under exchangeability — is the intellectual core. This proof is short (one page) and elegant; it should be read carefully. Chapter 8 covers split (inductive) conformal prediction, which is the variant used in practice and implemented in Section 34.4. For the regression extension, Lei et al., "Distribution-Free Predictive Inference for Regression" (JASA, 2018) provides a rigorous treatment of split conformal regression. For conformalized quantile regression, see Romano, Patterson, and Candes, "Conformalized Quantile Regression" (NeurIPS, 2019), which combines the flexibility of quantile regression (adaptive interval widths) with the formal guarantee of conformal prediction. For adaptive conformal inference under distribution shift, see Gibbs and Candes, "Adaptive Conformal Inference Under Distribution Shift" (NeurIPS, 2021), which provides the online threshold adjustment algorithm implemented in Section 34.4.4. For a practical tutorial, Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" (arXiv, 2021) provides an accessible 50-page introduction with code examples.
3. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell, "Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles" (NeurIPS, 2017)
The paper that established deep ensembles as the gold standard for neural network uncertainty estimation. Lakshminarayanan et al. show that training $M$ independent networks (same architecture, different random initialization and data order) and averaging their predictions produces better-calibrated uncertainty estimates than MC dropout, variational inference, and other Bayesian approximations — despite the simplicity of the approach.
Reading guidance: Section 2 defines the ensemble procedure and the heteroscedastic extension (each member predicts mean and variance, trained with Gaussian NLL). Section 3's experiments are comprehensive: CIFAR-10, CIFAR-100, ImageNet, UCI regression datasets, and out-of-distribution detection. The key result: a 5-member ensemble with heteroscedastic heads outperforms MC dropout ($T = 50$) on ECE, NLL, Brier score, and out-of-distribution detection AUC on every benchmark. Table 1 and Table 2 provide the definitive comparison. The theoretical explanation for why ensembles work — different random initializations find different local minima that agree on in-distribution data but disagree on out-of-distribution data — is discussed in Section 4 and developed more formally in Fort, Hu, and Lakshminarayanan, "Deep Ensembles: A Loss Landscape Perspective" (arXiv, 2019). For practical deployment of ensembles, see Wen et al., "BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning" (ICLR, 2020), which reduces the cost of ensembles from $M \times$ to approximately $1.2 \times$ using rank-1 perturbations. For the connection between ensembles and Bayesian model averaging, see Wilson and Izmailov (NeurIPS, 2020), cited above.
4. Yarin Gal and Zoubin Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" (ICML, 2016)
The paper that provided the theoretical foundation for MC dropout as an uncertainty estimation method. Gal and Ghahramani prove that a neural network with dropout applied at test time is mathematically equivalent to an approximate variational inference procedure over a specific posterior distribution (a mixture of two delta functions at 0 and the learned weight, for each weight in the network).
Reading guidance: Theorem 1 (Section 3) is the core result: the dropout objective (cross-entropy loss with L2 regularization and dropout) is a variational lower bound on the log-evidence of a specific Bayesian model. This means that standard dropout training already performs approximate Bayesian inference — the only additional step at test time is to keep dropout active and average multiple forward passes. Section 4's experiments on MNIST and CIFAR-10 demonstrate that MC dropout uncertainty estimates are well-correlated with prediction errors. The uncertainty decomposition into predictive entropy and mutual information (Section 3.2) provides the aleatoric/epistemic separation implemented in Section 34.6. The main limitation — acknowledged in Section 5 — is that the variational family (factored Bernoulli over individual weights) is a coarse approximation, leading to underestimation of true posterior uncertainty. For a critical evaluation, see Osband et al., "Randomized Prior Functions for Deep Reinforcement Learning" (NeurIPS, 2018), which shows that MC dropout can fail to detect certain types of distributional shift. For practical guidelines on MC dropout (number of samples, dropout rate tuning, computational cost), see Gal's PhD thesis, Uncertainty in Deep Learning (University of Cambridge, 2016), available open-access.
5. Anastasios N. Angelopoulos and Stephen Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" (arXiv:2107.07511, 2021)
A 50-page tutorial that has become the standard practical introduction to conformal prediction. Angelopoulos and Bates present the theory with minimal formalism, emphasize implementation, and provide Python code examples for classification, regression, and multi-label prediction.
Reading guidance: This is the recommended starting point for readers who want to implement conformal prediction rather than prove theorems about it. Section 2 introduces split conformal prediction with a 10-line Python implementation. Section 3 covers classification-specific scores (softmax, adaptive prediction sets). Section 4 covers regression (residual-based and CQR). Section 5 discusses conditional coverage and its impossibility (connecting to the caveats in Section 34.4.2). The code examples are directly usable and map closely to the implementations in this chapter. For readers who want to go deeper: Section 6 covers online conformal prediction and connections to the ACI framework. The bibliography is comprehensive and annotated, making it an excellent navigation guide to the conformal prediction literature. For recent advances in conformal prediction for causal inference — relevant to the MediCore pharma anchor — see Lei and Candes, "Conformal Inference of Counterfactuals and Individual Treatment Effects" (JRSS-B, 2021).