Chapter 12 Further Reading
Foundational Papers
Calibration Theory and Measurement
-
Murphy, A. H., & Winkler, R. L. (1977). "Reliability of Subjective Probability Forecasts of Precipitation and Temperature." Journal of the Royal Statistical Society: Series C, 26(1), 41-47. The foundational paper on reliability diagrams and calibration assessment. Introduces the visual framework used throughout this chapter.
-
Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12(4), 595-600. Introduces the Murphy decomposition of the Brier score into reliability, resolution, and uncertainty. The mathematical backbone of Section 12.4.
-
DeGroot, M. H., & Fienberg, S. E. (1983). "The Comparison and Evaluation of Forecasters." Journal of the Royal Statistical Society: Series D, 32(1-2), 12-22. Formalizes the concept of calibration and its relationship to other forecast quality measures. A rigorous treatment of the theoretical foundations.
-
Dawid, A. P. (1982). "The Well-Calibrated Bayesian." Journal of the American Statistical Association, 77(379), 605-610. Explores calibration from a Bayesian perspective. Shows that subjective Bayesians are automatically calibrated under certain conditions.
Scoring Rules
-
Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review, 78(1), 1-3. The original paper introducing the Brier score. Brief but enormously influential — one of the most cited papers in forecast verification.
-
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association, 102(477), 359-378. The definitive modern treatment of proper scoring rules. Covers the theory behind why calibration is a "free lunch" for proper scores.
-
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). "Probabilistic Forecasts, Calibration and Sharpness." Journal of the Royal Statistical Society: Series B, 69(2), 243-268. Introduces the "maximize sharpness, subject to calibration" paradigm. Essential reading for understanding the relationship between calibration, sharpness, and resolution.
Calibration in Machine Learning
-
Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, 625-632. Empirical study of calibration across many machine learning algorithms. Demonstrates that most classifiers produce poorly calibrated probabilities and benefit from post-hoc recalibration.
-
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning, 1321-1330. Shows that modern deep neural networks are significantly miscalibrated despite high accuracy. Popularized temperature scaling as a simple recalibration method.
-
Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." Advances in Large Margin Classifiers, 61-74. Introduces Platt scaling — fitting a logistic regression on classifier outputs to produce calibrated probabilities. The most widely used parametric recalibration method.
-
Zadrozny, B., & Elkan, C. (2002). "Transforming Classifier Scores into Accurate Multiclass Probability Estimates." Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694-699. Introduces isotonic regression for probability calibration. Compares it to Platt scaling and histogram binning across multiple datasets.
-
Kull, M., Silva Filho, T. M., & Flach, P. (2017). "Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers." Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 623-631. Introduces beta calibration, a three-parameter alternative to Platt scaling that can model a wider variety of miscalibration patterns.
Prediction Markets and Calibration
-
Wolfers, J., & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107-126. Excellent survey of prediction markets with discussion of their calibration and informational efficiency. Accessible introduction for economists.
-
Snowberg, E., Wolfers, J., & Zitzewitz, E. (2007). "Partisan Impacts on the Economy: Evidence from Prediction Markets and Close Elections." Quarterly Journal of Economics, 122(2), 807-829. Uses prediction market calibration to draw economic conclusions. Demonstrates how well-calibrated markets can be used for causal inference.
-
Page, L. (2012). "Are Markets Efficient? Experimental Evidence from a Prediction Market for Soccer Results." Journal of Forecasting, 31(7), 529-547. Studies calibration of sports prediction markets. Finds generally good calibration with evidence of the favorite-longshot bias.
-
Atanasov, P., Rescober, P., Stone, E., et al. (2017). "Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls." Management Science, 63(3), 691-706. Compares calibration of prediction markets vs. prediction polls. Finds that both can be well-calibrated but markets tend to be sharper.
Superforecasting and Human Calibration
-
Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press. The seminal study of expert calibration. Documents widespread overconfidence among political experts and proposes methods for improvement.
-
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. Accessible book on the Good Judgment Project. Describes the characteristics of superforecasters, including their excellent calibration. Essential reading for anyone interested in personal calibration improvement.
-
Mellers, B., Stone, E., Murray, T., et al. (2015). "Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions." Perspectives on Psychological Science, 10(3), 267-281. Research paper underlying the Superforecasting book. Provides detailed data on what makes superforecasters special, including calibration analysis.
-
Moore, D. A., & Healy, P. J. (2008). "The Trouble with Overconfidence." Psychological Review, 115(2), 502-517. Comprehensive review of overconfidence research. Distinguishes different types of overconfidence (overestimation, overplacement, overprecision) and their effects on calibration.
Favorite-Longshot Bias
-
Griffith, R. M. (1949). "Odds Adjustments by American Horse-Race Bettors." American Journal of Psychology, 62(2), 290-294. The original documentation of the favorite-longshot bias in horse racing. Shows that longshots are systematically overbet relative to their true win probability.
-
Snowberg, E., & Wolfers, J. (2010). "Explaining the Favorite-Long Shot Bias: Is It Risk-Love or Misperceptions?" Journal of Political Economy, 118(4), 723-746. Modern analysis of the favorite-longshot bias with competing explanations (risk preferences vs. probability distortion).
-
Ottaviani, M., & Sorensen, P. N. (2010). "Noise, Information, and the Favorite-Longshot Bias in Parimutuel Predictions." American Economic Journal: Microeconomics, 2(1), 58-85. Theoretical model explaining the favorite-longshot bias as a consequence of noise in private information.
Advanced Topics
Distributional Calibration
-
Gneiting, T., & Katzfuss, M. (2014). "Probabilistic Forecasting." Annual Review of Statistics and Its Application, 1, 125-151. Survey covering distributional forecasting, CRPS, and the PIT for calibration assessment. The go-to reference for calibration beyond binary outcomes.
-
Hamill, T. M. (2001). "Interpretation of Rank Histograms for Verifying Ensemble Forecasts." Monthly Weather Review, 129(3), 550-560. Explains how to use PIT histograms (rank histograms) to diagnose distributional miscalibration. Essential for ensemble forecasting contexts.
Information-Theoretic Perspectives
- Brocker, J. (2009). "Reliability, Sufficiency, and the Decomposition of Proper Scores." Quarterly Journal of the Royal Meteorological Society, 135(643), 1512-1519. Provides an information-theoretic decomposition of proper scoring rules, connecting calibration to mutual information and entropy.
Online Calibration
- Foster, D. P., & Vohra, R. V. (1998). "Asymptotic Calibration." Biometrika, 85(2), 379-390. Proves that it is possible to achieve calibration in an adversarial online setting. Foundational result for the theory of sequential calibration.
Software and Tools
-
scikit-learn calibration module:
sklearn.calibrationprovidesCalibratedClassifierCV,calibration_curve, andCalibrationDisplay. The standard Python toolkit for machine learning calibration. Documentation: https://scikit-learn.org/stable/modules/calibration.html -
Metaculus: https://www.metaculus.com — Community prediction platform with excellent calibration tracking and public calibration data.
-
Good Judgment Open: https://www.gjopen.com — Open forecasting platform from the team behind superforecasting research. Good source of calibration data and practice.
-
uncertainty-toolbox: Python package for uncertainty quantification and calibration. GitHub: https://github.com/uncertainty-toolbox/uncertainty-toolbox
Textbooks
-
Wilks, D. S. (2011). Statistical Methods in the Atmospheric Sciences (3rd ed.). Academic Press. Chapters 8-9 provide an excellent treatment of forecast verification including calibration, reliability diagrams, and skill scores. The standard reference in meteorological forecasting.
-
Jolliffe, I. T., & Stephenson, D. B. (2012). Forecast Verification: A Practitioner's Guide in Atmospheric Science (2nd ed.). Wiley. Comprehensive guide to forecast verification methods. Covers all the calibration metrics discussed in this chapter with detailed examples.
-
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Section 9.2.5 covers isotonic regression. General reference for the machine learning methods used in recalibration.