Chapter 9 Further Reading: Scoring Rules and Proper Incentives
This annotated bibliography covers the foundational papers, key surveys, and practical references for scoring rules and their connection to prediction markets. Entries are organized from foundational to advanced.
Foundational Papers
1. Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review, 78(1), 1-3.
The paper that introduced the Brier score. Remarkably short (just three pages), it proposed the quadratic scoring rule for evaluating weather forecasts. Brier's original formulation was for multi-category forecasts, though the binary version is now the most commonly referenced. Essential reading for understanding the historical origins of scoring rules. The paper is accessible and requires no advanced mathematics.
2. Good, I. J. (1952). "Rational Decisions." Journal of the Royal Statistical Society, Series B, 14(1), 107-114.
I. J. Good independently proposed the logarithmic scoring rule, connecting it to information theory and subjective probability. Good argued that the log score is the natural measure of the "information" in a probability forecast. This paper laid the groundwork for the information-theoretic interpretation of scoring rules that became central to the field. Moderately technical but conceptually clear.
3. Savage, L. J. (1971). "Elicitation of Personal Probabilities and Expectations." Journal of the American Statistical Association, 66(336), 783-801.
A landmark paper that formalized the concept of proper scoring rules and proved the fundamental characterization theorem: every proper scoring rule corresponds to a convex function. Savage showed that properness is equivalent to the forecaster maximizing expected score by reporting truthfully. This paper is the theoretical bedrock of the field. Mathematically rigorous but the key ideas are accessible with undergraduate probability.
4. De Finetti, B. (1962). "Does It Make Sense to Speak of 'Good Probability Appraisers'?" In The Scientist Speculates (I. J. Good, ed.), 357-364.
De Finetti's early discussion of how to evaluate probability forecasters, including the idea that a good scoring rule should incentivize honest reporting. This philosophical paper motivated much of the later mathematical development. Short and conceptually stimulating.
Key Surveys and Reviews
5. Gneiting, T. and Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association, 102(477), 359-378.
The definitive modern survey of proper scoring rules. Gneiting and Raftery provide a comprehensive treatment of properness, characterization theorems, and extensions to continuous distributions (including the CRPS). They unify the binary and continuous cases and discuss weighted scoring rules. This is the single most important reference for anyone wanting a deep understanding of scoring rules. Mathematically thorough but well-written and accessible to anyone with graduate-level statistics.
6. Winkler, R. L. (1996). "Scoring Rules and the Evaluation of Probabilities." Test, 5(1), 1-60.
An excellent survey covering the evaluation of probability forecasts, with extensive discussion of calibration, refinement (resolution), and the relationship between different scoring rules. Winkler provides practical guidance on choosing and applying scoring rules. More applied than Gneiting and Raftery, making it a good companion piece.
7. Merkle, E. C. and Steyvers, M. (2013). "Choosing a Strictly Proper Scoring Rule." Decision Analysis, 10(4), 292-304.
A practical guide to choosing among proper scoring rules. The authors compare the Brier, log, and spherical scores in terms of their sensitivity properties, robustness, and suitability for different applications. Particularly useful for practitioners who need to make a concrete choice. Accessible and applied in orientation.
Scoring Rules and Prediction Markets
8. Hanson, R. (2003). "Combinatorial Information Market Design." Information Systems Frontiers, 5(1), 107-119.
The paper that introduced the Logarithmic Market Scoring Rule (LMSR) and established the deep connection between proper scoring rules and automated market makers. Hanson showed that every proper scoring rule generates a market maker, and that the LMSR has particularly desirable properties (bounded loss, always-available liquidity). Essential reading for understanding the scoring rule-market maker connection. The paper is technical but the key ideas are clearly explained.
9. Chen, Y. and Pennock, D. M. (2007). "A Utility Framework for Bounded-Loss Market Makers." In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI), 49-56.
Extends Hanson's work by providing a general framework for designing bounded-loss market makers from scoring rules. The authors show how the liquidity parameter and loss bounds relate to the properties of the underlying scoring rule. Important for understanding the practical design of prediction market systems.
10. Abernethy, J., Chen, Y., and Vaughan, J. W. (2013). "Efficient Market Making via Convex Optimization, and a Connection to Online Learning." ACM Transactions on Economics and Computation, 1(2), Article 12.
A deeper exploration of the mathematical structure connecting scoring rules, market makers, and online learning algorithms. Shows that market making can be viewed as a convex optimization problem and establishes connections to regret minimization in online learning theory. More advanced mathematically, but provides powerful unifying insights.
Calibration and Decomposition
11. Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12(4), 595-600.
The paper that introduced the decomposition of the Brier score into calibration (reliability), resolution, and uncertainty components. Murphy showed that this decomposition provides diagnostic insight into forecaster performance that the overall Brier score alone cannot. Fundamental for anyone using the Brier score in practice. Short and accessible.
12. Brocker, J. (2009). "Reliability, Sufficiency, and the Decomposition of Proper Scores." Quarterly Journal of the Royal Meteorological Society, 135(643), 1512-1519.
Generalizes Murphy's Brier decomposition to arbitrary proper scoring rules. Shows that any proper scoring rule can be decomposed into reliability and resolution components, though the decomposition is not as clean as for the Brier score. Important for understanding the theoretical limits of score decomposition.
Practical Applications and Platforms
13. Tetlock, P. E. and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
While not a technical paper, this book describes the Good Judgment Project, which used scoring rules (particularly the Brier score) to evaluate thousands of forecasters on geopolitical events. The book provides extensive practical insight into forecasting tournaments, scoring system design, and what makes some forecasters consistently better than others. Essential reading for anyone designing a forecasting platform or tournament.
14. Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., and Mellers, B. (2017). "Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls." Management Science, 63(3), 691-706.
Compares prediction markets with prediction polls (scored using proper scoring rules) for aggregating forecasts. Finds that properly scored prediction polls can match or exceed the accuracy of prediction markets in some settings. Relevant for understanding when to use markets vs. scoring rules.
Advanced Topics
15. Ehm, W., Gneiting, T., Jordan, A., and Kruger, F. (2016). "Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations and Forecast Rankings." Journal of the Royal Statistical Society, Series B, 78(3), 505-562.
The state-of-the-art on weighted scoring rules and the characterization of consistent scoring functions for quantiles and expectiles. The paper introduces the mixture representation of scoring rules that enables principled weighting of different probability regions. Mathematically advanced but important for designing custom scoring rules with asymmetric properties.
16. Schervish, M. J. (1989). "A General Method for Comparing Probability Assessors." Annals of Statistics, 17(4), 1856-1879.
Provides the complete mathematical characterization of proper scoring rules, extending Savage's results. Shows the connection to convex functions and to Bregman divergences. A mathematically rigorous treatment that serves as the theoretical foundation for much subsequent work.
17. Johnstone, D. J. (2007). "The Value of a Probability Forecast from Portfolio Theory." Theory and Decision, 63(2), 153-203.
Explores the economic interpretation of scoring rules through the lens of portfolio theory. Shows how proper scoring rules relate to expected utility maximization and provides economic justification for the use of specific scoring rules in different contexts. Connects scoring rules to financial theory in a natural way.
Software and Implementations
18. Jordan, A., Kruger, F., and Lerch, S. (2019). "Evaluating Probabilistic Forecasts with scoringRules." Journal of Statistical Software, 90(12), 1-37.
Describes the R package scoringRules, which implements a comprehensive set of proper scoring rules for both discrete and continuous forecasts. While the package is in R rather than Python, the paper serves as an excellent practical guide to implementing scoring rules, with discussion of numerical issues and edge cases. The package documentation includes worked examples for all major scoring rules.
19. Metaculus (2020-present). Metaculus Scoring Documentation.
The forecasting platform Metaculus publishes detailed documentation of its scoring system, which is based on the logarithmic score with relative scoring against the community. This documentation provides a practical case study of how a major platform implements scoring rules. Available at metaculus.com.
How to Read These References
If you read only three things: 1. Brier (1950) -- the origin of the most important scoring rule 2. Gneiting and Raftery (2007) -- the comprehensive modern treatment 3. Hanson (2003) -- the connection to prediction markets
For practitioners: Start with Winkler (1996) and Merkle and Steyvers (2013), then read Tetlock and Gardner (2015) for real-world context.
For theorists: Start with Savage (1971) and Schervish (1989), then proceed to Gneiting and Raftery (2007) and Ehm et al. (2016).
For market designers: Start with Hanson (2003), then read Chen and Pennock (2007) and Abernethy et al. (2013).