Chapter 9 Key Takeaways: Scoring Rules and Proper Incentives

Core Concepts

1. Scoring Rules Evaluate Probabilistic Forecasts

A scoring rule is a function $S(p, y)$ that takes a probability forecast $p$ and an outcome $y$ and returns a numerical score. It is the rigorous way to answer "how good was this forecast?" and to compare forecasters fairly.

2. The Three Major Scoring Rules

Rule Formula Range Best For
Brier $(p-y)^2$ [0,1], lower=better General use, intuitive, decomposable
Log $y\ln p + (1-y)\ln(1-p)$ $(-\infty,0]$, higher=better Precise probabilities, rare events, markets
Spherical $p_y / \|\mathbf{p}\|_2$ $(0,1]$, higher=better Bounded alternative with stronger sensitivity

3. Properness Is the Critical Property

A scoring rule is proper if honesty is optimal: your expected score is maximized by reporting your true belief. A scoring rule is strictly proper if honest reporting is the only optimal strategy. All three major scoring rules (Brier, log, spherical) are strictly proper. Without properness, forecasters have incentives to lie, and the forecasting system fails.

4. The Brier Score Has a Powerful Decomposition

$$\text{BS} = \underbrace{\text{Calibration}}_{\text{lower=better}} - \underbrace{\text{Resolution}}_{\text{higher=better}} + \underbrace{\text{Uncertainty}}_{\text{fixed}}$$

  • Calibration: Do your 70% forecasts come true 70% of the time?
  • Resolution: Do your forecasts vary meaningfully, or do you always say 50%?
  • Uncertainty: How inherently hard is the problem? (Not under your control.)

5. The Log Score Connects to Information Theory

The log score measures the information content (surprisal) of a forecast. The difference in expected log scores between honest and dishonest reporting equals the KL divergence -- always non-negative, confirming properness through an information-theoretic lens.

6. Scoring Rules Generate Market Makers

Every proper scoring rule can be transformed into a cost-function-based automated market maker using Hanson's construction. The LMSR comes from the log score. The quadratic market maker comes from the Brier score. This is not just an analogy -- it is a precise mathematical correspondence.

7. Different Rules Have Different Sensitivities

  • Brier: Constant sensitivity across all probability levels
  • Log: Increasing sensitivity near 0 and 1 (rewards getting extreme probabilities right)
  • Spherical: Between Brier and log

Choose based on your application: log for high-stakes forecasting where extreme probabilities matter; Brier for general use and ease of communication.

8. Multi-Outcome Extensions Exist

  • Multiclass Brier: Sum of squared differences with one-hot outcome vector
  • Multiclass Log: $\ln(p_j)$ for the realized outcome $j$ (simplest -- only depends on probability of what happened)
  • RPS: For ordered outcomes, compares cumulative distributions
  • CRPS: For continuous outcomes, integrates over all thresholds

9. Practical Design Matters

When designing a scoring system: - Use a strictly proper rule (Brier or log) - Keep rewards linear in the score (preserves properness) - Clamp probabilities (e.g., [0.01, 0.99]) for numerical safety - Require participation on most questions to prevent cherry-picking - Consider time-weighted scoring to reward early accuracy

10. Gaming Is Defeated by Properness

Under a proper scoring rule, common gaming strategies (extremizing, hedging, contrarian betting, always-base-rate) all have lower expected scores than honest reporting. This is a mathematical guarantee, not just an empirical observation.

Decision Guide: Which Scoring Rule Should I Use?

  • Building a prediction market? Use the log score (via LMSR).
  • Running a forecasting tournament for experts? Use the log score for strongest incentives.
  • Running a corporate forecasting program? Use the Brier score for simplicity and bounded risk.
  • Evaluating weather forecasts? Use the Brier score (traditional, decomposable).
  • Forecasting ordered outcomes (e.g., election margins)? Use the RPS.
  • Forecasting continuous variables? Use the CRPS.
  • Need bounded scores with strong incentives? Consider the spherical score.
  • Unsure? Report both Brier and log scores -- they usually agree on rankings.

Common Mistakes to Avoid

  1. Using an improper scoring rule (e.g., linear score, absolute score) -- this incentivizes lying.
  2. Nonlinear reward functions (e.g., "top 10 win prizes") -- this breaks properness even with proper scoring rules.
  3. Ignoring numerical issues at extreme probabilities -- always clamp or use epsilon.
  4. Confusing scoring conventions -- always check whether lower or higher is better.
  5. Over-interpreting small samples -- with few questions, luck dominates; use bootstrap confidence intervals.
  6. Scoring only the final forecast -- this incentivizes free-riding and late updating.