Chapter 12 Key Takeaways

The Big Picture

Calibration is the cornerstone of forecast quality measurement. A well-calibrated forecaster's stated probabilities match observed frequencies: when they say 70%, events happen 70% of the time. This chapter provided the complete toolkit for measuring, visualizing, and improving calibration in prediction markets.


Essential Concepts

1. Calibration Is Necessary but Not Sufficient

A forecaster who always predicts the base rate is perfectly calibrated but completely useless. Good forecasting requires calibration (probabilities match reality) AND resolution (forecasts discriminate between events that happen and events that do not). The maxim: "Maximize sharpness, subject to calibration."

2. Expected Calibration Error (ECE)

ECE is the workhorse metric for calibration quality. It computes the weighted average absolute gap between predicted probabilities and observed frequencies across bins. Benchmarks: ECE below 0.02 is excellent, 0.02-0.05 is good, above 0.10 is poor.

3. Reliability Diagrams Tell the Full Story

A single number (ECE) can obscure important patterns. Reliability diagrams reveal whether you are overconfident, underconfident, or miscalibrated in specific probability ranges. Always generate a reliability diagram alongside scalar metrics.

4. The Murphy Decomposition

The Brier score decomposes into three components: BS = REL - RES + UNC.

  • Reliability (REL): Your calibration error. Lower is better.
  • Resolution (RES): Your discriminative power. Higher is better.
  • Uncertainty (UNC): The inherent difficulty. Fixed for a given dataset.

This decomposition reveals whether poor scores come from bad calibration or poor discrimination, which dictates different improvement strategies.

5. Brier Skill Score (BSS)

BSS measures how much better you are than a naive baseline (usually the base rate). BSS = (RES - REL) / UNC. Positive BSS means you add value; negative BSS means you are worse than guessing the base rate. Typical values for skilled forecasters: 0.10-0.30.

6. Prediction Markets Are Generally Well-Calibrated

Empirical evidence shows prediction markets achieve ECE of 0.02-0.06. However, they exhibit systematic biases: - Favorite-longshot bias: High-probability events are slightly underpriced; low-probability events are slightly overpriced. - Liquidity effects: Higher-liquidity markets are better calibrated. - Category effects: Well-traded categories (sports, politics) outperform thinly traded categories (technology, science).

7. Recalibration Is a Free Lunch

If your forecasts have good resolution but poor calibration, recalibration (Platt scaling, isotonic regression) can improve your Brier score without losing information. Use cross-validation to avoid overfitting.

8. Track Your Personal Calibration

Building a personal calibration tracker is the single most valuable investment for improving as a prediction market trader. Log predictions, record outcomes, review calibration reports regularly. Deliberate practice with fast feedback loops dramatically improves calibration.


Key Formulas

Metric Formula Goal
ECE $\sum_k \frac{n_k}{N} \|\bar{p}_k - \bar{o}_k\|$ Minimize
MCE $\max_k \|\bar{p}_k - \bar{o}_k\|$ Minimize
Reliability $\frac{1}{N} \sum_k n_k(\bar{p}_k - \bar{o}_k)^2$ Minimize
Resolution $\frac{1}{N} \sum_k n_k(\bar{o}_k - \bar{o})^2$ Maximize
BSS $\frac{\text{RES} - \text{REL}}{\text{UNC}}$ Maximize
Sharpness $\frac{1}{N} \sum_i \|p_i - 0.5\|$ Maximize (subject to calibration)

Practical Rules of Thumb

  1. Use 10 bins for calibration analysis with 1,000+ forecasts. Use 5 bins for smaller datasets.
  2. Always include confidence bands on reliability diagrams. Small samples produce noisy calibration estimates.
  3. Overconfidence is the most common bias. If you have not measured your calibration, you are probably overconfident.
  4. Resolution matters more than calibration in most practical settings. The binding constraint on forecast quality is usually discrimination, not calibration.
  5. Isotonic regression is the best general-purpose recalibration method for large datasets. Use Platt scaling for small datasets.
  6. 50-100 resolved forecasts is the minimum for meaningful calibration assessment. Aim for 200+ for reliable conclusions.
  7. Monitor calibration over time. A well-calibrated model today may drift as the underlying environment changes.
  8. Compare yourself to the market. If your BSS relative to market prices is not positive, you are not adding value by disagreeing with the market.

Common Mistakes to Avoid

  1. Evaluating calibration on too few forecasts. With 20 forecasts, almost any calibration pattern could be due to chance.
  2. Ignoring the forecast distribution histogram. A wildly miscalibrated bin with 3 forecasts is not meaningful.
  3. Confusing calibration with accuracy. A perfectly calibrated forecaster can still have a high Brier score (if they lack resolution).
  4. Overfitting recalibration. Always cross-validate recalibration models.
  5. Ignoring category differences. Your calibration in one domain may not transfer to another.
  6. Not logging market prices. Without market comparison, you cannot assess whether your forecasts add value.

Connection to Part III

This chapter completes Part II (Market Microstructure and Pricing). The calibration tools developed here will be used throughout Part III (Trading Strategies):

  • Chapter 13 (Portfolio Construction): Position sizing depends on the calibration of your probability estimates.
  • Chapter 15 (Favorite-Longshot Bias): The calibration analysis framework is the primary tool for detecting and exploiting this bias.
  • Chapter 17 (Model-Based Trading): Recalibration is a standard post-processing step for predictive models used in trading.

Your calibration report is your most honest performance evaluation. Make it a habit.