> "The whole problem with the world is that fools and fanatics are always so certain of themselves, and wiser people so full of doubts."
In This Chapter
- 9.1 What Are Scoring Rules?
- 9.2 The Brier Score
- 9.3 The Logarithmic Score
- 9.4 The Spherical Score
- 9.5 What Makes a Scoring Rule "Proper"?
- 9.6 The Deep Connection: Scoring Rules and Market Makers
- 9.7 Comparing Scoring Rules
- 9.8 Scoring Rules for Multiple Outcomes
- 9.9 Practical Scoring and Evaluation
- 9.10 Scoring Rule Manipulation and Edge Cases
- 9.11 Advanced: Weighted Scoring Rules and Asymmetric Loss
- 9.12 Chapter Summary
- What's Next
- Key Equations Reference
Chapter 9: Scoring Rules and Proper Incentives
"The whole problem with the world is that fools and fanatics are always so certain of themselves, and wiser people so full of doubts." — Bertrand Russell
Imagine you run a weather forecasting service. Two of your meteorologists give you their forecasts for rain tomorrow. Alice says there is a 70% chance of rain, and Bob says there is a 90% chance. Tomorrow comes and it rains. Who was the better forecaster? And more importantly, how do you set up a system so that both Alice and Bob are motivated to tell you their honest beliefs about the probability of rain, rather than gaming whatever scoring system you put in place?
This is the fundamental problem that scoring rules solve. In this chapter, we will explore the mathematics and practice of scoring rules -- the formal tools that assign numerical rewards (or penalties) to probabilistic forecasts based on what actually happens. Scoring rules are far more than academic curiosities: they are the hidden engine that drives prediction markets. Every well-designed prediction market is, at its core, built on a proper scoring rule. Understanding scoring rules gives you deep insight into why prediction markets work and how they incentivize truth-telling.
We will cover the three most important scoring rules -- the Brier score, the logarithmic score, and the spherical score -- along with the crucial concept of properness, which guarantees that honesty is the best policy. We will then reveal the deep mathematical connection between scoring rules and the automated market makers from Chapter 8, showing that Robin Hanson's Logarithmic Market Scoring Rule (LMSR) is literally the logarithmic scoring rule turned into a market maker.
By the end of this chapter, you will be able to:
- Define and compute the Brier, logarithmic, and spherical scores
- Explain what makes a scoring rule "proper" and why properness matters
- Decompose the Brier score into calibration, resolution, and uncertainty
- Connect scoring rules to cost-function-based market makers
- Implement and compare scoring rules in Python
- Design scoring systems for forecasting tournaments and platforms
9.1 What Are Scoring Rules?
The Core Problem: Rewarding Honest Forecasts
Suppose you ask someone to forecast the probability of an event -- say, whether a particular startup will be profitable within three years. The forecaster tells you a number: 60%. Three years pass, and the startup is indeed profitable. Was 60% a "good" forecast?
This question is trickier than it seems. A single forecast on a single event cannot, by itself, tell you whether the forecaster is skilled. After all, a forecaster who said 99% would also have been "right," and a forecaster who said 10% would also not have been "wrong" -- the event was always possible at any probability. The power of scoring rules emerges when we evaluate forecasters across many predictions.
A scoring rule is a function that takes two inputs:
- A probability forecast $p$ (the forecaster's stated probability for an event)
- The outcome $y$ (did the event actually happen? $y = 1$ for yes, $y = 0$ for no)
And produces a single number: the score $S(p, y)$.
$$S: [0, 1] \times \{0, 1\} \rightarrow \mathbb{R}$$
The score tells us how "good" the forecast was, given what actually happened. Some scoring rules reward higher scores (where bigger is better), and others penalize with lower scores (where smaller is better). The convention varies by context, and we will be explicit about which direction is used for each rule.
Why Scoring Rules Matter for Prediction Markets
You might wonder: if we have prediction markets, why do we need scoring rules? The answer is that scoring rules and prediction markets are deeply intertwined:
-
Scoring rules motivate honest reporting. In a prediction market, traders reveal their beliefs through their trades. But what ensures they trade honestly? The market's payoff structure, which is built on a scoring rule.
-
Every proper scoring rule generates a market maker. As we will see in Section 9.6, there is a mathematical correspondence between proper scoring rules and cost-function-based automated market makers (like the LMSR from Chapter 8).
-
Scoring rules evaluate market performance. After a market closes, we need to assess how well the market's prices predicted outcomes. Scoring rules give us the tools to do this rigorously.
-
Forecasting platforms use scoring rules directly. Platforms like Metaculus, Good Judgment Open, and many corporate forecasting programs score their participants using explicit scoring rules rather than market mechanisms.
A Simple Example
Let us start with the simplest possible scoring rule: the absolute score.
$$S_{\text{abs}}(p, y) = -|p - y|$$
If the event happens ($y = 1$) and you said $p = 0.9$, your score is $-|0.9 - 1| = -0.1$. If you said $p = 0.3$, your score is $-|0.3 - 1| = -0.7$. Higher (less negative) scores are better.
This seems reasonable, but as we will discover in Section 9.5, the absolute score has a critical flaw: it is not proper. A forecaster can do better by lying about their true beliefs. This is why we need to study properness carefully.
9.2 The Brier Score
Definition
The Brier score, introduced by Glenn Brier in 1950 for evaluating weather forecasts, is the most widely used scoring rule. It is defined as:
$$\text{BS}(p, y) = (p - y)^2$$
where: - $p$ is the forecaster's stated probability for the event occurring (between 0 and 1) - $y$ is the outcome (1 if the event occurred, 0 if it did not)
The Brier score ranges from 0 to 1, and lower is better. A score of 0 means a perfect forecast (you said 100% and the event happened, or you said 0% and it did not happen). A score of 1 means the worst possible forecast (you said 100% and the event did not happen, or vice versa).
Convention note: Some sources define the Brier score with a negative sign $-( p - y)^2$ so that higher is better. We use the original convention where lower is better, which is more common in weather forecasting and statistics. When reading other sources, always check which convention they use.
Worked Examples
Example 1: Rain Forecast
Alice forecasts a 70% chance of rain ($p = 0.7$). It rains ($y = 1$).
$$\text{BS} = (0.7 - 1)^2 = (-0.3)^2 = 0.09$$
Example 2: Election Forecast
Bob forecasts a 55% chance that Candidate X wins ($p = 0.55$). Candidate X loses ($y = 0$).
$$\text{BS} = (0.55 - 0)^2 = 0.3025$$
Example 3: Comparing Two Forecasters
For a series of 5 events, Alice and Bob provide these forecasts:
| Event | Alice's $p$ | Bob's $p$ | Outcome $y$ | Alice BS | Bob BS |
|---|---|---|---|---|---|
| 1 | 0.70 | 0.90 | 1 | 0.0900 | 0.0100 |
| 2 | 0.30 | 0.10 | 0 | 0.0900 | 0.0100 |
| 3 | 0.60 | 0.80 | 1 | 0.1600 | 0.0400 |
| 4 | 0.40 | 0.20 | 1 | 0.3600 | 0.6400 |
| 5 | 0.50 | 0.50 | 0 | 0.2500 | 0.2500 |
Average Brier scores: Alice = 0.1900, Bob = 0.1900. Interesting -- they perform equally on average despite having very different strategies! Bob is more "extreme" (further from 0.5) and gets rewarded when right but punished heavily when wrong (Event 4).
The Brier Skill Score
To contextualize a Brier score, we often compare it to a reference forecast -- typically the base rate (climatological frequency) or a naive forecast of 0.5. The Brier Skill Score (BSS) is:
$$\text{BSS} = 1 - \frac{\text{BS}_{\text{forecast}}}{\text{BS}_{\text{reference}}}$$
- BSS = 1 means perfect forecasting
- BSS = 0 means no improvement over the reference
- BSS < 0 means worse than the reference
If the reference is always predicting $p = 0.5$, then $\text{BS}_{\text{reference}} = 0.25$ for any event, and:
$$\text{BSS} = 1 - \frac{\text{BS}_{\text{forecast}}}{0.25}$$
Brier Score Decomposition
One of the most powerful features of the Brier score is that it can be decomposed into three meaningful components. When we have $N$ forecasts that we group into $K$ bins based on the forecasted probability:
$$\text{BS} = \underbrace{\frac{1}{N}\sum_{k=1}^{K} n_k (\bar{p}_k - \bar{y}_k)^2}_{\text{Calibration (REL)}} - \underbrace{\frac{1}{N}\sum_{k=1}^{K} n_k (\bar{y}_k - \bar{y})^2}_{\text{Resolution (RES)}} + \underbrace{\bar{y}(1-\bar{y})}_{\text{Uncertainty (UNC)}}$$
where: - $n_k$ = number of forecasts in bin $k$ - $\bar{p}_k$ = mean forecast probability in bin $k$ - $\bar{y}_k$ = observed frequency of outcome in bin $k$ - $\bar{y}$ = overall base rate of the outcome
Calibration (Reliability): How well do the forecasted probabilities match the observed frequencies? When a forecaster says "70%," does the event happen about 70% of the time? Lower calibration values are better. A perfectly calibrated forecaster has calibration = 0.
Resolution: How much do the forecaster's probabilities vary across different situations? A forecaster who always says 50% has zero resolution. Higher resolution is better (note the minus sign in the decomposition).
Uncertainty: This depends only on the base rate of the outcomes, not on the forecaster. It measures how inherently hard the forecasting problem is. Events with base rates near 50% are hardest (uncertainty = 0.25); events with extreme base rates are easier.
So: $\text{BS} = \text{Calibration} - \text{Resolution} + \text{Uncertainty}$.
A good forecaster has low calibration (well-calibrated) and high resolution (can distinguish likely from unlikely events).
Python Implementation
import numpy as np
from typing import Union, List
def brier_score(forecast: float, outcome: int) -> float:
"""
Compute the Brier score for a single forecast.
Parameters
----------
forecast : float
Predicted probability of the event occurring, in [0, 1].
outcome : int
Actual outcome: 1 if event occurred, 0 otherwise.
Returns
-------
float
Brier score in [0, 1]. Lower is better.
"""
return (forecast - outcome) ** 2
def mean_brier_score(forecasts: np.ndarray, outcomes: np.ndarray) -> float:
"""Compute the mean Brier score across multiple forecasts."""
return np.mean((forecasts - outcomes) ** 2)
# Example usage
forecasts = np.array([0.7, 0.3, 0.6, 0.4, 0.5])
outcomes = np.array([1, 0, 1, 1, 0])
print(f"Mean Brier Score: {mean_brier_score(forecasts, outcomes):.4f}")
# Output: Mean Brier Score: 0.1900
9.3 The Logarithmic Score
Definition
The logarithmic scoring rule (or log score) assigns a score based on the natural logarithm of the probability assigned to the outcome that actually occurred:
$$\text{LS}(p, y) = \begin{cases} \ln(p) & \text{if } y = 1 \\ \ln(1 - p) & \text{if } y = 0 \end{cases}$$
This can be written more compactly as:
$$\text{LS}(p, y) = y \ln(p) + (1 - y) \ln(1 - p)$$
The log score is always negative (or zero), and higher (less negative) is better. A perfect forecast on an event that occurs ($p = 1, y = 1$) gives $\ln(1) = 0$. A forecast of $p = 0$ on an event that occurs gives $\ln(0) = -\infty$ -- an infinite penalty.
Worked Examples
Example 1: Confident and Correct
You forecast $p = 0.9$ for an event that happens ($y = 1$):
$$\text{LS} = \ln(0.9) \approx -0.105$$
Example 2: Confident and Wrong
You forecast $p = 0.9$ for an event that does not happen ($y = 0$):
$$\text{LS} = \ln(1 - 0.9) = \ln(0.1) \approx -2.303$$
The penalty for being confidently wrong ($-2.303$) is much more severe than the reward for being confidently right ($-0.105$). This asymmetry is a key feature of the log score.
Example 3: The Danger of Extreme Forecasts
| Forecast $p$ | Score if $y=1$ | Score if $y=0$ |
|---|---|---|
| 0.50 | -0.693 | -0.693 |
| 0.70 | -0.357 | -1.204 |
| 0.90 | -0.105 | -2.303 |
| 0.99 | -0.010 | -4.605 |
| 0.999 | -0.001 | -6.908 |
| 0.9999 | -0.0001 | -9.210 |
Notice that moving from 0.99 to 0.999 barely improves your score if you are right ($-0.010$ to $-0.001$) but dramatically worsens your score if you are wrong ($-4.605$ to $-6.908$). The log score strongly discourages extreme probabilities unless you are very confident.
Connection to Information Theory
The log score has a deep connection to information theory. The expected log score under the forecaster's true belief $q$ is:
$$\mathbb{E}_q[\text{LS}(p)] = q \ln(p) + (1 - q) \ln(1 - p)$$
This is maximized when $p = q$ (reporting honestly). The difference between the expected score at the true belief and the expected score at the reported probability is the Kullback-Leibler (KL) divergence:
$$D_{\text{KL}}(q \| p) = q \ln\frac{q}{p} + (1-q) \ln\frac{1-q}{1-p}$$
The KL divergence is always non-negative and equals zero only when $p = q$. This means:
$$\mathbb{E}_q[\text{LS}(q)] - \mathbb{E}_q[\text{LS}(p)] = D_{\text{KL}}(q \| p) \geq 0$$
In words: your expected log score is always highest when you report your true belief. This is exactly what makes the log score proper (more on this in Section 9.5).
The connection to information theory also means the log score measures the information content (or surprisal) of the forecast. A forecast of $p = 0.5$ for an event that happens contains $-\ln(0.5) \approx 0.693$ nats of information. A forecast of $p = 0.99$ for an event that happens contains only $-\ln(0.99) \approx 0.01$ nats -- it is barely surprising, so the information content is low.
Why Logarithmic Scoring Is Used in Practice
The log score is the most widely used scoring rule in practice for several reasons:
-
Local properness: The log score depends only on the probability assigned to the outcome that occurred, not on probabilities assigned to other outcomes. This is called being a "local" scoring rule. It simplifies computation and interpretation for multi-outcome events.
-
Information-theoretic foundation: The connection to KL divergence and entropy gives it a principled interpretation.
-
Strong incentive for calibration: The unbounded penalty near 0 and 1 forces forecasters to take extreme probabilities seriously.
-
Connection to LMSR: As we will see in Section 9.6, the log score directly generates the Logarithmic Market Scoring Rule, the most popular automated market maker.
Python Implementation
import numpy as np
def log_score(forecast: float, outcome: int, eps: float = 1e-15) -> float:
"""
Compute the logarithmic score for a single forecast.
Parameters
----------
forecast : float
Predicted probability of the event occurring, in [0, 1].
outcome : int
Actual outcome: 1 if event occurred, 0 otherwise.
eps : float
Small constant to avoid log(0). Default: 1e-15.
Returns
-------
float
Log score (negative or zero). Higher (less negative) is better.
"""
# Clip forecast to avoid log(0)
forecast = np.clip(forecast, eps, 1 - eps)
if outcome == 1:
return np.log(forecast)
else:
return np.log(1 - forecast)
def mean_log_score(forecasts: np.ndarray, outcomes: np.ndarray,
eps: float = 1e-15) -> float:
"""Compute the mean log score across multiple forecasts."""
forecasts = np.clip(forecasts, eps, 1 - eps)
scores = outcomes * np.log(forecasts) + (1 - outcomes) * np.log(1 - forecasts)
return np.mean(scores)
# Example usage
forecasts = np.array([0.7, 0.3, 0.6, 0.4, 0.5])
outcomes = np.array([1, 0, 1, 1, 0])
print(f"Mean Log Score: {mean_log_score(forecasts, outcomes):.4f}")
# Output: Mean Log Score: -0.6199
9.4 The Spherical Score
Definition
The spherical scoring rule (also called the spherical payoff rule) is less commonly used than the Brier or log score but has interesting geometric properties. For a binary event, it is defined as:
$$\text{SS}(p, y) = \begin{cases} \displaystyle\frac{p}{\sqrt{p^2 + (1-p)^2}} & \text{if } y = 1 \\[8pt] \displaystyle\frac{1-p}{\sqrt{p^2 + (1-p)^2}} & \text{if } y = 0 \end{cases}$$
The spherical score ranges from $\frac{1}{\sqrt{2}} \approx 0.707$ (worst, when $p = 0.5$... actually when the forecast is maximally wrong) to 1 (perfect forecast). Higher is better.
Let us be more precise. For the general $n$-outcome case, if the forecaster provides a probability vector $\mathbf{p} = (p_1, p_2, \ldots, p_n)$ and outcome $j$ occurs, the spherical score is:
$$\text{SS}(\mathbf{p}, j) = \frac{p_j}{\|\mathbf{p}\|_2} = \frac{p_j}{\sqrt{\sum_{i=1}^{n} p_i^2}}$$
Geometric Interpretation
The name "spherical" comes from a geometric interpretation. Think of the probability vector $\mathbf{p}$ as a point in $n$-dimensional space. The spherical score normalizes this vector to lie on the unit sphere (by dividing by its Euclidean norm) and then takes the component corresponding to the realized outcome.
Geometrically, the spherical score measures the cosine of the angle between the probability vector and the unit vector pointing in the direction of the realized outcome. When the probability vector points directly toward the outcome (maximum probability on the correct outcome), the cosine is 1 and the score is maximized.
Worked Examples
Example 1: Binary Event, Correct Direction
Forecast $p = 0.8$, event happens ($y = 1$):
$$\text{SS} = \frac{0.8}{\sqrt{0.8^2 + 0.2^2}} = \frac{0.8}{\sqrt{0.64 + 0.04}} = \frac{0.8}{\sqrt{0.68}} = \frac{0.8}{0.8246} \approx 0.9701$$
Example 2: Binary Event, Wrong Direction
Forecast $p = 0.8$, event does not happen ($y = 0$):
$$\text{SS} = \frac{0.2}{\sqrt{0.8^2 + 0.2^2}} = \frac{0.2}{0.8246} \approx 0.2425$$
Example 3: Comparison at Different Forecast Levels
| Forecast $p$ | SS if $y=1$ | SS if $y=0$ |
|---|---|---|
| 0.50 | 0.7071 | 0.7071 |
| 0.60 | 0.8321 | 0.5547 |
| 0.70 | 0.9285 | 0.3714 |
| 0.80 | 0.9701 | 0.2425 |
| 0.90 | 0.9945 | 0.1106 |
| 0.99 | 0.9999 | 0.0101 |
Comparison to Brier and Log Scores
The spherical score sits between the Brier and log scores in terms of how harshly it penalizes incorrect extreme forecasts:
- Log score: Infinite penalty for assigning probability 0 to an event that occurs
- Spherical score: Finite but severe penalty (score approaches 0)
- Brier score: Finite penalty, maximum of 1
The spherical score is less commonly used in practice, primarily because: 1. It lacks the intuitive decomposition of the Brier score 2. It does not have the information-theoretic interpretation of the log score 3. It is less widely supported in forecasting software
However, it is useful when you want a bounded score (unlike the log score) that still provides stronger incentives than the Brier score for distinguishing between high probabilities (e.g., 0.9 vs 0.99).
Python Implementation
import numpy as np
def spherical_score(forecast: float, outcome: int) -> float:
"""
Compute the spherical score for a single binary forecast.
Parameters
----------
forecast : float
Predicted probability of the event occurring, in [0, 1].
outcome : int
Actual outcome: 1 if event occurred, 0 otherwise.
Returns
-------
float
Spherical score in (0, 1]. Higher is better.
"""
norm = np.sqrt(forecast ** 2 + (1 - forecast) ** 2)
if outcome == 1:
return forecast / norm
else:
return (1 - forecast) / norm
def spherical_score_multiclass(probs: np.ndarray, outcome_index: int) -> float:
"""
Compute the spherical score for a multi-outcome forecast.
Parameters
----------
probs : np.ndarray
Probability vector (must sum to 1).
outcome_index : int
Index of the outcome that occurred.
Returns
-------
float
Spherical score. Higher is better.
"""
norm = np.linalg.norm(probs)
return probs[outcome_index] / norm
# Example usage
print(f"Spherical score (p=0.8, y=1): {spherical_score(0.8, 1):.4f}")
print(f"Spherical score (p=0.8, y=0): {spherical_score(0.8, 0):.4f}")
# Output:
# Spherical score (p=0.8, y=1): 0.9701
# Spherical score (p=0.8, y=0): 0.2425
9.5 What Makes a Scoring Rule "Proper"?
The Central Question
We have now seen three scoring rules. But which ones actually incentivize honest reporting? And what does "incentivize honest reporting" even mean mathematically?
The key insight is this: a forecaster has a true belief $q$ about the probability of an event. They must report a probability $p$ to the scoring system. The question is: under what conditions is the forecaster's best strategy to report $p = q$?
Formal Definition
A scoring rule $S(p, y)$ is proper if, for every true belief $q \in [0, 1]$:
$$\mathbb{E}_{y \sim q}[S(q, y)] \geq \mathbb{E}_{y \sim q}[S(p, y)] \quad \text{for all } p \in [0, 1]$$
In words: the expected score is maximized (or the expected penalty is minimized) when the forecaster reports their true belief.
A scoring rule is strictly proper if the inequality is strict for $p \neq q$:
$$\mathbb{E}_{y \sim q}[S(q, y)] > \mathbb{E}_{y \sim q}[S(p, y)] \quad \text{for all } p \neq q$$
This means the only optimal strategy is to report truthfully. There is no other probability that achieves the same expected score.
Checking Properness: The Brier Score
Let us verify that the Brier score is strictly proper. The forecaster's true belief is $q$. If they report $p$, their expected Brier score (using the convention where lower is better, so we want to minimize) is:
$$\mathbb{E}_q[\text{BS}(p)] = q(p - 1)^2 + (1-q)(p - 0)^2$$
$$= q(1 - p)^2 + (1-q)p^2$$
$$= q - 2qp + qp^2 + p^2 - qp^2$$
$$= p^2 - 2qp + q$$
To find the minimum, take the derivative with respect to $p$ and set it to zero:
$$\frac{d}{dp}\mathbb{E}_q[\text{BS}(p)] = 2p - 2q = 0$$
$$p^* = q$$
The second derivative is $2 > 0$, confirming this is a minimum. So the expected Brier score is minimized (best performance) when $p = q$ -- the forecaster reports their true belief. The Brier score is strictly proper.
Checking Properness: The Log Score
For the log score (where higher is better), the expected score when the true belief is $q$ and the report is $p$ is:
$$\mathbb{E}_q[\text{LS}(p)] = q \ln(p) + (1-q)\ln(1-p)$$
Taking the derivative:
$$\frac{d}{dp}\mathbb{E}_q[\text{LS}(p)] = \frac{q}{p} - \frac{1-q}{1-p} = 0$$
$$q(1-p) = (1-q)p$$
$$q - qp = p - qp$$
$$p^* = q$$
The second derivative is $-\frac{q}{p^2} - \frac{1-q}{(1-p)^2} < 0$, confirming this is a maximum. The log score is strictly proper.
An Example of an Improper Rule
To understand why properness matters, let us examine the linear score:
$$S_{\text{linear}}(p, y) = \begin{cases} p & \text{if } y = 1 \\ 1-p & \text{if } y = 0 \end{cases}$$
This seems reasonable: you get rewarded proportionally to the probability you assigned to the correct outcome. But is it proper?
The expected score when truth is $q$ and report is $p$:
$$\mathbb{E}_q[S_{\text{linear}}(p)] = qp + (1-q)(1-p) = qp + 1 - q - p + qp = 2qp - p - q + 1$$
Taking the derivative:
$$\frac{d}{dp} = 2q - 1$$
This is positive when $q > 0.5$ and negative when $q < 0.5$. So: - If $q > 0.5$, the forecaster should report $p = 1$ (not $p = q$) - If $q < 0.5$, the forecaster should report $p = 0$ (not $p = q$) - If $q = 0.5$, any $p$ is equally good
The linear score incentivizes extreme forecasts -- it pushes forecasters to report 0 or 1 regardless of their true beliefs! This is terrible for a forecasting system. The linear score is improper.
Why Properness Matters for Prediction Markets
Properness is not just an academic nicety. It has profound practical implications:
-
Truth-telling is optimal. In a properly scored system, forecasters do not need to strategize or second-guess other participants. Their best move is simply to report what they believe. This dramatically simplifies the forecasting task.
-
Information revelation. When every participant honestly reports their beliefs, the aggregate forecast reflects the group's genuine information. Improper rules cause information to be distorted or lost.
-
No gaming opportunities. With a proper rule, there is no way to "hack" the system. You cannot improve your expected score by misreporting, hedging, or engaging in strategic behavior (at least on individual questions).
-
Market maker design. As we will see, the properness of the underlying scoring rule guarantees that the market maker correctly incentivizes information revelation through trading.
The Space of Proper Scoring Rules
It turns out there are infinitely many proper scoring rules. In fact, there is a beautiful mathematical characterization: every strictly proper scoring rule can be written in terms of a convex function $G$:
$$S(p, y) = G(p) + G'(p)(y - p)$$
where $G$ is a strictly convex function on $[0, 1]$ and $G'$ is its derivative. Different choices of $G$ give different proper scoring rules:
- $G(p) = p(1-p)$ gives the Brier score (up to a constant)
- $G(p) = -[p \ln p + (1-p)\ln(1-p)]$ gives the logarithmic score (this is the negative binary entropy)
- Other convex functions give other proper scoring rules
This characterization, due to Savage (1971) and developed further by Schervish (1989), gives us a complete understanding of the space of proper scoring rules.
9.6 The Deep Connection: Scoring Rules and Market Makers
Hanson's Key Insight
In Chapter 8, we introduced the Logarithmic Market Scoring Rule (LMSR) as a market maker. We described it in terms of cost functions and prices. Now we can reveal where it really comes from: the LMSR is the logarithmic scoring rule transformed into a market maker.
Robin Hanson's insight was elegant: instead of having a single forecaster who provides a probability and gets scored, have a sequence of forecasters where each one updates the previous forecast, and each is scored on the change in score.
Here is how it works:
-
The market starts with an initial probability $p_0$ (typically 0.5 or the base rate).
-
Trader 1 arrives and moves the probability to $p_1$. Trader 1's payoff (when the event resolves) is:
$$\text{Payoff}_1 = S(p_1, y) - S(p_0, y)$$
They get the score at their new probability minus the score at the old probability.
- Trader 2 moves the probability to $p_2$. Their payoff is:
$$\text{Payoff}_2 = S(p_2, y) - S(p_1, y)$$
- And so on. Trader $k$'s payoff is $S(p_k, y) - S(p_{k-1}, y)$.
Because the scoring rule is proper, each trader maximizes their expected payoff by moving the probability to their true belief. And notice that the payoffs telescope -- the total paid out by the market is:
$$\sum_{k=1}^{n} \text{Payoff}_k = S(p_n, y) - S(p_0, y)$$
The market maker's total loss is bounded by the range of the scoring rule.
From Log Score to LMSR
When we use the logarithmic scoring rule, the payoff for a trader who moves the probability from $p_{\text{old}}$ to $p_{\text{new}}$ is:
If $y = 1$: $\ln(p_{\text{new}}) - \ln(p_{\text{old}})$
If $y = 0$: $\ln(1-p_{\text{new}}) - \ln(1-p_{\text{old}})$
Now, let us think of this in terms of shares. Define the number of "yes" shares outstanding as $q_1$ and "no" shares as $q_2$. The LMSR cost function from Chapter 8 was:
$$C(q_1, q_2) = b \ln(e^{q_1/b} + e^{q_2/b})$$
The price (probability) for the "yes" outcome is:
$$p = \frac{e^{q_1/b}}{e^{q_1/b} + e^{q_2/b}}$$
When a trader buys shares, they pay the change in the cost function. When the market resolves, each "yes" share pays \$1 if $y = 1$ and \$0 if $y = 0$. The net effect is exactly the same as the log scoring rule difference -- the trader's profit/loss is $S(p_{\text{new}}, y) - S(p_{\text{old}}, y)$ under the logarithmic score (up to scaling by $b$).
Quadratic Score to Quadratic Market Maker
The same construction works with the Brier (quadratic) score. Using $S(p, y) = -(p-y)^2$, the payoff for moving from $p_{\text{old}}$ to $p_{\text{new}}$ when $y = 1$ is:
$$-(p_{\text{new}} - 1)^2 + (p_{\text{old}} - 1)^2$$
This generates a different automated market maker -- one with a quadratic cost function. This market maker has different properties than LMSR: - The cost function is a quadratic polynomial rather than an exponential/log function - Prices are linear functions of shares outstanding - The maximum loss is bounded by a simpler expression
Summary Table: Scoring Rules to Market Makers
| Scoring Rule | Market Maker | Cost Function | Price Function |
|---|---|---|---|
| Logarithmic | LMSR | $b \ln(\sum e^{q_i/b})$ | Softmax / logistic |
| Quadratic (Brier) | Quadratic AMM | Quadratic polynomial | Linear |
| Spherical | Spherical AMM | $b\|\mathbf{q}/b\|_2$ | Normalized linear |
Why This Matters
This connection between scoring rules and market makers is profound for several reasons:
-
Design principle. To create a new market maker, start with a proper scoring rule and apply Hanson's construction. The properness of the scoring rule guarantees that the market maker correctly incentivizes information revelation.
-
Loss bounds. The range of the scoring rule determines the maximum loss of the market maker (the subsidy required to run the market). Log score has unbounded range, so LMSR's maximum loss depends on the liquidity parameter $b$.
-
Behavioral properties. The sensitivity properties of the scoring rule translate directly into the market maker's behavior. The log score's sensitivity to extreme probabilities makes LMSR more responsive near 0 and 1.
-
Unifying framework. This connection shows that prediction markets and scoring-rule-based forecasting platforms are not fundamentally different -- they are two expressions of the same mathematical structure.
9.7 Comparing Scoring Rules
Now that we understand the three main scoring rules, let us compare them systematically.
Comparison Table
| Property | Brier Score | Log Score | Spherical Score |
|---|---|---|---|
| Formula | $(p-y)^2$ | $y\ln p + (1-y)\ln(1-p)$ | $\frac{p_y}{\|\mathbf{p}\|_2}$ |
| Range | [0, 1] | $(-\infty, 0]$ | $(0, 1]$ |
| Best score | 0 | 0 | 1 |
| Worst score | 1 | $-\infty$ | ~0 |
| Direction | Lower is better | Higher is better | Higher is better |
| Proper? | Strictly proper | Strictly proper | Strictly proper |
| Bounded? | Yes | No | Yes |
| Decomposable? | Yes (cal+res+unc) | Partial | No |
| Local? | No | Yes | No |
| Penalty at extremes | Moderate | Severe (infinite) | Strong but finite |
| Sensitivity near 0.5 | Low | Low | Low |
| Sensitivity near 0/1 | Low | Very high | High |
| Information-theoretic? | No | Yes (KL divergence) | No |
| Generates which AMM? | Quadratic AMM | LMSR | Spherical AMM |
| Common uses | Weather, sports | Markets, ML | Rare / theoretical |
Sensitivity Analysis
A crucial difference between scoring rules is how they respond to changes in probability at different levels. The sensitivity of a scoring rule tells us how much the expected score changes when the forecast moves slightly.
For a forecaster with true belief $q$, the sensitivity is proportional to the second derivative of the expected score at $p = q$. For the three rules:
- Brier: Sensitivity $= 2$ (constant everywhere)
- Log: Sensitivity $= \frac{1}{q(1-q)}$ (high near 0 and 1, low near 0.5)
- Spherical: Sensitivity varies, between Brier and log
The log score's increasing sensitivity near extreme probabilities means it provides the strongest incentive for forecasters to get the "last few percentage points" right. If you think an event has a 97% chance of occurring, the log score rewards you significantly for reporting 97% rather than 95%, whereas the Brier score barely notices the difference.
This has practical implications: - For events where distinguishing 90% from 95% from 99% matters, use the log score - For events where overall calibration is more important, the Brier score works well - When you need bounded scores with moderate sensitivity, consider the spherical score
Python Comparison Visualization
import numpy as np
import matplotlib.pyplot as plt
def compare_scoring_rules():
"""Visualize and compare the three main scoring rules."""
p = np.linspace(0.01, 0.99, 500)
# Scores when outcome = 1
brier_y1 = (p - 1) ** 2
log_y1 = np.log(p)
sph_y1 = p / np.sqrt(p**2 + (1-p)**2)
# Scores when outcome = 0
brier_y0 = p ** 2
log_y0 = np.log(1 - p)
sph_y0 = (1-p) / np.sqrt(p**2 + (1-p)**2)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Plot for y = 1
axes[0].plot(p, -brier_y1, label='Brier (negated)', linewidth=2)
axes[0].plot(p, log_y1, label='Log', linewidth=2)
axes[0].plot(p, sph_y1, label='Spherical', linewidth=2)
axes[0].set_xlabel('Forecast probability p')
axes[0].set_ylabel('Score')
axes[0].set_title('Scores when event occurs (y=1)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot for y = 0
axes[1].plot(p, -brier_y0, label='Brier (negated)', linewidth=2)
axes[1].plot(p, log_y0, label='Log', linewidth=2)
axes[1].plot(p, sph_y0, label='Spherical', linewidth=2)
axes[1].set_xlabel('Forecast probability p')
axes[1].set_ylabel('Score')
axes[1].set_title('Scores when event does NOT occur (y=0)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('scoring_rules_comparison.png', dpi=150)
plt.show()
compare_scoring_rules()
9.8 Scoring Rules for Multiple Outcomes
So far we have focused on binary events (yes/no). But many forecasting problems involve multiple outcomes. What is the probability of each candidate winning an election? What is the probability that a stock price falls in each of several ranges?
Extending to $n$ Outcomes
For $n$ mutually exclusive and exhaustive outcomes, the forecaster provides a probability vector $\mathbf{p} = (p_1, p_2, \ldots, p_n)$ where $\sum_{i=1}^n p_i = 1$. If outcome $j$ occurs, the scoring rules generalize as follows:
Multiclass Brier Score:
$$\text{BS}(\mathbf{p}, j) = \sum_{i=1}^{n} (p_i - \mathbf{1}_{i=j})^2 = -2p_j + \sum_{i=1}^{n} p_i^2 + 1$$
where $\mathbf{1}_{i=j}$ is 1 when $i = j$ and 0 otherwise. This is just the sum of squared differences between the forecast vector and the one-hot outcome vector. Lower is better, with range $[0, 2]$.
Multiclass Log Score:
$$\text{LS}(\mathbf{p}, j) = \ln(p_j)$$
Simply the log of the probability assigned to the actual outcome. This is the same formula as for binary events but applied to one of $n$ outcomes. This is why the log score is called "local" -- it only depends on the probability assigned to the outcome that occurred.
Multiclass Spherical Score:
$$\text{SS}(\mathbf{p}, j) = \frac{p_j}{\sqrt{\sum_{i=1}^n p_i^2}}$$
The Ranked Probability Score (RPS)
When outcomes have a natural ordering (e.g., temperature ranges, number of goals in a soccer match), we want a scoring rule that penalizes forecasts more for being "further away" from the correct outcome. The Ranked Probability Score (RPS) does this:
$$\text{RPS}(\mathbf{p}, j) = \frac{1}{n-1}\sum_{k=1}^{n-1}\left(\sum_{i=1}^{k} p_i - \sum_{i=1}^{k}\mathbf{1}_{i \leq j}\right)^2$$
The RPS compares cumulative distributions rather than individual probabilities. It penalizes a forecast of "definitely outcome 1" when outcome 3 occurs more than it penalizes a forecast of "definitely outcome 2," because outcome 2 is closer to outcome 3 in the ordering.
CRPS for Continuous Outcomes
When the outcome is a continuous variable (e.g., temperature, stock price), we need the Continuous Ranked Probability Score (CRPS):
$$\text{CRPS}(F, x) = \int_{-\infty}^{\infty} [F(z) - \mathbf{1}(z \geq x)]^2 \, dz$$
where $F$ is the forecaster's cumulative distribution function and $x$ is the realized outcome.
The CRPS generalizes the Brier score to continuous variables and the absolute error to probabilistic forecasts. It is measured in the same units as the outcome variable.
Key properties of CRPS: - It is a strictly proper scoring rule for continuous distributions - It reduces to the absolute error when the forecast is a point forecast (delta function) - It reduces to the Brier score when applied to a binary event - It penalizes both miscalibration and lack of sharpness (overly wide distributions)
Python Implementations
import numpy as np
def multiclass_brier_score(probs: np.ndarray, outcome_index: int) -> float:
"""
Brier score for multi-outcome forecasts.
Parameters
----------
probs : np.ndarray
Probability vector of length n, summing to 1.
outcome_index : int
Index of the outcome that occurred.
Returns
-------
float
Multi-class Brier score. Lower is better.
"""
n = len(probs)
one_hot = np.zeros(n)
one_hot[outcome_index] = 1
return np.sum((probs - one_hot) ** 2)
def ranked_probability_score(probs: np.ndarray, outcome_index: int) -> float:
"""
Ranked Probability Score for ordered multi-outcome forecasts.
Parameters
----------
probs : np.ndarray
Probability vector for ordered outcomes.
outcome_index : int
Index of the outcome that occurred.
Returns
-------
float
RPS. Lower is better.
"""
n = len(probs)
cum_forecast = np.cumsum(probs)
cum_outcome = np.zeros(n)
cum_outcome[outcome_index:] = 1.0
return np.sum((cum_forecast - cum_outcome) ** 2) / (n - 1)
def crps_empirical(forecast_samples: np.ndarray, observation: float) -> float:
"""
CRPS estimated from forecast samples (ensemble CRPS).
Parameters
----------
forecast_samples : np.ndarray
Array of samples from the forecast distribution.
observation : float
The observed value.
Returns
-------
float
Estimated CRPS. Lower is better.
"""
n = len(forecast_samples)
# CRPS = E|X - y| - 0.5 * E|X - X'|
term1 = np.mean(np.abs(forecast_samples - observation))
# Efficient computation of E|X - X'|
sorted_samples = np.sort(forecast_samples)
term2 = 0.0
for i in range(n):
term2 += (2 * i - n) * sorted_samples[i]
term2 = 2 * term2 / (n * n)
return term1 - 0.5 * abs(term2)
# Example: 3-outcome election
probs = np.array([0.5, 0.3, 0.2])
outcome = 0 # Candidate A wins
print(f"Multiclass Brier: {multiclass_brier_score(probs, outcome):.4f}")
print(f"RPS: {ranked_probability_score(probs, outcome):.4f}")
print(f"Log score: {np.log(probs[outcome]):.4f}")
9.9 Practical Scoring and Evaluation
How Platforms Score Forecasters
Several major forecasting platforms use scoring rules to evaluate and rank their users. Understanding their approaches illuminates the practical considerations of scoring rule design.
Metaculus uses a system based on the log score. Each question contributes to a forecaster's track record, with scores computed as the log score relative to the community median. This means forecasters are rewarded for doing better than the crowd, not just for making accurate forecasts. Metaculus also provides calibration plots showing whether forecasters' stated probabilities match observed frequencies.
Good Judgment Open uses a variant of the Brier score. Forecasters receive daily Brier scores on open questions, and these are averaged over the question's lifetime. The platform emphasizes both accuracy and timeliness -- updating your forecast early when new information arrives is rewarded.
PredictIt and Polymarket use market-based scoring through actual monetary payoffs. While they do not directly use scoring rules, the market structure (which is built on proper scoring principles, as we saw in Section 9.6) achieves the same effect.
Designing a Tournament Scoring System
When designing a forecasting tournament, several decisions affect the scoring system:
1. Choice of scoring rule: - Brier score: Easy to explain, bounded, decomposable -- good for educational settings - Log score: Stronger incentives for precise probabilities -- good for expert tournaments - Relative scoring (vs. crowd): Rewards beating the consensus -- good for competitive settings
2. Aggregation across questions: - Simple average: Each question has equal weight - Weighted average: More important or more difficult questions count more - Geometric mean (for log scores): Multiplicative aggregation is natural for log scores
3. Timing and updates: - Score at each update: Rewards early movers but penalizes experimentation - Score at close only: Simpler but rewards waiting and free-riding on others' information - Time-weighted scoring: Discounts early forecasts less, rewarding consistent accuracy
4. Handling missing forecasts: - If a forecaster skips a question, assign the prior (e.g., community median or base rate) - Or exclude missing questions from their average (but this allows cherry-picking easy questions)
Python Tournament Scorer
import numpy as np
from dataclasses import dataclass
@dataclass
class Forecast:
forecaster_id: str
question_id: int
probability: float
@dataclass
class Question:
question_id: int
outcome: int # 0 or 1
weight: float = 1.0
def score_tournament(
forecasts: list,
questions: list,
scoring_rule: str = "brier",
default_prob: float = 0.5
) -> dict:
"""
Score a forecasting tournament.
Parameters
----------
forecasts : list of Forecast
All forecasts submitted.
questions : list of Question
All questions with outcomes.
scoring_rule : str
"brier" or "log".
default_prob : float
Default probability for missing forecasts.
Returns
-------
dict
Mapping from forecaster_id to their average score.
"""
# Organize forecasts by forecaster and question
forecast_map = {}
for f in forecasts:
forecast_map[(f.forecaster_id, f.question_id)] = f.probability
# Get all forecaster IDs
forecaster_ids = list(set(f.forecaster_id for f in forecasts))
# Score each forecaster
results = {}
for fid in forecaster_ids:
total_score = 0.0
total_weight = 0.0
for q in questions:
p = forecast_map.get((fid, q.question_id), default_prob)
if scoring_rule == "brier":
score = (p - q.outcome) ** 2
elif scoring_rule == "log":
eps = 1e-15
p_clipped = np.clip(p, eps, 1 - eps)
score = -(q.outcome * np.log(p_clipped) +
(1 - q.outcome) * np.log(1 - p_clipped))
else:
raise ValueError(f"Unknown scoring rule: {scoring_rule}")
total_score += score * q.weight
total_weight += q.weight
results[fid] = total_score / total_weight
return results
# Example tournament
questions = [Question(i, np.random.randint(0, 2)) for i in range(20)]
forecasts = []
for fid in ["Alice", "Bob", "Charlie"]:
for q in questions:
# Simulate different forecaster behaviors
if fid == "Alice":
p = np.clip(q.outcome + np.random.normal(0, 0.2), 0.01, 0.99)
elif fid == "Bob":
p = np.clip(q.outcome + np.random.normal(0, 0.35), 0.01, 0.99)
else:
p = 0.5 # Charlie always says 50/50
forecasts.append(Forecast(fid, q.question_id, p))
scores = score_tournament(forecasts, questions, scoring_rule="brier")
print("Brier Score Leaderboard (lower is better):")
for fid, score in sorted(scores.items(), key=lambda x: x[1]):
print(f" {fid}: {score:.4f}")
9.10 Scoring Rule Manipulation and Edge Cases
Hedging Across Questions
In a multi-question tournament, a forecaster might try to hedge -- intentionally misreporting on one question to reduce risk across a portfolio of questions. For example, if Alice's bonus depends on her total score across 100 questions, she might report less extreme probabilities to reduce variance in her overall score.
With a strictly proper scoring rule, this strategy is suboptimal on each individual question. However, it can reduce the variance of the total score at the cost of a small reduction in expected score. A risk-averse forecaster might accept this tradeoff.
The solution is to make the reward function linear in the score:
$$\text{Reward} = a + b \times \text{Score}$$
If $a$ and $b$ are fixed constants (with $b > 0$ for positively oriented rules), then maximizing expected reward is equivalent to maximizing expected score, and properness is preserved regardless of risk aversion.
However, if the reward function is nonlinear (e.g., "top scorer wins a prize"), hedging and strategic behavior become rational even with proper scoring rules. Tournament designers should be aware of this.
Probability Near 0 and 1
Extreme probability reports create special challenges:
Log score at the boundary: The log score goes to $-\infty$ as $p \to 0$ or $p \to 1$ for the wrong outcome. In practice, we need to handle this:
def safe_log_score(p: float, y: int, min_prob: float = 0.001) -> float:
"""Log score with probability clamping for numerical safety."""
p = max(min_prob, min(1 - min_prob, p))
return np.log(p) if y == 1 else np.log(1 - p)
Many platforms clamp probabilities to $[0.01, 0.99]$ or $[0.001, 0.999]$ to prevent infinite scores. This slightly breaks strict properness but is a necessary practical compromise.
Brier score at the boundary: The Brier score is bounded $[0, 1]$ and does not have a singularity, so it is naturally more robust at the boundaries. However, this also means it provides weaker incentives for getting extreme probabilities right.
Numerical Stability
When computing scores across many forecasts, floating-point arithmetic can cause issues:
# Bad: accumulating many small log values
total = sum(np.log(p) for p, y in zip(forecasts, outcomes) if y == 1)
# Better: use logsumexp-style techniques
from scipy.special import logsumexp
log_probs = [np.log(p) if y == 1 else np.log(1-p)
for p, y in zip(forecasts, outcomes)]
mean_score = np.mean(log_probs)
For very small probabilities, working in log space throughout the computation avoids underflow.
Score Manipulation in Practice
Beyond hedging, forecasters might attempt other manipulations:
-
Cherry-picking questions: Only forecasting on "easy" questions where the base rate is extreme. Solution: require forecasts on all questions, or assign the prior for missing forecasts.
-
Extremization: Reporting 0 or 1 on questions where you are fairly confident, hoping to get lucky. With the log score, this is extremely risky (infinite penalty). With the Brier score, the maximum loss is only 1, making this a more viable (if still suboptimal) strategy.
-
Timing manipulation: Waiting until other forecasters have revealed information before forecasting. Solution: time-weighted scoring that rewards earlier forecasts.
-
Collusion: Multiple forecasters coordinating to distort the aggregate forecast. This is harder to address with scoring rules alone and requires institutional safeguards.
9.11 Advanced: Weighted Scoring Rules and Asymmetric Loss
Motivation
Standard scoring rules treat all errors symmetrically: overestimating the probability of rain by 20% is penalized the same as underestimating by 20%. But in many applications, the costs of different types of errors are not symmetric:
- A hospital predicting disease outbreaks cares more about false negatives (missing an outbreak) than false positives
- A trader cares more about predicting market crashes than about predicting calm markets
- An intelligence analyst may face higher costs for missing a rare threat than for a false alarm
Weighted scoring rules allow us to assign different importance to different outcomes or regions of probability.
Weighted Brier Score
A simple approach is to weight the two outcomes differently:
$$\text{WBS}(p, y) = w_1 y(p-1)^2 + w_0 (1-y)p^2$$
where $w_1$ is the weight for the event occurring and $w_0$ is the weight for the event not occurring.
Unfortunately, naive weighting breaks properness. The forecaster's optimal report is no longer their true belief -- it is skewed toward the more heavily weighted outcome. To maintain properness, we need a more careful construction.
Proper Weighted Scoring Rules
One way to create a proper weighted scoring rule is through the reweighted probability approach. Instead of weighting the score directly, we transform the probability space:
Let $\phi$ be a weight function. The weighted log score is:
$$S_\phi(p, y) = \phi(y) \cdot \ln\left(\frac{p_y}{\sum_i p_i}\right)$$
where $\phi(y)$ depends on the outcome. However, this is proper only under specific conditions on $\phi$.
A more general approach, due to Ehm et al. (2016), defines proper weighted scoring rules through the mixture representation:
$$S(p, y) = \int_0^1 S_\theta(p, y) \, w(\theta) \, d\theta$$
where $S_\theta$ is an "elementary" proper scoring rule at threshold $\theta$ and $w(\theta)$ is a non-negative weight function. By choosing $w$ to emphasize certain probability thresholds, you can create a proper scoring rule that is more sensitive in the regions you care about.
Asymmetric Scoring in Practice
For practical applications, here are several approaches:
1. Threshold-based weighting:
def asymmetric_brier(p: float, y: int,
threshold: float = 0.5,
weight_above: float = 2.0,
weight_below: float = 1.0) -> float:
"""
Asymmetric Brier-like score that penalizes errors
differently above and below a threshold.
Note: This is NOT proper in general. Use with caution.
"""
error = (p - y) ** 2
if (y == 1 and p < threshold) or (y == 0 and p > threshold):
return error * weight_above # Heavier penalty for "bad" errors
else:
return error * weight_below
2. Multiple scoring rules combined:
A pragmatic approach is to evaluate forecasters on multiple proper scoring rules and combine them:
$$\text{Combined} = \alpha \cdot \text{BS}_{\text{normalized}} + (1-\alpha) \cdot \text{LS}_{\text{normalized}}$$
This preserves properness (a convex combination of proper scoring rules is proper).
3. Class-weighted proper scoring:
For multi-class problems, we can maintain properness by weighting questions rather than outcomes:
$$\text{Score} = \frac{1}{\sum_i w_i} \sum_{i} w_i \cdot S(p_i, y_i)$$
If questions with outcome $y = 1$ receive higher weight $w_i$, the scoring system effectively values correct forecasts on positive events more highly, while each individual question is still scored with a proper rule.
Custom Scoring Rules for Specific Applications
Prediction markets for rare events: When forecasting events with base rates below 1%, standard scoring rules provide little information because most of the score comes from the dominant "non-event" predictions. Solutions include:
- Scoring only relative to a baseline (e.g., the historical base rate)
- Using the log score, which provides more discrimination at extreme probabilities
- Conditioning on event subsets (only comparing forecast quality on questions where events actually occurred)
Time-series forecasting: When forecasts evolve over time (e.g., updating a probability daily until resolution), we might want to weight recent forecasts more heavily:
$$\text{Score} = \frac{\sum_{t} \delta^{T-t} S(p_t, y)}{\sum_t \delta^{T-t}}$$
where $\delta \in (0, 1]$ is a discount factor and $T$ is the resolution time. This rewards forecasters who have accurate beliefs closer to resolution, when information is more available and the forecast matters more for decision-making.
9.12 Chapter Summary
In this chapter, we have covered the theory and practice of scoring rules -- the mathematical tools that make honest forecasting incentive-compatible.
Key concepts:
-
Scoring rules assign numerical scores to probabilistic forecasts based on outcomes, enabling systematic evaluation of forecasters.
-
The Brier score $\text{BS} = (p - y)^2$ is the most intuitive scoring rule: bounded, decomposable into calibration, resolution, and uncertainty, and widely used in practice.
-
The logarithmic score $\text{LS} = y\ln p + (1-y)\ln(1-p)$ has the deepest theoretical foundations: connected to information theory, KL divergence, and maximum likelihood estimation. It is the most commonly used scoring rule in high-stakes forecasting.
-
The spherical score provides a bounded alternative with stronger sensitivity than the Brier score, though it is used less often in practice.
-
Properness is the crucial property: a proper scoring rule incentivizes honest reporting, meaning the forecaster's expected score is maximized when they report their true belief. Strict properness means this optimum is unique.
-
The connection to market makers is deep and fundamental: every proper scoring rule generates a cost-function-based automated market maker through Hanson's sequential scoring construction. LMSR comes from the log score; the quadratic market maker comes from the Brier score.
-
For multiple outcomes, scoring rules generalize naturally. The RPS is appropriate for ordered outcomes, and the CRPS extends to continuous variables.
-
Practical considerations include numerical stability near extreme probabilities, hedging in multi-question settings, and the tradeoffs in tournament design.
-
Weighted and asymmetric scoring rules address situations where different types of errors have different costs, though maintaining properness under weighting requires care.
What's Next
In Chapter 10, we will build on the foundations from this chapter and Chapter 8 to explore information aggregation -- how prediction markets combine the private information of many traders into accurate aggregate forecasts. We will see how the proper incentive structure guaranteed by scoring rules ensures that market prices converge to reflect the collective wisdom of participants, and examine conditions under which markets succeed or fail at information aggregation.
The scoring rules from this chapter will also reappear in Part III when we evaluate the performance of trading strategies against benchmark forecasters, and in Part IV when we design and backtest prediction market systems.
Key Equations Reference
| Scoring Rule | Formula | Range | Direction |
|---|---|---|---|
| Brier | $(p - y)^2$ | $[0, 1]$ | Lower = better |
| Log | $y\ln p + (1-y)\ln(1-p)$ | $(-\infty, 0]$ | Higher = better |
| Spherical | $\frac{p_y}{\|\mathbf{p}\|_2}$ | $(0, 1]$ | Higher = better |
| RPS | $\frac{1}{n-1}\sum_k (F_k - O_k)^2$ | $[0, 1]$ | Lower = better |
| CRPS | $\int [F(z) - \mathbf{1}(z \geq x)]^2 dz$ | $[0, \infty)$ | Lower = better |
| Properness Condition | Meaning |
|---|---|
| Proper | $\mathbb{E}_q[S(q,y)] \geq \mathbb{E}_q[S(p,y)]$ for all $p$ |
| Strictly proper | $\mathbb{E}_q[S(q,y)] > \mathbb{E}_q[S(p,y)]$ for all $p \neq q$ |