Chapter 9 Exercises: Scoring Rules and Proper Incentives
These exercises range from basic score calculations to advanced topics like proving properness, implementing scoring systems, and analyzing forecaster performance. They are organized into four sections of increasing difficulty.
Section A: Computation and Basics (Exercises 1-10)
Exercise 1: Computing Brier Scores
A meteorologist provides the following rain forecasts over five days:
| Day | Forecast (p) | Rained? (y) |
|---|---|---|
| Mon | 0.80 | 1 |
| Tue | 0.60 | 0 |
| Wed | 0.30 | 0 |
| Thu | 0.90 | 1 |
| Fri | 0.50 | 1 |
(a) Calculate the Brier score for each day. (b) Calculate the mean Brier score across all five days. (c) How does this compare to a naive forecaster who always predicts p = 0.50?
Exercise 2: Computing Log Scores
Using the same forecasts from Exercise 1:
(a) Calculate the log score for each day. (b) Calculate the mean log score across all five days. (c) Which day had the worst log score? Why? (d) Compare the ranking of days by Brier score vs. log score. Do they agree?
Exercise 3: Computing Spherical Scores
Using the same forecasts from Exercise 1:
(a) Calculate the spherical score for each day. (b) Calculate the mean spherical score. (c) Compare the day-by-day rankings across all three scoring rules (Brier, log, spherical). Where do they differ?
Exercise 4: Three-Outcome Brier Score
A political forecaster assigns the following probabilities to an election outcome:
- Candidate A wins: 0.55
- Candidate B wins: 0.35
- Candidate C wins: 0.10
Candidate B wins.
(a) Calculate the multiclass Brier score. (b) Calculate the log score. (c) Calculate the spherical score. (d) What would the scores be for a "uniform" forecaster who assigns 1/3 to each candidate?
Exercise 5: Brier Skill Score
A forecaster makes predictions on 100 binary events. Their mean Brier score is 0.18. The base rate of the events is 30% (30 events occurred, 70 did not).
(a) What is the Brier score of a "climatological" forecaster who always predicts p = 0.30? (b) Calculate the Brier Skill Score relative to the climatological forecaster. (c) Calculate the Brier Skill Score relative to a coin-flip forecaster (p = 0.50). (d) Interpret each skill score in words.
Exercise 6: Score Sensitivity
Consider a binary event that you believe has a 90% chance of occurring.
(a) Fill in this table showing each score if y = 1 and if y = 0:
| Scoring Rule | Score if y=1 | Score if y=0 |
|---|---|---|
| Brier | ||
| Log | ||
| Spherical |
(b) Now do the same for a forecast of p = 0.99. (c) Compute the "improvement" in expected score (using your true belief of 0.90) from reporting 0.90 vs 0.99 for each scoring rule. Which rule rewards the distinction the most?
Exercise 7: Comparing Two Forecasters
Two weather forecasters predict the probability of rain for 10 days:
| Day | Alice | Bob | Rain? |
|---|---|---|---|
| 1 | 0.80 | 0.60 | 1 |
| 2 | 0.20 | 0.40 | 0 |
| 3 | 0.70 | 0.50 | 1 |
| 4 | 0.30 | 0.50 | 0 |
| 5 | 0.90 | 0.75 | 1 |
| 6 | 0.10 | 0.25 | 0 |
| 7 | 0.60 | 0.55 | 1 |
| 8 | 0.50 | 0.45 | 0 |
| 9 | 0.75 | 0.65 | 1 |
| 10 | 0.40 | 0.50 | 1 |
(a) Compute the mean Brier score for each forecaster. (b) Compute the mean log score for each forecaster. (c) Who is the better forecaster according to each rule? (d) Alice uses more extreme probabilities. How does this affect her scores relative to Bob?
Exercise 8: The Linear Score Is Improper
Consider the linear scoring rule: $S(p, y) = py + (1-p)(1-y)$.
(a) Compute the expected score $\mathbb{E}_q[S(p, y)]$ as a function of $p$ and the true belief $q$. (b) Find the report $p^*$ that maximizes the expected score for $q = 0.7$. (c) Verify that $p^* \neq 0.7$, confirming the rule is improper. (d) What does a forecaster using this rule always report?
Exercise 9: Ranked Probability Score
A soccer match has predicted goal distributions:
| Goals | Forecast A | Forecast B |
|---|---|---|
| 0 | 0.15 | 0.30 |
| 1 | 0.35 | 0.35 |
| 2 | 0.30 | 0.20 |
| 3+ | 0.20 | 0.15 |
The actual result is 2 goals.
(a) Calculate the RPS for Forecast A. (b) Calculate the RPS for Forecast B. (c) Calculate the multiclass Brier score for each forecast. (d) Why might the RPS and Brier score rank the forecasts differently? (Hint: think about the ordering of outcomes.)
Exercise 10: Score Ranges and Baselines
(a) What is the worst possible Brier score? When does it occur? (b) What is the Brier score of a forecaster who always says p = 0.5? How does this depend on the base rate? (c) What is the worst possible log score? Is it finite? (d) What is the log score of a forecaster who always says p = 0.5? (e) For a base rate of 40%, what is the "break-even" Brier score -- the score that matches the base-rate forecaster?
Section B: Properness and Theory (Exercises 11-18)
Exercise 11: Verifying Properness of the Brier Score
(a) Write the expected Brier score $\mathbb{E}_q[\text{BS}(p)]$ as a function of $p$ and $q$. (b) Take the derivative with respect to $p$ and find the optimal report $p^*$. (c) Verify that $p^* = q$ and that the second-order condition confirms a minimum. (d) Calculate $\mathbb{E}_q[\text{BS}(q)] - \mathbb{E}_q[\text{BS}(p)]$ and show it equals $(p - q)^2$.
Exercise 12: Verifying Properness of the Log Score
(a) Write the expected log score $\mathbb{E}_q[\text{LS}(p)]$. (b) Take the derivative with respect to $p$ and find the optimal report. (c) Verify strict properness. (d) Show that the "regret" $\mathbb{E}_q[\text{LS}(q)] - \mathbb{E}_q[\text{LS}(p)]$ equals the KL divergence $D_{\text{KL}}(q \| p)$.
Exercise 13: Verifying Properness of the Spherical Score
(a) Write the expected spherical score for a binary event: $\mathbb{E}_q[\text{SS}(p)]$. (b) Take the derivative with respect to $p$ and show that the optimum is at $p = q$. (c) Argue that the spherical score is strictly proper.
Exercise 14: Constructing Proper Scoring Rules
The characterization theorem states that every strictly proper scoring rule for binary events can be written as:
$$S(p, y) = G(p) + G'(p)(y - p)$$
for some strictly convex function $G$.
(a) Let $G(p) = p(1-p)$. Derive the corresponding scoring rule. What well-known rule do you get? (b) Let $G(p) = -[p \ln p + (1-p)\ln(1-p)]$ (negative entropy). Derive the corresponding scoring rule. What rule is this? (c) Let $G(p) = p^3(1-p) + p(1-p)^3$. Is this convex? If so, what scoring rule does it generate?
Exercise 15: Why Convex Combinations Preserve Properness
Let $S_1$ and $S_2$ be two proper scoring rules, and let $S_{\alpha} = \alpha S_1 + (1-\alpha)S_2$ for $\alpha \in [0, 1]$.
(a) Prove that $S_{\alpha}$ is also proper. (b) Is $S_{\alpha}$ strictly proper if both $S_1$ and $S_2$ are strictly proper? (c) Give a practical example of when you might want to combine scoring rules this way.
Exercise 16: The Absolute Score Is Improper
The absolute scoring rule is $S(p, y) = -|p - y|$.
(a) Show that the expected score is $\mathbb{E}_q[S(p)] = -q|p - 1| - (1-q)|p|$. (b) For $q = 0.7$, find the report $p^*$ that maximizes the expected score. (Hint: consider the derivative for $p \in (0,1)$.) (c) Show that the optimal report is always 0 or 1 (never the true belief for $q \notin \{0, 0.5, 1\}$). (d) Compare this to the linear score from Exercise 8. What do improper rules have in common?
Exercise 17: Sensitivity Functions
The sensitivity of a scoring rule at belief $q$ measures how much the expected score changes for small deviations from truthful reporting. Formally, it is $-\frac{d^2}{dp^2}\mathbb{E}_q[S(p)]\big|_{p=q}$.
(a) Calculate the sensitivity function for the Brier score. Is it constant? (b) Calculate the sensitivity function for the log score. Where is it highest? (c) Calculate the sensitivity function for the spherical score. (d) Plot all three sensitivity functions on the same axes for $q \in [0.01, 0.99]$ using Python. Interpret the differences.
Exercise 18: Proper Scoring Rules and Market Makers
(a) Explain in your own words how Hanson's market scoring rule construction turns a proper scoring rule into a market maker. (b) If you use the Brier score $S(p, y) = -(p-y)^2$ in this construction, what is the payoff for a trader who moves the market from $p = 0.3$ to $p = 0.5$ when $y = 1$? (c) What is the maximum loss for the market maker (the "subsidy" needed) when using the Brier score? (d) Why is the LMSR (log score) more popular than the quadratic (Brier score) market maker in practice?
Section C: Implementation (Exercises 19-25)
Exercise 19: Implement All Three Scoring Rules
Write a Python class ScoringRules that implements:
class ScoringRules:
@staticmethod
def brier(forecast: float, outcome: int) -> float: ...
@staticmethod
def log(forecast: float, outcome: int, eps: float = 1e-15) -> float: ...
@staticmethod
def spherical(forecast: float, outcome: int) -> float: ...
@staticmethod
def score(forecast: float, outcome: int, rule: str = "brier") -> float:
"""Dispatch to the appropriate scoring rule by name."""
...
Test your implementation with the data from Exercise 1.
Exercise 20: Brier Score Decomposition
Write a function brier_decomposition(forecasts, outcomes, n_bins=10) that:
(a) Bins forecasts into n_bins equally-spaced bins
(b) Computes the calibration (reliability), resolution, and uncertainty components
(c) Returns a dictionary with all three components and the overall Brier score
(d) Verifies that BS = Calibration - Resolution + Uncertainty (within floating-point tolerance)
Test with simulated data: generate 1000 forecasts from a well-calibrated forecaster and 1000 from a poorly-calibrated one.
Exercise 21: Calibration Plot
Write a function calibration_plot(forecasts, outcomes, n_bins=10) that:
(a) Bins forecasts into groups (b) For each bin, computes the average forecast and the observed frequency (c) Plots the calibration diagram (forecast vs. observed frequency) with a diagonal reference line (d) Adds confidence intervals using the binomial proportion confidence interval (e) Displays the calibration component of the Brier decomposition
Exercise 22: Scoring Rule Comparison Tool
Write a Python function that takes a list of (forecast, outcome) pairs and produces:
(a) Mean score under Brier, log, and spherical rules (b) A visualization showing the score for each prediction under all three rules (c) A scatter plot of Brier score vs. log score for each prediction, colored by whether the forecast was "correct" (p > 0.5 and y=1, or p < 0.5 and y=0)
Exercise 23: Tournament Scoring System
Build a complete tournament scoring system in Python that:
(a) Accepts forecasts from multiple forecasters on multiple questions (b) Scores each forecaster using a configurable scoring rule (c) Generates a leaderboard sorted by average score (d) Computes confidence intervals on each forecaster's ranking using bootstrap resampling (e) Identifies "significantly better than average" forecasters
Exercise 24: CRPS Implementation
Implement the Continuous Ranked Probability Score:
(a) Write a function crps_normal(mu, sigma, observation) that computes the analytical CRPS for a normal distribution forecast.
(b) Write a function crps_ensemble(samples, observation) that estimates CRPS from forecast samples.
(c) Test both implementations: generate 1000 samples from $N(0, 1)$, compute the ensemble CRPS for observation $x = 0.5$, and compare to the analytical CRPS.
(d) Plot CRPS as a function of the observation for a fixed $N(0, 1)$ forecast.
Exercise 25: Multi-Outcome Scoring
Write Python functions for the multiclass versions of all three scoring rules:
(a) multiclass_brier(probs, outcome_index)
(b) multiclass_log(probs, outcome_index)
(c) multiclass_spherical(probs, outcome_index)
(d) ranked_probability_score(probs, outcome_index)
Test with a 5-outcome example and verify that all rules are proper by computing expected scores at the truth vs. at deviations from the truth.
Section D: Analysis and Applications (Exercises 26-30)
Exercise 26: Forecaster Performance Analysis
Download (or simulate) a dataset of 500 binary predictions from 10 forecasters. For each forecaster:
(a) Compute their mean Brier score, mean log score, and mean spherical score (b) Decompose their Brier score into calibration, resolution, and uncertainty (c) Create a calibration plot (d) Rank them by each scoring rule. Do the rankings agree? If not, explain why. (e) Identify which forecasters are well-calibrated but lack resolution, and vice versa.
Exercise 27: Scoring Rule Selection
A company is designing a forecasting tournament for its employees. They need to choose a scoring rule. Write a memo (in code comments or markdown) that:
(a) Compares the Brier, log, and spherical scores for this use case (b) Considers that employees are risk-averse and may not understand complex mathematics (c) Addresses the concern that some events have base rates below 5% (d) Recommends a scoring rule with justification (e) Proposes a reward structure that maintains proper incentives
Exercise 28: Simulating Gaming Attempts
Write a simulation that:
(a) Creates 100 binary questions with various base rates (b) Simulates an "honest" forecaster who reports true beliefs (with noise) (c) Simulates a "gamer" who uses the following strategies: - Always reports 0 or 1 (extremization) - Reports the base rate for every question - Reports the complement of their true belief for 20% of questions (to hedge) (d) Scores all strategies with Brier and log scores (e) Shows that the honest forecaster outperforms on average (f) Identifies cases where gaming strategies perform well in a single trial (due to luck) and explain why this does not violate properness
Exercise 29: Time-Weighted Scoring
In many forecasting contexts, forecasts are updated over time as new information arrives.
(a) Implement a time-weighted scoring function that weights forecasts closer to resolution more heavily. (b) Simulate a scenario with 50 questions, each open for 30 days, with daily forecast updates. (c) Compare three weighting schemes: uniform weights, exponential decay, and "late-only" (only scoring the final forecast). (d) Analyze which weighting scheme best rewards early information incorporation vs. final accuracy.
Exercise 30: Designing a Prediction Platform Scoring System
You are designing the scoring system for a new prediction platform. The platform will host: - Binary questions (yes/no) - Multi-outcome questions (e.g., elections with 3+ candidates) - Continuous questions (e.g., "What will GDP growth be?") - Questions with different resolution dates (some resolve in days, some in years)
(a) Propose a unified scoring approach that handles all question types. (b) Address how to aggregate scores across question types fairly. (c) Design a leaderboard system that is robust to cherry-picking. (d) Implement your scoring system in Python and demonstrate it on simulated data. (e) Write a brief user-facing explanation of your scoring system that a non-technical user could understand.