Chapter 12 Exercises: Calibration — Measuring Forecast Quality
These exercises are organized into five parts of increasing difficulty. Part A tests conceptual understanding, Part B focuses on calculations, Part C involves programming, Part D presents scenario-based problems, and Part E offers advanced challenges.
Part A: Conceptual Understanding (Exercises 1-6)
Exercise 1: Calibration Definition
A weather forecaster has the following track record over 1,000 forecasts:
| Stated probability of rain | Number of forecasts | Days it actually rained |
|---|---|---|
| 10% | 200 | 15 |
| 30% | 250 | 80 |
| 50% | 150 | 75 |
| 70% | 200 | 155 |
| 90% | 200 | 185 |
(a) Is this forecaster well-calibrated? Compute the observed frequency for each bin and compare to the stated probability.
(b) In which probability range is the forecaster most miscalibrated?
(c) Is the forecaster generally overconfident or underconfident? Explain.
(d) What specific advice would you give this forecaster to improve their calibration?
Exercise 2: Calibration vs. Accuracy
Consider three forecasters who each predict 100 binary events (50 of which actually occur):
- Forecaster A: Always predicts 50%.
- Forecaster B: Predicts 90% for the 50 events that occur and 10% for the 50 that do not.
- Forecaster C: Predicts 80% for all events.
(a) Compute the Brier score for each forecaster.
(b) Which forecasters are well-calibrated? Which are not?
(c) Rank the forecasters by (i) calibration quality and (ii) overall forecast quality. Explain any differences in the rankings.
(d) What does this exercise illustrate about the relationship between calibration and accuracy?
Exercise 3: Reading Reliability Diagrams
Sketch (or describe in words) the reliability diagram for each of the following scenarios:
(a) A perfectly calibrated forecaster.
(b) An overconfident forecaster who assigns extreme probabilities too frequently.
(c) An underconfident forecaster who hedges toward 50%.
(d) A forecaster who only uses three probability values: 20%, 50%, and 80%.
(e) A forecaster whose calibration curve forms an S-shape (overconfident for low probabilities, underconfident for high probabilities).
Exercise 4: Components of the Murphy Decomposition
In your own words, explain what each of the following components of the Murphy decomposition measures:
(a) Reliability (REL)
(b) Resolution (RES)
(c) Uncertainty (UNC)
(d) A forecaster has REL = 0.01, RES = 0.05, UNC = 0.25. Compute their Brier score and Brier Skill Score. Interpret the results.
(e) Another forecaster has REL = 0.00, RES = 0.00, UNC = 0.25. Compute their Brier score and BSS. What kind of forecaster produces these values?
Exercise 5: Sharpness vs. Calibration
(a) Define sharpness in the context of probabilistic forecasting.
(b) Explain the maxim "maximize sharpness, subject to calibration." Why is the order important?
(c) Give an example of a forecaster who is sharp but poorly calibrated. What is the practical consequence?
(d) Give an example of a forecaster who is well-calibrated but not sharp. What is the practical consequence?
(e) In the context of prediction market trading, why does sharpness matter for profitability?
Exercise 6: Favorite-Longshot Bias
(a) Define the favorite-longshot bias in your own words.
(b) How does this bias manifest in a reliability diagram? Sketch or describe.
(c) If a prediction market exhibits the favorite-longshot bias, what trading strategy could exploit it?
(d) Why might this bias persist even in efficient markets? Propose at least two explanations.
Part B: Calculation Exercises (Exercises 7-12)
Exercise 7: Computing ECE
A forecaster makes 200 predictions, grouped into 5 bins:
| Bin range | n_k | Mean predicted (p_k) | Actual frequency (o_k) |
|---|---|---|---|
| [0, 0.2) | 30 | 0.12 | 0.10 |
| [0.2, 0.4) | 40 | 0.31 | 0.35 |
| [0.4, 0.6) | 60 | 0.48 | 0.50 |
| [0.6, 0.8) | 45 | 0.72 | 0.60 |
| [0.8, 1.0] | 25 | 0.88 | 0.92 |
(a) Compute the ECE.
(b) Compute the MCE.
(c) Which bin contributes most to the ECE?
(d) Is the forecaster generally overconfident or underconfident?
Exercise 8: Murphy Decomposition by Hand
Using the data from Exercise 7:
(a) Compute the base rate $\bar{o}$.
(b) Compute the Reliability component.
(c) Compute the Resolution component.
(d) Compute the Uncertainty component.
(e) Verify that REL - RES + UNC approximately equals the Brier score (you may approximate the Brier score as the sum of weighted bin-level Brier scores).
Exercise 9: Brier Skill Score Comparison
Two prediction platforms forecast the same 500 events (base rate = 0.40):
| Platform | Brier Score |
|---|---|
| Platform A | 0.200 |
| Platform B | 0.225 |
(a) Compute the BSS for each platform against the climatological baseline.
(b) Compute the BSS for Platform A using Platform B as the reference.
(c) Interpret the results: which platform is better? By how much?
(d) If Platform A has REL = 0.005 and Platform B has REL = 0.020, compute the resolution of each platform.
Exercise 10: Confidence Intervals for Calibration
A bin in a reliability diagram contains 80 forecasts with a mean predicted probability of 0.70 and an observed frequency of 0.65.
(a) Compute the standard error of the observed frequency.
(b) Construct a 95% confidence interval for the true observed frequency.
(c) Is the forecaster statistically significantly miscalibrated in this bin at the 5% level?
(d) How many forecasts would be needed in this bin to make a 5-percentage-point miscalibration statistically significant at the 5% level?
Exercise 11: Recalibration with Platt Scaling
A model produces the following calibration data:
| Original prediction | Observed frequency (from large sample) |
|---|---|
| 0.10 | 0.15 |
| 0.30 | 0.38 |
| 0.50 | 0.55 |
| 0.70 | 0.68 |
| 0.90 | 0.82 |
(a) Convert the original predictions to log-odds (logits).
(b) Convert the observed frequencies to logits.
(c) Plot the observed logits vs. predicted logits. Is the relationship approximately linear?
(d) If you fit a line observed_logit = a * predicted_logit + b, estimate the parameters a and b by visual inspection or simple calculation.
(e) Using your estimated parameters, what would the recalibrated probability be for an original prediction of 0.60?
Exercise 12: Multi-Class Calibration
A model predicts probabilities for 3 candidates (A, B, C) in 100 elections. Here is a summary for Candidate A:
| Predicted P(A) range | Count | Actual frequency of A winning |
|---|---|---|
| [0.0, 0.2) | 25 | 0.08 |
| [0.2, 0.4) | 30 | 0.27 |
| [0.4, 0.6) | 20 | 0.55 |
| [0.6, 0.8) | 15 | 0.73 |
| [0.8, 1.0] | 10 | 0.90 |
(a) Compute the ECE for Candidate A's predictions.
(b) Is the model well-calibrated for Candidate A?
(c) If similar data shows Candidate B has ECE = 0.08 and Candidate C has ECE = 0.12, what is the multi-class ECE?
(d) Which candidate's predictions need the most improvement?
Part C: Programming Exercises (Exercises 13-20)
Exercise 13: Implement ECE from Scratch
Write a Python function compute_ece(predictions, outcomes, n_bins=10) that computes the Expected Calibration Error without using any calibration-specific libraries. Test it on the following synthetic data:
np.random.seed(42)
n = 1000
true_probs = np.random.beta(2, 2, n)
outcomes = np.random.binomial(1, true_probs)
# Add overconfidence bias
biased_predictions = 0.5 + 1.3 * (true_probs - 0.5)
biased_predictions = np.clip(biased_predictions, 0.01, 0.99)
Verify that the biased predictions have higher ECE than the true probabilities.
Exercise 14: Reliability Diagram
Write a Python function that generates a reliability diagram with: - The calibration curve - The perfect calibration diagonal - A histogram of forecast counts per bin - 95% confidence bands based on binomial standard error
Test it on the data from Exercise 13.
Exercise 15: Murphy Decomposition Visualization
Write a Python function that: (a) Computes the Murphy decomposition of the Brier score. (b) Creates a stacked bar chart showing REL, RES, and UNC. (c) Annotates the chart with the Brier score and BSS.
Compare three forecasters: 1. A well-calibrated, sharp forecaster (use true probabilities from Exercise 13). 2. An overconfident forecaster (use biased predictions from Exercise 13). 3. A "base rate" forecaster who always predicts the mean outcome rate.
Exercise 16: Rolling Calibration Monitor
Write a Python class RollingCalibrationMonitor that:
(a) Accepts new (prediction, outcome) pairs one at a time.
(b) Maintains a rolling window of the last 200 observations.
(c) Computes ECE, MCE, and BSS at any time.
(d) Issues a warning when ECE exceeds a configurable threshold.
(e) Includes a method to plot ECE over time as new data arrives.
Test it with a simulation where calibration degrades over time (e.g., a forecaster becomes progressively more overconfident).
Exercise 17: Platform Comparison
Write a Python script that: (a) Generates synthetic calibration data for three hypothetical platforms: - Platform A: Well-calibrated, moderate sharpness (ECE ~ 0.02) - Platform B: Overconfident, high sharpness (ECE ~ 0.06) - Platform C: Underconfident, low sharpness (ECE ~ 0.04) (b) Plots all three reliability diagrams on the same figure. (c) Computes a comparison table of all metrics (ECE, MCE, BSS, REL, RES, sharpness). (d) Determines which platform is "best" overall and explains why.
Exercise 18: Recalibration Experiment
Using the biased predictions from Exercise 13: (a) Apply Platt scaling and compute the ECE of the recalibrated predictions. (b) Apply isotonic regression and compute the ECE of the recalibrated predictions. (c) Apply histogram binning and compute the ECE of the recalibrated predictions. (d) Use 5-fold cross-validation for each method to get honest ECE estimates. (e) Compare all methods in a summary table and reliability diagrams. (f) Which method works best for this type of miscalibration? Why?
Exercise 19: Personal Calibration Tracker
Extend the CalibrationTracker class from the chapter to include:
(a) A method rolling_calibration(window_size=50) that computes ECE over a rolling window of resolved forecasts.
(b) A method compare_to_market() that compares your calibration to the market prices recorded at forecast time.
(c) A method category_breakdown() that shows ECE and BSS for each category separately.
(d) A method improvement_over_time() that plots how your ECE has changed over time.
Exercise 20: Simulating Calibration Learning
Write a simulation that models a trader learning to become better calibrated over time: (a) The trader starts with a systematic overconfidence bias (predictions are 15% more extreme than true probabilities). (b) After each batch of 50 resolved forecasts, the trader adjusts their bias by 20% of the observed calibration error. (c) Simulate 2,000 forecasts and plot the ECE over time. (d) How many forecasts does it take for the trader to achieve ECE < 0.03? (e) What happens if you change the learning rate (adjustment factor)?
Part D: Scenario-Based Problems (Exercises 21-25)
Exercise 21: The Overconfident Analyst
You are reviewing the track record of a political analyst who posts predictions on social media. They have made 500 binary predictions over two years. Their reliability diagram shows a clear overconfidence pattern: predictions above 60% have observed frequencies about 10 percentage points lower than predicted, while predictions below 40% have observed frequencies about 10 percentage points higher.
(a) Estimate their ECE given this pattern.
(b) Their Brier score is 0.22 and the base rate is 0.50. Compute their BSS.
(c) Decompose their Brier score: estimate REL and RES.
(d) If you could perfectly recalibrate this analyst (setting REL = 0), what would their new Brier score and BSS be?
(e) Should this analyst be using Platt scaling or isotonic regression for recalibration? Justify your answer.
Exercise 22: The Cautious Trader
A prediction market trader has been tracking their forecasts for six months. They have resolved 150 trades. Their calibration report shows: - ECE: 0.015 (excellent) - BSS: 0.02 (poor) - Sharpness (MAD): 0.08 (very low) - All predictions fall between 0.40 and 0.60.
(a) Explain the apparent contradiction between excellent calibration and poor BSS.
(b) What is this trader's fundamental problem?
(c) What specific steps should this trader take to improve their performance?
(d) If this trader were to make their predictions 50% more extreme (e.g., 55% becomes 57.5%), what would happen to their ECE and BSS, assuming their rank-ordering of events is good?
Exercise 23: Market Category Analysis
A prediction market platform publishes the following calibration data by category:
| Category | n | ECE | BSS | Avg. liquidity |
|---|---|---|---|---|
| US Politics | 200 | 0.03 | 0.25 | High |
| World Politics | 150 | 0.06 | 0.15 | Medium |
| Technology | 100 | 0.08 | 0.10 | Low |
| Science | 80 | 0.05 | 0.20 | Low |
| Entertainment | 120 | 0.04 | 0.18 | Medium |
(a) Which category is best-calibrated? Which has the highest skill?
(b) Is there a relationship between liquidity and calibration? Explain.
(c) A trader specializes in Technology markets. What does the data suggest about their potential edge?
(d) If you were designing a recalibration system for this platform, which categories would you prioritize? What recalibration method would you use for each?
Exercise 24: Calibration Drift Detection
You run a forecasting model for a prediction market aggregator. Your model was well-calibrated (ECE = 0.02) when deployed six months ago, but your rolling calibration monitor shows ECE has increased to 0.07 over the past month. The Brier score has also increased from 0.18 to 0.22.
(a) List three possible causes of calibration drift.
(b) How would you diagnose which cause is most likely? What data would you examine?
(c) Should you immediately recalibrate the model? What are the risks?
(d) Design an automated system that detects calibration drift and responds appropriately. What thresholds and actions would you use?
Exercise 25: Multi-Platform Arbitrage Through Calibration
You discover that Platform A and Platform B both offer markets on the same set of 100 events. Your calibration analysis reveals:
- Platform A: Generally well-calibrated, but overprices events in the 80-100% range by about 5 percentage points.
- Platform B: Generally well-calibrated, but underprices events in the 80-100% range by about 3 percentage points.
(a) Describe a trading strategy that exploits this calibration difference.
(b) Estimate the expected profit per trade, assuming you can buy/sell at the stated prices.
(c) What risks does this strategy face?
(d) How many events would you need to trade for the strategy to be profitable with 95% confidence, assuming each trade has a standard deviation of 0.30?
Part E: Advanced Challenges (Exercises 26-30)
Exercise 26: Kernel Density Calibration
The standard binning approach to calibration can be sensitive to the choice of bins. Implement a kernel-density-based calibration measure that:
(a) Estimates the calibration function $g(p) = E[\text{outcome} \mid \text{prediction} = p]$ using local weighted regression (LOESS/LOWESS or Nadaraya-Watson kernel regression).
(b) Computes a continuous ECE analog: $\text{ECE}_{\text{cont}} = \int_0^1 |g(p) - p| \, f(p) \, dp$, where $f(p)$ is the density of predictions.
(c) Compares this to the standard binned ECE for various bin counts (5, 10, 20, 50) on a dataset where the true calibration function is known (e.g., $g(p) = 0.5 + 0.6(p - 0.5)$, representing overconfidence).
(d) Discuss the advantages and disadvantages of the kernel approach versus binning.
Exercise 27: Bayesian Calibration Assessment
Develop a Bayesian approach to calibration assessment:
(a) Place a Beta prior on the "true" observed frequency for each calibration bin.
(b) Update with the observed data to get a posterior distribution.
(c) Compute the posterior probability that each bin is "well-calibrated" (|true frequency - predicted probability| < 0.05).
(d) Compute a Bayesian ECE by taking the posterior expectation of the calibration error.
(e) Compare this to the frequentist ECE. When do the approaches differ most? (Hint: consider small sample sizes.)
Exercise 28: Calibration Under Distribution Shift
Design a simulation study that examines calibration under distribution shift:
(a) Generate training data from one probability distribution (e.g., base rate = 0.40, well-calibrated forecasts).
(b) Generate test data from a shifted distribution (e.g., base rate = 0.60, same forecasting model).
(c) Measure calibration on the test data. How badly does calibration degrade?
(d) Apply recalibration using (i) the training distribution, (ii) a small sample from the test distribution, and (iii) an expanding window of test data.
(e) Plot ECE as a function of the amount of test data used for recalibration. How quickly does calibration recover?
Exercise 29: Information-Theoretic Calibration
Explore the connection between calibration and information theory:
(a) Define the calibration loss as the KL divergence between the true conditional distribution $P(\text{outcome} | \text{forecast} = p)$ and the stated forecast $p$:
$$\text{Cal}_{\text{KL}} = E_p \left[ D_{\text{KL}}(P(\text{outcome} | p) \| \text{Bernoulli}(p)) \right]$$
(b) Show that this is related to the reliability component of the Brier decomposition (they are not identical, but they are both zero if and only if the forecaster is calibrated).
(c) Implement both the Brier-based reliability and the KL-based calibration loss. Compare them on synthetic data with varying levels of miscalibration.
(d) Discuss when the two measures might rank forecasters differently.
Exercise 30: End-to-End Calibration Pipeline
Build a complete, production-quality calibration analysis pipeline:
(a) Data ingestion: Read prediction-outcome pairs from a CSV file with columns: id, prediction, outcome, timestamp, category.
(b) Metric computation: Compute all metrics from this chapter (ECE, MCE, BSS, Murphy decomposition, sharpness).
(c) Visualization: Generate a multi-panel figure with: - Reliability diagram with confidence bands - Murphy decomposition bar chart - Forecast distribution histogram - ECE over time (rolling window)
(d) Recalibration: Apply cross-validated isotonic regression and show before/after reliability diagrams.
(e) Report generation: Output a text report summarizing all findings with actionable recommendations.
(f) Alerting: Implement a function that checks if calibration has degraded beyond a threshold and returns a warning message.
Test the pipeline on synthetic data with 5,000 forecasts across 5 categories, with varying levels of calibration quality.
Submission Guidelines
- For calculation exercises, show all intermediate steps.
- For programming exercises, include complete, runnable code with comments.
- For scenario exercises, justify your reasoning with specific numbers from the chapter.
- All code should be tested with at least one example dataset.
- Reliability diagrams should include confidence bands and forecast histograms.