Chapter 12 Quiz: Calibration — Measuring Forecast Quality

Instructions: Choose the single best answer for each question unless otherwise indicated. Answers are provided at the end.


Question 1

A forecaster is perfectly calibrated if:

(a) Their predictions are always correct. (b) Among events they assign probability p, the fraction that occur equals p. (c) Their Brier score is zero. (d) They always predict the base rate.


Question 2

A forecaster assigns 80% probability to events that actually occur 65% of the time. This forecaster is:

(a) Underconfident (b) Overconfident (c) Well-calibrated (d) Poorly resolved


Question 3

Expected Calibration Error (ECE) is computed as:

(a) The maximum absolute difference between predicted and observed frequencies across bins (b) The weighted average of squared differences between predicted and observed frequencies (c) The weighted average of absolute differences between predicted and observed frequencies (d) The mean of all Brier scores across bins


Question 4

In a reliability diagram, a point that lies above the perfect calibration diagonal indicates:

(a) The forecaster is overconfident in that range (b) The forecaster is underconfident in that range (c) The forecaster has zero resolution (d) The bin has too few samples


Question 5

Which forecaster is better, all else being equal?

  • Forecaster A: ECE = 0.03, Sharpness (MAD) = 0.25
  • Forecaster B: ECE = 0.03, Sharpness (MAD) = 0.08

(a) Forecaster A, because higher sharpness with equal calibration is better (b) Forecaster B, because lower sharpness means less risk (c) They are equally good since their ECE is the same (d) Cannot determine without knowing the base rate


Question 6

In the Murphy decomposition BS = REL - RES + UNC, which component measures calibration?

(a) UNC (Uncertainty) (b) RES (Resolution) (c) REL (Reliability) (d) BS (Brier Score)


Question 7

A forecaster has REL = 0.01, RES = 0.08, UNC = 0.24. Their Brier Skill Score is:

(a) 0.292 (b) 0.708 (c) 0.170 (d) 0.830


Question 8

The Brier Skill Score (BSS) equals zero when:

(a) The Brier score equals zero (b) The forecast is perfect (c) The forecast is no better than predicting the base rate (d) The forecast is perfectly calibrated


Question 9

A forecaster always predicts the base rate (0.40) for a dataset with base rate 0.40. Which statement is true?

(a) They have perfect calibration and perfect resolution (b) They have perfect calibration and zero resolution (c) They have zero calibration error and maximum Brier score (d) They have zero ECE but negative BSS


Question 10

What does the Uncertainty component of the Murphy decomposition depend on?

(a) The quality of the forecaster (b) The number of predictions (c) The base rate of the outcomes only (d) The sharpness of the predictions


Question 11

Which recalibration technique requires the most data to fit reliably?

(a) Platt scaling (b) Histogram binning (c) Isotonic regression (d) Beta calibration


Question 12

Platt scaling recalibrates forecasts by:

(a) Fitting a monotone step function to the data (b) Replacing each forecast with the observed bin frequency (c) Fitting a logistic regression on the log-odds of the predictions (d) Computing a weighted average of the prediction and the base rate


Question 13

The favorite-longshot bias in prediction markets means:

(a) Markets are always perfectly calibrated (b) Events with high probability tend to be underpriced and events with low probability tend to be overpriced (c) Markets overreact to new information (d) All events converge to 50% probability over time


Question 14

Which of the following ECE values indicates excellent calibration?

(a) ECE = 0.15 (b) ECE = 0.08 (c) ECE = 0.01 (d) ECE = 0.50


Question 15

Choosing too many bins (e.g., K = 100) for calibration analysis is problematic because:

(a) The computation becomes too slow (b) Each bin has too few samples, making observed frequencies unreliable (c) It forces the ECE to zero (d) It makes the forecaster appear perfectly calibrated


Question 16

Resolution in the Murphy decomposition measures:

(a) How well-calibrated the forecasts are (b) How much the observed frequency varies across forecast bins (c) How extreme the forecasts are (d) The inherent difficulty of the prediction problem


Question 17

A trader has excellent resolution but poor calibration. The best strategy for improvement is:

(a) Make less extreme predictions (b) Always predict 50% (c) Apply a recalibration function to the predictions (d) Increase the number of bins used for evaluation


Question 18

When using cross-validated recalibration, why is cross-validation important?

(a) It makes the computation faster (b) It prevents the recalibration model from overfitting to the training data (c) It increases the number of data points (d) It ensures the bins have equal numbers of samples


Question 19

Which statement about sharpness is TRUE?

(a) Sharpness measures how accurate the forecasts are (b) A forecaster who always predicts 50% has maximum sharpness (c) Sharpness measures how extreme (confident) the forecasts are (d) Sharpness and calibration are the same thing


Question 20

A prediction market has ECE = 0.04 for US political events and ECE = 0.09 for technology events. The most likely explanation is:

(a) The market maker is biased against technology (b) Technology events have lower liquidity and fewer experienced participants (c) ECE is not a valid metric for technology events (d) Political events are inherently easier to predict


Question 21

The Probability Integral Transform (PIT) is used to assess calibration for:

(a) Binary outcomes only (b) Multi-class categorical outcomes (c) Continuous distributional forecasts (d) Ordinal outcomes only


Question 22

If a PIT histogram is U-shaped, the forecaster is:

(a) Well-calibrated (b) Underconfident (distributions too wide) (c) Overconfident (distributions too narrow) (d) Biased in location


Question 23

Maximum Calibration Error (MCE) is useful when:

(a) You want to know the average calibration quality (b) You need to ensure no probability range is badly miscalibrated (c) You have very few data points (d) You are comparing more than two forecasters


Question 24

A BSS of -0.10 means:

(a) The forecaster has 10% skill (b) The forecaster is 10% worse than a random forecaster (c) The forecaster is 10% worse than the base rate forecaster (d) The data has 10% uncertainty


Question 25

Which combination of properties defines an ideal forecaster?

(a) Low sharpness, low calibration error, low resolution (b) High sharpness, low calibration error, high resolution (c) High sharpness, high calibration error, high resolution (d) Low sharpness, low calibration error, high resolution


Answer Key

Question 1: (b) Perfect calibration means the stated probability matches the observed frequency. This does not require being "always correct" (which would mean Brier score = 0 and perfect accuracy), nor does it require predicting the base rate.

Question 2: (b) The forecaster says 80% but reality is 65% — they are more confident than warranted, which is overconfidence. Overconfidence means predicted probabilities are more extreme than observed frequencies.

Question 3: (c) ECE = sum of (n_k / N) * |p_k - o_k| across bins. It uses absolute differences (not squared), weighted by bin size.

Question 4: (b) A point above the diagonal means the observed frequency is higher than the predicted probability. The forecaster said "low probability" but events happened more often — they should have predicted higher. This is underconfidence.

Question 5: (a) Both forecasters have equal calibration (ECE = 0.03). Among equally calibrated forecasters, the sharper one is more informative and more useful. Forecaster A with MAD = 0.25 makes more extreme (and still accurate) predictions.

Question 6: (c) Reliability (REL) directly measures the squared calibration error — the weighted squared difference between predicted and observed frequencies. Lower REL means better calibration.

Question 7: (a) BS = REL - RES + UNC = 0.01 - 0.08 + 0.24 = 0.17. BSS = 1 - BS/UNC = 1 - 0.17/0.24 = 1 - 0.708 = 0.292. Alternatively, BSS = (RES - REL)/UNC = (0.08 - 0.01)/0.24 = 0.07/0.24 = 0.292.

Question 8: (c) BSS = 1 - BS/BS_ref. When BSS = 0, BS = BS_ref, meaning the forecast performs exactly as well as the reference (base rate) forecast. No skill relative to the baseline.

Question 9: (b) A constant forecaster at the base rate is always perfectly calibrated (the one bin matches perfectly) but has zero resolution (no variation in forecasts, no discrimination between events). REL = 0, RES = 0, BS = UNC, BSS = 0.

Question 10: (c) UNC = o_bar * (1 - o_bar), where o_bar is the base rate. It depends solely on the marginal outcome frequency and reflects the inherent difficulty of the prediction problem. It is independent of the forecaster.

Question 11: (c) Isotonic regression is a non-parametric method that fits a monotone step function. It has more degrees of freedom than Platt scaling (2 parameters) or beta calibration (3 parameters), so it requires more data to avoid overfitting.

Question 12: (c) Platt scaling converts predictions to log-odds (logits), fits a logistic regression on these logits, and converts back. The formula is q = sigma(a * logit(p) + b), with two learned parameters a and b.

Question 13: (b) The favorite-longshot bias is the empirical finding that high-probability events (favorites) tend to be slightly underpriced and low-probability events (longshots) tend to be slightly overpriced in prediction and betting markets.

Question 14: (c) ECE values below 0.02 are generally considered excellent. ECE = 0.01 indicates that on average, the predicted probability and observed frequency differ by only 1 percentage point.

Question 15: (b) With too many bins, each bin contains very few forecasts. The observed frequency in a bin with 3-5 samples has enormous sampling variance, making calibration assessment unreliable for individual bins.

Question 16: (b) Resolution = (1/N) * sum(n_k * (o_k - o_bar)^2). It measures how much the observed frequency in each bin deviates from the overall base rate. High resolution means the forecaster successfully separates events into groups with genuinely different outcome rates.

Question 17: (c) If the forecaster has good resolution (correct rank-ordering of events) but poor calibration (probability scale is off), recalibration will fix the probability scale while preserving the valuable rank-ordering. This is the "free lunch" of calibration.

Question 18: (b) Without cross-validation, you fit and evaluate recalibration on the same data, overstating the improvement. Cross-validation ensures you evaluate on data not used for fitting, giving an honest estimate of out-of-sample performance.

Question 19: (c) Sharpness measures how extreme or confident the forecasts are — how much they deviate from 50%. A forecaster who uses probabilities near 0% and 100% is sharp. Sharpness says nothing about whether those extreme forecasts are correct.

Question 20: (b) Lower liquidity typically leads to worse calibration because there are fewer participants to correct individual errors. Technology events on prediction markets tend to have lower trading volumes than major political events.

Question 21: (c) The PIT applies to continuous distributional forecasts. If a forecaster provides a CDF F, then u = F(x_observed) should be uniform if the forecaster is well-calibrated. This extends calibration beyond binary/categorical settings.

Question 22: (c) A U-shaped PIT histogram means too many observations fall in the extreme tails of the forecast distribution (PIT values near 0 or 1). This indicates the forecast distributions are too narrow — the forecaster is overconfident about the precision of their distributional forecast.

Question 23: (b) MCE captures the worst-case calibration error across all bins. It is most useful when you need to guarantee that no single probability range is badly miscalibrated — important for risk management and regulatory contexts.

Question 24: (c) BSS = 1 - BS/BS_ref. BSS = -0.10 means BS = 1.10 * BS_ref. The forecast is 10% worse than the reference forecast (typically the base rate). This is a negative skill situation — the forecaster would be better off always predicting the base rate.

Question 25: (b) The ideal forecaster has high sharpness (extreme, confident predictions), low calibration error (those predictions match reality), and high resolution (predictions discriminate between events that happen and those that do not). This is the "maximize sharpness, subject to calibration" principle, combined with high discrimination.