Chapter 25 Exercises

Section 25.2: Simple Averaging

Exercise 1: Basic Simple Average (Beginner) Three models predict the probability of a political event: Model A says 0.65, Model B says 0.72, Model C says 0.58. Compute the simple average ensemble forecast. If the event occurs (outcome = 1), compute the Brier score of (a) each individual model and (b) the ensemble. Which is best?

Exercise 2: Variance Reduction (Intermediate) You have 5 models, each with error variance $\sigma^2 = 0.04$ and pairwise error correlation $\rho = 0.3$. (a) Compute the ensemble variance. (b) What would the ensemble variance be if $\rho = 0$ (independent errors)? (c) What would it be if $\rho = 0.8$? (d) Plot ensemble variance as a function of $K$ (number of models, from 1 to 20) for $\rho \in \{0, 0.3, 0.5, 0.8\}$.

Exercise 3: Averaging vs. Best Model (Intermediate) Write a Python simulation. Generate 100 binary events with true probabilities drawn from Beta(2, 2). Create 5 models where each model's forecast equals the true probability plus Gaussian noise with standard deviations [0.08, 0.10, 0.12, 0.15, 0.20]. Compute the Brier score of each model and of the simple average. Repeat 1000 times and report how often the simple average beats the best individual model.

Exercise 4: When Does Averaging Fail? (Intermediate) Construct a specific example where simple averaging is worse than the best individual model. Hint: consider the case where one model is perfectly calibrated and the others are highly biased in the same direction. Explain why averaging fails in this case.

Section 25.3: Weighted Averaging

Exercise 5: Inverse-Error Weights (Beginner) Three models have historical Brier scores of 0.18, 0.22, and 0.30. Compute the inverse-error weights for each model. If their current forecasts are 0.70, 0.65, and 0.55, compute the weighted ensemble forecast.

Exercise 6: Weight Optimization (Intermediate) Using the optimize_weights function from Section 25.3, generate synthetic data with 4 models of varying quality and: (a) Find optimal weights with no shrinkage. (b) Find optimal weights with shrinkage = 0.5. (c) Evaluate both on a held-out test set. Which performs better? Why?

Exercise 7: Recency Weighting (Intermediate) Implement a recency-weighted Brier score calculator that uses exponential decay with parameter $\lambda$. Given the following model performance over 10 recent predictions (from oldest to newest):

Period Model A Error Model B Error
1 0.25 0.10
2 0.20 0.15
3 0.22 0.12
4 0.18 0.20
5 0.15 0.25
6 0.12 0.28
7 0.10 0.30
8 0.08 0.32
9 0.07 0.35
10 0.05 0.38

Compute the recency-weighted MSE for both models with $\lambda = 0.9$ and $\lambda = 0.5$. How do the weights change? Which model does each $\lambda$ favor?

Exercise 8: Shrinkage toward Equal Weights (Intermediate) Implement a weight-shrinkage function that blends optimal weights with equal weights: $$w_i^{\text{shrunk}} = \alpha \cdot w_i^{\text{optimal}} + (1 - \alpha) \cdot \frac{1}{K}$$ Using synthetic data with $K = 6$ models and $T \in \{20, 50, 100, 500\}$ training observations, find the optimal shrinkage parameter $\alpha$ for each $T$ using cross-validation. Show that the optimal $\alpha$ increases with $T$.

Section 25.4: Linear and Logarithmic Pooling

Exercise 9: Linear vs. Logarithmic Pool (Beginner) Three forecasters predict the probability of rain: 0.80, 0.85, 0.90. Compute: (a) The linear pool (equal weights). (b) The logarithmic pool (equal weights). (c) Which is more extreme? Explain why.

Exercise 10: Shared Information Problem (Intermediate) Two forecasters both read the same weather report and independently predict rain probability at 0.80. A third forecaster uses a completely different data source and predicts 0.60. (a) Compare the linear pool (equal weights) with the logarithmic pool. (b) Argue which pool is more appropriate and why. (c) What weights in the linear pool would correct for the shared information between the first two forecasters?

Exercise 11: Pool Properties (Advanced) Prove that the linear pool cannot produce a probability outside the range of individual forecasts, while the logarithmic pool can. Give a numerical example where the logarithmic pool produces a more extreme probability than any individual forecast.

Section 25.5: Extremizing

Exercise 12: Basic Extremizing (Beginner) An average forecast is 0.65. Compute the extremized forecast for $d \in \{1.0, 1.5, 2.0, 2.5, 3.0\}$. Plot the results.

Exercise 13: Extremizing and Calibration (Intermediate) Generate 500 synthetic forecasting questions. For each, draw a true probability from Beta(2, 2), simulate 10 forecasters who observe the true probability with noise, compute the simple average, and then extremize with various values of $d$. Create calibration plots (reliability diagrams) for the raw average and the best extremized version. Which is better calibrated?

Exercise 14: Optimal Extremizing Factor (Intermediate) Using the shared-information model from Section 25.5, compute the theoretical optimal extremizing factor when: (a) $K = 5$ forecasters each observe $n = 3$ unique signals and $m = 2$ shared signals. (b) $K = 10$ forecasters with $n = 1$ unique signal and $m = 5$ shared signals. (c) $K = 3$ forecasters with $n = 10$ unique signals and $m = 0$ shared signals. Interpret each result.

Exercise 15: Logistic Recalibration (Intermediate) Implement the logistic recalibration approach to find the optimal extremizing factor. Using historical data from a prediction platform (real or simulated), fit the model: $$\text{logit}(P(Y=1)) = \beta_0 + \beta_1 \cdot \text{logit}(\bar{p})$$ Report $\beta_0$ and $\beta_1$ and interpret them.

Section 25.6: Stacking

Exercise 16: Basic Stacking (Intermediate) Build a stacking ensemble for a binary classification problem. Use at least 4 base models (e.g., logistic regression, random forest, gradient boosting, SVM). Compare the stacking ensemble's Brier score against each individual model and the simple average. Use 5-fold cross-validated stacking to prevent information leakage.

Exercise 17: Meta-Learner Choice (Advanced) Extend Exercise 16 by trying three different meta-learners: (a) Logistic regression (no regularization) (b) Ridge logistic regression (L2 regularization) (c) Gradient boosting Compare their performance. Which meta-learner works best? When might a more complex meta-learner be justified?

Exercise 18: Feature-Dependent Stacking (Advanced) Build a stacking ensemble where the meta-learner receives both the base model predictions and a context feature (e.g., time until event resolution). Generate synthetic data where one model is better early and another is better late. Show that the feature-dependent stacker outperforms the feature-independent stacker.

Section 25.7: Bayesian Model Averaging

Exercise 19: BIC-Based BMA (Intermediate) Fit three logistic regression models with different feature sets to a prediction problem. Compute the BIC for each, the posterior model probabilities, and the BMA prediction. Compare BMA to the single best model (by BIC).

Exercise 20: BMA Sensitivity (Advanced) Using the same setup as Exercise 19, investigate how BMA weights change as a function of: (a) Sample size $n$ (from 50 to 5000). (b) The number of candidate models $K$ (from 2 to 10). (c) The prior model probabilities (uniform vs. favoring simpler models). Discuss the practical implications.

Section 25.8: Market-Model Combination

Exercise 21: Linear Market-Model Combination (Intermediate) Generate synthetic data where a market price has Brier score 0.20 and a model has Brier score 0.18, with a correlation of 0.6 between their errors. Find the optimal combination weight using: (a) The analytical formula from Section 25.8. (b) Numerical optimization. (c) Do they agree?

Exercise 22: When to Override the Market (Advanced) Design a simulation where a model has genuine information not reflected in the market price. The model detects a signal that will shift the true probability by 0.15, but the market has not yet reacted. Find the optimal combination weight in this scenario and compare it to the general-case weight. How much weight should the model receive when it has a genuine informational edge?

Section 25.9: Forecast Aggregation

Exercise 23: Robust Aggregation (Intermediate) Generate 30 forecasts for a single question from forecasters of varying quality. Include 3 "trolls" who give extreme (near 0 or near 1) forecasts. Compare: (a) Simple mean (b) Trimmed mean (10%) (c) Winsorized mean (10%) (d) Median (e) Performance-weighted mean (using historical accuracy) Which method is most robust to the trolls?

Exercise 24: Optimal Trim Fraction (Intermediate) Using a simulation with varying fractions of "bad" forecasters (0%, 5%, 10%, 20%, 30%), find the optimal trim fraction for the trimmed mean. Plot the optimal trim fraction as a function of the fraction of bad forecasters.

Section 25.10: Diversity

Exercise 25: Ambiguity Decomposition (Intermediate) Create an ensemble of 4 models and empirically verify the ambiguity decomposition: $$\text{MSE}_{\text{ensemble}} = \overline{\text{MSE}} - \overline{\text{Diversity}}$$ Compute each term separately and verify the equation holds (up to numerical precision).

Exercise 26: Creating Diversity (Advanced) Starting with a random forest classifier, create four diverse variants by: (a) Using different random seeds (b) Using different max_depth values (c) Using different feature subsets (50% of features each) (d) Training on different bootstrap samples of the data Measure the diversity of each ensemble using the Q-statistic and error correlation. Which strategy creates the most diversity?

Exercise 27: Marginal Value Analysis (Advanced) Build an ensemble incrementally, adding one model at a time. After each addition, compute the ensemble's Brier score and the marginal value of the last model added. Plot both curves. At what point do diminishing returns set in?

Section 25.11: Practical Ensemble Building

Exercise 28: End-to-End Ensemble (Advanced) Build a complete ensemble forecasting system for a synthetic prediction market: 1. Generate 500 binary events with features. 2. Build 5 diverse base models. 3. Combine using (a) simple average, (b) weighted average, (c) extremized average, (d) stacking. 4. Evaluate all methods using cross-validation. 5. Select the best method and justify your choice. 6. Create a monitoring dashboard that tracks performance over time.

Exercise 29: Ensemble Tournament (Advanced) Run a tournament: generate 1000 binary events and have 10 ensemble methods compete: 1. Simple average 2. Inverse-Brier-score weighted average 3. Optimized weights (no shrinkage) 4. Optimized weights (with shrinkage) 5. Linear pool 6. Logarithmic pool 7. Extremized average (d = 1.5) 8. Extremized average (calibrated d) 9. Stacking with logistic regression 10. BMA with BIC weights

Rank them by Brier score, log loss, and calibration. Which method wins most consistently?

Exercise 30: Research Extension (Advanced) Read the paper "Combining probability forecasts" by Ranjan and Gneiting (2010). Implement their beta-transformed linear pool: $$p_{\text{BLP}} = F_{\text{Beta}}\left(\sum_{i=1}^{K} w_i \cdot F_{\text{Beta}}^{-1}(p_i; \alpha, \beta); \alpha, \beta\right)$$ Compare it to the standard linear pool and the extremized linear pool on a simulated dataset. Under what conditions does the beta-transformed pool outperform the others?