Chapter 25 Key Takeaways
Core Principle
Combining multiple imperfect forecasts almost always produces a forecast superior to any individual component. This is one of the most robust findings in all of forecasting research.
The Big Ideas
1. Diversity Is Everything
The quality of an ensemble depends far more on the diversity of its components than on the quality of any individual component. The ambiguity decomposition makes this precise:
$$\text{MSE}_{\text{ensemble}} = \overline{\text{MSE}}_{\text{individual}} - \overline{\text{Diversity}}$$
The ensemble error is always less than or equal to the average individual error, and the gap is exactly the diversity. Invest your effort in creating diverse models, not in perfecting any single model.
2. Simple Averaging Is Surprisingly Hard to Beat
The "forecast combination puzzle" tells us that equal-weight averages often match or beat sophisticated optimization. Use simple averaging as your baseline and demand clear evidence before moving to more complex methods. When in doubt, average.
3. Extremizing Corrects a Systematic Problem
When forecasters share information, their simple average is systematically too moderate. Extremizing -- pushing the aggregate away from 50% -- corrects this. The optimal extremizing factor can be estimated via logistic recalibration:
$$\text{logit}(p_{\text{ext}}) = d \cdot \text{logit}(\bar{p})$$
Typical values: $d \in [1.5, 3.0]$.
4. Stacking Learns What Averaging Cannot
Stacking uses a meta-learner to discover non-linear, context-dependent combination strategies. Use cross-validated stacking to prevent information leakage. This is the go-to approach when you have sufficient data (100+ resolved predictions).
5. Market Prices Are a Model, Not Ground Truth
A prediction market price should be treated as one input to your ensemble, not as the definitive answer. Weight it according to market liquidity, maturity, and how it compares to your own models' track records.
Method Selection Guide
| Situation | Recommended Method |
|---|---|
| Few historical observations (< 30) | Simple average |
| Moderate history (30-100) | Weighted average with shrinkage |
| Abundant history (100+) | Stacking or calibrated extremizing |
| Models share lots of information | Extremized average |
| Need principled uncertainty | Bayesian Model Averaging |
| Many forecasters, some unreliable | Trimmed mean or median |
| Market price + personal model | Linear combination calibrated on data |
Common Pitfalls
-
Overfitting weights: Optimized combination weights can overfit to historical data. Always use shrinkage, regularization, or cross-validation.
-
Insufficient diversity: Adding a 5th polling model to an ensemble of 4 polling models provides minimal improvement. Invest in model variety.
-
Ignoring non-stationarity: Model performance rankings change over time. Recalibrate weights periodically.
-
Over-extremizing: Pushing forecasts too far from 50% can produce overconfident predictions. Calibrate the extremizing factor carefully.
-
Information leakage in stacking: Always use out-of-fold predictions for meta-learner training.
Numbers to Remember
- 3-5 diverse models capture most of the ensemble benefit
- Extremizing factor d = 1.5 is a reasonable default for crowd forecasts
- Shrinkage of 0.3-0.5 toward equal weights balances optimization and robustness
- Pairwise error correlation < 0.5 is desirable for new ensemble members
- 95%+ of Kaggle competition winners use ensembles
The One-Sentence Summary
Build diverse models, combine them simply, extremize if they share information, and always validate that your ensemble actually outperforms its components on out-of-sample data.