Chapter 30 Key Takeaways: Model Evaluation and Selection

Key Concepts

1. Proper scoring rules are the foundation of probabilistic model evaluation. Accuracy measures only the binary decision; proper scoring rules (Brier score, log loss) evaluate the full probability distribution. A scoring rule is proper if the expected score is optimized when the forecaster reports their true belief, ensuring honest probability assessment. For sports betting, where bet sizing depends on probability magnitude rather than just direction, proper scoring rules are essential.

2. The Brier score decomposition provides diagnostic insight. Decomposing BS = Reliability - Resolution + Uncertainty separates calibration quality (reliability) from discriminative ability (resolution). This decomposition identifies the specific source of model weakness: high reliability means the model is poorly calibrated (fixable through recalibration), while low resolution means the model lacks discriminative power (requires better features or architecture). Good betting models need both low reliability and high resolution.

3. Brier Skill Score benchmarks against the market. BSS = 1 - BS_model / BS_reference, where the reference should be market-implied probabilities for betting applications. A positive BSS relative to the market is a necessary (but not sufficient) condition for profitable betting. A negative BSS means your model is worse than the market and should not be used for bet selection.

4. Backtesting must avoid look-ahead bias. The four most common sources of look-ahead bias are feature leakage (using future data in features), training leakage (training on data you later predict), odds leakage (using closing odds when betting would happen at opening odds), and model selection bias (choosing configurations after seeing results). Proper backtests retrain at each time step using only historically available data.

5. Walk-forward validation respects temporal ordering. Standard k-fold cross-validation produces optimistically biased estimates for sports data because it allows future information to leak into training. Walk-forward validation (expanding or sliding window) always trains on past data and evaluates on future data. Purge gaps prevent additional leakage through overlapping feature windows.

6. Calibration determines probability quality. A well-calibrated model produces honest probabilities: when it predicts 70%, the event occurs 70% of the time. Expected Calibration Error (ECE) quantifies the average deviation from perfect calibration. Recalibration techniques (Platt scaling, isotonic regression) can correct miscalibrated models without retraining, but they cannot improve discrimination.

7. Statistical tests prevent false model selection. The Diebold-Mariano test determines whether the difference between two models is statistically significant or could be due to random variation. Without this test, you risk selecting a model that happened to perform well by chance. When differences are not significant, prefer the simpler model (parsimony principle).

8. Information criteria balance fit and complexity. AIC and BIC penalize the log-likelihood by the number of parameters, discouraging overly complex models. BIC penalizes complexity more heavily than AIC (for n > 7) and is generally preferred for sports betting because overfitting is the dominant risk.

9. Ensembling reduces prediction variance. Averaging predictions from multiple models cancels correlated errors when models make different types of mistakes. The benefit of ensembling is proportional to the diversity of errors across models. Weighting by inverse Brier score concentrates ensemble weight on the best-performing models.

10. Profitability requires more than good predictions. A model with excellent Brier score may still lose money if the market is more accurate or the vigorish exceeds the model's edge. The full evaluation pipeline must include scoring rules, calibration analysis, walk-forward validation, statistical model comparison, and realistic backtesting with transaction costs.

Key Formulas

Formula	Description
$\text{BS} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)^2$	Brier score (mean squared probability error)
$\text{BS} = \text{REL} - \text{RES} + \text{UNC}$	Brier score decomposition
$\text{BSS} = 1 - \frac{\text{BS}_{\text{model}}}{\text{BS}_{\text{reference}}}$	Brier Skill Score (positive = better than reference)
$\text{LL} = -\frac{1}{n}\sum[y\log p + (1-y)\log(1-p)]$	Log loss (binary cross-entropy)
$\text{ECE} = \sum_{k=1}^{K} \frac{n_k}{n}\lvert\bar{y}_k - \bar{p}_k\rvert$	Expected Calibration Error
$\text{MCE} = \max_k \lvert\bar{y}_k - \bar{p}_k\rvert$	Maximum Calibration Error
$\text{AIC} = 2k - 2\ln\hat{L}$	Akaike Information Criterion
$\text{BIC} = k\ln n - 2\ln\hat{L}$	Bayesian Information Criterion
$\text{DM} = \frac{\bar{d}}{\sqrt{\hat{V}(\bar{d})}}$	Diebold-Mariano test statistic
$f^* = \frac{bp - q}{b}$	Kelly criterion optimal fraction

Decision Framework: Model Evaluation and Selection

START: You have trained one or more models and generated predictions.

Step 1: Compute Proper Scoring Rules
  - Brier score and log loss for each model.
  - Brier score decomposition (reliability, resolution, uncertainty).
  - Brier Skill Score relative to market-implied probabilities.
  -> If BSS vs. market < 0 for all models: Stop. Your models do not
     add value beyond the market.

Step 2: Assess Calibration
  - Generate reliability diagrams.
  - Compute ECE and MCE.
  -> If ECE > 0.05: Apply Platt scaling or isotonic recalibration.
  -> If ECE < 0.02 after recalibration: Proceed.

Step 3: Validate with Walk-Forward Scheme
  - Use expanding or sliding window (NOT standard k-fold).
  - Set purge gap >= longest feature lookback window.
  - Report mean and std of fold-level Brier scores.
  -> If fold-level std is high: Model may be unstable; investigate.

Step 4: Compare Models Statistically
  - Compute AIC/BIC for complexity-penalized comparison.
  - Run pairwise Diebold-Mariano tests.
  -> If DM p-value < 0.05: Select the significantly better model.
  -> If DM p-value >= 0.05: Prefer the simpler model (parsimony).

Step 5: Backtest with Realistic Betting Conditions
  - Apply vigorish (typically -110/-110 = 4.76%).
  - Use fractional Kelly staking (25% of full Kelly).
  - Report ROI, Sharpe ratio, max drawdown, total bets.
  - Bootstrap 95% CI for ROI.

Step 6: Select Final Model (or Ensemble)
  - If multiple models are comparable: Ensemble them.
  - Validate ensemble on held-out test period.
  - Document model choice and supporting evidence.

END: Deploy with monitoring and pre-defined stopping criteria.

Self-Assessment Checklist

After completing Chapter 30, you should be able to answer "yes" to each of the following:

[ ] I can compute the Brier score and log loss by hand for a small set of predictions.
[ ] I can decompose the Brier score into reliability, resolution, and uncertainty and interpret each component.
[ ] I can compute the Brier Skill Score and explain what a positive or negative value means relative to the market.
[ ] I can explain why accuracy is insufficient for evaluating probabilistic sports predictions.
[ ] I can identify at least four sources of look-ahead bias in betting backtests.
[ ] I can implement a walk-forward validation scheme with expanding or sliding windows.
[ ] I can explain why standard k-fold cross-validation is invalid for time series sports data.
[ ] I can set an appropriate purge gap based on feature lookback windows.
[ ] I can construct and interpret a reliability diagram.
[ ] I can compute ECE and MCE and explain what they measure.
[ ] I can apply Platt scaling or isotonic regression for recalibration.
[ ] I can implement the Diebold-Mariano test and interpret its p-value.
[ ] I can compute AIC and BIC and explain when each is preferred.
[ ] I can design a model ensemble with inverse Brier score weighting.
[ ] I can design a complete model evaluation pipeline that includes scoring rules, validation, statistical comparison, and backtesting.
[ ] I can explain the relationship between calibration quality and betting profitability.
[ ] I understand when to select a simpler model over a slightly more accurate complex model.

Common Mistakes to Avoid

Using accuracy instead of proper scoring rules. Accuracy discards all information about prediction confidence, which is critical for bet sizing.
Using standard k-fold cross-validation on sports data. Random shuffling violates temporal ordering and produces optimistically biased performance estimates.
Omitting the purge gap in walk-forward validation. If features include rolling statistics, observations near the train/test boundary share overlapping data.
Selecting models based on backtest performance without statistical testing. The best of N models will appear to outperform by chance alone; the DM test determines if the difference is real.
Deploying models without recalibration. Most models are somewhat overconfident after training. Recalibration on a held-out set can significantly improve betting decisions.
Confusing good Brier score with profitability. A model must beat the market (not just the base rate) and overcome the vigorish to be profitable.
Backtesting with model selection bias. Trying many configurations and reporting the best result inflates apparent performance. Use a three-way split (train / model selection / final test).
Running multiple DM tests without correction. When comparing many models pairwise, apply Bonferroni correction or use the Model Confidence Set procedure to control false discovery.