Chapter 18: Key Takeaways - Game Outcome Prediction
Quick Reference Summary
This chapter covered building complete game prediction systems using rating systems, feature engineering, and machine learning.
Core Concepts
Prediction Targets
| Target | Output | Use Case | Key Metric |
|---|---|---|---|
| Win Probability | 0-1 probability | Broadcasting, analytics | Brier Score |
| Point Spread | Expected margin | Betting, power rankings | MAE |
| Straight-Up | Binary winner | Season records | Accuracy |
| Total Points | Combined score | Over/under betting | RMSE |
Theoretical Limits
Prediction Ceiling: ~75-78%
- Due to inherent game variance (σ ≈ 13.5 points)
- Even perfect knowledge of "true" spread limits accuracy
- Random component is irreducible
Good Model Targets:
- Accuracy: 65-70%
- AUC-ROC: 0.72-0.78
- Brier Score: < 0.22
Essential Formulas
Spread-Probability Conversion
P(win) = 1 - Φ(0 | μ=spread, σ=13.5)
P(win) ≈ 1 / (1 + e^(-spread/4)) # Simplified
spread = Φ⁻¹(P) × σ
Example conversions:
-7 points → 70% win probability
-14 points → 85% win probability
-3 points → 59% win probability
Elo Rating Update
Expected Score:
E = 1 / (1 + 10^((R_opp - R_self) / 400))
Rating Update:
R_new = R_old + K × (Actual - Expected)
Home Field Adjustment:
Add ~65 Elo points to home team for expected score
Season Reversion:
R_season_start = R_prior + 0.33 × (1500 - R_prior)
Brier Skill Score
BSS = 1 - (BS_model / BS_baseline)
Where:
BS = mean((P_predicted - Y_actual)²)
BS_baseline = variance of outcomes (use home win rate)
BSS > 0: Model beats baseline
BSS = 1: Perfect prediction
Code Patterns
Elo Rating System
class EloSystem:
def __init__(self, k=20, hfa=65, reversion=0.33):
self.k = k
self.hfa = hfa
self.reversion = reversion
self.ratings = {}
def expected_prob(self, home, away):
h_rating = self.ratings.get(home, 1500) + self.hfa
a_rating = self.ratings.get(away, 1500)
return 1 / (1 + 10 ** ((a_rating - h_rating) / 400))
def update(self, home, away, home_won):
expected = self.expected_prob(home, away)
actual = 1.0 if home_won else 0.0
change = self.k * (actual - expected)
self.ratings[home] = self.ratings.get(home, 1500) + change
self.ratings[away] = self.ratings.get(away, 1500) - change
Feature Engineering
def create_game_features(home_stats, away_stats):
"""Create differential features."""
features = {}
# Differentials
for stat in ['off_epa', 'def_epa', 'win_pct', 'ppg']:
features[f'{stat}_diff'] = home_stats[stat] - away_stats[stat]
# Aggregates
features['total_quality'] = home_stats['total_epa'] + away_stats['total_epa']
# Indicators
features['home_field'] = 1
return features
Ensemble Prediction
def ensemble_predict(elo_prob, ml_prob, weights={'elo': 0.35, 'ml': 0.65}):
"""Combine Elo and ML predictions."""
return weights['elo'] * elo_prob + weights['ml'] * ml_prob
# Optimize weights on validation set
from sklearn.metrics import brier_score_loss
best_weight = min(
np.arange(0.1, 0.6, 0.05),
key=lambda w: brier_score_loss(y_val, w*elo + (1-w)*ml)
)
Temporal Cross-Validation
from sklearn.model_selection import TimeSeriesSplit
def temporal_cv(X, y, dates, model, n_splits=5):
"""Cross-validate with temporal ordering."""
sorted_idx = dates.argsort()
X_sorted = X.iloc[sorted_idx]
y_sorted = y.iloc[sorted_idx]
tscv = TimeSeriesSplit(n_splits=n_splits)
scores = cross_val_score(model, X_sorted, y_sorted, cv=tscv)
return scores.mean(), scores.std()
Model Selection Guide
| Scenario | Recommended Approach |
|---|---|
| Limited data (< 500 games) | Elo ratings alone |
| Standard prediction | Elo + Logistic Regression ensemble |
| Complex feature interactions | Gradient Boosting |
| Production system | Calibrated ensemble with multiple models |
| Real-time updates | Elo (low computational cost) |
| Maximum accuracy | Neural network ensemble |
Common Pitfalls
1. Data Leakage
Wrong: Using game outcome features to predict game outcome
# BAD - includes future information
features = ['home_score', 'away_score', 'final_margin']
Right: Use only pre-game information
# GOOD - only prior information
features = ['home_rating', 'away_rating', 'rest_days']
2. Random Split for Time Series
Wrong: Random train/test split
# BAD - future games in training, past in test
X_train, X_test = train_test_split(X, test_size=0.2)
Right: Temporal split
# GOOD - always train on past, test on future
train = data[data['season'] < 2023]
test = data[data['season'] >= 2023]
3. Ignoring Calibration
Wrong: Only checking accuracy
# BAD - high accuracy but poor probabilities
print(f"Accuracy: {accuracy:.1%}")
Right: Assess calibration
# GOOD - check probability quality
from sklearn.metrics import brier_score_loss
print(f"Accuracy: {accuracy:.1%}")
print(f"Brier Score: {brier:.4f}")
print(f"ECE: {ece:.4f}")
4. Overfitting to Small Samples
Wrong: Complex model with few games
# BAD - too many parameters for data size
model = RandomForestClassifier(n_estimators=500, max_depth=20)
model.fit(X_train[:100], y_train[:100]) # Only 100 games!
Right: Match complexity to data
# GOOD - simpler model for small data
model = LogisticRegression(C=0.1) # Regularized
model.fit(X_train, y_train)
Evaluation Checklist
Before Training
- [ ] No data leakage (future info not in features)
- [ ] Temporal split implemented
- [ ] Features use only pre-game information
- [ ] Baseline accuracy calculated
Model Comparison
- [ ] Multiple models evaluated
- [ ] Cross-validation used (TimeSeriesSplit)
- [ ] Feature importance analyzed
- [ ] Ensemble combinations tested
Evaluation
- [ ] Accuracy vs. baseline compared
- [ ] Brier Score calculated
- [ ] Calibration curve plotted
- [ ] ATS performance measured (if applicable)
- [ ] Confidence stratification analyzed
Quick Reference Tables
Typical Metrics by Model Quality
| Quality Level | Accuracy | AUC-ROC | Brier Score |
|---|---|---|---|
| Poor | < 55% | < 0.55 | > 0.28 |
| Baseline | 55-60% | 0.55-0.65 | 0.25-0.28 |
| Good | 60-65% | 0.65-0.72 | 0.22-0.25 |
| Very Good | 65-70% | 0.72-0.78 | 0.18-0.22 |
| Excellent | > 70% | > 0.78 | < 0.18 |
Feature Importance Rankings (Typical)
| Rank | Feature Type | Description |
|---|---|---|
| 1 | Elo/Rating Difference | Overall team quality |
| 2 | Efficiency Differential | EPA, success rate |
| 3 | Recent Form | Last 5 games performance |
| 4 | Win Percentage Diff | Season record comparison |
| 5 | Home Field | ~3 point advantage |
| 6 | Rest Differential | Bye week, short week |
| 7 | Turnover Margin | Ball security |
| 8 | Head-to-Head | Historical matchup |
Spread to Probability Quick Reference
| Spread | Win Prob | Upset Rate |
|---|---|---|
| -21 | 94% | 6% |
| -14 | 85% | 15% |
| -10 | 77% | 23% |
| -7 | 70% | 30% |
| -3 | 59% | 41% |
| Pick | 50% | 50% |
| +3 | 41% | N/A |
| +7 | 30% | N/A |
Red Flags
Warning signs your model may have issues:
- Accuracy > 75% → Likely data leakage
- Train >> Test accuracy → Overfitting
- Probabilities clustered near 0.5 → Model not discriminating
- ECE > 0.05 → Poor calibration
- ATS < 50% → Not beating random
- Negative Brier Skill Score → Worse than baseline
Next Steps
After mastering game prediction, proceed to: - Chapter 19: Player Performance Forecasting - Chapter 20: Recruiting Analytics - Chapter 21: Win Probability Models - Chapter 22: Machine Learning Applications