Chapter 18: Key Takeaways - Game Outcome Prediction

Quick Reference Summary

This chapter covered building complete game prediction systems using rating systems, feature engineering, and machine learning.


Core Concepts

Prediction Targets

Target Output Use Case Key Metric
Win Probability 0-1 probability Broadcasting, analytics Brier Score
Point Spread Expected margin Betting, power rankings MAE
Straight-Up Binary winner Season records Accuracy
Total Points Combined score Over/under betting RMSE

Theoretical Limits

Prediction Ceiling: ~75-78%
- Due to inherent game variance (σ ≈ 13.5 points)
- Even perfect knowledge of "true" spread limits accuracy
- Random component is irreducible

Good Model Targets:
- Accuracy: 65-70%
- AUC-ROC: 0.72-0.78
- Brier Score: < 0.22

Essential Formulas

Spread-Probability Conversion

P(win) = 1 - Φ(0 | μ=spread, σ=13.5)
P(win) ≈ 1 / (1 + e^(-spread/4))  # Simplified

spread = Φ⁻¹(P) × σ

Example conversions:
  -7 points → 70% win probability
  -14 points → 85% win probability
  -3 points → 59% win probability

Elo Rating Update

Expected Score:
  E = 1 / (1 + 10^((R_opp - R_self) / 400))

Rating Update:
  R_new = R_old + K × (Actual - Expected)

Home Field Adjustment:
  Add ~65 Elo points to home team for expected score

Season Reversion:
  R_season_start = R_prior + 0.33 × (1500 - R_prior)

Brier Skill Score

BSS = 1 - (BS_model / BS_baseline)

Where:
  BS = mean((P_predicted - Y_actual)²)
  BS_baseline = variance of outcomes (use home win rate)

BSS > 0: Model beats baseline
BSS = 1: Perfect prediction

Code Patterns

Elo Rating System

class EloSystem:
    def __init__(self, k=20, hfa=65, reversion=0.33):
        self.k = k
        self.hfa = hfa
        self.reversion = reversion
        self.ratings = {}

    def expected_prob(self, home, away):
        h_rating = self.ratings.get(home, 1500) + self.hfa
        a_rating = self.ratings.get(away, 1500)
        return 1 / (1 + 10 ** ((a_rating - h_rating) / 400))

    def update(self, home, away, home_won):
        expected = self.expected_prob(home, away)
        actual = 1.0 if home_won else 0.0
        change = self.k * (actual - expected)

        self.ratings[home] = self.ratings.get(home, 1500) + change
        self.ratings[away] = self.ratings.get(away, 1500) - change

Feature Engineering

def create_game_features(home_stats, away_stats):
    """Create differential features."""
    features = {}

    # Differentials
    for stat in ['off_epa', 'def_epa', 'win_pct', 'ppg']:
        features[f'{stat}_diff'] = home_stats[stat] - away_stats[stat]

    # Aggregates
    features['total_quality'] = home_stats['total_epa'] + away_stats['total_epa']

    # Indicators
    features['home_field'] = 1

    return features

Ensemble Prediction

def ensemble_predict(elo_prob, ml_prob, weights={'elo': 0.35, 'ml': 0.65}):
    """Combine Elo and ML predictions."""
    return weights['elo'] * elo_prob + weights['ml'] * ml_prob

# Optimize weights on validation set
from sklearn.metrics import brier_score_loss
best_weight = min(
    np.arange(0.1, 0.6, 0.05),
    key=lambda w: brier_score_loss(y_val, w*elo + (1-w)*ml)
)

Temporal Cross-Validation

from sklearn.model_selection import TimeSeriesSplit

def temporal_cv(X, y, dates, model, n_splits=5):
    """Cross-validate with temporal ordering."""
    sorted_idx = dates.argsort()
    X_sorted = X.iloc[sorted_idx]
    y_sorted = y.iloc[sorted_idx]

    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = cross_val_score(model, X_sorted, y_sorted, cv=tscv)

    return scores.mean(), scores.std()

Model Selection Guide

Scenario Recommended Approach
Limited data (< 500 games) Elo ratings alone
Standard prediction Elo + Logistic Regression ensemble
Complex feature interactions Gradient Boosting
Production system Calibrated ensemble with multiple models
Real-time updates Elo (low computational cost)
Maximum accuracy Neural network ensemble

Common Pitfalls

1. Data Leakage

Wrong: Using game outcome features to predict game outcome

# BAD - includes future information
features = ['home_score', 'away_score', 'final_margin']

Right: Use only pre-game information

# GOOD - only prior information
features = ['home_rating', 'away_rating', 'rest_days']

2. Random Split for Time Series

Wrong: Random train/test split

# BAD - future games in training, past in test
X_train, X_test = train_test_split(X, test_size=0.2)

Right: Temporal split

# GOOD - always train on past, test on future
train = data[data['season'] < 2023]
test = data[data['season'] >= 2023]

3. Ignoring Calibration

Wrong: Only checking accuracy

# BAD - high accuracy but poor probabilities
print(f"Accuracy: {accuracy:.1%}")

Right: Assess calibration

# GOOD - check probability quality
from sklearn.metrics import brier_score_loss
print(f"Accuracy: {accuracy:.1%}")
print(f"Brier Score: {brier:.4f}")
print(f"ECE: {ece:.4f}")

4. Overfitting to Small Samples

Wrong: Complex model with few games

# BAD - too many parameters for data size
model = RandomForestClassifier(n_estimators=500, max_depth=20)
model.fit(X_train[:100], y_train[:100])  # Only 100 games!

Right: Match complexity to data

# GOOD - simpler model for small data
model = LogisticRegression(C=0.1)  # Regularized
model.fit(X_train, y_train)

Evaluation Checklist

Before Training

  • [ ] No data leakage (future info not in features)
  • [ ] Temporal split implemented
  • [ ] Features use only pre-game information
  • [ ] Baseline accuracy calculated

Model Comparison

  • [ ] Multiple models evaluated
  • [ ] Cross-validation used (TimeSeriesSplit)
  • [ ] Feature importance analyzed
  • [ ] Ensemble combinations tested

Evaluation

  • [ ] Accuracy vs. baseline compared
  • [ ] Brier Score calculated
  • [ ] Calibration curve plotted
  • [ ] ATS performance measured (if applicable)
  • [ ] Confidence stratification analyzed

Quick Reference Tables

Typical Metrics by Model Quality

Quality Level Accuracy AUC-ROC Brier Score
Poor < 55% < 0.55 > 0.28
Baseline 55-60% 0.55-0.65 0.25-0.28
Good 60-65% 0.65-0.72 0.22-0.25
Very Good 65-70% 0.72-0.78 0.18-0.22
Excellent > 70% > 0.78 < 0.18

Feature Importance Rankings (Typical)

Rank Feature Type Description
1 Elo/Rating Difference Overall team quality
2 Efficiency Differential EPA, success rate
3 Recent Form Last 5 games performance
4 Win Percentage Diff Season record comparison
5 Home Field ~3 point advantage
6 Rest Differential Bye week, short week
7 Turnover Margin Ball security
8 Head-to-Head Historical matchup

Spread to Probability Quick Reference

Spread Win Prob Upset Rate
-21 94% 6%
-14 85% 15%
-10 77% 23%
-7 70% 30%
-3 59% 41%
Pick 50% 50%
+3 41% N/A
+7 30% N/A

Red Flags

Warning signs your model may have issues:

  1. Accuracy > 75% → Likely data leakage
  2. Train >> Test accuracy → Overfitting
  3. Probabilities clustered near 0.5 → Model not discriminating
  4. ECE > 0.05 → Poor calibration
  5. ATS < 50% → Not beating random
  6. Negative Brier Skill Score → Worse than baseline

Next Steps

After mastering game prediction, proceed to: - Chapter 19: Player Performance Forecasting - Chapter 20: Recruiting Analytics - Chapter 21: Win Probability Models - Chapter 22: Machine Learning Applications