Case Study: Building an ML Model That Beats Elo
From simple features to a production prediction system
Introduction
This case study documents the development of an ML-based NFL prediction system, from initial concept through production deployment. We'll see the iterative process of feature engineering, model selection, and validation that separates hobby projects from reliable prediction systems.
Our goal: Build an ML model that consistently outperforms a well-tuned Elo baseline.
Phase 1: Establishing the Baseline
The Elo Benchmark
Before adding ML complexity, we established a strong baseline using margin-adjusted Elo (Chapter 19):
# Baseline Elo Performance (2019-2022)
baseline_results = {
'model': 'Margin-Adjusted Elo',
'k_factor': 28,
'home_advantage': 48,
'margin_cap': 24,
'straight_up_accuracy': 0.612,
'brier_score': 0.218,
'spread_mae': 10.8
}
Key Insight: This baseline is deceptively strong. Many ML approaches fail to beat it because they overfit to training data or include leaky features.
Phase 2: Initial ML Attempt
First Try: Everything at Once
Our first attempt threw many features at XGBoost:
initial_features = [
# Basic stats
'home_wins', 'home_losses', 'away_wins', 'away_losses',
'home_ppg', 'away_ppg', 'home_ppg_allowed', 'away_ppg_allowed',
'home_ypg', 'away_ypg', 'home_ypg_allowed', 'away_ypg_allowed',
# Advanced stats
'home_turnover_margin', 'away_turnover_margin',
'home_third_down_pct', 'away_third_down_pct',
'home_red_zone_pct', 'away_red_zone_pct',
# Situational
'home_rest_days', 'away_rest_days',
'is_divisional', 'week', 'home_timezone_advantage',
# QB stats
'home_qb_rating', 'away_qb_rating',
'home_qb_td_int_ratio', 'away_qb_td_int_ratio',
# Recent form
'home_last_5_wins', 'away_last_5_wins',
'home_streak', 'away_streak',
# 35 total features
]
Initial Results
| Metric | Training | Test |
|---|---|---|
| Accuracy | 72.3% | 58.4% |
| Brier Score | 0.168 | 0.232 |
Problem Identified: Massive overfitting. Training performance was unrealistically high while test performance was below the Elo baseline.
Phase 3: Debugging the Model
Finding the Leaks
Reviewing features revealed several problems:
- home_wins, home_losses: Included current game's result
- home_ypg: Season averages included current game
- home_qb_rating: Some implementations used end-of-season stats
Fixing Data Leakage
def create_safe_features(games_df, current_week, current_season):
"""
Create features using only data available BEFORE the game.
"""
# Filter to games BEFORE this week
prior_games = games_df[
(games_df['season'] < current_season) |
((games_df['season'] == current_season) &
(games_df['week'] < current_week))
]
# Now calculate features from prior_games only
# ...
Post-Fix Results
| Metric | Training | Test |
|---|---|---|
| Accuracy | 65.1% | 60.8% |
| Brier Score | 0.202 | 0.221 |
Progress: Closer to baseline, but still showing overfitting and not beating Elo.
Phase 4: Feature Engineering
Hypothesis: Differential Features
Instead of raw team stats, use differences:
improved_features = {
# Differentials (home - away perspective)
'ppg_diff': home_ppg - away_ppg,
'defensive_diff': away_ppg_allowed - home_ppg_allowed,
'yardage_diff': home_ypg - away_ypg,
'turnover_diff': home_to_margin - away_to_margin,
# Matchup features
'home_off_vs_away_def': home_ppg - away_ppg_allowed,
'away_off_vs_home_def': away_ppg - home_ppg_allowed,
# Form indicators
'recent_form_diff': home_last_4_pct - away_last_4_pct,
'momentum': home_streak - away_streak,
# Situational
'rest_advantage': home_rest - away_rest,
'travel_factor': away_miles_traveled / 1000,
# Historical
'h2h_home_advantage': home_h2h_win_pct if h2h_games > 3 else 0.5,
# 12 total features
}
Why Differentials Work Better
- Directly encode prediction target: We're predicting who wins, not absolute performance
- Reduce multicollinearity: Raw features are highly correlated
- Simpler model needed: Fewer, more informative features
Results with Differential Features
| Metric | Training | Test | Elo Baseline |
|---|---|---|---|
| Accuracy | 62.5% | 61.4% | 61.2% |
| Brier Score | 0.214 | 0.216 | 0.218 |
Breakthrough: Finally beating the baseline, with minimal overfitting.
Phase 5: Model Selection
Comparing Algorithms
We tested multiple algorithms with our refined feature set:
models_tested = {
'XGBoost': XGBClassifier(max_depth=4, n_estimators=100),
'Random Forest': RandomForestClassifier(max_depth=6, n_estimators=200),
'Logistic Regression': LogisticRegression(C=0.5),
'LightGBM': LGBMClassifier(max_depth=4, n_estimators=100),
'Neural Network': MLPClassifier(hidden_layer_sizes=(32, 16), alpha=0.1)
}
Results Across Models
| Model | Test Accuracy | Test Brier | Train-Test Gap |
|---|---|---|---|
| XGBoost | 61.4% | 0.216 | 1.1% |
| LightGBM | 61.6% | 0.215 | 0.9% |
| Random Forest | 60.8% | 0.219 | 1.5% |
| Logistic Regression | 60.2% | 0.220 | 0.3% |
| Neural Network | 59.5% | 0.224 | 3.2% |
Winner: LightGBM slightly edged XGBoost with the most stable performance.
Phase 6: Hyperparameter Tuning
Temporal Grid Search
param_grid = {
'max_depth': [3, 4, 5],
'n_estimators': [50, 100, 150],
'learning_rate': [0.05, 0.1, 0.15],
'min_child_samples': [10, 20, 30],
'reg_alpha': [0.5, 1.0, 2.0],
'reg_lambda': [0.5, 1.0, 2.0]
}
# Use temporal CV, not random CV
cv_splitter = NFLTemporalCV(min_train_seasons=2)
Optimal Parameters Found
optimal_params = {
'max_depth': 4,
'n_estimators': 100,
'learning_rate': 0.1,
'min_child_samples': 20,
'reg_alpha': 1.0,
'reg_lambda': 1.0
}
Key Finding: Moderate regularization was crucial. Both under- and over-regularized models performed worse.
Phase 7: Ensemble Construction
Building the Ensemble
Given that different models capture different patterns, we built an ensemble:
ensemble_components = [
('lgbm', lgbm_model, 0.40), # Primary model
('xgb', xgb_model, 0.35), # Strong alternative
('logreg', logreg_model, 0.25), # Stabilizing influence
]
def ensemble_predict(X):
predictions = []
for name, model, weight in ensemble_components:
pred = model.predict_proba(X)[:, 1]
predictions.append(pred * weight)
return sum(predictions)
Why Include Logistic Regression?
Despite lower individual accuracy, logistic regression: - Makes different errors than tree-based models - Provides probability calibration - Reduces overall variance
Ensemble Results
| Model | Test Accuracy | Test Brier |
|---|---|---|
| LightGBM alone | 61.6% | 0.215 |
| XGBoost alone | 61.4% | 0.216 |
| Ensemble | 62.1% | 0.213 |
| Elo Baseline | 61.2% | 0.218 |
Final Edge: +0.9% accuracy, -0.005 Brier score vs baseline.
Phase 8: Calibration Check
Probability Calibration
Our model's predicted probabilities should match reality:
| Predicted Range | Games | Actual Win % | Calibration Error |
|---|---|---|---|
| 45-50% | 180 | 47.2% | -0.3% |
| 50-55% | 250 | 52.4% | 0.1% |
| 55-60% | 280 | 56.8% | -0.7% |
| 60-65% | 210 | 62.4% | -0.1% |
| 65-70% | 150 | 67.3% | 0.2% |
| 70-75% | 90 | 73.3% | 1.1% |
| 75-80% | 40 | 77.5% | 0.8% |
Verdict: Well-calibrated. Average absolute error: 0.5%.
Phase 9: Production Deployment
Weekly Pipeline
class WeeklyPredictionPipeline:
"""
Production pipeline for weekly predictions.
"""
def run_weekly(self, week: int, season: int):
"""Generate predictions for upcoming week."""
# 1. Update statistics through last week
self.update_team_stats(week - 1, season)
# 2. Generate features for upcoming games
upcoming_games = self.get_week_schedule(week, season)
features = self.generate_features(upcoming_games)
# 3. Generate predictions
predictions = self.ensemble_predict(features)
# 4. Add context
predictions = self.add_context(predictions, upcoming_games)
# 5. Store for tracking
self.store_predictions(predictions)
return predictions
def add_context(self, preds, games):
"""Add feature contributions and comparisons."""
for i, game in enumerate(games):
preds[i]['key_factors'] = self.get_shap_summary(game)
preds[i]['elo_comparison'] = self.get_elo_prediction(game)
preds[i]['market_line'] = self.get_market_line(game)
return preds
Monitoring
class ModelMonitor:
"""Track model performance over time."""
def __init__(self):
self.weekly_results = []
def record_week(self, predictions, actuals):
"""Record weekly performance."""
accuracy = (predictions.round() == actuals).mean()
brier = ((predictions - actuals) ** 2).mean()
self.weekly_results.append({
'accuracy': accuracy,
'brier': brier,
'n_games': len(predictions)
})
# Alert if performance degrading
if len(self.weekly_results) >= 4:
recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
if recent_brier > 0.230:
self.send_alert("Model performance degrading")
def should_retrain(self):
"""Determine if model needs retraining."""
if len(self.weekly_results) < 4:
return False
recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
return recent_brier > 0.225
Final Results: Full 2023 Season
Head-to-Head vs Elo
| Week | ML Model | Elo | Winner |
|---|---|---|---|
| 1 | 9/16 | 8/16 | ML |
| 2 | 10/16 | 9/16 | ML |
| 3 | 9/16 | 10/16 | Elo |
| 4 | 11/16 | 10/16 | ML |
| ... | ... | ... | ... |
| 17 | 10/16 | 10/16 | Tie |
Season Totals: - ML Model: 167/267 = 62.5% - Elo Baseline: 163/267 = 61.0% - Market Lines: 168/267 = 62.9%
Performance by Game Type
| Game Type | ML Accuracy | Elo Accuracy |
|---|---|---|
| Divisional | 58.3% | 56.7% |
| Primetime | 65.4% | 63.1% |
| Blowouts (10+ pts) | 72.1% | 70.8% |
| Close games (<7 pts) | 54.2% | 53.1% |
Lessons Learned
What Worked
- Differential features over raw statistics: Directly encoding what we're predicting
- Aggressive regularization: Prevented overfitting to limited NFL data
- Ensemble approach: Combining diverse models reduced variance
- Strict temporal validation: Caught leakage and overfitting early
- Simple but thoughtful features: 12 features beat 35 poorly-chosen ones
What Didn't Work
- Complex neural networks: Overfitted despite regularization
- Too many features: More noise than signal
- Chasing accuracy: Brier score was a better north star
- Weekly retraining: Added noise without improving predictions
Key Insights
On beating baselines: - Elo is harder to beat than it looks - 1-2% improvement is meaningful - Consistent improvement matters more than occasional big wins
On feature engineering: - Domain knowledge beats feature count - Differential features capture prediction target directly - Simpler is often better
On validation: - Temporal CV is non-negotiable - Watch for train-test gaps - Monitor production performance continuously
Future Improvements
Near-Term
- Add weather features for outdoor games
- Incorporate injury reports
- Include referee tendencies
Medium-Term
- Play-by-play derived features (EPA, success rate)
- Bayesian uncertainty quantification
- Game script features (expected closeness)
Long-Term
- Real-time in-game updating
- Player-level modeling
- Video-based features (formation, alignment)
Conclusion
Building an ML system that beats Elo requires discipline more than sophistication. The keys are:
- Establish a strong baseline
- Engineer features carefully, avoiding leakage
- Use temporal validation religiously
- Prefer simpler models with regularization
- Ensemble for robustness
- Monitor continuously
The marginal improvement (1-2%) may seem small, but over a season it represents several additional correct predictions. Combined with proper uncertainty quantification, this edge enables better decision-making whether for analysis, fantasy sports, or understanding the game.