Case Study: Building an ML Model That Beats Elo

From simple features to a production prediction system


Introduction

This case study documents the development of an ML-based NFL prediction system, from initial concept through production deployment. We'll see the iterative process of feature engineering, model selection, and validation that separates hobby projects from reliable prediction systems.

Our goal: Build an ML model that consistently outperforms a well-tuned Elo baseline.


Phase 1: Establishing the Baseline

The Elo Benchmark

Before adding ML complexity, we established a strong baseline using margin-adjusted Elo (Chapter 19):

# Baseline Elo Performance (2019-2022)
baseline_results = {
    'model': 'Margin-Adjusted Elo',
    'k_factor': 28,
    'home_advantage': 48,
    'margin_cap': 24,
    'straight_up_accuracy': 0.612,
    'brier_score': 0.218,
    'spread_mae': 10.8
}

Key Insight: This baseline is deceptively strong. Many ML approaches fail to beat it because they overfit to training data or include leaky features.


Phase 2: Initial ML Attempt

First Try: Everything at Once

Our first attempt threw many features at XGBoost:

initial_features = [
    # Basic stats
    'home_wins', 'home_losses', 'away_wins', 'away_losses',
    'home_ppg', 'away_ppg', 'home_ppg_allowed', 'away_ppg_allowed',
    'home_ypg', 'away_ypg', 'home_ypg_allowed', 'away_ypg_allowed',

    # Advanced stats
    'home_turnover_margin', 'away_turnover_margin',
    'home_third_down_pct', 'away_third_down_pct',
    'home_red_zone_pct', 'away_red_zone_pct',

    # Situational
    'home_rest_days', 'away_rest_days',
    'is_divisional', 'week', 'home_timezone_advantage',

    # QB stats
    'home_qb_rating', 'away_qb_rating',
    'home_qb_td_int_ratio', 'away_qb_td_int_ratio',

    # Recent form
    'home_last_5_wins', 'away_last_5_wins',
    'home_streak', 'away_streak',

    # 35 total features
]

Initial Results

Metric Training Test
Accuracy 72.3% 58.4%
Brier Score 0.168 0.232

Problem Identified: Massive overfitting. Training performance was unrealistically high while test performance was below the Elo baseline.


Phase 3: Debugging the Model

Finding the Leaks

Reviewing features revealed several problems:

  1. home_wins, home_losses: Included current game's result
  2. home_ypg: Season averages included current game
  3. home_qb_rating: Some implementations used end-of-season stats

Fixing Data Leakage

def create_safe_features(games_df, current_week, current_season):
    """
    Create features using only data available BEFORE the game.
    """
    # Filter to games BEFORE this week
    prior_games = games_df[
        (games_df['season'] < current_season) |
        ((games_df['season'] == current_season) &
         (games_df['week'] < current_week))
    ]

    # Now calculate features from prior_games only
    # ...

Post-Fix Results

Metric Training Test
Accuracy 65.1% 60.8%
Brier Score 0.202 0.221

Progress: Closer to baseline, but still showing overfitting and not beating Elo.


Phase 4: Feature Engineering

Hypothesis: Differential Features

Instead of raw team stats, use differences:

improved_features = {
    # Differentials (home - away perspective)
    'ppg_diff': home_ppg - away_ppg,
    'defensive_diff': away_ppg_allowed - home_ppg_allowed,
    'yardage_diff': home_ypg - away_ypg,
    'turnover_diff': home_to_margin - away_to_margin,

    # Matchup features
    'home_off_vs_away_def': home_ppg - away_ppg_allowed,
    'away_off_vs_home_def': away_ppg - home_ppg_allowed,

    # Form indicators
    'recent_form_diff': home_last_4_pct - away_last_4_pct,
    'momentum': home_streak - away_streak,

    # Situational
    'rest_advantage': home_rest - away_rest,
    'travel_factor': away_miles_traveled / 1000,

    # Historical
    'h2h_home_advantage': home_h2h_win_pct if h2h_games > 3 else 0.5,

    # 12 total features
}

Why Differentials Work Better

  1. Directly encode prediction target: We're predicting who wins, not absolute performance
  2. Reduce multicollinearity: Raw features are highly correlated
  3. Simpler model needed: Fewer, more informative features

Results with Differential Features

Metric Training Test Elo Baseline
Accuracy 62.5% 61.4% 61.2%
Brier Score 0.214 0.216 0.218

Breakthrough: Finally beating the baseline, with minimal overfitting.


Phase 5: Model Selection

Comparing Algorithms

We tested multiple algorithms with our refined feature set:

models_tested = {
    'XGBoost': XGBClassifier(max_depth=4, n_estimators=100),
    'Random Forest': RandomForestClassifier(max_depth=6, n_estimators=200),
    'Logistic Regression': LogisticRegression(C=0.5),
    'LightGBM': LGBMClassifier(max_depth=4, n_estimators=100),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(32, 16), alpha=0.1)
}

Results Across Models

Model Test Accuracy Test Brier Train-Test Gap
XGBoost 61.4% 0.216 1.1%
LightGBM 61.6% 0.215 0.9%
Random Forest 60.8% 0.219 1.5%
Logistic Regression 60.2% 0.220 0.3%
Neural Network 59.5% 0.224 3.2%

Winner: LightGBM slightly edged XGBoost with the most stable performance.


Phase 6: Hyperparameter Tuning

param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.05, 0.1, 0.15],
    'min_child_samples': [10, 20, 30],
    'reg_alpha': [0.5, 1.0, 2.0],
    'reg_lambda': [0.5, 1.0, 2.0]
}

# Use temporal CV, not random CV
cv_splitter = NFLTemporalCV(min_train_seasons=2)

Optimal Parameters Found

optimal_params = {
    'max_depth': 4,
    'n_estimators': 100,
    'learning_rate': 0.1,
    'min_child_samples': 20,
    'reg_alpha': 1.0,
    'reg_lambda': 1.0
}

Key Finding: Moderate regularization was crucial. Both under- and over-regularized models performed worse.


Phase 7: Ensemble Construction

Building the Ensemble

Given that different models capture different patterns, we built an ensemble:

ensemble_components = [
    ('lgbm', lgbm_model, 0.40),       # Primary model
    ('xgb', xgb_model, 0.35),          # Strong alternative
    ('logreg', logreg_model, 0.25),    # Stabilizing influence
]

def ensemble_predict(X):
    predictions = []
    for name, model, weight in ensemble_components:
        pred = model.predict_proba(X)[:, 1]
        predictions.append(pred * weight)
    return sum(predictions)

Why Include Logistic Regression?

Despite lower individual accuracy, logistic regression: - Makes different errors than tree-based models - Provides probability calibration - Reduces overall variance

Ensemble Results

Model Test Accuracy Test Brier
LightGBM alone 61.6% 0.215
XGBoost alone 61.4% 0.216
Ensemble 62.1% 0.213
Elo Baseline 61.2% 0.218

Final Edge: +0.9% accuracy, -0.005 Brier score vs baseline.


Phase 8: Calibration Check

Probability Calibration

Our model's predicted probabilities should match reality:

Predicted Range Games Actual Win % Calibration Error
45-50% 180 47.2% -0.3%
50-55% 250 52.4% 0.1%
55-60% 280 56.8% -0.7%
60-65% 210 62.4% -0.1%
65-70% 150 67.3% 0.2%
70-75% 90 73.3% 1.1%
75-80% 40 77.5% 0.8%

Verdict: Well-calibrated. Average absolute error: 0.5%.


Phase 9: Production Deployment

Weekly Pipeline

class WeeklyPredictionPipeline:
    """
    Production pipeline for weekly predictions.
    """

    def run_weekly(self, week: int, season: int):
        """Generate predictions for upcoming week."""

        # 1. Update statistics through last week
        self.update_team_stats(week - 1, season)

        # 2. Generate features for upcoming games
        upcoming_games = self.get_week_schedule(week, season)
        features = self.generate_features(upcoming_games)

        # 3. Generate predictions
        predictions = self.ensemble_predict(features)

        # 4. Add context
        predictions = self.add_context(predictions, upcoming_games)

        # 5. Store for tracking
        self.store_predictions(predictions)

        return predictions

    def add_context(self, preds, games):
        """Add feature contributions and comparisons."""
        for i, game in enumerate(games):
            preds[i]['key_factors'] = self.get_shap_summary(game)
            preds[i]['elo_comparison'] = self.get_elo_prediction(game)
            preds[i]['market_line'] = self.get_market_line(game)
        return preds

Monitoring

class ModelMonitor:
    """Track model performance over time."""

    def __init__(self):
        self.weekly_results = []

    def record_week(self, predictions, actuals):
        """Record weekly performance."""
        accuracy = (predictions.round() == actuals).mean()
        brier = ((predictions - actuals) ** 2).mean()

        self.weekly_results.append({
            'accuracy': accuracy,
            'brier': brier,
            'n_games': len(predictions)
        })

        # Alert if performance degrading
        if len(self.weekly_results) >= 4:
            recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
            if recent_brier > 0.230:
                self.send_alert("Model performance degrading")

    def should_retrain(self):
        """Determine if model needs retraining."""
        if len(self.weekly_results) < 4:
            return False

        recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
        return recent_brier > 0.225

Final Results: Full 2023 Season

Head-to-Head vs Elo

Week ML Model Elo Winner
1 9/16 8/16 ML
2 10/16 9/16 ML
3 9/16 10/16 Elo
4 11/16 10/16 ML
... ... ... ...
17 10/16 10/16 Tie

Season Totals: - ML Model: 167/267 = 62.5% - Elo Baseline: 163/267 = 61.0% - Market Lines: 168/267 = 62.9%

Performance by Game Type

Game Type ML Accuracy Elo Accuracy
Divisional 58.3% 56.7%
Primetime 65.4% 63.1%
Blowouts (10+ pts) 72.1% 70.8%
Close games (<7 pts) 54.2% 53.1%

Lessons Learned

What Worked

  1. Differential features over raw statistics: Directly encoding what we're predicting
  2. Aggressive regularization: Prevented overfitting to limited NFL data
  3. Ensemble approach: Combining diverse models reduced variance
  4. Strict temporal validation: Caught leakage and overfitting early
  5. Simple but thoughtful features: 12 features beat 35 poorly-chosen ones

What Didn't Work

  1. Complex neural networks: Overfitted despite regularization
  2. Too many features: More noise than signal
  3. Chasing accuracy: Brier score was a better north star
  4. Weekly retraining: Added noise without improving predictions

Key Insights

On beating baselines: - Elo is harder to beat than it looks - 1-2% improvement is meaningful - Consistent improvement matters more than occasional big wins

On feature engineering: - Domain knowledge beats feature count - Differential features capture prediction target directly - Simpler is often better

On validation: - Temporal CV is non-negotiable - Watch for train-test gaps - Monitor production performance continuously


Future Improvements

Near-Term

  • Add weather features for outdoor games
  • Incorporate injury reports
  • Include referee tendencies

Medium-Term

  • Play-by-play derived features (EPA, success rate)
  • Bayesian uncertainty quantification
  • Game script features (expected closeness)

Long-Term

  • Real-time in-game updating
  • Player-level modeling
  • Video-based features (formation, alignment)

Conclusion

Building an ML system that beats Elo requires discipline more than sophistication. The keys are:

  1. Establish a strong baseline
  2. Engineer features carefully, avoiding leakage
  3. Use temporal validation religiously
  4. Prefer simpler models with regularization
  5. Ensemble for robustness
  6. Monitor continuously

The marginal improvement (1-2%) may seem small, but over a season it represents several additional correct predictions. Combined with proper uncertainty quantification, this edge enables better decision-making whether for analysis, fantasy sports, or understanding the game.