Case Study: Building an ML Model That Beats Elo

From simple features to a production prediction system

Introduction

This case study documents the development of an ML-based NFL prediction system, from initial concept through production deployment. We'll see the iterative process of feature engineering, model selection, and validation that separates hobby projects from reliable prediction systems.

Our goal: Build an ML model that consistently outperforms a well-tuned Elo baseline.

Phase 1: Establishing the Baseline

The Elo Benchmark

Before adding ML complexity, we established a strong baseline using margin-adjusted Elo (Chapter 19):

# Baseline Elo Performance (2019-2022)
baseline_results = {
    'model': 'Margin-Adjusted Elo',
    'k_factor': 28,
    'home_advantage': 48,
    'margin_cap': 24,
    'straight_up_accuracy': 0.612,
    'brier_score': 0.218,
    'spread_mae': 10.8
}

Key Insight: This baseline is deceptively strong. Many ML approaches fail to beat it because they overfit to training data or include leaky features.

Phase 2: Initial ML Attempt

First Try: Everything at Once

Our first attempt threw many features at XGBoost:

initial_features = [
    # Basic stats
    'home_wins', 'home_losses', 'away_wins', 'away_losses',
    'home_ppg', 'away_ppg', 'home_ppg_allowed', 'away_ppg_allowed',
    'home_ypg', 'away_ypg', 'home_ypg_allowed', 'away_ypg_allowed',

    # Advanced stats
    'home_turnover_margin', 'away_turnover_margin',
    'home_third_down_pct', 'away_third_down_pct',
    'home_red_zone_pct', 'away_red_zone_pct',

    # Situational
    'home_rest_days', 'away_rest_days',
    'is_divisional', 'week', 'home_timezone_advantage',

    # QB stats
    'home_qb_rating', 'away_qb_rating',
    'home_qb_td_int_ratio', 'away_qb_td_int_ratio',

    # Recent form
    'home_last_5_wins', 'away_last_5_wins',
    'home_streak', 'away_streak',

    # 35 total features
]

Initial Results

Metric	Training	Test
Accuracy	72.3%	58.4%
Brier Score	0.168	0.232

Problem Identified: Massive overfitting. Training performance was unrealistically high while test performance was below the Elo baseline.

Phase 3: Debugging the Model

Finding the Leaks

Reviewing features revealed several problems:

home_wins, home_losses: Included current game's result
home_ypg: Season averages included current game
home_qb_rating: Some implementations used end-of-season stats

Fixing Data Leakage

def create_safe_features(games_df, current_week, current_season):
    """
    Create features using only data available BEFORE the game.
    """
    # Filter to games BEFORE this week
    prior_games = games_df[
        (games_df['season'] < current_season) |
        ((games_df['season'] == current_season) &
         (games_df['week'] < current_week))
    ]

    # Now calculate features from prior_games only
    # ...

Post-Fix Results

Metric	Training	Test
Accuracy	65.1%	60.8%
Brier Score	0.202	0.221

Progress: Closer to baseline, but still showing overfitting and not beating Elo.

Phase 4: Feature Engineering

Hypothesis: Differential Features

Instead of raw team stats, use differences:

improved_features = {
    # Differentials (home - away perspective)
    'ppg_diff': home_ppg - away_ppg,
    'defensive_diff': away_ppg_allowed - home_ppg_allowed,
    'yardage_diff': home_ypg - away_ypg,
    'turnover_diff': home_to_margin - away_to_margin,

    # Matchup features
    'home_off_vs_away_def': home_ppg - away_ppg_allowed,
    'away_off_vs_home_def': away_ppg - home_ppg_allowed,

    # Form indicators
    'recent_form_diff': home_last_4_pct - away_last_4_pct,
    'momentum': home_streak - away_streak,

    # Situational
    'rest_advantage': home_rest - away_rest,
    'travel_factor': away_miles_traveled / 1000,

    # Historical
    'h2h_home_advantage': home_h2h_win_pct if h2h_games > 3 else 0.5,

    # 12 total features
}

Why Differentials Work Better

Directly encode prediction target: We're predicting who wins, not absolute performance
Reduce multicollinearity: Raw features are highly correlated
Simpler model needed: Fewer, more informative features

Results with Differential Features

Metric	Training	Test	Elo Baseline
Accuracy	62.5%	61.4%	61.2%
Brier Score	0.214	0.216	0.218

Breakthrough: Finally beating the baseline, with minimal overfitting.

Phase 5: Model Selection

Comparing Algorithms

We tested multiple algorithms with our refined feature set:

models_tested = {
    'XGBoost': XGBClassifier(max_depth=4, n_estimators=100),
    'Random Forest': RandomForestClassifier(max_depth=6, n_estimators=200),
    'Logistic Regression': LogisticRegression(C=0.5),
    'LightGBM': LGBMClassifier(max_depth=4, n_estimators=100),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(32, 16), alpha=0.1)
}

Results Across Models

Model	Test Accuracy	Test Brier	Train-Test Gap
XGBoost	61.4%	0.216	1.1%
LightGBM	61.6%	0.215	0.9%
Random Forest	60.8%	0.219	1.5%
Logistic Regression	60.2%	0.220	0.3%
Neural Network	59.5%	0.224	3.2%

Winner: LightGBM slightly edged XGBoost with the most stable performance.

Phase 6: Hyperparameter Tuning

Temporal Grid Search

param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.05, 0.1, 0.15],
    'min_child_samples': [10, 20, 30],
    'reg_alpha': [0.5, 1.0, 2.0],
    'reg_lambda': [0.5, 1.0, 2.0]
}

# Use temporal CV, not random CV
cv_splitter = NFLTemporalCV(min_train_seasons=2)

Optimal Parameters Found

optimal_params = {
    'max_depth': 4,
    'n_estimators': 100,
    'learning_rate': 0.1,
    'min_child_samples': 20,
    'reg_alpha': 1.0,
    'reg_lambda': 1.0
}

Key Finding: Moderate regularization was crucial. Both under- and over-regularized models performed worse.

Phase 7: Ensemble Construction

Building the Ensemble

Given that different models capture different patterns, we built an ensemble:

ensemble_components = [
    ('lgbm', lgbm_model, 0.40),       # Primary model
    ('xgb', xgb_model, 0.35),          # Strong alternative
    ('logreg', logreg_model, 0.25),    # Stabilizing influence
]

def ensemble_predict(X):
    predictions = []
    for name, model, weight in ensemble_components:
        pred = model.predict_proba(X)[:, 1]
        predictions.append(pred * weight)
    return sum(predictions)

Why Include Logistic Regression?

Despite lower individual accuracy, logistic regression: - Makes different errors than tree-based models - Provides probability calibration - Reduces overall variance

Ensemble Results

Model	Test Accuracy	Test Brier
LightGBM alone	61.6%	0.215
XGBoost alone	61.4%	0.216
Ensemble	62.1%	0.213
Elo Baseline	61.2%	0.218

Final Edge: +0.9% accuracy, -0.005 Brier score vs baseline.

Phase 8: Calibration Check

Probability Calibration

Our model's predicted probabilities should match reality:

Predicted Range	Games	Actual Win %	Calibration Error
45-50%	180	47.2%	-0.3%
50-55%	250	52.4%	0.1%
55-60%	280	56.8%	-0.7%
60-65%	210	62.4%	-0.1%
65-70%	150	67.3%	0.2%
70-75%	90	73.3%	1.1%
75-80%	40	77.5%	0.8%

Verdict: Well-calibrated. Average absolute error: 0.5%.

Phase 9: Production Deployment

Weekly Pipeline

class WeeklyPredictionPipeline:
    """
    Production pipeline for weekly predictions.
    """

    def run_weekly(self, week: int, season: int):
        """Generate predictions for upcoming week."""

        # 1. Update statistics through last week
        self.update_team_stats(week - 1, season)

        # 2. Generate features for upcoming games
        upcoming_games = self.get_week_schedule(week, season)
        features = self.generate_features(upcoming_games)

        # 3. Generate predictions
        predictions = self.ensemble_predict(features)

        # 4. Add context
        predictions = self.add_context(predictions, upcoming_games)

        # 5. Store for tracking
        self.store_predictions(predictions)

        return predictions

    def add_context(self, preds, games):
        """Add feature contributions and comparisons."""
        for i, game in enumerate(games):
            preds[i]['key_factors'] = self.get_shap_summary(game)
            preds[i]['elo_comparison'] = self.get_elo_prediction(game)
            preds[i]['market_line'] = self.get_market_line(game)
        return preds

Monitoring

class ModelMonitor:
    """Track model performance over time."""

    def __init__(self):
        self.weekly_results = []

    def record_week(self, predictions, actuals):
        """Record weekly performance."""
        accuracy = (predictions.round() == actuals).mean()
        brier = ((predictions - actuals) ** 2).mean()

        self.weekly_results.append({
            'accuracy': accuracy,
            'brier': brier,
            'n_games': len(predictions)
        })

        # Alert if performance degrading
        if len(self.weekly_results) >= 4:
            recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
            if recent_brier > 0.230:
                self.send_alert("Model performance degrading")

    def should_retrain(self):
        """Determine if model needs retraining."""
        if len(self.weekly_results) < 4:
            return False

        recent_brier = np.mean([w['brier'] for w in self.weekly_results[-4:]])
        return recent_brier > 0.225

Final Results: Full 2023 Season

Head-to-Head vs Elo

Week	ML Model	Elo	Winner
1	9/16	8/16	ML
2	10/16	9/16	ML
3	9/16	10/16	Elo
4	11/16	10/16	ML
...	...	...	...
17	10/16	10/16	Tie

Season Totals: - ML Model: 167/267 = 62.5% - Elo Baseline: 163/267 = 61.0% - Market Lines: 168/267 = 62.9%

Performance by Game Type

Game Type	ML Accuracy	Elo Accuracy
Divisional	58.3%	56.7%
Primetime	65.4%	63.1%
Blowouts (10+ pts)	72.1%	70.8%
Close games (<7 pts)	54.2%	53.1%

Lessons Learned

What Worked

Differential features over raw statistics: Directly encoding what we're predicting
Aggressive regularization: Prevented overfitting to limited NFL data
Ensemble approach: Combining diverse models reduced variance
Strict temporal validation: Caught leakage and overfitting early
Simple but thoughtful features: 12 features beat 35 poorly-chosen ones

What Didn't Work

Complex neural networks: Overfitted despite regularization
Too many features: More noise than signal
Chasing accuracy: Brier score was a better north star
Weekly retraining: Added noise without improving predictions

Key Insights

On beating baselines: - Elo is harder to beat than it looks - 1-2% improvement is meaningful - Consistent improvement matters more than occasional big wins

On feature engineering: - Domain knowledge beats feature count - Differential features capture prediction target directly - Simpler is often better

On validation: - Temporal CV is non-negotiable - Watch for train-test gaps - Monitor production performance continuously

Future Improvements

Near-Term

Add weather features for outdoor games
Incorporate injury reports
Include referee tendencies

Medium-Term

Play-by-play derived features (EPA, success rate)
Bayesian uncertainty quantification
Game script features (expected closeness)

Long-Term

Real-time in-game updating
Player-level modeling
Video-based features (formation, alignment)

Conclusion

Building an ML system that beats Elo requires discipline more than sophistication. The keys are:

Establish a strong baseline
Engineer features carefully, avoiding leakage
Use temporal validation religiously
Prefer simpler models with regularization
Ensemble for robustness
Monitor continuously

The marginal improvement (1-2%) may seem small, but over a season it represents several additional correct predictions. Combined with proper uncertainty quantification, this edge enables better decision-making whether for analysis, fantasy sports, or understanding the game.