6 min read

Machine learning has transformed how we approach NFL prediction. While simple rating systems like Elo perform surprisingly well, ML methods can capture complex patterns that linear models miss: interactions between team attributes, non-linear...

Chapter 20: Machine Learning for NFL Prediction

When more data meets better algorithms


Introduction

Machine learning has transformed how we approach NFL prediction. While simple rating systems like Elo perform surprisingly well, ML methods can capture complex patterns that linear models miss: interactions between team attributes, non-linear effects, and subtle signals buried in play-by-play data.

This chapter covers the practical application of machine learning to NFL prediction. We'll explore feature engineering, model selection, evaluation strategies, and the pitfalls that trap unwary practitioners. By the end, you'll understand when ML adds value, when simpler methods suffice, and how to build robust prediction systems that generalize beyond training data.


Learning Objectives

By the end of this chapter, you will be able to:

  1. Engineer effective features from NFL data for ML models
  2. Select appropriate algorithms for different prediction tasks
  3. Implement proper temporal cross-validation for NFL data
  4. Build and evaluate gradient boosting models for game prediction
  5. Understand neural network approaches for NFL analytics
  6. Avoid common ML pitfalls in sports prediction
  7. Combine ML outputs with traditional methods

Part 1: The Case for Machine Learning

When ML Helps

Machine learning excels when:

  1. High-dimensional data: Many features with unknown interactions
  2. Non-linear relationships: Complex patterns beyond linear combinations
  3. Large datasets: Enough data to learn nuanced patterns
  4. Feature engineering potential: Raw data can be transformed meaningfully

When ML Struggles

Machine learning underperforms when:

  1. Limited data: NFL provides ~270 games per season—small by ML standards
  2. High noise: NFL outcomes have high variance (~13.5 point standard deviation)
  3. Concept drift: Team compositions change, requiring constant adaptation
  4. Overfitting risk: Easy to memorize training data without learning generalizable patterns

The NFL ML Challenge

NFL prediction occupies an awkward middle ground: complex enough to benefit from sophisticated methods, but data-limited enough to punish overconfidence. Success requires:

  • Thoughtful feature engineering
  • Aggressive regularization
  • Rigorous temporal validation
  • Skepticism about complex models

Part 2: Feature Engineering

The Foundation of ML Success

Feature engineering—transforming raw data into predictive signals—often matters more than algorithm choice. Good features make simple models powerful; bad features make complex models useless.

Categories of NFL Features

from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import pandas as pd
import numpy as np

@dataclass
class NFLFeatureEngineer:
    """
    Comprehensive feature engineering for NFL prediction.

    Generates features across multiple categories:
    - Team strength metrics
    - Recent form indicators
    - Situational factors
    - Matchup features
    """

    lookback_weeks: int = 8
    min_games: int = 4

    def create_features(self, games: pd.DataFrame,
                        team_stats: pd.DataFrame,
                        current_week: int) -> pd.DataFrame:
        """
        Generate all features for games in a specific week.

        Args:
            games: Games to generate features for
            team_stats: Historical statistics
            current_week: Week number for temporal filtering

        Returns:
            DataFrame with all features
        """
        features = []

        for _, game in games.iterrows():
            home = game['home_team']
            away = game['away_team']

            feature_dict = {
                'game_id': game.get('game_id', ''),
                'week': game.get('week', current_week),
                'home_team': home,
                'away_team': away
            }

            # Add all feature categories
            feature_dict.update(
                self._team_strength_features(home, away, team_stats, current_week)
            )
            feature_dict.update(
                self._recent_form_features(home, away, team_stats, current_week)
            )
            feature_dict.update(
                self._situational_features(game, team_stats, current_week)
            )
            feature_dict.update(
                self._matchup_features(home, away, team_stats, current_week)
            )

            features.append(feature_dict)

        return pd.DataFrame(features)

    def _team_strength_features(self, home: str, away: str,
                                 stats: pd.DataFrame, week: int) -> Dict:
        """
        Team strength metrics: points, yards, efficiency.
        """
        prior_stats = stats[stats['week'] < week]

        home_stats = prior_stats[prior_stats['team'] == home]
        away_stats = prior_stats[prior_stats['team'] == away]

        def safe_mean(df, col, default=0):
            if len(df) < self.min_games:
                return default
            return df[col].mean() if col in df.columns else default

        return {
            # Offensive metrics
            'home_ppg': safe_mean(home_stats, 'points_for', 21),
            'away_ppg': safe_mean(away_stats, 'points_for', 21),
            'home_ypg': safe_mean(home_stats, 'yards', 330),
            'away_ypg': safe_mean(away_stats, 'yards', 330),
            'home_pass_ypg': safe_mean(home_stats, 'pass_yards', 220),
            'away_pass_ypg': safe_mean(away_stats, 'pass_yards', 220),
            'home_rush_ypg': safe_mean(home_stats, 'rush_yards', 110),
            'away_rush_ypg': safe_mean(away_stats, 'rush_yards', 110),

            # Defensive metrics
            'home_ppg_allowed': safe_mean(home_stats, 'points_against', 21),
            'away_ppg_allowed': safe_mean(away_stats, 'points_against', 21),
            'home_ypg_allowed': safe_mean(home_stats, 'yards_allowed', 330),
            'away_ypg_allowed': safe_mean(away_stats, 'yards_allowed', 330),

            # Turnover metrics
            'home_turnover_margin': safe_mean(home_stats, 'turnover_margin', 0),
            'away_turnover_margin': safe_mean(away_stats, 'turnover_margin', 0),

            # Differential features
            'ppg_diff': safe_mean(home_stats, 'points_for', 21) - safe_mean(away_stats, 'points_for', 21),
            'defensive_diff': safe_mean(away_stats, 'points_against', 21) - safe_mean(home_stats, 'points_against', 21)
        }

    def _recent_form_features(self, home: str, away: str,
                               stats: pd.DataFrame, week: int) -> Dict:
        """
        Recent performance indicators (last N weeks).
        """
        recent_cutoff = max(1, week - self.lookback_weeks)
        recent_stats = stats[(stats['week'] >= recent_cutoff) & (stats['week'] < week)]

        home_recent = recent_stats[recent_stats['team'] == home]
        away_recent = recent_stats[recent_stats['team'] == away]

        def get_form(df):
            if len(df) < 2:
                return 0.5
            wins = (df['result'] == 'W').sum()
            return wins / len(df)

        def get_streak(df):
            if len(df) == 0:
                return 0
            streak = 0
            last_result = None
            for result in df.sort_values('week', ascending=False)['result']:
                if last_result is None:
                    last_result = result
                    streak = 1 if result == 'W' else -1
                elif result == last_result:
                    streak += 1 if result == 'W' else -1
                else:
                    break
            return streak

        return {
            'home_recent_win_pct': get_form(home_recent),
            'away_recent_win_pct': get_form(away_recent),
            'home_streak': get_streak(home_recent),
            'away_streak': get_streak(away_recent),
            'form_diff': get_form(home_recent) - get_form(away_recent)
        }

    def _situational_features(self, game: pd.Series,
                               stats: pd.DataFrame, week: int) -> Dict:
        """
        Situational factors: rest, travel, divisional.
        """
        home = game['home_team']
        away = game['away_team']

        prior_stats = stats[stats['week'] < week]

        def get_rest_days(team, current_week):
            team_games = prior_stats[prior_stats['team'] == team]
            if len(team_games) == 0:
                return 7  # Default rest
            last_game = team_games[team_games['week'] == team_games['week'].max()]
            if len(last_game) == 0:
                return 7
            weeks_since = current_week - last_game['week'].iloc[0]
            return weeks_since * 7  # Approximate

        return {
            'home_rest_days': get_rest_days(home, week),
            'away_rest_days': get_rest_days(away, week),
            'rest_advantage': get_rest_days(home, week) - get_rest_days(away, week),
            'is_divisional': 1 if game.get('is_divisional', False) else 0,
            'week': week
        }

    def _matchup_features(self, home: str, away: str,
                           stats: pd.DataFrame, week: int) -> Dict:
        """
        Matchup-specific features.
        """
        prior_stats = stats[stats['week'] < week]

        home_stats = prior_stats[prior_stats['team'] == home]
        away_stats = prior_stats[prior_stats['team'] == away]

        def safe_mean(df, col, default=0):
            return df[col].mean() if len(df) > 0 and col in df.columns else default

        # Style matchup features
        home_pass_pct = safe_mean(home_stats, 'pass_pct', 0.6)
        away_pass_pct = safe_mean(away_stats, 'pass_pct', 0.6)

        home_pass_def = safe_mean(home_stats, 'pass_yards_allowed', 220)
        away_pass_def = safe_mean(away_stats, 'pass_yards_allowed', 220)

        return {
            # Offensive vs defensive matchups
            'home_off_vs_away_def': safe_mean(home_stats, 'points_for', 21) - safe_mean(away_stats, 'points_against', 21),
            'away_off_vs_home_def': safe_mean(away_stats, 'points_for', 21) - safe_mean(home_stats, 'points_against', 21),

            # Style matchups
            'pass_style_diff': home_pass_pct - away_pass_pct,
            'home_pass_off_vs_away_pass_def': (home_pass_pct * 100) - (away_pass_def / 10),
            'away_pass_off_vs_home_pass_def': (away_pass_pct * 100) - (home_pass_def / 10)
        }

Feature Importance Principles

1. Predictive Power: Some features predict outcomes; others are noise. Validate each feature's contribution.

2. Independence: Highly correlated features add redundancy without information. Consider: - Points per game and yards per game (correlated) - Offensive efficiency and turnover margin (somewhat independent)

3. Temporal Validity: Features must only use information available before the game. Never include: - Future game results - Opponent's performance in the current game - Post-game statistics

4. Stability: Stable features (team strength) predict better than volatile ones (single-game stats).


Part 3: Model Selection

Gradient Boosting: The NFL ML Workhorse

Gradient boosting dominates NFL prediction for good reasons: - Handles mixed feature types (numeric, categorical) - Captures non-linear relationships - Built-in feature importance - Robust to outliers with proper regularization

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
import lightgbm as lgb

class NFLGradientBoostingPredictor:
    """
    Gradient boosting model for NFL game prediction.

    Uses XGBoost or LightGBM for efficient training.
    Supports both classification (win/loss) and regression (spread).
    """

    def __init__(self, task: str = 'classification',
                 n_estimators: int = 100,
                 max_depth: int = 4,
                 learning_rate: float = 0.1,
                 min_child_weight: int = 10,
                 subsample: float = 0.8,
                 colsample_bytree: float = 0.8,
                 reg_alpha: float = 1.0,
                 reg_lambda: float = 1.0):
        """
        Initialize model with hyperparameters.

        Key regularization parameters:
        - max_depth: Limits tree complexity (lower = simpler)
        - min_child_weight: Minimum samples per leaf
        - reg_alpha/lambda: L1/L2 regularization
        """
        self.task = task
        self.params = {
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'learning_rate': learning_rate,
            'min_child_weight': min_child_weight,
            'subsample': subsample,
            'colsample_bytree': colsample_bytree,
            'reg_alpha': reg_alpha,
            'reg_lambda': reg_lambda,
            'random_state': 42,
            'n_jobs': -1
        }

        if task == 'classification':
            self.model = xgb.XGBClassifier(**self.params)
        else:
            self.model = xgb.XGBRegressor(**self.params)

        self.feature_names = None
        self.feature_importance = None

    def fit(self, X: pd.DataFrame, y: pd.Series,
            eval_set: Tuple = None) -> 'NFLGradientBoostingPredictor':
        """
        Train the model.

        Args:
            X: Feature matrix
            y: Target (1/0 for classification, spread for regression)
            eval_set: Optional (X_val, y_val) for early stopping
        """
        self.feature_names = X.columns.tolist()

        fit_params = {}
        if eval_set is not None:
            fit_params['eval_set'] = [eval_set]
            fit_params['early_stopping_rounds'] = 20
            fit_params['verbose'] = False

        self.model.fit(X, y, **fit_params)

        # Extract feature importance
        self.feature_importance = pd.DataFrame({
            'feature': self.feature_names,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)

        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict win probability (classification only)."""
        if self.task != 'classification':
            raise ValueError("predict_proba only for classification task")
        return self.model.predict_proba(X)[:, 1]

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict outcome (class or spread)."""
        return self.model.predict(X)

    def get_feature_importance(self, top_n: int = 20) -> pd.DataFrame:
        """Get top N most important features."""
        return self.feature_importance.head(top_n)

Random Forests: An Alternative

Random forests offer similar benefits with different tradeoffs: - More stable (bagging vs boosting) - Less prone to overfitting - Generally slightly lower accuracy - Easier to parallelize

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

class NFLRandomForestPredictor:
    """
    Random forest model for NFL prediction.

    More stable than gradient boosting, potentially less accurate.
    """

    def __init__(self, task: str = 'classification',
                 n_estimators: int = 200,
                 max_depth: int = 6,
                 min_samples_leaf: int = 20,
                 max_features: str = 'sqrt'):
        self.task = task

        if task == 'classification':
            self.model = RandomForestClassifier(
                n_estimators=n_estimators,
                max_depth=max_depth,
                min_samples_leaf=min_samples_leaf,
                max_features=max_features,
                random_state=42,
                n_jobs=-1
            )
        else:
            self.model = RandomForestRegressor(
                n_estimators=n_estimators,
                max_depth=max_depth,
                min_samples_leaf=min_samples_leaf,
                max_features=max_features,
                random_state=42,
                n_jobs=-1
            )

    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'NFLRandomForestPredictor':
        """Train model."""
        self.feature_names = X.columns.tolist()
        self.model.fit(X, y)
        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict probabilities (classification)."""
        return self.model.predict_proba(X)[:, 1]

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict outcomes."""
        return self.model.predict(X)

Linear Models: The Baseline

Despite their simplicity, regularized linear models often perform well:

from sklearn.linear_model import LogisticRegression, Ridge

class NFLLinearPredictor:
    """
    Regularized linear model for NFL prediction.

    Serves as a strong baseline and is interpretable.
    """

    def __init__(self, task: str = 'classification',
                 regularization: float = 1.0):
        self.task = task

        if task == 'classification':
            self.model = LogisticRegression(C=regularization, max_iter=1000)
        else:
            self.model = Ridge(alpha=regularization)

        self.feature_names = None
        self.coefficients = None

    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'NFLLinearPredictor':
        """Train model."""
        self.feature_names = X.columns.tolist()
        self.model.fit(X, y)

        # Store coefficients
        coefs = self.model.coef_.flatten()
        self.coefficients = pd.DataFrame({
            'feature': self.feature_names,
            'coefficient': coefs
        }).sort_values('coefficient', ascending=False, key=abs)

        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict probabilities."""
        return self.model.predict_proba(X)[:, 1]

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict outcomes."""
        return self.model.predict(X)

Part 4: Temporal Cross-Validation

The Critical Difference

Standard cross-validation randomly splits data. For NFL prediction, this creates data leakage—future games inform past predictions. Temporal cross-validation maintains chronological order.

class NFLTemporalCV:
    """
    Temporal cross-validation for NFL prediction.

    Ensures train data always precedes test data chronologically.
    """

    def __init__(self, min_train_seasons: int = 2,
                 test_weeks: int = 17,
                 gap_weeks: int = 0):
        """
        Initialize temporal CV.

        Args:
            min_train_seasons: Minimum seasons in training set
            test_weeks: Weeks to use as test set
            gap_weeks: Gap between train and test (prevents leakage)
        """
        self.min_train_seasons = min_train_seasons
        self.test_weeks = test_weeks
        self.gap_weeks = gap_weeks

    def split(self, games: pd.DataFrame) -> List[Tuple[pd.Index, pd.Index]]:
        """
        Generate train/test splits.

        Each split uses all prior data for training and
        one season (or portion) for testing.

        Returns:
            List of (train_indices, test_indices) tuples
        """
        games = games.sort_values(['season', 'week']).copy()
        seasons = sorted(games['season'].unique())

        splits = []

        for i, test_season in enumerate(seasons[self.min_train_seasons:], self.min_train_seasons):
            # Training: all seasons before test
            train_mask = games['season'] < test_season

            # Optional gap
            if self.gap_weeks > 0:
                train_mask &= ~(
                    (games['season'] == test_season - 1) &
                    (games['week'] > 17 - self.gap_weeks)
                )

            # Test: the test season
            test_mask = games['season'] == test_season

            if train_mask.sum() > 0 and test_mask.sum() > 0:
                splits.append((
                    games[train_mask].index,
                    games[test_mask].index
                ))

        return splits

    def walk_forward(self, games: pd.DataFrame,
                     train_weeks: int = 52) -> List[Tuple[pd.Index, pd.Index]]:
        """
        Walk-forward validation: predict one week at a time.

        Uses rolling training window.

        Returns:
            List of (train_indices, test_indices) for each week
        """
        games = games.sort_values(['season', 'week']).copy()
        games['global_week'] = games['season'] * 100 + games['week']

        all_weeks = sorted(games['global_week'].unique())
        splits = []

        for i, test_week in enumerate(all_weeks):
            if i < train_weeks:
                continue  # Need minimum training data

            # Rolling window for training
            train_weeks_list = all_weeks[max(0, i - train_weeks):i]
            train_mask = games['global_week'].isin(train_weeks_list)
            test_mask = games['global_week'] == test_week

            if train_mask.sum() > 0 and test_mask.sum() > 0:
                splits.append((
                    games[train_mask].index,
                    games[test_mask].index
                ))

        return splits

Evaluating with Temporal CV

class NFLModelEvaluator:
    """
    Comprehensive model evaluation using temporal CV.
    """

    def evaluate_with_cv(self, model_class, model_params: Dict,
                         X: pd.DataFrame, y: pd.Series,
                         games: pd.DataFrame,
                         cv_splitter) -> Dict:
        """
        Evaluate model using temporal cross-validation.

        Args:
            model_class: Model class to instantiate
            model_params: Parameters for model
            X: Feature matrix
            y: Target
            games: DataFrame with season/week for splitting
            cv_splitter: Temporal CV object

        Returns:
            Aggregated evaluation metrics
        """
        splits = cv_splitter.split(games)

        all_predictions = []
        all_actuals = []
        fold_results = []

        for fold, (train_idx, test_idx) in enumerate(splits):
            # Split data
            X_train, X_test = X.loc[train_idx], X.loc[test_idx]
            y_train, y_test = y.loc[train_idx], y.loc[test_idx]

            # Train model
            model = model_class(**model_params)
            model.fit(X_train, y_train)

            # Predict
            if hasattr(model, 'predict_proba'):
                preds = model.predict_proba(X_test)
            else:
                preds = model.predict(X_test)

            # Collect predictions
            all_predictions.extend(preds)
            all_actuals.extend(y_test)

            # Calculate fold metrics
            fold_results.append({
                'fold': fold,
                'train_size': len(train_idx),
                'test_size': len(test_idx),
                'accuracy': (preds.round() == y_test).mean() if hasattr(model, 'predict_proba') else None,
                'brier': ((preds - y_test) ** 2).mean() if hasattr(model, 'predict_proba') else None
            })

        # Aggregate results
        predictions = np.array(all_predictions)
        actuals = np.array(all_actuals)

        return {
            'n_folds': len(splits),
            'total_predictions': len(predictions),
            'accuracy': ((predictions.round() == actuals).mean()
                        if model_params.get('task') == 'classification' else None),
            'brier_score': float(np.mean((predictions - actuals) ** 2)),
            'mae': float(np.mean(np.abs(predictions - actuals))),
            'fold_details': pd.DataFrame(fold_results)
        }

Part 5: Neural Networks for NFL

When Neural Networks Help

Neural networks can capture complex patterns but require: - Substantial training data (more than NFL typically provides) - Careful architecture design - Extensive regularization

A Simple Architecture

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

class NFLNeuralNetwork(nn.Module):
    """
    Simple feedforward neural network for NFL prediction.

    Architecture: Input -> Hidden -> Dropout -> Hidden -> Output
    """

    def __init__(self, input_dim: int, hidden_dim: int = 64,
                 dropout: float = 0.3, task: str = 'classification'):
        super().__init__()

        self.task = task

        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.BatchNorm1d(hidden_dim),

            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),

            nn.Linear(hidden_dim // 2, 1)
        )

        if task == 'classification':
            self.output_activation = nn.Sigmoid()
        else:
            self.output_activation = nn.Identity()

    def forward(self, x):
        logits = self.network(x)
        return self.output_activation(logits)


class NFLNeuralPredictor:
    """
    Wrapper for training and using neural network.
    """

    def __init__(self, input_dim: int, hidden_dim: int = 64,
                 dropout: float = 0.3, task: str = 'classification',
                 learning_rate: float = 0.001, epochs: int = 100,
                 batch_size: int = 32, weight_decay: float = 0.01):

        self.task = task
        self.epochs = epochs
        self.batch_size = batch_size

        self.model = NFLNeuralNetwork(input_dim, hidden_dim, dropout, task)
        self.optimizer = torch.optim.Adam(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay
        )

        if task == 'classification':
            self.criterion = nn.BCELoss()
        else:
            self.criterion = nn.MSELoss()

        self.scaler = None
        self.history = []

    def fit(self, X: pd.DataFrame, y: pd.Series,
            X_val: pd.DataFrame = None, y_val: pd.Series = None) -> 'NFLNeuralPredictor':
        """
        Train neural network.

        Args:
            X: Training features
            y: Training targets
            X_val: Validation features (for early stopping)
            y_val: Validation targets
        """
        from sklearn.preprocessing import StandardScaler

        # Standardize features
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)

        # Convert to tensors
        X_tensor = torch.FloatTensor(X_scaled)
        y_tensor = torch.FloatTensor(y.values).unsqueeze(1)

        dataset = TensorDataset(X_tensor, y_tensor)
        loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)

        # Validation data
        if X_val is not None:
            X_val_scaled = self.scaler.transform(X_val)
            X_val_tensor = torch.FloatTensor(X_val_scaled)
            y_val_tensor = torch.FloatTensor(y_val.values).unsqueeze(1)

        best_val_loss = float('inf')
        patience_counter = 0
        patience = 10

        for epoch in range(self.epochs):
            self.model.train()
            epoch_loss = 0

            for batch_X, batch_y in loader:
                self.optimizer.zero_grad()
                outputs = self.model(batch_X)
                loss = self.criterion(outputs, batch_y)
                loss.backward()
                self.optimizer.step()
                epoch_loss += loss.item()

            # Validation
            if X_val is not None:
                self.model.eval()
                with torch.no_grad():
                    val_outputs = self.model(X_val_tensor)
                    val_loss = self.criterion(val_outputs, y_val_tensor).item()

                self.history.append({
                    'epoch': epoch,
                    'train_loss': epoch_loss / len(loader),
                    'val_loss': val_loss
                })

                # Early stopping
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                else:
                    patience_counter += 1
                    if patience_counter >= patience:
                        break
            else:
                self.history.append({
                    'epoch': epoch,
                    'train_loss': epoch_loss / len(loader)
                })

        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict probabilities."""
        self.model.eval()
        X_scaled = self.scaler.transform(X)
        X_tensor = torch.FloatTensor(X_scaled)

        with torch.no_grad():
            outputs = self.model(X_tensor)

        return outputs.numpy().flatten()

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict outcomes."""
        probs = self.predict_proba(X)
        if self.task == 'classification':
            return (probs > 0.5).astype(int)
        return probs

Advanced Architectures

For richer data (play-by-play), more complex architectures may help:

Recurrent Networks (LSTM/GRU): Process sequential play data to capture game flow.

Attention Mechanisms: Focus on key plays or moments that predict outcomes.

Graph Neural Networks: Model player interactions and team structure.

These require substantial data and expertise. For most practitioners, gradient boosting remains more practical.


Part 6: Common Pitfalls

1. Data Leakage

The most dangerous error: using information not available at prediction time.

Examples: - Including final game statistics in features - Using season-end metrics for mid-season predictions - Random train/test splits crossing weeks

Prevention: - Audit every feature for temporal validity - Use strict temporal validation - Review feature engineering with fresh eyes

2. Overfitting

Models memorize training data instead of learning patterns.

Signs: - Training accuracy much higher than test accuracy - Performance degrades on new seasons - Complex models underperform simple ones

Solutions: - Strong regularization (low max_depth, high min_samples) - Early stopping with validation set - Cross-validation for hyperparameter selection - Feature selection to reduce noise

def check_for_overfitting(train_results: Dict, test_results: Dict) -> Dict:
    """
    Compare train and test performance to detect overfitting.
    """
    metrics = ['accuracy', 'brier_score', 'mae']
    overfitting_analysis = {}

    for metric in metrics:
        if metric in train_results and metric in test_results:
            train_val = train_results[metric]
            test_val = test_results[metric]

            # For accuracy, higher is better
            if metric == 'accuracy':
                gap = train_val - test_val
                concern_level = 'High' if gap > 0.10 else 'Medium' if gap > 0.05 else 'Low'
            # For loss metrics, lower is better
            else:
                gap = test_val - train_val
                concern_level = 'High' if gap > 0.02 else 'Medium' if gap > 0.01 else 'Low'

            overfitting_analysis[metric] = {
                'train': train_val,
                'test': test_val,
                'gap': gap,
                'concern': concern_level
            }

    return overfitting_analysis

3. Sample Size Issues

NFL provides limited data: ~270 games/season. This affects: - Model complexity capacity - Feature count sustainability - Confidence interval width

Guidelines: - Keep feature count below 20-30 - Use simpler models than typical ML - Report uncertainty estimates

4. Ignoring Class Balance

If predicting rare events (e.g., upsets), class imbalance matters:

from sklearn.utils.class_weight import compute_class_weight

def get_balanced_weights(y: pd.Series) -> Dict:
    """Calculate class weights for imbalanced data."""
    classes = np.unique(y)
    weights = compute_class_weight('balanced', classes=classes, y=y)
    return dict(zip(classes, weights))

5. Feature Importance Misinterpretation

High importance doesn't mean causal effect:

  • Correlated features share importance
  • Random noise can appear important by chance
  • Importance varies between algorithms

Always validate important features with domain knowledge.


Part 7: Ensemble Methods

Combining Multiple Models

Different models capture different patterns. Combining them often improves predictions:

class NFLEnsemble:
    """
    Ensemble multiple ML models for robust predictions.
    """

    def __init__(self, models: List, weights: List[float] = None):
        """
        Initialize ensemble.

        Args:
            models: List of fitted models
            weights: Relative weights (default: equal)
        """
        self.models = models

        if weights is None:
            weights = [1.0] * len(models)
        self.weights = np.array(weights) / sum(weights)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Ensemble probability prediction."""
        predictions = []

        for model in self.models:
            if hasattr(model, 'predict_proba'):
                pred = model.predict_proba(X)
            else:
                pred = model.predict(X)
            predictions.append(pred)

        # Weighted average
        ensemble_pred = np.average(predictions, axis=0, weights=self.weights)
        return ensemble_pred

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Ensemble prediction."""
        probs = self.predict_proba(X)
        return (probs > 0.5).astype(int)

    @staticmethod
    def optimize_weights(models: List, X_val: pd.DataFrame,
                         y_val: pd.Series) -> List[float]:
        """
        Find optimal ensemble weights using validation data.

        Uses grid search over weight combinations.
        """
        from itertools import product

        # Get individual predictions
        predictions = []
        for model in models:
            if hasattr(model, 'predict_proba'):
                pred = model.predict_proba(X_val)
            else:
                pred = model.predict(X_val)
            predictions.append(pred)

        predictions = np.array(predictions)

        # Grid search
        best_brier = float('inf')
        best_weights = None

        weight_options = [0.0, 0.25, 0.5, 0.75, 1.0]
        for weights in product(weight_options, repeat=len(models)):
            if sum(weights) == 0:
                continue

            weights = np.array(weights) / sum(weights)
            ensemble = np.average(predictions, axis=0, weights=weights)
            brier = np.mean((ensemble - y_val) ** 2)

            if brier < best_brier:
                best_brier = brier
                best_weights = weights.tolist()

        return best_weights

Stacking: Meta-Learning

Use one model's predictions as features for another:

class NFLStackingEnsemble:
    """
    Stacking ensemble using meta-learner.

    Level 0: Base models predict on validation fold
    Level 1: Meta-model learns from Level 0 predictions
    """

    def __init__(self, base_models: List, meta_model):
        self.base_models = base_models
        self.meta_model = meta_model
        self.fitted_base_models = []

    def fit(self, X: pd.DataFrame, y: pd.Series,
            cv_splitter) -> 'NFLStackingEnsemble':
        """
        Fit stacking ensemble.

        Uses out-of-fold predictions to train meta-model.
        """
        n_models = len(self.base_models)
        meta_features = np.zeros((len(X), n_models))

        splits = cv_splitter.split(X)

        # Generate out-of-fold predictions
        for train_idx, val_idx in splits:
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train = y.iloc[train_idx]

            for i, model_class in enumerate(self.base_models):
                model = model_class()
                model.fit(X_train, y_train)

                if hasattr(model, 'predict_proba'):
                    preds = model.predict_proba(X_val)
                else:
                    preds = model.predict(X_val)

                meta_features[val_idx, i] = preds

        # Train meta-model on meta-features
        self.meta_model.fit(meta_features, y)

        # Re-train base models on full data
        for model_class in self.base_models:
            model = model_class()
            model.fit(X, y)
            self.fitted_base_models.append(model)

        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Generate predictions using stacked ensemble."""
        # Get base model predictions
        meta_features = np.zeros((len(X), len(self.fitted_base_models)))

        for i, model in enumerate(self.fitted_base_models):
            if hasattr(model, 'predict_proba'):
                preds = model.predict_proba(X)
            else:
                preds = model.predict(X)
            meta_features[:, i] = preds

        # Meta-model prediction
        if hasattr(self.meta_model, 'predict_proba'):
            return self.meta_model.predict_proba(meta_features)[:, 1]
        return self.meta_model.predict(meta_features)

Part 8: Practical Implementation

Complete Pipeline

class NFLMLPipeline:
    """
    End-to-end ML pipeline for NFL prediction.

    Handles feature engineering, model training, and evaluation.
    """

    def __init__(self, feature_engineer: NFLFeatureEngineer,
                 model_class, model_params: Dict):
        self.feature_engineer = feature_engineer
        self.model_class = model_class
        self.model_params = model_params
        self.model = None
        self.evaluation_results = None

    def train(self, games: pd.DataFrame, team_stats: pd.DataFrame,
              train_seasons: List[int]) -> 'NFLMLPipeline':
        """
        Train model on specified seasons.

        Args:
            games: All game data
            team_stats: Team statistics
            train_seasons: Seasons to use for training
        """
        # Filter to training seasons
        train_games = games[games['season'].isin(train_seasons)].copy()

        # Generate features
        features_list = []
        for (season, week), week_games in train_games.groupby(['season', 'week']):
            week_features = self.feature_engineer.create_features(
                week_games, team_stats, week
            )
            features_list.append(week_features)

        features_df = pd.concat(features_list, ignore_index=True)

        # Prepare X and y
        feature_cols = [c for c in features_df.columns
                       if c not in ['game_id', 'week', 'home_team', 'away_team',
                                   'home_score', 'away_score', 'season']]

        X = features_df[feature_cols].fillna(0)

        # Create target (home team win)
        y = (train_games['home_score'] > train_games['away_score']).astype(int)
        y = y.reset_index(drop=True)

        # Train model
        self.model = self.model_class(**self.model_params)
        self.model.fit(X, y)

        return self

    def predict(self, games: pd.DataFrame, team_stats: pd.DataFrame,
                current_week: int) -> pd.DataFrame:
        """
        Generate predictions for games.

        Args:
            games: Games to predict
            team_stats: Current statistics
            current_week: Week number for feature generation

        Returns:
            DataFrame with predictions
        """
        features = self.feature_engineer.create_features(
            games, team_stats, current_week
        )

        feature_cols = [c for c in features.columns
                       if c not in ['game_id', 'week', 'home_team', 'away_team',
                                   'home_score', 'away_score', 'season']]

        X = features[feature_cols].fillna(0)

        # Predictions
        if hasattr(self.model, 'predict_proba'):
            probs = self.model.predict_proba(X)
        else:
            probs = self.model.predict(X)

        # Build result
        predictions = features[['game_id', 'home_team', 'away_team']].copy()
        predictions['home_win_prob'] = probs
        predictions['predicted_winner'] = np.where(
            probs > 0.5,
            predictions['home_team'],
            predictions['away_team']
        )
        predictions['predicted_spread'] = (0.5 - probs) * 16  # Rough conversion

        return predictions

    def evaluate(self, games: pd.DataFrame, team_stats: pd.DataFrame,
                 test_season: int) -> Dict:
        """Evaluate on a test season."""
        test_games = games[games['season'] == test_season].copy()

        all_predictions = []
        all_actuals = []

        for week in sorted(test_games['week'].unique()):
            week_games = test_games[test_games['week'] == week]
            preds = self.predict(week_games, team_stats, week)

            all_predictions.extend(preds['home_win_prob'].tolist())

            actual_wins = (week_games['home_score'] > week_games['away_score']).astype(int)
            all_actuals.extend(actual_wins.tolist())

        # Calculate metrics
        preds = np.array(all_predictions)
        actuals = np.array(all_actuals)

        accuracy = ((preds.round() == actuals).sum()) / len(preds)
        brier = np.mean((preds - actuals) ** 2)

        return {
            'season': test_season,
            'n_games': len(preds),
            'accuracy': round(accuracy, 3),
            'brier_score': round(brier, 4)
        }

Summary

Machine learning offers powerful tools for NFL prediction, but success requires careful application. Key principles:

  1. Feature engineering matters most: Transform raw data into predictive signals
  2. Temporal validation is essential: Never leak future information
  3. Regularization prevents overfitting: Simple models often win
  4. Ensemble methods add robustness: Combine diverse approaches
  5. Benchmark against baselines: Elo-based methods are surprisingly strong

The goal isn't the most sophisticated model—it's the most reliable predictions. Often, gradient boosting with thoughtful features outperforms complex deep learning approaches on NFL data.


Key Equations Reference

Gradient Boosting Objective: $$L = \sum_i l(y_i, \hat{y}_i) + \sum_k \Omega(f_k)$$

Where $l$ is loss function and $\Omega$ is regularization.

Brier Score: $$BS = \frac{1}{N}\sum_i (p_i - o_i)^2$$

Log Loss: $$LL = -\frac{1}{N}\sum_i [y_i \log(p_i) + (1-y_i)\log(1-p_i)]$$


Looking Ahead

Chapter 21 Preview: Next, we'll explore Game Simulation—using Monte Carlo methods to simulate thousands of game outcomes and generate probability distributions for any metric, from win probability to specific score predictions.


Chapter Summary

Machine learning enhances NFL prediction by capturing complex patterns in high-dimensional data. However, the limited sample size and high variance of NFL outcomes require aggressive regularization and careful validation.

Gradient boosting dominates practical applications due to its balance of power and robustness. Neural networks offer potential for richer data types (play-by-play, video) but require more expertise and data.

The most successful practitioners combine ML with domain knowledge: engineered features capture football understanding, while algorithms find patterns humans might miss. This hybrid approach—human insight plus machine learning—represents the state of the art in NFL prediction.