3 min read

Machine learning has transformed sports analytics by enabling pattern recognition and prediction at scales impossible for traditional statistical methods. This chapter explores how modern ML techniques apply to college football, from player...

Chapter 22: Machine Learning Applications in College Football

Introduction

Machine learning has transformed sports analytics by enabling pattern recognition and prediction at scales impossible for traditional statistical methods. This chapter explores how modern ML techniques apply to college football, from player classification to game outcome prediction, providing practical implementations that bridge theoretical foundations with real-world applications.

Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply supervised learning algorithms to football prediction tasks
  2. Implement ensemble methods for improved accuracy
  3. Use clustering to discover player archetypes
  4. Build neural networks for complex pattern recognition
  5. Evaluate model performance with appropriate metrics
  6. Avoid common pitfalls in sports ML applications

22.1 The ML Landscape in Football Analytics

22.1.1 Why Machine Learning?

Traditional statistical approaches have served football analytics well, but they face limitations:

TRADITIONAL STATISTICS VS. MACHINE LEARNING
===========================================

Traditional Approach:
- Regression models with specified features
- Assumptions about data distributions
- Limited interaction terms
- Manual feature engineering

Machine Learning Approach:
- Automatic feature learning
- Handles complex non-linear relationships
- Discovers unexpected patterns
- Scales to high-dimensional data

Key Trade-off:
Interpretability ←──────────────────→ Predictive Power
    (Linear)                               (Deep Learning)

22.1.2 Common ML Tasks in Football

import pandas as pd
import numpy as np
from enum import Enum

class MLTask(Enum):
    """Common ML task types in football analytics."""
    CLASSIFICATION = "classification"    # Win/loss, draft/undrafted
    REGRESSION = "regression"            # Points scored, yards gained
    CLUSTERING = "clustering"            # Player archetypes, play types
    RANKING = "ranking"                  # Team power rankings
    SEQUENCE = "sequence"                # Play calling patterns
    ANOMALY = "anomaly"                  # Injury risk, outlier performance


# Example task mapping
football_ml_tasks = {
    'game_outcome': MLTask.CLASSIFICATION,
    'point_spread': MLTask.REGRESSION,
    'player_archetypes': MLTask.CLUSTERING,
    'power_rankings': MLTask.RANKING,
    'play_prediction': MLTask.SEQUENCE,
    'injury_risk': MLTask.ANOMALY
}

22.1.3 The Football ML Pipeline

class FootballMLPipeline:
    """Standard ML pipeline for football applications."""

    def __init__(self, task_type: MLTask):
        self.task_type = task_type
        self.preprocessor = None
        self.model = None
        self.evaluator = None

    def build_pipeline(self):
        """Construct the complete pipeline."""
        return {
            'data_collection': self._data_collection_stage(),
            'preprocessing': self._preprocessing_stage(),
            'feature_engineering': self._feature_engineering_stage(),
            'model_selection': self._model_selection_stage(),
            'training': self._training_stage(),
            'evaluation': self._evaluation_stage(),
            'deployment': self._deployment_stage()
        }

    def _data_collection_stage(self) -> dict:
        """Data collection considerations."""
        return {
            'sources': ['play-by-play', 'box_scores', 'tracking_data'],
            'historical_depth': '5+ seasons recommended',
            'quality_checks': ['missing_values', 'outliers', 'consistency'],
            'labeling': 'Ensure clear outcome definitions'
        }

    def _preprocessing_stage(self) -> dict:
        """Data preprocessing steps."""
        return {
            'cleaning': ['remove_garbage_time', 'handle_missing', 'outlier_treatment'],
            'encoding': ['one_hot_categorical', 'ordinal_encoding', 'target_encoding'],
            'scaling': ['standard_scaler', 'min_max', 'robust_scaler'],
            'splitting': ['temporal_split', 'stratified_split', 'group_split_by_team']
        }

    def _feature_engineering_stage(self) -> dict:
        """Feature engineering approaches."""
        return {
            'domain_features': ['efficiency_metrics', 'situational_indicators'],
            'aggregations': ['rolling_averages', 'exponential_weights'],
            'interactions': ['score_time', 'down_distance'],
            'embeddings': ['team_embeddings', 'player_embeddings']
        }

    def _model_selection_stage(self) -> dict:
        """Model selection guidance."""
        models_by_task = {
            MLTask.CLASSIFICATION: ['logistic', 'random_forest', 'xgboost', 'neural_net'],
            MLTask.REGRESSION: ['linear', 'ridge', 'xgboost', 'neural_net'],
            MLTask.CLUSTERING: ['kmeans', 'hierarchical', 'dbscan', 'gmm'],
            MLTask.RANKING: ['bradley_terry', 'elo', 'learning_to_rank'],
            MLTask.SEQUENCE: ['lstm', 'transformer', 'hidden_markov'],
            MLTask.ANOMALY: ['isolation_forest', 'autoencoder', 'one_class_svm']
        }
        return models_by_task.get(self.task_type, [])

    def _training_stage(self) -> dict:
        """Training considerations."""
        return {
            'cross_validation': 'temporal or group-based',
            'hyperparameter_tuning': ['grid_search', 'random_search', 'bayesian'],
            'early_stopping': 'prevent overfitting',
            'regularization': 'L1/L2 or dropout'
        }

    def _evaluation_stage(self) -> dict:
        """Evaluation metrics by task."""
        return {
            MLTask.CLASSIFICATION: ['accuracy', 'auc', 'brier_score', 'log_loss'],
            MLTask.REGRESSION: ['rmse', 'mae', 'r_squared', 'mape'],
            MLTask.CLUSTERING: ['silhouette', 'calinski_harabasz', 'domain_validity']
        }

    def _deployment_stage(self) -> dict:
        """Deployment considerations."""
        return {
            'model_serialization': 'joblib or pickle',
            'api_integration': 'REST endpoints',
            'monitoring': 'track prediction drift',
            'retraining': 'schedule periodic updates'
        }

22.2 Supervised Learning for Game Prediction

22.2.1 Classification: Game Outcome Prediction

from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, brier_score_loss
import xgboost as xgb

class GameOutcomePredictor:
    """Predict game outcomes using various ML models."""

    def __init__(self):
        self.models = {}
        self.scaler = StandardScaler()
        self.feature_cols = None

    def prepare_features(self, games: pd.DataFrame) -> pd.DataFrame:
        """Engineer features for game prediction."""
        df = games.copy()

        # Team strength metrics
        df['elo_diff'] = df['home_elo'] - df['away_elo']
        df['sp_plus_diff'] = df['home_sp_plus'] - df['away_sp_plus']

        # Efficiency differentials
        df['off_eff_diff'] = df['home_off_efficiency'] - df['away_off_efficiency']
        df['def_eff_diff'] = df['home_def_efficiency'] - df['away_def_efficiency']

        # Recent form (last 3 games)
        df['home_recent_win_pct'] = df['home_last_3_wins'] / 3
        df['away_recent_win_pct'] = df['away_last_3_wins'] / 3
        df['form_diff'] = df['home_recent_win_pct'] - df['away_recent_win_pct']

        # Situational factors
        df['home_field_advantage'] = 1  # Baseline
        df['rivalry_game'] = df['rivalry'].astype(int)
        df['conference_game'] = df['same_conference'].astype(int)

        # Rest advantage
        df['rest_diff'] = df['home_days_rest'] - df['away_days_rest']

        self.feature_cols = [
            'elo_diff', 'sp_plus_diff', 'off_eff_diff', 'def_eff_diff',
            'form_diff', 'home_field_advantage', 'rivalry_game',
            'conference_game', 'rest_diff'
        ]

        return df

    def temporal_split(self,
                       df: pd.DataFrame,
                       train_seasons: list,
                       test_seasons: list) -> tuple:
        """Split data temporally by season."""
        train_mask = df['season'].isin(train_seasons)
        test_mask = df['season'].isin(test_seasons)

        X_train = df.loc[train_mask, self.feature_cols].values
        X_test = df.loc[test_mask, self.feature_cols].values
        y_train = df.loc[train_mask, 'home_win'].values
        y_test = df.loc[test_mask, 'home_win'].values

        # Scale features
        X_train = self.scaler.fit_transform(X_train)
        X_test = self.scaler.transform(X_test)

        return X_train, X_test, y_train, y_test

    def train_models(self, X_train: np.ndarray, y_train: np.ndarray):
        """Train multiple classification models."""
        # Logistic Regression baseline
        self.models['logistic'] = LogisticRegression(
            C=1.0, max_iter=1000, random_state=42
        )
        self.models['logistic'].fit(X_train, y_train)

        # Random Forest
        self.models['random_forest'] = RandomForestClassifier(
            n_estimators=100,
            max_depth=6,
            min_samples_leaf=10,
            random_state=42
        )
        self.models['random_forest'].fit(X_train, y_train)

        # Gradient Boosting
        self.models['gbm'] = GradientBoostingClassifier(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            random_state=42
        )
        self.models['gbm'].fit(X_train, y_train)

        # XGBoost
        self.models['xgboost'] = xgb.XGBClassifier(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            objective='binary:logistic',
            random_state=42
        )
        self.models['xgboost'].fit(X_train, y_train)

    def evaluate_models(self,
                        X_test: np.ndarray,
                        y_test: np.ndarray) -> pd.DataFrame:
        """Evaluate all models on test set."""
        results = []

        for name, model in self.models.items():
            y_pred = model.predict(X_test)
            y_prob = model.predict_proba(X_test)[:, 1]

            results.append({
                'model': name,
                'accuracy': accuracy_score(y_test, y_pred),
                'auc': roc_auc_score(y_test, y_prob),
                'brier_score': brier_score_loss(y_test, y_prob),
                'log_loss': -np.mean(
                    y_test * np.log(y_prob + 1e-10) +
                    (1 - y_test) * np.log(1 - y_prob + 1e-10)
                )
            })

        return pd.DataFrame(results).sort_values('brier_score')

    def get_feature_importance(self, model_name: str = 'xgboost') -> pd.DataFrame:
        """Extract feature importance from tree-based model."""
        model = self.models[model_name]

        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
        else:
            # For logistic regression, use absolute coefficients
            importance = np.abs(model.coef_[0])

        return pd.DataFrame({
            'feature': self.feature_cols,
            'importance': importance
        }).sort_values('importance', ascending=False)

22.2.2 Regression: Point Spread Prediction

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

class SpreadPredictor:
    """Predict game point spreads using regression models."""

    def __init__(self):
        self.models = {}
        self.scaler = StandardScaler()
        self.feature_cols = None

    def prepare_features(self, games: pd.DataFrame) -> pd.DataFrame:
        """Engineer features for spread prediction."""
        df = games.copy()

        # Power rating differences
        df['power_diff'] = df['home_power_rating'] - df['away_power_rating']

        # Component differences
        df['rush_off_diff'] = df['home_rush_off'] - df['away_rush_def']
        df['pass_off_diff'] = df['home_pass_off'] - df['away_pass_def']
        df['rush_def_diff'] = df['away_rush_off'] - df['home_rush_def']
        df['pass_def_diff'] = df['away_pass_off'] - df['home_pass_def']

        # Tempo and style
        df['tempo_diff'] = df['home_tempo'] - df['away_tempo']
        df['avg_tempo'] = (df['home_tempo'] + df['away_tempo']) / 2

        # Historical performance
        df['home_avg_margin'] = df['home_season_margin']
        df['away_avg_margin'] = df['away_season_margin']
        df['margin_diff'] = df['home_avg_margin'] - df['away_avg_margin']

        # Opponent adjustments
        df['home_sos'] = df['home_strength_of_schedule']
        df['away_sos'] = df['away_strength_of_schedule']
        df['sos_diff'] = df['home_sos'] - df['away_sos']

        # Home field (convert to points, typically ~2.5-3)
        df['home_advantage_pts'] = 2.7

        self.feature_cols = [
            'power_diff', 'rush_off_diff', 'pass_off_diff',
            'rush_def_diff', 'pass_def_diff', 'tempo_diff',
            'margin_diff', 'sos_diff', 'home_advantage_pts'
        ]

        return df

    def train_models(self, X_train: np.ndarray, y_train: np.ndarray):
        """Train regression models."""
        # Ridge Regression
        self.models['ridge'] = Ridge(alpha=1.0)
        self.models['ridge'].fit(X_train, y_train)

        # Lasso (sparse features)
        self.models['lasso'] = Lasso(alpha=0.1)
        self.models['lasso'].fit(X_train, y_train)

        # ElasticNet
        self.models['elastic_net'] = ElasticNet(alpha=0.1, l1_ratio=0.5)
        self.models['elastic_net'].fit(X_train, y_train)

        # Random Forest
        self.models['random_forest'] = RandomForestRegressor(
            n_estimators=100,
            max_depth=6,
            random_state=42
        )
        self.models['random_forest'].fit(X_train, y_train)

        # Gradient Boosting
        self.models['gbm'] = GradientBoostingRegressor(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            random_state=42
        )
        self.models['gbm'].fit(X_train, y_train)

    def evaluate_models(self,
                        X_test: np.ndarray,
                        y_test: np.ndarray) -> pd.DataFrame:
        """Evaluate regression models."""
        results = []

        for name, model in self.models.items():
            y_pred = model.predict(X_test)

            results.append({
                'model': name,
                'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
                'mae': mean_absolute_error(y_test, y_pred),
                'r2': r2_score(y_test, y_pred),
                'within_3': np.mean(np.abs(y_test - y_pred) <= 3),
                'within_7': np.mean(np.abs(y_test - y_pred) <= 7)
            })

        return pd.DataFrame(results).sort_values('rmse')

    def predict_spread(self, game_features: dict) -> dict:
        """Predict spread for a single game."""
        features = np.array([[game_features[col] for col in self.feature_cols]])
        features_scaled = self.scaler.transform(features)

        predictions = {}
        for name, model in self.models.items():
            predictions[name] = model.predict(features_scaled)[0]

        # Ensemble prediction
        predictions['ensemble'] = np.mean(list(predictions.values()))

        return predictions

22.3 Ensemble Methods

22.3.1 Voting and Stacking Ensembles

from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

class EnsembleGamePredictor:
    """Advanced ensemble methods for game prediction."""

    def __init__(self):
        self.voting_ensemble = None
        self.stacking_ensemble = None

    def build_voting_ensemble(self) -> VotingClassifier:
        """Build voting ensemble classifier."""
        base_models = [
            ('logistic', LogisticRegression(C=1.0, max_iter=1000)),
            ('rf', RandomForestClassifier(n_estimators=100, max_depth=6)),
            ('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=4)),
            ('xgb', xgb.XGBClassifier(n_estimators=100, max_depth=4))
        ]

        # Soft voting uses predicted probabilities
        self.voting_ensemble = VotingClassifier(
            estimators=base_models,
            voting='soft'
        )

        return self.voting_ensemble

    def build_stacking_ensemble(self) -> StackingClassifier:
        """Build stacking ensemble with meta-learner."""
        base_models = [
            ('rf', RandomForestClassifier(n_estimators=100, max_depth=6)),
            ('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=4)),
            ('xgb', xgb.XGBClassifier(n_estimators=100, max_depth=4)),
            ('mlp', MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500))
        ]

        # Logistic regression as meta-learner
        self.stacking_ensemble = StackingClassifier(
            estimators=base_models,
            final_estimator=LogisticRegression(C=0.1),
            cv=5,
            passthrough=False  # Only use base model predictions
        )

        return self.stacking_ensemble

    def compare_ensembles(self,
                          X_train: np.ndarray,
                          X_test: np.ndarray,
                          y_train: np.ndarray,
                          y_test: np.ndarray) -> pd.DataFrame:
        """Compare ensemble performance."""
        results = []

        # Train and evaluate voting ensemble
        voting = self.build_voting_ensemble()
        voting.fit(X_train, y_train)
        voting_pred = voting.predict_proba(X_test)[:, 1]

        results.append({
            'model': 'voting_ensemble',
            'accuracy': accuracy_score(y_test, voting.predict(X_test)),
            'auc': roc_auc_score(y_test, voting_pred),
            'brier_score': brier_score_loss(y_test, voting_pred)
        })

        # Train and evaluate stacking ensemble
        stacking = self.build_stacking_ensemble()
        stacking.fit(X_train, y_train)
        stacking_pred = stacking.predict_proba(X_test)[:, 1]

        results.append({
            'model': 'stacking_ensemble',
            'accuracy': accuracy_score(y_test, stacking.predict(X_test)),
            'auc': roc_auc_score(y_test, stacking_pred),
            'brier_score': brier_score_loss(y_test, stacking_pred)
        })

        return pd.DataFrame(results)


class CustomWeightedEnsemble:
    """Custom weighted ensemble based on validation performance."""

    def __init__(self, models: dict):
        self.models = models
        self.weights = None

    def fit_weights(self,
                    X_val: np.ndarray,
                    y_val: np.ndarray,
                    metric: str = 'brier_score'):
        """Fit ensemble weights based on validation performance."""
        performances = []

        for name, model in self.models.items():
            y_prob = model.predict_proba(X_val)[:, 1]

            if metric == 'brier_score':
                # Lower is better, invert
                perf = 1 / (brier_score_loss(y_val, y_prob) + 0.01)
            elif metric == 'auc':
                perf = roc_auc_score(y_val, y_prob)
            else:
                perf = accuracy_score(y_val, model.predict(X_val))

            performances.append((name, perf))

        # Normalize to sum to 1
        total_perf = sum(p[1] for p in performances)
        self.weights = {name: perf / total_perf for name, perf in performances}

        return self.weights

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Get weighted probability predictions."""
        weighted_sum = np.zeros(len(X))

        for name, model in self.models.items():
            proba = model.predict_proba(X)[:, 1]
            weighted_sum += self.weights[name] * proba

        return weighted_sum

22.3.2 XGBoost Deep Dive

import xgboost as xgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

class XGBoostGamePredictor:
    """Optimized XGBoost model for game prediction."""

    def __init__(self):
        self.model = None
        self.best_params = None

    def tune_hyperparameters(self,
                             X_train: np.ndarray,
                             y_train: np.ndarray,
                             method: str = 'random') -> dict:
        """Tune XGBoost hyperparameters."""
        param_grid = {
            'n_estimators': [50, 100, 200, 300],
            'max_depth': [3, 4, 5, 6],
            'learning_rate': [0.01, 0.05, 0.1, 0.2],
            'min_child_weight': [1, 3, 5],
            'subsample': [0.7, 0.8, 0.9, 1.0],
            'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
            'reg_alpha': [0, 0.1, 1],
            'reg_lambda': [0.1, 1, 10]
        }

        base_model = xgb.XGBClassifier(
            objective='binary:logistic',
            random_state=42,
            use_label_encoder=False,
            eval_metric='logloss'
        )

        if method == 'grid':
            # Smaller grid for demonstration
            small_grid = {
                'n_estimators': [100, 200],
                'max_depth': [3, 4, 5],
                'learning_rate': [0.05, 0.1]
            }
            searcher = GridSearchCV(
                base_model, small_grid, cv=5,
                scoring='neg_brier_score', n_jobs=-1
            )
        else:
            searcher = RandomizedSearchCV(
                base_model, param_grid, n_iter=50, cv=5,
                scoring='neg_brier_score', n_jobs=-1, random_state=42
            )

        searcher.fit(X_train, y_train)
        self.best_params = searcher.best_params_
        self.model = searcher.best_estimator_

        return self.best_params

    def train_with_early_stopping(self,
                                   X_train: np.ndarray,
                                   y_train: np.ndarray,
                                   X_val: np.ndarray,
                                   y_val: np.ndarray):
        """Train with early stopping to prevent overfitting."""
        self.model = xgb.XGBClassifier(
            **self.best_params if self.best_params else {},
            n_estimators=1000,  # High, but early stopping will stop
            objective='binary:logistic',
            random_state=42,
            use_label_encoder=False
        )

        self.model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            early_stopping_rounds=50,
            verbose=False
        )

        print(f"Best iteration: {self.model.best_iteration}")

    def plot_learning_curve(self, X_train, y_train, X_val, y_val):
        """Plot training and validation learning curves."""
        import matplotlib.pyplot as plt

        eval_set = [(X_train, y_train), (X_val, y_val)]

        model = xgb.XGBClassifier(
            **self.best_params if self.best_params else {},
            n_estimators=300,
            objective='binary:logistic',
            random_state=42
        )

        model.fit(
            X_train, y_train,
            eval_set=eval_set,
            verbose=False
        )

        results = model.evals_result()

        plt.figure(figsize=(10, 6))
        plt.plot(results['validation_0']['logloss'], label='Training')
        plt.plot(results['validation_1']['logloss'], label='Validation')
        plt.xlabel('Boosting Round')
        plt.ylabel('Log Loss')
        plt.title('XGBoost Learning Curves')
        plt.legend()
        plt.grid(True, alpha=0.3)

        return plt

    def get_feature_importance_analysis(self,
                                         feature_names: list) -> pd.DataFrame:
        """Comprehensive feature importance analysis."""
        # Default importance (gain)
        gain_importance = self.model.feature_importances_

        # Get other importance types
        booster = self.model.get_booster()
        weight_importance = booster.get_score(importance_type='weight')
        cover_importance = booster.get_score(importance_type='cover')

        # Convert to arrays
        n_features = len(feature_names)
        weight_arr = np.array([
            weight_importance.get(f'f{i}', 0) for i in range(n_features)
        ])
        cover_arr = np.array([
            cover_importance.get(f'f{i}', 0) for i in range(n_features)
        ])

        # Normalize
        weight_arr = weight_arr / weight_arr.sum() if weight_arr.sum() > 0 else weight_arr
        cover_arr = cover_arr / cover_arr.sum() if cover_arr.sum() > 0 else cover_arr

        return pd.DataFrame({
            'feature': feature_names,
            'gain': gain_importance,
            'weight': weight_arr,
            'cover': cover_arr
        }).sort_values('gain', ascending=False)

22.4 Unsupervised Learning: Player Clustering

22.4.1 Finding Player Archetypes

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score

class PlayerArchetypeDiscovery:
    """Discover player archetypes using clustering."""

    def __init__(self, position: str):
        self.position = position
        self.scaler = StandardScaler()
        self.cluster_model = None
        self.pca = None

    def prepare_features(self, players: pd.DataFrame) -> np.ndarray:
        """Prepare features for clustering based on position."""
        position_features = {
            'QB': [
                'completion_pct', 'yards_per_attempt', 'td_rate', 'int_rate',
                'sack_rate', 'rush_attempts_per_game', 'rush_yards_per_attempt',
                'deep_pass_pct', 'play_action_rate', 'time_to_throw'
            ],
            'RB': [
                'yards_per_carry', 'yards_after_contact', 'broken_tackles_per_att',
                'receiving_targets_per_game', 'catch_rate', 'rush_epa',
                'elusive_rating', 'pass_block_grade'
            ],
            'WR': [
                'yards_per_route', 'separation_avg', 'catch_rate',
                'contested_catch_rate', 'yards_after_catch', 'deep_target_rate',
                'slot_rate', 'target_share'
            ],
            'LB': [
                'tackles_per_game', 'tackles_for_loss', 'pass_rush_snaps_pct',
                'coverage_snaps_pct', 'qb_pressures', 'forced_incompletions',
                'missed_tackle_rate', 'run_stop_rate'
            ]
        }

        features = position_features.get(self.position, position_features['QB'])
        available_features = [f for f in features if f in players.columns]

        X = players[available_features].dropna().values
        X_scaled = self.scaler.fit_transform(X)

        return X_scaled, available_features

    def find_optimal_clusters(self, X: np.ndarray, max_k: int = 10) -> dict:
        """Find optimal number of clusters using multiple methods."""
        results = {
            'silhouette': [],
            'calinski_harabasz': [],
            'inertia': []
        }

        for k in range(2, max_k + 1):
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(X)

            results['silhouette'].append(silhouette_score(X, labels))
            results['calinski_harabasz'].append(calinski_harabasz_score(X, labels))
            results['inertia'].append(kmeans.inertia_)

        # Find optimal k
        optimal_k_silhouette = np.argmax(results['silhouette']) + 2

        # Elbow method for inertia
        inertia = np.array(results['inertia'])
        diffs = np.diff(inertia)
        diffs2 = np.diff(diffs)
        optimal_k_elbow = np.argmax(diffs2) + 3 if len(diffs2) > 0 else 3

        return {
            'metrics': results,
            'optimal_k_silhouette': optimal_k_silhouette,
            'optimal_k_elbow': optimal_k_elbow,
            'recommended_k': round((optimal_k_silhouette + optimal_k_elbow) / 2)
        }

    def fit_clustering(self,
                       X: np.ndarray,
                       n_clusters: int,
                       method: str = 'kmeans'):
        """Fit clustering model."""
        if method == 'kmeans':
            self.cluster_model = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        elif method == 'gmm':
            self.cluster_model = GaussianMixture(n_components=n_clusters, random_state=42)
        elif method == 'hierarchical':
            self.cluster_model = AgglomerativeClustering(n_clusters=n_clusters)

        return self.cluster_model.fit_predict(X)

    def analyze_clusters(self,
                         players: pd.DataFrame,
                         labels: np.ndarray,
                         feature_cols: list) -> pd.DataFrame:
        """Analyze cluster characteristics."""
        players_clustered = players.copy()
        players_clustered['cluster'] = labels

        # Calculate cluster means
        cluster_profiles = players_clustered.groupby('cluster')[feature_cols].mean()

        # Calculate cluster sizes
        cluster_sizes = players_clustered['cluster'].value_counts().sort_index()

        # Name clusters based on characteristics
        cluster_names = self._generate_cluster_names(cluster_profiles)

        cluster_profiles['cluster_name'] = cluster_names
        cluster_profiles['size'] = cluster_sizes.values

        return cluster_profiles

    def _generate_cluster_names(self, profiles: pd.DataFrame) -> list:
        """Generate descriptive names for clusters based on profile."""
        names = []

        for idx, row in profiles.iterrows():
            # Find top 2 distinguishing features
            z_scores = (row - profiles.mean()) / profiles.std()
            top_features = z_scores.abs().nlargest(2).index.tolist()

            # Generate name based on position
            if self.position == 'QB':
                name = self._name_qb_cluster(row, z_scores)
            elif self.position == 'RB':
                name = self._name_rb_cluster(row, z_scores)
            elif self.position == 'WR':
                name = self._name_wr_cluster(row, z_scores)
            else:
                name = f"Cluster {idx}"

            names.append(name)

        return names

    def _name_qb_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
        """Generate QB archetype name."""
        if z_scores.get('rush_yards_per_attempt', 0) > 1:
            return "Dual-Threat"
        elif z_scores.get('deep_pass_pct', 0) > 1:
            return "Deep Ball Specialist"
        elif z_scores.get('completion_pct', 0) > 1 and z_scores.get('yards_per_attempt', 0) < 0:
            return "Game Manager"
        elif z_scores.get('yards_per_attempt', 0) > 1:
            return "Aggressive Gunslinger"
        elif z_scores.get('sack_rate', 0) < -1:
            return "Pocket Presence"
        else:
            return "Balanced Passer"

    def _name_rb_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
        """Generate RB archetype name."""
        if z_scores.get('receiving_targets_per_game', 0) > 1:
            return "Receiving Back"
        elif z_scores.get('yards_after_contact', 0) > 1:
            return "Power Back"
        elif z_scores.get('elusive_rating', 0) > 1:
            return "Elusive Scat Back"
        elif z_scores.get('pass_block_grade', 0) > 1:
            return "Pass-Blocking Specialist"
        else:
            return "All-Purpose Back"

    def _name_wr_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
        """Generate WR archetype name."""
        if z_scores.get('slot_rate', 0) > 1:
            return "Slot Receiver"
        elif z_scores.get('deep_target_rate', 0) > 1:
            return "Deep Threat"
        elif z_scores.get('contested_catch_rate', 0) > 1:
            return "Possession Receiver"
        elif z_scores.get('yards_after_catch', 0) > 1:
            return "YAC Monster"
        else:
            return "Route Technician"

    def visualize_clusters(self, X: np.ndarray, labels: np.ndarray):
        """Visualize clusters using PCA."""
        import matplotlib.pyplot as plt

        # Reduce to 2D for visualization
        self.pca = PCA(n_components=2)
        X_2d = self.pca.fit_transform(X)

        plt.figure(figsize=(12, 8))
        scatter = plt.scatter(
            X_2d[:, 0], X_2d[:, 1],
            c=labels, cmap='viridis', alpha=0.6
        )
        plt.colorbar(scatter, label='Cluster')
        plt.xlabel(f'PC1 ({self.pca.explained_variance_ratio_[0]*100:.1f}% var)')
        plt.ylabel(f'PC2 ({self.pca.explained_variance_ratio_[1]*100:.1f}% var)')
        plt.title(f'{self.position} Player Archetypes')

        return plt

22.5 Neural Networks for Football

22.5.1 Feed-Forward Networks

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class FootballMLP(nn.Module):
    """Multi-layer perceptron for football prediction."""

    def __init__(self, input_size: int, hidden_sizes: list, output_size: int = 1):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.extend([
                nn.Linear(prev_size, hidden_size),
                nn.BatchNorm1d(hidden_size),
                nn.ReLU(),
                nn.Dropout(0.3)
            ])
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, output_size))
        layers.append(nn.Sigmoid())

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


class NeuralNetworkTrainer:
    """Training wrapper for neural network models."""

    def __init__(self, model: nn.Module, learning_rate: float = 0.001):
        self.model = model
        self.optimizer = optim.Adam(model.parameters(), lr=learning_rate)
        self.criterion = nn.BCELoss()
        self.history = {'train_loss': [], 'val_loss': []}

    def train_epoch(self, train_loader: DataLoader) -> float:
        """Train for one epoch."""
        self.model.train()
        total_loss = 0

        for X_batch, y_batch in train_loader:
            self.optimizer.zero_grad()
            predictions = self.model(X_batch).squeeze()
            loss = self.criterion(predictions, y_batch)
            loss.backward()
            self.optimizer.step()
            total_loss += loss.item()

        return total_loss / len(train_loader)

    def evaluate(self, val_loader: DataLoader) -> float:
        """Evaluate on validation set."""
        self.model.eval()
        total_loss = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                predictions = self.model(X_batch).squeeze()
                loss = self.criterion(predictions, y_batch)
                total_loss += loss.item()

        return total_loss / len(val_loader)

    def train(self,
              X_train: np.ndarray,
              y_train: np.ndarray,
              X_val: np.ndarray,
              y_val: np.ndarray,
              epochs: int = 100,
              batch_size: int = 64,
              early_stopping_patience: int = 10):
        """Full training loop with early stopping."""
        # Convert to tensors
        train_dataset = TensorDataset(
            torch.FloatTensor(X_train),
            torch.FloatTensor(y_train)
        )
        val_dataset = TensorDataset(
            torch.FloatTensor(X_val),
            torch.FloatTensor(y_val)
        )

        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size)

        best_val_loss = float('inf')
        patience_counter = 0

        for epoch in range(epochs):
            train_loss = self.train_epoch(train_loader)
            val_loss = self.evaluate(val_loader)

            self.history['train_loss'].append(train_loss)
            self.history['val_loss'].append(val_loss)

            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
                # Save best model
                self.best_state = self.model.state_dict().copy()
            else:
                patience_counter += 1

            if patience_counter >= early_stopping_patience:
                print(f"Early stopping at epoch {epoch}")
                break

            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")

        # Restore best model
        self.model.load_state_dict(self.best_state)

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Get predictions."""
        self.model.eval()
        with torch.no_grad():
            X_tensor = torch.FloatTensor(X)
            predictions = self.model(X_tensor).squeeze().numpy()
        return predictions

22.5.2 Recurrent Networks for Play Sequences

class PlaySequenceLSTM(nn.Module):
    """LSTM for modeling play sequences."""

    def __init__(self,
                 input_size: int,
                 hidden_size: int = 64,
                 num_layers: int = 2,
                 num_classes: int = 10):
        super().__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.3
        )

        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 32),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(32, num_classes)
        )

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        lstm_out, (hidden, cell) = self.lstm(x)

        # Use last hidden state
        last_hidden = hidden[-1]

        # Classify
        output = self.fc(last_hidden)
        return output


class PlayPredictionSystem:
    """System for predicting next play type."""

    def __init__(self, sequence_length: int = 5):
        self.sequence_length = sequence_length
        self.play_encoder = None
        self.model = None

    def encode_play(self, play: dict) -> np.ndarray:
        """Encode a single play as feature vector."""
        features = [
            play['down'] / 4,  # Normalize down
            play['distance'] / 20,  # Normalize distance
            play['yard_line'] / 100,  # Normalize field position
            play['seconds_remaining'] / 3600,  # Normalize time
            play['score_diff'] / 50,  # Normalize score
            1 if play['play_type'] == 'pass' else 0,
            1 if play['play_type'] == 'rush' else 0,
            play.get('yards_gained', 0) / 20
        ]
        return np.array(features)

    def create_sequences(self, plays: pd.DataFrame) -> tuple:
        """Create sequences for training."""
        sequences = []
        labels = []

        # Group by game
        for game_id, game_plays in plays.groupby('game_id'):
            game_plays = game_plays.sort_values('play_id')

            for i in range(len(game_plays) - self.sequence_length):
                seq = game_plays.iloc[i:i+self.sequence_length]
                next_play = game_plays.iloc[i+self.sequence_length]

                # Encode sequence
                seq_encoded = np.array([
                    self.encode_play(row.to_dict())
                    for _, row in seq.iterrows()
                ])
                sequences.append(seq_encoded)

                # Label is next play type
                labels.append(self._encode_play_type(next_play['play_type']))

        return np.array(sequences), np.array(labels)

    def _encode_play_type(self, play_type: str) -> int:
        """Encode play type as integer."""
        play_types = {
            'rush_inside': 0, 'rush_outside': 1,
            'pass_short': 2, 'pass_deep': 3,
            'screen': 4, 'play_action': 5,
            'punt': 6, 'field_goal': 7,
            'qb_scramble': 8, 'other': 9
        }
        return play_types.get(play_type, 9)

    def train(self, plays: pd.DataFrame):
        """Train the play prediction model."""
        X, y = self.create_sequences(plays)

        # Split data
        split_idx = int(0.8 * len(X))
        X_train, X_val = X[:split_idx], X[split_idx:]
        y_train, y_val = y[:split_idx], y[split_idx:]

        # Build model
        input_size = X.shape[2]
        self.model = PlaySequenceLSTM(
            input_size=input_size,
            hidden_size=64,
            num_layers=2,
            num_classes=10
        )

        # Train
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.model.parameters(), lr=0.001)

        train_dataset = TensorDataset(
            torch.FloatTensor(X_train),
            torch.LongTensor(y_train)
        )
        train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

        for epoch in range(50):
            self.model.train()
            total_loss = 0

            for X_batch, y_batch in train_loader:
                optimizer.zero_grad()
                output = self.model(X_batch)
                loss = criterion(output, y_batch)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()

            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Loss = {total_loss/len(train_loader):.4f}")

    def predict_next_play(self, recent_plays: list) -> dict:
        """Predict probability distribution for next play."""
        self.model.eval()

        # Encode recent plays
        seq = np.array([self.encode_play(p) for p in recent_plays[-self.sequence_length:]])
        seq = torch.FloatTensor(seq).unsqueeze(0)

        with torch.no_grad():
            output = self.model(seq)
            probs = torch.softmax(output, dim=1).squeeze().numpy()

        play_types = [
            'rush_inside', 'rush_outside', 'pass_short', 'pass_deep',
            'screen', 'play_action', 'punt', 'field_goal', 'qb_scramble', 'other'
        ]

        return {play_type: prob for play_type, prob in zip(play_types, probs)}

22.6 Model Evaluation and Validation

22.6.1 Cross-Validation Strategies

from sklearn.model_selection import (
    KFold, StratifiedKFold, TimeSeriesSplit, GroupKFold
)

class FootballCrossValidator:
    """Cross-validation strategies for football data."""

    @staticmethod
    def temporal_cv(df: pd.DataFrame,
                    n_splits: int = 5) -> list:
        """Time-based cross-validation."""
        seasons = sorted(df['season'].unique())
        n_train_seasons = len(seasons) - n_splits

        folds = []
        for i in range(n_splits):
            train_seasons = seasons[:n_train_seasons + i]
            test_season = seasons[n_train_seasons + i]

            train_idx = df[df['season'].isin(train_seasons)].index
            test_idx = df[df['season'] == test_season].index

            folds.append((train_idx, test_idx))

        return folds

    @staticmethod
    def group_cv(df: pd.DataFrame,
                 group_col: str = 'team',
                 n_splits: int = 5) -> list:
        """Group-based cross-validation (leave teams out)."""
        groups = df[group_col].values

        gkf = GroupKFold(n_splits=n_splits)
        return list(gkf.split(df, groups=groups))

    @staticmethod
    def stratified_temporal_cv(df: pd.DataFrame,
                               target_col: str = 'home_win',
                               n_splits: int = 5) -> list:
        """Stratified within temporal splits."""
        seasons = sorted(df['season'].unique())
        n_train_seasons = len(seasons) - n_splits

        folds = []
        for i in range(n_splits):
            train_seasons = seasons[:n_train_seasons + i]
            test_season = seasons[n_train_seasons + i]

            train_data = df[df['season'].isin(train_seasons)]
            test_data = df[df['season'] == test_season]

            # Ensure stratification in test set
            folds.append((train_data.index, test_data.index))

        return folds


class ModelEvaluator:
    """Comprehensive model evaluation."""

    def __init__(self):
        self.results = {}

    def evaluate_classification(self,
                                y_true: np.ndarray,
                                y_pred: np.ndarray,
                                y_prob: np.ndarray) -> dict:
        """Evaluate classification model."""
        from sklearn.metrics import (
            accuracy_score, precision_score, recall_score,
            f1_score, roc_auc_score, brier_score_loss,
            confusion_matrix
        )

        return {
            'accuracy': accuracy_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred),
            'recall': recall_score(y_true, y_pred),
            'f1': f1_score(y_true, y_pred),
            'auc': roc_auc_score(y_true, y_prob),
            'brier_score': brier_score_loss(y_true, y_prob),
            'log_loss': -np.mean(
                y_true * np.log(y_prob + 1e-10) +
                (1 - y_true) * np.log(1 - y_prob + 1e-10)
            ),
            'confusion_matrix': confusion_matrix(y_true, y_pred)
        }

    def evaluate_regression(self,
                            y_true: np.ndarray,
                            y_pred: np.ndarray) -> dict:
        """Evaluate regression model."""
        return {
            'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
            'mae': mean_absolute_error(y_true, y_pred),
            'r2': r2_score(y_true, y_pred),
            'mape': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-10))) * 100,
            'within_3': np.mean(np.abs(y_true - y_pred) <= 3),
            'within_7': np.mean(np.abs(y_true - y_pred) <= 7)
        }

    def evaluate_calibration(self,
                             y_true: np.ndarray,
                             y_prob: np.ndarray,
                             n_bins: int = 10) -> dict:
        """Evaluate probability calibration."""
        bins = np.linspace(0, 1, n_bins + 1)
        calibration_data = []

        for i in range(n_bins):
            mask = (y_prob >= bins[i]) & (y_prob < bins[i+1])
            if mask.sum() > 0:
                calibration_data.append({
                    'bin': f'{bins[i]:.1f}-{bins[i+1]:.1f}',
                    'predicted': y_prob[mask].mean(),
                    'actual': y_true[mask].mean(),
                    'count': mask.sum()
                })

        cal_df = pd.DataFrame(calibration_data)
        ece = np.average(
            np.abs(cal_df['predicted'] - cal_df['actual']),
            weights=cal_df['count']
        )
        mce = np.abs(cal_df['predicted'] - cal_df['actual']).max()

        return {
            'calibration_curve': cal_df,
            'ece': ece,
            'mce': mce
        }

    def compare_to_baseline(self,
                            model_metrics: dict,
                            baseline_type: str = 'always_home') -> dict:
        """Compare model to simple baselines."""
        baselines = {
            'always_home': 0.55,  # Home team wins ~55%
            'coin_flip': 0.50,
            'vegas_implied': 0.52  # Typical ATS accuracy
        }

        baseline_acc = baselines.get(baseline_type, 0.50)

        return {
            'model_accuracy': model_metrics['accuracy'],
            'baseline_accuracy': baseline_acc,
            'improvement': model_metrics['accuracy'] - baseline_acc,
            'relative_improvement': (model_metrics['accuracy'] - baseline_acc) / baseline_acc * 100
        }

22.7 Common Pitfalls and Best Practices

22.7.1 Data Leakage Prevention

class DataLeakageChecker:
    """Check for and prevent data leakage."""

    @staticmethod
    def check_temporal_leakage(df: pd.DataFrame,
                               feature_cols: list,
                               date_col: str = 'game_date') -> list:
        """Check for features that use future information."""
        leaky_features = []

        # Check for season-level aggregates
        season_features = [col for col in feature_cols if 'season' in col.lower()]
        for feat in season_features:
            print(f"Warning: {feat} may contain full-season data (leakage risk)")
            leaky_features.append(feat)

        # Check for outcome-derived features
        outcome_keywords = ['win', 'loss', 'margin', 'score', 'result']
        for feat in feature_cols:
            if any(kw in feat.lower() for kw in outcome_keywords):
                print(f"Warning: {feat} may be derived from outcome (check carefully)")
                leaky_features.append(feat)

        return leaky_features

    @staticmethod
    def ensure_point_in_time(df: pd.DataFrame,
                             feature_cols: list,
                             game_date_col: str = 'game_date') -> pd.DataFrame:
        """Ensure features are point-in-time valid."""
        df = df.copy()

        # For rolling features, ensure they don't include current game
        for col in feature_cols:
            if 'rolling' in col.lower() or 'avg' in col.lower():
                # Shift to exclude current game
                df[col] = df.groupby('team')[col].shift(1)

        return df


class BestPractices:
    """Best practices for football ML."""

    @staticmethod
    def get_recommendations() -> dict:
        """Get best practice recommendations."""
        return {
            'data_handling': [
                'Use temporal train/test splits',
                'Exclude garbage time plays',
                'Handle missing data consistently',
                'Account for rule changes across seasons'
            ],
            'feature_engineering': [
                'Include strength of schedule adjustments',
                'Use opponent-adjusted metrics',
                'Create score-time interaction terms',
                'Consider home/away splits'
            ],
            'model_training': [
                'Use cross-validation appropriate for sports data',
                'Tune regularization to prevent overfitting',
                'Monitor for calibration, not just accuracy',
                'Include baseline comparisons'
            ],
            'evaluation': [
                'Report Brier score for probabilistic predictions',
                'Use ATS accuracy for betting applications',
                'Check calibration across probability ranges',
                'Validate on recent, unseen season'
            ],
            'deployment': [
                'Update models with new season data',
                'Monitor prediction drift',
                'Document feature definitions',
                'Log model predictions for post-hoc analysis'
            ]
        }

Summary

This chapter has explored the application of machine learning to college football analytics:

  1. Supervised Learning: Classification for game outcomes and regression for spread prediction offer modest but meaningful improvements over baselines when properly implemented.

  2. Ensemble Methods: Combining multiple models through voting or stacking consistently outperforms individual models, with XGBoost often serving as a strong base learner.

  3. Unsupervised Learning: Clustering reveals meaningful player archetypes that go beyond traditional position labels, providing insights for recruiting and team building.

  4. Neural Networks: Deep learning architectures can capture complex patterns, particularly in sequence modeling for play prediction, though they require more data and careful tuning.

  5. Evaluation: Proper validation using temporal splits and calibration metrics is critical for football applications where the margin between success and failure is small.

  6. Pitfalls: Data leakage, overfitting to small samples, and ignoring domain knowledge are common failure modes that can be avoided with disciplined practices.

The key to successful football ML is balancing model complexity with interpretability, always grounding predictions in football logic, and maintaining rigorous evaluation standards.

Key Takeaways

  • Machine learning provides modest improvements over traditional methods when properly implemented
  • Temporal validation is essential for sports prediction tasks
  • Ensemble methods consistently outperform individual models
  • Feature engineering remains crucial even with complex models
  • Calibration matters as much as accuracy for probabilistic predictions
  • Domain knowledge should guide model development and interpretation