Machine learning has transformed sports analytics by enabling pattern recognition and prediction at scales impossible for traditional statistical methods. This chapter explores how modern ML techniques apply to college football, from player...
In This Chapter
- Introduction
- Learning Objectives
- 22.1 The ML Landscape in Football Analytics
- 22.2 Supervised Learning for Game Prediction
- 22.3 Ensemble Methods
- 22.4 Unsupervised Learning: Player Clustering
- 22.5 Neural Networks for Football
- 22.6 Model Evaluation and Validation
- 22.7 Common Pitfalls and Best Practices
- Summary
- Key Takeaways
Chapter 22: Machine Learning Applications in College Football
Introduction
Machine learning has transformed sports analytics by enabling pattern recognition and prediction at scales impossible for traditional statistical methods. This chapter explores how modern ML techniques apply to college football, from player classification to game outcome prediction, providing practical implementations that bridge theoretical foundations with real-world applications.
Learning Objectives
By the end of this chapter, you will be able to:
- Apply supervised learning algorithms to football prediction tasks
- Implement ensemble methods for improved accuracy
- Use clustering to discover player archetypes
- Build neural networks for complex pattern recognition
- Evaluate model performance with appropriate metrics
- Avoid common pitfalls in sports ML applications
22.1 The ML Landscape in Football Analytics
22.1.1 Why Machine Learning?
Traditional statistical approaches have served football analytics well, but they face limitations:
TRADITIONAL STATISTICS VS. MACHINE LEARNING
===========================================
Traditional Approach:
- Regression models with specified features
- Assumptions about data distributions
- Limited interaction terms
- Manual feature engineering
Machine Learning Approach:
- Automatic feature learning
- Handles complex non-linear relationships
- Discovers unexpected patterns
- Scales to high-dimensional data
Key Trade-off:
Interpretability ←──────────────────→ Predictive Power
(Linear) (Deep Learning)
22.1.2 Common ML Tasks in Football
import pandas as pd
import numpy as np
from enum import Enum
class MLTask(Enum):
"""Common ML task types in football analytics."""
CLASSIFICATION = "classification" # Win/loss, draft/undrafted
REGRESSION = "regression" # Points scored, yards gained
CLUSTERING = "clustering" # Player archetypes, play types
RANKING = "ranking" # Team power rankings
SEQUENCE = "sequence" # Play calling patterns
ANOMALY = "anomaly" # Injury risk, outlier performance
# Example task mapping
football_ml_tasks = {
'game_outcome': MLTask.CLASSIFICATION,
'point_spread': MLTask.REGRESSION,
'player_archetypes': MLTask.CLUSTERING,
'power_rankings': MLTask.RANKING,
'play_prediction': MLTask.SEQUENCE,
'injury_risk': MLTask.ANOMALY
}
22.1.3 The Football ML Pipeline
class FootballMLPipeline:
"""Standard ML pipeline for football applications."""
def __init__(self, task_type: MLTask):
self.task_type = task_type
self.preprocessor = None
self.model = None
self.evaluator = None
def build_pipeline(self):
"""Construct the complete pipeline."""
return {
'data_collection': self._data_collection_stage(),
'preprocessing': self._preprocessing_stage(),
'feature_engineering': self._feature_engineering_stage(),
'model_selection': self._model_selection_stage(),
'training': self._training_stage(),
'evaluation': self._evaluation_stage(),
'deployment': self._deployment_stage()
}
def _data_collection_stage(self) -> dict:
"""Data collection considerations."""
return {
'sources': ['play-by-play', 'box_scores', 'tracking_data'],
'historical_depth': '5+ seasons recommended',
'quality_checks': ['missing_values', 'outliers', 'consistency'],
'labeling': 'Ensure clear outcome definitions'
}
def _preprocessing_stage(self) -> dict:
"""Data preprocessing steps."""
return {
'cleaning': ['remove_garbage_time', 'handle_missing', 'outlier_treatment'],
'encoding': ['one_hot_categorical', 'ordinal_encoding', 'target_encoding'],
'scaling': ['standard_scaler', 'min_max', 'robust_scaler'],
'splitting': ['temporal_split', 'stratified_split', 'group_split_by_team']
}
def _feature_engineering_stage(self) -> dict:
"""Feature engineering approaches."""
return {
'domain_features': ['efficiency_metrics', 'situational_indicators'],
'aggregations': ['rolling_averages', 'exponential_weights'],
'interactions': ['score_time', 'down_distance'],
'embeddings': ['team_embeddings', 'player_embeddings']
}
def _model_selection_stage(self) -> dict:
"""Model selection guidance."""
models_by_task = {
MLTask.CLASSIFICATION: ['logistic', 'random_forest', 'xgboost', 'neural_net'],
MLTask.REGRESSION: ['linear', 'ridge', 'xgboost', 'neural_net'],
MLTask.CLUSTERING: ['kmeans', 'hierarchical', 'dbscan', 'gmm'],
MLTask.RANKING: ['bradley_terry', 'elo', 'learning_to_rank'],
MLTask.SEQUENCE: ['lstm', 'transformer', 'hidden_markov'],
MLTask.ANOMALY: ['isolation_forest', 'autoencoder', 'one_class_svm']
}
return models_by_task.get(self.task_type, [])
def _training_stage(self) -> dict:
"""Training considerations."""
return {
'cross_validation': 'temporal or group-based',
'hyperparameter_tuning': ['grid_search', 'random_search', 'bayesian'],
'early_stopping': 'prevent overfitting',
'regularization': 'L1/L2 or dropout'
}
def _evaluation_stage(self) -> dict:
"""Evaluation metrics by task."""
return {
MLTask.CLASSIFICATION: ['accuracy', 'auc', 'brier_score', 'log_loss'],
MLTask.REGRESSION: ['rmse', 'mae', 'r_squared', 'mape'],
MLTask.CLUSTERING: ['silhouette', 'calinski_harabasz', 'domain_validity']
}
def _deployment_stage(self) -> dict:
"""Deployment considerations."""
return {
'model_serialization': 'joblib or pickle',
'api_integration': 'REST endpoints',
'monitoring': 'track prediction drift',
'retraining': 'schedule periodic updates'
}
22.2 Supervised Learning for Game Prediction
22.2.1 Classification: Game Outcome Prediction
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, brier_score_loss
import xgboost as xgb
class GameOutcomePredictor:
"""Predict game outcomes using various ML models."""
def __init__(self):
self.models = {}
self.scaler = StandardScaler()
self.feature_cols = None
def prepare_features(self, games: pd.DataFrame) -> pd.DataFrame:
"""Engineer features for game prediction."""
df = games.copy()
# Team strength metrics
df['elo_diff'] = df['home_elo'] - df['away_elo']
df['sp_plus_diff'] = df['home_sp_plus'] - df['away_sp_plus']
# Efficiency differentials
df['off_eff_diff'] = df['home_off_efficiency'] - df['away_off_efficiency']
df['def_eff_diff'] = df['home_def_efficiency'] - df['away_def_efficiency']
# Recent form (last 3 games)
df['home_recent_win_pct'] = df['home_last_3_wins'] / 3
df['away_recent_win_pct'] = df['away_last_3_wins'] / 3
df['form_diff'] = df['home_recent_win_pct'] - df['away_recent_win_pct']
# Situational factors
df['home_field_advantage'] = 1 # Baseline
df['rivalry_game'] = df['rivalry'].astype(int)
df['conference_game'] = df['same_conference'].astype(int)
# Rest advantage
df['rest_diff'] = df['home_days_rest'] - df['away_days_rest']
self.feature_cols = [
'elo_diff', 'sp_plus_diff', 'off_eff_diff', 'def_eff_diff',
'form_diff', 'home_field_advantage', 'rivalry_game',
'conference_game', 'rest_diff'
]
return df
def temporal_split(self,
df: pd.DataFrame,
train_seasons: list,
test_seasons: list) -> tuple:
"""Split data temporally by season."""
train_mask = df['season'].isin(train_seasons)
test_mask = df['season'].isin(test_seasons)
X_train = df.loc[train_mask, self.feature_cols].values
X_test = df.loc[test_mask, self.feature_cols].values
y_train = df.loc[train_mask, 'home_win'].values
y_test = df.loc[test_mask, 'home_win'].values
# Scale features
X_train = self.scaler.fit_transform(X_train)
X_test = self.scaler.transform(X_test)
return X_train, X_test, y_train, y_test
def train_models(self, X_train: np.ndarray, y_train: np.ndarray):
"""Train multiple classification models."""
# Logistic Regression baseline
self.models['logistic'] = LogisticRegression(
C=1.0, max_iter=1000, random_state=42
)
self.models['logistic'].fit(X_train, y_train)
# Random Forest
self.models['random_forest'] = RandomForestClassifier(
n_estimators=100,
max_depth=6,
min_samples_leaf=10,
random_state=42
)
self.models['random_forest'].fit(X_train, y_train)
# Gradient Boosting
self.models['gbm'] = GradientBoostingClassifier(
n_estimators=100,
max_depth=4,
learning_rate=0.1,
random_state=42
)
self.models['gbm'].fit(X_train, y_train)
# XGBoost
self.models['xgboost'] = xgb.XGBClassifier(
n_estimators=100,
max_depth=4,
learning_rate=0.1,
objective='binary:logistic',
random_state=42
)
self.models['xgboost'].fit(X_train, y_train)
def evaluate_models(self,
X_test: np.ndarray,
y_test: np.ndarray) -> pd.DataFrame:
"""Evaluate all models on test set."""
results = []
for name, model in self.models.items():
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
results.append({
'model': name,
'accuracy': accuracy_score(y_test, y_pred),
'auc': roc_auc_score(y_test, y_prob),
'brier_score': brier_score_loss(y_test, y_prob),
'log_loss': -np.mean(
y_test * np.log(y_prob + 1e-10) +
(1 - y_test) * np.log(1 - y_prob + 1e-10)
)
})
return pd.DataFrame(results).sort_values('brier_score')
def get_feature_importance(self, model_name: str = 'xgboost') -> pd.DataFrame:
"""Extract feature importance from tree-based model."""
model = self.models[model_name]
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
else:
# For logistic regression, use absolute coefficients
importance = np.abs(model.coef_[0])
return pd.DataFrame({
'feature': self.feature_cols,
'importance': importance
}).sort_values('importance', ascending=False)
22.2.2 Regression: Point Spread Prediction
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
class SpreadPredictor:
"""Predict game point spreads using regression models."""
def __init__(self):
self.models = {}
self.scaler = StandardScaler()
self.feature_cols = None
def prepare_features(self, games: pd.DataFrame) -> pd.DataFrame:
"""Engineer features for spread prediction."""
df = games.copy()
# Power rating differences
df['power_diff'] = df['home_power_rating'] - df['away_power_rating']
# Component differences
df['rush_off_diff'] = df['home_rush_off'] - df['away_rush_def']
df['pass_off_diff'] = df['home_pass_off'] - df['away_pass_def']
df['rush_def_diff'] = df['away_rush_off'] - df['home_rush_def']
df['pass_def_diff'] = df['away_pass_off'] - df['home_pass_def']
# Tempo and style
df['tempo_diff'] = df['home_tempo'] - df['away_tempo']
df['avg_tempo'] = (df['home_tempo'] + df['away_tempo']) / 2
# Historical performance
df['home_avg_margin'] = df['home_season_margin']
df['away_avg_margin'] = df['away_season_margin']
df['margin_diff'] = df['home_avg_margin'] - df['away_avg_margin']
# Opponent adjustments
df['home_sos'] = df['home_strength_of_schedule']
df['away_sos'] = df['away_strength_of_schedule']
df['sos_diff'] = df['home_sos'] - df['away_sos']
# Home field (convert to points, typically ~2.5-3)
df['home_advantage_pts'] = 2.7
self.feature_cols = [
'power_diff', 'rush_off_diff', 'pass_off_diff',
'rush_def_diff', 'pass_def_diff', 'tempo_diff',
'margin_diff', 'sos_diff', 'home_advantage_pts'
]
return df
def train_models(self, X_train: np.ndarray, y_train: np.ndarray):
"""Train regression models."""
# Ridge Regression
self.models['ridge'] = Ridge(alpha=1.0)
self.models['ridge'].fit(X_train, y_train)
# Lasso (sparse features)
self.models['lasso'] = Lasso(alpha=0.1)
self.models['lasso'].fit(X_train, y_train)
# ElasticNet
self.models['elastic_net'] = ElasticNet(alpha=0.1, l1_ratio=0.5)
self.models['elastic_net'].fit(X_train, y_train)
# Random Forest
self.models['random_forest'] = RandomForestRegressor(
n_estimators=100,
max_depth=6,
random_state=42
)
self.models['random_forest'].fit(X_train, y_train)
# Gradient Boosting
self.models['gbm'] = GradientBoostingRegressor(
n_estimators=100,
max_depth=4,
learning_rate=0.1,
random_state=42
)
self.models['gbm'].fit(X_train, y_train)
def evaluate_models(self,
X_test: np.ndarray,
y_test: np.ndarray) -> pd.DataFrame:
"""Evaluate regression models."""
results = []
for name, model in self.models.items():
y_pred = model.predict(X_test)
results.append({
'model': name,
'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
'mae': mean_absolute_error(y_test, y_pred),
'r2': r2_score(y_test, y_pred),
'within_3': np.mean(np.abs(y_test - y_pred) <= 3),
'within_7': np.mean(np.abs(y_test - y_pred) <= 7)
})
return pd.DataFrame(results).sort_values('rmse')
def predict_spread(self, game_features: dict) -> dict:
"""Predict spread for a single game."""
features = np.array([[game_features[col] for col in self.feature_cols]])
features_scaled = self.scaler.transform(features)
predictions = {}
for name, model in self.models.items():
predictions[name] = model.predict(features_scaled)[0]
# Ensemble prediction
predictions['ensemble'] = np.mean(list(predictions.values()))
return predictions
22.3 Ensemble Methods
22.3.1 Voting and Stacking Ensembles
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
class EnsembleGamePredictor:
"""Advanced ensemble methods for game prediction."""
def __init__(self):
self.voting_ensemble = None
self.stacking_ensemble = None
def build_voting_ensemble(self) -> VotingClassifier:
"""Build voting ensemble classifier."""
base_models = [
('logistic', LogisticRegression(C=1.0, max_iter=1000)),
('rf', RandomForestClassifier(n_estimators=100, max_depth=6)),
('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=4)),
('xgb', xgb.XGBClassifier(n_estimators=100, max_depth=4))
]
# Soft voting uses predicted probabilities
self.voting_ensemble = VotingClassifier(
estimators=base_models,
voting='soft'
)
return self.voting_ensemble
def build_stacking_ensemble(self) -> StackingClassifier:
"""Build stacking ensemble with meta-learner."""
base_models = [
('rf', RandomForestClassifier(n_estimators=100, max_depth=6)),
('gbm', GradientBoostingClassifier(n_estimators=100, max_depth=4)),
('xgb', xgb.XGBClassifier(n_estimators=100, max_depth=4)),
('mlp', MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500))
]
# Logistic regression as meta-learner
self.stacking_ensemble = StackingClassifier(
estimators=base_models,
final_estimator=LogisticRegression(C=0.1),
cv=5,
passthrough=False # Only use base model predictions
)
return self.stacking_ensemble
def compare_ensembles(self,
X_train: np.ndarray,
X_test: np.ndarray,
y_train: np.ndarray,
y_test: np.ndarray) -> pd.DataFrame:
"""Compare ensemble performance."""
results = []
# Train and evaluate voting ensemble
voting = self.build_voting_ensemble()
voting.fit(X_train, y_train)
voting_pred = voting.predict_proba(X_test)[:, 1]
results.append({
'model': 'voting_ensemble',
'accuracy': accuracy_score(y_test, voting.predict(X_test)),
'auc': roc_auc_score(y_test, voting_pred),
'brier_score': brier_score_loss(y_test, voting_pred)
})
# Train and evaluate stacking ensemble
stacking = self.build_stacking_ensemble()
stacking.fit(X_train, y_train)
stacking_pred = stacking.predict_proba(X_test)[:, 1]
results.append({
'model': 'stacking_ensemble',
'accuracy': accuracy_score(y_test, stacking.predict(X_test)),
'auc': roc_auc_score(y_test, stacking_pred),
'brier_score': brier_score_loss(y_test, stacking_pred)
})
return pd.DataFrame(results)
class CustomWeightedEnsemble:
"""Custom weighted ensemble based on validation performance."""
def __init__(self, models: dict):
self.models = models
self.weights = None
def fit_weights(self,
X_val: np.ndarray,
y_val: np.ndarray,
metric: str = 'brier_score'):
"""Fit ensemble weights based on validation performance."""
performances = []
for name, model in self.models.items():
y_prob = model.predict_proba(X_val)[:, 1]
if metric == 'brier_score':
# Lower is better, invert
perf = 1 / (brier_score_loss(y_val, y_prob) + 0.01)
elif metric == 'auc':
perf = roc_auc_score(y_val, y_prob)
else:
perf = accuracy_score(y_val, model.predict(X_val))
performances.append((name, perf))
# Normalize to sum to 1
total_perf = sum(p[1] for p in performances)
self.weights = {name: perf / total_perf for name, perf in performances}
return self.weights
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Get weighted probability predictions."""
weighted_sum = np.zeros(len(X))
for name, model in self.models.items():
proba = model.predict_proba(X)[:, 1]
weighted_sum += self.weights[name] * proba
return weighted_sum
22.3.2 XGBoost Deep Dive
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
class XGBoostGamePredictor:
"""Optimized XGBoost model for game prediction."""
def __init__(self):
self.model = None
self.best_params = None
def tune_hyperparameters(self,
X_train: np.ndarray,
y_train: np.ndarray,
method: str = 'random') -> dict:
"""Tune XGBoost hyperparameters."""
param_grid = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [3, 4, 5, 6],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'min_child_weight': [1, 3, 5],
'subsample': [0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
'reg_alpha': [0, 0.1, 1],
'reg_lambda': [0.1, 1, 10]
}
base_model = xgb.XGBClassifier(
objective='binary:logistic',
random_state=42,
use_label_encoder=False,
eval_metric='logloss'
)
if method == 'grid':
# Smaller grid for demonstration
small_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 4, 5],
'learning_rate': [0.05, 0.1]
}
searcher = GridSearchCV(
base_model, small_grid, cv=5,
scoring='neg_brier_score', n_jobs=-1
)
else:
searcher = RandomizedSearchCV(
base_model, param_grid, n_iter=50, cv=5,
scoring='neg_brier_score', n_jobs=-1, random_state=42
)
searcher.fit(X_train, y_train)
self.best_params = searcher.best_params_
self.model = searcher.best_estimator_
return self.best_params
def train_with_early_stopping(self,
X_train: np.ndarray,
y_train: np.ndarray,
X_val: np.ndarray,
y_val: np.ndarray):
"""Train with early stopping to prevent overfitting."""
self.model = xgb.XGBClassifier(
**self.best_params if self.best_params else {},
n_estimators=1000, # High, but early stopping will stop
objective='binary:logistic',
random_state=42,
use_label_encoder=False
)
self.model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False
)
print(f"Best iteration: {self.model.best_iteration}")
def plot_learning_curve(self, X_train, y_train, X_val, y_val):
"""Plot training and validation learning curves."""
import matplotlib.pyplot as plt
eval_set = [(X_train, y_train), (X_val, y_val)]
model = xgb.XGBClassifier(
**self.best_params if self.best_params else {},
n_estimators=300,
objective='binary:logistic',
random_state=42
)
model.fit(
X_train, y_train,
eval_set=eval_set,
verbose=False
)
results = model.evals_result()
plt.figure(figsize=(10, 6))
plt.plot(results['validation_0']['logloss'], label='Training')
plt.plot(results['validation_1']['logloss'], label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('XGBoost Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)
return plt
def get_feature_importance_analysis(self,
feature_names: list) -> pd.DataFrame:
"""Comprehensive feature importance analysis."""
# Default importance (gain)
gain_importance = self.model.feature_importances_
# Get other importance types
booster = self.model.get_booster()
weight_importance = booster.get_score(importance_type='weight')
cover_importance = booster.get_score(importance_type='cover')
# Convert to arrays
n_features = len(feature_names)
weight_arr = np.array([
weight_importance.get(f'f{i}', 0) for i in range(n_features)
])
cover_arr = np.array([
cover_importance.get(f'f{i}', 0) for i in range(n_features)
])
# Normalize
weight_arr = weight_arr / weight_arr.sum() if weight_arr.sum() > 0 else weight_arr
cover_arr = cover_arr / cover_arr.sum() if cover_arr.sum() > 0 else cover_arr
return pd.DataFrame({
'feature': feature_names,
'gain': gain_importance,
'weight': weight_arr,
'cover': cover_arr
}).sort_values('gain', ascending=False)
22.4 Unsupervised Learning: Player Clustering
22.4.1 Finding Player Archetypes
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score
class PlayerArchetypeDiscovery:
"""Discover player archetypes using clustering."""
def __init__(self, position: str):
self.position = position
self.scaler = StandardScaler()
self.cluster_model = None
self.pca = None
def prepare_features(self, players: pd.DataFrame) -> np.ndarray:
"""Prepare features for clustering based on position."""
position_features = {
'QB': [
'completion_pct', 'yards_per_attempt', 'td_rate', 'int_rate',
'sack_rate', 'rush_attempts_per_game', 'rush_yards_per_attempt',
'deep_pass_pct', 'play_action_rate', 'time_to_throw'
],
'RB': [
'yards_per_carry', 'yards_after_contact', 'broken_tackles_per_att',
'receiving_targets_per_game', 'catch_rate', 'rush_epa',
'elusive_rating', 'pass_block_grade'
],
'WR': [
'yards_per_route', 'separation_avg', 'catch_rate',
'contested_catch_rate', 'yards_after_catch', 'deep_target_rate',
'slot_rate', 'target_share'
],
'LB': [
'tackles_per_game', 'tackles_for_loss', 'pass_rush_snaps_pct',
'coverage_snaps_pct', 'qb_pressures', 'forced_incompletions',
'missed_tackle_rate', 'run_stop_rate'
]
}
features = position_features.get(self.position, position_features['QB'])
available_features = [f for f in features if f in players.columns]
X = players[available_features].dropna().values
X_scaled = self.scaler.fit_transform(X)
return X_scaled, available_features
def find_optimal_clusters(self, X: np.ndarray, max_k: int = 10) -> dict:
"""Find optimal number of clusters using multiple methods."""
results = {
'silhouette': [],
'calinski_harabasz': [],
'inertia': []
}
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
results['silhouette'].append(silhouette_score(X, labels))
results['calinski_harabasz'].append(calinski_harabasz_score(X, labels))
results['inertia'].append(kmeans.inertia_)
# Find optimal k
optimal_k_silhouette = np.argmax(results['silhouette']) + 2
# Elbow method for inertia
inertia = np.array(results['inertia'])
diffs = np.diff(inertia)
diffs2 = np.diff(diffs)
optimal_k_elbow = np.argmax(diffs2) + 3 if len(diffs2) > 0 else 3
return {
'metrics': results,
'optimal_k_silhouette': optimal_k_silhouette,
'optimal_k_elbow': optimal_k_elbow,
'recommended_k': round((optimal_k_silhouette + optimal_k_elbow) / 2)
}
def fit_clustering(self,
X: np.ndarray,
n_clusters: int,
method: str = 'kmeans'):
"""Fit clustering model."""
if method == 'kmeans':
self.cluster_model = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
elif method == 'gmm':
self.cluster_model = GaussianMixture(n_components=n_clusters, random_state=42)
elif method == 'hierarchical':
self.cluster_model = AgglomerativeClustering(n_clusters=n_clusters)
return self.cluster_model.fit_predict(X)
def analyze_clusters(self,
players: pd.DataFrame,
labels: np.ndarray,
feature_cols: list) -> pd.DataFrame:
"""Analyze cluster characteristics."""
players_clustered = players.copy()
players_clustered['cluster'] = labels
# Calculate cluster means
cluster_profiles = players_clustered.groupby('cluster')[feature_cols].mean()
# Calculate cluster sizes
cluster_sizes = players_clustered['cluster'].value_counts().sort_index()
# Name clusters based on characteristics
cluster_names = self._generate_cluster_names(cluster_profiles)
cluster_profiles['cluster_name'] = cluster_names
cluster_profiles['size'] = cluster_sizes.values
return cluster_profiles
def _generate_cluster_names(self, profiles: pd.DataFrame) -> list:
"""Generate descriptive names for clusters based on profile."""
names = []
for idx, row in profiles.iterrows():
# Find top 2 distinguishing features
z_scores = (row - profiles.mean()) / profiles.std()
top_features = z_scores.abs().nlargest(2).index.tolist()
# Generate name based on position
if self.position == 'QB':
name = self._name_qb_cluster(row, z_scores)
elif self.position == 'RB':
name = self._name_rb_cluster(row, z_scores)
elif self.position == 'WR':
name = self._name_wr_cluster(row, z_scores)
else:
name = f"Cluster {idx}"
names.append(name)
return names
def _name_qb_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
"""Generate QB archetype name."""
if z_scores.get('rush_yards_per_attempt', 0) > 1:
return "Dual-Threat"
elif z_scores.get('deep_pass_pct', 0) > 1:
return "Deep Ball Specialist"
elif z_scores.get('completion_pct', 0) > 1 and z_scores.get('yards_per_attempt', 0) < 0:
return "Game Manager"
elif z_scores.get('yards_per_attempt', 0) > 1:
return "Aggressive Gunslinger"
elif z_scores.get('sack_rate', 0) < -1:
return "Pocket Presence"
else:
return "Balanced Passer"
def _name_rb_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
"""Generate RB archetype name."""
if z_scores.get('receiving_targets_per_game', 0) > 1:
return "Receiving Back"
elif z_scores.get('yards_after_contact', 0) > 1:
return "Power Back"
elif z_scores.get('elusive_rating', 0) > 1:
return "Elusive Scat Back"
elif z_scores.get('pass_block_grade', 0) > 1:
return "Pass-Blocking Specialist"
else:
return "All-Purpose Back"
def _name_wr_cluster(self, row: pd.Series, z_scores: pd.Series) -> str:
"""Generate WR archetype name."""
if z_scores.get('slot_rate', 0) > 1:
return "Slot Receiver"
elif z_scores.get('deep_target_rate', 0) > 1:
return "Deep Threat"
elif z_scores.get('contested_catch_rate', 0) > 1:
return "Possession Receiver"
elif z_scores.get('yards_after_catch', 0) > 1:
return "YAC Monster"
else:
return "Route Technician"
def visualize_clusters(self, X: np.ndarray, labels: np.ndarray):
"""Visualize clusters using PCA."""
import matplotlib.pyplot as plt
# Reduce to 2D for visualization
self.pca = PCA(n_components=2)
X_2d = self.pca.fit_transform(X)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
X_2d[:, 0], X_2d[:, 1],
c=labels, cmap='viridis', alpha=0.6
)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({self.pca.explained_variance_ratio_[0]*100:.1f}% var)')
plt.ylabel(f'PC2 ({self.pca.explained_variance_ratio_[1]*100:.1f}% var)')
plt.title(f'{self.position} Player Archetypes')
return plt
22.5 Neural Networks for Football
22.5.1 Feed-Forward Networks
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class FootballMLP(nn.Module):
"""Multi-layer perceptron for football prediction."""
def __init__(self, input_size: int, hidden_sizes: list, output_size: int = 1):
super().__init__()
layers = []
prev_size = input_size
for hidden_size in hidden_sizes:
layers.extend([
nn.Linear(prev_size, hidden_size),
nn.BatchNorm1d(hidden_size),
nn.ReLU(),
nn.Dropout(0.3)
])
prev_size = hidden_size
layers.append(nn.Linear(prev_size, output_size))
layers.append(nn.Sigmoid())
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
class NeuralNetworkTrainer:
"""Training wrapper for neural network models."""
def __init__(self, model: nn.Module, learning_rate: float = 0.001):
self.model = model
self.optimizer = optim.Adam(model.parameters(), lr=learning_rate)
self.criterion = nn.BCELoss()
self.history = {'train_loss': [], 'val_loss': []}
def train_epoch(self, train_loader: DataLoader) -> float:
"""Train for one epoch."""
self.model.train()
total_loss = 0
for X_batch, y_batch in train_loader:
self.optimizer.zero_grad()
predictions = self.model(X_batch).squeeze()
loss = self.criterion(predictions, y_batch)
loss.backward()
self.optimizer.step()
total_loss += loss.item()
return total_loss / len(train_loader)
def evaluate(self, val_loader: DataLoader) -> float:
"""Evaluate on validation set."""
self.model.eval()
total_loss = 0
with torch.no_grad():
for X_batch, y_batch in val_loader:
predictions = self.model(X_batch).squeeze()
loss = self.criterion(predictions, y_batch)
total_loss += loss.item()
return total_loss / len(val_loader)
def train(self,
X_train: np.ndarray,
y_train: np.ndarray,
X_val: np.ndarray,
y_val: np.ndarray,
epochs: int = 100,
batch_size: int = 64,
early_stopping_patience: int = 10):
"""Full training loop with early stopping."""
# Convert to tensors
train_dataset = TensorDataset(
torch.FloatTensor(X_train),
torch.FloatTensor(y_train)
)
val_dataset = TensorDataset(
torch.FloatTensor(X_val),
torch.FloatTensor(y_val)
)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(epochs):
train_loss = self.train_epoch(train_loader)
val_loss = self.evaluate(val_loader)
self.history['train_loss'].append(train_loss)
self.history['val_loss'].append(val_loss)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
# Save best model
self.best_state = self.model.state_dict().copy()
else:
patience_counter += 1
if patience_counter >= early_stopping_patience:
print(f"Early stopping at epoch {epoch}")
break
if epoch % 10 == 0:
print(f"Epoch {epoch}: Train Loss = {train_loss:.4f}, Val Loss = {val_loss:.4f}")
# Restore best model
self.model.load_state_dict(self.best_state)
def predict(self, X: np.ndarray) -> np.ndarray:
"""Get predictions."""
self.model.eval()
with torch.no_grad():
X_tensor = torch.FloatTensor(X)
predictions = self.model(X_tensor).squeeze().numpy()
return predictions
22.5.2 Recurrent Networks for Play Sequences
class PlaySequenceLSTM(nn.Module):
"""LSTM for modeling play sequences."""
def __init__(self,
input_size: int,
hidden_size: int = 64,
num_layers: int = 2,
num_classes: int = 10):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.3
)
self.fc = nn.Sequential(
nn.Linear(hidden_size, 32),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(32, num_classes)
)
def forward(self, x):
# x shape: (batch, seq_len, input_size)
lstm_out, (hidden, cell) = self.lstm(x)
# Use last hidden state
last_hidden = hidden[-1]
# Classify
output = self.fc(last_hidden)
return output
class PlayPredictionSystem:
"""System for predicting next play type."""
def __init__(self, sequence_length: int = 5):
self.sequence_length = sequence_length
self.play_encoder = None
self.model = None
def encode_play(self, play: dict) -> np.ndarray:
"""Encode a single play as feature vector."""
features = [
play['down'] / 4, # Normalize down
play['distance'] / 20, # Normalize distance
play['yard_line'] / 100, # Normalize field position
play['seconds_remaining'] / 3600, # Normalize time
play['score_diff'] / 50, # Normalize score
1 if play['play_type'] == 'pass' else 0,
1 if play['play_type'] == 'rush' else 0,
play.get('yards_gained', 0) / 20
]
return np.array(features)
def create_sequences(self, plays: pd.DataFrame) -> tuple:
"""Create sequences for training."""
sequences = []
labels = []
# Group by game
for game_id, game_plays in plays.groupby('game_id'):
game_plays = game_plays.sort_values('play_id')
for i in range(len(game_plays) - self.sequence_length):
seq = game_plays.iloc[i:i+self.sequence_length]
next_play = game_plays.iloc[i+self.sequence_length]
# Encode sequence
seq_encoded = np.array([
self.encode_play(row.to_dict())
for _, row in seq.iterrows()
])
sequences.append(seq_encoded)
# Label is next play type
labels.append(self._encode_play_type(next_play['play_type']))
return np.array(sequences), np.array(labels)
def _encode_play_type(self, play_type: str) -> int:
"""Encode play type as integer."""
play_types = {
'rush_inside': 0, 'rush_outside': 1,
'pass_short': 2, 'pass_deep': 3,
'screen': 4, 'play_action': 5,
'punt': 6, 'field_goal': 7,
'qb_scramble': 8, 'other': 9
}
return play_types.get(play_type, 9)
def train(self, plays: pd.DataFrame):
"""Train the play prediction model."""
X, y = self.create_sequences(plays)
# Split data
split_idx = int(0.8 * len(X))
X_train, X_val = X[:split_idx], X[split_idx:]
y_train, y_val = y[:split_idx], y[split_idx:]
# Build model
input_size = X.shape[2]
self.model = PlaySequenceLSTM(
input_size=input_size,
hidden_size=64,
num_layers=2,
num_classes=10
)
# Train
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(self.model.parameters(), lr=0.001)
train_dataset = TensorDataset(
torch.FloatTensor(X_train),
torch.LongTensor(y_train)
)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(50):
self.model.train()
total_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
output = self.model(X_batch)
loss = criterion(output, y_batch)
loss.backward()
optimizer.step()
total_loss += loss.item()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss = {total_loss/len(train_loader):.4f}")
def predict_next_play(self, recent_plays: list) -> dict:
"""Predict probability distribution for next play."""
self.model.eval()
# Encode recent plays
seq = np.array([self.encode_play(p) for p in recent_plays[-self.sequence_length:]])
seq = torch.FloatTensor(seq).unsqueeze(0)
with torch.no_grad():
output = self.model(seq)
probs = torch.softmax(output, dim=1).squeeze().numpy()
play_types = [
'rush_inside', 'rush_outside', 'pass_short', 'pass_deep',
'screen', 'play_action', 'punt', 'field_goal', 'qb_scramble', 'other'
]
return {play_type: prob for play_type, prob in zip(play_types, probs)}
22.6 Model Evaluation and Validation
22.6.1 Cross-Validation Strategies
from sklearn.model_selection import (
KFold, StratifiedKFold, TimeSeriesSplit, GroupKFold
)
class FootballCrossValidator:
"""Cross-validation strategies for football data."""
@staticmethod
def temporal_cv(df: pd.DataFrame,
n_splits: int = 5) -> list:
"""Time-based cross-validation."""
seasons = sorted(df['season'].unique())
n_train_seasons = len(seasons) - n_splits
folds = []
for i in range(n_splits):
train_seasons = seasons[:n_train_seasons + i]
test_season = seasons[n_train_seasons + i]
train_idx = df[df['season'].isin(train_seasons)].index
test_idx = df[df['season'] == test_season].index
folds.append((train_idx, test_idx))
return folds
@staticmethod
def group_cv(df: pd.DataFrame,
group_col: str = 'team',
n_splits: int = 5) -> list:
"""Group-based cross-validation (leave teams out)."""
groups = df[group_col].values
gkf = GroupKFold(n_splits=n_splits)
return list(gkf.split(df, groups=groups))
@staticmethod
def stratified_temporal_cv(df: pd.DataFrame,
target_col: str = 'home_win',
n_splits: int = 5) -> list:
"""Stratified within temporal splits."""
seasons = sorted(df['season'].unique())
n_train_seasons = len(seasons) - n_splits
folds = []
for i in range(n_splits):
train_seasons = seasons[:n_train_seasons + i]
test_season = seasons[n_train_seasons + i]
train_data = df[df['season'].isin(train_seasons)]
test_data = df[df['season'] == test_season]
# Ensure stratification in test set
folds.append((train_data.index, test_data.index))
return folds
class ModelEvaluator:
"""Comprehensive model evaluation."""
def __init__(self):
self.results = {}
def evaluate_classification(self,
y_true: np.ndarray,
y_pred: np.ndarray,
y_prob: np.ndarray) -> dict:
"""Evaluate classification model."""
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, brier_score_loss,
confusion_matrix
)
return {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred),
'recall': recall_score(y_true, y_pred),
'f1': f1_score(y_true, y_pred),
'auc': roc_auc_score(y_true, y_prob),
'brier_score': brier_score_loss(y_true, y_prob),
'log_loss': -np.mean(
y_true * np.log(y_prob + 1e-10) +
(1 - y_true) * np.log(1 - y_prob + 1e-10)
),
'confusion_matrix': confusion_matrix(y_true, y_pred)
}
def evaluate_regression(self,
y_true: np.ndarray,
y_pred: np.ndarray) -> dict:
"""Evaluate regression model."""
return {
'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
'mae': mean_absolute_error(y_true, y_pred),
'r2': r2_score(y_true, y_pred),
'mape': np.mean(np.abs((y_true - y_pred) / (y_true + 1e-10))) * 100,
'within_3': np.mean(np.abs(y_true - y_pred) <= 3),
'within_7': np.mean(np.abs(y_true - y_pred) <= 7)
}
def evaluate_calibration(self,
y_true: np.ndarray,
y_prob: np.ndarray,
n_bins: int = 10) -> dict:
"""Evaluate probability calibration."""
bins = np.linspace(0, 1, n_bins + 1)
calibration_data = []
for i in range(n_bins):
mask = (y_prob >= bins[i]) & (y_prob < bins[i+1])
if mask.sum() > 0:
calibration_data.append({
'bin': f'{bins[i]:.1f}-{bins[i+1]:.1f}',
'predicted': y_prob[mask].mean(),
'actual': y_true[mask].mean(),
'count': mask.sum()
})
cal_df = pd.DataFrame(calibration_data)
ece = np.average(
np.abs(cal_df['predicted'] - cal_df['actual']),
weights=cal_df['count']
)
mce = np.abs(cal_df['predicted'] - cal_df['actual']).max()
return {
'calibration_curve': cal_df,
'ece': ece,
'mce': mce
}
def compare_to_baseline(self,
model_metrics: dict,
baseline_type: str = 'always_home') -> dict:
"""Compare model to simple baselines."""
baselines = {
'always_home': 0.55, # Home team wins ~55%
'coin_flip': 0.50,
'vegas_implied': 0.52 # Typical ATS accuracy
}
baseline_acc = baselines.get(baseline_type, 0.50)
return {
'model_accuracy': model_metrics['accuracy'],
'baseline_accuracy': baseline_acc,
'improvement': model_metrics['accuracy'] - baseline_acc,
'relative_improvement': (model_metrics['accuracy'] - baseline_acc) / baseline_acc * 100
}
22.7 Common Pitfalls and Best Practices
22.7.1 Data Leakage Prevention
class DataLeakageChecker:
"""Check for and prevent data leakage."""
@staticmethod
def check_temporal_leakage(df: pd.DataFrame,
feature_cols: list,
date_col: str = 'game_date') -> list:
"""Check for features that use future information."""
leaky_features = []
# Check for season-level aggregates
season_features = [col for col in feature_cols if 'season' in col.lower()]
for feat in season_features:
print(f"Warning: {feat} may contain full-season data (leakage risk)")
leaky_features.append(feat)
# Check for outcome-derived features
outcome_keywords = ['win', 'loss', 'margin', 'score', 'result']
for feat in feature_cols:
if any(kw in feat.lower() for kw in outcome_keywords):
print(f"Warning: {feat} may be derived from outcome (check carefully)")
leaky_features.append(feat)
return leaky_features
@staticmethod
def ensure_point_in_time(df: pd.DataFrame,
feature_cols: list,
game_date_col: str = 'game_date') -> pd.DataFrame:
"""Ensure features are point-in-time valid."""
df = df.copy()
# For rolling features, ensure they don't include current game
for col in feature_cols:
if 'rolling' in col.lower() or 'avg' in col.lower():
# Shift to exclude current game
df[col] = df.groupby('team')[col].shift(1)
return df
class BestPractices:
"""Best practices for football ML."""
@staticmethod
def get_recommendations() -> dict:
"""Get best practice recommendations."""
return {
'data_handling': [
'Use temporal train/test splits',
'Exclude garbage time plays',
'Handle missing data consistently',
'Account for rule changes across seasons'
],
'feature_engineering': [
'Include strength of schedule adjustments',
'Use opponent-adjusted metrics',
'Create score-time interaction terms',
'Consider home/away splits'
],
'model_training': [
'Use cross-validation appropriate for sports data',
'Tune regularization to prevent overfitting',
'Monitor for calibration, not just accuracy',
'Include baseline comparisons'
],
'evaluation': [
'Report Brier score for probabilistic predictions',
'Use ATS accuracy for betting applications',
'Check calibration across probability ranges',
'Validate on recent, unseen season'
],
'deployment': [
'Update models with new season data',
'Monitor prediction drift',
'Document feature definitions',
'Log model predictions for post-hoc analysis'
]
}
Summary
This chapter has explored the application of machine learning to college football analytics:
-
Supervised Learning: Classification for game outcomes and regression for spread prediction offer modest but meaningful improvements over baselines when properly implemented.
-
Ensemble Methods: Combining multiple models through voting or stacking consistently outperforms individual models, with XGBoost often serving as a strong base learner.
-
Unsupervised Learning: Clustering reveals meaningful player archetypes that go beyond traditional position labels, providing insights for recruiting and team building.
-
Neural Networks: Deep learning architectures can capture complex patterns, particularly in sequence modeling for play prediction, though they require more data and careful tuning.
-
Evaluation: Proper validation using temporal splits and calibration metrics is critical for football applications where the margin between success and failure is small.
-
Pitfalls: Data leakage, overfitting to small samples, and ignoring domain knowledge are common failure modes that can be avoided with disciplined practices.
The key to successful football ML is balancing model complexity with interpretability, always grounding predictions in football logic, and maintaining rigorous evaluation standards.
Key Takeaways
- Machine learning provides modest improvements over traditional methods when properly implemented
- Temporal validation is essential for sports prediction tasks
- Ensemble methods consistently outperform individual models
- Feature engineering remains crucial even with complex models
- Calibration matters as much as accuracy for probabilistic predictions
- Domain knowledge should guide model development and interpretation
Related Reading
Explore this topic in other books
AI Engineering Supervised Learning Sports Betting Feature Engineering for ML NFL Analytics ML Prediction Models Basketball Analytics Machine Learning in Basketball Soccer Analytics Machine Learning for Soccer Prediction Markets ML for Market Prediction