Every football decision involves prediction. When a coach calls a fourth-down play, they implicitly predict the probability of conversion. When a general manager drafts a player, they predict future performance. When a defensive coordinator adjusts...
In This Chapter
Chapter 17: Introduction to Predictive Analytics in Football
Learning Objectives
By the end of this chapter, you will be able to:
- Understand the role of predictive analytics in modern football operations
- Distinguish between descriptive, predictive, and prescriptive analytics
- Apply the machine learning workflow to football problems
- Evaluate model performance using appropriate metrics
- Recognize common pitfalls in sports prediction modeling
- Build foundational prediction models using Python
Introduction
Every football decision involves prediction. When a coach calls a fourth-down play, they implicitly predict the probability of conversion. When a general manager drafts a player, they predict future performance. When a defensive coordinator adjusts coverage, they predict offensive tendencies. Predictive analytics transforms these implicit predictions into explicit, data-driven models that can be tested, improved, and deployed at scale.
This chapter introduces the fundamental concepts of predictive analytics as applied to college football. We'll establish the theoretical foundation for the machine learning techniques covered in subsequent chapters while building practical skills through hands-on implementation.
The Evolution of Football Prediction
Football prediction has evolved dramatically over the past century:
Early Era (1900s-1960s): Predictions were based purely on intuition, recent performance, and media narratives. The concept of "momentum" and "big game experience" dominated discourse.
Statistical Era (1970s-1990s): Simple statistics like rushing yards, passing efficiency, and turnover margin became the basis for predictions. Jeff Sagarin and Ken Pomeroy pioneered computer rankings.
Analytics Era (2000s-2010s): Advanced metrics like EPA, Success Rate, and SP+ emerged. Pro Football Focus began grading every player on every play.
Machine Learning Era (2020s-Present): Neural networks, ensemble methods, and real-time tracking data enable unprecedented prediction accuracy. Teams employ dedicated data science staffs.
Why Predictive Analytics Matters
Predictive analytics provides competitive advantages across football operations:
| Application | Traditional Approach | Predictive Analytics Approach |
|---|---|---|
| Recruiting | Subjective ratings, eye test | Combine metrics + film grades + projection models |
| Game Planning | Film study, tendencies | Automated tendency analysis + optimal strategy models |
| In-Game Decisions | Experience, gut feel | Win probability models, expected value calculations |
| Player Development | Coach intuition | Performance trajectory prediction |
| Roster Management | Positional need | Contract value models, replacement player analysis |
Fundamentals of Prediction
The Prediction Problem
At its core, prediction involves estimating an unknown quantity based on known information:
$$\hat{y} = f(X) + \epsilon$$
Where: - $\hat{y}$ is the predicted outcome - $X$ represents input features (predictors) - $f$ is the function we're trying to learn - $\epsilon$ is irreducible error (randomness in the outcome)
In football contexts, this might be: - Predicting game outcomes ($\hat{y}$ = win probability) - Projecting player performance ($\hat{y}$ = future EPA) - Forecasting draft position ($\hat{y}$ = expected pick number)
Types of Prediction Problems
Classification
Predicting categorical outcomes:
- Binary: Will the team win? (Yes/No)
- Multi-class: What defensive coverage? (Cover 2, Cover 3, Cover 4, Man)
- Multi-label: Which receivers are primary targets? (Multiple possible)
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
def classify_game_outcome(team_data: pd.DataFrame) -> dict:
"""
Binary classification example: Predict game wins.
Parameters:
-----------
team_data : pd.DataFrame
Team statistics with 'won' target column
Returns:
--------
dict : Model performance metrics
"""
# Features: offensive and defensive efficiency
feature_cols = ['offensive_epa', 'defensive_epa', 'turnover_margin',
'third_down_pct', 'red_zone_pct']
X = team_data[feature_cols]
y = team_data['won']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return {
'model': model,
'accuracy': accuracy,
'feature_importance': dict(zip(feature_cols, model.coef_[0]))
}
Regression
Predicting continuous outcomes:
- Points scored in a game
- Player performance metrics (yards, touchdowns)
- Future salary/contract value
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def predict_points_scored(game_data: pd.DataFrame) -> dict:
"""
Regression example: Predict points scored.
Parameters:
-----------
game_data : pd.DataFrame
Game-level statistics
Returns:
--------
dict : Model performance metrics
"""
feature_cols = ['total_yards', 'turnovers', 'time_of_possession',
'third_down_conversions', 'penalties']
X = game_data[feature_cols]
y = game_data['points_scored']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regressor
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
return {
'model': model,
'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
'mae': mean_absolute_error(y_test, y_pred),
'r2': r2_score(y_test, y_pred),
'coefficients': dict(zip(feature_cols, model.coef_))
}
Probability Estimation
Estimating likelihoods of outcomes:
- Win probability at any point in a game
- Conversion probability on fourth down
- Draft pick probability distributions
from sklearn.calibration import CalibratedClassifierCV
def estimate_win_probability(situation_data: pd.DataFrame) -> dict:
"""
Probability estimation: Win probability model.
Parameters:
-----------
situation_data : pd.DataFrame
In-game situations with outcomes
Returns:
--------
dict : Calibrated probability model
"""
feature_cols = ['score_differential', 'time_remaining',
'possession', 'field_position', 'timeouts_remaining']
X = situation_data[feature_cols]
y = situation_data['won']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Base classifier
base_model = LogisticRegression(max_iter=1000)
# Calibrate probabilities
calibrated_model = CalibratedClassifierCV(base_model, cv=5)
calibrated_model.fit(X_train, y_train)
# Get probability predictions
y_prob = calibrated_model.predict_proba(X_test)[:, 1]
return {
'model': calibrated_model,
'probabilities': y_prob,
'actual': y_test.values
}
The Machine Learning Workflow
Step 1: Problem Definition
Before writing any code, clearly define the prediction problem:
Questions to Answer: 1. What exactly are we predicting? 2. When do we need the prediction? (pre-game, in-game, off-season) 3. What decisions will the prediction inform? 4. How accurate does it need to be to be useful? 5. What is the cost of errors (false positives vs. false negatives)?
Example Problem Definition:
Problem: Fourth Down Decision Making
- Prediction: Probability of conversion for various play types
- Timing: Real-time during games
- Decision: Go for it, punt, or kick field goal
- Accuracy Target: Calibrated within 5% of true probabilities
- Error Costs: Going for it and failing is costly but recoverable;
punting when you should go costs expected points
Step 2: Data Collection and Understanding
Gather relevant data and understand its structure:
import pandas as pd
from typing import Dict, List, Tuple
class FootballDataPipeline:
"""
Data collection and understanding for football prediction.
"""
def __init__(self, data_source: str):
self.data_source = data_source
self.raw_data = None
self.processed_data = None
def load_data(self) -> pd.DataFrame:
"""Load raw data from source."""
# In practice, this would connect to a database or API
self.raw_data = pd.read_csv(self.data_source)
return self.raw_data
def explore_data(self) -> Dict:
"""Generate data quality report."""
if self.raw_data is None:
raise ValueError("Load data first")
report = {
'shape': self.raw_data.shape,
'columns': list(self.raw_data.columns),
'dtypes': self.raw_data.dtypes.to_dict(),
'missing': self.raw_data.isnull().sum().to_dict(),
'missing_pct': (self.raw_data.isnull().sum() / len(self.raw_data) * 100).to_dict(),
'numeric_stats': self.raw_data.describe().to_dict(),
}
return report
def check_target_distribution(self, target_col: str) -> Dict:
"""Analyze target variable distribution."""
target = self.raw_data[target_col]
if target.dtype in ['int64', 'float64']:
# Continuous target
return {
'type': 'continuous',
'mean': target.mean(),
'std': target.std(),
'min': target.min(),
'max': target.max(),
'median': target.median()
}
else:
# Categorical target
return {
'type': 'categorical',
'classes': target.unique().tolist(),
'class_counts': target.value_counts().to_dict(),
'class_balance': (target.value_counts() / len(target)).to_dict()
}
Step 3: Feature Engineering
Transform raw data into predictive features:
from typing import List
import numpy as np
class FootballFeatureEngineer:
"""
Feature engineering for football prediction models.
"""
def __init__(self, play_data: pd.DataFrame):
self.plays = play_data
def create_game_features(self) -> pd.DataFrame:
"""Create game-level features."""
game_features = self.plays.groupby('game_id').agg({
'epa': ['sum', 'mean', 'std'],
'success': 'mean',
'yards_gained': ['sum', 'mean'],
'turnover': 'sum',
'penalty': 'sum'
})
# Flatten column names
game_features.columns = ['_'.join(col) for col in game_features.columns]
return game_features
def create_rolling_features(self, window: int = 5) -> pd.DataFrame:
"""Create rolling average features for team performance."""
# Sort by team and game date
sorted_data = self.plays.sort_values(['team', 'game_date'])
rolling_features = sorted_data.groupby('team').agg({
'epa': lambda x: x.rolling(window, min_periods=1).mean(),
'success': lambda x: x.rolling(window, min_periods=1).mean()
})
return rolling_features
def create_situational_features(self) -> pd.DataFrame:
"""Create situational context features."""
features = self.plays.copy()
# Down and distance encoding
features['short_yardage'] = (features['ydstogo'] <= 2).astype(int)
features['medium_yardage'] = ((features['ydstogo'] > 2) &
(features['ydstogo'] <= 6)).astype(int)
features['long_yardage'] = (features['ydstogo'] > 6).astype(int)
# Field position zones
features['own_territory'] = (features['yardline_100'] > 50).astype(int)
features['red_zone'] = (features['yardline_100'] <= 20).astype(int)
features['goal_to_go'] = (features['yardline_100'] <= 10).astype(int)
# Game state
features['close_game'] = (abs(features['score_differential']) <= 7).astype(int)
features['trailing'] = (features['score_differential'] < 0).astype(int)
features['late_game'] = (features['game_seconds_remaining'] <= 300).astype(int)
return features
def create_opponent_adjusted_features(self) -> pd.DataFrame:
"""Create opponent-adjusted features."""
# Calculate opponent averages
opponent_stats = self.plays.groupby('opponent').agg({
'epa': 'mean',
'success': 'mean'
}).rename(columns={'epa': 'opp_epa_allowed', 'success': 'opp_success_allowed'})
# Merge back
features = self.plays.merge(opponent_stats, left_on='opponent', right_index=True)
# Calculate adjusted metrics
features['epa_vs_expected'] = features['epa'] - features['opp_epa_allowed']
return features
Step 4: Model Selection
Choose appropriate algorithms based on the problem:
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor
class ModelSelector:
"""
Guide model selection based on problem characteristics.
"""
CLASSIFICATION_MODELS = {
'logistic': {
'model': LogisticRegression,
'params': {'max_iter': 1000, 'random_state': 42},
'interpretable': True,
'handles_multiclass': True,
'requires_scaling': True
},
'random_forest': {
'model': RandomForestClassifier,
'params': {'n_estimators': 100, 'random_state': 42},
'interpretable': False,
'handles_multiclass': True,
'requires_scaling': False
},
'gradient_boost': {
'model': GradientBoostingClassifier,
'params': {'n_estimators': 100, 'random_state': 42},
'interpretable': False,
'handles_multiclass': True,
'requires_scaling': False
},
'neural_net': {
'model': MLPClassifier,
'params': {'hidden_layer_sizes': (100, 50), 'max_iter': 500, 'random_state': 42},
'interpretable': False,
'handles_multiclass': True,
'requires_scaling': True
}
}
REGRESSION_MODELS = {
'ridge': {
'model': Ridge,
'params': {'alpha': 1.0, 'random_state': 42},
'interpretable': True,
'requires_scaling': True
},
'lasso': {
'model': Lasso,
'params': {'alpha': 0.1, 'random_state': 42},
'interpretable': True,
'requires_scaling': True
},
'random_forest': {
'model': RandomForestRegressor,
'params': {'n_estimators': 100, 'random_state': 42},
'interpretable': False,
'requires_scaling': False
},
'gradient_boost': {
'model': GradientBoostingRegressor,
'params': {'n_estimators': 100, 'random_state': 42},
'interpretable': False,
'requires_scaling': False
}
}
@classmethod
def recommend(cls, problem_type: str, requirements: Dict) -> List[str]:
"""
Recommend models based on problem requirements.
Parameters:
-----------
problem_type : str
'classification' or 'regression'
requirements : dict
Keys: 'interpretability', 'multiclass', 'large_data', 'realtime'
Returns:
--------
List[str] : Recommended model names
"""
models = cls.CLASSIFICATION_MODELS if problem_type == 'classification' else cls.REGRESSION_MODELS
recommendations = []
for name, config in models.items():
score = 0
if requirements.get('interpretability') and config.get('interpretable'):
score += 2
if requirements.get('multiclass') and config.get('handles_multiclass', True):
score += 1
if requirements.get('large_data') and name in ['random_forest', 'gradient_boost']:
score += 1
if requirements.get('realtime') and config.get('interpretable'):
score += 1 # Simpler models are faster
recommendations.append((name, score))
# Sort by score
recommendations.sort(key=lambda x: x[1], reverse=True)
return [name for name, score in recommendations]
Step 5: Model Training and Validation
Train models with proper validation:
from sklearn.model_selection import cross_val_score, TimeSeriesSplit, StratifiedKFold
from sklearn.preprocessing import StandardScaler
import numpy as np
class FootballModelTrainer:
"""
Train and validate football prediction models.
"""
def __init__(self, model, scaling: bool = True):
self.model = model
self.scaling = scaling
self.scaler = StandardScaler() if scaling else None
self.is_fitted = False
def train_with_cv(self, X: pd.DataFrame, y: pd.Series,
cv_strategy: str = 'kfold',
n_splits: int = 5) -> Dict:
"""
Train model with cross-validation.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : pd.Series
Target variable
cv_strategy : str
'kfold', 'stratified', or 'timeseries'
n_splits : int
Number of CV folds
Returns:
--------
dict : Cross-validation results
"""
# Select CV strategy
if cv_strategy == 'kfold':
cv = n_splits
elif cv_strategy == 'stratified':
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
elif cv_strategy == 'timeseries':
cv = TimeSeriesSplit(n_splits=n_splits)
else:
raise ValueError(f"Unknown CV strategy: {cv_strategy}")
# Scale features if needed
if self.scaling:
X_scaled = self.scaler.fit_transform(X)
else:
X_scaled = X.values
# Cross-validate
scores = cross_val_score(self.model, X_scaled, y, cv=cv, scoring='accuracy')
# Fit final model on all data
self.model.fit(X_scaled, y)
self.is_fitted = True
return {
'cv_scores': scores,
'mean_cv_score': scores.mean(),
'std_cv_score': scores.std(),
'model': self.model
}
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Make predictions on new data."""
if not self.is_fitted:
raise ValueError("Model not fitted. Call train_with_cv first.")
if self.scaling:
X_scaled = self.scaler.transform(X)
else:
X_scaled = X.values
return self.model.predict(X_scaled)
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Get probability predictions."""
if not self.is_fitted:
raise ValueError("Model not fitted. Call train_with_cv first.")
if not hasattr(self.model, 'predict_proba'):
raise ValueError("Model doesn't support probability predictions")
if self.scaling:
X_scaled = self.scaler.transform(X)
else:
X_scaled = X.values
return self.model.predict_proba(X_scaled)
Step 6: Evaluation
Assess model performance with appropriate metrics:
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, confusion_matrix, classification_report,
mean_squared_error, mean_absolute_error, r2_score
)
import matplotlib.pyplot as plt
from typing import Dict
class ModelEvaluator:
"""
Comprehensive model evaluation for football predictions.
"""
@staticmethod
def evaluate_classifier(y_true: np.ndarray, y_pred: np.ndarray,
y_prob: np.ndarray = None) -> Dict:
"""
Evaluate classification model.
Parameters:
-----------
y_true : np.ndarray
True labels
y_pred : np.ndarray
Predicted labels
y_prob : np.ndarray, optional
Predicted probabilities
Returns:
--------
dict : Evaluation metrics
"""
metrics = {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred, average='weighted'),
'recall': recall_score(y_true, y_pred, average='weighted'),
'f1': f1_score(y_true, y_pred, average='weighted'),
'confusion_matrix': confusion_matrix(y_true, y_pred)
}
if y_prob is not None:
try:
metrics['auc_roc'] = roc_auc_score(y_true, y_prob)
except ValueError:
metrics['auc_roc'] = None
return metrics
@staticmethod
def evaluate_regressor(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
"""
Evaluate regression model.
Parameters:
-----------
y_true : np.ndarray
True values
y_pred : np.ndarray
Predicted values
Returns:
--------
dict : Evaluation metrics
"""
return {
'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
'mae': mean_absolute_error(y_true, y_pred),
'r2': r2_score(y_true, y_pred),
'mean_error': np.mean(y_pred - y_true),
'std_error': np.std(y_pred - y_true)
}
@staticmethod
def evaluate_calibration(y_true: np.ndarray, y_prob: np.ndarray,
n_bins: int = 10) -> Dict:
"""
Evaluate probability calibration.
Critical for win probability and conversion models.
"""
from sklearn.calibration import calibration_curve
# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
# Brier score (lower is better)
brier = np.mean((y_prob - y_true) ** 2)
# Expected Calibration Error
bin_indices = np.digitize(y_prob, np.linspace(0, 1, n_bins + 1)[1:-1])
ece = 0
for i in range(n_bins):
mask = bin_indices == i
if mask.sum() > 0:
bin_accuracy = y_true[mask].mean()
bin_confidence = y_prob[mask].mean()
bin_weight = mask.sum() / len(y_prob)
ece += bin_weight * abs(bin_accuracy - bin_confidence)
return {
'brier_score': brier,
'ece': ece,
'calibration_curve': (prob_true, prob_pred)
}
@staticmethod
def plot_calibration(y_true: np.ndarray, y_prob: np.ndarray,
title: str = 'Calibration Plot') -> plt.Figure:
"""Create calibration plot."""
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
fig, ax = plt.subplots(figsize=(8, 8))
# Perfect calibration line
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
# Model calibration
ax.plot(prob_pred, prob_true, 's-', label='Model')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title(title)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
return fig
Common Pitfalls in Sports Prediction
1. Data Leakage
Using information that wouldn't be available at prediction time:
# WRONG: Using final game stats to predict game outcome
def predict_winner_wrong(game_data):
# This includes final scores - data leakage!
features = ['home_score', 'away_score', 'home_yards', 'away_yards']
return model.predict(game_data[features])
# CORRECT: Using only pre-game information
def predict_winner_correct(game_data):
# Only use information available before the game
features = ['home_rating', 'away_rating', 'home_rest_days',
'away_rest_days', 'is_rivalry', 'home_injuries']
return model.predict(game_data[features])
2. Temporal Leakage
Training on future data to predict past events:
# WRONG: Random train/test split with time series data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 2023 games might be in training, 2021 games in test!
# CORRECT: Time-based split
def temporal_split(data, test_year=2023):
train = data[data['season'] < test_year]
test = data[data['season'] >= test_year]
return train, test
3. Overfitting to Small Samples
Football has limited games per team per season:
def assess_sample_adequacy(n_games: int, n_features: int) -> Dict:
"""
Assess if sample size is adequate for modeling.
Rule of thumb: Need 10-20 observations per feature.
"""
min_ratio = n_games / n_features
return {
'n_games': n_games,
'n_features': n_features,
'ratio': min_ratio,
'adequate': min_ratio >= 10,
'recommendation': 'Reduce features' if min_ratio < 10 else 'Sample adequate'
}
# Example: FBS team with 12 games
assessment = assess_sample_adequacy(n_games=12, n_features=20)
# ratio = 0.6 - NOT adequate! Need to reduce features.
4. Ignoring Base Rates
Forgetting how often events naturally occur:
def baseline_comparison(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
"""
Compare model to baseline predictions.
Always compare to:
- Random guessing
- Predicting the majority class
- Predicting the mean value
"""
n_samples = len(y_true)
# For classification
majority_class = pd.Series(y_true).mode()[0]
majority_baseline = (y_true == majority_class).mean()
# Model accuracy
model_accuracy = (y_true == y_pred).mean()
# Improvement over baseline
improvement = model_accuracy - majority_baseline
return {
'majority_baseline': majority_baseline,
'model_accuracy': model_accuracy,
'improvement': improvement,
'relative_improvement': improvement / (1 - majority_baseline) if majority_baseline < 1 else 0
}
5. Selection Bias
Models trained only on games that happened:
def identify_selection_bias(data: pd.DataFrame) -> Dict:
"""
Identify potential selection bias in the data.
Examples in football:
- Only looking at games with available tracking data
- Only considering players who stayed healthy
- Only analyzing teams that made the playoffs
"""
bias_indicators = {
'missing_data_games': data['tracking_available'].sum() / len(data),
'injury_filtered': data['injury_game'].sum() / len(data),
'playoff_only': data['playoff_game'].sum() / len(data) if 'playoff_game' in data else None
}
return bias_indicators
Building Your First Prediction Model
Let's build a complete prediction model for game outcomes:
"""
Complete Game Outcome Prediction Pipeline
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
class GameOutcomePredictor:
"""
Predict game outcomes using team statistics.
"""
def __init__(self):
self.model = None
self.scaler = StandardScaler()
self.feature_columns = None
self.training_history = []
def prepare_features(self, team_stats: pd.DataFrame,
schedule: pd.DataFrame) -> pd.DataFrame:
"""
Prepare features for game prediction.
Parameters:
-----------
team_stats : pd.DataFrame
Team-level statistics (season aggregates)
schedule : pd.DataFrame
Game schedule with home/away teams
Returns:
--------
pd.DataFrame : Feature matrix for each game
"""
# Merge home team stats
games = schedule.merge(
team_stats.add_prefix('home_'),
left_on='home_team',
right_on='home_team_name'
)
# Merge away team stats
games = games.merge(
team_stats.add_prefix('away_'),
left_on='away_team',
right_on='away_team_name'
)
# Create differential features
stat_cols = ['offensive_epa', 'defensive_epa', 'turnover_margin',
'third_down_pct', 'red_zone_pct']
for col in stat_cols:
games[f'{col}_diff'] = games[f'home_{col}'] - games[f'away_{col}']
# Add home field advantage indicator
games['home_advantage'] = 1
self.feature_columns = [f'{col}_diff' for col in stat_cols] + ['home_advantage']
return games
def train(self, X: pd.DataFrame, y: pd.Series,
model_type: str = 'logistic') -> Dict:
"""
Train the prediction model.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : pd.Series
Target (home team win = 1)
model_type : str
'logistic' or 'random_forest'
Returns:
--------
dict : Training results
"""
# Scale features
X_scaled = self.scaler.fit_transform(X[self.feature_columns])
# Select model
if model_type == 'logistic':
self.model = LogisticRegression(max_iter=1000, random_state=42)
else:
self.model = RandomForestClassifier(n_estimators=100, random_state=42)
# Cross-validate
cv_scores = cross_val_score(self.model, X_scaled, y, cv=5, scoring='accuracy')
# Fit on all data
self.model.fit(X_scaled, y)
results = {
'cv_accuracy': cv_scores.mean(),
'cv_std': cv_scores.std(),
'model_type': model_type
}
self.training_history.append(results)
return results
def predict_game(self, home_stats: Dict, away_stats: Dict) -> Dict:
"""
Predict a single game outcome.
Parameters:
-----------
home_stats : dict
Home team statistics
away_stats : dict
Away team statistics
Returns:
--------
dict : Prediction with probability
"""
# Calculate differentials
features = {}
for col in ['offensive_epa', 'defensive_epa', 'turnover_margin',
'third_down_pct', 'red_zone_pct']:
features[f'{col}_diff'] = home_stats.get(col, 0) - away_stats.get(col, 0)
features['home_advantage'] = 1
# Create feature vector
X = pd.DataFrame([features])[self.feature_columns]
X_scaled = self.scaler.transform(X)
# Predict
pred = self.model.predict(X_scaled)[0]
prob = self.model.predict_proba(X_scaled)[0]
return {
'home_win': bool(pred),
'home_win_probability': prob[1],
'away_win_probability': prob[0]
}
def evaluate(self, X: pd.DataFrame, y: pd.Series) -> Dict:
"""Evaluate model on test data."""
X_scaled = self.scaler.transform(X[self.feature_columns])
y_pred = self.model.predict(X_scaled)
y_prob = self.model.predict_proba(X_scaled)[:, 1]
return {
'accuracy': accuracy_score(y, y_pred),
'auc_roc': roc_auc_score(y, y_prob),
'n_correct': (y == y_pred).sum(),
'n_total': len(y)
}
Model Deployment Considerations
Real-Time vs. Batch Predictions
from typing import List
import time
class PredictionDeployment:
"""
Considerations for deploying football predictions.
"""
@staticmethod
def batch_prediction(model, games: pd.DataFrame) -> pd.DataFrame:
"""
Batch prediction for pre-game analysis.
Use case: Generate predictions for all upcoming games
Timing: Daily or weekly
"""
predictions = []
for _, game in games.iterrows():
pred = model.predict_game(
home_stats=game['home_stats'],
away_stats=game['away_stats']
)
predictions.append({
'game_id': game['game_id'],
**pred
})
return pd.DataFrame(predictions)
@staticmethod
def real_time_prediction(model, situation: Dict,
latency_threshold_ms: float = 100) -> Dict:
"""
Real-time prediction for in-game decisions.
Use case: Win probability updates during games
Timing: Every play
"""
start_time = time.time()
prediction = model.predict(situation)
latency_ms = (time.time() - start_time) * 1000
if latency_ms > latency_threshold_ms:
print(f"Warning: Prediction latency {latency_ms:.1f}ms exceeds threshold")
return {
**prediction,
'latency_ms': latency_ms
}
Model Monitoring
class ModelMonitor:
"""
Monitor model performance over time.
"""
def __init__(self, model_name: str):
self.model_name = model_name
self.predictions = []
self.outcomes = []
def log_prediction(self, prediction: Dict, game_id: str):
"""Log a prediction for later evaluation."""
self.predictions.append({
'game_id': game_id,
'timestamp': time.time(),
**prediction
})
def log_outcome(self, game_id: str, actual_outcome: int):
"""Log actual outcome once known."""
self.outcomes.append({
'game_id': game_id,
'actual': actual_outcome
})
def calculate_drift(self, window_size: int = 50) -> Dict:
"""
Calculate model performance drift.
If recent performance differs significantly from training,
model may need retraining.
"""
if len(self.predictions) < window_size:
return {'status': 'insufficient_data'}
# Get recent predictions with outcomes
recent = pd.DataFrame(self.predictions[-window_size:])
outcomes = pd.DataFrame(self.outcomes)
merged = recent.merge(outcomes, on='game_id')
recent_accuracy = (merged['prediction'] == merged['actual']).mean()
return {
'recent_accuracy': recent_accuracy,
'window_size': window_size,
'needs_retraining': recent_accuracy < 0.55 # Below baseline
}
Chapter Summary
This chapter introduced the fundamental concepts of predictive analytics for football:
Key Concepts: 1. Prediction problems can be classification (categorical outcomes), regression (continuous outcomes), or probability estimation 2. The ML workflow includes: problem definition, data collection, feature engineering, model selection, training, and evaluation 3. Proper validation requires temporal splits and cross-validation 4. Common pitfalls include data leakage, overfitting, and ignoring base rates
Technical Skills: - Implemented classification and regression models using scikit-learn - Created football-specific feature engineering pipelines - Evaluated models with appropriate metrics - Built a complete game outcome prediction system
Looking Ahead: The next chapters will dive deeper into specific prediction problems: - Chapter 18: Game outcome prediction with advanced models - Chapter 19: Player performance forecasting - Chapter 20: Recruiting analytics and prospect evaluation - Chapter 21: Win probability models for in-game decisions - Chapter 22: Deep learning and advanced ML applications
Key Terminology
| Term | Definition |
|---|---|
| Classification | Predicting categorical outcomes |
| Regression | Predicting continuous outcomes |
| Feature Engineering | Creating predictive variables from raw data |
| Cross-Validation | Evaluating models on held-out data |
| Data Leakage | Using information not available at prediction time |
| Calibration | How well predicted probabilities match actual frequencies |
| Baseline | Simple prediction method for comparison |
| Overfitting | Model learns noise instead of signal |
Related Reading
Explore this topic in other books
Sports Betting Regression Analysis Sports Betting Advanced Regression & Classification AI Engineering Supervised Learning Soccer Analytics ML & Regression for Soccer