5 min read

Every football decision involves prediction. When a coach calls a fourth-down play, they implicitly predict the probability of conversion. When a general manager drafts a player, they predict future performance. When a defensive coordinator adjusts...

Chapter 17: Introduction to Predictive Analytics in Football

Learning Objectives

By the end of this chapter, you will be able to:

  1. Understand the role of predictive analytics in modern football operations
  2. Distinguish between descriptive, predictive, and prescriptive analytics
  3. Apply the machine learning workflow to football problems
  4. Evaluate model performance using appropriate metrics
  5. Recognize common pitfalls in sports prediction modeling
  6. Build foundational prediction models using Python

Introduction

Every football decision involves prediction. When a coach calls a fourth-down play, they implicitly predict the probability of conversion. When a general manager drafts a player, they predict future performance. When a defensive coordinator adjusts coverage, they predict offensive tendencies. Predictive analytics transforms these implicit predictions into explicit, data-driven models that can be tested, improved, and deployed at scale.

This chapter introduces the fundamental concepts of predictive analytics as applied to college football. We'll establish the theoretical foundation for the machine learning techniques covered in subsequent chapters while building practical skills through hands-on implementation.

The Evolution of Football Prediction

Football prediction has evolved dramatically over the past century:

Early Era (1900s-1960s): Predictions were based purely on intuition, recent performance, and media narratives. The concept of "momentum" and "big game experience" dominated discourse.

Statistical Era (1970s-1990s): Simple statistics like rushing yards, passing efficiency, and turnover margin became the basis for predictions. Jeff Sagarin and Ken Pomeroy pioneered computer rankings.

Analytics Era (2000s-2010s): Advanced metrics like EPA, Success Rate, and SP+ emerged. Pro Football Focus began grading every player on every play.

Machine Learning Era (2020s-Present): Neural networks, ensemble methods, and real-time tracking data enable unprecedented prediction accuracy. Teams employ dedicated data science staffs.

Why Predictive Analytics Matters

Predictive analytics provides competitive advantages across football operations:

Application Traditional Approach Predictive Analytics Approach
Recruiting Subjective ratings, eye test Combine metrics + film grades + projection models
Game Planning Film study, tendencies Automated tendency analysis + optimal strategy models
In-Game Decisions Experience, gut feel Win probability models, expected value calculations
Player Development Coach intuition Performance trajectory prediction
Roster Management Positional need Contract value models, replacement player analysis

Fundamentals of Prediction

The Prediction Problem

At its core, prediction involves estimating an unknown quantity based on known information:

$$\hat{y} = f(X) + \epsilon$$

Where: - $\hat{y}$ is the predicted outcome - $X$ represents input features (predictors) - $f$ is the function we're trying to learn - $\epsilon$ is irreducible error (randomness in the outcome)

In football contexts, this might be: - Predicting game outcomes ($\hat{y}$ = win probability) - Projecting player performance ($\hat{y}$ = future EPA) - Forecasting draft position ($\hat{y}$ = expected pick number)

Types of Prediction Problems

Classification

Predicting categorical outcomes:

  • Binary: Will the team win? (Yes/No)
  • Multi-class: What defensive coverage? (Cover 2, Cover 3, Cover 4, Man)
  • Multi-label: Which receivers are primary targets? (Multiple possible)
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


def classify_game_outcome(team_data: pd.DataFrame) -> dict:
    """
    Binary classification example: Predict game wins.

    Parameters:
    -----------
    team_data : pd.DataFrame
        Team statistics with 'won' target column

    Returns:
    --------
    dict : Model performance metrics
    """
    # Features: offensive and defensive efficiency
    feature_cols = ['offensive_epa', 'defensive_epa', 'turnover_margin',
                   'third_down_pct', 'red_zone_pct']
    X = team_data[feature_cols]
    y = team_data['won']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train classifier
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    return {
        'model': model,
        'accuracy': accuracy,
        'feature_importance': dict(zip(feature_cols, model.coef_[0]))
    }

Regression

Predicting continuous outcomes:

  • Points scored in a game
  • Player performance metrics (yards, touchdowns)
  • Future salary/contract value
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


def predict_points_scored(game_data: pd.DataFrame) -> dict:
    """
    Regression example: Predict points scored.

    Parameters:
    -----------
    game_data : pd.DataFrame
        Game-level statistics

    Returns:
    --------
    dict : Model performance metrics
    """
    feature_cols = ['total_yards', 'turnovers', 'time_of_possession',
                   'third_down_conversions', 'penalties']

    X = game_data[feature_cols]
    y = game_data['points_scored']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train regressor
    model = Ridge(alpha=1.0)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)

    return {
        'model': model,
        'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
        'mae': mean_absolute_error(y_test, y_pred),
        'r2': r2_score(y_test, y_pred),
        'coefficients': dict(zip(feature_cols, model.coef_))
    }

Probability Estimation

Estimating likelihoods of outcomes:

  • Win probability at any point in a game
  • Conversion probability on fourth down
  • Draft pick probability distributions
from sklearn.calibration import CalibratedClassifierCV


def estimate_win_probability(situation_data: pd.DataFrame) -> dict:
    """
    Probability estimation: Win probability model.

    Parameters:
    -----------
    situation_data : pd.DataFrame
        In-game situations with outcomes

    Returns:
    --------
    dict : Calibrated probability model
    """
    feature_cols = ['score_differential', 'time_remaining',
                   'possession', 'field_position', 'timeouts_remaining']

    X = situation_data[feature_cols]
    y = situation_data['won']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Base classifier
    base_model = LogisticRegression(max_iter=1000)

    # Calibrate probabilities
    calibrated_model = CalibratedClassifierCV(base_model, cv=5)
    calibrated_model.fit(X_train, y_train)

    # Get probability predictions
    y_prob = calibrated_model.predict_proba(X_test)[:, 1]

    return {
        'model': calibrated_model,
        'probabilities': y_prob,
        'actual': y_test.values
    }

The Machine Learning Workflow

Step 1: Problem Definition

Before writing any code, clearly define the prediction problem:

Questions to Answer: 1. What exactly are we predicting? 2. When do we need the prediction? (pre-game, in-game, off-season) 3. What decisions will the prediction inform? 4. How accurate does it need to be to be useful? 5. What is the cost of errors (false positives vs. false negatives)?

Example Problem Definition:

Problem: Fourth Down Decision Making
- Prediction: Probability of conversion for various play types
- Timing: Real-time during games
- Decision: Go for it, punt, or kick field goal
- Accuracy Target: Calibrated within 5% of true probabilities
- Error Costs: Going for it and failing is costly but recoverable;
              punting when you should go costs expected points

Step 2: Data Collection and Understanding

Gather relevant data and understand its structure:

import pandas as pd
from typing import Dict, List, Tuple


class FootballDataPipeline:
    """
    Data collection and understanding for football prediction.
    """

    def __init__(self, data_source: str):
        self.data_source = data_source
        self.raw_data = None
        self.processed_data = None

    def load_data(self) -> pd.DataFrame:
        """Load raw data from source."""
        # In practice, this would connect to a database or API
        self.raw_data = pd.read_csv(self.data_source)
        return self.raw_data

    def explore_data(self) -> Dict:
        """Generate data quality report."""
        if self.raw_data is None:
            raise ValueError("Load data first")

        report = {
            'shape': self.raw_data.shape,
            'columns': list(self.raw_data.columns),
            'dtypes': self.raw_data.dtypes.to_dict(),
            'missing': self.raw_data.isnull().sum().to_dict(),
            'missing_pct': (self.raw_data.isnull().sum() / len(self.raw_data) * 100).to_dict(),
            'numeric_stats': self.raw_data.describe().to_dict(),
        }

        return report

    def check_target_distribution(self, target_col: str) -> Dict:
        """Analyze target variable distribution."""
        target = self.raw_data[target_col]

        if target.dtype in ['int64', 'float64']:
            # Continuous target
            return {
                'type': 'continuous',
                'mean': target.mean(),
                'std': target.std(),
                'min': target.min(),
                'max': target.max(),
                'median': target.median()
            }
        else:
            # Categorical target
            return {
                'type': 'categorical',
                'classes': target.unique().tolist(),
                'class_counts': target.value_counts().to_dict(),
                'class_balance': (target.value_counts() / len(target)).to_dict()
            }

Step 3: Feature Engineering

Transform raw data into predictive features:

from typing import List
import numpy as np


class FootballFeatureEngineer:
    """
    Feature engineering for football prediction models.
    """

    def __init__(self, play_data: pd.DataFrame):
        self.plays = play_data

    def create_game_features(self) -> pd.DataFrame:
        """Create game-level features."""
        game_features = self.plays.groupby('game_id').agg({
            'epa': ['sum', 'mean', 'std'],
            'success': 'mean',
            'yards_gained': ['sum', 'mean'],
            'turnover': 'sum',
            'penalty': 'sum'
        })

        # Flatten column names
        game_features.columns = ['_'.join(col) for col in game_features.columns]

        return game_features

    def create_rolling_features(self, window: int = 5) -> pd.DataFrame:
        """Create rolling average features for team performance."""
        # Sort by team and game date
        sorted_data = self.plays.sort_values(['team', 'game_date'])

        rolling_features = sorted_data.groupby('team').agg({
            'epa': lambda x: x.rolling(window, min_periods=1).mean(),
            'success': lambda x: x.rolling(window, min_periods=1).mean()
        })

        return rolling_features

    def create_situational_features(self) -> pd.DataFrame:
        """Create situational context features."""
        features = self.plays.copy()

        # Down and distance encoding
        features['short_yardage'] = (features['ydstogo'] <= 2).astype(int)
        features['medium_yardage'] = ((features['ydstogo'] > 2) &
                                       (features['ydstogo'] <= 6)).astype(int)
        features['long_yardage'] = (features['ydstogo'] > 6).astype(int)

        # Field position zones
        features['own_territory'] = (features['yardline_100'] > 50).astype(int)
        features['red_zone'] = (features['yardline_100'] <= 20).astype(int)
        features['goal_to_go'] = (features['yardline_100'] <= 10).astype(int)

        # Game state
        features['close_game'] = (abs(features['score_differential']) <= 7).astype(int)
        features['trailing'] = (features['score_differential'] < 0).astype(int)
        features['late_game'] = (features['game_seconds_remaining'] <= 300).astype(int)

        return features

    def create_opponent_adjusted_features(self) -> pd.DataFrame:
        """Create opponent-adjusted features."""
        # Calculate opponent averages
        opponent_stats = self.plays.groupby('opponent').agg({
            'epa': 'mean',
            'success': 'mean'
        }).rename(columns={'epa': 'opp_epa_allowed', 'success': 'opp_success_allowed'})

        # Merge back
        features = self.plays.merge(opponent_stats, left_on='opponent', right_index=True)

        # Calculate adjusted metrics
        features['epa_vs_expected'] = features['epa'] - features['opp_epa_allowed']

        return features

Step 4: Model Selection

Choose appropriate algorithms based on the problem:

from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor


class ModelSelector:
    """
    Guide model selection based on problem characteristics.
    """

    CLASSIFICATION_MODELS = {
        'logistic': {
            'model': LogisticRegression,
            'params': {'max_iter': 1000, 'random_state': 42},
            'interpretable': True,
            'handles_multiclass': True,
            'requires_scaling': True
        },
        'random_forest': {
            'model': RandomForestClassifier,
            'params': {'n_estimators': 100, 'random_state': 42},
            'interpretable': False,
            'handles_multiclass': True,
            'requires_scaling': False
        },
        'gradient_boost': {
            'model': GradientBoostingClassifier,
            'params': {'n_estimators': 100, 'random_state': 42},
            'interpretable': False,
            'handles_multiclass': True,
            'requires_scaling': False
        },
        'neural_net': {
            'model': MLPClassifier,
            'params': {'hidden_layer_sizes': (100, 50), 'max_iter': 500, 'random_state': 42},
            'interpretable': False,
            'handles_multiclass': True,
            'requires_scaling': True
        }
    }

    REGRESSION_MODELS = {
        'ridge': {
            'model': Ridge,
            'params': {'alpha': 1.0, 'random_state': 42},
            'interpretable': True,
            'requires_scaling': True
        },
        'lasso': {
            'model': Lasso,
            'params': {'alpha': 0.1, 'random_state': 42},
            'interpretable': True,
            'requires_scaling': True
        },
        'random_forest': {
            'model': RandomForestRegressor,
            'params': {'n_estimators': 100, 'random_state': 42},
            'interpretable': False,
            'requires_scaling': False
        },
        'gradient_boost': {
            'model': GradientBoostingRegressor,
            'params': {'n_estimators': 100, 'random_state': 42},
            'interpretable': False,
            'requires_scaling': False
        }
    }

    @classmethod
    def recommend(cls, problem_type: str, requirements: Dict) -> List[str]:
        """
        Recommend models based on problem requirements.

        Parameters:
        -----------
        problem_type : str
            'classification' or 'regression'
        requirements : dict
            Keys: 'interpretability', 'multiclass', 'large_data', 'realtime'

        Returns:
        --------
        List[str] : Recommended model names
        """
        models = cls.CLASSIFICATION_MODELS if problem_type == 'classification' else cls.REGRESSION_MODELS

        recommendations = []

        for name, config in models.items():
            score = 0

            if requirements.get('interpretability') and config.get('interpretable'):
                score += 2
            if requirements.get('multiclass') and config.get('handles_multiclass', True):
                score += 1
            if requirements.get('large_data') and name in ['random_forest', 'gradient_boost']:
                score += 1
            if requirements.get('realtime') and config.get('interpretable'):
                score += 1  # Simpler models are faster

            recommendations.append((name, score))

        # Sort by score
        recommendations.sort(key=lambda x: x[1], reverse=True)

        return [name for name, score in recommendations]

Step 5: Model Training and Validation

Train models with proper validation:

from sklearn.model_selection import cross_val_score, TimeSeriesSplit, StratifiedKFold
from sklearn.preprocessing import StandardScaler
import numpy as np


class FootballModelTrainer:
    """
    Train and validate football prediction models.
    """

    def __init__(self, model, scaling: bool = True):
        self.model = model
        self.scaling = scaling
        self.scaler = StandardScaler() if scaling else None
        self.is_fitted = False

    def train_with_cv(self, X: pd.DataFrame, y: pd.Series,
                     cv_strategy: str = 'kfold',
                     n_splits: int = 5) -> Dict:
        """
        Train model with cross-validation.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : pd.Series
            Target variable
        cv_strategy : str
            'kfold', 'stratified', or 'timeseries'
        n_splits : int
            Number of CV folds

        Returns:
        --------
        dict : Cross-validation results
        """
        # Select CV strategy
        if cv_strategy == 'kfold':
            cv = n_splits
        elif cv_strategy == 'stratified':
            cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
        elif cv_strategy == 'timeseries':
            cv = TimeSeriesSplit(n_splits=n_splits)
        else:
            raise ValueError(f"Unknown CV strategy: {cv_strategy}")

        # Scale features if needed
        if self.scaling:
            X_scaled = self.scaler.fit_transform(X)
        else:
            X_scaled = X.values

        # Cross-validate
        scores = cross_val_score(self.model, X_scaled, y, cv=cv, scoring='accuracy')

        # Fit final model on all data
        self.model.fit(X_scaled, y)
        self.is_fitted = True

        return {
            'cv_scores': scores,
            'mean_cv_score': scores.mean(),
            'std_cv_score': scores.std(),
            'model': self.model
        }

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Make predictions on new data."""
        if not self.is_fitted:
            raise ValueError("Model not fitted. Call train_with_cv first.")

        if self.scaling:
            X_scaled = self.scaler.transform(X)
        else:
            X_scaled = X.values

        return self.model.predict(X_scaled)

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Get probability predictions."""
        if not self.is_fitted:
            raise ValueError("Model not fitted. Call train_with_cv first.")

        if not hasattr(self.model, 'predict_proba'):
            raise ValueError("Model doesn't support probability predictions")

        if self.scaling:
            X_scaled = self.scaler.transform(X)
        else:
            X_scaled = X.values

        return self.model.predict_proba(X_scaled)

Step 6: Evaluation

Assess model performance with appropriate metrics:

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score
)
import matplotlib.pyplot as plt
from typing import Dict


class ModelEvaluator:
    """
    Comprehensive model evaluation for football predictions.
    """

    @staticmethod
    def evaluate_classifier(y_true: np.ndarray, y_pred: np.ndarray,
                           y_prob: np.ndarray = None) -> Dict:
        """
        Evaluate classification model.

        Parameters:
        -----------
        y_true : np.ndarray
            True labels
        y_pred : np.ndarray
            Predicted labels
        y_prob : np.ndarray, optional
            Predicted probabilities

        Returns:
        --------
        dict : Evaluation metrics
        """
        metrics = {
            'accuracy': accuracy_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred, average='weighted'),
            'recall': recall_score(y_true, y_pred, average='weighted'),
            'f1': f1_score(y_true, y_pred, average='weighted'),
            'confusion_matrix': confusion_matrix(y_true, y_pred)
        }

        if y_prob is not None:
            try:
                metrics['auc_roc'] = roc_auc_score(y_true, y_prob)
            except ValueError:
                metrics['auc_roc'] = None

        return metrics

    @staticmethod
    def evaluate_regressor(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
        """
        Evaluate regression model.

        Parameters:
        -----------
        y_true : np.ndarray
            True values
        y_pred : np.ndarray
            Predicted values

        Returns:
        --------
        dict : Evaluation metrics
        """
        return {
            'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
            'mae': mean_absolute_error(y_true, y_pred),
            'r2': r2_score(y_true, y_pred),
            'mean_error': np.mean(y_pred - y_true),
            'std_error': np.std(y_pred - y_true)
        }

    @staticmethod
    def evaluate_calibration(y_true: np.ndarray, y_prob: np.ndarray,
                            n_bins: int = 10) -> Dict:
        """
        Evaluate probability calibration.

        Critical for win probability and conversion models.
        """
        from sklearn.calibration import calibration_curve

        # Calculate calibration curve
        prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)

        # Brier score (lower is better)
        brier = np.mean((y_prob - y_true) ** 2)

        # Expected Calibration Error
        bin_indices = np.digitize(y_prob, np.linspace(0, 1, n_bins + 1)[1:-1])
        ece = 0
        for i in range(n_bins):
            mask = bin_indices == i
            if mask.sum() > 0:
                bin_accuracy = y_true[mask].mean()
                bin_confidence = y_prob[mask].mean()
                bin_weight = mask.sum() / len(y_prob)
                ece += bin_weight * abs(bin_accuracy - bin_confidence)

        return {
            'brier_score': brier,
            'ece': ece,
            'calibration_curve': (prob_true, prob_pred)
        }

    @staticmethod
    def plot_calibration(y_true: np.ndarray, y_prob: np.ndarray,
                        title: str = 'Calibration Plot') -> plt.Figure:
        """Create calibration plot."""
        from sklearn.calibration import calibration_curve

        prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)

        fig, ax = plt.subplots(figsize=(8, 8))

        # Perfect calibration line
        ax.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')

        # Model calibration
        ax.plot(prob_pred, prob_true, 's-', label='Model')

        ax.set_xlabel('Mean Predicted Probability')
        ax.set_ylabel('Fraction of Positives')
        ax.set_title(title)
        ax.legend(loc='lower right')
        ax.grid(True, alpha=0.3)

        return fig

Common Pitfalls in Sports Prediction

1. Data Leakage

Using information that wouldn't be available at prediction time:

# WRONG: Using final game stats to predict game outcome
def predict_winner_wrong(game_data):
    # This includes final scores - data leakage!
    features = ['home_score', 'away_score', 'home_yards', 'away_yards']
    return model.predict(game_data[features])


# CORRECT: Using only pre-game information
def predict_winner_correct(game_data):
    # Only use information available before the game
    features = ['home_rating', 'away_rating', 'home_rest_days',
               'away_rest_days', 'is_rivalry', 'home_injuries']
    return model.predict(game_data[features])

2. Temporal Leakage

Training on future data to predict past events:

# WRONG: Random train/test split with time series data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 2023 games might be in training, 2021 games in test!


# CORRECT: Time-based split
def temporal_split(data, test_year=2023):
    train = data[data['season'] < test_year]
    test = data[data['season'] >= test_year]
    return train, test

3. Overfitting to Small Samples

Football has limited games per team per season:

def assess_sample_adequacy(n_games: int, n_features: int) -> Dict:
    """
    Assess if sample size is adequate for modeling.

    Rule of thumb: Need 10-20 observations per feature.
    """
    min_ratio = n_games / n_features

    return {
        'n_games': n_games,
        'n_features': n_features,
        'ratio': min_ratio,
        'adequate': min_ratio >= 10,
        'recommendation': 'Reduce features' if min_ratio < 10 else 'Sample adequate'
    }


# Example: FBS team with 12 games
assessment = assess_sample_adequacy(n_games=12, n_features=20)
# ratio = 0.6 - NOT adequate! Need to reduce features.

4. Ignoring Base Rates

Forgetting how often events naturally occur:

def baseline_comparison(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
    """
    Compare model to baseline predictions.

    Always compare to:
    - Random guessing
    - Predicting the majority class
    - Predicting the mean value
    """
    n_samples = len(y_true)

    # For classification
    majority_class = pd.Series(y_true).mode()[0]
    majority_baseline = (y_true == majority_class).mean()

    # Model accuracy
    model_accuracy = (y_true == y_pred).mean()

    # Improvement over baseline
    improvement = model_accuracy - majority_baseline

    return {
        'majority_baseline': majority_baseline,
        'model_accuracy': model_accuracy,
        'improvement': improvement,
        'relative_improvement': improvement / (1 - majority_baseline) if majority_baseline < 1 else 0
    }

5. Selection Bias

Models trained only on games that happened:

def identify_selection_bias(data: pd.DataFrame) -> Dict:
    """
    Identify potential selection bias in the data.

    Examples in football:
    - Only looking at games with available tracking data
    - Only considering players who stayed healthy
    - Only analyzing teams that made the playoffs
    """
    bias_indicators = {
        'missing_data_games': data['tracking_available'].sum() / len(data),
        'injury_filtered': data['injury_game'].sum() / len(data),
        'playoff_only': data['playoff_game'].sum() / len(data) if 'playoff_game' in data else None
    }

    return bias_indicators

Building Your First Prediction Model

Let's build a complete prediction model for game outcomes:

"""
Complete Game Outcome Prediction Pipeline
"""

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score


class GameOutcomePredictor:
    """
    Predict game outcomes using team statistics.
    """

    def __init__(self):
        self.model = None
        self.scaler = StandardScaler()
        self.feature_columns = None
        self.training_history = []

    def prepare_features(self, team_stats: pd.DataFrame,
                        schedule: pd.DataFrame) -> pd.DataFrame:
        """
        Prepare features for game prediction.

        Parameters:
        -----------
        team_stats : pd.DataFrame
            Team-level statistics (season aggregates)
        schedule : pd.DataFrame
            Game schedule with home/away teams

        Returns:
        --------
        pd.DataFrame : Feature matrix for each game
        """
        # Merge home team stats
        games = schedule.merge(
            team_stats.add_prefix('home_'),
            left_on='home_team',
            right_on='home_team_name'
        )

        # Merge away team stats
        games = games.merge(
            team_stats.add_prefix('away_'),
            left_on='away_team',
            right_on='away_team_name'
        )

        # Create differential features
        stat_cols = ['offensive_epa', 'defensive_epa', 'turnover_margin',
                    'third_down_pct', 'red_zone_pct']

        for col in stat_cols:
            games[f'{col}_diff'] = games[f'home_{col}'] - games[f'away_{col}']

        # Add home field advantage indicator
        games['home_advantage'] = 1

        self.feature_columns = [f'{col}_diff' for col in stat_cols] + ['home_advantage']

        return games

    def train(self, X: pd.DataFrame, y: pd.Series,
             model_type: str = 'logistic') -> Dict:
        """
        Train the prediction model.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : pd.Series
            Target (home team win = 1)
        model_type : str
            'logistic' or 'random_forest'

        Returns:
        --------
        dict : Training results
        """
        # Scale features
        X_scaled = self.scaler.fit_transform(X[self.feature_columns])

        # Select model
        if model_type == 'logistic':
            self.model = LogisticRegression(max_iter=1000, random_state=42)
        else:
            self.model = RandomForestClassifier(n_estimators=100, random_state=42)

        # Cross-validate
        cv_scores = cross_val_score(self.model, X_scaled, y, cv=5, scoring='accuracy')

        # Fit on all data
        self.model.fit(X_scaled, y)

        results = {
            'cv_accuracy': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'model_type': model_type
        }

        self.training_history.append(results)

        return results

    def predict_game(self, home_stats: Dict, away_stats: Dict) -> Dict:
        """
        Predict a single game outcome.

        Parameters:
        -----------
        home_stats : dict
            Home team statistics
        away_stats : dict
            Away team statistics

        Returns:
        --------
        dict : Prediction with probability
        """
        # Calculate differentials
        features = {}
        for col in ['offensive_epa', 'defensive_epa', 'turnover_margin',
                   'third_down_pct', 'red_zone_pct']:
            features[f'{col}_diff'] = home_stats.get(col, 0) - away_stats.get(col, 0)
        features['home_advantage'] = 1

        # Create feature vector
        X = pd.DataFrame([features])[self.feature_columns]
        X_scaled = self.scaler.transform(X)

        # Predict
        pred = self.model.predict(X_scaled)[0]
        prob = self.model.predict_proba(X_scaled)[0]

        return {
            'home_win': bool(pred),
            'home_win_probability': prob[1],
            'away_win_probability': prob[0]
        }

    def evaluate(self, X: pd.DataFrame, y: pd.Series) -> Dict:
        """Evaluate model on test data."""
        X_scaled = self.scaler.transform(X[self.feature_columns])

        y_pred = self.model.predict(X_scaled)
        y_prob = self.model.predict_proba(X_scaled)[:, 1]

        return {
            'accuracy': accuracy_score(y, y_pred),
            'auc_roc': roc_auc_score(y, y_prob),
            'n_correct': (y == y_pred).sum(),
            'n_total': len(y)
        }

Model Deployment Considerations

Real-Time vs. Batch Predictions

from typing import List
import time


class PredictionDeployment:
    """
    Considerations for deploying football predictions.
    """

    @staticmethod
    def batch_prediction(model, games: pd.DataFrame) -> pd.DataFrame:
        """
        Batch prediction for pre-game analysis.

        Use case: Generate predictions for all upcoming games
        Timing: Daily or weekly
        """
        predictions = []

        for _, game in games.iterrows():
            pred = model.predict_game(
                home_stats=game['home_stats'],
                away_stats=game['away_stats']
            )
            predictions.append({
                'game_id': game['game_id'],
                **pred
            })

        return pd.DataFrame(predictions)

    @staticmethod
    def real_time_prediction(model, situation: Dict,
                            latency_threshold_ms: float = 100) -> Dict:
        """
        Real-time prediction for in-game decisions.

        Use case: Win probability updates during games
        Timing: Every play
        """
        start_time = time.time()

        prediction = model.predict(situation)

        latency_ms = (time.time() - start_time) * 1000

        if latency_ms > latency_threshold_ms:
            print(f"Warning: Prediction latency {latency_ms:.1f}ms exceeds threshold")

        return {
            **prediction,
            'latency_ms': latency_ms
        }

Model Monitoring

class ModelMonitor:
    """
    Monitor model performance over time.
    """

    def __init__(self, model_name: str):
        self.model_name = model_name
        self.predictions = []
        self.outcomes = []

    def log_prediction(self, prediction: Dict, game_id: str):
        """Log a prediction for later evaluation."""
        self.predictions.append({
            'game_id': game_id,
            'timestamp': time.time(),
            **prediction
        })

    def log_outcome(self, game_id: str, actual_outcome: int):
        """Log actual outcome once known."""
        self.outcomes.append({
            'game_id': game_id,
            'actual': actual_outcome
        })

    def calculate_drift(self, window_size: int = 50) -> Dict:
        """
        Calculate model performance drift.

        If recent performance differs significantly from training,
        model may need retraining.
        """
        if len(self.predictions) < window_size:
            return {'status': 'insufficient_data'}

        # Get recent predictions with outcomes
        recent = pd.DataFrame(self.predictions[-window_size:])
        outcomes = pd.DataFrame(self.outcomes)

        merged = recent.merge(outcomes, on='game_id')

        recent_accuracy = (merged['prediction'] == merged['actual']).mean()

        return {
            'recent_accuracy': recent_accuracy,
            'window_size': window_size,
            'needs_retraining': recent_accuracy < 0.55  # Below baseline
        }

Chapter Summary

This chapter introduced the fundamental concepts of predictive analytics for football:

Key Concepts: 1. Prediction problems can be classification (categorical outcomes), regression (continuous outcomes), or probability estimation 2. The ML workflow includes: problem definition, data collection, feature engineering, model selection, training, and evaluation 3. Proper validation requires temporal splits and cross-validation 4. Common pitfalls include data leakage, overfitting, and ignoring base rates

Technical Skills: - Implemented classification and regression models using scikit-learn - Created football-specific feature engineering pipelines - Evaluated models with appropriate metrics - Built a complete game outcome prediction system

Looking Ahead: The next chapters will dive deeper into specific prediction problems: - Chapter 18: Game outcome prediction with advanced models - Chapter 19: Player performance forecasting - Chapter 20: Recruiting analytics and prospect evaluation - Chapter 21: Win probability models for in-game decisions - Chapter 22: Deep learning and advanced ML applications

Key Terminology

Term Definition
Classification Predicting categorical outcomes
Regression Predicting continuous outcomes
Feature Engineering Creating predictive variables from raw data
Cross-Validation Evaluating models on held-out data
Data Leakage Using information not available at prediction time
Calibration How well predicted probabilities match actual frequencies
Baseline Simple prediction method for comparison
Overfitting Model learns noise instead of signal