Chapter 17: Key Takeaways - Introduction to Predictive Analytics

Quick Reference Summary

This chapter established the foundation for predictive modeling in football analytics.


Core Concepts

Types of Prediction Problems

Type Output Football Examples
Classification Categories Win/Loss, Coverage type, Play call
Regression Continuous values Points scored, Yards gained, Draft position
Probability Likelihoods Win probability, Conversion chance, FG success

The ML Workflow

  1. Problem Definition → Define what you're predicting and why
  2. Data Collection → Gather relevant historical data
  3. Feature Engineering → Transform data into predictive features
  4. Model Selection → Choose appropriate algorithms
  5. Training & Validation → Fit and evaluate with proper splits
  6. Deployment → Put model into production use

Essential Formulas

Prediction Equation

ŷ = f(X) + ε

Where:
- ŷ = predicted outcome
- X = input features
- f = learned function
- ε = irreducible error

Key Metrics

Classification:

Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Regression:

RMSE = √[Σ(y - ŷ)² / n]
MAE = Σ|y - ŷ| / n
R² = 1 - (SS_res / SS_tot)

Calibration:

Brier Score = Σ(p - y)² / n
ECE = Σ (n_bin/n) × |accuracy_bin - confidence_bin|

Code Patterns

Train-Test Split (Temporal)

def temporal_split(data, test_year):
    train = data[data['season'] < test_year]
    test = data[data['season'] >= test_year]
    return train, test

Basic Classifier Training

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit only on train
X_test_scaled = scaler.transform(X_test)        # Transform test

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

Cross-Validation

from sklearn.model_selection import cross_val_score, TimeSeriesSplit

# For time series data
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)

print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Probability Calibration

from sklearn.calibration import CalibratedClassifierCV

base_model = LogisticRegression()
calibrated = CalibratedClassifierCV(base_model, cv=5)
calibrated.fit(X_train, y_train)

probabilities = calibrated.predict_proba(X_test)[:, 1]

Common Pitfalls to Avoid

1. Data Leakage

Wrong: Using game outcome (score) to predict game outcome

# BAD - includes information from the future
features = ['home_score', 'away_score', 'winner']

Right: Use only pre-game information

# GOOD - only information available before game
features = ['home_rating', 'away_rating', 'rest_days']

2. Temporal Leakage

Wrong: Random train/test split for time series

# BAD - 2023 games might train, 2021 test
X_train, X_test = train_test_split(X, test_size=0.2)

Right: Split by time

# GOOD - always train on past, test on future
train = data[data['season'] < 2023]
test = data[data['season'] >= 2023]

3. Ignoring Base Rates

Wrong: Claiming 60% accuracy without context

# BAD - no baseline comparison
print(f"Model accuracy: {accuracy:.1%}")

Right: Compare to meaningful baseline

# GOOD - compare to baseline
baseline = y_test.mean()  # Home team win rate
print(f"Baseline: {baseline:.1%}, Model: {accuracy:.1%}")
print(f"Improvement: {accuracy - baseline:.1%}")

4. Sample Size Issues

# Rule of thumb: 10-20 observations per feature
n_games = 12  # Single team season
n_features = 20

if n_games / n_features < 10:
    print("Warning: Risk of overfitting!")
    print("Reduce features or get more data")

Model Selection Guide

Requirement Recommended Model
Interpretability needed Logistic Regression
Many features, unknown relationships Random Forest
Probability calibration critical Calibrated Classifier
Large dataset Gradient Boosting
Real-time predictions Simple models (LR, trees)

Evaluation Checklist

Before Training

  • [ ] Defined clear prediction target
  • [ ] Verified no data leakage
  • [ ] Temporal split implemented
  • [ ] Features scaled if needed
  • [ ] Baseline established

During Training

  • [ ] Cross-validation used
  • [ ] Learning curves checked
  • [ ] Train/test gap monitored
  • [ ] Multiple metrics tracked

After Training

  • [ ] Calibration assessed
  • [ ] Feature importance analyzed
  • [ ] Error patterns examined
  • [ ] Baseline comparison made

Quick Reference Tables

Scikit-Learn Models for Football

Task Model Key Parameters
Game outcome LogisticRegression C=1.0, max_iter=1000
Points scored Ridge alpha=1.0
Multi-class RandomForestClassifier n_estimators=100
Probability CalibratedClassifierCV cv=5

Evaluation Metrics by Problem

Problem Type Primary Metric Secondary Metrics
Win prediction Accuracy AUC-ROC, Brier Score
Score prediction RMSE MAE, R²
Probability Brier Score ECE, Calibration curve

Common Feature Types

Category Examples
Efficiency EPA/play, Success rate, Yards/play
Situational Down, Distance, Field position
Game state Score differential, Time remaining
Historical Win %, Rolling averages, Rankings
Opponent-adjusted EPA vs expected, SOS-adjusted

Red Flags

Warning signs your model may have issues:

  1. Train accuracy >> Test accuracy → Overfitting
  2. 100% training accuracy → Severe overfitting
  3. Model worse than baseline → Feature or data issues
  4. Calibration curve far from diagonal → Probability issues
  5. Accuracy drops over time → Model drift

Next Steps

After mastering these fundamentals, you're ready for: - Chapter 18: Advanced game outcome prediction - Chapter 19: Player performance forecasting - Chapter 20: Recruiting analytics - Chapter 21: Win probability models - Chapter 22: Deep learning applications