Chapter 22: Key Takeaways - Machine Learning Applications

Quick Reference Summary

This chapter covered applying machine learning techniques to college football analytics, from classification to clustering and neural networks.


Core Concepts

ML Task Types in Football

Task Description Example
Classification Predict discrete outcomes Win/loss, drafted/undrafted
Regression Predict continuous values Point spread, draft pick
Clustering Group similar items Player archetypes
Sequence Model sequential data Play prediction

Key Algorithms

Algorithm Strengths Use Case
Logistic Regression Interpretable, fast Baseline predictions
Random Forest Handles non-linearity Feature importance
XGBoost High accuracy Game prediction
K-Means Unsupervised grouping Player archetypes
Neural Networks Complex patterns Sequence modeling

Essential Formulas

Classification Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

AUC = P(score(positive) > score(negative))

Brier Score

Brier Score = (1/n) * Σ(predicted_prob - actual_outcome)²

Range: 0 (perfect) to 1 (worst)
Good value: < 0.20 for game prediction

Expected Calibration Error

ECE = Σ (n_bin / n_total) × |predicted_avg - actual_avg|

Good value: < 0.05

Silhouette Score (Clustering)

s(i) = (b(i) - a(i)) / max(a(i), b(i))

a(i) = avg distance to same cluster
b(i) = avg distance to nearest other cluster
Range: -1 to 1 (higher is better)

Code Patterns

Basic Classification Pipeline

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, brier_score_loss

def train_game_predictor(games: pd.DataFrame, features: list):
    """Train basic game outcome predictor."""
    X = games[features].values
    y = games['home_win'].values

    # Temporal split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False
    )

    # Scale
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Train
    model = GradientBoostingClassifier(n_estimators=100, max_depth=4)
    model.fit(X_train, y_train)

    # Evaluate
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"Accuracy: {accuracy_score(y_test, (y_prob > 0.5)):.3f}")
    print(f"Brier: {brier_score_loss(y_test, y_prob):.3f}")

    return model, scaler

Ensemble Construction

def build_weighted_ensemble(models: dict, X_val, y_val):
    """Build weighted ensemble from multiple models."""
    weights = {}

    for name, model in models.items():
        y_prob = model.predict_proba(X_val)[:, 1]
        brier = brier_score_loss(y_val, y_prob)
        weights[name] = 1 / (brier + 0.01)

    total = sum(weights.values())
    weights = {k: v/total for k, v in weights.items()}

    return weights


def ensemble_predict(models: dict, weights: dict, X):
    """Get weighted ensemble predictions."""
    pred = np.zeros(len(X))
    for name, model in models.items():
        pred += weights[name] * model.predict_proba(X)[:, 1]
    return pred

Clustering for Archetypes

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def discover_archetypes(players: pd.DataFrame,
                        features: list,
                        max_k: int = 8):
    """Discover player archetypes using clustering."""
    X = players[features].values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Find optimal k
    best_k, best_score = 2, -1
    for k in range(2, max_k + 1):
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X_scaled)
        score = silhouette_score(X_scaled, labels)
        if score > best_score:
            best_k, best_score = k, score

    # Final clustering
    final_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
    players['archetype'] = final_kmeans.fit_predict(X_scaled)

    return players, final_kmeans

XGBoost with Early Stopping

import xgboost as xgb

def train_xgboost_optimal(X_train, y_train, X_val, y_val):
    """Train XGBoost with early stopping."""
    model = xgb.XGBClassifier(
        n_estimators=1000,  # High, early stopping will limit
        max_depth=4,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
    )

    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=50,
        verbose=False
    )

    print(f"Best iteration: {model.best_iteration}")
    return model

Model Comparison

Classification Algorithms

Model Typical Accuracy AUC Training Speed
Logistic 68-70% 0.75 Very Fast
Random Forest 70-72% 0.78 Fast
Gradient Boosting 72-74% 0.81 Medium
XGBoost 73-75% 0.82 Medium
Neural Network 72-75% 0.81 Slow
Ensemble 74-76% 0.83 Varies

Clustering Algorithms

Algorithm Strengths Weaknesses
K-Means Fast, scalable Requires k, spherical clusters
Hierarchical Dendrogram visualization Slow for large data
DBSCAN Finds outliers, any shape Sensitive to parameters
GMM Soft assignments Assumes Gaussian

Validation Strategies

Football-Specific Splits

# CORRECT: Temporal split
train = games[games['season'] < 2023]
test = games[games['season'] == 2023]

# WRONG: Random split (data leakage)
train, test = train_test_split(games, random_state=42)

Cross-Validation Options

Method Use Case Code
Temporal Time series Manual by season
Group K-Fold Leave teams out GroupKFold
Stratified Balanced classes StratifiedKFold

Common Pitfalls

1. Data Leakage

Wrong:

# Using full-season stats to predict early-season games
features = ['season_total_yards', 'final_win_pct']

Right:

# Using only pre-game information
features = ['yards_entering_game', 'win_pct_entering_game']

2. Ignoring Temporal Structure

Wrong:

train_test_split(games, test_size=0.2, shuffle=True)

Right:

train = games[games['season'] < test_season]
test = games[games['season'] == test_season]

3. Overfitting to Small Samples

Wrong:

# Complex model on small dataset
model = xgb.XGBClassifier(n_estimators=500, max_depth=10)

Right:

# Regularized model appropriate for sample size
model = xgb.XGBClassifier(
    n_estimators=100, max_depth=4,
    reg_alpha=0.1, reg_lambda=1.0
)

4. Ignoring Calibration

Wrong:

# Only checking accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Right:

# Check calibration too
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Brier: {brier_score_loss(y_test, y_prob)}")
print(f"Calibration: {check_calibration(y_test, y_prob)}")

Hyperparameter Guidelines

XGBoost/GBM

Parameter Start Value Tuning Range
n_estimators 100 50-500
max_depth 4 3-7
learning_rate 0.1 0.01-0.3
subsample 0.8 0.6-1.0
colsample_bytree 0.8 0.6-1.0

Random Forest

Parameter Start Value Tuning Range
n_estimators 100 50-500
max_depth 6 4-12
min_samples_leaf 10 5-50
max_features sqrt sqrt, log2, 0.3-0.8

Neural Networks

Parameter Start Value Notes
Hidden layers [64, 32] 1-3 layers
Dropout 0.3 0.2-0.5
Learning rate 0.001 0.0001-0.01
Batch size 32 16-128

Evaluation Checklist

Before Training

  • [ ] Data cleaned and preprocessed
  • [ ] Features engineered appropriately
  • [ ] No data leakage in features
  • [ ] Temporal train/test split implemented
  • [ ] Baseline model established

Model Training

  • [ ] Multiple algorithms compared
  • [ ] Hyperparameters tuned on validation
  • [ ] Cross-validation performed
  • [ ] Overfitting checked (train vs. val gap)

Evaluation

  • [ ] Accuracy reported
  • [ ] AUC calculated
  • [ ] Brier score measured
  • [ ] Calibration analyzed
  • [ ] Compared to baseline

Deployment

  • [ ] Model serialized
  • [ ] Pipeline documented
  • [ ] Monitoring plan in place
  • [ ] Retraining schedule defined

Quick Reference Tables

Typical Football ML Performance

Task Good Accuracy Good AUC Good Brier
Game outcome >72% >0.80 <0.18
Spread prediction >55% ATS - -
Draft prediction >75% >0.85 -
Play prediction >40% - -

Feature Importance Ranking (Game Prediction)

Rank Feature Type Importance
1 Vegas spread/implied 30-35%
2 Power ratings (Elo, SP+) 20-25%
3 Efficiency metrics 15-20%
4 Home advantage 5-10%
5 Recent form 5-8%
6 Situational factors <5%

Next Steps

After mastering ML applications, proceed to: - Chapter 23: Network Analysis in Football - Chapter 24: Computer Vision and Tracking Data - Chapter 27: Building a Complete Analytics System