Chapter 22: Key Takeaways - Machine Learning Applications

DataField.Dev

Chapter 22: Key Takeaways - Machine Learning Applications

Quick Reference Summary

This chapter covered applying machine learning techniques to college football analytics, from classification to clustering and neural networks.

Core Concepts

ML Task Types in Football

Task	Description	Example
Classification	Predict discrete outcomes	Win/loss, drafted/undrafted
Regression	Predict continuous values	Point spread, draft pick
Clustering	Group similar items	Player archetypes
Sequence	Model sequential data	Play prediction

Key Algorithms

Algorithm	Strengths	Use Case
Logistic Regression	Interpretable, fast	Baseline predictions
Random Forest	Handles non-linearity	Feature importance
XGBoost	High accuracy	Game prediction
K-Means	Unsupervised grouping	Player archetypes
Neural Networks	Complex patterns	Sequence modeling

Essential Formulas

Classification Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

AUC = P(score(positive) > score(negative))

Brier Score

Brier Score = (1/n) * Σ(predicted_prob - actual_outcome)²

Range: 0 (perfect) to 1 (worst)
Good value: < 0.20 for game prediction

Expected Calibration Error

ECE = Σ (n_bin / n_total) × |predicted_avg - actual_avg|

Good value: < 0.05

Silhouette Score (Clustering)

s(i) = (b(i) - a(i)) / max(a(i), b(i))

a(i) = avg distance to same cluster
b(i) = avg distance to nearest other cluster
Range: -1 to 1 (higher is better)

Code Patterns

Basic Classification Pipeline

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, brier_score_loss

def train_game_predictor(games: pd.DataFrame, features: list):
    """Train basic game outcome predictor."""
    X = games[features].values
    y = games['home_win'].values

    # Temporal split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, shuffle=False
    )

    # Scale
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Train
    model = GradientBoostingClassifier(n_estimators=100, max_depth=4)
    model.fit(X_train, y_train)

    # Evaluate
    y_prob = model.predict_proba(X_test)[:, 1]
    print(f"Accuracy: {accuracy_score(y_test, (y_prob > 0.5)):.3f}")
    print(f"Brier: {brier_score_loss(y_test, y_prob):.3f}")

    return model, scaler

Ensemble Construction

def build_weighted_ensemble(models: dict, X_val, y_val):
    """Build weighted ensemble from multiple models."""
    weights = {}

    for name, model in models.items():
        y_prob = model.predict_proba(X_val)[:, 1]
        brier = brier_score_loss(y_val, y_prob)
        weights[name] = 1 / (brier + 0.01)

    total = sum(weights.values())
    weights = {k: v/total for k, v in weights.items()}

    return weights


def ensemble_predict(models: dict, weights: dict, X):
    """Get weighted ensemble predictions."""
    pred = np.zeros(len(X))
    for name, model in models.items():
        pred += weights[name] * model.predict_proba(X)[:, 1]
    return pred

Clustering for Archetypes

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def discover_archetypes(players: pd.DataFrame,
                        features: list,
                        max_k: int = 8):
    """Discover player archetypes using clustering."""
    X = players[features].values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Find optimal k
    best_k, best_score = 2, -1
    for k in range(2, max_k + 1):
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X_scaled)
        score = silhouette_score(X_scaled, labels)
        if score > best_score:
            best_k, best_score = k, score

    # Final clustering
    final_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
    players['archetype'] = final_kmeans.fit_predict(X_scaled)

    return players, final_kmeans

XGBoost with Early Stopping

import xgboost as xgb

def train_xgboost_optimal(X_train, y_train, X_val, y_val):
    """Train XGBoost with early stopping."""
    model = xgb.XGBClassifier(
        n_estimators=1000,  # High, early stopping will limit
        max_depth=4,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
    )

    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=50,
        verbose=False
    )

    print(f"Best iteration: {model.best_iteration}")
    return model

Model Comparison

Classification Algorithms

Model	Typical Accuracy	AUC	Training Speed
Logistic	68-70%	0.75	Very Fast
Random Forest	70-72%	0.78	Fast
Gradient Boosting	72-74%	0.81	Medium
XGBoost	73-75%	0.82	Medium
Neural Network	72-75%	0.81	Slow
Ensemble	74-76%	0.83	Varies

Clustering Algorithms

Algorithm	Strengths	Weaknesses
K-Means	Fast, scalable	Requires k, spherical clusters
Hierarchical	Dendrogram visualization	Slow for large data
DBSCAN	Finds outliers, any shape	Sensitive to parameters
GMM	Soft assignments	Assumes Gaussian

Validation Strategies

Football-Specific Splits

# CORRECT: Temporal split
train = games[games['season'] < 2023]
test = games[games['season'] == 2023]

# WRONG: Random split (data leakage)
train, test = train_test_split(games, random_state=42)

Cross-Validation Options

Method	Use Case	Code
Temporal	Time series	Manual by season
Group K-Fold	Leave teams out	GroupKFold
Stratified	Balanced classes	StratifiedKFold

Common Pitfalls

1. Data Leakage

Wrong:

# Using full-season stats to predict early-season games
features = ['season_total_yards', 'final_win_pct']

Right:

# Using only pre-game information
features = ['yards_entering_game', 'win_pct_entering_game']

2. Ignoring Temporal Structure

Wrong:

train_test_split(games, test_size=0.2, shuffle=True)

Right:

train = games[games['season'] < test_season]
test = games[games['season'] == test_season]

3. Overfitting to Small Samples

Wrong:

# Complex model on small dataset
model = xgb.XGBClassifier(n_estimators=500, max_depth=10)

Right:

# Regularized model appropriate for sample size
model = xgb.XGBClassifier(
    n_estimators=100, max_depth=4,
    reg_alpha=0.1, reg_lambda=1.0
)

4. Ignoring Calibration

Wrong:

# Only checking accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Right:

# Check calibration too
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Brier: {brier_score_loss(y_test, y_prob)}")
print(f"Calibration: {check_calibration(y_test, y_prob)}")

Hyperparameter Guidelines

XGBoost/GBM

Parameter	Start Value	Tuning Range
n_estimators	100	50-500
max_depth	4	3-7
learning_rate	0.1	0.01-0.3
subsample	0.8	0.6-1.0
colsample_bytree	0.8	0.6-1.0

Random Forest

Parameter	Start Value	Tuning Range
n_estimators	100	50-500
max_depth	6	4-12
min_samples_leaf	10	5-50
max_features	sqrt	sqrt, log2, 0.3-0.8

Neural Networks

Parameter	Start Value	Notes
Hidden layers	[64, 32]	1-3 layers
Dropout	0.3	0.2-0.5
Learning rate	0.001	0.0001-0.01
Batch size	32	16-128

Evaluation Checklist

Before Training

[ ] Data cleaned and preprocessed
[ ] Features engineered appropriately
[ ] No data leakage in features
[ ] Temporal train/test split implemented
[ ] Baseline model established

Model Training

[ ] Multiple algorithms compared
[ ] Hyperparameters tuned on validation
[ ] Cross-validation performed
[ ] Overfitting checked (train vs. val gap)

Evaluation

[ ] Accuracy reported
[ ] AUC calculated
[ ] Brier score measured
[ ] Calibration analyzed
[ ] Compared to baseline

Deployment

[ ] Model serialized
[ ] Pipeline documented
[ ] Monitoring plan in place
[ ] Retraining schedule defined

Quick Reference Tables

Typical Football ML Performance

Task	Good Accuracy	Good AUC	Good Brier
Game outcome	>72%	>0.80	<0.18
Spread prediction	>55% ATS	-	-
Draft prediction	>75%	>0.85	-
Play prediction	>40%	-	-

Feature Importance Ranking (Game Prediction)

Rank	Feature Type	Importance
1	Vegas spread/implied	30-35%
2	Power ratings (Elo, SP+)	20-25%
3	Efficiency metrics	15-20%
4	Home advantage	5-10%
5	Recent form	5-8%
6	Situational factors	<5%

Next Steps

After mastering ML applications, proceed to: - Chapter 23: Network Analysis in Football - Chapter 24: Computer Vision and Tracking Data - Chapter 27: Building a Complete Analytics System