Chapter 22: Key Takeaways - Machine Learning Applications
Quick Reference Summary
This chapter covered applying machine learning techniques to college football analytics, from classification to clustering and neural networks.
Core Concepts
ML Task Types in Football
| Task | Description | Example |
|---|---|---|
| Classification | Predict discrete outcomes | Win/loss, drafted/undrafted |
| Regression | Predict continuous values | Point spread, draft pick |
| Clustering | Group similar items | Player archetypes |
| Sequence | Model sequential data | Play prediction |
Key Algorithms
| Algorithm | Strengths | Use Case |
|---|---|---|
| Logistic Regression | Interpretable, fast | Baseline predictions |
| Random Forest | Handles non-linearity | Feature importance |
| XGBoost | High accuracy | Game prediction |
| K-Means | Unsupervised grouping | Player archetypes |
| Neural Networks | Complex patterns | Sequence modeling |
Essential Formulas
Classification Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
AUC = P(score(positive) > score(negative))
Brier Score
Brier Score = (1/n) * Σ(predicted_prob - actual_outcome)²
Range: 0 (perfect) to 1 (worst)
Good value: < 0.20 for game prediction
Expected Calibration Error
ECE = Σ (n_bin / n_total) × |predicted_avg - actual_avg|
Good value: < 0.05
Silhouette Score (Clustering)
s(i) = (b(i) - a(i)) / max(a(i), b(i))
a(i) = avg distance to same cluster
b(i) = avg distance to nearest other cluster
Range: -1 to 1 (higher is better)
Code Patterns
Basic Classification Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, brier_score_loss
def train_game_predictor(games: pd.DataFrame, features: list):
"""Train basic game outcome predictor."""
X = games[features].values
y = games['home_win'].values
# Temporal split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False
)
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = GradientBoostingClassifier(n_estimators=100, max_depth=4)
model.fit(X_train, y_train)
# Evaluate
y_prob = model.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, (y_prob > 0.5)):.3f}")
print(f"Brier: {brier_score_loss(y_test, y_prob):.3f}")
return model, scaler
Ensemble Construction
def build_weighted_ensemble(models: dict, X_val, y_val):
"""Build weighted ensemble from multiple models."""
weights = {}
for name, model in models.items():
y_prob = model.predict_proba(X_val)[:, 1]
brier = brier_score_loss(y_val, y_prob)
weights[name] = 1 / (brier + 0.01)
total = sum(weights.values())
weights = {k: v/total for k, v in weights.items()}
return weights
def ensemble_predict(models: dict, weights: dict, X):
"""Get weighted ensemble predictions."""
pred = np.zeros(len(X))
for name, model in models.items():
pred += weights[name] * model.predict_proba(X)[:, 1]
return pred
Clustering for Archetypes
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def discover_archetypes(players: pd.DataFrame,
features: list,
max_k: int = 8):
"""Discover player archetypes using clustering."""
X = players[features].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find optimal k
best_k, best_score = 2, -1
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
if score > best_score:
best_k, best_score = k, score
# Final clustering
final_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
players['archetype'] = final_kmeans.fit_predict(X_scaled)
return players, final_kmeans
XGBoost with Early Stopping
import xgboost as xgb
def train_xgboost_optimal(X_train, y_train, X_val, y_val):
"""Train XGBoost with early stopping."""
model = xgb.XGBClassifier(
n_estimators=1000, # High, early stopping will limit
max_depth=4,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
verbose=False
)
print(f"Best iteration: {model.best_iteration}")
return model
Model Comparison
Classification Algorithms
| Model | Typical Accuracy | AUC | Training Speed |
|---|---|---|---|
| Logistic | 68-70% | 0.75 | Very Fast |
| Random Forest | 70-72% | 0.78 | Fast |
| Gradient Boosting | 72-74% | 0.81 | Medium |
| XGBoost | 73-75% | 0.82 | Medium |
| Neural Network | 72-75% | 0.81 | Slow |
| Ensemble | 74-76% | 0.83 | Varies |
Clustering Algorithms
| Algorithm | Strengths | Weaknesses |
|---|---|---|
| K-Means | Fast, scalable | Requires k, spherical clusters |
| Hierarchical | Dendrogram visualization | Slow for large data |
| DBSCAN | Finds outliers, any shape | Sensitive to parameters |
| GMM | Soft assignments | Assumes Gaussian |
Validation Strategies
Football-Specific Splits
# CORRECT: Temporal split
train = games[games['season'] < 2023]
test = games[games['season'] == 2023]
# WRONG: Random split (data leakage)
train, test = train_test_split(games, random_state=42)
Cross-Validation Options
| Method | Use Case | Code |
|---|---|---|
| Temporal | Time series | Manual by season |
| Group K-Fold | Leave teams out | GroupKFold |
| Stratified | Balanced classes | StratifiedKFold |
Common Pitfalls
1. Data Leakage
Wrong:
# Using full-season stats to predict early-season games
features = ['season_total_yards', 'final_win_pct']
Right:
# Using only pre-game information
features = ['yards_entering_game', 'win_pct_entering_game']
2. Ignoring Temporal Structure
Wrong:
train_test_split(games, test_size=0.2, shuffle=True)
Right:
train = games[games['season'] < test_season]
test = games[games['season'] == test_season]
3. Overfitting to Small Samples
Wrong:
# Complex model on small dataset
model = xgb.XGBClassifier(n_estimators=500, max_depth=10)
Right:
# Regularized model appropriate for sample size
model = xgb.XGBClassifier(
n_estimators=100, max_depth=4,
reg_alpha=0.1, reg_lambda=1.0
)
4. Ignoring Calibration
Wrong:
# Only checking accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Right:
# Check calibration too
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Brier: {brier_score_loss(y_test, y_prob)}")
print(f"Calibration: {check_calibration(y_test, y_prob)}")
Hyperparameter Guidelines
XGBoost/GBM
| Parameter | Start Value | Tuning Range |
|---|---|---|
| n_estimators | 100 | 50-500 |
| max_depth | 4 | 3-7 |
| learning_rate | 0.1 | 0.01-0.3 |
| subsample | 0.8 | 0.6-1.0 |
| colsample_bytree | 0.8 | 0.6-1.0 |
Random Forest
| Parameter | Start Value | Tuning Range |
|---|---|---|
| n_estimators | 100 | 50-500 |
| max_depth | 6 | 4-12 |
| min_samples_leaf | 10 | 5-50 |
| max_features | sqrt | sqrt, log2, 0.3-0.8 |
Neural Networks
| Parameter | Start Value | Notes |
|---|---|---|
| Hidden layers | [64, 32] | 1-3 layers |
| Dropout | 0.3 | 0.2-0.5 |
| Learning rate | 0.001 | 0.0001-0.01 |
| Batch size | 32 | 16-128 |
Evaluation Checklist
Before Training
- [ ] Data cleaned and preprocessed
- [ ] Features engineered appropriately
- [ ] No data leakage in features
- [ ] Temporal train/test split implemented
- [ ] Baseline model established
Model Training
- [ ] Multiple algorithms compared
- [ ] Hyperparameters tuned on validation
- [ ] Cross-validation performed
- [ ] Overfitting checked (train vs. val gap)
Evaluation
- [ ] Accuracy reported
- [ ] AUC calculated
- [ ] Brier score measured
- [ ] Calibration analyzed
- [ ] Compared to baseline
Deployment
- [ ] Model serialized
- [ ] Pipeline documented
- [ ] Monitoring plan in place
- [ ] Retraining schedule defined
Quick Reference Tables
Typical Football ML Performance
| Task | Good Accuracy | Good AUC | Good Brier |
|---|---|---|---|
| Game outcome | >72% | >0.80 | <0.18 |
| Spread prediction | >55% ATS | - | - |
| Draft prediction | >75% | >0.85 | - |
| Play prediction | >40% | - | - |
Feature Importance Ranking (Game Prediction)
| Rank | Feature Type | Importance |
|---|---|---|
| 1 | Vegas spread/implied | 30-35% |
| 2 | Power ratings (Elo, SP+) | 20-25% |
| 3 | Efficiency metrics | 15-20% |
| 4 | Home advantage | 5-10% |
| 5 | Recent form | 5-8% |
| 6 | Situational factors | <5% |
Next Steps
After mastering ML applications, proceed to: - Chapter 23: Network Analysis in Football - Chapter 24: Computer Vision and Tracking Data - Chapter 27: Building a Complete Analytics System