Chapter 17: Key Takeaways - Introduction to Predictive Analytics
Quick Reference Summary
This chapter established the foundation for predictive modeling in football analytics.
Core Concepts
Types of Prediction Problems
| Type | Output | Football Examples |
|---|---|---|
| Classification | Categories | Win/Loss, Coverage type, Play call |
| Regression | Continuous values | Points scored, Yards gained, Draft position |
| Probability | Likelihoods | Win probability, Conversion chance, FG success |
The ML Workflow
- Problem Definition → Define what you're predicting and why
- Data Collection → Gather relevant historical data
- Feature Engineering → Transform data into predictive features
- Model Selection → Choose appropriate algorithms
- Training & Validation → Fit and evaluate with proper splits
- Deployment → Put model into production use
Essential Formulas
Prediction Equation
ŷ = f(X) + ε
Where:
- ŷ = predicted outcome
- X = input features
- f = learned function
- ε = irreducible error
Key Metrics
Classification:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Regression:
RMSE = √[Σ(y - ŷ)² / n]
MAE = Σ|y - ŷ| / n
R² = 1 - (SS_res / SS_tot)
Calibration:
Brier Score = Σ(p - y)² / n
ECE = Σ (n_bin/n) × |accuracy_bin - confidence_bin|
Code Patterns
Train-Test Split (Temporal)
def temporal_split(data, test_year):
train = data[data['season'] < test_year]
test = data[data['season'] >= test_year]
return train, test
Basic Classifier Training
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit only on train
X_test_scaled = scaler.transform(X_test) # Transform test
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
Cross-Validation
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
# For time series data
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Probability Calibration
from sklearn.calibration import CalibratedClassifierCV
base_model = LogisticRegression()
calibrated = CalibratedClassifierCV(base_model, cv=5)
calibrated.fit(X_train, y_train)
probabilities = calibrated.predict_proba(X_test)[:, 1]
Common Pitfalls to Avoid
1. Data Leakage
Wrong: Using game outcome (score) to predict game outcome
# BAD - includes information from the future
features = ['home_score', 'away_score', 'winner']
Right: Use only pre-game information
# GOOD - only information available before game
features = ['home_rating', 'away_rating', 'rest_days']
2. Temporal Leakage
Wrong: Random train/test split for time series
# BAD - 2023 games might train, 2021 test
X_train, X_test = train_test_split(X, test_size=0.2)
Right: Split by time
# GOOD - always train on past, test on future
train = data[data['season'] < 2023]
test = data[data['season'] >= 2023]
3. Ignoring Base Rates
Wrong: Claiming 60% accuracy without context
# BAD - no baseline comparison
print(f"Model accuracy: {accuracy:.1%}")
Right: Compare to meaningful baseline
# GOOD - compare to baseline
baseline = y_test.mean() # Home team win rate
print(f"Baseline: {baseline:.1%}, Model: {accuracy:.1%}")
print(f"Improvement: {accuracy - baseline:.1%}")
4. Sample Size Issues
# Rule of thumb: 10-20 observations per feature
n_games = 12 # Single team season
n_features = 20
if n_games / n_features < 10:
print("Warning: Risk of overfitting!")
print("Reduce features or get more data")
Model Selection Guide
| Requirement | Recommended Model |
|---|---|
| Interpretability needed | Logistic Regression |
| Many features, unknown relationships | Random Forest |
| Probability calibration critical | Calibrated Classifier |
| Large dataset | Gradient Boosting |
| Real-time predictions | Simple models (LR, trees) |
Evaluation Checklist
Before Training
- [ ] Defined clear prediction target
- [ ] Verified no data leakage
- [ ] Temporal split implemented
- [ ] Features scaled if needed
- [ ] Baseline established
During Training
- [ ] Cross-validation used
- [ ] Learning curves checked
- [ ] Train/test gap monitored
- [ ] Multiple metrics tracked
After Training
- [ ] Calibration assessed
- [ ] Feature importance analyzed
- [ ] Error patterns examined
- [ ] Baseline comparison made
Quick Reference Tables
Scikit-Learn Models for Football
| Task | Model | Key Parameters |
|---|---|---|
| Game outcome | LogisticRegression |
C=1.0, max_iter=1000 |
| Points scored | Ridge |
alpha=1.0 |
| Multi-class | RandomForestClassifier |
n_estimators=100 |
| Probability | CalibratedClassifierCV |
cv=5 |
Evaluation Metrics by Problem
| Problem Type | Primary Metric | Secondary Metrics |
|---|---|---|
| Win prediction | Accuracy | AUC-ROC, Brier Score |
| Score prediction | RMSE | MAE, R² |
| Probability | Brier Score | ECE, Calibration curve |
Common Feature Types
| Category | Examples |
|---|---|
| Efficiency | EPA/play, Success rate, Yards/play |
| Situational | Down, Distance, Field position |
| Game state | Score differential, Time remaining |
| Historical | Win %, Rolling averages, Rankings |
| Opponent-adjusted | EPA vs expected, SOS-adjusted |
Red Flags
Warning signs your model may have issues:
- Train accuracy >> Test accuracy → Overfitting
- 100% training accuracy → Severe overfitting
- Model worse than baseline → Feature or data issues
- Calibration curve far from diagonal → Probability issues
- Accuracy drops over time → Model drift
Next Steps
After mastering these fundamentals, you're ready for: - Chapter 18: Advanced game outcome prediction - Chapter 19: Player performance forecasting - Chapter 20: Recruiting analytics - Chapter 21: Win probability models - Chapter 22: Deep learning applications