Key Takeaways: Machine Learning for NFL Prediction
One-Page Reference
When ML Helps vs Struggles
ML Excels When: - High-dimensional data with unknown interactions - Non-linear relationships exist - Large datasets available - Rich feature engineering potential
ML Struggles When: - Limited data (~270 games/season) - High noise (σ ≈ 13.5 points) - Concept drift (team changes) - Overfitting risk high
Feature Categories
| Category | Examples | Key Insight |
|---|---|---|
| Team Strength | PPG, YPG, turnover margin | Foundation of prediction |
| Recent Form | Last N weeks performance | Captures momentum |
| Situational | Rest days, travel, divisional | Context matters |
| Matchup | Off vs Def strengths | Specific advantages |
Model Selection Guide
| Model | Strengths | Weaknesses | Use When |
|---|---|---|---|
| Gradient Boosting | Captures non-linearity, handles mixed features | Can overfit | Default choice |
| Random Forest | More stable, parallel training | Slightly lower accuracy | Need stability |
| Logistic Regression | Interpretable, fast, baseline | Misses non-linearity | Baseline comparison |
| Neural Networks | Captures complex patterns | Needs more data | Rich data (play-by-play) |
Gradient Boosting Hyperparameters
| Parameter | Typical Range | Effect |
|---|---|---|
| n_estimators | 50-200 | More = more complex |
| max_depth | 3-6 | Lower = simpler trees |
| learning_rate | 0.05-0.2 | Lower = more regularization |
| min_child_weight | 10-50 | Higher = less overfitting |
| subsample | 0.7-0.9 | Lower = more regularization |
| reg_alpha/lambda | 0.5-2.0 | L1/L2 regularization |
Temporal Cross-Validation
NEVER randomly split NFL data!
Season 1 ──────────┐
Season 2 ──────────┼── Train
Season 3 ──────────┘
Season 4 ──────────── Test
Walk-Forward: - Predict one week at a time - Use only prior data for training - Realistic evaluation
Common Pitfalls
- Data Leakage: Using future info in features
- Overfitting: Train accuracy >> Test accuracy
- Small Sample: Too many features for data size
- Class Imbalance: Rare events need handling
- Feature Importance: Correlation ≠ causation
Overfitting Detection
If Train Accuracy - Test Accuracy > 10%: High concern
If Train Accuracy - Test Accuracy > 5%: Medium concern
Else: Acceptable
Solutions: - Reduce max_depth - Increase regularization - Fewer features - More training data
Ensemble Methods
Simple Average:
prediction = (model1 + model2 + model3) / 3
Weighted Average:
prediction = w1*model1 + w2*model2 + w3*model3
weights optimized on validation set
Stacking:
meta_features = [model1_pred, model2_pred, model3_pred]
final_pred = meta_model(meta_features)
Feature Engineering Tips
- Temporal Validity: Only use pre-game info
- Stability: Season aggregates > single games
- Differentials: Home - Away often better than raw
- Domain Knowledge: Football insight guides features
- Feature Selection: <30 features typically optimal
Evaluation Metrics
| Metric | Formula | Target |
|---|---|---|
| Accuracy | Correct / Total | >60% |
| Brier Score | mean((pred - actual)²) | <0.220 |
| Log Loss | -mean(ylog(p) + (1-y)log(1-p)) | <0.65 |
| AUC | Area under ROC | >0.65 |
Quick Implementation
# 1. Feature Engineering
features = engineer.create_features(games, stats, week)
# 2. Train/Test Split (temporal!)
train = data[data['season'] < test_season]
test = data[data['season'] == test_season]
# 3. Model Training
model = XGBClassifier(max_depth=4, n_estimators=100)
model.fit(X_train, y_train)
# 4. Evaluate
preds = model.predict_proba(X_test)[:, 1]
brier = np.mean((preds - y_test) ** 2)
NFL-Specific Constraints
- ~270 games/season limits complexity
- High variance (σ ≈ 13.5) limits precision
- Team changes require retraining
- Market is efficient competition
Key Insight
Feature engineering matters more than algorithm choice. A simple model with thoughtful features often beats a complex model with raw data. Combine domain knowledge (football) with ML power (pattern finding) for best results.