Key Takeaways: Machine Learning for NFL Prediction

One-Page Reference


When ML Helps vs Struggles

ML Excels When: - High-dimensional data with unknown interactions - Non-linear relationships exist - Large datasets available - Rich feature engineering potential

ML Struggles When: - Limited data (~270 games/season) - High noise (σ ≈ 13.5 points) - Concept drift (team changes) - Overfitting risk high


Feature Categories

Category Examples Key Insight
Team Strength PPG, YPG, turnover margin Foundation of prediction
Recent Form Last N weeks performance Captures momentum
Situational Rest days, travel, divisional Context matters
Matchup Off vs Def strengths Specific advantages

Model Selection Guide

Model Strengths Weaknesses Use When
Gradient Boosting Captures non-linearity, handles mixed features Can overfit Default choice
Random Forest More stable, parallel training Slightly lower accuracy Need stability
Logistic Regression Interpretable, fast, baseline Misses non-linearity Baseline comparison
Neural Networks Captures complex patterns Needs more data Rich data (play-by-play)

Gradient Boosting Hyperparameters

Parameter Typical Range Effect
n_estimators 50-200 More = more complex
max_depth 3-6 Lower = simpler trees
learning_rate 0.05-0.2 Lower = more regularization
min_child_weight 10-50 Higher = less overfitting
subsample 0.7-0.9 Lower = more regularization
reg_alpha/lambda 0.5-2.0 L1/L2 regularization

Temporal Cross-Validation

NEVER randomly split NFL data!

Season 1 ──────────┐
Season 2 ──────────┼── Train
Season 3 ──────────┘
Season 4 ──────────── Test

Walk-Forward: - Predict one week at a time - Use only prior data for training - Realistic evaluation


Common Pitfalls

  1. Data Leakage: Using future info in features
  2. Overfitting: Train accuracy >> Test accuracy
  3. Small Sample: Too many features for data size
  4. Class Imbalance: Rare events need handling
  5. Feature Importance: Correlation ≠ causation

Overfitting Detection

If Train Accuracy - Test Accuracy > 10%: High concern
If Train Accuracy - Test Accuracy > 5%: Medium concern
Else: Acceptable

Solutions: - Reduce max_depth - Increase regularization - Fewer features - More training data


Ensemble Methods

Simple Average:

prediction = (model1 + model2 + model3) / 3

Weighted Average:

prediction = w1*model1 + w2*model2 + w3*model3
weights optimized on validation set

Stacking:

meta_features = [model1_pred, model2_pred, model3_pred]
final_pred = meta_model(meta_features)

Feature Engineering Tips

  1. Temporal Validity: Only use pre-game info
  2. Stability: Season aggregates > single games
  3. Differentials: Home - Away often better than raw
  4. Domain Knowledge: Football insight guides features
  5. Feature Selection: <30 features typically optimal

Evaluation Metrics

Metric Formula Target
Accuracy Correct / Total >60%
Brier Score mean((pred - actual)²) <0.220
Log Loss -mean(ylog(p) + (1-y)log(1-p)) <0.65
AUC Area under ROC >0.65

Quick Implementation

# 1. Feature Engineering
features = engineer.create_features(games, stats, week)

# 2. Train/Test Split (temporal!)
train = data[data['season'] < test_season]
test = data[data['season'] == test_season]

# 3. Model Training
model = XGBClassifier(max_depth=4, n_estimators=100)
model.fit(X_train, y_train)

# 4. Evaluate
preds = model.predict_proba(X_test)[:, 1]
brier = np.mean((preds - y_test) ** 2)

NFL-Specific Constraints

  • ~270 games/season limits complexity
  • High variance (σ ≈ 13.5) limits precision
  • Team changes require retraining
  • Market is efficient competition

Key Insight

Feature engineering matters more than algorithm choice. A simple model with thoughtful features often beats a complex model with raw data. Combine domain knowledge (football) with ML power (pattern finding) for best results.