Key Takeaways: Machine Learning for NFL Prediction

One-Page Reference

When ML Helps vs Struggles

ML Excels When: - High-dimensional data with unknown interactions - Non-linear relationships exist - Large datasets available - Rich feature engineering potential

ML Struggles When: - Limited data (~270 games/season) - High noise (σ ≈ 13.5 points) - Concept drift (team changes) - Overfitting risk high

Feature Categories

Category	Examples	Key Insight
Team Strength	PPG, YPG, turnover margin	Foundation of prediction
Recent Form	Last N weeks performance	Captures momentum
Situational	Rest days, travel, divisional	Context matters
Matchup	Off vs Def strengths	Specific advantages

Model Selection Guide

Model	Strengths	Weaknesses	Use When
Gradient Boosting	Captures non-linearity, handles mixed features	Can overfit	Default choice
Random Forest	More stable, parallel training	Slightly lower accuracy	Need stability
Logistic Regression	Interpretable, fast, baseline	Misses non-linearity	Baseline comparison
Neural Networks	Captures complex patterns	Needs more data	Rich data (play-by-play)

Gradient Boosting Hyperparameters

Parameter	Typical Range	Effect
n_estimators	50-200	More = more complex
max_depth	3-6	Lower = simpler trees
learning_rate	0.05-0.2	Lower = more regularization
min_child_weight	10-50	Higher = less overfitting
subsample	0.7-0.9	Lower = more regularization
reg_alpha/lambda	0.5-2.0	L1/L2 regularization

Temporal Cross-Validation

NEVER randomly split NFL data!

Season 1 ──────────┐
Season 2 ──────────┼── Train
Season 3 ──────────┘
Season 4 ──────────── Test

Walk-Forward: - Predict one week at a time - Use only prior data for training - Realistic evaluation

Common Pitfalls

Data Leakage: Using future info in features
Overfitting: Train accuracy >> Test accuracy
Small Sample: Too many features for data size
Class Imbalance: Rare events need handling
Feature Importance: Correlation ≠ causation

Overfitting Detection

If Train Accuracy - Test Accuracy > 10%: High concern
If Train Accuracy - Test Accuracy > 5%: Medium concern
Else: Acceptable

Solutions: - Reduce max_depth - Increase regularization - Fewer features - More training data

Ensemble Methods

Simple Average:

prediction = (model1 + model2 + model3) / 3

Weighted Average:

prediction = w1*model1 + w2*model2 + w3*model3
weights optimized on validation set

Stacking:

meta_features = [model1_pred, model2_pred, model3_pred]
final_pred = meta_model(meta_features)

Feature Engineering Tips

Temporal Validity: Only use pre-game info
Stability: Season aggregates > single games
Differentials: Home - Away often better than raw
Domain Knowledge: Football insight guides features
Feature Selection: <30 features typically optimal

Evaluation Metrics

Metric	Formula	Target
Accuracy	Correct / Total	>60%
Brier Score	mean((pred - actual)²)	<0.220
Log Loss	-mean(ylog(p) + (1-y)log(1-p))	<0.65
AUC	Area under ROC	>0.65

Quick Implementation

# 1. Feature Engineering
features = engineer.create_features(games, stats, week)

# 2. Train/Test Split (temporal!)
train = data[data['season'] < test_season]
test = data[data['season'] == test_season]

# 3. Model Training
model = XGBClassifier(max_depth=4, n_estimators=100)
model.fit(X_train, y_train)

# 4. Evaluate
preds = model.predict_proba(X_test)[:, 1]
brier = np.mean((preds - y_test) ** 2)

NFL-Specific Constraints

~270 games/season limits complexity
High variance (σ ≈ 13.5) limits precision
Team changes require retraining
Market is efficient competition

Key Insight

Feature engineering matters more than algorithm choice. A simple model with thoughtful features often beats a complex model with raw data. Combine domain knowledge (football) with ML power (pattern finding) for best results.