Exercises: Machine Learning for NFL Prediction
Exercise 1: Feature Engineering Audit
Review the following features and identify potential issues:
features = [
'home_points_scored', # This game's final score
'home_season_ppg', # Season average PPG
'away_qb_passer_rating', # Season QB rating
'home_win_probability', # Vegas pre-game probability
'home_total_wins', # Wins this season including this game
'weather_temperature', # Game day temperature
]
Tasks: a) Identify features with data leakage b) Explain why each leaky feature is problematic c) Suggest valid alternatives for the leaky features
Exercise 2: Temporal Cross-Validation
You have 5 seasons of NFL data (2019-2023) and want to evaluate your model.
Tasks: a) Design a temporal CV scheme with at least 3 folds b) What is the minimum training data in each fold? c) How does this differ from 5-fold random CV? d) Why would random CV give overly optimistic results?
Exercise 3: Regularization Effects
Train a gradient boosting model with these parameter sets:
| Set | max_depth | n_estimators | min_child_weight |
|---|---|---|---|
| A | 8 | 500 | 1 |
| B | 4 | 100 | 10 |
| C | 2 | 50 | 50 |
Tasks: a) Which set is most prone to overfitting? Why? b) Which set is most likely to underfit? Why? c) Train all three on sample data and compare train vs test accuracy d) Plot learning curves for each
Exercise 4: Feature Importance Analysis
Your model produces these feature importances:
| Feature | Importance |
|---|---|
| home_ppg | 0.25 |
| away_ppg | 0.22 |
| home_ypg | 0.12 |
| away_ypg | 0.11 |
| week | 0.08 |
| rest_days | 0.06 |
Tasks: a) Why might home_ppg and away_ppg both be high? b) Should you drop away_ypg since it's similar to home_ypg? c) What does the week importance suggest? d) Create a "differential" feature that might be more informative
Exercise 5: Class Imbalance
You're predicting home upset (underdog wins at home). In your training data: - 1,500 games total - 350 home upsets (23%) - 1,150 non-upsets (77%)
Tasks:
a) Calculate appropriate class weights
b) How would you modify XGBoost's scale_pos_weight?
c) Compare model performance with and without class weighting
d) When might you NOT want to use class weights?
Exercise 6: Ensemble Construction
You have three trained models with these validation metrics:
| Model | Accuracy | Brier Score | Correlation with Others |
|---|---|---|---|
| XGBoost | 61.2% | 0.218 | - |
| Random Forest | 60.5% | 0.221 | 0.82 with XGB |
| Logistic Reg | 59.8% | 0.224 | 0.65 with XGB |
Tasks: a) Calculate optimal weights based on Brier scores b) Why might Logistic Regression be valuable despite lowest accuracy? c) Implement a simple averaging ensemble d) How would you test if the ensemble beats any single model?
Exercise 7: Neural Network Architecture
Design a neural network for NFL game prediction with these constraints: - Input: 25 features - Training data: 3,000 games - Goal: Classify home win/loss
Tasks: a) Design an appropriate architecture (layers, neurons, activation) b) What regularization would you apply? c) How many epochs and what batch size? d) Calculate approximately how many trainable parameters your network has
Exercise 8: Walk-Forward Validation
Implement walk-forward validation for the 2023 season: - Train on all prior data through Week N-1 - Predict Week N - Repeat for all 17 weeks
Tasks: a) Write pseudocode for this process b) How does model performance change over the season? c) Should you retrain weekly or keep a fixed model? d) Compare week 1 predictions to week 17 predictions in terms of accuracy
Exercise 9: Feature Selection
You have 50 candidate features but only 800 training games.
Tasks: a) What is the risk of using all 50 features? b) Implement forward selection to choose the best subset c) Implement LASSO regularization to automatically select features d) Compare the features selected by each method
Exercise 10: Model Comparison
Compare three approaches on the same test set: 1. Simple Elo rating (from Chapter 19) 2. Logistic regression with 10 features 3. XGBoost with 25 features
Tasks: a) Set up a fair comparison framework b) Calculate accuracy, Brier score, and log loss for each c) Perform statistical significance testing (McNemar's test) d) Which model would you recommend and why?
Exercise 11: Handling Missing Data
Your feature matrix has missing values: - home_qb_rating: 5% missing (injured starters) - weather_temp: 10% missing (dome games) - attendance: 15% missing (not recorded)
Tasks: a) Develop an imputation strategy for each feature b) Should you drop rows with missing values? c) Can missingness itself be a feature? d) How does XGBoost handle missing values natively?
Exercise 12: Calibration Analysis
Your model produces these probability buckets:
| Predicted | Games | Actual Home Wins |
|---|---|---|
| 40-50% | 80 | 38 |
| 50-60% | 120 | 62 |
| 60-70% | 100 | 58 |
| 70-80% | 60 | 48 |
| 80-90% | 30 | 27 |
Tasks: a) Calculate actual win rate per bucket b) Is the model under-confident or over-confident? c) Apply Platt scaling to recalibrate d) Calculate calibration error before and after
Exercise 13: Spread Prediction
Convert your win probability model to spread prediction: - P(home win) = 65% should imply roughly -5 spread
Tasks: a) Derive the formula relating probability to spread b) What assumptions does this formula make? c) Train a regression model directly predicting spread d) Compare probability-derived spreads to direct regression
Exercise 14: Cross-Model Stacking
Implement stacking with: - Level 0: XGBoost, Random Forest, Logistic Regression - Level 1: Logistic Regression meta-learner
Tasks: a) Generate out-of-fold predictions for Level 0 b) Train Level 1 on Level 0 predictions c) Evaluate the stacked model d) Does stacking beat simple averaging?
Exercise 15: Hyperparameter Tuning
Use temporal cross-validation to tune XGBoost hyperparameters:
Search space:
{
'max_depth': [3, 4, 5, 6],
'n_estimators': [50, 100, 150, 200],
'learning_rate': [0.05, 0.1, 0.15],
'min_child_weight': [5, 10, 20]
}
Tasks: a) How many combinations are there? b) Implement grid search with temporal CV c) Find optimal parameters d) Is the improvement significant vs default parameters?
Exercise 16: Real-Time Prediction Pipeline
Design a pipeline that can make predictions for upcoming games.
Tasks: a) What data needs to be updated weekly? b) How would you handle Thursday Night Football (less rest)? c) Design error handling for missing player data d) How would you version and track model updates?
Exercise 17: Feature Drift Detection
Monitor whether features are drifting over time.
Tasks: a) Calculate feature distributions for 2019 vs 2023 b) Which features have drifted significantly? c) How might feature drift affect model performance? d) Design a system to alert when drift exceeds a threshold
Exercise 18: Explainability
Make your model's predictions explainable.
Tasks: a) Calculate SHAP values for a single prediction b) Identify the top 3 features driving a prediction c) Create a summary plot of feature contributions d) How would you explain a surprising prediction?
Exercise 19: Baseline Comparison
Establish strong baselines before ML:
Tasks: a) Calculate "always pick home team" accuracy b) Calculate "always pick favorite" accuracy (using Elo) c) Calculate "pick based on win percentage" accuracy d) How much does your ML model improve over each baseline?
Exercise 20: Full Pipeline Implementation
Build a complete ML prediction pipeline:
- Data loading and preprocessing
- Feature engineering
- Temporal train/test split
- Model training with hyperparameter tuning
- Evaluation with multiple metrics
- Prediction for upcoming week
Deliverables: a) Complete Python implementation b) Documentation of feature engineering choices c) Evaluation report with confidence intervals d) Predictions for the upcoming week with explanations
Programming Challenges
Challenge A: AutoML for NFL
Implement an automated ML pipeline that: - Tries multiple algorithms - Automatically selects features - Tunes hyperparameters - Returns the best model
Challenge B: Online Learning
Build a model that updates continuously: - Processes each game as it finishes - Updates predictions for remaining games - Tracks prediction accuracy in real-time
Challenge C: Play-by-Play Features
Create features from play-by-play data: - Calculate EPA-based metrics - Create success rate features - Build game script features Integrate with your game-level model.
Challenge D: Uncertainty Quantification
Extend your model to output prediction intervals: - Bootstrap to estimate uncertainty - Quantile regression for bounds - Bayesian neural network approach