Exercises: Machine Learning for NFL Prediction


Exercise 1: Feature Engineering Audit

Review the following features and identify potential issues:

features = [
    'home_points_scored',      # This game's final score
    'home_season_ppg',         # Season average PPG
    'away_qb_passer_rating',   # Season QB rating
    'home_win_probability',    # Vegas pre-game probability
    'home_total_wins',         # Wins this season including this game
    'weather_temperature',     # Game day temperature
]

Tasks: a) Identify features with data leakage b) Explain why each leaky feature is problematic c) Suggest valid alternatives for the leaky features


Exercise 2: Temporal Cross-Validation

You have 5 seasons of NFL data (2019-2023) and want to evaluate your model.

Tasks: a) Design a temporal CV scheme with at least 3 folds b) What is the minimum training data in each fold? c) How does this differ from 5-fold random CV? d) Why would random CV give overly optimistic results?


Exercise 3: Regularization Effects

Train a gradient boosting model with these parameter sets:

Set max_depth n_estimators min_child_weight
A 8 500 1
B 4 100 10
C 2 50 50

Tasks: a) Which set is most prone to overfitting? Why? b) Which set is most likely to underfit? Why? c) Train all three on sample data and compare train vs test accuracy d) Plot learning curves for each


Exercise 4: Feature Importance Analysis

Your model produces these feature importances:

Feature Importance
home_ppg 0.25
away_ppg 0.22
home_ypg 0.12
away_ypg 0.11
week 0.08
rest_days 0.06

Tasks: a) Why might home_ppg and away_ppg both be high? b) Should you drop away_ypg since it's similar to home_ypg? c) What does the week importance suggest? d) Create a "differential" feature that might be more informative


Exercise 5: Class Imbalance

You're predicting home upset (underdog wins at home). In your training data: - 1,500 games total - 350 home upsets (23%) - 1,150 non-upsets (77%)

Tasks: a) Calculate appropriate class weights b) How would you modify XGBoost's scale_pos_weight? c) Compare model performance with and without class weighting d) When might you NOT want to use class weights?


Exercise 6: Ensemble Construction

You have three trained models with these validation metrics:

Model Accuracy Brier Score Correlation with Others
XGBoost 61.2% 0.218 -
Random Forest 60.5% 0.221 0.82 with XGB
Logistic Reg 59.8% 0.224 0.65 with XGB

Tasks: a) Calculate optimal weights based on Brier scores b) Why might Logistic Regression be valuable despite lowest accuracy? c) Implement a simple averaging ensemble d) How would you test if the ensemble beats any single model?


Exercise 7: Neural Network Architecture

Design a neural network for NFL game prediction with these constraints: - Input: 25 features - Training data: 3,000 games - Goal: Classify home win/loss

Tasks: a) Design an appropriate architecture (layers, neurons, activation) b) What regularization would you apply? c) How many epochs and what batch size? d) Calculate approximately how many trainable parameters your network has


Exercise 8: Walk-Forward Validation

Implement walk-forward validation for the 2023 season: - Train on all prior data through Week N-1 - Predict Week N - Repeat for all 17 weeks

Tasks: a) Write pseudocode for this process b) How does model performance change over the season? c) Should you retrain weekly or keep a fixed model? d) Compare week 1 predictions to week 17 predictions in terms of accuracy


Exercise 9: Feature Selection

You have 50 candidate features but only 800 training games.

Tasks: a) What is the risk of using all 50 features? b) Implement forward selection to choose the best subset c) Implement LASSO regularization to automatically select features d) Compare the features selected by each method


Exercise 10: Model Comparison

Compare three approaches on the same test set: 1. Simple Elo rating (from Chapter 19) 2. Logistic regression with 10 features 3. XGBoost with 25 features

Tasks: a) Set up a fair comparison framework b) Calculate accuracy, Brier score, and log loss for each c) Perform statistical significance testing (McNemar's test) d) Which model would you recommend and why?


Exercise 11: Handling Missing Data

Your feature matrix has missing values: - home_qb_rating: 5% missing (injured starters) - weather_temp: 10% missing (dome games) - attendance: 15% missing (not recorded)

Tasks: a) Develop an imputation strategy for each feature b) Should you drop rows with missing values? c) Can missingness itself be a feature? d) How does XGBoost handle missing values natively?


Exercise 12: Calibration Analysis

Your model produces these probability buckets:

Predicted Games Actual Home Wins
40-50% 80 38
50-60% 120 62
60-70% 100 58
70-80% 60 48
80-90% 30 27

Tasks: a) Calculate actual win rate per bucket b) Is the model under-confident or over-confident? c) Apply Platt scaling to recalibrate d) Calculate calibration error before and after


Exercise 13: Spread Prediction

Convert your win probability model to spread prediction: - P(home win) = 65% should imply roughly -5 spread

Tasks: a) Derive the formula relating probability to spread b) What assumptions does this formula make? c) Train a regression model directly predicting spread d) Compare probability-derived spreads to direct regression


Exercise 14: Cross-Model Stacking

Implement stacking with: - Level 0: XGBoost, Random Forest, Logistic Regression - Level 1: Logistic Regression meta-learner

Tasks: a) Generate out-of-fold predictions for Level 0 b) Train Level 1 on Level 0 predictions c) Evaluate the stacked model d) Does stacking beat simple averaging?


Exercise 15: Hyperparameter Tuning

Use temporal cross-validation to tune XGBoost hyperparameters:

Search space:

{
    'max_depth': [3, 4, 5, 6],
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.05, 0.1, 0.15],
    'min_child_weight': [5, 10, 20]
}

Tasks: a) How many combinations are there? b) Implement grid search with temporal CV c) Find optimal parameters d) Is the improvement significant vs default parameters?


Exercise 16: Real-Time Prediction Pipeline

Design a pipeline that can make predictions for upcoming games.

Tasks: a) What data needs to be updated weekly? b) How would you handle Thursday Night Football (less rest)? c) Design error handling for missing player data d) How would you version and track model updates?


Exercise 17: Feature Drift Detection

Monitor whether features are drifting over time.

Tasks: a) Calculate feature distributions for 2019 vs 2023 b) Which features have drifted significantly? c) How might feature drift affect model performance? d) Design a system to alert when drift exceeds a threshold


Exercise 18: Explainability

Make your model's predictions explainable.

Tasks: a) Calculate SHAP values for a single prediction b) Identify the top 3 features driving a prediction c) Create a summary plot of feature contributions d) How would you explain a surprising prediction?


Exercise 19: Baseline Comparison

Establish strong baselines before ML:

Tasks: a) Calculate "always pick home team" accuracy b) Calculate "always pick favorite" accuracy (using Elo) c) Calculate "pick based on win percentage" accuracy d) How much does your ML model improve over each baseline?


Exercise 20: Full Pipeline Implementation

Build a complete ML prediction pipeline:

  1. Data loading and preprocessing
  2. Feature engineering
  3. Temporal train/test split
  4. Model training with hyperparameter tuning
  5. Evaluation with multiple metrics
  6. Prediction for upcoming week

Deliverables: a) Complete Python implementation b) Documentation of feature engineering choices c) Evaluation report with confidence intervals d) Predictions for the upcoming week with explanations


Programming Challenges

Challenge A: AutoML for NFL

Implement an automated ML pipeline that: - Tries multiple algorithms - Automatically selects features - Tunes hyperparameters - Returns the best model

Challenge B: Online Learning

Build a model that updates continuously: - Processes each game as it finishes - Updates predictions for remaining games - Tracks prediction accuracy in real-time

Challenge C: Play-by-Play Features

Create features from play-by-play data: - Calculate EPA-based metrics - Create success rate features - Build game script features Integrate with your game-level model.

Challenge D: Uncertainty Quantification

Extend your model to output prediction intervals: - Bootstrap to estimate uncertainty - Quantile regression for bounds - Bayesian neural network approach