Chapter 28 Quiz: Feature Engineering for Sports Betting

Instructions: Answer all 25 questions. This quiz is worth 100 points. You have 60 minutes. A calculator is permitted; no notes or internet access. For multiple choice, select the single best answer. For short answer, be precise and concise.

Section 1: Multiple Choice (10 questions, 3 points each = 30 points)

Question 1. Which of the following best describes "temporal leakage" in a sports prediction model?

(A) Using features that are highly correlated with each other

(B) Including information in training features that would not have been available at prediction time

(D) Allowing the model to see the target variable during feature selection

Answer

**(B) Including information in training features that would not have been available at prediction time.** Temporal leakage occurs when a model uses data from the future relative to the prediction point. For example, using a team's full-season average EPA/play to predict a Week 6 game includes information from Weeks 7-17 that was not yet available. This artificially inflates model performance during development but fails in real-time deployment.

Question 2. A feature has a variance inflation factor (VIF) of 12.4. What does this indicate?

(A) The feature has high predictive power for the target variable

(B) The feature is nearly constant across all observations

(D) The feature has a non-normal distribution requiring transformation

Answer

**(C) The feature is highly correlated with other features in the model, causing multicollinearity.** A VIF above 10 is generally considered indicative of problematic multicollinearity. A VIF of 12.4 means that 1 - 1/12.4 = 91.9% of the variance in that feature is explained by the other features in the model. This inflates coefficient standard errors in linear models and can destabilize feature importance rankings.

Question 3. When encoding the categorical variable "stadium" (32 NFL venues) for a gradient-boosted tree model, which approach is LEAST recommended?

(A) One-hot encoding all 32 stadiums

(B) Target encoding with leave-one-out smoothing

(D) Entity embeddings learned from a neural network

Answer

**(C) Ordinal encoding (assigning integers 1-32 arbitrarily).** While tree-based models can technically handle ordinal encoding by splitting on thresholds, arbitrary integer assignment imposes a false ordering (stadium 15 is not "between" stadiums 14 and 16 in any meaningful sense). The model would need many splits to isolate individual stadium effects. One-hot encoding, target encoding, and entity embeddings all preserve the categorical nature of the variable without imposing spurious ordinal relationships.

Question 4. An exponentially weighted moving average (EWMA) of EPA/play with alpha = 0.4 places what fraction of total weight on the most recent game?

(A) 0.40

(B) 0.24

(D) 0.16

Answer

**(A) 0.40.** In EWMA, the weight on the most recent observation is exactly alpha. The weight on the observation two periods ago is alpha * (1 - alpha) = 0.4 * 0.6 = 0.24. The weight on three periods ago is alpha * (1 - alpha)^2 = 0.4 * 0.36 = 0.144. The parameter alpha directly controls how much emphasis is placed on the most recent data point.

Question 5. Which of the following is the primary advantage of using mutual information over Pearson correlation for feature selection in sports betting models?

(A) Mutual information is faster to compute

(B) Mutual information captures nonlinear and non-monotonic relationships

(D) Mutual information is not affected by sample size

Answer

**(B) Mutual information captures nonlinear and non-monotonic relationships.** Pearson correlation only measures linear association, so a feature with a U-shaped relationship to the target (e.g., extreme pace in either direction correlating with upset probability) would show near-zero correlation despite having real predictive power. Mutual information measures any form of statistical dependence between variables, making it more general for feature selection.

Question 6. A modeler creates a "rest advantage" feature defined as (home team days of rest) minus (away team days of rest). For a game where the home team had 8 days rest and the away team had 5 days rest, the feature value is +3. Why might this linear difference be suboptimal?

(A) The feature should use multiplication instead of subtraction

(B) The marginal benefit of additional rest is likely nonlinear (diminishing returns beyond 7 days)

(D) The feature should only be computed for outdoor games

Answer

**(B) The marginal benefit of additional rest is likely nonlinear (diminishing returns beyond 7 days).** Research suggests that the benefit of extra rest days follows a diminishing-returns curve: going from 3 days to 5 days of rest is far more impactful than going from 8 days to 10 days. A simple linear difference treats each additional day equally. A better approach might use log-transformed rest days, or create bins (short rest / normal / extended rest), or model the nonlinearity explicitly.

Question 7. In a walk-forward validation setup for an NFL season model, which of the following correctly describes the expanding-window approach for Week 10 predictions?

(A) Train on Weeks 1-9 of the current season only, predict Week 10

(B) Train on all available historical data plus Weeks 1-9 of the current season, predict Week 10

(D) Train on Weeks 1-17 of the previous season only, predict Week 10

Answer

**(B) Train on all available historical data plus Weeks 1-9 of the current season, predict Week 10.** An expanding-window approach uses all data available up to the prediction point. For Week 10, this includes all prior seasons and Weeks 1-9 of the current season. This maximizes sample size while maintaining temporal integrity. Option (A) discards prior seasons, option (C) is a sliding window (not expanding), and option (D) ignores current-season data.

Question 8. Which dimensionality reduction technique preserves the most interpretability when applied to a set of 50 team-level basketball statistics?

(A) Principal Component Analysis (PCA)

(B) t-SNE (t-distributed Stochastic Neighbor Embedding)

(D) Autoencoders

Answer

**(C) Factor Analysis with varimax rotation.** Factor analysis with varimax rotation produces factors that load heavily on a small number of original variables, making them interpretable (e.g., a factor that loads on offensive rating, effective FG%, and points per possession can be labeled "offensive efficiency"). PCA components are linear combinations of all variables and harder to interpret. t-SNE is a visualization tool, not a feature engineering method. Autoencoders produce opaque latent representations.

Question 9. A feature "turnover margin per game" has a Pearson correlation of 0.38 with the binary target "win." However, when the feature is added to an existing 12-feature model, the model's out-of-sample performance decreases. Which of the following is the most likely explanation?

(A) The correlation of 0.38 was a computational error

(B) Turnover margin is redundant with other features already in the model, adding noise without new signal

(D) The sample size is too large for the correlation to be meaningful

Answer

**(B) Turnover margin is redundant with other features already in the model, adding noise without new signal.** Even though turnover margin correlates with winning, if the existing 12 features already capture the same information (e.g., through EPA/play, which accounts for turnovers), adding the redundant feature increases model complexity without adding new predictive signal. The additional complexity can cause overfitting, especially with limited sample sizes typical in sports data.

Question 10. A modeler uses the closing line as a feature in their spread prediction model and reports 58% accuracy against the spread in backtesting. Why is this result misleading?

(A) 58% accuracy is below the breakeven threshold

(B) The closing line incorporates the very information the model is trying to predict, making the task trivially easy

(D) Spread prediction should use log-loss, not accuracy, as the evaluation metric

Answer

**(B) The closing line incorporates the very information the model is trying to predict, making the task trivially easy.** The closing line is the market's final assessment of the game and is the single strongest predictor of the final margin. Using it as a feature essentially asks the model to predict the outcome using a near-perfect prediction as input. In practice, bets must be placed before the closing line is set, so this feature would not be available at decision time. This is a classic form of temporal leakage.

Section 2: Short Answer (8 questions, 5 points each = 40 points)

Question 11. Explain the difference between a "lagged feature" and a "rolling aggregate feature." Provide one example of each for an NBA betting model and state which is more resistant to single-game outliers.

Answer

A **lagged feature** uses the raw value from a specific past observation --- for example, "points scored in the previous game" (lag-1). A **rolling aggregate feature** computes a summary statistic (mean, median, standard deviation) over a window of past observations --- for example, "average points scored over the last 5 games." NBA example of a lagged feature: the team's assist-to-turnover ratio in their most recent game. NBA example of a rolling aggregate: the team's 10-game rolling average defensive rating. The rolling aggregate is more resistant to single-game outliers because it averages over multiple observations. A lagged feature from a single game can be heavily influenced by blowout games, overtime periods, or unusual circumstances (e.g., a key player ejected early).

Question 12. Define "target encoding" for a categorical feature. Write the formula for a smoothed target encoding that blends the category-specific mean with the global mean, and explain why smoothing is necessary.

Answer

**Target encoding** replaces each category with the mean of the target variable for that category. For a category c with n_c observations and category mean y_c, and a global mean y_global, the smoothed encoding is: encoded(c) = (n_c * y_c + m * y_global) / (n_c + m) where m is a smoothing parameter (often set to the overall sample size divided by the number of categories, or tuned via cross-validation). Smoothing is necessary because categories with few observations have unreliable mean estimates. Without smoothing, a referee who officiated only 2 games where the home team covered both times would receive an encoding of 1.0, which is an extreme and unreliable value. Smoothing pulls low-count categories toward the global mean, reducing overfitting. The larger the smoothing parameter m, the more the encoding resembles the global mean.

Question 13. A modeler creates 200 candidate features and uses Lasso (L1 regularization) to select the most important ones. After tuning the regularization parameter lambda, the Lasso retains 18 features. Describe one advantage and one disadvantage of using Lasso for feature selection compared to using permutation importance with a Random Forest.

Answer

**Advantage of Lasso:** Lasso performs feature selection as part of model fitting, producing a sparse model in a single step. It tends to select one feature from a group of highly correlated features and set the others to zero, which naturally handles multicollinearity. This is efficient and produces an interpretable, parsimonious model. **Disadvantage of Lasso:** Lasso assumes a linear (or generalized linear) relationship between features and the target. If important features have nonlinear or interaction-based relationships with the target, Lasso may discard them because they have weak linear effects. Permutation importance with a Random Forest captures nonlinear relationships and interaction effects, making it more suitable when the true data-generating process is complex. However, permutation importance does not produce a sparse model directly and can be biased when features are correlated.

Question 14. Explain why a team's "season-to-date win percentage" is a poor feature for predicting outcomes in the first three weeks of an NFL season. Propose two alternative features that are more stable in small samples and explain why.

Answer

Season-to-date win percentage is poor in early weeks because with 1-3 games, the feature is extremely noisy. A team that is 0-2 has a win percentage of 0%, and a team that is 2-0 has 100%, but a two-game sample tells us very little about true team quality. The variance of the sample proportion p is p(1-p)/n, which is enormous when n is 2 or 3. **Alternative 1:** Preseason power ratings blended with early-season results. Use preseason projections (e.g., from market win totals) as a prior and Bayesian-update with each week's result. This anchors early-season ratings to reasonable prior beliefs while still incorporating new information. **Alternative 2:** EPA/play or point differential per game, which are continuous metrics with much lower per-game variance than binary win/loss. A team's EPA/play through 2 games provides far more information about ability than their 1-1 record, because it captures the margin and quality of performance rather than just the binary outcome.

Question 15. Describe the "garbage time" filtering problem in NFL feature engineering. Define specific criteria you would use to identify garbage time plays, and explain the tradeoff between filtering too aggressively and not filtering at all.

Answer

**Garbage time** refers to late-game situations where the outcome is effectively decided, causing teams to change their behavior (e.g., the trailing team passes aggressively against a prevent defense). If unfiltered, garbage time inflates certain statistics (passing yards for trailing teams, defensive yards allowed for leading teams) in ways that misrepresent true team quality. **Criteria for identifying garbage time:** Score differential exceeding 28 points at any time, or exceeding 21 points in the 4th quarter, or win probability below 5% or above 95% based on score, time remaining, and possession. More granular approaches use continuous win-probability-based weighting rather than a binary cutoff. **Tradeoff:** Filtering too aggressively removes legitimate plays from competitive games (e.g., a team trailing by 21 in the third quarter can still win) and reduces sample size. Not filtering at all allows garbage-time statistics to distort team quality assessments. The recommended approach is either to use a continuous weight based on win probability (so garbage-time plays are downweighted but not discarded) or to use a liberal threshold (e.g., win probability < 5%) that only removes the most extreme situations.

Question 16. A feature "travel distance to game" is measured in miles and ranges from 0 (home game) to 2,800 (cross-country away game). Describe three different ways to transform this feature for use in a model, and explain the hypothesis behind each transformation.

Answer

**Transformation 1: Log transformation.** Use log(1 + distance) to compress the upper range. Hypothesis: the fatigue impact of travel increases with distance but at a diminishing rate --- flying from New York to Chicago (700 miles) is noticeably tiring, but the difference between flying to Denver (1,600 miles) and Seattle (2,400 miles) is less impactful because both involve a full cross-country travel day. **Transformation 2: Binary indicator for time-zone changes.** Create a feature that counts the number of time zones crossed (0, 1, 2, or 3). Hypothesis: the primary mechanism of travel disadvantage is jet lag from time-zone changes, not physical distance. A 500-mile trip within the same time zone is less disruptive than a 500-mile trip that crosses a time-zone boundary. **Transformation 3: Bucketed categories.** Create buckets: "home" (0 miles), "regional" (1-500 miles), "medium" (501-1500 miles), and "cross-country" (1500+ miles). Hypothesis: teams experience travel in discrete categories rather than continuous gradations. The operational difference between a 200-mile and 400-mile trip is negligible (both are short flights), while the difference between a 400-mile and 1,600-mile trip is significant (short flight versus a full travel day).

Question 17. Explain the concept of "feature drift" in the context of a sports betting model that has been running for three seasons. Provide a concrete example of a feature whose distribution might change over time in the NFL, and describe how you would detect and respond to this drift.

Answer

**Feature drift** (also called covariate shift) occurs when the statistical distribution of a feature changes over time, even though the feature's definition remains the same. This can degrade model performance because the model was trained on data with a different feature distribution than it now encounters. **Concrete example:** NFL pass rate has increased steadily over the past decade, from approximately 56% in 2015 to over 60% in recent seasons. A model trained on 2015-2018 data would learn the relationship between pass rate and outcomes in a lower-pass-rate environment. By 2023, the league-average pass rate has shifted, meaning a "high" pass rate in 2018 might be "average" in 2023. **Detection:** Monitor the distribution of each feature over time using statistical tests such as the Kolmogorov-Smirnov test or by tracking summary statistics (mean, standard deviation, quantiles) on a rolling basis. Set alert thresholds for when the current feature distribution deviates significantly from the training distribution. **Response:** Options include (a) retraining the model on more recent data, (b) normalizing the feature relative to the current season's distribution rather than using raw values, (c) using within-season z-scores (how many standard deviations from this season's mean) instead of raw values, or (d) implementing an expanding-window retraining schedule that automatically incorporates new data.

Question 18. A dataset for NBA game prediction contains both team-level aggregate features (e.g., season offensive rating) and game-level contextual features (e.g., back-to-back indicator, travel distance). Explain why these two types of features require different update frequencies and describe a pipeline architecture that handles both.

Answer

**Team-level aggregate features** (offensive rating, defensive rating, pace) change gradually as each game's data is incorporated. They should be updated after every game but change slowly because they average over many observations. They represent a team's underlying ability. **Game-level contextual features** (back-to-back indicator, rest days, travel distance, injuries) are specific to each game and change abruptly. A team that played last night is on a back-to-back today but will not be tomorrow. These features must be computed fresh for each game and cannot be cached or smoothed. **Pipeline architecture:** 1. A **nightly batch job** runs after all games are completed, updates team-level rolling statistics (efficiency metrics, win rates, rating systems), and stores them in a feature store keyed by (team, date). 2. A **game-day job** runs hours before tip-off, queries the feature store for current team-level features, and computes game-level contextual features from the schedule, injury reports, and rest-day calendars. 3. A **merge step** joins team-level features for both the home and away team with the game-level contextual features to produce the final feature vector for prediction. This architecture ensures team-level features are always current (updated nightly) while game-level features are computed fresh for each prediction.

Section 3: Applied Problems (4 questions, 7.5 points each = 30 points)

Question 19. You are building an NFL spread prediction model. You have the following candidate features for the home team:

EPA/play (offense), EPA/play (defense)
Success rate (offense), Success rate (defense)
Turnover rate, Red zone efficiency
Third-down conversion rate
Yards per play (offense), Yards per play (defense)

Compute the correlation matrix conceptually (describe which pairs you expect to be highly correlated) and recommend a reduced set of 4-5 features that minimizes redundancy while maintaining coverage of distinct predictive dimensions. Justify each selection.

Answer

**Expected high correlations:** - EPA/play (offense) and Yards per play (offense): r likely > 0.75, as both measure offensive efficiency per play. - EPA/play (offense) and Success rate (offense): r likely > 0.65, as successful plays drive positive EPA. - EPA/play (defense) and Yards per play (defense): r likely > 0.70, same logic as offensive pair. **Recommended reduced set (5 features):** 1. **EPA/play (offense):** The best single measure of offensive efficiency, captures both explosiveness and consistency. Preferred over yards per play because it is adjusted for game situation (down, distance, field position). 2. **EPA/play (defense):** Same reasoning on the defensive side. Subsumes yards per play allowed. 3. **Turnover rate:** Captures a dimension (ball security) not fully reflected in EPA/play, and turnovers have outsized effects on game outcomes beyond their EPA impact. 4. **Red zone efficiency:** Captures scoring efficiency in the critical area inside the 20-yard line. Teams can differ significantly in red zone performance even with similar EPA/play. 5. **Third-down conversion rate:** Captures the ability to sustain drives. While correlated with success rate, third-down conversions have a unique "clutch" dimension and directly affect time of possession and opponent opportunities. Yards per play (offense and defense) are dropped because they are highly redundant with EPA/play. Success rate is dropped because it is substantially captured by EPA/play. This five-feature set covers four distinct dimensions: offensive efficiency, defensive efficiency, ball security, and scoring conversion.

Question 20. A colleague presents an NBA model with the following feature engineering pipeline and asks for your review. Identify all issues:

Load all games from the 2019-2023 seasons.
Compute each team's season-average offensive and defensive rating.
Compute the difference (home team rating - away team rating) as the primary feature.
Standardize all features using the mean and standard deviation of the full dataset.
Train a logistic regression model using 80/20 random train/test split.
Report accuracy on the test set.

Answer

**Issue 1 -- Temporal leakage in season averages (Step 2):** Computing the full-season average for each team means that for a game played in January, the feature includes data from games played in February, March, and April. The feature must be computed using only games played before each prediction date (a rolling or expanding window). **Issue 2 -- Standardization data leakage (Step 4):** Computing the mean and standard deviation on the full dataset (including the test set) leaks information from the test set into the training process. The standardization parameters should be computed on the training set only, then applied to the test set. **Issue 3 -- Random train/test split violates temporal structure (Step 5):** An 80/20 random split mixes games from all five seasons into both train and test sets, meaning the model can train on April 2023 games to predict January 2023 games. The correct approach is a temporal split: train on 2019-2022, test on 2023 (or use walk-forward validation within each season). **Issue 4 -- Single-season averages ignore within-season dynamics:** A team's ability changes throughout the season due to injuries, trades, and development. A single average for the whole season misses these dynamics. Exponentially weighted or rolling averages would better capture current team quality. **Issue 5 -- Accuracy as the evaluation metric (Step 6):** For a betting model, accuracy is a poor metric because it treats all games equally. A model that is 55% accurate but only confident enough to bet on 20% of games is more valuable than one that is 53% accurate on all games. Brier score, log-loss, or profit-based metrics (ROI with a betting threshold) are more appropriate.

Question 21. Write Python code (or detailed pseudocode) for a function that creates an "opponent-adjusted EPA/play" feature for NFL teams. The function should: - Take a DataFrame of game results with columns [team, opponent, week, epa_per_play] - Iteratively compute opponent adjustments (at least 10 iterations) - Return each team's adjusted EPA/play for each week using only data from prior weeks

Describe the convergence behavior you expect.

Answer

import pandas as pd
import numpy as np

def compute_opponent_adjusted_epa(
    games: pd.DataFrame,
    n_iterations: int = 10
) -> pd.DataFrame:
    """Compute opponent-adjusted EPA/play using iterative adjustment.

    For each week w, uses only games from weeks 1 through w-1.
    Iterates to resolve the circularity of adjustments.
    """
    results = []
    all_weeks = sorted(games["week"].unique())

    for target_week in all_weeks[1:]:  # Can't compute for Week 1
        prior_games = games[games["week"] < target_week].copy()

        # Initialize: raw average EPA/play per team
        team_ratings = prior_games.groupby("team")["epa_per_play"].mean()

        for iteration in range(n_iterations):
            # Compute opponent strength for each game
            prior_games["opp_rating"] = prior_games["opponent"].map(team_ratings)

            # Adjusted EPA = raw EPA - opponent rating (facing a tough
            # defense yields higher adjusted EPA)
            prior_games["adj_epa"] = (
                prior_games["epa_per_play"] - prior_games["opp_rating"]
            )

            # Update team ratings to be the mean adjusted EPA
            team_ratings = prior_games.groupby("team")["adj_epa"].mean()

        for team, rating in team_ratings.items():
            results.append({
                "team": team, "week": target_week,
                "adj_epa_per_play": rating
            })

    return pd.DataFrame(results)

**Convergence behavior:** The ratings typically converge within 5-10 iterations. The first iteration makes the largest adjustments, as teams that played weak opponents see their ratings drop and teams that played strong opponents see their ratings rise. Each subsequent iteration produces smaller changes as the adjustments propagate through the network of games. Convergence can be monitored by tracking the maximum absolute change in any team's rating between iterations; when this falls below a threshold (e.g., 0.001), the algorithm has effectively converged.

Question 22. Design a complete feature engineering experiment to determine whether adding player-level features improves an NBA spread prediction model beyond team-level features alone. Specify: - The control model (team-level features only) and treatment model (team + player features) - At least three specific player-level features you would construct - The evaluation methodology (validation scheme, metrics, statistical test) - The minimum sample size needed and how you would ensure the comparison is fair

Answer

**Control model:** Uses team-level features only: offensive rating, defensive rating, pace, effective FG%, turnover rate, offensive rebound rate, free throw rate (four factors), plus situational features (rest days, home/away, travel distance). Total: approximately 12 features. **Treatment model:** Adds player-level features: 1. **Star player availability index:** Sum of win shares per 48 minutes for all players in the projected starting lineup, divided by the team's full-strength total. Captures the impact of injuries to key players. 2. **Lineup continuity score:** Fraction of games in the last 10 where the projected starting five played together. Captures chemistry and the disruption caused by lineup changes. 3. **Matchup-specific player advantage:** The difference in Player Efficiency Rating (PER) between each team's best player at each position, aggregated into a single "matchup edge" score. Captures individual matchup dynamics. **Evaluation methodology:** - **Validation scheme:** Walk-forward validation across three full NBA seasons (e.g., train on all data before each game, predict that game, advance). No random splitting. - **Primary metric:** Brier score (average squared error of predicted win probabilities). - **Secondary metric:** Simulated ROI assuming bets placed when the model's edge exceeds 3% over the implied probability. - **Statistical test:** Paired Diebold-Mariano test on the Brier score differences for each game, using a one-sided alternative (treatment model has lower Brier score). **Sample size:** Three NBA seasons provide approximately 3,690 games. With a typical Brier score difference of 0.002-0.005 and game-level standard deviation of Brier score differences around 0.05, 3,690 observations provide approximately 80% power to detect a 0.003 improvement (computed via the standard formula for paired t-test power). **Fairness:** Both models are trained and evaluated on identical train/test splits. Hyperparameters for both models are tuned via nested cross-validation within the training set only. The same random seeds are used for both models if the algorithm involves randomness.