Chapter 9 Quiz: Regression Analysis for Sports Modeling

Instructions: Answer all 25 questions. This quiz is worth 100 points. You have 60 minutes. A calculator is permitted; no notes or internet access. For multiple choice, select the single best answer.

Section 1: Multiple Choice (10 questions, 3 points each = 30 points)

Question 1. In a linear regression model predicting NFL point differentials, the coefficient on "home team offensive EPA per play" is 0.15. This means:

(A) The home team scores 0.15 more points per game on average

(B) For every one-unit increase in offensive EPA per play, the predicted point differential increases by 0.15, holding other variables constant

(D) The correlation between offensive EPA and point differentials is 0.15

Answer

**(B) For every one-unit increase in offensive EPA per play, the predicted point differential increases by 0.15, holding other variables constant.** In multiple regression, each coefficient represents the expected change in the response variable for a one-unit change in the corresponding predictor, while holding all other predictors constant. Option (A) confuses the coefficient with the prediction itself. Option (C) confuses the coefficient with R-squared. Option (D) confuses the coefficient with the correlation coefficient.

Question 2. Which of the following is the primary reason to use logistic regression instead of linear regression when modeling the probability of a team winning?

(A) Logistic regression is always more accurate than linear regression

(B) Linear regression can produce predicted probabilities outside the [0, 1] range

(D) Linear regression cannot handle categorical predictor variables

Answer

**(B) Linear regression can produce predicted probabilities outside the [0, 1] range.** When the outcome is binary (win/loss), linear regression (the linear probability model) can produce predicted values below 0 or above 1, which are meaningless as probabilities. Logistic regression uses the sigmoid function to constrain outputs to the [0, 1] interval. Option (A) is false; linear regression can be competitive for some tasks. Option (C) is false; logistic regression typically needs more data. Option (D) is false; both can handle categorical predictors via dummy encoding.

Question 3. A regression model has an $R^2$ of 0.35 on the training set and 0.12 on the test set. This pattern most likely indicates:

(A) The model is underfitting

(B) The model is overfitting

(D) The training and test sets come from different populations

Answer

**(B) The model is overfitting.** A large gap between training and test performance is the hallmark of overfitting: the model has learned patterns specific to the training data (including noise) that do not generalize to new data. Underfitting would show poor performance on both sets. While (C) and (D) are possible contributing factors, the pattern described is most directly indicative of overfitting.

Question 4. The Variance Inflation Factor (VIF) for a feature is 10.5. This indicates:

(A) The feature explains 10.5% of the variance in the outcome

(B) The feature's coefficient variance is inflated by a factor of 10.5 due to multicollinearity

(D) The feature has 10.5 times the predictive power of the average feature

Answer

**(B) The feature's coefficient variance is inflated by a factor of 10.5 due to multicollinearity.** VIF measures how much the variance of a regression coefficient is inflated due to linear dependence with other predictors. A VIF of 10.5 means the standard error of that coefficient is approximately $\sqrt{10.5} \approx 3.24$ times larger than it would be if the feature were uncorrelated with all other predictors. This makes the coefficient estimate less precise and its p-value less reliable. A VIF above 5-10 is commonly flagged as problematic.

Question 5. In the context of sports regression models, "feature engineering" refers to:

(A) Removing outliers from the training data

(B) Creating new predictor variables by transforming or combining raw data

(D) Tuning the hyperparameters of the regression model

Answer

**(B) Creating new predictor variables by transforming or combining raw data.** Feature engineering is the process of using domain knowledge to create new variables from raw data that better capture the underlying relationships. Examples include computing rolling averages, creating interaction terms, calculating strength-of-schedule adjustments, or encoding categorical variables. It is widely considered the most impactful step in the modeling pipeline for sports analytics.

Question 6. A logistic regression model outputs a predicted probability of 0.65 for the home team winning. The sportsbook's implied probability (after removing vig) is 0.58. The bettor should:

(A) Bet on the home team because the model probability exceeds the market probability

(B) Bet on the away team because the market is always more accurate

(D) Bet on the home team only if the model's Brier score is below 0.20

Answer

**(A) Bet on the home team because the model probability exceeds the market probability.** When the model assigns a higher probability to an outcome than the market implies (after vig removal), the bet has positive expected value according to the model. The edge is 0.65 - 0.58 = 0.07, or 7 percentage points. While options (C) and (D) raise valid practical considerations (model uncertainty and calibration quality), the fundamental principle is that a positive edge, as estimated by a validated model, is a betting opportunity. The bettor should also consider bet sizing, model confidence, and sample-size caveats.

Question 7. Which regularization method would you choose if you suspect that many of your features are irrelevant and you want the model to automatically set their coefficients to exactly zero?

(A) Ridge regression (L2)

(B) Lasso regression (L1)

(D) Ordinary least squares with backward elimination

Answer

**(B) Lasso regression (L1).** Lasso (Least Absolute Shrinkage and Selection Operator) applies an L1 penalty that can shrink coefficients to exactly zero, effectively performing automatic feature selection. Ridge regression shrinks coefficients toward zero but never sets them exactly to zero. Elastic Net combines L1 and L2 and can also produce zero coefficients, but Lasso is the standard choice when the primary goal is sparsity. Backward elimination is a valid alternative but is computationally more expensive and less stable.

Question 8. The Durbin-Watson statistic for a regression model is 0.85. This suggests:

(A) Strong positive autocorrelation in the residuals

(B) No autocorrelation in the residuals

(D) Heteroscedasticity in the residuals

Answer

**(A) Strong positive autocorrelation in the residuals.** The Durbin-Watson statistic ranges from 0 to 4. A value of 2.0 indicates no autocorrelation. Values significantly below 2.0 indicate positive autocorrelation (successive residuals tend to have the same sign), and values significantly above 2.0 indicate negative autocorrelation. A value of 0.85 is well below 2.0, indicating strong positive autocorrelation. This could occur if the model is missing a time-dependent feature or if games are ordered chronologically and there is a seasonal trend.

Question 9. Cross-validation is used in sports modeling primarily to:

(A) Increase the training set size

(B) Estimate out-of-sample performance and guard against overfitting

(D) Ensure the model meets the normality assumption

Answer

**(B) Estimate out-of-sample performance and guard against overfitting.** Cross-validation partitions the data into multiple train/test folds, training the model on each training portion and evaluating on the held-out portion. The average performance across folds provides a more reliable estimate of how the model will perform on truly new data. This is crucial in sports betting, where overfitting to historical data is the primary failure mode. Note that for time-series sports data, time-series cross-validation (walk-forward validation) should be used instead of random k-fold to avoid future data leakage.

Question 10. When building a regression model for betting, a "look-ahead bias" occurs when:

(A) You use too many predictor variables

(B) You include information in training features that would not have been available at prediction time

(D) You fail to account for home-field advantage

Answer

**(B) You include information in training features that would not have been available at prediction time.** Look-ahead bias (also called future data leakage) is one of the most insidious errors in sports model building. It occurs when features computed using future information are included in the model. For example, using a team's end-of-season offensive rating to predict games from Week 5, or including closing line data in a model that needs to make predictions before lines are set. This artificially inflates backtesting results and produces models that cannot perform in real-time deployment.

Section 2: True/False (5 questions, 3 points each = 15 points)

Write "True" or "False." Full credit requires correct identification only.

Question 11. True or False: Adding more features to a linear regression model will always increase (or leave unchanged) the $R^2$ on the training data.

Answer

**True.** By construction, adding a feature to an OLS regression model can only increase or maintain the training $R^2$, never decrease it. This is because the optimization can always set the new coefficient to zero, recovering the original model. However, the adjusted $R^2$ can decrease if the new feature does not improve the fit enough to offset the penalty for the additional parameter. This is exactly why adjusted $R^2$ and cross-validated metrics are necessary for model selection.

Question 12. True or False: If a regression coefficient has a p-value of 0.03, it means there is a 3% probability that the coefficient is actually zero.

Answer

**False.** The p-value of 0.03 means: assuming the null hypothesis (coefficient = 0) is true, there is a 3% probability of observing a coefficient as extreme or more extreme than the one estimated. It is a statement about the data given the hypothesis, not about the hypothesis given the data. The probability that the coefficient is actually zero is a Bayesian question that requires a prior distribution, which is not part of the frequentist framework. This is a commonly misunderstood distinction.

Question 13. True or False: In sports modeling, time-series cross-validation (walk-forward validation) is preferred over standard k-fold cross-validation because it prevents future information from leaking into the training set.

Answer

**True.** Sports data is inherently temporal: games occur in sequence, and using future games to train a model that predicts past games is unrealistic. Standard k-fold cross-validation randomly assigns observations to folds, which can result in training on 2024 data and testing on 2023 data. Time-series (walk-forward) cross-validation maintains chronological order, always training on earlier data and testing on later data, which accurately simulates real-world deployment conditions.

Question 14. True or False: A model with an RMSE of 10 points for predicting NBA game totals is too inaccurate to be profitable for betting.

Answer

**False.** Profitability in betting does not require perfect predictions; it requires predictions that are more accurate than the market's implied probabilities frequently enough to overcome the vig. An RMSE of 10 points means the model's average error is about 10 points, but if that model is systematically better than the market on a subset of games (e.g., identifying when the total is significantly mispriced), it can be profitable even with large average errors. The key metrics for betting profitability are edge identification and calibration, not RMSE alone.

Question 15. True or False: Ridge regression (L2 regularization) is preferred over Lasso (L1 regularization) when you believe all features contribute some predictive value and none should be entirely excluded.

Answer

**True.** Ridge regression shrinks all coefficients toward zero but never sets any exactly to zero, which preserves all features in the model. This is appropriate when domain knowledge suggests that many correlated features each contribute incrementally to the prediction. Lasso, by contrast, tends to select one feature from a group of correlated features and zero out the rest. When predictors are correlated but all theoretically relevant (common in sports data), Ridge retains their combined information while controlling overfitting.

Section 3: Fill in the Blank (3 questions, 4 points each = 12 points)

Question 16. In logistic regression, the function that maps the linear predictor to a probability between 0 and 1 is called the __________ function, defined as $\sigma(z) = \frac{1}{1 + e^{-z}}$.

Answer

**Sigmoid** (also accepted: "logistic" or "inverse logit") The sigmoid function transforms any real-valued input into the (0, 1) interval, making it suitable for modeling probabilities. It is the inverse of the logit function: while the logit maps probabilities to the real line via $\log(p / (1-p))$, the sigmoid maps real values back to probabilities. In logistic regression, the linear combination of features $z = \beta_0 + \beta_1 x_1 + \ldots$ is passed through the sigmoid to produce a predicted probability.

Question 17. The diagnostic statistic that measures the influence of a single observation on the fitted regression by comparing the full model to the model with that observation removed is called __________ distance.

Answer

**Cook's** distance Cook's distance measures the aggregate change in all fitted values when a single observation is deleted. Observations with high Cook's distance are considered influential because their removal substantially changes the regression coefficients. A common rule of thumb flags observations with Cook's distance greater than $4/n$ or greater than 1.0. In sports modeling, games with unusual circumstances (blowouts, weather events, key injuries mid-game) often have high Cook's distance.

Question 18. When the variance of regression residuals is not constant across fitted values (e.g., prediction errors are larger for high-scoring games than low-scoring games), the condition is called __________.

Answer

**Heteroscedasticity** Heteroscedasticity violates the OLS assumption of constant error variance (homoscedasticity). When present, OLS coefficient estimates remain unbiased but are no longer the most efficient estimators, and standard errors (and therefore p-values and confidence intervals) become unreliable. In sports modeling, heteroscedasticity commonly arises because game outcomes become more variable in certain conditions (e.g., high-pace games have more variable totals). Remedies include weighted least squares, robust standard errors (White or HC3), or transforming the response variable.

Section 4: Short Answer (3 questions, 5 points each = 15 points)

Answer each question in 3-5 sentences.

Question 19. Explain the difference between "training error" and "generalization error" in the context of a sports betting regression model. Why is the distinction critical for bettors?

Answer

**Training error** is the model's prediction error measured on the same data used to fit the model. **Generalization error** (also called test error or out-of-sample error) is the error measured on new data the model has never seen. Training error tends to underestimate the true prediction error because the model has been optimized to fit the training data, including its noise. For bettors, this distinction is critical because betting decisions are always about future games --- data the model has never seen. A model that performs brilliantly on historical data but poorly on new games is worse than useless: it gives false confidence. The only honest assessment of a betting model's quality is its out-of-sample performance, measured through proper cross-validation or a genuine hold-out test period.

Question 20. Describe what a "calibration plot" shows for a logistic regression model predicting game outcomes, and explain why good calibration is essential for betting applications.

Answer

A **calibration plot** (also called a reliability diagram) plots the predicted probability on the x-axis against the observed frequency of the outcome on the y-axis, typically after binning predictions into groups (e.g., deciles). A perfectly calibrated model lies on the 45-degree diagonal: when the model predicts 70% probability, the outcome occurs approximately 70% of the time. Good calibration is essential for betting because the bettor directly compares the model's predicted probability to the market's implied probability to compute expected value. If the model systematically overestimates probabilities (e.g., predicting 70% when the true frequency is 60%), the bettor will see phantom edges that do not exist, leading to systematic losses. A well-calibrated model produces probability estimates that can be trusted at face value for EV calculations and Kelly Criterion bet sizing.

Question 21. Explain why using a team's current season statistics (e.g., season-to-date points per game) as features early in the season is problematic, and describe one approach to address this issue.

Answer

Early in the season, current-season statistics are based on very few games and are therefore extremely noisy and unreliable. A team that has played only two games might show 35 points per game due to one blowout, but this is far from a stable estimate of their true scoring ability. Using these unstable estimates as regression features introduces high variance into predictions and can lead the model to overreact to small-sample fluctuations. One effective approach is **Bayesian shrinkage** (or empirical Bayes estimation), where early-season statistics are blended with a prior --- typically the team's previous season performance or the league average. As more games are played, the weight shifts from the prior to the current data. For example, after Week 2, you might use 70% prior (last season) and 30% current season; by Week 8, the blend might be 20% prior and 80% current. This produces more stable features that improve early-season predictions.

Section 5: Code Analysis (2 questions, 6 points each = 12 points)

Question 22. Examine the following code that builds a logistic regression model:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("nba_games.csv")
X = df[["net_rating", "pace", "rest_days", "travel_miles"]]
y = df["home_win"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")

(a) This code contains a data leakage error. Identify it and explain why it matters.

(b) Write the corrected version of the problematic lines.

Answer

**(a)** The data leakage error occurs because `StandardScaler` is fit on the **entire dataset** (`X_scaled = scaler.fit_transform(X)`) before the train/test split. This means the scaler's mean and standard deviation include information from the test set, which leaks test-set statistics into the training process. In practice, this means the model has indirect access to information about the test data during training, which inflates performance estimates. Additionally, using `train_test_split` with random splitting on time-series sports data is problematic (look-ahead bias), though the question focuses on the scaling leakage. **(b)** Corrected version:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit only on training data
X_test_scaled = scaler.transform(X_test)         # Transform test using training stats

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print(f"Test accuracy: {model.score(X_test_scaled, y_test):.3f}")

The key fix is fitting the scaler on the training data only, then using the same transformation on the test data. This ensures no test-set information leaks into the preprocessing step.

Question 23. Examine the following feature engineering code:

def compute_features(games_df, team, game_date):
    """Compute features for a team as of a given game date."""
    team_games = games_df[games_df["team"] == team]
    season_games = team_games[team_games["season"] == game_date.year]

    features = {
        "avg_points": season_games["points_scored"].mean(),
        "avg_allowed": season_games["points_allowed"].mean(),
        "win_pct": season_games["win"].mean(),
        "last_5_avg_points": season_games.tail(5)["points_scored"].mean(),
    }
    return features

(a) Identify at least two bugs or methodological problems in this function.

(b) For each problem, explain the potential impact on model performance.

Answer

**(a) and (b):** **Problem 1: Look-ahead bias.** The function computes statistics over `season_games` without filtering to only games *before* `game_date`. If `game_date` is February 15, the averages include games from March onward (if they exist in the dataset). This inflates backtesting performance because the model uses future information. **Problem 2: Season boundary assumption.** The code assumes `season == game_date.year`, but many sports seasons span two calendar years (NBA: October to June, NFL: September to February). A game on January 15, 2025, might be part of the 2024-25 season, but this code would look for 2025-season games. **Problem 3: Missing data handling.** If a team has played fewer than 5 games, `tail(5)` returns fewer rows, and `.mean()` still works but computes over a misleadingly small sample. If no games exist, `mean()` returns NaN with no warning. **(c)** Corrected version:

def compute_features(games_df, team, game_date, season_id):
    """Compute features for a team using only games before game_date."""
    team_games = games_df[
        (games_df["team"] == team) &
        (games_df["season"] == season_id) &
        (games_df["game_date"] < game_date)  # Only past games
    ].sort_values("game_date")

    if len(team_games) < 3:
        return None  # Not enough data for stable features

    features = {
        "avg_points": team_games["points_scored"].mean(),
        "avg_allowed": team_games["points_allowed"].mean(),
        "win_pct": team_games["win"].mean(),
        "last_5_avg_points": team_games.tail(5)["points_scored"].mean(),
        "games_played": len(team_games),
    }
    return features

Section 6: Applied Problems (2 questions, 8 points each = 16 points)

Question 24. You build a linear regression model predicting NFL game totals (combined score). The model produces the following predictions and the corresponding sportsbook lines for five upcoming games:

Game	Model Prediction	Sportsbook Total	Over/Under Vig
Game A	48.2	44.5	-110/-110
Game B	41.7	43.0	-108/-112
Game C	52.1	51.5	-105/-115
Game D	38.5	41.0	-110/-110
Game E	46.0	45.5	-115/-105

(a) (2 points) For each game, state whether the model favors the Over or the Under, and calculate the difference between the model's prediction and the sportsbook total.

(b) (2 points) Using a normal distribution assumption with the model's RMSE of 12.0 points, calculate the model's implied probability of the Over hitting for Game A (model prediction 48.2, line 44.5).

(c) (2 points) Convert the Game A sportsbook vig-adjusted line (-110/-110) into implied probabilities for Over and Under. Does the model see value on the Over?

(d) (2 points) If you were deploying this model with a minimum edge threshold of 3 percentage points, which games would you bet, and on which side?

Answer

**(a)** | Game | Model - Line | Direction | |---|---|---| | A | 48.2 - 44.5 = +3.7 | Over | | B | 41.7 - 43.0 = -1.3 | Under | | C | 52.1 - 51.5 = +0.6 | Over | | D | 38.5 - 41.0 = -2.5 | Under | | E | 46.0 - 45.5 = +0.5 | Over | **(b)** Under a normal distribution with mean = 48.2 and SD = 12.0: $P(\text{Over 44.5}) = P(X > 44.5) = P\left(Z > \frac{44.5 - 48.2}{12.0}\right) = P(Z > -0.308) = \Phi(0.308) \approx 0.621$ The model implies approximately a 62.1% probability the Over hits. **(c)** At -110/-110, each side has an implied probability of 110/210 = 52.38%. After vig removal (normalizing to 100%), the fair implied probability is 50% for each side. The model's 62.1% exceeds the market's 50% (or even the vig-included 52.38%) by a substantial margin. **Yes, the model sees significant value on the Over in Game A.** **(d)** To determine which games clear the 3-percentage-point edge threshold, we need each game's model-implied probability minus the market's implied probability: - **Game A**: Model Over probability ~62.1%, market ~52.4%. Edge ~9.7 pp. **Bet Over.** - **Game B**: Model Under probability. $P(X < 43.0) = P(Z < (43.0-41.7)/12.0) = P(Z < 0.108) = 0.543 = 54.3\%$. Market Under ~52.4% (from -112 side). Edge ~1.9 pp. **No bet** (below threshold). - **Game C**: $P(X > 51.5) = P(Z > (51.5-52.1)/12.0) = P(Z > -0.05) = 0.520 = 52.0\%$. Market Over ~51.2% (from -105). Edge ~0.8 pp. **No bet.** - **Game D**: $P(X < 41.0) = P(Z < (41.0-38.5)/12.0) = P(Z < 0.208) = 0.582 = 58.2\%$. Market Under ~52.4%. Edge ~5.8 pp. **Bet Under.** - **Game E**: $P(X > 45.5) = P(Z > (45.5-46.0)/12.0) = P(Z > -0.042) = 0.517 = 51.7\%$. Market Over ~53.5% (from -115). Edge is negative. **No bet.** **Recommended bets: Game A Over, Game D Under.**

Question 25. A colleague builds a logistic regression model to predict NBA moneyline outcomes and reports the following evaluation metrics:

Training accuracy: 68.5%
Test accuracy: 62.1%
Training log-loss: 0.58
Test log-loss: 0.67
Training Brier score: 0.195
Test Brier score: 0.232
AUC-ROC: 0.67

They also report that when they bet on all games where the model's predicted probability exceeds the market's implied probability by at least 5 percentage points, they achieved a 54.2% win rate over 240 bets at average odds of -108.

(a) (2 points) Assess whether the model is overfitting based on the training vs. test metrics.

(b) (2 points) Calculate the approximate profit or loss from 240 bets at -108 average odds with a 54.2% win rate, assuming a flat $100 stake per bet.

(c) (2 points) Is the 54.2% win rate over 240 bets statistically significant at the 5% level against a null of 52.4% (the break-even rate at -108)? Use the normal approximation.

(d) (2 points) What three improvements would you recommend for this model before deploying it with real money?

Answer

**(a)** There is moderate overfitting. Training accuracy (68.5%) exceeds test accuracy (62.1%) by 6.4 percentage points. Similarly, training log-loss (0.58) is better than test log-loss (0.67), and training Brier score (0.195) is better than test (0.232). These gaps indicate the model has learned some training-set-specific patterns that do not generalize. However, the overfitting is not extreme --- the test metrics still show meaningful predictive power above a naive baseline. **(b)** At -108 odds, profit per winning bet = $100 \times (100/108) = $92.59. - Wins: 240 x 0.542 = 130.1, round to 130 wins - Losses: 240 - 130 = 110 losses - Profit from wins: 130 x $92.59 = $12,036.70 - Loss from losses: 110 x $100 = $11,000.00 - **Net profit: $12,036.70 - $11,000.00 = $1,036.70** - ROI: $1,036.70 / (240 x $100) = **4.3%** **(c)** Null hypothesis: $p_0 = 0.524$ (break-even at -108), observed: $\hat{p} = 0.542$, $n = 240$. $z = \frac{0.542 - 0.524}{\sqrt{0.524 \times 0.476 / 240}} = \frac{0.018}{\sqrt{0.000104}} = \frac{0.018}{0.0322} = 0.559$ The z-score of 0.559 corresponds to a p-value of approximately 0.288 (one-tailed). This is **not statistically significant** at the 5% level. The observed edge could easily be due to chance over 240 bets. The bettor would need roughly 1,500+ bets at this win rate to achieve statistical significance. **(d)** Three recommended improvements: 1. **Regularization**: Add L1 or L2 regularization to reduce the training-test gap and combat overfitting. The 6.4 percentage point accuracy gap suggests the model is too complex for the available data. 2. **Calibration tuning**: Apply Platt scaling or isotonic regression to the predicted probabilities. Since betting decisions depend on comparing model probabilities to market probabilities, calibration is more important than raw accuracy. 3. **Larger evaluation sample**: 240 bets is insufficient to confirm a real edge. Use walk-forward validation over a longer time horizon (multiple seasons) to get a more reliable performance estimate before risking real capital.

Scoring Summary

Section	Questions	Points Each	Total
1. Multiple Choice	10	3	30
2. True/False	5	3	15
3. Fill in the Blank	3	4	12
4. Short Answer	3	5	15
5. Code Analysis	2	6	12
6. Applied Problems	2	8	16
Total	25	---	100

Grade Thresholds

Grade	Score Range	Percentage
A	90-100	90-100%
B	80-89	80-89%
C	70-79	70-79%
D	60-69	60-69%
F	0-59	0-59%