Chapter 9 Key Takeaways: Regression Analysis for Sports Modeling

Key Concepts

  1. Linear Regression for Continuous Outcomes: Linear regression models a continuous response variable (point differentials, game totals, player statistics) as a linear combination of predictor features. Each coefficient represents the expected change in the outcome for a one-unit increase in the feature, holding all other features constant. OLS minimizes the sum of squared residuals.

  2. Logistic Regression for Binary Outcomes: Logistic regression models the probability of a binary outcome (win/loss, over/under) by passing a linear combination of features through the sigmoid function. Coefficients represent changes in log-odds, and the model naturally constrains predictions to the [0, 1] probability interval.

  3. Feature Engineering: The process of creating predictive variables from raw data. In sports modeling, this includes rolling averages, efficiency metrics (EPA, net rating), pace adjustments, rest/travel factors, and Bayesian-shrunk early-season estimates. Feature engineering is typically more impactful than algorithm selection.

  4. Multicollinearity: When predictor variables are highly correlated, coefficient estimates become unstable (large standard errors) even though predictions may remain accurate. The Variance Inflation Factor (VIF) quantifies the severity; values above 5--10 warrant attention.

  5. Regularization: Ridge (L2), Lasso (L1), and Elastic Net add penalty terms to the loss function to prevent overfitting. Ridge shrinks coefficients toward zero; Lasso can set coefficients to exactly zero (feature selection). Regularization is essential when the number of features is large relative to the sample size.

  6. Model Diagnostics: Residual analysis, Q-Q plots, the Breusch-Pagan test (heteroscedasticity), Durbin-Watson test (autocorrelation), and Cook's distance (influential observations) verify that OLS assumptions are met and the model is reliable.

  7. Cross-Validation: Repeated train/test splitting estimates out-of-sample performance. For sports data, time-series (walk-forward) cross-validation is required to prevent look-ahead bias.

  8. Calibration: The alignment between predicted probabilities and observed frequencies. A well-calibrated model is essential for betting because expected value calculations depend on the accuracy of probability estimates, not just directional accuracy.


Key Formulas

Formula Expression Application
OLS Objective $\min_\beta \sum_{i=1}^n (y_i - X_i \beta)^2$ Fit linear regression
Logistic Function $p = \frac{1}{1 + e^{-(X\beta)}}$ Convert log-odds to probability
R-Squared $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ Proportion of variance explained
Adjusted R-Squared $\bar{R}^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$ Penalizes for extra features
VIF $\text{VIF}_j = \frac{1}{1 - R_j^2}$ Diagnose multicollinearity
Ridge Penalty $\min_\beta \sum (y_i - X_i\beta)^2 + \lambda \sum \beta_j^2$ L2 regularization
Lasso Penalty $\min_\beta \sum (y_i - X_i\beta)^2 + \lambda \sum |\beta_j|$ L1 regularization (sparse)
Brier Score $BS = \frac{1}{n}\sum_{i=1}^n (p_i - o_i)^2$ Evaluate probability calibration
Edge (Betting) $\text{Edge} = p_{\text{model}} - p_{\text{market}}$ Identify value bets

Quick-Reference Modeling Workflow

When building a regression model for sports betting, follow this six-step workflow:

Step 1 --- Define the prediction target. Choose the outcome variable: a continuous quantity (for linear regression) or a binary indicator (for logistic regression). The target must be something you can compare against a sportsbook line.

Step 2 --- Engineer features with temporal discipline. Create predictor variables using only information available before the game. Apply rolling windows, Bayesian shrinkage for early-season data, and domain-specific transformations. Never use information from the future.

Step 3 --- Split data temporally. Always use time-series splitting: train on earlier games, test on later games. Random train/test splits leak future information and produce misleadingly optimistic results.

Step 4 --- Fit, regularize, and select features. Fit the model with regularization (Ridge, Lasso, or Elastic Net) using cross-validated hyperparameter tuning. Use VIF to check for multicollinearity. Remove features that are redundant or statistically insignificant.

Step 5 --- Diagnose and validate. Check residual plots, run diagnostic tests (normality, heteroscedasticity, autocorrelation), and examine influential observations. Evaluate out-of-sample performance with RMSE, MAE, $R^2$, and (for logistic) log-loss, Brier score, and calibration plots.

Step 6 --- Deploy with edge thresholds. Compare model predictions to sportsbook lines. Bet only when the model's estimated probability exceeds the market probability by a meaningful threshold. Track live performance and compare to backtested expectations.

The core principle: A regression model for betting must not only be accurate but also well-calibrated and honestly evaluated on out-of-sample data. The model's probability estimates must be reliable enough to support direct comparison against market-implied probabilities for expected value calculation.


Ready for Chapter 10? Self-Assessment Checklist

Before moving on to Chapter 10 ("Bayesian Thinking for Bettors"), confirm that you can do the following:

  • [ ] Build a linear regression model for a continuous sports outcome and interpret all coefficients
  • [ ] Build a logistic regression model for a binary outcome and convert log-odds to probabilities
  • [ ] Engineer features from raw sports data with strict temporal discipline (no look-ahead bias)
  • [ ] Diagnose model assumptions using residual plots, VIF, Durbin-Watson, and diagnostic tests
  • [ ] Apply Ridge, Lasso, or Elastic Net regularization with cross-validated parameter selection
  • [ ] Evaluate out-of-sample performance using RMSE, MAE, $R^2$, log-loss, and Brier score
  • [ ] Construct and interpret a calibration plot for a probability model
  • [ ] Use time-series cross-validation instead of random k-fold for sports data
  • [ ] Compare model predictions to sportsbook lines and compute betting edges
  • [ ] Assess statistical significance of backtested betting results

If you can check every box with confidence, you are well prepared for Chapter 10, where Bayesian methods will provide a complementary framework for reasoning about uncertainty in sports betting.