In Chapter 6, we learned how to measure the strength and direction of relationships between variables using correlation. In Chapters 7 and 8, we built the probability and hypothesis-testing machinery that allows us to say whether an observed effect...
Learning Objectives
- Build and interpret simple and multiple linear regression models for predicting sports outcomes such as game totals, point margins, and player performance
- Construct logistic regression models for win probability estimation and convert model outputs into actionable betting probabilities
- Apply feature selection techniques including LASSO, Ridge, and VIF analysis to avoid overfitting and multicollinearity in sports datasets
- Perform comprehensive model diagnostics using residual analysis, influence measures, and heteroscedasticity tests to validate regression assumptions
- Transform regression model predictions into betting signals by comparing estimated spreads, totals, and win probabilities against market lines
In This Chapter
- Chapter Overview
- 9.1 Simple and Multiple Linear Regression
- 9.2 Logistic Regression for Win Probability
- 9.3 Feature Selection and Multicollinearity
- 9.4 Model Diagnostics and Residual Analysis
- 9.5 From Regression Output to Betting Lines
- 9.6 Chapter Summary
- What's Next: Chapter 10 --- Bayesian Thinking for Sports Bettors
Chapter 9: Regression Analysis for Sports Modeling
Chapter Overview
In Chapter 6, we learned how to measure the strength and direction of relationships between variables using correlation. In Chapters 7 and 8, we built the probability and hypothesis-testing machinery that allows us to say whether an observed effect is statistically meaningful or merely noise. Now we take the decisive step from describing relationships to predicting outcomes.
Regression analysis is the workhorse of quantitative sports modeling. When an analyst says that "each additional rushing yard is worth 0.08 expected points" or that "a team's three-point shooting percentage has a coefficient of 0.34 in our win probability model," they are communicating results from regression. When a sportsbook sets a total at 47.5 for an NFL game, a regression model almost certainly played a role in arriving at that number.
This chapter covers two pillars of regression: linear regression for continuous outcomes (point totals, scoring margins, yards gained) and logistic regression for binary outcomes (win/loss, cover/miss, over/under). We will build complete models using real-world sports statistics, diagnose their weaknesses, select the most informative features, and ultimately convert raw model output into betting signals that can be compared against the market.
By the end of this chapter, you will possess a complete pipeline: raw data in, betting recommendation out. That pipeline will not be perfect---no model is---but it will be principled, transparent, and improvable.
9.1 Simple and Multiple Linear Regression
The Core Idea
Linear regression models the relationship between one or more independent variables (also called predictors or features) and a dependent variable (the outcome we wish to predict). In its simplest form, we are fitting a straight line through a scatter plot:
$$\hat{y} = \beta_0 + \beta_1 x$$
where $\hat{y}$ is the predicted value of the dependent variable, $\beta_0$ is the intercept, $\beta_1$ is the slope coefficient, and $x$ is the predictor.
In sports modeling, the dependent variable might be total points scored in a game, a team's scoring margin, or a player's fantasy point output. The predictor might be yards per play, turnover differential, or pace of play.
Ordinary Least Squares (OLS) Estimation
The most common method for fitting a linear regression is Ordinary Least Squares. OLS finds the values of $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals:
$$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$
The closed-form solutions for simple linear regression are:
$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}$$
$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$
For multiple linear regression with $p$ predictors, the model becomes:
$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$$
In matrix notation, this is expressed compactly as:
$$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}$$
where $\mathbf{X}$ is the $n \times (p+1)$ design matrix (with a column of ones for the intercept), $\boldsymbol{\beta}$ is the $(p+1) \times 1$ coefficient vector, and $\hat{\mathbf{y}}$ is the $n \times 1$ vector of predicted values. The OLS solution in matrix form is:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
This elegant formula is the foundation of virtually every linear model you will encounter in sports analytics.
Interpreting Coefficients in Sports Contexts
Each coefficient $\beta_j$ in a multiple regression tells you: holding all other predictors constant, a one-unit increase in $x_j$ is associated with a $\beta_j$-unit change in the predicted outcome.
This interpretation is powerful but demands care. Consider an NFL model where:
$$\widehat{\text{Points}} = -12.4 + 0.081 \times \text{TotalYards} + 3.7 \times \text{TurnoverMargin}$$
This tells us:
- Each additional yard of total offense is worth approximately 0.081 points, controlling for turnover differential.
- Each additional turnover gained (relative to turnovers lost) is worth approximately 3.7 points, controlling for total yardage.
- A team with zero yards and zero turnover margin would be predicted to score $-12.4$ points (which illustrates that the intercept is an extrapolation and not always meaningful on its own).
Callout: Correlation Is Not Causation---Still
Regression coefficients describe associations, not causal effects. If your model includes "penalty yards" as a predictor and it has a positive coefficient on scoring, that does not mean penalties cause more scoring. It may reflect that teams with more plays commit more penalties, and more plays produce more points. Always think carefully about the causal story behind your variables before making strategy recommendations.
R-Squared and Adjusted R-Squared
The coefficient of determination, $R^2$, measures the proportion of variance in the dependent variable that is explained by the model:
$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$
An $R^2$ of 0.65 means that 65% of the variance in the outcome is explained by the predictors in the model. In sports modeling, $R^2$ values tend to be lower than in many other domains because athletic outcomes contain substantial randomness. An NFL game-total model with $R^2 = 0.30$ at the game level is actually quite useful, even though 70% of the variance remains unexplained.
Adjusted R-squared penalizes the addition of predictors that do not meaningfully improve the model:
$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$
where $n$ is the number of observations and $p$ is the number of predictors. Unlike $R^2$, which can never decrease when you add a variable, adjusted $R^2$ can decline if a new predictor adds more noise than signal. Always report adjusted $R^2$ when comparing models with different numbers of predictors.
Standard Errors and Confidence Intervals for Coefficients
The standard error of a coefficient quantifies the uncertainty in our estimate. For the $j$th coefficient in a multiple regression:
$$\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (\mathbf{X}^T \mathbf{X})^{-1}_{jj}}$$
where $\hat{\sigma}^2 = \frac{SS_{\text{res}}}{n - p - 1}$ is the estimated variance of the residuals. A 95% confidence interval for $\beta_j$ is:
$$\hat{\beta}_j \pm t_{n-p-1, 0.025} \times \text{SE}(\hat{\beta}_j)$$
In practice, if a coefficient's confidence interval includes zero, we cannot be confident the predictor has a real effect on the outcome.
Prediction Intervals vs. Confidence Intervals
This distinction is crucial for bettors and frequently misunderstood:
- A confidence interval for the mean response tells you the range where the average outcome is expected to fall for a given set of predictor values. It is narrow.
- A prediction interval for an individual observation tells you the range where a single new observation is expected to fall. It is wider because it accounts for both the uncertainty in the model and the inherent variability of individual outcomes.
$$\text{Prediction Interval} = \hat{y}_0 \pm t_{n-p-1,\alpha/2} \times \hat{\sigma}\sqrt{1 + \mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}$$
$$\text{Confidence Interval for Mean} = \hat{y}_0 \pm t_{n-p-1,\alpha/2} \times \hat{\sigma}\sqrt{\mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}$$
When betting on a single game, the prediction interval is the appropriate measure of uncertainty. A model might predict an NFL total of 45.2 with a 95% prediction interval of [28.4, 62.0]. That enormous range is a humbling reminder of how much game-level variance exists---and why bankroll management (covered in Part 4) is non-negotiable.
Python Code: Building an NFL Points Model
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table
# Simulated NFL team-game data (one row per team per game)
np.random.seed(42)
n_games = 256 # 16 games x 32 teams / 2
# Generate realistic team-game statistics
total_yards = np.random.normal(340, 60, n_games)
turnovers_committed = np.random.poisson(1.2, n_games)
turnovers_forced = np.random.poisson(1.2, n_games)
turnover_margin = turnovers_forced - turnovers_committed
time_of_possession = np.random.normal(30, 3, n_games) # minutes
penalty_yards = np.random.normal(50, 20, n_games)
# Generate points with realistic relationships + noise
points = (
-8.0
+ 0.075 * total_yards
+ 3.5 * turnover_margin
+ 0.15 * time_of_possession
- 0.02 * penalty_yards
+ np.random.normal(0, 6, n_games) # game-level noise
)
points = np.clip(points, 0, None).round(0) # floor at zero
# Create DataFrame
df = pd.DataFrame({
'points': points,
'total_yards': total_yards,
'turnover_margin': turnover_margin,
'time_of_possession': time_of_possession,
'penalty_yards': penalty_yards
})
# Fit multiple linear regression using statsmodels
X = df[['total_yards', 'turnover_margin', 'time_of_possession', 'penalty_yards']]
X = sm.add_constant(X) # adds intercept column
y = df['points']
model = sm.OLS(y, X).fit()
print(model.summary())
# Prediction for a specific game scenario
new_game = pd.DataFrame({
'const': [1.0],
'total_yards': [365],
'turnover_margin': [1],
'time_of_possession': [31.5],
'penalty_yards': [45]
})
predicted_points = model.predict(new_game)
print(f"\nPredicted points: {predicted_points.values[0]:.1f}")
# Get prediction interval
predictions = model.get_prediction(new_game)
pred_summary = predictions.summary_frame(alpha=0.05)
print(f"95% Confidence Interval (mean): [{pred_summary['mean_ci_lower'].values[0]:.1f}, "
f"{pred_summary['mean_ci_upper'].values[0]:.1f}]")
print(f"95% Prediction Interval: [{pred_summary['obs_ci_lower'].values[0]:.1f}, "
f"{pred_summary['obs_ci_upper'].values[0]:.1f}]")
Worked Example: Predicting NFL Game Totals from Team Stats
Suppose we have assembled season-level data for each NFL team across three seasons. For each team-game observation, we know: total yards, turnover margin, time of possession, and penalty yards. Our goal is to predict the total points a team will score.
Step 1: Fit the model. Running the code above produces output like:
| Variable | Coefficient | Std. Error | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | -7.83 | 4.21 | -1.86 | 0.064 |
| total_yards | 0.074 | 0.006 | 12.33 | <0.001 |
| turnover_margin | 3.42 | 0.52 | 6.58 | <0.001 |
| time_of_possession | 0.16 | 0.12 | 1.33 | 0.185 |
| penalty_yards | -0.019 | 0.018 | -1.06 | 0.292 |
Step 2: Interpret. Total yards and turnover margin are statistically significant ($p < 0.001$). Each yard is worth about 0.074 points; each turnover gained is worth about 3.4 points. Time of possession and penalty yards are not significant at $\alpha = 0.05$ in this dataset.
Step 3: Evaluate. The model's adjusted $R^2$ is approximately 0.46, meaning it explains about 46% of the game-level variance in scoring. This is solid for a game-level model.
Step 4: Predict. For a team projected to gain 365 total yards with a +1 turnover margin, 31.5 minutes of possession, and 45 penalty yards, the model predicts approximately 24.3 points. The 95% prediction interval spans from roughly 12 to 36 points---a 24-point range that reflects genuine game-level uncertainty.
Callout: What $R^2$ Values Are Typical in Sports?
Game-level models in the NFL typically produce $R^2$ values between 0.25 and 0.50. Season-level models (predicting a team's total wins) can reach 0.60--0.80. Player-level models for statistics like batting average or free-throw percentage tend to have lower $R^2$ because individual performance is noisier. Do not discard a model because its $R^2$ seems "low"---the relevant question is whether it generates profitable betting signals.
9.2 Logistic Regression for Win Probability
Why Logistic Regression for Binary Outcomes
Many of the most important questions in sports betting are binary: Will this team win? Will the game go over the total? Will a team cover the spread? Linear regression is inappropriate for these outcomes because it can produce predicted probabilities below 0 or above 1, and the error structure violates OLS assumptions.
Logistic regression solves these problems by modeling the log-odds of the outcome rather than the outcome itself.
The Logit Function and Sigmoid Curve
The logistic model takes the form:
$$\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$$
The left side is the logit (log of the odds). Solving for $p$ gives us the sigmoid function:
$$p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}} = \frac{1}{1 + e^{-\mathbf{x}^T\boldsymbol{\beta}}}$$
This function maps any real-valued linear combination into the $(0, 1)$ interval, producing a valid probability. The sigmoid has an S-shaped curve: probabilities near 0.5 change rapidly with small changes in the predictors, while probabilities near 0 or 1 are more resistant to change.
Maximum Likelihood Estimation
Unlike linear regression, logistic regression cannot be solved in closed form. Instead, we use Maximum Likelihood Estimation (MLE). For $n$ observations where $y_i \in \{0, 1\}$:
$$\mathcal{L}(\boldsymbol{\beta}) = \prod_{i=1}^{n} p_i^{y_i}(1 - p_i)^{1 - y_i}$$
Taking the log gives the log-likelihood:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i \ln(p_i) + (1 - y_i)\ln(1 - p_i)\right]$$
MLE finds the $\boldsymbol{\beta}$ that maximizes this expression, typically via iterative algorithms like Newton-Raphson or gradient descent.
Interpreting Odds Ratios
In logistic regression, the coefficient $\beta_j$ represents the change in the log-odds of the outcome for a one-unit increase in $x_j$. The odds ratio is:
$$\text{OR}_j = e^{\beta_j}$$
An odds ratio of 1.5 means that a one-unit increase in the predictor multiplies the odds of the outcome by 1.5 (a 50% increase in odds). An odds ratio of 0.8 means the odds decrease by 20%.
For example, if the coefficient on "offensive rebounding percentage" in an NBA win-probability model is 0.06, then the odds ratio is $e^{0.06} \approx 1.062$, meaning each percentage point increase in offensive rebounding percentage multiplies the win odds by about 1.062.
Building a Win Probability Model: The NBA Four Factors
Dean Oliver's "Four Factors" provide a concise framework for understanding basketball team quality:
- Effective Field Goal Percentage (eFG%): Shooting efficiency, adjusted for three-pointers
- Turnover Percentage (TOV%): Turnovers per 100 possessions
- Offensive Rebounding Percentage (ORB%): Percentage of available offensive rebounds grabbed
- Free Throw Rate (FTR): Free throws attempted per field goal attempted
We can build a win probability model using the differential of each factor (team value minus opponent value).
Python Code: NBA Win Probability Model
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, brier_score_loss
import warnings
warnings.filterwarnings('ignore')
# Simulated NBA game data based on Four Factors differentials
np.random.seed(123)
n_games = 1230 # Approximate full NBA season (each game counted once)
# Generate Four Factor differentials (team - opponent)
efg_diff = np.random.normal(0, 4.5, n_games) # eFG% differential
tov_diff = np.random.normal(0, 3.0, n_games) # TOV% differential (negative is better)
orb_diff = np.random.normal(0, 5.0, n_games) # ORB% differential
ftr_diff = np.random.normal(0, 8.0, n_games) # FT Rate differential
home_court = np.random.binomial(1, 0.5, n_games) # 1 if home
# Generate win probability from true coefficients
log_odds = (
-0.20
+ 0.15 * efg_diff
- 0.08 * tov_diff
+ 0.04 * orb_diff
+ 0.02 * ftr_diff
+ 0.35 * home_court
)
true_prob = 1 / (1 + np.exp(-log_odds))
wins = np.random.binomial(1, true_prob)
# Create DataFrame
df = pd.DataFrame({
'win': wins,
'efg_diff': efg_diff,
'tov_diff': tov_diff,
'orb_diff': orb_diff,
'ftr_diff': ftr_diff,
'home_court': home_court
})
# Fit logistic regression
feature_cols = ['efg_diff', 'tov_diff', 'orb_diff', 'ftr_diff', 'home_court']
X = df[feature_cols]
y = df['win']
log_model = LogisticRegression(max_iter=1000, solver='lbfgs')
log_model.fit(X, y)
# Display coefficients and odds ratios
print("Logistic Regression Coefficients and Odds Ratios")
print("=" * 60)
print(f"{'Feature':<20} {'Coefficient':>12} {'Odds Ratio':>12}")
print("-" * 60)
print(f"{'Intercept':<20} {log_model.intercept_[0]:>12.4f} {np.exp(log_model.intercept_[0]):>12.4f}")
for feat, coef in zip(feature_cols, log_model.coef_[0]):
print(f"{feat:<20} {coef:>12.4f} {np.exp(coef):>12.4f}")
# Cross-validated accuracy
cv_scores = cross_val_score(log_model, X, y, cv=10, scoring='accuracy')
print(f"\n10-Fold CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Predicted probabilities for a specific matchup
example_game = pd.DataFrame({
'efg_diff': [3.5], # team shoots 3.5% better eFG
'tov_diff': [-1.0], # team has 1% fewer turnovers
'orb_diff': [2.0], # team has 2% better ORB%
'ftr_diff': [5.0], # team gets to the line more
'home_court': [1] # playing at home
})
win_prob = log_model.predict_proba(example_game)[0][1]
print(f"\nPredicted Win Probability: {win_prob:.4f}")
print(f"Implied Fair Odds: {1/win_prob:.2f} to 1")
print(f"Implied Moneyline: {'+' if win_prob < 0.5 else '-'}"
f"{abs(int(100/win_prob - 100)) if win_prob > 0.5 else int(100*(1/win_prob - 1))}")
# Brier Score (calibration measure, lower is better)
y_pred_prob = log_model.predict_proba(X)[:, 1]
brier = brier_score_loss(y, y_pred_prob)
print(f"\nBrier Score: {brier:.4f}")
print(f"Baseline Brier (predict 0.5 always): {brier_score_loss(y, [0.5]*len(y)):.4f}")
Worked Example: NBA Win Probability from Four Factors
Consider the following matchup scenario for an upcoming NBA game:
| Factor | Home Team | Away Team | Differential |
|---|---|---|---|
| eFG% | 53.2% | 49.7% | +3.5 |
| TOV% | 12.8% | 13.8% | -1.0 |
| ORB% | 27.0% | 25.0% | +2.0 |
| FT Rate | 0.30 | 0.25 | +5.0 |
| Home Court | Yes | --- | 1 |
Plugging into our fitted model, suppose the coefficients are:
$$\ln\left(\frac{p}{1-p}\right) = -0.19 + 0.148(3.5) - 0.077(-1.0) + 0.039(2.0) + 0.021(5.0) + 0.34(1)$$
$$= -0.19 + 0.518 + 0.077 + 0.078 + 0.105 + 0.34 = 0.928$$
$$p = \frac{1}{1 + e^{-0.928}} = \frac{1}{1 + 0.395} = 0.717$$
Our model gives the home team a 71.7% win probability. This translates to fair odds of approximately $-253$ on the American moneyline. If the actual moneyline is $-200$ (implied probability of 66.7%), our model identifies a potential edge of about 5 percentage points.
From Logistic Regression Output to Betting Probabilities
The conversion from logistic regression output to betting-relevant probabilities is direct---the model literally outputs probabilities. However, two critical adjustments must be considered:
-
Calibration: A model that says "70% win probability" should win roughly 70% of the time across many such predictions. We will examine calibration in Section 9.5.
-
Vig removal: Market odds include the bookmaker's margin (vig). To compare your model's probabilities fairly, you must convert the book's line to a no-vig probability. If a book offers $-200/+170$, the implied probabilities are 66.7% and 37.0%, summing to 103.7%. Removing the vig proportionally yields approximately 64.3% and 35.7%.
Callout: When to Use Linear vs. Logistic Regression
Outcome Type Example Model Continuous Points scored, yardage, time of possession Linear regression Binary Win/loss, cover/miss, over/under Logistic regression Ordinal Rankings (1st, 2nd, 3rd) Ordinal logistic regression Count data Goals scored, touchdowns Poisson regression Always match the model to the outcome variable. Using linear regression for a binary outcome will produce predictions outside [0, 1] and violate model assumptions.
9.3 Feature Selection and Multicollinearity
The Curse of Dimensionality in Sports Modeling
Modern sports databases contain hundreds of potential predictors. An NFL game log might include 50+ offensive and defensive statistics per team. An NBA dataset might track dozens of advanced metrics. The temptation is to include everything and "let the model sort it out." This is a mistake.
Adding too many features relative to the number of observations leads to overfitting: the model memorizes noise in the training data and performs poorly on new data. With $n$ observations and $p$ predictors, problems become severe when $p$ approaches $n$. A common rule of thumb is that you need at least 10--20 observations per predictor to estimate coefficients reliably.
If you have three NFL seasons of 256 team-game observations each ($n = 768$), using 50 predictors is aggressive. Using 100 is reckless.
Multicollinearity
Multicollinearity occurs when two or more predictors are highly correlated with each other. In sports, this is ubiquitous:
- Total yards and first downs are strongly correlated
- Passing yards and passing attempts are highly correlated
- Offensive rating and points per game move together
Multicollinearity does not bias the overall predictions, but it inflates the standard errors of individual coefficients, making it impossible to isolate the effect of any one variable. You might see a situation where total yards and first downs are both insignificant individually, but removing one makes the other highly significant.
Variance Inflation Factor (VIF)
The Variance Inflation Factor quantifies multicollinearity for each predictor:
$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the $R^2$ from regressing predictor $x_j$ on all other predictors. A VIF of 1 means no collinearity; a VIF of 5 means the variance of that coefficient is inflated fivefold; a VIF above 10 is a clear red flag.
| VIF Range | Interpretation | Action |
|---|---|---|
| 1.0--2.0 | No concern | No action needed |
| 2.0--5.0 | Moderate | Monitor, consider combining |
| 5.0--10.0 | High | Seriously consider removing or combining |
| >10.0 | Severe | Remove or combine predictors |
Feature Selection Methods
Forward Selection
Start with no predictors. Add the one that most improves the model (by AIC, BIC, or adjusted $R^2$). Repeat until no remaining predictor meets the inclusion criterion.
Backward Elimination
Start with all predictors. Remove the least significant one. Refit and repeat until all remaining predictors are significant.
Stepwise Selection
Combine forward and backward: add the best variable, then check whether any previously included variable should be removed. Continue until no changes improve the model.
Information Criteria
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) balance model fit against complexity:
$$\text{AIC} = -2\ell + 2p$$
$$\text{BIC} = -2\ell + p \ln(n)$$
where $\ell$ is the log-likelihood, $p$ is the number of parameters, and $n$ is the sample size. Lower values indicate a better balance of fit and parsimony. BIC penalizes complexity more heavily than AIC, leading to sparser models.
LASSO and Ridge Regression
Regularization methods add a penalty term to the OLS objective function, shrinking coefficients toward zero and reducing overfitting.
Ridge Regression (L2 penalty):
$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p}\beta_j^2$$
Ridge shrinks coefficients but never sets them exactly to zero. It is useful when many predictors contribute small amounts of information.
LASSO Regression (L1 penalty):
$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p}|\beta_j|$$
LASSO can shrink coefficients exactly to zero, effectively performing automatic feature selection. This makes it especially useful in sports modeling, where you want a parsimonious model from dozens of candidates.
Elastic Net combines both penalties:
$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda_1 \sum_{j=1}^{p}|\beta_j| + \lambda_2 \sum_{j=1}^{p}\beta_j^2$$
The hyperparameter $\lambda$ (and the mixing parameter $\alpha$ in Elastic Net) is chosen via cross-validation: fit the model on $k-1$ folds and evaluate on the held-out fold, iterating across all folds and averaging performance.
Comparison of Feature Selection Methods
| Method | Handles Multicollinearity | Automatic | Produces Sparse Model | Computational Cost |
|---|---|---|---|---|
| Forward Selection | Partially | Semi | Yes | Low |
| Backward Elimination | Partially | Semi | Yes | Moderate |
| Stepwise | Partially | Semi | Yes | Moderate |
| LASSO (L1) | Yes | Yes | Yes | Moderate |
| Ridge (L2) | Yes | Yes | No | Moderate |
| Elastic Net | Yes | Yes | Yes | Moderate |
| VIF Filtering | Yes | Manual | Yes | Low |
Python Code: Feature Selection Pipeline
import numpy as np
import pandas as pd
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
# Simulated NFL dataset with many correlated features
np.random.seed(99)
n = 512
# Generate correlated features (mimicking real football data)
pass_yards = np.random.normal(230, 50, n)
rush_yards = np.random.normal(115, 35, n)
total_yards = pass_yards + rush_yards + np.random.normal(0, 10, n)
first_downs = 0.06 * total_yards + np.random.normal(0, 2, n)
pass_attempts = pass_yards / 7.5 + np.random.normal(0, 3, n)
completions = 0.63 * pass_attempts + np.random.normal(0, 2, n)
rush_attempts = rush_yards / 4.2 + np.random.normal(0, 3, n)
time_of_poss = 0.5 * rush_attempts + 15 + np.random.normal(0, 2, n)
turnovers = np.random.poisson(1.1, n)
penalties = np.random.normal(6, 2, n)
third_down_pct = 35 + 0.02 * total_yards + np.random.normal(0, 8, n)
redzone_pct = 50 + 0.01 * total_yards + np.random.normal(0, 12, n)
# Outcome: points scored
points = (
-5.0 + 0.065 * total_yards - 3.0 * turnovers + 0.08 * redzone_pct
+ np.random.normal(0, 5.5, n)
)
points = np.clip(points, 0, None)
features = pd.DataFrame({
'pass_yards': pass_yards, 'rush_yards': rush_yards,
'total_yards': total_yards, 'first_downs': first_downs,
'pass_attempts': pass_attempts, 'completions': completions,
'rush_attempts': rush_attempts, 'time_of_poss': time_of_poss,
'turnovers': turnovers, 'penalties': penalties,
'third_down_pct': third_down_pct, 'redzone_pct': redzone_pct
})
# Step 1: Check VIF
print("VARIANCE INFLATION FACTORS")
print("=" * 40)
X_vif = sm.add_constant(features)
for i, col in enumerate(X_vif.columns):
if col == 'const':
continue
vif = variance_inflation_factor(X_vif.values, i)
flag = " *** HIGH" if vif > 5 else ""
print(f"{col:<20} VIF = {vif:.2f}{flag}")
# Step 2: Standardize features for regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
y = points
# Step 3: LASSO with cross-validation
lasso = LassoCV(cv=10, random_state=42, max_iter=10000)
lasso.fit(X_scaled, y)
print(f"\nLASSO REGRESSION (lambda = {lasso.alpha_:.4f})")
print("=" * 50)
print(f"{'Feature':<20} {'Coefficient':>12} {'Selected':>10}")
print("-" * 50)
for feat, coef in zip(features.columns, lasso.coef_):
selected = "YES" if abs(coef) > 0.001 else "no"
print(f"{feat:<20} {coef:>12.4f} {selected:>10}")
lasso_r2 = lasso.score(X_scaled, y)
print(f"\nLASSO R-squared: {lasso_r2:.4f}")
# Step 4: Ridge for comparison
ridge = RidgeCV(cv=10, alphas=np.logspace(-3, 3, 100))
ridge.fit(X_scaled, y)
print(f"\nRIDGE REGRESSION (lambda = {ridge.alpha_:.4f})")
print("=" * 50)
for feat, coef in zip(features.columns, ridge.coef_):
print(f"{feat:<20} {coef:>12.4f}")
ridge_r2 = ridge.score(X_scaled, y)
print(f"\nRidge R-squared: {ridge_r2:.4f}")
# Step 5: Compare OLS, LASSO, and Ridge via cross-validation
from sklearn.linear_model import LinearRegression
models = {
'OLS (all features)': LinearRegression(),
'LASSO': LassoCV(cv=5, random_state=42, max_iter=10000),
'Ridge': RidgeCV(cv=5, alphas=np.logspace(-3, 3, 100)),
'Elastic Net': ElasticNetCV(cv=5, random_state=42, max_iter=10000)
}
print("\nCROSS-VALIDATED MODEL COMPARISON")
print("=" * 50)
for name, model in models.items():
scores = cross_val_score(model, X_scaled, y, cv=10, scoring='neg_mean_squared_error')
rmse = np.sqrt(-scores.mean())
print(f"{name:<25} RMSE = {rmse:.3f}")
This pipeline reveals the practical consequences of multicollinearity. You will typically see that total_yards, first_downs, pass_yards, and rush_yards have extremely high VIFs because they are arithmetic combinations of each other. LASSO will zero out the redundant predictors and retain only the most informative subset, typically selecting total_yards (or its components) alongside turnovers and redzone_pct while discarding the correlated alternatives.
Callout: The Art of Feature Engineering
Raw box-score stats are often less powerful than derived features. Consider creating: - Differentials (team stat minus opponent stat) rather than raw values - Per-play metrics (yards per play, points per drive) rather than totals - Rolling averages (last 5 games) rather than season-long averages - Opponent-adjusted stats (performance relative to what opponents typically allow)
Good feature engineering is often worth more than algorithmic sophistication.
9.4 Model Diagnostics and Residual Analysis
Why Diagnostics Matter
A regression model is built on assumptions. When those assumptions are violated, the model's coefficient estimates, standard errors, confidence intervals, and predictions can all be unreliable. Model diagnostics are the detective work that reveals whether your model is trustworthy or misleading.
The key assumptions of OLS regression are:
- Linearity: The relationship between predictors and outcome is linear
- Independence: Residuals are independent of each other
- Homoscedasticity: Residuals have constant variance
- Normality: Residuals are approximately normally distributed
- No perfect multicollinearity: No predictor is a perfect linear combination of others
Residual Plots and What They Tell You
A residual is the difference between the observed value and the predicted value:
$$e_i = y_i - \hat{y}_i$$
The most informative diagnostic is the residuals vs. fitted values plot. In a well-specified model, this plot should show random scatter around zero with constant spread. Common pathological patterns include:
- Funnel shape (spread increases with fitted values): heteroscedasticity
- Curved pattern: nonlinearity; a squared term or transformation may be needed
- Clusters: possible omitted categorical variable
- Trend: possible omitted variable or misspecification
Normality of Residuals
While OLS estimates are unbiased regardless of residual normality (by the Gauss-Markov theorem), inference (t-tests, confidence intervals) relies on normality, especially in small samples. Diagnostics include:
- Q-Q plot: Residuals plotted against theoretical normal quantiles. Deviations at the tails indicate heavy-tailed or skewed residuals.
- Shapiro-Wilk test: A formal test of normality. In practice, with large samples, this test is overly sensitive and often rejects normality even when deviations are trivial.
- Histogram of residuals: A visual check for obvious skewness or multimodality.
In sports data, residuals often have slightly heavier tails than a normal distribution because of blowout games, overtime periods, and other extreme events. This is generally acceptable for practical purposes.
Heteroscedasticity Detection and Correction
Heteroscedasticity means that the variance of residuals changes across the range of predicted values. In sports data, this is common: high-scoring games may have more variable outcomes than low-scoring games.
Breusch-Pagan test: Regress squared residuals on the predictors. If the $R^2$ is significant, heteroscedasticity is present.
$$\text{Test statistic} = n \times R^2_{\text{aux}} \sim \chi^2_p$$
White's test: A more general version that does not assume a specific form of heteroscedasticity.
Correction: Use heteroscedasticity-consistent (HC) standard errors (also called "robust" or "White" standard errors). In statsmodels, this is as simple as specifying cov_type='HC3' when fitting the model. The coefficient estimates remain the same, but the standard errors are corrected to account for non-constant variance.
Influential Observations
Some observations exert disproportionate influence on the regression results. Identifying these is essential in sports data, where outlier games (a team scores 60 points, a key player is ejected in the first quarter) can distort your model.
Leverage ($h_{ii}$): Measures how far an observation's predictor values are from the mean. The diagonal elements of the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ give the leverage values. High leverage means the observation has unusual predictor values.
Cook's Distance: Combines leverage and residual size to measure overall influence:
$$D_i = \frac{e_i^2}{p \times \hat{\sigma}^2} \times \frac{h_{ii}}{(1 - h_{ii})^2}$$
A common rule of thumb is that $D_i > 4/n$ warrants investigation, though $D_i > 1$ is a stronger threshold.
DFFITS: Another influence measure; observations with $|\text{DFFITS}_i| > 2\sqrt{p/n}$ are potentially influential.
In practice, you should not automatically remove influential observations. Instead, investigate them: Was there a data entry error? Did something unusual happen in the game (weather, injuries, garbage time)? If the observation is legitimate, consider whether your model needs to accommodate the phenomenon it represents.
Autocorrelation in Time Series Sports Data
When modeling sports data across time (e.g., a team's game-by-game performance over a season), residuals may be autocorrelated: the residual from game $t$ is correlated with the residual from game $t-1$. This violates the independence assumption and leads to overly optimistic standard errors.
The Durbin-Watson statistic tests for first-order autocorrelation:
$$DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n}e_i^2}$$
Values near 2 indicate no autocorrelation; values significantly below 2 indicate positive autocorrelation; values above 2 suggest negative autocorrelation. In game-level data, autocorrelation is usually mild, but in player-level time series (e.g., predicting a pitcher's next-start ERA), it can be significant.
Python Code: Comprehensive Diagnostic Suite
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson
from scipy import stats
# Use the NFL model from Section 9.1
np.random.seed(42)
n = 256
total_yards = np.random.normal(340, 60, n)
turnover_margin = np.random.poisson(1.2, n) - np.random.poisson(1.2, n)
time_of_poss = np.random.normal(30, 3, n)
points = -8.0 + 0.075 * total_yards + 3.5 * turnover_margin + 0.15 * time_of_poss
points += np.random.normal(0, 6, n)
points = np.clip(points, 0, None)
df = pd.DataFrame({
'points': points,
'total_yards': total_yards,
'turnover_margin': turnover_margin,
'time_of_poss': time_of_poss
})
X = sm.add_constant(df[['total_yards', 'turnover_margin', 'time_of_poss']])
y = df['points']
model = sm.OLS(y, X).fit()
# --- RESIDUAL ANALYSIS ---
residuals = model.resid
fitted = model.fittedvalues
standardized_resid = model.get_influence().resid_studentized_internal
print("MODEL DIAGNOSTICS REPORT")
print("=" * 60)
# 1. Normality tests
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"\n1. NORMALITY OF RESIDUALS")
print(f" Shapiro-Wilk statistic: {shapiro_stat:.4f}")
print(f" Shapiro-Wilk p-value: {shapiro_p:.4f}")
print(f" Skewness: {stats.skew(residuals):.4f}")
print(f" Kurtosis: {stats.kurtosis(residuals):.4f}")
if shapiro_p > 0.05:
print(" -> Residuals appear normally distributed")
else:
print(" -> Evidence of non-normality (examine Q-Q plot)")
# 2. Heteroscedasticity tests
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, X)
print(f"\n2. HETEROSCEDASTICITY")
print(f" Breusch-Pagan statistic: {bp_stat:.4f}")
print(f" Breusch-Pagan p-value: {bp_p:.4f}")
if bp_p > 0.05:
print(" -> No significant heteroscedasticity detected")
else:
print(" -> Heteroscedasticity detected; use robust standard errors")
robust_model = sm.OLS(y, X).fit(cov_type='HC3')
print(" -> Robust standard errors computed (HC3)")
# 3. Autocorrelation
dw = durbin_watson(residuals)
print(f"\n3. AUTOCORRELATION")
print(f" Durbin-Watson statistic: {dw:.4f}")
if 1.5 < dw < 2.5:
print(" -> No significant autocorrelation detected")
else:
print(" -> Possible autocorrelation; consider time-series methods")
# 4. Influential observations
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
leverage = influence.hat_matrix_diag
n_obs = len(y)
n_params = len(model.params)
high_cooks = np.sum(cooks_d > 4 / n_obs)
high_leverage = np.sum(leverage > 2 * n_params / n_obs)
print(f"\n4. INFLUENTIAL OBSERVATIONS")
print(f" Observations with Cook's D > {4/n_obs:.4f}: {high_cooks}")
print(f" Observations with high leverage (>{2*n_params/n_obs:.4f}): {high_leverage}")
# Top 5 most influential observations
top_influential = np.argsort(cooks_d)[-5:][::-1]
print(f"\n Top 5 Influential Observations:")
print(f" {'Index':<8} {'Cook D':>10} {'Leverage':>10} {'Resid':>10}")
for idx in top_influential:
print(f" {idx:<8} {cooks_d[idx]:>10.4f} {leverage[idx]:>10.4f} "
f"{residuals.iloc[idx]:>10.2f}")
# 5. Overall model assessment
print(f"\n5. MODEL FIT SUMMARY")
print(f" R-squared: {model.rsquared:.4f}")
print(f" Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f" AIC: {model.aic:.1f}")
print(f" BIC: {model.bic:.1f}")
print(f" RMSE: {np.sqrt(model.mse_resid):.3f}")
print(f" Mean Absolute Error: {np.mean(np.abs(residuals)):.3f}")
When Your Model Is Wrong: What Residuals Reveal
Residual analysis is not just a box-checking exercise. It is a diagnostic conversation between you and your model. Here is a decision framework:
| Pattern in Residuals | Likely Cause | Suggested Fix |
|---|---|---|
| Funnel shape (variance increases with fitted values) | Heteroscedasticity | Use robust SE, or transform the outcome (e.g., log) |
| U-shape or inverted U | Missing nonlinear term | Add polynomial or interaction terms |
| Systematic trend over time | Autocorrelation or concept drift | Add time component, use rolling windows |
| Clusters of large residuals at specific values | Omitted categorical variable | Add relevant categorical predictor (e.g., dome vs. outdoor) |
| A few extreme outliers | Unusual games or data errors | Investigate individually; consider robust regression |
| Residuals correlated with an omitted variable | Omitted variable bias | Add the missing predictor to the model |
Callout: The Residual as a Betting Tool
Large systematic residuals point to where your model is blind. If your model consistently underestimates the scoring of teams playing in dome stadiums, that is not just a diagnostic finding---it is a potential betting opportunity. Teams playing in domes may be systematically undervalued by the market if other bettors use similarly incomplete models. Mining your own model's failures is one of the most productive activities in sports analytics.
9.5 From Regression Output to Betting Lines
The Final Mile: From Model to Market
A regression model that predicts accurately but sits unused in a Jupyter notebook generates zero profit. The critical final step is converting model output into actionable betting signals by comparing your predictions against the market's prices. This section builds that bridge.
Converting Predicted Margins to Spreads
A linear regression model that predicts a scoring margin produces a natural spread estimate. If your model predicts:
$$\widehat{\text{HomePoints}} - \widehat{\text{AwayPoints}} = 4.7$$
Then your model's implied spread is Home -4.7. If the market line is Home -3, your model sees value on the home team (you believe they will win by more than the market expects).
However, raw predictions need adjustment:
-
Home-field advantage: Ensure your model accounts for home-field advantage. In the NFL, this has historically been worth approximately 2.5--3 points, though it has declined in recent years. In the NBA, it is roughly 3--4 points. If your model does not include a home indicator variable, your spread estimates will be systematically biased.
-
Key numbers in football: NFL scores cluster around certain margins (3, 7, 10, 14). A model that predicts a margin of 3.2 versus a market line of 3 is qualitatively different from a prediction of 6.2 versus a line of 6, because 3 and 7 are key numbers where pushes and backdoor covers frequently occur.
-
Regression to the mean: Early-season predictions based on small samples should be shrunk toward the league mean. A team that averages 35 points through 3 games will probably not sustain that pace over 17 games.
Converting Win Probabilities to Moneylines
Logistic regression outputs probabilities directly, which convert to moneylines using these formulas:
For a favorite (probability $p > 0.5$):
$$\text{American Odds} = -\frac{p}{1-p} \times 100$$
For an underdog (probability $p < 0.5$):
$$\text{American Odds} = +\frac{1-p}{p} \times 100$$
Example: A model gives the home team a 63% win probability.
$$\text{Fair Moneyline} = -\frac{0.63}{0.37} \times 100 = -170$$
If the market offers the home team at $-150$ (implied probability = 60%), there is a potential edge. The model believes the team should be priced at $-170$ but the market offers $-150$, meaning the team is underpriced.
Converting Predicted Totals
If your model predicts the total points for each team separately, the game total is simply:
$$\widehat{\text{Total}} = \widehat{\text{HomePoints}} + \widehat{\text{AwayPoints}}$$
Compare this to the market total. If your model predicts 48.3 and the market total is 45.5, you have a signal toward the over.
Calibrating Model Outputs
A model is calibrated if its predictions match observed frequencies. For a win probability model, this means:
- Of all games where the model predicts a 60% win probability, approximately 60% should be wins
- Of all games where the model predicts a 80% win probability, approximately 80% should be wins
Calibration is assessed by:
- Binning predictions: Group predictions into buckets (e.g., 50-55%, 55-60%, ..., 95-100%)
- Computing observed frequency: Within each bucket, calculate the actual win rate
- Plotting: A calibration plot shows predicted probability (x-axis) versus observed frequency (y-axis). A perfectly calibrated model lies on the 45-degree line.
The Brier Score provides a single number measuring calibration and discrimination:
$$\text{Brier Score} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)^2$$
where $p_i$ is the predicted probability and $y_i$ is the actual outcome (0 or 1). The Brier score ranges from 0 (perfect) to 1 (worst possible). A baseline model that always predicts 50% has a Brier score of 0.25. Any useful model should beat this baseline.
The Gap Between Model and Market = Potential Edge
The fundamental equation of quantitative betting is:
$$\text{Edge} = \text{Model Probability} - \text{Implied Market Probability}$$
A positive edge means the model sees more value than the market is pricing. But edge alone is not sufficient---you also need:
- Statistical significance: Is the edge larger than the uncertainty in your model's estimate?
- Consistency: Does the model's edge persist across different time periods and subsets of data?
- Vig coverage: The edge must exceed the bookmaker's margin (typically 4--5% in terms of total implied probability) to be profitable.
A useful rule of thumb: require a minimum edge of 3--5% before placing a bet. An edge of 1% is likely to be consumed by vig and modeling error.
Building a Basic Betting Signal
A complete pipeline from regression to bet recommendation involves:
- Train: Fit the regression model on historical data
- Predict: Generate predictions for upcoming games
- Compare: Compare model predictions to market lines
- Filter: Only bet when the edge exceeds a minimum threshold
- Size: Determine bet size based on edge magnitude (Kelly Criterion, covered in Part 4)
- Track: Log all bets and outcomes for ongoing model evaluation
Python Code: Complete Pipeline from Regression Model to Bet Recommendation
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
# =====================================================
# PART 1: Build the model (training phase)
# =====================================================
np.random.seed(42)
n_train = 1500 # ~5 NBA seasons
# Simulated historical data: Four Factors differentials
train_data = pd.DataFrame({
'efg_diff': np.random.normal(0, 4.5, n_train),
'tov_diff': np.random.normal(0, 3.0, n_train),
'orb_diff': np.random.normal(0, 5.0, n_train),
'ftr_diff': np.random.normal(0, 8.0, n_train),
'home': np.random.binomial(1, 0.5, n_train)
})
log_odds = (
-0.18 + 0.145 * train_data['efg_diff']
- 0.075 * train_data['tov_diff']
+ 0.038 * train_data['orb_diff']
+ 0.019 * train_data['ftr_diff']
+ 0.32 * train_data['home']
)
train_data['win'] = np.random.binomial(1, 1 / (1 + np.exp(-log_odds)))
# Fit logistic regression
features = ['efg_diff', 'tov_diff', 'orb_diff', 'ftr_diff', 'home']
X_train = train_data[features]
y_train = train_data['win']
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# =====================================================
# PART 2: Generate predictions for upcoming games
# =====================================================
upcoming_games = pd.DataFrame({
'game_id': ['LAL@BOS', 'MIA@MIL', 'GSW@DEN', 'PHX@DAL', 'NYK@PHI'],
'home_team': ['BOS', 'MIL', 'DEN', 'DAL', 'PHI'],
'away_team': ['LAL', 'MIA', 'GSW', 'PHX', 'NYK'],
'efg_diff': [2.1, -1.5, 3.8, 0.5, -2.3],
'tov_diff': [-0.8, 1.2, -1.5, 0.3, 0.9],
'orb_diff': [1.5, -2.0, 3.0, 1.0, -1.5],
'ftr_diff': [3.0, -4.0, 6.0, 2.0, -3.0],
'home': [1, 1, 1, 1, 1],
'market_home_ml': [-180, +130, -220, -140, +110],
})
# Model predictions
X_upcoming = upcoming_games[features]
home_win_probs = model.predict_proba(X_upcoming)[:, 1]
upcoming_games['model_prob'] = home_win_probs
# =====================================================
# PART 3: Compare model to market
# =====================================================
def american_to_prob(ml):
"""Convert American odds to implied probability."""
if ml < 0:
return abs(ml) / (abs(ml) + 100)
else:
return 100 / (ml + 100)
def prob_to_american(p):
"""Convert probability to American odds."""
if p >= 0.5:
return int(-p / (1 - p) * 100)
else:
return int((1 - p) / p * 100)
def remove_vig(home_ml, away_ml=None):
"""Remove vig from moneyline to get true implied probability.
If away_ml is not provided, estimate it."""
home_imp = american_to_prob(home_ml)
if away_ml is None:
# Estimate away line from home line
away_imp = 1 - home_imp + 0.02 # approximate vig
else:
away_imp = american_to_prob(away_ml)
total_imp = home_imp + away_imp
return home_imp / total_imp # no-vig probability
upcoming_games['market_prob'] = upcoming_games['market_home_ml'].apply(
lambda ml: remove_vig(ml)
)
upcoming_games['edge'] = upcoming_games['model_prob'] - upcoming_games['market_prob']
upcoming_games['model_fair_ml'] = upcoming_games['model_prob'].apply(prob_to_american)
# =====================================================
# PART 4: Filter and generate recommendations
# =====================================================
MIN_EDGE = 0.03 # minimum 3% edge required
print("BET RECOMMENDATION REPORT")
print("=" * 80)
print(f"{'Game':<12} {'Model%':>8} {'Market%':>9} {'Edge':>8} {'Fair ML':>9} "
f"{'Mkt ML':>8} {'Signal':>10}")
print("-" * 80)
for _, game in upcoming_games.iterrows():
edge = game['edge']
if abs(edge) >= MIN_EDGE:
if edge > 0:
signal = f"HOME ({game['home_team']})"
else:
signal = f"AWAY ({game['away_team']})"
else:
signal = "NO BET"
print(f"{game['game_id']:<12} {game['model_prob']:>7.1%} {game['market_prob']:>8.1%} "
f"{edge:>+7.1%} {game['model_fair_ml']:>+9d} {game['market_home_ml']:>+8d} "
f"{signal:>10}")
# =====================================================
# PART 5: Calibration check on training data
# =====================================================
train_probs = model.predict_proba(X_train)[:, 1]
fraction_pos, mean_pred = calibration_curve(y_train, train_probs, n_bins=10)
print("\nCALIBRATION CHECK (Training Data)")
print("=" * 45)
print(f"{'Predicted':>12} {'Observed':>12} {'Count':>8}")
print("-" * 45)
# Manually compute bin counts
bin_edges = np.linspace(0, 1, 11)
for i in range(len(fraction_pos)):
lo = bin_edges[i]
hi = bin_edges[i + 1] if i + 1 < len(bin_edges) else 1.0
mask = (train_probs >= lo) & (train_probs < hi)
count = mask.sum()
if count > 0:
print(f"{mean_pred[i]:>11.3f} {fraction_pos[i]:>11.3f} {count:>8d}")
from sklearn.metrics import brier_score_loss
brier = brier_score_loss(y_train, train_probs)
print(f"\nBrier Score: {brier:.4f} (baseline 0.2500)")
print(f"Brier Skill Score: {1 - brier/0.25:.4f}")
This code produces a complete betting recommendation report. For each upcoming game, it shows:
- The model's estimated win probability for the home team
- The market's implied probability (after vig removal)
- The edge (model minus market)
- A bet recommendation (home, away, or no bet)
A sample output might look like:
| Game | Model% | Market% | Edge | Fair ML | Mkt ML | Signal |
|---|---|---|---|---|---|---|
| LAL@BOS | 66.3% | 62.8% | +3.5% | -197 | -180 | HOME (BOS) |
| MIA@MIL | 47.2% | 54.9% | -7.7% | -89 | +130 | AWAY (MIA) |
| GSW@DEN | 72.1% | 66.3% | +5.8% | -258 | -220 | HOME (DEN) |
| PHX@DAL | 55.8% | 56.2% | -0.4% | -126 | -140 | NO BET |
| NYK@PHI | 43.5% | 49.5% | -6.0% | +130 | +110 | AWAY (NYK) |
The MIA@MIL and NYK@PHI games show negative edges for the home team, which translates to positive edges for the away team. The PHX@DAL game is essentially a toss-up in the model's view, close to the market, so no bet is recommended.
Callout: The Humility Check
Before risking real money, ask yourself these questions:
- Has the model been tested on out-of-sample data (data it was not trained on)?
- Would the historical edges have been profitable after vig?
- Is the edge large enough to survive estimation error in the model?
- Are you confident the market has not already incorporated the information your model uses?
If the answer to any of these is "no" or "I am not sure," you are not yet ready to bet with real money. Paper trading (tracking hypothetical bets without wagering) is a prudent intermediate step.
The Edge Estimation Framework
To formalize the relationship between model uncertainty and required edge, consider that your model's predicted probability $\hat{p}$ has its own standard error, $\text{SE}(\hat{p})$. A rough estimate of the standard error for a logistic regression prediction is:
$$\text{SE}(\hat{p}) \approx \hat{p}(1 - \hat{p}) \times \sqrt{\mathbf{x}_0^T (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{x}_0}$$
where $\mathbf{W}$ is the diagonal matrix of weights in the iteratively reweighted least squares algorithm. In practice, a model with $\hat{p} = 0.65$ and $\text{SE}(\hat{p}) = 0.04$ means the model's "true" edge could plausibly be anywhere from $0.65 - 2(0.04) = 0.57$ to $0.65 + 2(0.04) = 0.73$. If the market's implied probability is 0.62, the nominal edge of 0.03 could easily be zero. Only bet when the edge clearly exceeds the model's own uncertainty.
9.6 Chapter Summary
Key Formulas
Ordinary Least Squares (Simple):
$$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$
OLS (Matrix Form):
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
R-squared:
$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}, \quad R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n - p - 1}$$
Logistic Function:
$$p = \frac{1}{1 + e^{-\mathbf{x}^T\boldsymbol{\beta}}}$$
Log-Likelihood (Logistic Regression):
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i\ln(p_i) + (1 - y_i)\ln(1 - p_i)\right]$$
Variance Inflation Factor:
$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
LASSO Objective:
$$\min_{\boldsymbol{\beta}} \sum(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum|\beta_j|$$
Cook's Distance:
$$D_i = \frac{e_i^2}{p \cdot \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}$$
Brier Score:
$$\text{BS} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)^2$$
Code Patterns
The Python code in this chapter follows a consistent workflow:
- Data preparation: Create or load a DataFrame with features and outcome
- Model fitting: Use
statsmodelsfor OLS (when you need detailed diagnostics) orsklearnfor logistic regression and regularized models (when you need cross-validation and prediction pipelines) - Diagnostics: Check VIF, residual plots, influence measures, and calibration
- Prediction: Generate predictions for new observations with uncertainty estimates
- Signal generation: Compare model predictions to market lines and filter for minimum edge
Decision Framework: Choosing the Right Regression
| Question | Linear Regression | Logistic Regression | Regularized (LASSO/Ridge) |
|---|---|---|---|
| Outcome type? | Continuous (points, yards) | Binary (win/loss) | Either |
| Number of features? | Few, well-chosen | Few, well-chosen | Many candidates |
| Multicollinearity? | Must address manually | Must address manually | Handled automatically |
| Interpretability? | High | Moderate (odds ratios) | Lower (coefficients shrunk) |
| Feature selection? | Manual | Manual | Automatic (LASSO) |
| When to use? | Predicting totals, margins | Win probability, cover prob | Many correlated predictors |
Key Takeaways
-
Linear regression is the foundation for predicting continuous sports outcomes. The OLS estimator is simple, interpretable, and powerful when used correctly. Always report adjusted $R^2$ and prediction intervals, not just point estimates.
-
Logistic regression is the natural choice for binary outcomes like win/loss. Its output is a probability that can be directly compared to market odds. The Brier score is the key metric for evaluating probability predictions.
-
Feature selection is essential, not optional. Sports data is riddled with multicollinearity (total yards and first downs, passing yards and pass attempts). Use VIF analysis to detect it, and LASSO or Ridge to handle it. Prefer parsimonious models with 5--8 well-chosen features over kitchen-sink models with 30 correlated predictors.
-
Model diagnostics are non-negotiable. Residual analysis reveals whether your model's assumptions are met and, more importantly, where your model is systematically wrong. Those systematic errors are both a diagnostic concern and a guide to improvement.
-
The final step is the most important. A model that predicts well but is never compared to the market generates zero profit. Build the complete pipeline: train, predict, compare to market, filter for minimum edge, size appropriately, and track results. The edge is in the gap between your model and the market's price.
Practice Problems
-
NFL Totals Model: Collect team-level statistics for two NFL seasons. Build a multiple regression model predicting total points scored per game. Which three predictors have the highest explanatory power? What is the model's adjusted $R^2$?
-
Logistic Regression for Spread Covering: Using the same data, create a binary outcome variable indicating whether the home team covered the spread. Fit a logistic regression model. Does your model beat the 50% baseline?
-
Multicollinearity Drill: Include at least 10 predictors in an NFL regression model and compute VIF for each. Identify which predictors are redundant. Use LASSO to automatically select a parsimonious subset. Compare the RMSE of the full model and the LASSO model using 10-fold cross-validation.
-
Calibration Exercise: Fit a logistic regression model predicting NBA wins. Split the data 80/20 into training and test sets. Generate a calibration plot on the test set. Compute the Brier score and the Brier skill score relative to a baseline of always predicting the home-team historical win rate.
-
Full Pipeline: For one week of upcoming NBA games, use publicly available team statistics to generate model predictions. Compare your predicted win probabilities to the market moneylines. Identify which games (if any) show an edge exceeding 5%. Track the results and compute your hypothetical P&L.
What's Next: Chapter 10 --- Bayesian Thinking for Sports Bettors
In this chapter, we built regression models from a frequentist perspective: we estimated fixed parameters from data and used confidence intervals and p-values to quantify uncertainty. But there is another way to think about modeling under uncertainty.
Bayesian statistics treats parameters as random variables with probability distributions. Instead of asking "What is the best single estimate of this coefficient?", Bayesian analysis asks "Given the data, what is the full distribution of plausible values for this coefficient?" This framework is especially powerful in sports modeling, where:
- Small samples are the norm (an NFL season is only 17 games per team)
- Prior information is abundant (we know roughly how good a team is before the season starts)
- Updating in real time is essential (incorporating new game results as the season unfolds)
Chapter 10 will introduce Bayesian thinking from the ground up, covering prior distributions, Bayes' theorem, posterior updating, and Bayesian regression. You will learn how to combine your regression models from this chapter with principled uncertainty quantification through the Bayesian lens---a combination that many professional sports bettors consider indispensable.
Related Reading
Explore this topic in other books
Sports Betting Advanced Regression & Classification AI Engineering Supervised Learning Soccer Analytics ML & Regression for Soccer College Football Analytics Intro to Predictive Analytics