30 min read

Learning Objectives

Build and interpret simple and multiple linear regression models for predicting sports outcomes such as game totals, point margins, and player performance
Construct logistic regression models for win probability estimation and convert model outputs into actionable betting probabilities
Apply feature selection techniques including LASSO, Ridge, and VIF analysis to avoid overfitting and multicollinearity in sports datasets
Perform comprehensive model diagnostics using residual analysis, influence measures, and heteroscedasticity tests to validate regression assumptions
Transform regression model predictions into betting signals by comparing estimated spreads, totals, and win probabilities against market lines

In This Chapter

Chapter Overview
9.1 Simple and Multiple Linear Regression
9.2 Logistic Regression for Win Probability
9.3 Feature Selection and Multicollinearity
9.4 Model Diagnostics and Residual Analysis
9.5 From Regression Output to Betting Lines
9.6 Chapter Summary
What's Next: Chapter 10 --- Bayesian Thinking for Sports Bettors

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 9: Regression Analysis for Sports Modeling

Chapter Overview

In Chapter 6, we learned how to measure the strength and direction of relationships between variables using correlation. In Chapters 7 and 8, we built the probability and hypothesis-testing machinery that allows us to say whether an observed effect is statistically meaningful or merely noise. Now we take the decisive step from describing relationships to predicting outcomes.

Regression analysis is the workhorse of quantitative sports modeling. When an analyst says that "each additional rushing yard is worth 0.08 expected points" or that "a team's three-point shooting percentage has a coefficient of 0.34 in our win probability model," they are communicating results from regression. When a sportsbook sets a total at 47.5 for an NFL game, a regression model almost certainly played a role in arriving at that number.

This chapter covers two pillars of regression: linear regression for continuous outcomes (point totals, scoring margins, yards gained) and logistic regression for binary outcomes (win/loss, cover/miss, over/under). We will build complete models using real-world sports statistics, diagnose their weaknesses, select the most informative features, and ultimately convert raw model output into betting signals that can be compared against the market.

By the end of this chapter, you will possess a complete pipeline: raw data in, betting recommendation out. That pipeline will not be perfect---no model is---but it will be principled, transparent, and improvable.

9.1 Simple and Multiple Linear Regression

The Core Idea

Linear regression models the relationship between one or more independent variables (also called predictors or features) and a dependent variable (the outcome we wish to predict). In its simplest form, we are fitting a straight line through a scatter plot:

$$\hat{y} = \beta_0 + \beta_1 x$$

where $\hat{y}$ is the predicted value of the dependent variable, $\beta_0$ is the intercept, $\beta_1$ is the slope coefficient, and $x$ is the predictor.

In sports modeling, the dependent variable might be total points scored in a game, a team's scoring margin, or a player's fantasy point output. The predictor might be yards per play, turnover differential, or pace of play.

Ordinary Least Squares (OLS) Estimation

The most common method for fitting a linear regression is Ordinary Least Squares. OLS finds the values of $\beta_0$ and $\beta_1$ that minimize the sum of squared residuals:

$$\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$

The closed-form solutions for simple linear regression are:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}$$

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

For multiple linear regression with $p$ predictors, the model becomes:

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$$

In matrix notation, this is expressed compactly as:

$$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}$$

where $\mathbf{X}$ is the $n \times (p+1)$ design matrix (with a column of ones for the intercept), $\boldsymbol{\beta}$ is the $(p+1) \times 1$ coefficient vector, and $\hat{\mathbf{y}}$ is the $n \times 1$ vector of predicted values. The OLS solution in matrix form is:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

This elegant formula is the foundation of virtually every linear model you will encounter in sports analytics.

Interpreting Coefficients in Sports Contexts

Each coefficient $\beta_j$ in a multiple regression tells you: holding all other predictors constant, a one-unit increase in $x_j$ is associated with a $\beta_j$-unit change in the predicted outcome.

This interpretation is powerful but demands care. Consider an NFL model where:

$$\widehat{\text{Points}} = -12.4 + 0.081 \times \text{TotalYards} + 3.7 \times \text{TurnoverMargin}$$

This tells us:

Each additional yard of total offense is worth approximately 0.081 points, controlling for turnover differential.
Each additional turnover gained (relative to turnovers lost) is worth approximately 3.7 points, controlling for total yardage.
A team with zero yards and zero turnover margin would be predicted to score $-12.4$ points (which illustrates that the intercept is an extrapolation and not always meaningful on its own).

Callout: Correlation Is Not Causation---Still

Regression coefficients describe associations, not causal effects. If your model includes "penalty yards" as a predictor and it has a positive coefficient on scoring, that does not mean penalties cause more scoring. It may reflect that teams with more plays commit more penalties, and more plays produce more points. Always think carefully about the causal story behind your variables before making strategy recommendations.

R-Squared and Adjusted R-Squared

The coefficient of determination, $R^2$, measures the proportion of variance in the dependent variable that is explained by the model:

$$R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

An $R^2$ of 0.65 means that 65% of the variance in the outcome is explained by the predictors in the model. In sports modeling, $R^2$ values tend to be lower than in many other domains because athletic outcomes contain substantial randomness. An NFL game-total model with $R^2 = 0.30$ at the game level is actually quite useful, even though 70% of the variance remains unexplained.

Adjusted R-squared penalizes the addition of predictors that do not meaningfully improve the model:

$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$

where $n$ is the number of observations and $p$ is the number of predictors. Unlike $R^2$, which can never decrease when you add a variable, adjusted $R^2$ can decline if a new predictor adds more noise than signal. Always report adjusted $R^2$ when comparing models with different numbers of predictors.

Standard Errors and Confidence Intervals for Coefficients

The standard error of a coefficient quantifies the uncertainty in our estimate. For the $j$th coefficient in a multiple regression:

$$\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (\mathbf{X}^T \mathbf{X})^{-1}_{jj}}$$

where $\hat{\sigma}^2 = \frac{SS_{\text{res}}}{n - p - 1}$ is the estimated variance of the residuals. A 95% confidence interval for $\beta_j$ is:

$$\hat{\beta}_j \pm t_{n-p-1, 0.025} \times \text{SE}(\hat{\beta}_j)$$

In practice, if a coefficient's confidence interval includes zero, we cannot be confident the predictor has a real effect on the outcome.

Prediction Intervals vs. Confidence Intervals

This distinction is crucial for bettors and frequently misunderstood:

A confidence interval for the mean response tells you the range where the average outcome is expected to fall for a given set of predictor values. It is narrow.
A prediction interval for an individual observation tells you the range where a single new observation is expected to fall. It is wider because it accounts for both the uncertainty in the model and the inherent variability of individual outcomes.

$$\text{Prediction Interval} = \hat{y}_0 \pm t_{n-p-1,\alpha/2} \times \hat{\sigma}\sqrt{1 + \mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}$$

$$\text{Confidence Interval for Mean} = \hat{y}_0 \pm t_{n-p-1,\alpha/2} \times \hat{\sigma}\sqrt{\mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0}$$

When betting on a single game, the prediction interval is the appropriate measure of uncertainty. A model might predict an NFL total of 45.2 with a 95% prediction interval of [28.4, 62.0]. That enormous range is a humbling reminder of how much game-level variance exists---and why bankroll management (covered in Part 4) is non-negotiable.

Python Code: Building an NFL Points Model

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import summary_table

# Simulated NFL team-game data (one row per team per game)
np.random.seed(42)
n_games = 256  # 16 games x 32 teams / 2

# Generate realistic team-game statistics
total_yards = np.random.normal(340, 60, n_games)
turnovers_committed = np.random.poisson(1.2, n_games)
turnovers_forced = np.random.poisson(1.2, n_games)
turnover_margin = turnovers_forced - turnovers_committed
time_of_possession = np.random.normal(30, 3, n_games)  # minutes
penalty_yards = np.random.normal(50, 20, n_games)

# Generate points with realistic relationships + noise
points = (
    -8.0
    + 0.075 * total_yards
    + 3.5 * turnover_margin
    + 0.15 * time_of_possession
    - 0.02 * penalty_yards
    + np.random.normal(0, 6, n_games)  # game-level noise
)
points = np.clip(points, 0, None).round(0)  # floor at zero

# Create DataFrame
df = pd.DataFrame({
    'points': points,
    'total_yards': total_yards,
    'turnover_margin': turnover_margin,
    'time_of_possession': time_of_possession,
    'penalty_yards': penalty_yards
})

# Fit multiple linear regression using statsmodels
X = df[['total_yards', 'turnover_margin', 'time_of_possession', 'penalty_yards']]
X = sm.add_constant(X)  # adds intercept column
y = df['points']

model = sm.OLS(y, X).fit()
print(model.summary())

# Prediction for a specific game scenario
new_game = pd.DataFrame({
    'const': [1.0],
    'total_yards': [365],
    'turnover_margin': [1],
    'time_of_possession': [31.5],
    'penalty_yards': [45]
})

predicted_points = model.predict(new_game)
print(f"\nPredicted points: {predicted_points.values[0]:.1f}")

# Get prediction interval
predictions = model.get_prediction(new_game)
pred_summary = predictions.summary_frame(alpha=0.05)
print(f"95% Confidence Interval (mean): [{pred_summary['mean_ci_lower'].values[0]:.1f}, "
      f"{pred_summary['mean_ci_upper'].values[0]:.1f}]")
print(f"95% Prediction Interval:        [{pred_summary['obs_ci_lower'].values[0]:.1f}, "
      f"{pred_summary['obs_ci_upper'].values[0]:.1f}]")

Worked Example: Predicting NFL Game Totals from Team Stats

Suppose we have assembled season-level data for each NFL team across three seasons. For each team-game observation, we know: total yards, turnover margin, time of possession, and penalty yards. Our goal is to predict the total points a team will score.

Step 1: Fit the model. Running the code above produces output like:

Variable	Coefficient	Std. Error	t-statistic	p-value
Intercept	-7.83	4.21	-1.86	0.064
total_yards	0.074	0.006	12.33	<0.001
turnover_margin	3.42	0.52	6.58	<0.001
time_of_possession	0.16	0.12	1.33	0.185
penalty_yards	-0.019	0.018	-1.06	0.292

Step 2: Interpret. Total yards and turnover margin are statistically significant ($p < 0.001$). Each yard is worth about 0.074 points; each turnover gained is worth about 3.4 points. Time of possession and penalty yards are not significant at $\alpha = 0.05$ in this dataset.

Step 3: Evaluate. The model's adjusted $R^2$ is approximately 0.46, meaning it explains about 46% of the game-level variance in scoring. This is solid for a game-level model.

Step 4: Predict. For a team projected to gain 365 total yards with a +1 turnover margin, 31.5 minutes of possession, and 45 penalty yards, the model predicts approximately 24.3 points. The 95% prediction interval spans from roughly 12 to 36 points---a 24-point range that reflects genuine game-level uncertainty.

Callout: What $R^2$ Values Are Typical in Sports?

Game-level models in the NFL typically produce $R^2$ values between 0.25 and 0.50. Season-level models (predicting a team's total wins) can reach 0.60--0.80. Player-level models for statistics like batting average or free-throw percentage tend to have lower $R^2$ because individual performance is noisier. Do not discard a model because its $R^2$ seems "low"---the relevant question is whether it generates profitable betting signals.

9.2 Logistic Regression for Win Probability

Why Logistic Regression for Binary Outcomes

Many of the most important questions in sports betting are binary: Will this team win? Will the game go over the total? Will a team cover the spread? Linear regression is inappropriate for these outcomes because it can produce predicted probabilities below 0 or above 1, and the error structure violates OLS assumptions.

Logistic regression solves these problems by modeling the log-odds of the outcome rather than the outcome itself.

The Logit Function and Sigmoid Curve

The logistic model takes the form:

$$\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p$$

The left side is the logit (log of the odds). Solving for $p$ gives us the sigmoid function:

$$p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}} = \frac{1}{1 + e^{-\mathbf{x}^T\boldsymbol{\beta}}}$$

This function maps any real-valued linear combination into the $(0, 1)$ interval, producing a valid probability. The sigmoid has an S-shaped curve: probabilities near 0.5 change rapidly with small changes in the predictors, while probabilities near 0 or 1 are more resistant to change.

Maximum Likelihood Estimation

Unlike linear regression, logistic regression cannot be solved in closed form. Instead, we use Maximum Likelihood Estimation (MLE). For $n$ observations where $y_i \in \{0, 1\}$:

$$\mathcal{L}(\boldsymbol{\beta}) = \prod_{i=1}^{n} p_i^{y_i}(1 - p_i)^{1 - y_i}$$

Taking the log gives the log-likelihood:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i \ln(p_i) + (1 - y_i)\ln(1 - p_i)\right]$$

MLE finds the $\boldsymbol{\beta}$ that maximizes this expression, typically via iterative algorithms like Newton-Raphson or gradient descent.

Interpreting Odds Ratios

In logistic regression, the coefficient $\beta_j$ represents the change in the log-odds of the outcome for a one-unit increase in $x_j$. The odds ratio is:

$$\text{OR}_j = e^{\beta_j}$$

An odds ratio of 1.5 means that a one-unit increase in the predictor multiplies the odds of the outcome by 1.5 (a 50% increase in odds). An odds ratio of 0.8 means the odds decrease by 20%.

For example, if the coefficient on "offensive rebounding percentage" in an NBA win-probability model is 0.06, then the odds ratio is $e^{0.06} \approx 1.062$, meaning each percentage point increase in offensive rebounding percentage multiplies the win odds by about 1.062.

Building a Win Probability Model: The NBA Four Factors

Dean Oliver's "Four Factors" provide a concise framework for understanding basketball team quality:

Effective Field Goal Percentage (eFG%): Shooting efficiency, adjusted for three-pointers
Turnover Percentage (TOV%): Turnovers per 100 possessions
Offensive Rebounding Percentage (ORB%): Percentage of available offensive rebounds grabbed
Free Throw Rate (FTR): Free throws attempted per field goal attempted

We can build a win probability model using the differential of each factor (team value minus opponent value).

Python Code: NBA Win Probability Model

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, brier_score_loss
import warnings
warnings.filterwarnings('ignore')

# Simulated NBA game data based on Four Factors differentials
np.random.seed(123)
n_games = 1230  # Approximate full NBA season (each game counted once)

# Generate Four Factor differentials (team - opponent)
efg_diff = np.random.normal(0, 4.5, n_games)        # eFG% differential
tov_diff = np.random.normal(0, 3.0, n_games)         # TOV% differential (negative is better)
orb_diff = np.random.normal(0, 5.0, n_games)         # ORB% differential
ftr_diff = np.random.normal(0, 8.0, n_games)         # FT Rate differential
home_court = np.random.binomial(1, 0.5, n_games)     # 1 if home

# Generate win probability from true coefficients
log_odds = (
    -0.20
    + 0.15 * efg_diff
    - 0.08 * tov_diff
    + 0.04 * orb_diff
    + 0.02 * ftr_diff
    + 0.35 * home_court
)
true_prob = 1 / (1 + np.exp(-log_odds))
wins = np.random.binomial(1, true_prob)

# Create DataFrame
df = pd.DataFrame({
    'win': wins,
    'efg_diff': efg_diff,
    'tov_diff': tov_diff,
    'orb_diff': orb_diff,
    'ftr_diff': ftr_diff,
    'home_court': home_court
})

# Fit logistic regression
feature_cols = ['efg_diff', 'tov_diff', 'orb_diff', 'ftr_diff', 'home_court']
X = df[feature_cols]
y = df['win']

log_model = LogisticRegression(max_iter=1000, solver='lbfgs')
log_model.fit(X, y)

# Display coefficients and odds ratios
print("Logistic Regression Coefficients and Odds Ratios")
print("=" * 60)
print(f"{'Feature':<20} {'Coefficient':>12} {'Odds Ratio':>12}")
print("-" * 60)
print(f"{'Intercept':<20} {log_model.intercept_[0]:>12.4f} {np.exp(log_model.intercept_[0]):>12.4f}")
for feat, coef in zip(feature_cols, log_model.coef_[0]):
    print(f"{feat:<20} {coef:>12.4f} {np.exp(coef):>12.4f}")

# Cross-validated accuracy
cv_scores = cross_val_score(log_model, X, y, cv=10, scoring='accuracy')
print(f"\n10-Fold CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Predicted probabilities for a specific matchup
example_game = pd.DataFrame({
    'efg_diff': [3.5],      # team shoots 3.5% better eFG
    'tov_diff': [-1.0],     # team has 1% fewer turnovers
    'orb_diff': [2.0],      # team has 2% better ORB%
    'ftr_diff': [5.0],      # team gets to the line more
    'home_court': [1]       # playing at home
})

win_prob = log_model.predict_proba(example_game)[0][1]
print(f"\nPredicted Win Probability: {win_prob:.4f}")
print(f"Implied Fair Odds: {1/win_prob:.2f} to 1")
print(f"Implied Moneyline: {'+' if win_prob < 0.5 else '-'}"
      f"{abs(int(100/win_prob - 100)) if win_prob > 0.5 else int(100*(1/win_prob - 1))}")

# Brier Score (calibration measure, lower is better)
y_pred_prob = log_model.predict_proba(X)[:, 1]
brier = brier_score_loss(y, y_pred_prob)
print(f"\nBrier Score: {brier:.4f}")
print(f"Baseline Brier (predict 0.5 always): {brier_score_loss(y, [0.5]*len(y)):.4f}")

Worked Example: NBA Win Probability from Four Factors

Consider the following matchup scenario for an upcoming NBA game:

Factor	Home Team	Away Team	Differential
eFG%	53.2%	49.7%	+3.5
TOV%	12.8%	13.8%	-1.0
ORB%	27.0%	25.0%	+2.0
FT Rate	0.30	0.25	+5.0
Home Court	Yes	---	1

Plugging into our fitted model, suppose the coefficients are:

$$\ln\left(\frac{p}{1-p}\right) = -0.19 + 0.148(3.5) - 0.077(-1.0) + 0.039(2.0) + 0.021(5.0) + 0.34(1)$$

$$= -0.19 + 0.518 + 0.077 + 0.078 + 0.105 + 0.34 = 0.928$$

$$p = \frac{1}{1 + e^{-0.928}} = \frac{1}{1 + 0.395} = 0.717$$

Our model gives the home team a 71.7% win probability. This translates to fair odds of approximately $-253$ on the American moneyline. If the actual moneyline is $-200$ (implied probability of 66.7%), our model identifies a potential edge of about 5 percentage points.

From Logistic Regression Output to Betting Probabilities

The conversion from logistic regression output to betting-relevant probabilities is direct---the model literally outputs probabilities. However, two critical adjustments must be considered:

Calibration: A model that says "70% win probability" should win roughly 70% of the time across many such predictions. We will examine calibration in Section 9.5.
Vig removal: Market odds include the bookmaker's margin (vig). To compare your model's probabilities fairly, you must convert the book's line to a no-vig probability. If a book offers $-200/+170$, the implied probabilities are 66.7% and 37.0%, summing to 103.7%. Removing the vig proportionally yields approximately 64.3% and 35.7%.

Callout: When to Use Linear vs. Logistic Regression

Outcome Type Example Model

Continuous Points scored, yardage, time of possession Linear regression

Binary Win/loss, cover/miss, over/under Logistic regression

Ordinal Rankings (1st, 2nd, 3rd) Ordinal logistic regression

Count data Goals scored, touchdowns Poisson regression

Always match the model to the outcome variable. Using linear regression for a binary outcome will produce predictions outside [0, 1] and violate model assumptions.

9.3 Feature Selection and Multicollinearity

The Curse of Dimensionality in Sports Modeling

Modern sports databases contain hundreds of potential predictors. An NFL game log might include 50+ offensive and defensive statistics per team. An NBA dataset might track dozens of advanced metrics. The temptation is to include everything and "let the model sort it out." This is a mistake.

Adding too many features relative to the number of observations leads to overfitting: the model memorizes noise in the training data and performs poorly on new data. With $n$ observations and $p$ predictors, problems become severe when $p$ approaches $n$. A common rule of thumb is that you need at least 10--20 observations per predictor to estimate coefficients reliably.

If you have three NFL seasons of 256 team-game observations each ($n = 768$), using 50 predictors is aggressive. Using 100 is reckless.

Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated with each other. In sports, this is ubiquitous:

Total yards and first downs are strongly correlated
Passing yards and passing attempts are highly correlated
Offensive rating and points per game move together

Multicollinearity does not bias the overall predictions, but it inflates the standard errors of individual coefficients, making it impossible to isolate the effect of any one variable. You might see a situation where total yards and first downs are both insignificant individually, but removing one makes the other highly significant.

Variance Inflation Factor (VIF)

The Variance Inflation Factor quantifies multicollinearity for each predictor:

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the $R^2$ from regressing predictor $x_j$ on all other predictors. A VIF of 1 means no collinearity; a VIF of 5 means the variance of that coefficient is inflated fivefold; a VIF above 10 is a clear red flag.

VIF Range	Interpretation	Action
1.0--2.0	No concern	No action needed
2.0--5.0	Moderate	Monitor, consider combining
5.0--10.0	High	Seriously consider removing or combining
>10.0	Severe	Remove or combine predictors

Feature Selection Methods

Forward Selection

Start with no predictors. Add the one that most improves the model (by AIC, BIC, or adjusted $R^2$). Repeat until no remaining predictor meets the inclusion criterion.

Backward Elimination

Start with all predictors. Remove the least significant one. Refit and repeat until all remaining predictors are significant.

Stepwise Selection

Combine forward and backward: add the best variable, then check whether any previously included variable should be removed. Continue until no changes improve the model.

Information Criteria

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) balance model fit against complexity:

$$\text{AIC} = -2\ell + 2p$$

$$\text{BIC} = -2\ell + p \ln(n)$$

where $\ell$ is the log-likelihood, $p$ is the number of parameters, and $n$ is the sample size. Lower values indicate a better balance of fit and parsimony. BIC penalizes complexity more heavily than AIC, leading to sparser models.

LASSO and Ridge Regression

Regularization methods add a penalty term to the OLS objective function, shrinking coefficients toward zero and reducing overfitting.

Ridge Regression (L2 penalty):

$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p}\beta_j^2$$

Ridge shrinks coefficients but never sets them exactly to zero. It is useful when many predictors contribute small amounts of information.

LASSO Regression (L1 penalty):

$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p}|\beta_j|$$

LASSO can shrink coefficients exactly to zero, effectively performing automatic feature selection. This makes it especially useful in sports modeling, where you want a parsimonious model from dozens of candidates.

Elastic Net combines both penalties:

$$\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda_1 \sum_{j=1}^{p}|\beta_j| + \lambda_2 \sum_{j=1}^{p}\beta_j^2$$

The hyperparameter $\lambda$ (and the mixing parameter $\alpha$ in Elastic Net) is chosen via cross-validation: fit the model on $k-1$ folds and evaluate on the held-out fold, iterating across all folds and averaging performance.

Comparison of Feature Selection Methods

Method	Handles Multicollinearity	Automatic	Produces Sparse Model	Computational Cost
Forward Selection	Partially	Semi	Yes	Low
Backward Elimination	Partially	Semi	Yes	Moderate
Stepwise	Partially	Semi	Yes	Moderate
LASSO (L1)	Yes	Yes	Yes	Moderate
Ridge (L2)	Yes	Yes	No	Moderate
Elastic Net	Yes	Yes	Yes	Moderate
VIF Filtering	Yes	Manual	Yes	Low

Python Code: Feature Selection Pipeline

import numpy as np
import pandas as pd
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# Simulated NFL dataset with many correlated features
np.random.seed(99)
n = 512

# Generate correlated features (mimicking real football data)
pass_yards = np.random.normal(230, 50, n)
rush_yards = np.random.normal(115, 35, n)
total_yards = pass_yards + rush_yards + np.random.normal(0, 10, n)
first_downs = 0.06 * total_yards + np.random.normal(0, 2, n)
pass_attempts = pass_yards / 7.5 + np.random.normal(0, 3, n)
completions = 0.63 * pass_attempts + np.random.normal(0, 2, n)
rush_attempts = rush_yards / 4.2 + np.random.normal(0, 3, n)
time_of_poss = 0.5 * rush_attempts + 15 + np.random.normal(0, 2, n)
turnovers = np.random.poisson(1.1, n)
penalties = np.random.normal(6, 2, n)
third_down_pct = 35 + 0.02 * total_yards + np.random.normal(0, 8, n)
redzone_pct = 50 + 0.01 * total_yards + np.random.normal(0, 12, n)

# Outcome: points scored
points = (
    -5.0 + 0.065 * total_yards - 3.0 * turnovers + 0.08 * redzone_pct
    + np.random.normal(0, 5.5, n)
)
points = np.clip(points, 0, None)

features = pd.DataFrame({
    'pass_yards': pass_yards, 'rush_yards': rush_yards,
    'total_yards': total_yards, 'first_downs': first_downs,
    'pass_attempts': pass_attempts, 'completions': completions,
    'rush_attempts': rush_attempts, 'time_of_poss': time_of_poss,
    'turnovers': turnovers, 'penalties': penalties,
    'third_down_pct': third_down_pct, 'redzone_pct': redzone_pct
})

# Step 1: Check VIF
print("VARIANCE INFLATION FACTORS")
print("=" * 40)
X_vif = sm.add_constant(features)
for i, col in enumerate(X_vif.columns):
    if col == 'const':
        continue
    vif = variance_inflation_factor(X_vif.values, i)
    flag = " *** HIGH" if vif > 5 else ""
    print(f"{col:<20} VIF = {vif:.2f}{flag}")

# Step 2: Standardize features for regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
y = points

# Step 3: LASSO with cross-validation
lasso = LassoCV(cv=10, random_state=42, max_iter=10000)
lasso.fit(X_scaled, y)
print(f"\nLASSO REGRESSION (lambda = {lasso.alpha_:.4f})")
print("=" * 50)
print(f"{'Feature':<20} {'Coefficient':>12} {'Selected':>10}")
print("-" * 50)
for feat, coef in zip(features.columns, lasso.coef_):
    selected = "YES" if abs(coef) > 0.001 else "no"
    print(f"{feat:<20} {coef:>12.4f} {selected:>10}")

lasso_r2 = lasso.score(X_scaled, y)
print(f"\nLASSO R-squared: {lasso_r2:.4f}")

# Step 4: Ridge for comparison
ridge = RidgeCV(cv=10, alphas=np.logspace(-3, 3, 100))
ridge.fit(X_scaled, y)
print(f"\nRIDGE REGRESSION (lambda = {ridge.alpha_:.4f})")
print("=" * 50)
for feat, coef in zip(features.columns, ridge.coef_):
    print(f"{feat:<20} {coef:>12.4f}")

ridge_r2 = ridge.score(X_scaled, y)
print(f"\nRidge R-squared: {ridge_r2:.4f}")

# Step 5: Compare OLS, LASSO, and Ridge via cross-validation
from sklearn.linear_model import LinearRegression

models = {
    'OLS (all features)': LinearRegression(),
    'LASSO': LassoCV(cv=5, random_state=42, max_iter=10000),
    'Ridge': RidgeCV(cv=5, alphas=np.logspace(-3, 3, 100)),
    'Elastic Net': ElasticNetCV(cv=5, random_state=42, max_iter=10000)
}

print("\nCROSS-VALIDATED MODEL COMPARISON")
print("=" * 50)
for name, model in models.items():
    scores = cross_val_score(model, X_scaled, y, cv=10, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    print(f"{name:<25} RMSE = {rmse:.3f}")

This pipeline reveals the practical consequences of multicollinearity. You will typically see that total_yards, first_downs, pass_yards, and rush_yards have extremely high VIFs because they are arithmetic combinations of each other. LASSO will zero out the redundant predictors and retain only the most informative subset, typically selecting total_yards (or its components) alongside turnovers and redzone_pct while discarding the correlated alternatives.

Callout: The Art of Feature Engineering

Raw box-score stats are often less powerful than derived features. Consider creating: - Differentials (team stat minus opponent stat) rather than raw values - Per-play metrics (yards per play, points per drive) rather than totals - Rolling averages (last 5 games) rather than season-long averages - Opponent-adjusted stats (performance relative to what opponents typically allow)

Good feature engineering is often worth more than algorithmic sophistication.

9.4 Model Diagnostics and Residual Analysis

Why Diagnostics Matter

A regression model is built on assumptions. When those assumptions are violated, the model's coefficient estimates, standard errors, confidence intervals, and predictions can all be unreliable. Model diagnostics are the detective work that reveals whether your model is trustworthy or misleading.

The key assumptions of OLS regression are:

Linearity: The relationship between predictors and outcome is linear
Independence: Residuals are independent of each other
Homoscedasticity: Residuals have constant variance
Normality: Residuals are approximately normally distributed
No perfect multicollinearity: No predictor is a perfect linear combination of others

Residual Plots and What They Tell You

A residual is the difference between the observed value and the predicted value:

$$e_i = y_i - \hat{y}_i$$

The most informative diagnostic is the residuals vs. fitted values plot. In a well-specified model, this plot should show random scatter around zero with constant spread. Common pathological patterns include:

Funnel shape (spread increases with fitted values): heteroscedasticity
Curved pattern: nonlinearity; a squared term or transformation may be needed
Clusters: possible omitted categorical variable
Trend: possible omitted variable or misspecification

Normality of Residuals

While OLS estimates are unbiased regardless of residual normality (by the Gauss-Markov theorem), inference (t-tests, confidence intervals) relies on normality, especially in small samples. Diagnostics include:

Q-Q plot: Residuals plotted against theoretical normal quantiles. Deviations at the tails indicate heavy-tailed or skewed residuals.
Shapiro-Wilk test: A formal test of normality. In practice, with large samples, this test is overly sensitive and often rejects normality even when deviations are trivial.
Histogram of residuals: A visual check for obvious skewness or multimodality.

In sports data, residuals often have slightly heavier tails than a normal distribution because of blowout games, overtime periods, and other extreme events. This is generally acceptable for practical purposes.

Heteroscedasticity Detection and Correction

Heteroscedasticity means that the variance of residuals changes across the range of predicted values. In sports data, this is common: high-scoring games may have more variable outcomes than low-scoring games.

Breusch-Pagan test: Regress squared residuals on the predictors. If the $R^2$ is significant, heteroscedasticity is present.

$$\text{Test statistic} = n \times R^2_{\text{aux}} \sim \chi^2_p$$

White's test: A more general version that does not assume a specific form of heteroscedasticity.

Correction: Use heteroscedasticity-consistent (HC) standard errors (also called "robust" or "White" standard errors). In statsmodels, this is as simple as specifying cov_type='HC3' when fitting the model. The coefficient estimates remain the same, but the standard errors are corrected to account for non-constant variance.

Influential Observations

Some observations exert disproportionate influence on the regression results. Identifying these is essential in sports data, where outlier games (a team scores 60 points, a key player is ejected in the first quarter) can distort your model.

Leverage ($h_{ii}$): Measures how far an observation's predictor values are from the mean. The diagonal elements of the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ give the leverage values. High leverage means the observation has unusual predictor values.

Cook's Distance: Combines leverage and residual size to measure overall influence:

$$D_i = \frac{e_i^2}{p \times \hat{\sigma}^2} \times \frac{h_{ii}}{(1 - h_{ii})^2}$$

A common rule of thumb is that $D_i > 4/n$ warrants investigation, though $D_i > 1$ is a stronger threshold.

DFFITS: Another influence measure; observations with $|\text{DFFITS}_i| > 2\sqrt{p/n}$ are potentially influential.

In practice, you should not automatically remove influential observations. Instead, investigate them: Was there a data entry error? Did something unusual happen in the game (weather, injuries, garbage time)? If the observation is legitimate, consider whether your model needs to accommodate the phenomenon it represents.

Autocorrelation in Time Series Sports Data

When modeling sports data across time (e.g., a team's game-by-game performance over a season), residuals may be autocorrelated: the residual from game $t$ is correlated with the residual from game $t-1$. This violates the independence assumption and leads to overly optimistic standard errors.

The Durbin-Watson statistic tests for first-order autocorrelation:

$$DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n}e_i^2}$$

Values near 2 indicate no autocorrelation; values significantly below 2 indicate positive autocorrelation; values above 2 suggest negative autocorrelation. In game-level data, autocorrelation is usually mild, but in player-level time series (e.g., predicting a pitcher's next-start ERA), it can be significant.

Python Code: Comprehensive Diagnostic Suite

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson
from scipy import stats

# Use the NFL model from Section 9.1
np.random.seed(42)
n = 256

total_yards = np.random.normal(340, 60, n)
turnover_margin = np.random.poisson(1.2, n) - np.random.poisson(1.2, n)
time_of_poss = np.random.normal(30, 3, n)
points = -8.0 + 0.075 * total_yards + 3.5 * turnover_margin + 0.15 * time_of_poss
points += np.random.normal(0, 6, n)
points = np.clip(points, 0, None)

df = pd.DataFrame({
    'points': points,
    'total_yards': total_yards,
    'turnover_margin': turnover_margin,
    'time_of_poss': time_of_poss
})

X = sm.add_constant(df[['total_yards', 'turnover_margin', 'time_of_poss']])
y = df['points']
model = sm.OLS(y, X).fit()

# --- RESIDUAL ANALYSIS ---
residuals = model.resid
fitted = model.fittedvalues
standardized_resid = model.get_influence().resid_studentized_internal

print("MODEL DIAGNOSTICS REPORT")
print("=" * 60)

# 1. Normality tests
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"\n1. NORMALITY OF RESIDUALS")
print(f"   Shapiro-Wilk statistic: {shapiro_stat:.4f}")
print(f"   Shapiro-Wilk p-value:   {shapiro_p:.4f}")
print(f"   Skewness:               {stats.skew(residuals):.4f}")
print(f"   Kurtosis:               {stats.kurtosis(residuals):.4f}")
if shapiro_p > 0.05:
    print("   -> Residuals appear normally distributed")
else:
    print("   -> Evidence of non-normality (examine Q-Q plot)")

# 2. Heteroscedasticity tests
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, X)
print(f"\n2. HETEROSCEDASTICITY")
print(f"   Breusch-Pagan statistic: {bp_stat:.4f}")
print(f"   Breusch-Pagan p-value:   {bp_p:.4f}")
if bp_p > 0.05:
    print("   -> No significant heteroscedasticity detected")
else:
    print("   -> Heteroscedasticity detected; use robust standard errors")
    robust_model = sm.OLS(y, X).fit(cov_type='HC3')
    print("   -> Robust standard errors computed (HC3)")

# 3. Autocorrelation
dw = durbin_watson(residuals)
print(f"\n3. AUTOCORRELATION")
print(f"   Durbin-Watson statistic: {dw:.4f}")
if 1.5 < dw < 2.5:
    print("   -> No significant autocorrelation detected")
else:
    print("   -> Possible autocorrelation; consider time-series methods")

# 4. Influential observations
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
leverage = influence.hat_matrix_diag
n_obs = len(y)
n_params = len(model.params)

high_cooks = np.sum(cooks_d > 4 / n_obs)
high_leverage = np.sum(leverage > 2 * n_params / n_obs)

print(f"\n4. INFLUENTIAL OBSERVATIONS")
print(f"   Observations with Cook's D > {4/n_obs:.4f}: {high_cooks}")
print(f"   Observations with high leverage (>{2*n_params/n_obs:.4f}): {high_leverage}")

# Top 5 most influential observations
top_influential = np.argsort(cooks_d)[-5:][::-1]
print(f"\n   Top 5 Influential Observations:")
print(f"   {'Index':<8} {'Cook D':>10} {'Leverage':>10} {'Resid':>10}")
for idx in top_influential:
    print(f"   {idx:<8} {cooks_d[idx]:>10.4f} {leverage[idx]:>10.4f} "
          f"{residuals.iloc[idx]:>10.2f}")

# 5. Overall model assessment
print(f"\n5. MODEL FIT SUMMARY")
print(f"   R-squared:          {model.rsquared:.4f}")
print(f"   Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"   AIC:                {model.aic:.1f}")
print(f"   BIC:                {model.bic:.1f}")
print(f"   RMSE:               {np.sqrt(model.mse_resid):.3f}")
print(f"   Mean Absolute Error: {np.mean(np.abs(residuals)):.3f}")

When Your Model Is Wrong: What Residuals Reveal

Residual analysis is not just a box-checking exercise. It is a diagnostic conversation between you and your model. Here is a decision framework:

Pattern in Residuals	Likely Cause	Suggested Fix
Funnel shape (variance increases with fitted values)	Heteroscedasticity	Use robust SE, or transform the outcome (e.g., log)
U-shape or inverted U	Missing nonlinear term	Add polynomial or interaction terms
Systematic trend over time	Autocorrelation or concept drift	Add time component, use rolling windows
Clusters of large residuals at specific values	Omitted categorical variable	Add relevant categorical predictor (e.g., dome vs. outdoor)
A few extreme outliers	Unusual games or data errors	Investigate individually; consider robust regression
Residuals correlated with an omitted variable	Omitted variable bias	Add the missing predictor to the model

Callout: The Residual as a Betting Tool

Large systematic residuals point to where your model is blind. If your model consistently underestimates the scoring of teams playing in dome stadiums, that is not just a diagnostic finding---it is a potential betting opportunity. Teams playing in domes may be systematically undervalued by the market if other bettors use similarly incomplete models. Mining your own model's failures is one of the most productive activities in sports analytics.

9.5 From Regression Output to Betting Lines

The Final Mile: From Model to Market

A regression model that predicts accurately but sits unused in a Jupyter notebook generates zero profit. The critical final step is converting model output into actionable betting signals by comparing your predictions against the market's prices. This section builds that bridge.

Converting Predicted Margins to Spreads

A linear regression model that predicts a scoring margin produces a natural spread estimate. If your model predicts:

$$\widehat{\text{HomePoints}} - \widehat{\text{AwayPoints}} = 4.7$$

Then your model's implied spread is Home -4.7. If the market line is Home -3, your model sees value on the home team (you believe they will win by more than the market expects).

However, raw predictions need adjustment:

Home-field advantage: Ensure your model accounts for home-field advantage. In the NFL, this has historically been worth approximately 2.5--3 points, though it has declined in recent years. In the NBA, it is roughly 3--4 points. If your model does not include a home indicator variable, your spread estimates will be systematically biased.
Key numbers in football: NFL scores cluster around certain margins (3, 7, 10, 14). A model that predicts a margin of 3.2 versus a market line of 3 is qualitatively different from a prediction of 6.2 versus a line of 6, because 3 and 7 are key numbers where pushes and backdoor covers frequently occur.
Regression to the mean: Early-season predictions based on small samples should be shrunk toward the league mean. A team that averages 35 points through 3 games will probably not sustain that pace over 17 games.

Converting Win Probabilities to Moneylines

Logistic regression outputs probabilities directly, which convert to moneylines using these formulas:

For a favorite (probability $p > 0.5$):

$$\text{American Odds} = -\frac{p}{1-p} \times 100$$

For an underdog (probability $p < 0.5$):

$$\text{American Odds} = +\frac{1-p}{p} \times 100$$

Example: A model gives the home team a 63% win probability.

$$\text{Fair Moneyline} = -\frac{0.63}{0.37} \times 100 = -170$$

If the market offers the home team at $-150$ (implied probability = 60%), there is a potential edge. The model believes the team should be priced at $-170$ but the market offers $-150$, meaning the team is underpriced.

Converting Predicted Totals

If your model predicts the total points for each team separately, the game total is simply:

$$\widehat{\text{Total}} = \widehat{\text{HomePoints}} + \widehat{\text{AwayPoints}}$$

Compare this to the market total. If your model predicts 48.3 and the market total is 45.5, you have a signal toward the over.

Calibrating Model Outputs

A model is calibrated if its predictions match observed frequencies. For a win probability model, this means:

Of all games where the model predicts a 60% win probability, approximately 60% should be wins
Of all games where the model predicts a 80% win probability, approximately 80% should be wins

Calibration is assessed by:

Binning predictions: Group predictions into buckets (e.g., 50-55%, 55-60%, ..., 95-100%)
Computing observed frequency: Within each bucket, calculate the actual win rate
Plotting: A calibration plot shows predicted probability (x-axis) versus observed frequency (y-axis). A perfectly calibrated model lies on the 45-degree line.

The Brier Score provides a single number measuring calibration and discrimination:

$$\text{Brier Score} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)^2$$

where $p_i$ is the predicted probability and $y_i$ is the actual outcome (0 or 1). The Brier score ranges from 0 (perfect) to 1 (worst possible). A baseline model that always predicts 50% has a Brier score of 0.25. Any useful model should beat this baseline.

The Gap Between Model and Market = Potential Edge

The fundamental equation of quantitative betting is:

$$\text{Edge} = \text{Model Probability} - \text{Implied Market Probability}$$

A positive edge means the model sees more value than the market is pricing. But edge alone is not sufficient---you also need:

Statistical significance: Is the edge larger than the uncertainty in your model's estimate?
Consistency: Does the model's edge persist across different time periods and subsets of data?
Vig coverage: The edge must exceed the bookmaker's margin (typically 4--5% in terms of total implied probability) to be profitable.

A useful rule of thumb: require a minimum edge of 3--5% before placing a bet. An edge of 1% is likely to be consumed by vig and modeling error.

Building a Basic Betting Signal

A complete pipeline from regression to bet recommendation involves:

Train: Fit the regression model on historical data
Predict: Generate predictions for upcoming games
Compare: Compare model predictions to market lines
Filter: Only bet when the edge exceeds a minimum threshold
Size: Determine bet size based on edge magnitude (Kelly Criterion, covered in Part 4)
Track: Log all bets and outcomes for ongoing model evaluation

Python Code: Complete Pipeline from Regression Model to Bet Recommendation

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve

# =====================================================
# PART 1: Build the model (training phase)
# =====================================================
np.random.seed(42)
n_train = 1500  # ~5 NBA seasons

# Simulated historical data: Four Factors differentials
train_data = pd.DataFrame({
    'efg_diff': np.random.normal(0, 4.5, n_train),
    'tov_diff': np.random.normal(0, 3.0, n_train),
    'orb_diff': np.random.normal(0, 5.0, n_train),
    'ftr_diff': np.random.normal(0, 8.0, n_train),
    'home': np.random.binomial(1, 0.5, n_train)
})

log_odds = (
    -0.18 + 0.145 * train_data['efg_diff']
    - 0.075 * train_data['tov_diff']
    + 0.038 * train_data['orb_diff']
    + 0.019 * train_data['ftr_diff']
    + 0.32 * train_data['home']
)
train_data['win'] = np.random.binomial(1, 1 / (1 + np.exp(-log_odds)))

# Fit logistic regression
features = ['efg_diff', 'tov_diff', 'orb_diff', 'ftr_diff', 'home']
X_train = train_data[features]
y_train = train_data['win']

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# =====================================================
# PART 2: Generate predictions for upcoming games
# =====================================================
upcoming_games = pd.DataFrame({
    'game_id': ['LAL@BOS', 'MIA@MIL', 'GSW@DEN', 'PHX@DAL', 'NYK@PHI'],
    'home_team': ['BOS', 'MIL', 'DEN', 'DAL', 'PHI'],
    'away_team': ['LAL', 'MIA', 'GSW', 'PHX', 'NYK'],
    'efg_diff': [2.1, -1.5, 3.8, 0.5, -2.3],
    'tov_diff': [-0.8, 1.2, -1.5, 0.3, 0.9],
    'orb_diff': [1.5, -2.0, 3.0, 1.0, -1.5],
    'ftr_diff': [3.0, -4.0, 6.0, 2.0, -3.0],
    'home': [1, 1, 1, 1, 1],
    'market_home_ml': [-180, +130, -220, -140, +110],
})

# Model predictions
X_upcoming = upcoming_games[features]
home_win_probs = model.predict_proba(X_upcoming)[:, 1]
upcoming_games['model_prob'] = home_win_probs

# =====================================================
# PART 3: Compare model to market
# =====================================================
def american_to_prob(ml):
    """Convert American odds to implied probability."""
    if ml < 0:
        return abs(ml) / (abs(ml) + 100)
    else:
        return 100 / (ml + 100)

def prob_to_american(p):
    """Convert probability to American odds."""
    if p >= 0.5:
        return int(-p / (1 - p) * 100)
    else:
        return int((1 - p) / p * 100)

def remove_vig(home_ml, away_ml=None):
    """Remove vig from moneyline to get true implied probability.
       If away_ml is not provided, estimate it."""
    home_imp = american_to_prob(home_ml)
    if away_ml is None:
        # Estimate away line from home line
        away_imp = 1 - home_imp + 0.02  # approximate vig
    else:
        away_imp = american_to_prob(away_ml)
    total_imp = home_imp + away_imp
    return home_imp / total_imp  # no-vig probability

upcoming_games['market_prob'] = upcoming_games['market_home_ml'].apply(
    lambda ml: remove_vig(ml)
)
upcoming_games['edge'] = upcoming_games['model_prob'] - upcoming_games['market_prob']
upcoming_games['model_fair_ml'] = upcoming_games['model_prob'].apply(prob_to_american)

# =====================================================
# PART 4: Filter and generate recommendations
# =====================================================
MIN_EDGE = 0.03  # minimum 3% edge required

print("BET RECOMMENDATION REPORT")
print("=" * 80)
print(f"{'Game':<12} {'Model%':>8} {'Market%':>9} {'Edge':>8} {'Fair ML':>9} "
      f"{'Mkt ML':>8} {'Signal':>10}")
print("-" * 80)

for _, game in upcoming_games.iterrows():
    edge = game['edge']
    if abs(edge) >= MIN_EDGE:
        if edge > 0:
            signal = f"HOME ({game['home_team']})"
        else:
            signal = f"AWAY ({game['away_team']})"
    else:
        signal = "NO BET"

    print(f"{game['game_id']:<12} {game['model_prob']:>7.1%} {game['market_prob']:>8.1%} "
          f"{edge:>+7.1%} {game['model_fair_ml']:>+9d} {game['market_home_ml']:>+8d} "
          f"{signal:>10}")

# =====================================================
# PART 5: Calibration check on training data
# =====================================================
train_probs = model.predict_proba(X_train)[:, 1]
fraction_pos, mean_pred = calibration_curve(y_train, train_probs, n_bins=10)

print("\nCALIBRATION CHECK (Training Data)")
print("=" * 45)
print(f"{'Predicted':>12} {'Observed':>12} {'Count':>8}")
print("-" * 45)

# Manually compute bin counts
bin_edges = np.linspace(0, 1, 11)
for i in range(len(fraction_pos)):
    lo = bin_edges[i]
    hi = bin_edges[i + 1] if i + 1 < len(bin_edges) else 1.0
    mask = (train_probs >= lo) & (train_probs < hi)
    count = mask.sum()
    if count > 0:
        print(f"{mean_pred[i]:>11.3f} {fraction_pos[i]:>11.3f} {count:>8d}")

from sklearn.metrics import brier_score_loss
brier = brier_score_loss(y_train, train_probs)
print(f"\nBrier Score: {brier:.4f} (baseline 0.2500)")
print(f"Brier Skill Score: {1 - brier/0.25:.4f}")

This code produces a complete betting recommendation report. For each upcoming game, it shows:

The model's estimated win probability for the home team
The market's implied probability (after vig removal)
The edge (model minus market)
A bet recommendation (home, away, or no bet)

A sample output might look like:

Game	Model%	Market%	Edge	Fair ML	Mkt ML	Signal
LAL@BOS	66.3%	62.8%	+3.5%	-197	-180	HOME (BOS)
MIA@MIL	47.2%	54.9%	-7.7%	-89	+130	AWAY (MIA)
GSW@DEN	72.1%	66.3%	+5.8%	-258	-220	HOME (DEN)
PHX@DAL	55.8%	56.2%	-0.4%	-126	-140	NO BET
NYK@PHI	43.5%	49.5%	-6.0%	+130	+110	AWAY (NYK)

The MIA@MIL and NYK@PHI games show negative edges for the home team, which translates to positive edges for the away team. The PHX@DAL game is essentially a toss-up in the model's view, close to the market, so no bet is recommended.

Callout: The Humility Check

Before risking real money, ask yourself these questions:

Has the model been tested on out-of-sample data (data it was not trained on)?

Would the historical edges have been profitable after vig?

Is the edge large enough to survive estimation error in the model?

Are you confident the market has not already incorporated the information your model uses?

If the answer to any of these is "no" or "I am not sure," you are not yet ready to bet with real money. Paper trading (tracking hypothetical bets without wagering) is a prudent intermediate step.

The Edge Estimation Framework

To formalize the relationship between model uncertainty and required edge, consider that your model's predicted probability $\hat{p}$ has its own standard error, $\text{SE}(\hat{p})$. A rough estimate of the standard error for a logistic regression prediction is:

$$\text{SE}(\hat{p}) \approx \hat{p}(1 - \hat{p}) \times \sqrt{\mathbf{x}_0^T (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{x}_0}$$

where $\mathbf{W}$ is the diagonal matrix of weights in the iteratively reweighted least squares algorithm. In practice, a model with $\hat{p} = 0.65$ and $\text{SE}(\hat{p}) = 0.04$ means the model's "true" edge could plausibly be anywhere from $0.65 - 2(0.04) = 0.57$ to $0.65 + 2(0.04) = 0.73$. If the market's implied probability is 0.62, the nominal edge of 0.03 could easily be zero. Only bet when the edge clearly exceeds the model's own uncertainty.

9.6 Chapter Summary

Key Formulas

Ordinary Least Squares (Simple):

$$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

OLS (Matrix Form):

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

R-squared:

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}, \quad R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n - p - 1}$$

Logistic Function:

$$p = \frac{1}{1 + e^{-\mathbf{x}^T\boldsymbol{\beta}}}$$

Log-Likelihood (Logistic Regression):

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left[y_i\ln(p_i) + (1 - y_i)\ln(1 - p_i)\right]$$

Variance Inflation Factor:

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

LASSO Objective:

$$\min_{\boldsymbol{\beta}} \sum(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum|\beta_j|$$

Cook's Distance:

$$D_i = \frac{e_i^2}{p \cdot \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}$$

Brier Score:

$$\text{BS} = \frac{1}{n}\sum_{i=1}^{n}(p_i - y_i)^2$$

Code Patterns

The Python code in this chapter follows a consistent workflow:

Data preparation: Create or load a DataFrame with features and outcome
Model fitting: Use statsmodels for OLS (when you need detailed diagnostics) or sklearn for logistic regression and regularized models (when you need cross-validation and prediction pipelines)
Diagnostics: Check VIF, residual plots, influence measures, and calibration
Prediction: Generate predictions for new observations with uncertainty estimates
Signal generation: Compare model predictions to market lines and filter for minimum edge

Decision Framework: Choosing the Right Regression

Question	Linear Regression	Logistic Regression	Regularized (LASSO/Ridge)
Outcome type?	Continuous (points, yards)	Binary (win/loss)	Either
Number of features?	Few, well-chosen	Few, well-chosen	Many candidates
Multicollinearity?	Must address manually	Must address manually	Handled automatically
Interpretability?	High	Moderate (odds ratios)	Lower (coefficients shrunk)
Feature selection?	Manual	Manual	Automatic (LASSO)
When to use?	Predicting totals, margins	Win probability, cover prob	Many correlated predictors

Key Takeaways

Linear regression is the foundation for predicting continuous sports outcomes. The OLS estimator is simple, interpretable, and powerful when used correctly. Always report adjusted $R^2$ and prediction intervals, not just point estimates.
Logistic regression is the natural choice for binary outcomes like win/loss. Its output is a probability that can be directly compared to market odds. The Brier score is the key metric for evaluating probability predictions.
Feature selection is essential, not optional. Sports data is riddled with multicollinearity (total yards and first downs, passing yards and pass attempts). Use VIF analysis to detect it, and LASSO or Ridge to handle it. Prefer parsimonious models with 5--8 well-chosen features over kitchen-sink models with 30 correlated predictors.
Model diagnostics are non-negotiable. Residual analysis reveals whether your model's assumptions are met and, more importantly, where your model is systematically wrong. Those systematic errors are both a diagnostic concern and a guide to improvement.
The final step is the most important. A model that predicts well but is never compared to the market generates zero profit. Build the complete pipeline: train, predict, compare to market, filter for minimum edge, size appropriately, and track results. The edge is in the gap between your model and the market's price.

Practice Problems

NFL Totals Model: Collect team-level statistics for two NFL seasons. Build a multiple regression model predicting total points scored per game. Which three predictors have the highest explanatory power? What is the model's adjusted $R^2$?
Logistic Regression for Spread Covering: Using the same data, create a binary outcome variable indicating whether the home team covered the spread. Fit a logistic regression model. Does your model beat the 50% baseline?
Multicollinearity Drill: Include at least 10 predictors in an NFL regression model and compute VIF for each. Identify which predictors are redundant. Use LASSO to automatically select a parsimonious subset. Compare the RMSE of the full model and the LASSO model using 10-fold cross-validation.
Calibration Exercise: Fit a logistic regression model predicting NBA wins. Split the data 80/20 into training and test sets. Generate a calibration plot on the test set. Compute the Brier score and the Brier skill score relative to a baseline of always predicting the home-team historical win rate.
Full Pipeline: For one week of upcoming NBA games, use publicly available team statistics to generate model predictions. Compare your predicted win probabilities to the market moneylines. Identify which games (if any) show an edge exceeding 5%. Track the results and compute your hypothetical P&L.

What's Next: Chapter 10 --- Bayesian Thinking for Sports Bettors

In this chapter, we built regression models from a frequentist perspective: we estimated fixed parameters from data and used confidence intervals and p-values to quantify uncertainty. But there is another way to think about modeling under uncertainty.

Bayesian statistics treats parameters as random variables with probability distributions. Instead of asking "What is the best single estimate of this coefficient?", Bayesian analysis asks "Given the data, what is the full distribution of plausible values for this coefficient?" This framework is especially powerful in sports modeling, where:

Small samples are the norm (an NFL season is only 17 games per team)
Prior information is abundant (we know roughly how good a team is before the season starts)
Updating in real time is essential (incorporating new game results as the season unfolds)

Chapter 10 will introduce Bayesian thinking from the ground up, covering prior distributions, Bayes' theorem, posterior updating, and Bayesian regression. You will learn how to combine your regression models from this chapter with principled uncertainty quantification through the Bayesian lens---a combination that many professional sports bettors consider indispensable.

Outcome Type	Example	Model
Continuous	Points scored, yardage, time of possession	Linear regression
Binary	Win/loss, cover/miss, over/under	Logistic regression
Ordinal	Rankings (1st, 2nd, 3rd)	Ordinal logistic regression
Count data	Goals scored, touchdowns	Poisson regression

Learning Objectives

In This Chapter

Chapter 9: Regression Analysis for Sports Modeling

Chapter Overview

9.1 Simple and Multiple Linear Regression

The Core Idea

Ordinary Least Squares (OLS) Estimation

Interpreting Coefficients in Sports Contexts

R-Squared and Adjusted R-Squared

Standard Errors and Confidence Intervals for Coefficients

Prediction Intervals vs. Confidence Intervals

Python Code: Building an NFL Points Model

Worked Example: Predicting NFL Game Totals from Team Stats

9.2 Logistic Regression for Win Probability

Why Logistic Regression for Binary Outcomes

The Logit Function and Sigmoid Curve

Maximum Likelihood Estimation

Interpreting Odds Ratios

Building a Win Probability Model: The NBA Four Factors

Python Code: NBA Win Probability Model

Worked Example: NBA Win Probability from Four Factors

From Logistic Regression Output to Betting Probabilities

9.3 Feature Selection and Multicollinearity

The Curse of Dimensionality in Sports Modeling

Multicollinearity

Variance Inflation Factor (VIF)

Feature Selection Methods

Forward Selection

Backward Elimination

Stepwise Selection

Information Criteria

LASSO and Ridge Regression

Comparison of Feature Selection Methods

Python Code: Feature Selection Pipeline

9.4 Model Diagnostics and Residual Analysis

Why Diagnostics Matter

Residual Plots and What They Tell You

Normality of Residuals

Heteroscedasticity Detection and Correction

Influential Observations

Autocorrelation in Time Series Sports Data

Python Code: Comprehensive Diagnostic Suite

When Your Model Is Wrong: What Residuals Reveal

9.5 From Regression Output to Betting Lines

The Final Mile: From Model to Market

Converting Predicted Margins to Spreads

Converting Win Probabilities to Moneylines

Converting Predicted Totals

Calibrating Model Outputs

The Gap Between Model and Market = Potential Edge

Building a Basic Betting Signal

Python Code: Complete Pipeline from Regression Model to Bet Recommendation

The Edge Estimation Framework

9.6 Chapter Summary

Key Formulas

Code Patterns

Decision Framework: Choosing the Right Regression

Key Takeaways

Practice Problems

What's Next: Chapter 10 --- Bayesian Thinking for Sports Bettors

Related Reading