Case Study: Building an NFL Game Totals Model with Regression

Executive Summary

Predicting the total combined score of an NFL game is one of the most common applications of regression analysis in sports betting. Sportsbooks post an Over/Under line for every game, and bettors wager on whether the actual combined score will exceed or fall short of that number. This case study walks through the complete process of building a linear regression model for NFL game totals: from raw data acquisition and feature engineering through model fitting, diagnostic evaluation, and practical betting application. Using synthetic data modeled on realistic NFL patterns, we demonstrate that a principled regression approach can identify games where the market's posted total diverges meaningfully from the model's estimate, creating opportunities for positive expected value wagers.

Background

Why Game Totals?

The NFL totals market is attractive for quantitative modelers for several reasons. First, the outcome variable --- combined points scored --- is continuous and approximately normally distributed, making it a natural fit for linear regression. Second, the factors that drive scoring (offensive efficiency, defensive quality, pace, weather, venue) are measurable and publicly available. Third, the totals market tends to receive less sharp attention than the point spread market, creating pockets of inefficiency that a disciplined model can exploit.

A typical NFL game total ranges from the mid-30s to the high 50s, with the average hovering around 45-47 points in recent seasons. The standard deviation of actual game totals around the posted line is approximately 13-14 points, reflecting the inherent unpredictability of football scoring.

The Modeling Goal

Our objective is straightforward: build a regression model that predicts the combined score of an NFL game using pre-game information, then compare the model's prediction to the sportsbook's posted total. When the model's prediction differs meaningfully from the line, a betting opportunity exists.

Formally, we seek to estimate:

$$E[\text{Total Points}] = f(\text{Home Offense}, \text{Away Offense}, \text{Home Defense}, \text{Away Defense}, \text{Situational Factors})$$

Data and Feature Engineering

Raw Data Sources

For a production model, you would source data from NFL play-by-play databases (e.g., nflfastR), team statistics aggregators, and weather APIs. For this case study, we generate synthetic data that mirrors the statistical properties of real NFL seasons.

Our dataset covers six NFL seasons (2019--2024), providing approximately 1,632 regular-season games (272 per season). For each game, we have pre-game team statistics computed from all prior games in the current season, along with situational factors.

Feature Engineering

The quality of a regression model depends more on the features than the algorithm. We engineer the following predictors from raw data:

Offensive Efficiency Features: - home_off_epa: Home team's offensive Expected Points Added per play (season-to-date rolling average, excluding the current game). EPA measures the value of each play relative to league-average expectations, making it a pace-neutral efficiency metric. - away_off_epa: Away team's offensive EPA per play. - home_pass_rate_oe: Home team's pass rate over expected --- how often the team passes relative to what game context (score, down, distance, time) would predict. Teams that pass more than expected tend to be more aggressive and score more. - away_pass_rate_oe: Away team's pass rate over expected.

Defensive Efficiency Features: - home_def_epa: Home team's defensive EPA per play allowed. Lower (more negative) values indicate a better defense. - away_def_epa: Away team's defensive EPA per play allowed.

Pace and Style Features: - home_plays_per_game: Home team's average offensive plays per game, a proxy for pace. - away_plays_per_game: Away team's average offensive plays per game. - combined_pace: Sum of both teams' plays per game, capturing the expected overall tempo.

Situational Features: - dome: Binary indicator for games played in a dome or retractable-roof stadium. Dome games tend to score higher due to controlled conditions. - wind_speed: Wind speed in mph at kickoff. High wind suppresses passing and reduces scoring. - temperature: Temperature in Fahrenheit. Extreme cold can reduce scoring. - home_rest_days: Days since the home team's last game. More rest correlates with better performance. - away_rest_days: Days since the away team's last game. - divisional_game: Binary indicator. Divisional games tend to be lower scoring due to familiarity.

Temporal Adjustments: - Early-season statistics (Weeks 1--4) are noisy. We apply Bayesian shrinkage, blending current-season averages with a prior based on the previous season's performance. The shrinkage weight decays exponentially as more games are played.

Critical Design Decision: No Look-Ahead Bias

Every feature is computed using only information available before the game is played. Season-to-date averages for a Week 8 game use only Weeks 1--7 data. This temporal discipline is non-negotiable. A model that uses end-of-season statistics to predict mid-season games will appear far more accurate in backtesting than it is in practice, because it is using information from the future.

Model Building

Model Specification

We fit an ordinary least squares (OLS) linear regression:

$$\text{Total Points}_i = \beta_0 + \beta_1 \cdot \text{home\_off\_epa}_i + \beta_2 \cdot \text{away\_off\_epa}_i + \ldots + \beta_k \cdot x_{ik} + \epsilon_i$$

where $\epsilon_i \sim N(0, \sigma^2)$ under the standard OLS assumptions.

Train-Test Split

We use a strict temporal split: seasons 2019--2022 (1,088 games) for training and seasons 2023--2024 (544 games) for testing. This simulates real-world deployment, where the model is trained on historical data and evaluated on future games.

Fitting the Model

Using statsmodels OLS, we fit the model on the training set. The summary output reveals the following key results (from synthetic data):

Feature	Coefficient	Std Error	t-stat	p-value
Intercept	36.8	3.2	11.50	<0.001
home_off_epa	12.4	2.1	5.90	<0.001
away_off_epa	11.8	2.2	5.36	<0.001
home_def_epa	8.9	2.4	3.71	<0.001
away_def_epa	7.6	2.5	3.04	0.002
combined_pace	0.08	0.02	4.00	<0.001
dome	2.1	0.9	2.33	0.020
wind_speed	-0.12	0.04	-3.00	0.003
divisional_game	-1.4	0.7	-2.00	0.046

Training $R^2 = 0.28$, Adjusted $R^2 = 0.27$, RMSE = 12.6 points.

Interpreting the Coefficients

The intercept of 36.8 represents the predicted total when all features are zero --- a theoretical baseline. The coefficient of 12.4 on home_off_epa means that a one-unit increase in the home team's offensive EPA per play is associated with a 12.4-point increase in the predicted game total, holding all else constant. Since EPA per play typically ranges from -0.15 to +0.15, the full range of this feature contributes approximately $12.4 \times 0.30 = 3.7$ points of variation in predictions.

The dome effect (+2.1 points) aligns with industry estimates that dome games score roughly 2--3 points higher. Wind speed reduces totals by 0.12 points per mph; a 20 mph wind day would reduce the prediction by 2.4 points. Divisional games are predicted to score 1.4 fewer points, consistent with the "familiarity breeds low scores" pattern documented in football analytics.

Diagnostic Evaluation

Residual Analysis

We examine the model's residuals (actual total minus predicted total) for violations of OLS assumptions.

Residuals vs. Fitted Values: The scatter plot shows no obvious pattern --- residuals are centered around zero across the range of fitted values, with roughly constant spread. There is no evidence of systematic bias or heteroscedasticity.

Q-Q Plot: Residuals follow the normal distribution closely in the center but exhibit slightly heavier tails than a normal distribution, with a few extreme positive residuals (unusually high-scoring games). The Jarque-Bera test for normality yields a p-value of 0.08, failing to reject normality at the 5% level.

Durbin-Watson Statistic: 1.92, close to the ideal value of 2.0, indicating no meaningful autocorrelation in residuals when games are ordered chronologically.

Breusch-Pagan Test: p-value of 0.15, failing to reject homoscedasticity. The assumption of constant error variance is supported.

Variance Inflation Factors: All VIFs are below 4.0, with the highest being combined_pace at 3.7 (due to its correlation with both teams' offensive metrics). Multicollinearity is not a concern.

Cook's Distance: Three observations have Cook's distance exceeding $4/n$, corresponding to games with extreme scores (a 62-point combined total and two games below 20 points). These are influential but not erroneous; football produces occasional extreme outcomes.

Overall Assessment

The model passes all standard diagnostic checks. The residuals are approximately normal, uncorrelated, and homoscedastic. The $R^2$ of 0.28 may seem modest, but in the context of predicting NFL game totals --- where the inherent randomness of football creates a large irreducible error component --- this level of explained variance is meaningful and consistent with published models.

Out-of-Sample Evaluation

Test Set Performance

Applied to the 2023--2024 test set (544 games), the model produces:

Test RMSE: 13.1 points (vs. 12.6 training --- modest degradation)
Test MAE: 10.2 points
Test $R^2$: 0.24

The small gap between training and test performance (RMSE difference of 0.5 points) indicates minimal overfitting. The model generalizes well to future seasons.

Comparison to the Market

The critical question for betting is not whether the model is accurate in absolute terms, but whether it adds value relative to the market's posted totals. We compare the model's predictions to the sportsbook's closing totals:

Mean absolute difference (model vs. closing line): 2.8 points
Correlation between model predictions and closing lines: 0.72
Games where model differs from line by 3+ points: 28% of all games

The model agrees with the market most of the time, which is expected --- the market is efficient. The edge lies in the 28% of games where the model and market diverge by at least 3 points.

Betting Application

Identifying Value Bets

We define a value bet as a game where:

$$|\text{Model Prediction} - \text{Sportsbook Total}| \geq \text{Edge Threshold}$$

and we bet the Over when the model prediction exceeds the line and the Under when it falls below.

Using an edge threshold of 3 points on the 2023--2024 test set:

Qualifying bets: 152 out of 544 games (28%)
Win rate: 55.3% (84 wins, 68 losses)
Average odds: -110
ROI: +5.1%
Units profit: +7.7 units (at 1 unit per bet)

Statistical Significance

At 55.3% over 152 bets against a 52.4% break-even rate:

$$z = \frac{0.553 - 0.524}{\sqrt{0.524 \times 0.476 / 152}} = \frac{0.029}{0.0405} = 0.716$$

This z-score yields a p-value of approximately 0.237 (one-tailed), which is not statistically significant at the 5% level. The edge is suggestive but the sample is insufficient to confirm profitability with confidence. This underscores a fundamental challenge in sports betting: even a genuine edge requires hundreds or thousands of bets to verify statistically.

Improving Confidence with Larger Thresholds

If we increase the edge threshold to 5 points:

Qualifying bets: 62
Win rate: 58.1%
ROI: +9.8%

The win rate improves substantially when the model is most confident, a hallmark of a model with genuine predictive power, but the sample size shrinks further, reducing statistical power.

Lessons Learned

1. Feature engineering is the largest lever. The choice to use EPA (a play-level efficiency metric) rather than raw points per game improved model performance by approximately 15% in cross-validation. Domain-specific features like dome, wind, and divisional indicators added another 5%.

2. Diagnostics are not optional. Checking residuals, VIF, and influential observations is not academic box-checking --- it prevents deploying a model with undetected flaws that would cost real money.

3. The market is a strong baseline. The model's predictions correlate 0.72 with the closing line. Most of the time, the market has already priced the relevant information. The edge lies in the tail of disagreements.

4. Sample size is the bottleneck. Even a profitable model needs many bets to prove itself. With roughly 272 games per NFL season and only 28% qualifying as value bets, you get approximately 76 bets per year. Confirming a genuine edge requires multiple seasons of live results.

5. The model must be updated and monitored. NFL rules, playing styles, and offensive philosophies evolve. A model trained on 2019 data may not capture the 2024 landscape. Re-fitting annually with fresh data and checking for distributional shift is essential maintenance.

Your Turn: Extension Projects

Add player-level features. Incorporate quarterback-specific metrics (passer rating, air yards per attempt) and test whether they improve predictions beyond team-level aggregates.
Try regularization. Re-fit the model using Ridge and Lasso regression with cross-validated regularization parameters. Does regularization improve out-of-sample RMSE?
Build a second-half model. Restrict training to only games from Week 8 onward, when season statistics stabilize. Does this subset-focused model outperform the full-season model?
Compare to a Poisson model. Instead of predicting the total directly, model each team's scoring as a Poisson process and compute the distribution of combined scores. Does the distributional approach produce better-calibrated betting signals?
Simulate bankroll growth. Using the model's backtested signals, simulate bankroll growth under Kelly Criterion sizing and compare it to flat betting. What maximum drawdown would you experience?

Discussion Questions

The model's $R^2$ is 0.28. Is this a good or bad model? How does the answer depend on whether you are using it for scientific understanding vs. betting profitability?
Why might a model that is excellent at predicting game totals still lose money if applied naively to all games?
If the sportsbook adjusts its lines faster than you can place bets, what happens to the model's practical edge?
How would you modify this approach for NBA game totals, where scoring distributions and pace dynamics differ from football?
Should the model's edge threshold be fixed (e.g., always 3 points) or dynamic (e.g., scaled to model uncertainty for each game)?