Game Outcome Prediction
Game Prediction Models
NBA game prediction models use statistical analysis, machine learning, and historical data to forecast game outcomes. These models can predict point spreads, total scores, and win probabilities, providing valuable insights for analysis and decision-making.
Types of Prediction Models
1. Spread Prediction Models
Predict the margin of victory (point spread) for games:
- Point Differential Models: Estimate the expected scoring margin between teams
- Adjusted Ratings: Account for strength of schedule and home court advantage
- Regression Models: Use team statistics to predict point differentials
- Market-Adjusted Models: Incorporate betting line movements
2. Total Score (Over/Under) Models
Forecast the combined score of both teams:
- Pace-Adjusted Models: Account for team tempo and possessions per game
- Offensive/Defensive Efficiency: Model scoring rates per 100 possessions
- Historical Averages: Weighted recent performance metrics
- Environmental Factors: Consider rest days, travel, and schedule
3. Moneyline (Win Probability) Models
Calculate the probability of each team winning:
- Logistic Regression: Binary outcome prediction based on team stats
- Rating Systems: Elo, Glicko, or custom power ratings
- Ensemble Methods: Combine multiple prediction approaches
- Classification Models: Machine learning classifiers for win/loss
Key Features and Inputs
Team Performance Metrics
- Offensive Rating (ORtg): Points scored per 100 possessions
- Defensive Rating (DRtg): Points allowed per 100 possessions
- Net Rating: Point differential per 100 possessions (ORtg - DRtg)
- Pace: Number of possessions per 48 minutes
- Four Factors: Shooting efficiency, turnovers, rebounding, free throws
Advanced Statistics
- True Shooting Percentage (TS%): Shooting efficiency including 3-pointers and free throws
- Effective Field Goal Percentage (eFG%): Field goal percentage adjusted for 3-point value
- Turnover Rate: Turnovers per 100 possessions
- Offensive/Defensive Rebound Rate: Percentage of available rebounds captured
- Free Throw Rate: Free throw attempts relative to field goal attempts
Contextual Factors
- Home Court Advantage: Typically worth 2-4 points
- Rest Days: Days since last game for each team
- Back-to-Back Games: Performance impact on second night
- Travel Distance: Miles traveled for away games
- Injuries: Key player availability and impact on team strength
- Schedule Strength: Quality of recent and upcoming opponents
Recent Form
- Rolling Averages: Performance over last 5, 10, or 15 games
- Win Streaks: Current momentum indicators
- Head-to-Head Records: Historical matchup performance
- Situational Splits: Home/away, vs. division, vs. conference
Python Implementation: Building Predictive Models
Data Preparation and Feature Engineering
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_absolute_error, accuracy_score, log_loss
import xgboost as xgb
# Load game data
games = pd.read_csv('nba_games.csv')
team_stats = pd.read_csv('team_stats.csv')
# Feature engineering function
def create_features(games, team_stats):
"""Create features for game prediction"""
features = []
for idx, game in games.iterrows():
home_team = game['home_team']
away_team = game['away_team']
game_date = game['date']
# Get recent team stats (rolling averages)
home_stats = team_stats[
(team_stats['team'] == home_team) &
(team_stats['date'] < game_date)
].tail(10).mean()
away_stats = team_stats[
(team_stats['team'] == away_team) &
(team_stats['date'] < game_date)
].tail(10).mean()
# Create feature vector
feature_dict = {
# Offensive ratings
'home_ortg': home_stats['offensive_rating'],
'away_ortg': away_stats['offensive_rating'],
'ortg_diff': home_stats['offensive_rating'] - away_stats['offensive_rating'],
# Defensive ratings
'home_drtg': home_stats['defensive_rating'],
'away_drtg': away_stats['defensive_rating'],
'drtg_diff': away_stats['defensive_rating'] - home_stats['defensive_rating'],
# Net ratings
'home_net_rtg': home_stats['net_rating'],
'away_net_rtg': away_stats['net_rating'],
'net_rtg_diff': home_stats['net_rating'] - away_stats['net_rating'],
# Pace
'home_pace': home_stats['pace'],
'away_pace': away_stats['pace'],
'avg_pace': (home_stats['pace'] + away_stats['pace']) / 2,
# Four factors
'home_efg': home_stats['efg_pct'],
'away_efg': away_stats['efg_pct'],
'home_tov_rate': home_stats['tov_rate'],
'away_tov_rate': away_stats['tov_rate'],
'home_orb_rate': home_stats['orb_rate'],
'away_orb_rate': away_stats['orb_rate'],
'home_ft_rate': home_stats['ft_rate'],
'away_ft_rate': away_stats['ft_rate'],
# Contextual
'home_advantage': 1, # 1 for home team
'home_rest_days': game['home_rest_days'],
'away_rest_days': game['away_rest_days'],
'home_back_to_back': int(game['home_rest_days'] == 0),
'away_back_to_back': int(game['away_rest_days'] == 0),
# Recent form
'home_win_pct_l10': home_stats['win_pct_last_10'],
'away_win_pct_l10': away_stats['win_pct_last_10'],
}
features.append(feature_dict)
return pd.DataFrame(features)
# Create features
X = create_features(games, team_stats)
# Target variables
y_spread = games['home_score'] - games['away_score'] # Point spread
y_total = games['home_score'] + games['away_score'] # Total score
y_winner = (games['home_score'] > games['away_score']).astype(int) # Home win
print(f"Features shape: {X.shape}")
print(f"Feature columns: {X.columns.tolist()}")
Spread Prediction Model
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y_spread, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Ridge regression for spread prediction
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
# Predictions
y_pred = ridge_model.predict(X_test_scaled)
# Evaluate
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (Spread): {mae:.2f} points")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'coefficient': ridge_model.coef_
}).sort_values('coefficient', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))
# Random Forest for comparison
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=20,
random_state=42
)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
print(f"\nRandom Forest MAE: {mae_rf:.2f} points")
Total Score Prediction Model
# Prepare data for total score prediction
X_train, X_test, y_train_total, y_test_total = train_test_split(
X, y_total, test_size=0.2, random_state=42
)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# XGBoost for total prediction
xgb_total = xgb.XGBRegressor(
n_estimators=200,
learning_rate=0.05,
max_depth=6,
min_child_weight=3,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
xgb_total.fit(X_train, y_train_total)
y_pred_total = xgb_total.predict(X_test)
# Evaluate
mae_total = mean_absolute_error(y_test_total, y_pred_total)
print(f"Total Score MAE: {mae_total:.2f} points")
# Analyze predictions vs actual
results = pd.DataFrame({
'actual_total': y_test_total,
'predicted_total': y_pred_total,
'error': np.abs(y_test_total - y_pred_total)
})
print(f"\nMedian Error: {results['error'].median():.2f}")
print(f"90th Percentile Error: {results['error'].quantile(0.9):.2f}")
# Feature importance
xgb_importance = pd.DataFrame({
'feature': X.columns,
'importance': xgb_total.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Features for Total Prediction:")
print(xgb_importance.head(10))
Win Probability Model
# Prepare data for win probability
X_train, X_test, y_train_win, y_test_win = train_test_split(
X, y_winner, test_size=0.2, random_state=42
)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic regression
logit_model = LogisticRegression(
C=1.0,
random_state=42,
max_iter=1000
)
logit_model.fit(X_train_scaled, y_train_win)
# Predict probabilities
y_pred_proba = logit_model.predict_proba(X_test_scaled)[:, 1]
y_pred_win = (y_pred_proba > 0.5).astype(int)
# Evaluate
accuracy = accuracy_score(y_test_win, y_pred_win)
logloss = log_loss(y_test_win, y_pred_proba)
print(f"Win Prediction Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss:.4f}")
# Gradient boosting classifier
gbc_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
gbc_model.fit(X_train, y_train_win)
y_pred_proba_gbc = gbc_model.predict_proba(X_test)[:, 1]
y_pred_win_gbc = (y_pred_proba_gbc > 0.5).astype(int)
accuracy_gbc = accuracy_score(y_test_win, y_pred_win_gbc)
print(f"Gradient Boosting Accuracy: {accuracy_gbc:.4f}")
# Calibration analysis
prob_bins = np.linspace(0, 1, 11)
calibration_data = []
for i in range(len(prob_bins) - 1):
mask = (y_pred_proba >= prob_bins[i]) & (y_pred_proba < prob_bins[i+1])
if mask.sum() > 0:
actual_win_rate = y_test_win[mask].mean()
predicted_prob = y_pred_proba[mask].mean()
count = mask.sum()
calibration_data.append({
'bin': f"{prob_bins[i]:.1f}-{prob_bins[i+1]:.1f}",
'predicted': predicted_prob,
'actual': actual_win_rate,
'count': count
})
calibration_df = pd.DataFrame(calibration_data)
print("\nModel Calibration:")
print(calibration_df)
Ensemble Model
from sklearn.ensemble import VotingRegressor, VotingClassifier
# Ensemble for spread prediction
spread_ensemble = VotingRegressor([
('ridge', Ridge(alpha=1.0)),
('rf', RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
('xgb', xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6))
])
spread_ensemble.fit(X_train_scaled, y_train)
y_pred_ensemble = spread_ensemble.predict(X_test_scaled)
mae_ensemble = mean_absolute_error(y_test, y_pred_ensemble)
print(f"Ensemble Spread MAE: {mae_ensemble:.2f} points")
# Ensemble for win prediction
win_ensemble = VotingClassifier([
('logit', LogisticRegression(C=1.0, max_iter=1000)),
('gbc', GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
('rf', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
], voting='soft')
win_ensemble.fit(X_train_scaled, y_train_win)
y_pred_proba_ensemble = win_ensemble.predict_proba(X_test_scaled)[:, 1]
accuracy_ensemble = accuracy_score(y_test_win, (y_pred_proba_ensemble > 0.5).astype(int))
print(f"Ensemble Win Accuracy: {accuracy_ensemble:.4f}")
R Implementation: Statistical Modeling
Linear Model for Spread Prediction
library(tidyverse)
library(caret)
library(glmnet)
library(randomForest)
# Load data
games <- read_csv("nba_games.csv")
team_stats <- read_csv("team_stats.csv")
# Create features
create_features <- function(games, team_stats) {
features <- games %>%
left_join(
team_stats %>%
group_by(team) %>%
arrange(date) %>%
mutate(
ortg_rolling = zoo::rollmean(offensive_rating, k=10, fill=NA, align="right"),
drtg_rolling = zoo::rollmean(defensive_rating, k=10, fill=NA, align="right"),
net_rtg_rolling = zoo::rollmean(net_rating, k=10, fill=NA, align="right"),
pace_rolling = zoo::rollmean(pace, k=10, fill=NA, align="right")
) %>%
ungroup(),
by = c("home_team" = "team", "date"),
suffix = c("", "_home")
) %>%
left_join(
team_stats %>%
group_by(team) %>%
arrange(date) %>%
mutate(
ortg_rolling = zoo::rollmean(offensive_rating, k=10, fill=NA, align="right"),
drtg_rolling = zoo::rollmean(defensive_rating, k=10, fill=NA, align="right"),
net_rtg_rolling = zoo::rollmean(net_rating, k=10, fill=NA, align="right"),
pace_rolling = zoo::rollmean(pace, k=10, fill=NA, align="right")
) %>%
ungroup(),
by = c("away_team" = "team", "date"),
suffix = c("_home", "_away")
) %>%
mutate(
# Calculate differences
ortg_diff = ortg_rolling_home - ortg_rolling_away,
drtg_diff = drtg_rolling_away - drtg_rolling_home,
net_rtg_diff = net_rtg_rolling_home - net_rtg_rolling_away,
avg_pace = (pace_rolling_home + pace_rolling_away) / 2,
# Target variables
point_spread = home_score - away_score,
total_score = home_score + away_score,
home_win = as.factor(ifelse(home_score > away_score, 1, 0))
) %>%
filter(!is.na(ortg_diff))
return(features)
}
# Create feature set
data <- create_features(games, team_stats)
# Split into train/test
set.seed(42)
train_index <- createDataPartition(data$point_spread, p=0.8, list=FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Linear regression for spread
spread_formula <- point_spread ~ ortg_diff + drtg_diff + net_rtg_diff +
avg_pace + home_rest_days + away_rest_days
spread_model <- lm(spread_formula, data=train_data)
# Summary
summary(spread_model)
# Predictions
spread_pred <- predict(spread_model, newdata=test_data)
spread_mae <- mean(abs(test_data$point_spread - spread_pred))
cat(sprintf("Spread MAE: %.2f points\n", spread_mae))
# Residual diagnostics
par(mfrow=c(2,2))
plot(spread_model)
# Cross-validation
train_control <- trainControl(method="cv", number=10)
cv_model <- train(
spread_formula,
data=train_data,
method="lm",
trControl=train_control
)
print(cv_model)
Ridge Regression with Cross-Validation
# Prepare matrix for glmnet
x_vars <- c("ortg_diff", "drtg_diff", "net_rtg_diff", "avg_pace",
"home_rest_days", "away_rest_days", "ortg_rolling_home",
"ortg_rolling_away", "drtg_rolling_home", "drtg_rolling_away")
X_train <- as.matrix(train_data[, x_vars])
y_train <- train_data$point_spread
X_test <- as.matrix(test_data[, x_vars])
y_test <- test_data$point_spread
# Ridge regression with cross-validation
ridge_cv <- cv.glmnet(
X_train,
y_train,
alpha=0, # Ridge penalty
nfolds=10
)
# Plot cross-validation results
plot(ridge_cv)
# Best lambda
cat(sprintf("Best lambda: %.4f\n", ridge_cv$lambda.min))
# Predictions with best lambda
ridge_pred <- predict(ridge_cv, s="lambda.min", newx=X_test)
ridge_mae <- mean(abs(y_test - ridge_pred))
cat(sprintf("Ridge MAE: %.2f points\n", ridge_mae))
# Coefficients
ridge_coef <- coef(ridge_cv, s="lambda.min")
print(ridge_coef)
Poisson Regression for Total Score
# Poisson regression for total score prediction
total_formula <- total_score ~ ortg_rolling_home + ortg_rolling_away +
drtg_rolling_home + drtg_rolling_away +
avg_pace + home_rest_days + away_rest_days
poisson_model <- glm(
total_formula,
data=train_data,
family=poisson(link="log")
)
summary(poisson_model)
# Predictions
total_pred <- predict(poisson_model, newdata=test_data, type="response")
total_mae <- mean(abs(test_data$total_score - total_pred))
cat(sprintf("Total Score MAE: %.2f points\n", total_mae))
# Goodness of fit
# Deviance residuals
cat(sprintf("Residual deviance: %.2f\n", poisson_model$deviance))
cat(sprintf("Degrees of freedom: %d\n", poisson_model$df.residual))
# Overdispersion test
dispersion <- poisson_model$deviance / poisson_model$df.residual
cat(sprintf("Dispersion parameter: %.4f\n", dispersion))
# If overdispersed, use quasi-Poisson
if (dispersion > 1.5) {
quasi_poisson_model <- glm(
total_formula,
data=train_data,
family=quasipoisson(link="log")
)
summary(quasi_poisson_model)
}
Logistic Regression for Win Probability
# Logistic regression for win probability
win_formula <- home_win ~ net_rtg_diff + avg_pace +
home_rest_days + away_rest_days
logit_model <- glm(
win_formula,
data=train_data,
family=binomial(link="logit")
)
summary(logit_model)
# Predictions
win_prob <- predict(logit_model, newdata=test_data, type="response")
win_pred <- ifelse(win_prob > 0.5, 1, 0)
# Accuracy
accuracy <- mean(win_pred == as.numeric(test_data$home_win) - 1)
cat(sprintf("Win Prediction Accuracy: %.4f\n", accuracy))
# ROC curve and AUC
library(pROC)
roc_obj <- roc(as.numeric(test_data$home_win) - 1, win_prob)
auc_value <- auc(roc_obj)
cat(sprintf("AUC: %.4f\n", auc_value))
plot(roc_obj, main="ROC Curve for Win Prediction")
# Calibration plot
calibration_data <- data.frame(
predicted = win_prob,
actual = as.numeric(test_data$home_win) - 1
) %>%
mutate(prob_bin = cut(predicted, breaks=seq(0, 1, 0.1))) %>%
group_by(prob_bin) %>%
summarise(
mean_predicted = mean(predicted),
mean_actual = mean(actual),
count = n()
)
ggplot(calibration_data, aes(x=mean_predicted, y=mean_actual)) +
geom_point(aes(size=count)) +
geom_abline(slope=1, intercept=0, linetype="dashed", color="red") +
labs(
title="Model Calibration",
x="Predicted Win Probability",
y="Actual Win Rate"
) +
theme_minimal()
Random Forest in R
# Random forest for spread prediction
rf_model <- randomForest(
spread_formula,
data=train_data,
ntree=500,
mtry=3,
importance=TRUE
)
# Variable importance
importance(rf_model)
varImpPlot(rf_model)
# Predictions
rf_pred <- predict(rf_model, newdata=test_data)
rf_mae <- mean(abs(test_data$point_spread - rf_pred))
cat(sprintf("Random Forest MAE: %.2f points\n", rf_mae))
# Partial dependence plots
library(pdp)
partial_plot <- partial(
rf_model,
pred.var="net_rtg_diff",
train=train_data
)
autoplot(partial_plot) +
labs(
title="Partial Dependence: Net Rating Difference",
x="Net Rating Difference",
y="Predicted Point Spread"
) +
theme_minimal()
Elo and Rating Systems
Elo Rating System
The Elo rating system, originally developed for chess, has been successfully adapted for NBA predictions. It provides a simple yet effective way to rate teams based on game outcomes.
Basic Elo Principles:
- Expected Score: E = 1 / (1 + 10^((R_opponent - R_team) / 400))
- Rating Update: R_new = R_old + K * (Actual - Expected)
- K-Factor: Determines how much ratings change per game (typically 20-30)
- Home Court Advantage: Add 100-120 points to home team rating
Python Implementation of Elo
import pandas as pd
import numpy as np
class EloRatingSystem:
"""NBA Elo Rating System"""
def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500):
"""
Initialize Elo rating system
Parameters:
- k_factor: Rating adjustment speed (20-30 typical)
- home_advantage: Points added to home team (100-120 typical)
- initial_rating: Starting rating for all teams
"""
self.k_factor = k_factor
self.home_advantage = home_advantage
self.initial_rating = initial_rating
self.ratings = {}
self.rating_history = []
def expected_score(self, rating_a, rating_b):
"""Calculate expected score for team A"""
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
def update_ratings(self, team_a, team_b, score_a, score_b, is_home_a=True):
"""
Update ratings after a game
Parameters:
- team_a, team_b: Team identifiers
- score_a, score_b: Final scores
- is_home_a: Whether team_a is home team
"""
# Initialize ratings if needed
if team_a not in self.ratings:
self.ratings[team_a] = self.initial_rating
if team_b not in self.ratings:
self.ratings[team_b] = self.initial_rating
# Get current ratings
rating_a = self.ratings[team_a]
rating_b = self.ratings[team_b]
# Apply home advantage
if is_home_a:
rating_a += self.home_advantage
else:
rating_b += self.home_advantage
# Calculate expected scores
expected_a = self.expected_score(rating_a, rating_b)
expected_b = 1 - expected_a
# Actual outcome (1 for win, 0 for loss)
actual_a = 1 if score_a > score_b else 0
actual_b = 1 - actual_a
# Margin of victory multiplier (optional enhancement)
mov_multiplier = self.mov_multiplier(abs(score_a - score_b), rating_a, rating_b)
# Update ratings
self.ratings[team_a] += self.k_factor * mov_multiplier * (actual_a - expected_a)
self.ratings[team_b] += self.k_factor * mov_multiplier * (actual_b - expected_b)
return expected_a, actual_a
def mov_multiplier(self, margin, rating_winner, rating_loser):
"""
Margin of victory multiplier
Larger margins increase the rating change
"""
return np.log(margin + 1) * (2.2 / ((rating_winner - rating_loser) * 0.001 + 2.2))
def predict_game(self, team_a, team_b, is_home_a=True):
"""
Predict game outcome
Returns:
- win_prob_a: Probability team A wins
- expected_spread: Expected point spread (positive = A favored)
"""
rating_a = self.ratings.get(team_a, self.initial_rating)
rating_b = self.ratings.get(team_b, self.initial_rating)
if is_home_a:
rating_a += self.home_advantage
else:
rating_b += self.home_advantage
# Win probability
win_prob_a = self.expected_score(rating_a, rating_b)
# Convert Elo difference to point spread
# Approximate: 25 Elo points ≈ 1 point spread
elo_diff = rating_a - rating_b
expected_spread = elo_diff / 25
return win_prob_a, expected_spread
def get_rankings(self):
"""Get current team rankings"""
return pd.DataFrame([
{'team': team, 'rating': rating}
for team, rating in self.ratings.items()
]).sort_values('rating', ascending=False).reset_index(drop=True)
# Example usage
games = pd.read_csv('nba_games.csv')
# Initialize Elo system
elo = EloRatingSystem(k_factor=20, home_advantage=100)
# Process games chronologically
predictions = []
for idx, game in games.iterrows():
home_team = game['home_team']
away_team = game['away_team']
home_score = game['home_score']
away_score = game['away_score']
# Make prediction before updating ratings
win_prob, spread = elo.predict_game(home_team, away_team, is_home_a=True)
# Update ratings
expected, actual = elo.update_ratings(
home_team, away_team,
home_score, away_score,
is_home_a=True
)
predictions.append({
'date': game['date'],
'home_team': home_team,
'away_team': away_team,
'predicted_win_prob': win_prob,
'predicted_spread': spread,
'actual_spread': home_score - away_score,
'correct': (win_prob > 0.5 and home_score > away_score) or
(win_prob <= 0.5 and home_score <= away_score)
})
# Results
predictions_df = pd.DataFrame(predictions)
accuracy = predictions_df['correct'].mean()
print(f"Elo Prediction Accuracy: {accuracy:.4f}")
# Mean absolute error for spread
mae = np.mean(np.abs(predictions_df['predicted_spread'] - predictions_df['actual_spread']))
print(f"Elo Spread MAE: {mae:.2f} points")
# Current rankings
rankings = elo.get_rankings()
print("\nTop 10 Teams by Elo Rating:")
print(rankings.head(10))
Enhanced Elo Variations
class EnhancedElo(EloRatingSystem):
"""Enhanced Elo with additional factors"""
def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500,
rest_adjustment=True, season_regression=True):
super().__init__(k_factor, home_advantage, initial_rating)
self.rest_adjustment = rest_adjustment
self.season_regression = season_regression
self.season_games = {}
def rest_factor(self, rest_days):
"""Adjust rating based on rest days"""
if not self.rest_adjustment:
return 0
if rest_days == 0: # Back-to-back
return -50
elif rest_days == 1:
return -20
elif rest_days >= 5: # Too much rest
return -10
else:
return 0
def predict_game_enhanced(self, team_a, team_b, is_home_a=True,
rest_a=2, rest_b=2):
"""Enhanced prediction with rest factors"""
rating_a = self.ratings.get(team_a, self.initial_rating)
rating_b = self.ratings.get(team_b, self.initial_rating)
# Home advantage
if is_home_a:
rating_a += self.home_advantage
else:
rating_b += self.home_advantage
# Rest adjustments
rating_a += self.rest_factor(rest_a)
rating_b += self.rest_factor(rest_b)
# Win probability
win_prob_a = self.expected_score(rating_a, rating_b)
# Point spread
elo_diff = rating_a - rating_b
expected_spread = elo_diff / 25
return win_prob_a, expected_spread
def regress_to_mean(self, regression_factor=0.3):
"""
Regress ratings toward mean at season start
Accounts for offseason changes
"""
if not self.season_regression:
return
mean_rating = np.mean(list(self.ratings.values()))
for team in self.ratings:
current_rating = self.ratings[team]
self.ratings[team] = (
regression_factor * mean_rating +
(1 - regression_factor) * current_rating
)
# FiveThirtyEight-style Elo with CARMELO adjustments
class FiveThirtyEightElo(EnhancedElo):
"""FiveThirtyEight-inspired Elo system"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.player_ratings = {} # Player-level ratings
def update_with_roster_changes(self, team, added_players, removed_players):
"""Adjust team rating based on roster changes"""
rating_change = 0
# Add value from new players
for player in added_players:
if player in self.player_ratings:
rating_change += self.player_ratings[player] * 0.1
# Subtract value from lost players
for player in removed_players:
if player in self.player_ratings:
rating_change -= self.player_ratings[player] * 0.1
self.ratings[team] = self.ratings.get(team, self.initial_rating) + rating_change
Glicko Rating System
The Glicko system extends Elo by adding a rating deviation (uncertainty) measure:
import math
class GlickoRatingSystem:
"""Glicko rating system with rating deviation"""
def __init__(self, initial_rating=1500, initial_rd=350, c=30):
"""
Initialize Glicko system
Parameters:
- initial_rating: Starting rating
- initial_rd: Initial rating deviation (uncertainty)
- c: Rating deviation increase per time period
"""
self.initial_rating = initial_rating
self.initial_rd = initial_rd
self.c = c
self.ratings = {} # {team: (rating, rd, last_update)}
def g(self, rd):
"""Glicko g function"""
return 1 / math.sqrt(1 + 3 * (rd ** 2) / (math.pi ** 2))
def expected_score(self, rating_a, rating_b, rd_b):
"""Expected score with rating deviation"""
return 1 / (1 + 10 ** (self.g(rd_b) * (rating_b - rating_a) / 400))
def update_rd(self, rd, time_periods=1):
"""Update rating deviation over time"""
new_rd = math.sqrt(rd ** 2 + (self.c ** 2) * time_periods)
return min(new_rd, self.initial_rd)
def update_ratings(self, team_a, team_b, outcome_a, date):
"""
Update ratings after a game
Parameters:
- outcome_a: 1 if team_a won, 0 if lost
"""
# Initialize if needed
if team_a not in self.ratings:
self.ratings[team_a] = (self.initial_rating, self.initial_rd, date)
if team_b not in self.ratings:
self.ratings[team_b] = (self.initial_rating, self.initial_rd, date)
rating_a, rd_a, last_date_a = self.ratings[team_a]
rating_b, rd_b, last_date_b = self.ratings[team_b]
# Update RDs based on time since last game
time_periods_a = (date - last_date_a).days / 30
time_periods_b = (date - last_date_b).days / 30
rd_a = self.update_rd(rd_a, time_periods_a)
rd_b = self.update_rd(rd_b, time_periods_b)
# Calculate d^2
expected = self.expected_score(rating_a, rating_b, rd_b)
g_rd_b = self.g(rd_b)
d_squared_a = 1 / ((g_rd_b ** 2) * expected * (1 - expected))
# Update rating and RD for team A
new_rating_a = rating_a + (1 / (1 / (rd_a ** 2) + 1 / d_squared_a)) * g_rd_b * (outcome_a - expected)
new_rd_a = math.sqrt(1 / (1 / (rd_a ** 2) + 1 / d_squared_a))
# Similarly for team B
expected_b = self.expected_score(rating_b, rating_a, rd_a)
g_rd_a = self.g(rd_a)
d_squared_b = 1 / ((g_rd_a ** 2) * expected_b * (1 - expected_b))
outcome_b = 1 - outcome_a
new_rating_b = rating_b + (1 / (1 / (rd_b ** 2) + 1 / d_squared_b)) * g_rd_a * (outcome_b - expected_b)
new_rd_b = math.sqrt(1 / (1 / (rd_b ** 2) + 1 / d_squared_b))
# Store updated ratings
self.ratings[team_a] = (new_rating_a, new_rd_a, date)
self.ratings[team_b] = (new_rating_b, new_rd_b, date)
def get_rankings(self):
"""Get rankings with confidence intervals"""
rankings = []
for team, (rating, rd, date) in self.ratings.items():
rankings.append({
'team': team,
'rating': rating,
'rd': rd,
'confidence_interval': (rating - 2*rd, rating + 2*rd)
})
return pd.DataFrame(rankings).sort_values('rating', ascending=False)
Machine Learning Approaches
Neural Networks for Game Prediction
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
X, y_spread, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Build neural network
def create_spread_model(input_dim):
model = keras.Sequential([
layers.Dense(128, activation='relu', input_dim=input_dim),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dense(1) # Output: point spread
])
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
return model
model = create_spread_model(X_train_scaled.shape[1])
# Train with early stopping
early_stopping = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
history = model.fit(
X_train_scaled, y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stopping],
verbose=1
)
# Evaluate
y_pred_nn = model.predict(X_test_scaled).flatten()
mae_nn = mean_absolute_error(y_test, y_pred_nn)
print(f"Neural Network MAE: {mae_nn:.2f} points")
# Plot training history
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Model Loss')
plt.subplot(1, 2, 2)
plt.plot(history.history['mae'], label='Training MAE')
plt.plot(history.history['val_mae'], label='Validation MAE')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()
plt.title('Model MAE')
plt.tight_layout()
plt.savefig('training_history.png')
plt.close()
Multi-Output Neural Network
# Multi-output model predicting spread, total, and win probability
def create_multi_output_model(input_dim):
# Input layer
inputs = layers.Input(shape=(input_dim,))
# Shared layers
x = layers.Dense(128, activation='relu')(inputs)
x = layers.Dropout(0.3)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(32, activation='relu')(x)
# Output branches
spread_output = layers.Dense(1, name='spread')(x)
total_output = layers.Dense(1, name='total')(x)
win_output = layers.Dense(1, activation='sigmoid', name='win_prob')(x)
model = keras.Model(
inputs=inputs,
outputs=[spread_output, total_output, win_output]
)
model.compile(
optimizer='adam',
loss={
'spread': 'mse',
'total': 'mse',
'win_prob': 'binary_crossentropy'
},
loss_weights={
'spread': 1.0,
'total': 0.5,
'win_prob': 1.0
},
metrics={
'spread': 'mae',
'total': 'mae',
'win_prob': 'accuracy'
}
)
return model
# Prepare targets
y_train_dict = {
'spread': y_train,
'total': train_totals,
'win_prob': train_wins
}
y_test_dict = {
'spread': y_test,
'total': test_totals,
'win_prob': test_wins
}
# Create and train model
multi_model = create_multi_output_model(X_train_scaled.shape[1])
history = multi_model.fit(
X_train_scaled,
y_train_dict,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stopping],
verbose=1
)
# Predictions
predictions = multi_model.predict(X_test_scaled)
spread_pred, total_pred, win_prob_pred = predictions
print(f"Spread MAE: {mean_absolute_error(y_test, spread_pred):.2f}")
print(f"Total MAE: {mean_absolute_error(test_totals, total_pred):.2f}")
print(f"Win Accuracy: {accuracy_score(test_wins, (win_prob_pred > 0.5).astype(int)):.4f}")
LightGBM for Fast Training
import lightgbm as lgb
# Prepare data for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
# Parameters
params = {
'objective': 'regression',
'metric': 'mae',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': -1
}
# Train model
lgbm_model = lgb.train(
params,
train_data,
num_boost_round=1000,
valid_sets=[test_data],
callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
# Predictions
y_pred_lgbm = lgbm_model.predict(X_test, num_iteration=lgbm_model.best_iteration)
mae_lgbm = mean_absolute_error(y_test, y_pred_lgbm)
print(f"LightGBM MAE: {mae_lgbm:.2f} points")
# Feature importance
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': lgbm_model.feature_importance()
}).sort_values('importance', ascending=False)
print("\nTop 10 Features:")
print(importance_df.head(10))
# Plot feature importance
lgb.plot_importance(lgbm_model, max_num_features=15, figsize=(10, 6))
plt.savefig('lgbm_feature_importance.png')
plt.close()
Deep Learning with Embeddings
# Neural network with team embeddings
def create_embedding_model(n_teams, embedding_dim=8, n_features=20):
# Team inputs
home_team_input = layers.Input(shape=(1,), name='home_team')
away_team_input = layers.Input(shape=(1,), name='away_team')
# Statistical features input
stats_input = layers.Input(shape=(n_features,), name='stats')
# Team embeddings
team_embedding = layers.Embedding(
input_dim=n_teams,
output_dim=embedding_dim,
name='team_embedding'
)
home_embedded = layers.Flatten()(team_embedding(home_team_input))
away_embedded = layers.Flatten()(team_embedding(away_team_input))
# Concatenate all inputs
concat = layers.Concatenate()([home_embedded, away_embedded, stats_input])
# Dense layers
x = layers.Dense(128, activation='relu')(concat)
x = layers.Dropout(0.3)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(32, activation='relu')(x)
# Output
output = layers.Dense(1, name='spread')(x)
model = keras.Model(
inputs=[home_team_input, away_team_input, stats_input],
outputs=output
)
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
return model
# Prepare data with team IDs
team_to_id = {team: idx for idx, team in enumerate(games['home_team'].unique())}
train_data = {
'home_team': train_games['home_team'].map(team_to_id).values,
'away_team': train_games['away_team'].map(team_to_id).values,
'stats': X_train_scaled
}
test_data = {
'home_team': test_games['home_team'].map(team_to_id).values,
'away_team': test_games['away_team'].map(team_to_id).values,
'stats': X_test_scaled
}
# Create and train model
embedding_model = create_embedding_model(
n_teams=len(team_to_id),
embedding_dim=8,
n_features=X_train_scaled.shape[1]
)
history = embedding_model.fit(
train_data,
y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stopping],
verbose=1
)
# Predictions
y_pred_embed = embedding_model.predict(test_data).flatten()
mae_embed = mean_absolute_error(y_test, y_pred_embed)
print(f"Embedding Model MAE: {mae_embed:.2f} points")
Model Evaluation and Backtesting
Performance Metrics
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
def evaluate_regression_model(y_true, y_pred, model_name="Model"):
"""Comprehensive evaluation for regression models"""
# Calculate metrics
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
# Median absolute error
median_ae = np.median(np.abs(y_true - y_pred))
# Mean absolute percentage error
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# Directional accuracy (for spread)
correct_direction = np.mean((y_true > 0) == (y_pred > 0))
print(f"\n{model_name} Performance:")
print(f" MAE: {mae:.2f}")
print(f" RMSE: {rmse:.2f}")
print(f" R²: {r2:.4f}")
print(f" Median AE: {median_ae:.2f}")
print(f" MAPE: {mape:.2f}%")
print(f" Directional Accuracy: {correct_direction:.4f}")
return {
'mae': mae,
'rmse': rmse,
'r2': r2,
'median_ae': median_ae,
'mape': mape,
'directional_accuracy': correct_direction
}
def evaluate_classification_model(y_true, y_pred_proba, threshold=0.5, model_name="Model"):
"""Comprehensive evaluation for classification models"""
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, log_loss, brier_score_loss
)
y_pred = (y_pred_proba > threshold).astype(int)
# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred_proba)
logloss = log_loss(y_true, y_pred_proba)
brier = brier_score_loss(y_true, y_pred_proba)
print(f"\n{model_name} Performance:")
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1 Score: {f1:.4f}")
print(f" AUC: {auc:.4f}")
print(f" Log Loss: {logloss:.4f}")
print(f" Brier Score: {brier:.4f}")
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'auc': auc,
'log_loss': logloss,
'brier_score': brier
}
# Example evaluation
metrics = evaluate_regression_model(y_test, y_pred, "Ridge Regression")
Backtesting Framework
class GamePredictionBacktest:
"""Backtesting framework for game prediction models"""
def __init__(self, model, features, targets, dates, initial_train_size=0.6):
"""
Initialize backtest
Parameters:
- model: sklearn-compatible model
- features: Feature matrix
- targets: Target variable
- dates: Game dates for chronological ordering
- initial_train_size: Initial training set proportion
"""
self.model = model
self.features = features
self.targets = targets
self.dates = dates
self.initial_train_size = initial_train_size
self.predictions = []
self.actuals = []
self.prediction_dates = []
def run_backtest(self, retrain_frequency=10):
"""
Run rolling window backtest
Parameters:
- retrain_frequency: Retrain model every N games
"""
# Sort by date
sort_idx = np.argsort(self.dates)
X = self.features[sort_idx]
y = self.targets[sort_idx]
dates = self.dates[sort_idx]
# Initial training set
initial_train_idx = int(len(X) * self.initial_train_size)
X_train = X[:initial_train_idx]
y_train = y[:initial_train_idx]
# Train initial model
self.model.fit(X_train, y_train)
games_since_retrain = 0
# Rolling predictions
for i in range(initial_train_idx, len(X)):
# Predict next game
X_test = X[i:i+1]
y_pred = self.model.predict(X_test)[0]
y_actual = y[i]
self.predictions.append(y_pred)
self.actuals.append(y_actual)
self.prediction_dates.append(dates[i])
games_since_retrain += 1
# Retrain if needed
if games_since_retrain >= retrain_frequency:
X_train = X[:i+1]
y_train = y[:i+1]
self.model.fit(X_train, y_train)
games_since_retrain = 0
return np.array(self.predictions), np.array(self.actuals)
def evaluate(self):
"""Evaluate backtest performance"""
predictions = np.array(self.predictions)
actuals = np.array(self.actuals)
mae = mean_absolute_error(actuals, predictions)
rmse = np.sqrt(mean_squared_error(actuals, predictions))
print(f"Backtest Results:")
print(f" Number of predictions: {len(predictions)}")
print(f" MAE: {mae:.2f}")
print(f" RMSE: {rmse:.2f}")
# Time-based analysis
results_df = pd.DataFrame({
'date': self.prediction_dates,
'predicted': predictions,
'actual': actuals,
'error': np.abs(predictions - actuals)
})
return results_df
def plot_results(self):
"""Visualize backtest results"""
results_df = pd.DataFrame({
'date': self.prediction_dates,
'predicted': self.predictions,
'actual': self.actuals
})
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.scatter(results_df['actual'], results_df['predicted'], alpha=0.5)
plt.plot([results_df['actual'].min(), results_df['actual'].max()],
[results_df['actual'].min(), results_df['actual'].max()],
'r--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual')
plt.subplot(1, 2, 2)
results_df['rolling_mae'] = results_df['actual'].sub(results_df['predicted']).abs().rolling(50).mean()
plt.plot(results_df['date'], results_df['rolling_mae'])
plt.xlabel('Date')
plt.ylabel('Rolling MAE (50 games)')
plt.title('Model Performance Over Time')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('backtest_results.png')
plt.close()
# Example usage
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
backtest = GamePredictionBacktest(
model=model,
features=X.values,
targets=y_spread.values,
dates=games['date'].values,
initial_train_size=0.6
)
predictions, actuals = backtest.run_backtest(retrain_frequency=10)
results = backtest.evaluate()
backtest.plot_results()
Cross-Validation for Time Series
from sklearn.model_selection import TimeSeriesSplit
def time_series_cv(model, X, y, n_splits=5):
"""Time series cross-validation"""
tscv = TimeSeriesSplit(n_splits=n_splits)
fold_scores = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
fold_scores.append(mae)
print(f"Fold {fold + 1}: MAE = {mae:.2f}")
print(f"\nAverage MAE: {np.mean(fold_scores):.2f} (+/- {np.std(fold_scores):.2f})")
return fold_scores
# Run time series CV
scores = time_series_cv(
Ridge(alpha=1.0),
X.values,
y_spread.values,
n_splits=5
)
Betting Simulation
class BettingSimulator:
"""Simulate betting strategy based on model predictions"""
def __init__(self, initial_bankroll=1000, bet_size=10):
self.initial_bankroll = initial_bankroll
self.bet_size = bet_size
self.bankroll = initial_bankroll
self.bet_history = []
def kelly_criterion(self, win_prob, odds):
"""Calculate optimal bet size using Kelly Criterion"""
# odds in American format
if odds > 0:
decimal_odds = 1 + (odds / 100)
else:
decimal_odds = 1 + (100 / abs(odds))
edge = win_prob * decimal_odds - 1
kelly_fraction = edge / (decimal_odds - 1)
return max(0, kelly_fraction)
def simulate_bet(self, predicted_spread, actual_spread, betting_line, odds=-110):
"""
Simulate a bet on the spread
Parameters:
- predicted_spread: Model's predicted spread
- actual_spread: Actual game spread
- betting_line: Market spread
- odds: Betting odds (American format)
"""
# Determine if we should bet
edge = abs(predicted_spread - betting_line)
if edge < 2: # Minimum edge threshold
return 0, "No bet"
# Bet on home team if model predicts better than line
if predicted_spread > betting_line + 2:
bet_team = "home"
# Home team wins bet if actual > betting_line
won = actual_spread > betting_line
elif predicted_spread < betting_line - 2:
bet_team = "away"
# Away team wins bet if actual < betting_line
won = actual_spread < betting_line
else:
return 0, "No bet"
# Calculate profit/loss
if odds > 0:
profit_multiplier = odds / 100
else:
profit_multiplier = 100 / abs(odds)
if won:
profit = self.bet_size * profit_multiplier
else:
profit = -self.bet_size
self.bankroll += profit
self.bet_history.append({
'bet_team': bet_team,
'predicted_spread': predicted_spread,
'betting_line': betting_line,
'actual_spread': actual_spread,
'won': won,
'profit': profit,
'bankroll': self.bankroll
})
return profit, bet_team
def run_simulation(self, predictions_df):
"""Run full betting simulation"""
for idx, row in predictions_df.iterrows():
self.simulate_bet(
row['predicted_spread'],
row['actual_spread'],
row['betting_line']
)
return pd.DataFrame(self.bet_history)
def calculate_roi(self):
"""Calculate return on investment"""
total_bet = len(self.bet_history) * self.bet_size
profit = self.bankroll - self.initial_bankroll
roi = (profit / total_bet) * 100
return roi
def print_summary(self):
"""Print betting simulation summary"""
df = pd.DataFrame(self.bet_history)
total_bets = len(df)
wins = df['won'].sum()
win_rate = wins / total_bets if total_bets > 0 else 0
profit = self.bankroll - self.initial_bankroll
roi = self.calculate_roi()
print(f"\nBetting Simulation Results:")
print(f" Initial Bankroll: ${self.initial_bankroll:.2f}")
print(f" Final Bankroll: ${self.bankroll:.2f}")
print(f" Total Profit: ${profit:.2f}")
print(f" Total Bets: {total_bets}")
print(f" Wins: {wins}")
print(f" Win Rate: {win_rate:.2%}")
print(f" ROI: {roi:.2f}%")
# Plot bankroll over time
plt.figure(figsize=(12, 6))
plt.plot(df['bankroll'])
plt.axhline(y=self.initial_bankroll, color='r', linestyle='--', label='Initial Bankroll')
plt.xlabel('Bet Number')
plt.ylabel('Bankroll ($)')
plt.title('Bankroll Over Time')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('betting_simulation.png')
plt.close()
# Example usage
simulator = BettingSimulator(initial_bankroll=1000, bet_size=10)
results = simulator.run_simulation(predictions_df)
simulator.print_summary()
Best Practices and Tips
Model Development
- Use Chronological Splits: Always validate with time-based splits, never random splits
- Feature Engineering: Rolling averages and recent form are more predictive than season averages
- Avoid Leakage: Only use data available before game time in predictions
- Weight Recent Games: Recent performance is more indicative than early-season games
- Account for Context: Rest, travel, injuries, and motivation matter significantly
Evaluation Strategy
- Multiple Metrics: Use MAE for interpretability, RMSE for penalizing large errors
- Directional Accuracy: Getting the winner right is often more important than exact spread
- Calibration: Ensure predicted probabilities match actual frequencies
- Segment Analysis: Evaluate performance by team strength, game importance, season phase
- Betting Simulation: Test profitability against market lines, not just prediction accuracy
Common Pitfalls
- Overfitting: Complex models may fit noise in training data
- Ignoring Market: Beating the closing line is harder than predicting outcomes
- Recency Bias: Overweighting very recent games can miss longer-term trends
- Lineup Changes: Not accounting for injuries and rest significantly degrades accuracy
- Static Models: Teams change over the season; models should adapt
Advanced Considerations
- Ensemble Approaches: Combine Elo, statistical models, and ML for robust predictions
- Player Impact Models: Adjust team ratings for lineup composition
- Market Integration: Use betting lines as additional features
- Bayesian Updates: Continuously update beliefs as new information arrives
- Uncertainty Quantification: Provide prediction intervals, not just point estimates
Resources and Further Reading
Academic Papers
- "Machine Learning for Sports Betting: Should Model Sophistication Matter?" - Hubáček et al.
- "Predicting the NBA Championship via NBA2K and fivethirtyeight's RAPTOR" - FiveThirtyEight
- "Beating the Bookies with Their Own Numbers" - Lisandro Kaunitz et al.
- "The Prediction Tracker: A Framework for Assessing NBA Predictions" - Ryan Davis
Websites and Tools
- FiveThirtyEight: NBA predictions using RAPTOR and Elo ratings
- Basketball-Reference: Comprehensive NBA statistics and advanced metrics
- NBA.com/stats: Official NBA advanced statistics
- Cleaning the Glass: Advanced analytics and team ratings
- InPredictable: Public prediction models and backtesting
Python Libraries
- nba_api: Python API client for NBA.com statistics
- basketball-reference-scraper: Scrape Basketball-Reference data
- scikit-learn: Machine learning framework
- XGBoost/LightGBM: Gradient boosting implementations
- TensorFlow/PyTorch: Deep learning frameworks
Summary
Game prediction models combine statistical analysis, machine learning, and domain expertise to forecast NBA outcomes. Successful models typically:
- Use rolling averages of team performance metrics (offensive/defensive ratings, pace, four factors)
- Account for contextual factors (home court, rest, injuries, strength of schedule)
- Employ ensemble methods combining multiple approaches (Elo, regression, ML)
- Validate with chronological splits and backtesting
- Focus on beating market lines rather than just prediction accuracy
- Continuously adapt to team changes throughout the season
While perfect prediction is impossible due to the inherent randomness in sports, well-designed models can identify value and outperform market expectations over large samples. The key is combining rigorous statistical methodology with basketball domain knowledge and disciplined evaluation practices.