The ability to predict basketball game outcomes represents the culmination of sports analytics knowledge. Game prediction synthesizes everything we have learned about player evaluation, team performance metrics, and situational factors into...
In This Chapter
- Introduction
- 25.1 Foundations of Game Prediction
- 25.2 Point Spread Modeling
- 25.3 Over/Under (Total Points) Prediction
- 25.4 Elo Rating Systems for Basketball
- 25.5 Feature Engineering for Game Prediction
- 25.6 Betting Market Efficiency
- 25.7 Model Evaluation Metrics
- 25.8 Home Court Advantage Quantification
- 25.9 Rest and Travel Effects
- 25.10 Ensemble Methods for Prediction
- 25.11 Complete Prediction Pipeline
- 25.12 Practical Considerations
- Summary
- References
Chapter 25: Game Outcome Prediction
Introduction
The ability to predict basketball game outcomes represents the culmination of sports analytics knowledge. Game prediction synthesizes everything we have learned about player evaluation, team performance metrics, and situational factors into actionable forecasts. Whether you are an analyst supporting a front office, a researcher studying competitive dynamics, or someone interested in testing predictions against betting markets, understanding the fundamentals of game outcome prediction is essential.
This chapter develops a comprehensive framework for predicting NBA game outcomes. We begin with the fundamental approaches to modeling point spreads and totals, then build increasingly sophisticated prediction systems. Along the way, we examine the efficiency of betting markets, quantify situational factors like home court advantage and rest, and develop rigorous evaluation frameworks for our models.
Game prediction in basketball presents unique challenges compared to other sports. The high-scoring nature of the game means that individual possessions matter less than in sports like soccer or hockey, making outcomes somewhat more predictable. However, the importance of matchups, the impact of injuries, and the strategic adjustments coaches make create substantial uncertainty that even the best models cannot fully capture.
25.1 Foundations of Game Prediction
The Prediction Problem
At its core, predicting a basketball game outcome involves estimating probabilities for various results. The most common prediction targets include:
- Win probability: The probability that a specific team wins the game
- Point spread: The expected margin of victory
- Total points: The expected combined score of both teams
- Exact score probability: The probability distribution over final scores
These targets are related but distinct. A team might be expected to win by 5 points (spread), but the win probability is not simply "greater than 50%." The distribution of outcomes matters. If a team is favored by 5 points but outcomes are highly variable, their win probability might be only 60%. If outcomes are more predictable, the same 5-point favorite might have a 70% win probability.
The Baseline: Home Team Win Percentage
Before building complex models, we must establish baselines. The simplest baseline for NBA game prediction is the historical home team win percentage. Over the past several decades, NBA home teams have won approximately 58-60% of their games, though this figure has declined in recent years.
import numpy as np
import pandas as pd
def calculate_home_win_rate(games_df):
"""
Calculate the historical home team win rate.
Parameters:
-----------
games_df : DataFrame
Game data with columns: home_score, away_score
Returns:
--------
float : Home team win rate
"""
home_wins = (games_df['home_score'] > games_df['away_score']).sum()
total_games = len(games_df)
return home_wins / total_games
# Example usage
# home_win_rate = calculate_home_win_rate(nba_games)
# print(f"Home win rate: {home_win_rate:.1%}")
This baseline tells us that any model must do better than simply predicting the home team to win every game. In terms of accuracy, that naive strategy achieves about 58% correct predictions. In terms of calibration and proper scoring rules (which we discuss later), this baseline performs reasonably well because it implicitly assigns roughly correct probabilities.
Simple Rating Systems
The next level of sophistication involves rating each team. The simplest approach is a power rating that captures each team's overall strength relative to an average team. If Team A has a power rating of +5 and Team B has a rating of -3, we might predict Team A to win by 8 points on a neutral court.
The most straightforward way to estimate power ratings is through least squares regression. Consider a model where:
$$\text{Margin}_i = \text{Rating}_{\text{home}} - \text{Rating}_{\text{away}} + \text{HCA} + \epsilon_i$$
Where HCA is home court advantage. This can be estimated using ridge regression to handle the collinearity inherent in team ratings:
from sklearn.linear_model import Ridge
import numpy as np
def estimate_power_ratings(games_df, teams, alpha=1.0):
"""
Estimate team power ratings using ridge regression.
Parameters:
-----------
games_df : DataFrame
Game data with home_team, away_team, home_score, away_score
teams : list
List of all team identifiers
alpha : float
Ridge regularization parameter
Returns:
--------
dict : Team power ratings
float : Estimated home court advantage
"""
n_games = len(games_df)
n_teams = len(teams)
team_to_idx = {team: i for i, team in enumerate(teams)}
# Build design matrix: one column per team + intercept for HCA
X = np.zeros((n_games, n_teams + 1))
y = np.zeros(n_games)
for i, (_, game) in enumerate(games_df.iterrows()):
home_idx = team_to_idx[game['home_team']]
away_idx = team_to_idx[game['away_team']]
X[i, home_idx] = 1 # Home team
X[i, away_idx] = -1 # Away team
X[i, -1] = 1 # Home court advantage
y[i] = game['home_score'] - game['away_score']
# Fit ridge regression (don't regularize HCA term)
model = Ridge(alpha=alpha, fit_intercept=False)
model.fit(X, y)
ratings = {team: model.coef_[i] for team, i in team_to_idx.items()}
hca = model.coef_[-1]
# Center ratings around zero
mean_rating = np.mean(list(ratings.values()))
ratings = {team: r - mean_rating for team, r in ratings.items()}
return ratings, hca
This simple approach captures a substantial portion of the predictable variance in game outcomes. Power ratings explain roughly 15-20% of the variance in game margins, which translates to meaningful improvements over the baseline.
25.2 Point Spread Modeling
Understanding the Spread
The point spread, also known as the line, represents the expected margin of victory for the favored team. When oddsmakers set a spread of -7 for Team A against Team B, they expect Team A to win by approximately 7 points. Bettors can wager on either team to "cover" the spread: Team A must win by more than 7 points, or Team B must lose by fewer than 7 points (or win outright).
Point spreads serve as the market's consensus prediction. Professional oddsmakers and sophisticated bettors contribute to setting these lines, making them difficult to beat consistently. The spread incorporates vast amounts of information, including injury reports, recent performance, matchup history, and situational factors.
Modeling Point Differential
Building a point spread model requires predicting the expected point differential between two teams. The general framework combines:
- Team strength measures: Offensive and defensive ratings
- Situational factors: Home court, rest, travel
- Matchup considerations: Style compatibility, pace effects
- Recent form: Hot and cold streaks
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
class PointSpreadModel:
"""
Model for predicting NBA game point spreads.
"""
def __init__(self):
self.model = Ridge(alpha=10.0)
self.scaler = StandardScaler()
self.feature_names = None
def create_features(self, games_df, team_stats_df):
"""
Create feature matrix for point spread prediction.
Parameters:
-----------
games_df : DataFrame
Game data with home_team, away_team, date
team_stats_df : DataFrame
Team statistics indexed by team and date
Returns:
--------
DataFrame : Feature matrix
"""
features = []
for _, game in games_df.iterrows():
home_stats = team_stats_df.loc[game['home_team'], game['date']]
away_stats = team_stats_df.loc[game['away_team'], game['date']]
row = {
# Efficiency differentials
'off_rtg_diff': home_stats['off_rtg'] - away_stats['off_rtg'],
'def_rtg_diff': home_stats['def_rtg'] - away_stats['def_rtg'],
'net_rtg_diff': home_stats['net_rtg'] - away_stats['net_rtg'],
# Pace factors
'home_pace': home_stats['pace'],
'away_pace': away_stats['pace'],
'pace_diff': home_stats['pace'] - away_stats['pace'],
# Four factors (home perspective)
'efg_diff': home_stats['efg'] - away_stats['efg'],
'tov_diff': away_stats['tov_rate'] - home_stats['tov_rate'],
'orb_diff': home_stats['orb_rate'] - away_stats['orb_rate'],
'ft_rate_diff': home_stats['ft_rate'] - away_stats['ft_rate'],
# Rest and travel
'home_rest': game.get('home_rest_days', 1),
'away_rest': game.get('away_rest_days', 1),
'rest_diff': game.get('home_rest_days', 1) - game.get('away_rest_days', 1),
'away_travel_dist': game.get('away_travel_distance', 0),
# Recent form (last 10 games)
'home_recent_margin': home_stats.get('last_10_margin', 0),
'away_recent_margin': away_stats.get('last_10_margin', 0),
# Home court indicator
'home_court': 1
}
features.append(row)
return pd.DataFrame(features)
def fit(self, X, y):
"""
Fit the point spread model.
Parameters:
-----------
X : DataFrame
Feature matrix
y : array-like
Actual point differentials (home - away)
"""
self.feature_names = X.columns.tolist()
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled, y)
def predict(self, X):
"""
Predict point spread.
Parameters:
-----------
X : DataFrame
Feature matrix
Returns:
--------
array : Predicted point spreads
"""
X_scaled = self.scaler.transform(X)
return self.model.predict(X_scaled)
def get_feature_importance(self):
"""
Get feature importance from model coefficients.
Returns:
--------
DataFrame : Feature importances
"""
importance = pd.DataFrame({
'feature': self.feature_names,
'coefficient': self.model.coef_,
'abs_coefficient': np.abs(self.model.coef_)
}).sort_values('abs_coefficient', ascending=False)
return importance
Spread Distribution and Win Probability
Point spread predictions typically assume a normal distribution of outcomes around the predicted spread. Historically, NBA game margins have a standard deviation of approximately 11-12 points. This allows us to convert spread predictions to win probabilities:
from scipy import stats
def spread_to_win_probability(spread, std_dev=11.5):
"""
Convert a point spread to win probability for the favored team.
Parameters:
-----------
spread : float
Predicted point spread (positive = home favored)
std_dev : float
Standard deviation of game margins
Returns:
--------
float : Win probability for home team
"""
# Probability that home team wins = P(margin > 0)
# If spread is the expected margin, we need P(X > 0) where X ~ N(spread, std_dev^2)
win_prob = 1 - stats.norm.cdf(0, loc=spread, scale=std_dev)
return win_prob
# Example: Home team favored by 5 points
spread = 5.0
win_prob = spread_to_win_probability(spread)
print(f"Spread: {spread}, Win probability: {win_prob:.1%}")
# Output: Spread: 5.0, Win probability: 66.9%
The relationship between spread and win probability is nonlinear near zero and approaches 50% and 100% asymptotically. A spread of zero implies a 50% win probability, while very large spreads approach certainty but never reach it.
25.3 Over/Under (Total Points) Prediction
The Totals Market
The over/under, or totals market, involves predicting whether the combined score of both teams will exceed a specified threshold. For example, if the total is set at 215.5 points, bettors wager on whether the actual combined score will be over or under that number.
Totals prediction requires understanding both teams' pace and efficiency. A game between two fast-paced, efficient offenses will likely produce more points than a game between two slow, defensive teams.
Modeling Total Points
class TotalPointsModel:
"""
Model for predicting total points in NBA games.
"""
def __init__(self):
self.model = Ridge(alpha=5.0)
self.scaler = StandardScaler()
def create_features(self, games_df, team_stats_df):
"""
Create features for total points prediction.
"""
features = []
for _, game in games_df.iterrows():
home_stats = team_stats_df.loc[game['home_team'], game['date']]
away_stats = team_stats_df.loc[game['away_team'], game['date']]
# Expected pace: geometric mean of team paces
expected_pace = np.sqrt(home_stats['pace'] * away_stats['pace'])
# Expected efficiency
home_off_vs_away_def = (home_stats['off_rtg'] + away_stats['def_rtg']) / 2
away_off_vs_home_def = (away_stats['off_rtg'] + home_stats['def_rtg']) / 2
row = {
'expected_pace': expected_pace,
'home_off_rtg': home_stats['off_rtg'],
'away_off_rtg': away_stats['off_rtg'],
'home_def_rtg': home_stats['def_rtg'],
'away_def_rtg': away_stats['def_rtg'],
'combined_off_rtg': home_stats['off_rtg'] + away_stats['off_rtg'],
'combined_def_rtg': home_stats['def_rtg'] + away_stats['def_rtg'],
'pace_sum': home_stats['pace'] + away_stats['pace'],
'home_three_rate': home_stats.get('three_rate', 0.35),
'away_three_rate': away_stats.get('three_rate', 0.35),
'home_ft_rate': home_stats.get('ft_rate', 0.2),
'away_ft_rate': away_stats.get('ft_rate', 0.2),
'rest_total': game.get('home_rest_days', 1) + game.get('away_rest_days', 1),
'is_back_to_back': int(
game.get('home_rest_days', 1) == 0 or
game.get('away_rest_days', 1) == 0
)
}
features.append(row)
return pd.DataFrame(features)
def predict_total(self, home_off_rtg, away_off_rtg, home_def_rtg, away_def_rtg,
home_pace, away_pace, league_avg_rtg=110):
"""
Simple analytical prediction of total points.
Uses the formula: Total = Pace * (Home_Off_Eff + Away_Off_Eff - League_Avg) / 100
This accounts for the pace of both teams and their relative efficiencies.
"""
# Expected pace
expected_pace = (home_pace + away_pace) / 2
# Adjusted efficiencies accounting for matchups
# Home offense vs away defense
home_pts_per_100 = (home_off_rtg + (100 - away_def_rtg)) / 2 + league_avg_rtg / 2
# Away offense vs home defense
away_pts_per_100 = (away_off_rtg + (100 - home_def_rtg)) / 2 + league_avg_rtg / 2
# Total points
total = expected_pace * (home_pts_per_100 + away_pts_per_100) / 100
return total
Factors Affecting Totals
Several factors systematically affect game totals:
- Pace: Faster-paced teams create more possessions and scoring opportunities
- Defensive quality: Elite defenses suppress scoring
- Three-point shooting: High-volume three-point teams have more volatile scoring
- Free throw rate: Games with more fouls tend to score differently
- Rest: Back-to-back games often feature lower totals
- Altitude: Games in Denver historically feature slightly different totals
25.4 Elo Rating Systems for Basketball
Origins and Principles
The Elo rating system, originally developed by Arpad Elo for chess, provides an elegant framework for rating competitors based on head-to-head results. Elo systems have been successfully adapted to many sports, including basketball.
The core principles of Elo are:
- Zero-sum updates: Points gained by the winner equal points lost by the loser
- Expectation-based: Updates depend on whether the result was surprising
- Self-correcting: Ratings converge to true strength over time
- Interpretable: Rating differences map to win probabilities
Basic Elo Implementation
class EloRatingSystem:
"""
Elo rating system adapted for NBA basketball.
"""
def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500):
"""
Initialize the Elo system.
Parameters:
-----------
k_factor : float
Maximum rating change per game
home_advantage : float
Rating points added to home team's effective rating
initial_rating : float
Starting rating for new teams
"""
self.k_factor = k_factor
self.home_advantage = home_advantage
self.initial_rating = initial_rating
self.ratings = {}
self.history = []
def get_rating(self, team):
"""Get current rating for a team."""
return self.ratings.get(team, self.initial_rating)
def expected_score(self, rating_a, rating_b):
"""
Calculate expected score (win probability) for team A.
Uses the logistic function with 400 as the scale factor.
A 400-point rating difference corresponds to ~90% win probability.
"""
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
def update_ratings(self, home_team, away_team, home_score, away_score):
"""
Update ratings after a game.
Parameters:
-----------
home_team : str
Home team identifier
away_team : str
Away team identifier
home_score : int
Home team's final score
away_score : int
Away team's final score
Returns:
--------
tuple : (new_home_rating, new_away_rating, home_rating_change)
"""
# Get current ratings
home_rating = self.get_rating(home_team)
away_rating = self.get_rating(away_team)
# Calculate expected scores with home advantage
home_expected = self.expected_score(
home_rating + self.home_advantage,
away_rating
)
away_expected = 1 - home_expected
# Actual result (1 for win, 0 for loss, 0.5 for tie)
if home_score > away_score:
home_actual = 1
away_actual = 0
elif away_score > home_score:
home_actual = 0
away_actual = 1
else:
home_actual = 0.5
away_actual = 0.5
# Update ratings
home_change = self.k_factor * (home_actual - home_expected)
away_change = self.k_factor * (away_actual - away_expected)
new_home_rating = home_rating + home_change
new_away_rating = away_rating + away_change
self.ratings[home_team] = new_home_rating
self.ratings[away_team] = new_away_rating
# Store history
self.history.append({
'home_team': home_team,
'away_team': away_team,
'home_rating_before': home_rating,
'away_rating_before': away_rating,
'home_rating_after': new_home_rating,
'away_rating_after': new_away_rating,
'home_expected': home_expected,
'home_actual': home_actual
})
return new_home_rating, new_away_rating, home_change
def predict_game(self, home_team, away_team):
"""
Predict win probability and spread for a game.
Returns:
--------
dict : Prediction with win_prob and expected_spread
"""
home_rating = self.get_rating(home_team)
away_rating = self.get_rating(away_team)
# Win probability with home advantage
home_win_prob = self.expected_score(
home_rating + self.home_advantage,
away_rating
)
# Convert rating difference to spread
# Empirically, ~25-30 Elo points ≈ 1 point of spread
rating_diff = (home_rating + self.home_advantage) - away_rating
expected_spread = rating_diff / 28
return {
'home_win_prob': home_win_prob,
'away_win_prob': 1 - home_win_prob,
'expected_spread': expected_spread,
'home_rating': home_rating,
'away_rating': away_rating
}
def process_season(self, games_df):
"""
Process all games in a season and update ratings.
Parameters:
-----------
games_df : DataFrame
Games sorted by date with columns:
home_team, away_team, home_score, away_score
"""
for _, game in games_df.iterrows():
self.update_ratings(
game['home_team'],
game['away_team'],
game['home_score'],
game['away_score']
)
def get_rankings(self):
"""Get current team rankings."""
rankings = pd.DataFrame([
{'team': team, 'rating': rating}
for team, rating in self.ratings.items()
]).sort_values('rating', ascending=False)
rankings['rank'] = range(1, len(rankings) + 1)
return rankings
Margin-Aware Elo
Standard Elo only considers wins and losses. For basketball, incorporating margin of victory provides additional information:
class MarginAwareElo(EloRatingSystem):
"""
Elo system that incorporates margin of victory.
"""
def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500,
margin_factor=0.04, margin_cap=20):
"""
Parameters:
-----------
margin_factor : float
How much to weight margin (higher = more weight)
margin_cap : float
Maximum margin to consider (caps blowouts)
"""
super().__init__(k_factor, home_advantage, initial_rating)
self.margin_factor = margin_factor
self.margin_cap = margin_cap
def margin_multiplier(self, margin, elo_diff):
"""
Calculate K-factor multiplier based on margin.
Uses a diminishing returns formula that accounts for
expected margin based on rating difference.
"""
# Cap the margin
margin = min(abs(margin), self.margin_cap)
# Autocorrelation adjustment: expected margin based on Elo diff
expected_margin = elo_diff / 28
# Multiplier increases with margin, but less so for expected blowouts
multiplier = np.log(margin + 1) * (2.2 / (1 + 0.001 * abs(elo_diff)))
return max(1.0, multiplier)
def update_ratings(self, home_team, away_team, home_score, away_score):
"""Update ratings with margin adjustment."""
home_rating = self.get_rating(home_team)
away_rating = self.get_rating(away_team)
# Calculate expected scores
elo_diff = (home_rating + self.home_advantage) - away_rating
home_expected = self.expected_score(
home_rating + self.home_advantage,
away_rating
)
# Actual result
margin = home_score - away_score
if margin > 0:
home_actual = 1
elif margin < 0:
home_actual = 0
else:
home_actual = 0.5
# Margin multiplier
mult = self.margin_multiplier(margin, elo_diff)
# Update with adjusted K-factor
effective_k = self.k_factor * mult
home_change = effective_k * (home_actual - home_expected)
self.ratings[home_team] = home_rating + home_change
self.ratings[away_team] = away_rating - home_change
return self.ratings[home_team], self.ratings[away_team], home_change
Season Carryover
Between seasons, teams change through trades, free agency, and the draft. Elo systems typically regress ratings toward the mean between seasons:
def regress_ratings_to_mean(elo_system, regression_factor=0.25):
"""
Regress all ratings toward the mean between seasons.
Parameters:
-----------
elo_system : EloRatingSystem
The Elo system to modify
regression_factor : float
Fraction to regress toward mean (0.25 = 25% regression)
"""
if not elo_system.ratings:
return
mean_rating = np.mean(list(elo_system.ratings.values()))
for team in elo_system.ratings:
current = elo_system.ratings[team]
elo_system.ratings[team] = current + regression_factor * (mean_rating - current)
25.5 Feature Engineering for Game Prediction
Temporal Features
The timing of games matters significantly in basketball. Key temporal features include:
def create_temporal_features(games_df, schedule_df):
"""
Create temporal features for game prediction.
Parameters:
-----------
games_df : DataFrame
Games to create features for
schedule_df : DataFrame
Full schedule for calculating rest, etc.
Returns:
--------
DataFrame : Temporal features
"""
features = []
for _, game in games_df.iterrows():
# Days since last game (rest)
home_last = schedule_df[
(schedule_df['team'] == game['home_team']) &
(schedule_df['date'] < game['date'])
]['date'].max()
away_last = schedule_df[
(schedule_df['team'] == game['away_team']) &
(schedule_df['date'] < game['date'])
]['date'].max()
home_rest = (game['date'] - home_last).days if pd.notna(home_last) else 3
away_rest = (game['date'] - away_last).days if pd.notna(away_last) else 3
# Game number in current road/home stretch
# (Not fully implemented - would require additional schedule analysis)
# Month of season (early, mid, late)
month = game['date'].month
is_early_season = month in [10, 11]
is_late_season = month in [3, 4]
# Day of week
day_of_week = game['date'].dayofweek
is_weekend = day_of_week >= 5
row = {
'home_rest_days': home_rest,
'away_rest_days': away_rest,
'rest_advantage': home_rest - away_rest,
'home_back_to_back': int(home_rest == 1),
'away_back_to_back': int(away_rest == 1),
'both_rested': int(home_rest >= 2 and away_rest >= 2),
'is_early_season': int(is_early_season),
'is_late_season': int(is_late_season),
'is_weekend': int(is_weekend),
'games_into_season': game.get('game_number', 41)
}
features.append(row)
return pd.DataFrame(features)
Team Quality Features
Team quality extends beyond simple power ratings:
def create_team_quality_features(games_df, team_stats_df):
"""
Create features capturing team quality and matchups.
"""
features = []
for _, game in games_df.iterrows():
home = team_stats_df.loc[game['home_team']]
away = team_stats_df.loc[game['away_team']]
row = {
# Efficiency metrics
'home_net_rtg': home['off_rtg'] - home['def_rtg'],
'away_net_rtg': away['off_rtg'] - away['def_rtg'],
'net_rtg_diff': (home['off_rtg'] - home['def_rtg']) - (away['off_rtg'] - away['def_rtg']),
# Strength of schedule
'home_sos': home.get('strength_of_schedule', 0),
'away_sos': away.get('strength_of_schedule', 0),
# Consistency (low std dev = consistent)
'home_consistency': -home.get('margin_std', 10),
'away_consistency': -away.get('margin_std', 10),
# Clutch performance (4th quarter, close games)
'home_clutch_net_rtg': home.get('clutch_net_rtg', home['off_rtg'] - home['def_rtg']),
'away_clutch_net_rtg': away.get('clutch_net_rtg', away['off_rtg'] - away['def_rtg']),
# Record in close games
'home_close_game_pct': home.get('close_game_win_pct', 0.5),
'away_close_game_pct': away.get('close_game_win_pct', 0.5),
# Win streaks / momentum
'home_streak': home.get('current_streak', 0),
'away_streak': away.get('current_streak', 0),
# Record vs playoff teams
'home_vs_playoff_pct': home.get('vs_playoff_win_pct', 0.5),
'away_vs_playoff_pct': away.get('vs_playoff_win_pct', 0.5)
}
features.append(row)
return pd.DataFrame(features)
Style Matchup Features
Basketball is a matchup-driven game. Certain styles perform better against others:
def create_style_features(games_df, team_stats_df):
"""
Create features based on team playing styles.
"""
features = []
for _, game in games_df.iterrows():
home = team_stats_df.loc[game['home_team']]
away = team_stats_df.loc[game['away_team']]
row = {
# Pace differential
'pace_diff': home['pace'] - away['pace'],
'avg_pace': (home['pace'] + away['pace']) / 2,
# Three-point reliance
'home_three_rate': home.get('three_att_rate', 0.35),
'away_three_rate': away.get('three_att_rate', 0.35),
'three_rate_diff': home.get('three_att_rate', 0.35) - away.get('three_att_rate', 0.35),
# Paint scoring
'home_paint_pts_rate': home.get('paint_pts_rate', 0.4),
'away_paint_pts_rate': away.get('paint_pts_rate', 0.4),
# Rebounding
'home_reb_rate': home.get('reb_rate', 0.5),
'away_reb_rate': away.get('reb_rate', 0.5),
# Turnover tendencies
'home_tov_rate': home.get('tov_rate', 0.12),
'away_tov_rate': away.get('tov_rate', 0.12),
# Style compatibility score
# High pace vs low pace, perimeter vs interior, etc.
'style_mismatch': abs(home['pace'] - away['pace']) +
abs(home.get('three_att_rate', 0.35) - away.get('three_att_rate', 0.35)) * 100
}
features.append(row)
return pd.DataFrame(features)
25.6 Betting Market Efficiency
The Efficient Market Hypothesis in Sports
The efficient market hypothesis (EMH) suggests that betting lines incorporate all available information, making it impossible to consistently beat the market. In sports betting, "soft" efficiency means that while markets may have biases, these biases are smaller than the transaction costs (the vig or juice).
Evidence suggests that NBA betting markets are highly efficient:
- Closing lines are excellent predictors of game outcomes
- Line movements from open to close generally improve accuracy
- Profitable betting strategies are rare and often disappear when publicized
Testing Market Efficiency
def analyze_market_efficiency(games_df):
"""
Analyze the efficiency of betting markets.
Parameters:
-----------
games_df : DataFrame
Games with columns: spread, actual_margin, over_under, actual_total
Returns:
--------
dict : Market efficiency metrics
"""
results = {}
# Point spread accuracy
games_df['spread_error'] = games_df['actual_margin'] - games_df['spread']
results['spread_mae'] = games_df['spread_error'].abs().mean()
results['spread_rmse'] = np.sqrt((games_df['spread_error'] ** 2).mean())
results['spread_bias'] = games_df['spread_error'].mean()
# Cover rate (should be ~50% if efficient)
games_df['home_covers'] = games_df['actual_margin'] > games_df['spread']
results['home_cover_rate'] = games_df['home_covers'].mean()
# Over/under accuracy
games_df['total_error'] = games_df['actual_total'] - games_df['over_under']
results['total_mae'] = games_df['total_error'].abs().mean()
results['total_rmse'] = np.sqrt((games_df['total_error'] ** 2).mean())
results['total_bias'] = games_df['total_error'].mean()
# Over rate (should be ~50% if efficient)
games_df['over_hit'] = games_df['actual_total'] > games_df['over_under']
results['over_rate'] = games_df['over_hit'].mean()
return results
def test_betting_strategy(games_df, predictions_df, strategy_fn, vig=0.0909):
"""
Backtest a betting strategy.
Parameters:
-----------
games_df : DataFrame
Actual game results
predictions_df : DataFrame
Model predictions
strategy_fn : callable
Function that returns bet decision given prediction and line
vig : float
Bookmaker's commission (standard -110 both ways = 0.0909)
Returns:
--------
dict : Strategy performance metrics
"""
results = []
for i in range(len(games_df)):
game = games_df.iloc[i]
pred = predictions_df.iloc[i]
# Get bet decision: 1 (bet home), -1 (bet away), 0 (no bet)
bet = strategy_fn(pred, game)
if bet == 0:
continue
# Determine if bet won
actual_margin = game['actual_margin']
spread = game['spread']
if bet == 1: # Bet on home team to cover
won = actual_margin > spread
else: # Bet on away team to cover
won = actual_margin < spread
# Calculate profit (standard -110 odds means risk 1.10 to win 1.00)
if won:
profit = 1.0
else:
profit = -(1 + vig)
results.append({
'bet': bet,
'won': won,
'profit': profit
})
results_df = pd.DataFrame(results)
return {
'total_bets': len(results_df),
'wins': results_df['won'].sum(),
'losses': len(results_df) - results_df['won'].sum(),
'win_rate': results_df['won'].mean(),
'total_profit': results_df['profit'].sum(),
'roi': results_df['profit'].sum() / len(results_df) if len(results_df) > 0 else 0,
'required_win_rate': (1 + vig) / (2 + vig) # ~52.4% for standard vig
}
Market Anomalies
Despite overall efficiency, research has identified several historical anomalies in NBA betting markets:
- Home underdog bias: Home underdogs have historically covered at slightly higher rates
- Rest advantage undervaluation: Teams with significant rest advantages may be undervalued
- Late-season motivation: Playoff-bound teams resting starters create value
- Public bias: Heavy public betting on favorites can move lines past true value
def identify_potential_value(games_df, predictions_df, threshold=2.0):
"""
Identify games where model disagrees significantly with market.
Parameters:
-----------
games_df : DataFrame
Game data with market spreads
predictions_df : DataFrame
Model spread predictions
threshold : float
Minimum disagreement to flag (in points)
Returns:
--------
DataFrame : Games with potential value
"""
games_df = games_df.copy()
games_df['model_spread'] = predictions_df['predicted_spread']
games_df['disagreement'] = games_df['model_spread'] - games_df['market_spread']
# Flag games with significant disagreement
value_games = games_df[games_df['disagreement'].abs() >= threshold].copy()
# Determine bet direction
value_games['suggested_bet'] = np.where(
value_games['disagreement'] > 0,
'HOME', # Model more bullish on home team
'AWAY' # Model more bullish on away team
)
return value_games[['home_team', 'away_team', 'market_spread',
'model_spread', 'disagreement', 'suggested_bet']]
25.7 Model Evaluation Metrics
Accuracy-Based Metrics
Simple accuracy (percentage of correct predictions) is intuitive but limited. It does not reward confidence or calibration:
def calculate_prediction_accuracy(predictions_df, results_df):
"""
Calculate basic prediction accuracy metrics.
"""
# Merge predictions with results
df = predictions_df.merge(results_df, on='game_id')
# Win/loss accuracy
df['predicted_winner'] = np.where(df['home_win_prob'] > 0.5, 'home', 'away')
df['actual_winner'] = np.where(df['home_score'] > df['away_score'], 'home', 'away')
accuracy = (df['predicted_winner'] == df['actual_winner']).mean()
# ATS accuracy (against the spread)
df['predicted_cover'] = df['predicted_spread'] > df['market_spread']
df['actual_cover'] = df['actual_margin'] > df['market_spread']
ats_accuracy = (df['predicted_cover'] == df['actual_cover']).mean()
return {
'win_accuracy': accuracy,
'ats_accuracy': ats_accuracy
}
Brier Score
The Brier score is a proper scoring rule that rewards calibrated probability predictions:
$$\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2$$
Where $p_i$ is the predicted probability and $o_i$ is the actual outcome (1 or 0).
def brier_score(predictions, outcomes):
"""
Calculate Brier score for probabilistic predictions.
Parameters:
-----------
predictions : array-like
Predicted probabilities (0 to 1)
outcomes : array-like
Actual outcomes (0 or 1)
Returns:
--------
float : Brier score (lower is better, 0 is perfect)
"""
predictions = np.array(predictions)
outcomes = np.array(outcomes)
return np.mean((predictions - outcomes) ** 2)
def brier_skill_score(predictions, outcomes, baseline_prob=None):
"""
Calculate Brier Skill Score relative to a baseline.
Parameters:
-----------
predictions : array-like
Model's predicted probabilities
outcomes : array-like
Actual outcomes
baseline_prob : float, optional
Baseline probability (default: mean outcome)
Returns:
--------
float : Brier Skill Score (higher is better, 0 = baseline, 1 = perfect)
"""
if baseline_prob is None:
baseline_prob = np.mean(outcomes)
model_brier = brier_score(predictions, outcomes)
baseline_brier = brier_score(np.full_like(predictions, baseline_prob), outcomes)
return 1 - (model_brier / baseline_brier)
Log Loss
Log loss (cross-entropy loss) more heavily penalizes confident wrong predictions:
$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [o_i \log(p_i) + (1-o_i) \log(1-p_i)]$$
def log_loss(predictions, outcomes, eps=1e-15):
"""
Calculate log loss for probabilistic predictions.
Parameters:
-----------
predictions : array-like
Predicted probabilities (0 to 1)
outcomes : array-like
Actual outcomes (0 or 1)
eps : float
Small value to avoid log(0)
Returns:
--------
float : Log loss (lower is better)
"""
predictions = np.array(predictions)
outcomes = np.array(outcomes)
# Clip predictions to avoid log(0)
predictions = np.clip(predictions, eps, 1 - eps)
return -np.mean(
outcomes * np.log(predictions) +
(1 - outcomes) * np.log(1 - predictions)
)
Calibration Analysis
A well-calibrated model's predicted probabilities should match observed frequencies:
def calibration_analysis(predictions, outcomes, n_bins=10):
"""
Analyze calibration of probability predictions.
Parameters:
-----------
predictions : array-like
Predicted probabilities
outcomes : array-like
Actual outcomes
n_bins : int
Number of bins for calibration
Returns:
--------
DataFrame : Calibration data by bin
"""
predictions = np.array(predictions)
outcomes = np.array(outcomes)
# Create bins
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_indices = np.digitize(predictions, bin_edges[1:-1])
calibration_data = []
for i in range(n_bins):
mask = bin_indices == i
if mask.sum() > 0:
calibration_data.append({
'bin_lower': bin_edges[i],
'bin_upper': bin_edges[i + 1],
'bin_center': (bin_edges[i] + bin_edges[i + 1]) / 2,
'mean_predicted': predictions[mask].mean(),
'mean_observed': outcomes[mask].mean(),
'count': mask.sum(),
'calibration_error': abs(predictions[mask].mean() - outcomes[mask].mean())
})
return pd.DataFrame(calibration_data)
def expected_calibration_error(predictions, outcomes, n_bins=10):
"""
Calculate Expected Calibration Error (ECE).
"""
cal_data = calibration_analysis(predictions, outcomes, n_bins)
total = cal_data['count'].sum()
ece = sum(
row['count'] / total * row['calibration_error']
for _, row in cal_data.iterrows()
)
return ece
Comprehensive Evaluation Framework
class PredictionEvaluator:
"""
Comprehensive evaluation framework for game predictions.
"""
def __init__(self, predictions, outcomes, spreads_pred=None,
spreads_actual=None, market_spreads=None):
"""
Initialize evaluator with predictions and outcomes.
Parameters:
-----------
predictions : array-like
Win probability predictions for home team
outcomes : array-like
Actual outcomes (1 if home won, 0 if away won)
spreads_pred : array-like, optional
Predicted point spreads
spreads_actual : array-like, optional
Actual point margins
market_spreads : array-like, optional
Market/betting spreads
"""
self.predictions = np.array(predictions)
self.outcomes = np.array(outcomes)
self.spreads_pred = spreads_pred
self.spreads_actual = spreads_actual
self.market_spreads = market_spreads
def evaluate_all(self):
"""
Run all evaluation metrics.
Returns:
--------
dict : Comprehensive evaluation results
"""
results = {}
# Accuracy metrics
predicted_winners = (self.predictions > 0.5).astype(int)
results['accuracy'] = (predicted_winners == self.outcomes).mean()
# Probabilistic metrics
results['brier_score'] = brier_score(self.predictions, self.outcomes)
results['brier_skill_score'] = brier_skill_score(self.predictions, self.outcomes)
results['log_loss'] = log_loss(self.predictions, self.outcomes)
results['expected_calibration_error'] = expected_calibration_error(
self.predictions, self.outcomes
)
# Spread-based metrics (if available)
if self.spreads_pred is not None and self.spreads_actual is not None:
results['spread_mae'] = np.mean(np.abs(self.spreads_pred - self.spreads_actual))
results['spread_rmse'] = np.sqrt(np.mean((self.spreads_pred - self.spreads_actual) ** 2))
results['spread_correlation'] = np.corrcoef(self.spreads_pred, self.spreads_actual)[0, 1]
# ATS performance (if market spreads available)
if self.market_spreads is not None and self.spreads_pred is not None:
model_pick = self.spreads_pred > self.market_spreads
actual_cover = self.spreads_actual > self.market_spreads
results['ats_accuracy'] = (model_pick == actual_cover).mean()
return results
def generate_report(self):
"""
Generate a formatted evaluation report.
"""
results = self.evaluate_all()
report = """
========================================
GAME PREDICTION MODEL EVALUATION REPORT
========================================
ACCURACY METRICS
----------------
Win/Loss Accuracy: {accuracy:.1%}
PROBABILISTIC METRICS
---------------------
Brier Score: {brier_score:.4f}
Brier Skill Score: {brier_skill_score:.4f}
Log Loss: {log_loss:.4f}
Expected Cal. Error: {expected_calibration_error:.4f}
""".format(**results)
if 'spread_mae' in results:
report += """
SPREAD PREDICTION METRICS
-------------------------
Spread MAE: {spread_mae:.2f} pts
Spread RMSE: {spread_rmse:.2f} pts
Spread Correlation: {spread_correlation:.3f}
""".format(**results)
if 'ats_accuracy' in results:
report += """
AGAINST THE SPREAD
------------------
ATS Accuracy: {ats_accuracy:.1%}
""".format(**results)
return report
25.8 Home Court Advantage Quantification
Measuring Home Court Advantage
Home court advantage (HCA) is one of the most consistent effects in basketball. Historically, NBA home teams win about 58-60% of games, which translates to roughly 3-4 points of margin advantage.
def quantify_home_court_advantage(games_df, min_games=30):
"""
Quantify home court advantage overall and by team.
Parameters:
-----------
games_df : DataFrame
Game results with home_team, away_team, home_score, away_score
min_games : int
Minimum games to report team-specific HCA
Returns:
--------
dict : Home court advantage metrics
"""
# Overall HCA
games_df['home_margin'] = games_df['home_score'] - games_df['away_score']
games_df['home_win'] = games_df['home_margin'] > 0
overall_hca = {
'win_pct': games_df['home_win'].mean(),
'avg_margin': games_df['home_margin'].mean(),
'median_margin': games_df['home_margin'].median(),
'std_margin': games_df['home_margin'].std()
}
# Team-specific HCA
home_records = games_df.groupby('home_team').agg({
'home_win': ['sum', 'count'],
'home_margin': 'mean'
})
home_records.columns = ['home_wins', 'home_games', 'home_margin']
away_records = games_df.groupby('away_team').agg({
'home_win': lambda x: (1 - x).sum(),
'home_margin': lambda x: -x.mean()
})
away_records.columns = ['away_wins', 'away_margin']
away_records['away_games'] = games_df.groupby('away_team').size()
# Calculate team-specific HCA
team_hca = []
for team in home_records.index:
if team in away_records.index:
home_stats = home_records.loc[team]
away_stats = away_records.loc[team]
if home_stats['home_games'] >= min_games and away_stats['away_games'] >= min_games:
hca_margin = home_stats['home_margin'] - away_stats['away_margin']
hca_win_pct = (home_stats['home_wins'] / home_stats['home_games']) - \
(away_stats['away_wins'] / away_stats['away_games'])
team_hca.append({
'team': team,
'home_win_pct': home_stats['home_wins'] / home_stats['home_games'],
'away_win_pct': away_stats['away_wins'] / away_stats['away_games'],
'home_margin': home_stats['home_margin'],
'away_margin': away_stats['away_margin'],
'hca_margin': hca_margin,
'hca_win_pct': hca_win_pct,
'total_games': home_stats['home_games'] + away_stats['away_games']
})
team_hca_df = pd.DataFrame(team_hca).sort_values('hca_margin', ascending=False)
return {
'overall': overall_hca,
'by_team': team_hca_df
}
Components of Home Court Advantage
Research has identified several factors contributing to HCA:
- Crowd effects: Psychological impact on players and referees
- Travel fatigue: Visiting teams travel more
- Familiarity: Home teams know their arena, sight lines, and shooting backgrounds
- Schedule: Home stands allow routine; road trips disrupt it
- Referee bias: Studies show subtle officiating favoritism
def decompose_home_court_advantage(games_df, ref_data=None):
"""
Attempt to decompose HCA into component factors.
This is an approximation based on available data.
"""
results = {}
# Base HCA
base_hca = games_df['home_score'].mean() - games_df['away_score'].mean()
results['total_hca'] = base_hca
# Rest-adjusted HCA
if 'home_rest' in games_df.columns and 'away_rest' in games_df.columns:
# Games where rest is equal
equal_rest = games_df[games_df['home_rest'] == games_df['away_rest']]
rest_neutral_hca = equal_rest['home_score'].mean() - equal_rest['away_score'].mean()
results['rest_neutral_hca'] = rest_neutral_hca
results['rest_component'] = base_hca - rest_neutral_hca
# Travel component (if travel data available)
if 'away_travel_distance' in games_df.columns:
short_travel = games_df[games_df['away_travel_distance'] < 500]
long_travel = games_df[games_df['away_travel_distance'] >= 1500]
results['short_travel_hca'] = short_travel['home_score'].mean() - short_travel['away_score'].mean()
results['long_travel_hca'] = long_travel['home_score'].mean() - long_travel['away_score'].mean()
results['travel_effect'] = results['long_travel_hca'] - results['short_travel_hca']
# Referee component (if ref data available)
if ref_data is not None:
# This would require detailed play-by-play analysis
pass
return results
Altitude Effects
Denver's high altitude creates a unique home court advantage:
def analyze_altitude_effect(games_df, high_altitude_teams=['DEN']):
"""
Analyze the effect of altitude on game outcomes.
Parameters:
-----------
games_df : DataFrame
Game data
high_altitude_teams : list
Teams playing at high altitude
Returns:
--------
dict : Altitude effect analysis
"""
# Denver home games
altitude_home = games_df[games_df['home_team'].isin(high_altitude_teams)]
altitude_margin = altitude_home['home_score'].mean() - altitude_home['away_score'].mean()
# Other teams' home games
other_home = games_df[~games_df['home_team'].isin(high_altitude_teams)]
other_margin = other_home['home_score'].mean() - other_home['away_score'].mean()
# Effect of altitude beyond normal HCA
altitude_effect = altitude_margin - other_margin
# Quarter-by-quarter analysis (fatigue should increase over time)
# This would require quarter-by-quarter scoring data
return {
'altitude_teams_hca': altitude_margin,
'other_teams_hca': other_margin,
'altitude_premium': altitude_effect
}
25.9 Rest and Travel Effects
Quantifying Rest Impact
Rest significantly affects NBA performance. Teams on back-to-backs or short rest face documented disadvantages:
def analyze_rest_effects(games_df):
"""
Analyze the effect of rest on game outcomes.
Parameters:
-----------
games_df : DataFrame
Games with rest_days columns for home and away teams
Returns:
--------
dict : Rest effect analysis
"""
df = games_df.copy()
df['margin'] = df['home_score'] - df['away_score']
df['home_win'] = df['margin'] > 0
# Rest categories
df['home_rest_cat'] = pd.cut(df['home_rest'],
bins=[-1, 0, 1, 2, 100],
labels=['B2B', '1_day', '2_days', '3+_days'])
df['away_rest_cat'] = pd.cut(df['away_rest'],
bins=[-1, 0, 1, 2, 100],
labels=['B2B', '1_day', '2_days', '3+_days'])
# Rest differential analysis
rest_analysis = df.groupby(['home_rest_cat', 'away_rest_cat']).agg({
'margin': ['mean', 'std', 'count'],
'home_win': 'mean'
})
# Back-to-back specific analysis
home_b2b = df[df['home_rest'] == 0]
away_b2b = df[df['away_rest'] == 0]
neither_b2b = df[(df['home_rest'] > 0) & (df['away_rest'] > 0)]
return {
'rest_matrix': rest_analysis,
'home_b2b_margin': home_b2b['margin'].mean() if len(home_b2b) > 0 else None,
'away_b2b_margin': away_b2b['margin'].mean() if len(away_b2b) > 0 else None,
'neither_b2b_margin': neither_b2b['margin'].mean() if len(neither_b2b) > 0 else None,
'home_b2b_penalty': (neither_b2b['margin'].mean() - home_b2b['margin'].mean())
if len(home_b2b) > 0 and len(neither_b2b) > 0 else None,
'away_b2b_bonus': (away_b2b['margin'].mean() - neither_b2b['margin'].mean())
if len(away_b2b) > 0 and len(neither_b2b) > 0 else None
}
Travel Fatigue
Travel distance and direction affect performance:
import math
def calculate_travel_distance(lat1, lon1, lat2, lon2):
"""
Calculate great circle distance between two points.
"""
R = 3959 # Earth's radius in miles
lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
return R * c
def analyze_travel_effects(games_df, arena_locations):
"""
Analyze the effect of travel on game outcomes.
Parameters:
-----------
games_df : DataFrame
Game data
arena_locations : dict
Dictionary mapping teams to (latitude, longitude)
Returns:
--------
dict : Travel effect analysis
"""
df = games_df.copy()
# Calculate travel distance for away team
def get_travel(row):
away_loc = arena_locations.get(row['away_team'])
home_loc = arena_locations.get(row['home_team'])
if away_loc and home_loc:
return calculate_travel_distance(
away_loc[0], away_loc[1],
home_loc[0], home_loc[1]
)
return None
df['travel_distance'] = df.apply(get_travel, axis=1)
df['margin'] = df['home_score'] - df['away_score']
# Categorize travel
df['travel_cat'] = pd.cut(df['travel_distance'],
bins=[0, 500, 1000, 1500, 2500, 5000],
labels=['Local', 'Short', 'Medium', 'Long', 'Cross-country'])
travel_effects = df.groupby('travel_cat').agg({
'margin': ['mean', 'std', 'count']
})
# Time zone analysis
def get_timezone_diff(row):
# Simplified: would need actual timezone data
away_lon = arena_locations.get(row['away_team'], (0, 0))[1]
home_lon = arena_locations.get(row['home_team'], (0, 0))[1]
return int(round((home_lon - away_lon) / 15)) # Approximate time zones
df['timezone_diff'] = df.apply(get_timezone_diff, axis=1)
# East-to-West vs West-to-East
df['direction'] = np.where(df['timezone_diff'] > 0, 'Westward',
np.where(df['timezone_diff'] < 0, 'Eastward', 'Same'))
direction_effects = df.groupby('direction').agg({
'margin': ['mean', 'count']
})
return {
'travel_effects': travel_effects,
'direction_effects': direction_effects
}
Scheduling Patterns
def analyze_schedule_patterns(games_df):
"""
Analyze schedule-related patterns in game outcomes.
"""
df = games_df.copy()
df['margin'] = df['home_score'] - df['away_score']
df['home_win'] = df['margin'] > 0
results = {}
# Games in a row on the road
if 'home_road_game_streak' in df.columns:
road_fatigue = df.groupby('away_road_game_streak')['margin'].mean()
results['road_fatigue'] = road_fatigue
# Second game of road back-to-back
if 'away_second_of_b2b' in df.columns:
second_b2b = df[df['away_second_of_b2b'] == True]
results['second_b2b_margin'] = second_b2b['margin'].mean()
results['second_b2b_home_win_pct'] = second_b2b['home_win'].mean()
# Segment of season
if 'game_number' in df.columns:
df['season_segment'] = pd.cut(df['game_number'],
bins=[0, 20, 41, 62, 82],
labels=['Early', 'Pre-ASB', 'Post-ASB', 'Late'])
segment_effects = df.groupby('season_segment').agg({
'margin': 'mean',
'home_win': 'mean'
})
results['segment_effects'] = segment_effects
return results
25.10 Ensemble Methods for Prediction
Why Ensembles Work
Ensemble methods combine multiple models to produce better predictions than any single model. The key insight is that different models make different errors, and averaging reduces error when those errors are uncorrelated.
For game prediction, ensembles can combine: - Elo ratings - Regression-based models - Machine learning models - Market-based predictions
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
import numpy as np
class GamePredictionEnsemble:
"""
Ensemble model for NBA game prediction.
"""
def __init__(self):
self.models = {
'ridge': Ridge(alpha=10),
'rf': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
'gbm': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}
self.weights = None
self.elo_system = MarginAwareElo()
def fit(self, X, y, games_for_elo=None):
"""
Fit all models in the ensemble.
Parameters:
-----------
X : DataFrame
Feature matrix
y : array-like
Target (point differential)
games_for_elo : DataFrame
Historical games for Elo initialization
"""
# Fit Elo system on historical data
if games_for_elo is not None:
self.elo_system.process_season(games_for_elo)
# Fit other models
for name, model in self.models.items():
model.fit(X, y)
# Calculate optimal weights using cross-validation
self._optimize_weights(X, y)
def _optimize_weights(self, X, y, n_folds=5):
"""
Optimize ensemble weights using cross-validation.
"""
from sklearn.model_selection import KFold
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Collect out-of-fold predictions
oof_preds = {name: np.zeros(len(y)) for name in self.models}
for train_idx, val_idx in kf.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train = y[train_idx]
for name, model_template in self.models.items():
# Clone and fit
model = type(model_template)(**model_template.get_params())
model.fit(X_train, y_train)
oof_preds[name][val_idx] = model.predict(X_val)
# Optimize weights to minimize RMSE
from scipy.optimize import minimize
def objective(weights):
weights = weights / weights.sum() # Normalize
ensemble_pred = sum(w * oof_preds[name] for name, w in zip(self.models.keys(), weights))
return np.sqrt(np.mean((ensemble_pred - y) ** 2))
n_models = len(self.models)
result = minimize(
objective,
x0=np.ones(n_models) / n_models,
method='SLSQP',
bounds=[(0, 1)] * n_models,
constraints={'type': 'eq', 'fun': lambda w: w.sum() - 1}
)
self.weights = dict(zip(self.models.keys(), result.x / result.x.sum()))
def predict(self, X, home_teams=None, away_teams=None):
"""
Generate ensemble predictions.
Parameters:
-----------
X : DataFrame
Feature matrix
home_teams : list, optional
Home teams for Elo predictions
away_teams : list, optional
Away teams for Elo predictions
Returns:
--------
array : Predicted point differentials
"""
predictions = {}
# Model predictions
for name, model in self.models.items():
predictions[name] = model.predict(X)
# Elo predictions
if home_teams is not None and away_teams is not None:
elo_preds = []
for home, away in zip(home_teams, away_teams):
pred = self.elo_system.predict_game(home, away)
elo_preds.append(pred['expected_spread'])
predictions['elo'] = np.array(elo_preds)
# Add Elo to weights if not present
if 'elo' not in self.weights:
# Simple reweighting
elo_weight = 0.2
for key in self.weights:
self.weights[key] *= (1 - elo_weight)
self.weights['elo'] = elo_weight
# Weighted average
ensemble_pred = sum(
self.weights.get(name, 0) * preds
for name, preds in predictions.items()
)
return ensemble_pred
def predict_with_uncertainty(self, X, home_teams=None, away_teams=None):
"""
Generate predictions with uncertainty estimates.
"""
predictions = {}
for name, model in self.models.items():
predictions[name] = model.predict(X)
# Ensemble prediction
ensemble_pred = self.predict(X, home_teams, away_teams)
# Uncertainty from model disagreement
pred_matrix = np.column_stack(list(predictions.values()))
uncertainty = pred_matrix.std(axis=1)
return {
'prediction': ensemble_pred,
'uncertainty': uncertainty,
'model_predictions': predictions
}
Stacking Ensemble
Stacking uses a meta-model to learn optimal combinations:
from sklearn.model_selection import cross_val_predict
class StackingEnsemble:
"""
Stacking ensemble for game prediction.
"""
def __init__(self):
self.base_models = {
'ridge': Ridge(alpha=10),
'rf': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
'gbm': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}
self.meta_model = Ridge(alpha=1)
def fit(self, X, y):
"""
Fit the stacking ensemble.
"""
# Generate base model predictions using cross-validation
base_predictions = np.column_stack([
cross_val_predict(model, X, y, cv=5)
for model in self.base_models.values()
])
# Fit meta-model on base predictions
self.meta_model.fit(base_predictions, y)
# Refit base models on full data
for model in self.base_models.values():
model.fit(X, y)
def predict(self, X):
"""
Generate stacked predictions.
"""
# Get base model predictions
base_predictions = np.column_stack([
model.predict(X)
for model in self.base_models.values()
])
# Meta-model combines them
return self.meta_model.predict(base_predictions)
25.11 Complete Prediction Pipeline
End-to-End Implementation
class NBAGamePredictor:
"""
Complete NBA game prediction pipeline.
"""
def __init__(self, config=None):
"""
Initialize the prediction system.
Parameters:
-----------
config : dict, optional
Configuration parameters
"""
self.config = config or {}
# Initialize components
self.elo_system = MarginAwareElo(
k_factor=self.config.get('k_factor', 20),
home_advantage=self.config.get('elo_hca', 100)
)
self.ensemble = GamePredictionEnsemble()
self.scaler = StandardScaler()
# State
self.team_stats = None
self.is_fitted = False
def prepare_data(self, games_df, team_stats_df):
"""
Prepare data for training or prediction.
"""
features = []
targets = []
for _, game in games_df.iterrows():
try:
home_stats = team_stats_df.loc[game['home_team']]
away_stats = team_stats_df.loc[game['away_team']]
except KeyError:
continue
# Feature engineering
feature_row = {
# Efficiency
'net_rtg_diff': (home_stats['off_rtg'] - home_stats['def_rtg']) -
(away_stats['off_rtg'] - away_stats['def_rtg']),
'off_rtg_diff': home_stats['off_rtg'] - away_stats['off_rtg'],
'def_rtg_diff': away_stats['def_rtg'] - home_stats['def_rtg'],
# Pace
'pace_diff': home_stats['pace'] - away_stats['pace'],
'avg_pace': (home_stats['pace'] + away_stats['pace']) / 2,
# Four Factors
'efg_diff': home_stats.get('efg', 0.5) - away_stats.get('efg', 0.5),
'tov_diff': away_stats.get('tov_rate', 0.12) - home_stats.get('tov_rate', 0.12),
'orb_diff': home_stats.get('orb_rate', 0.25) - away_stats.get('orb_rate', 0.25),
'ft_rate_diff': home_stats.get('ft_rate', 0.2) - away_stats.get('ft_rate', 0.2),
# Rest and travel
'rest_diff': game.get('home_rest', 2) - game.get('away_rest', 2),
'away_travel': game.get('away_travel_distance', 0) / 1000,
# Recent form
'home_last10_margin': home_stats.get('last_10_margin', 0),
'away_last10_margin': away_stats.get('last_10_margin', 0),
# Elo ratings
'home_elo': self.elo_system.get_rating(game['home_team']),
'away_elo': self.elo_system.get_rating(game['away_team']),
'elo_diff': self.elo_system.get_rating(game['home_team']) -
self.elo_system.get_rating(game['away_team'])
}
features.append(feature_row)
if 'home_score' in game and 'away_score' in game:
targets.append(game['home_score'] - game['away_score'])
X = pd.DataFrame(features)
y = np.array(targets) if targets else None
return X, y
def fit(self, historical_games_df, team_stats_df):
"""
Fit the prediction model on historical data.
Parameters:
-----------
historical_games_df : DataFrame
Historical games sorted by date
team_stats_df : DataFrame
Team statistics
"""
# Initialize Elo ratings from early games
early_games = historical_games_df.iloc[:len(historical_games_df)//4]
self.elo_system.process_season(early_games)
# Prepare training data from later games
training_games = historical_games_df.iloc[len(historical_games_df)//4:]
X, y = self.prepare_data(training_games, team_stats_df)
if len(X) == 0:
raise ValueError("No valid training data")
# Scale features
X_scaled = pd.DataFrame(
self.scaler.fit_transform(X),
columns=X.columns
)
# Fit ensemble
self.ensemble.fit(X_scaled, y, games_for_elo=early_games)
# Update Elo with all games
self.elo_system.process_season(training_games)
self.team_stats = team_stats_df
self.is_fitted = True
def predict(self, upcoming_games_df, team_stats_df=None):
"""
Predict outcomes for upcoming games.
Parameters:
-----------
upcoming_games_df : DataFrame
Games to predict
team_stats_df : DataFrame, optional
Updated team statistics
Returns:
--------
DataFrame : Predictions
"""
if not self.is_fitted:
raise ValueError("Model must be fitted first")
if team_stats_df is None:
team_stats_df = self.team_stats
# Prepare features
X, _ = self.prepare_data(upcoming_games_df, team_stats_df)
X_scaled = pd.DataFrame(
self.scaler.transform(X),
columns=X.columns
)
# Generate predictions
home_teams = upcoming_games_df['home_team'].tolist()
away_teams = upcoming_games_df['away_team'].tolist()
spread_predictions = self.ensemble.predict(
X_scaled,
home_teams=home_teams,
away_teams=away_teams
)
# Convert to win probabilities
win_probs = [spread_to_win_probability(s) for s in spread_predictions]
# Compile predictions
predictions = pd.DataFrame({
'home_team': home_teams,
'away_team': away_teams,
'predicted_spread': spread_predictions,
'home_win_prob': win_probs,
'away_win_prob': [1 - p for p in win_probs]
})
return predictions
def update(self, game_result):
"""
Update model with a new game result.
Parameters:
-----------
game_result : dict or Series
Game result with home_team, away_team, home_score, away_score
"""
self.elo_system.update_ratings(
game_result['home_team'],
game_result['away_team'],
game_result['home_score'],
game_result['away_score']
)
def evaluate(self, test_games_df, team_stats_df=None):
"""
Evaluate model performance on test games.
"""
predictions = self.predict(test_games_df, team_stats_df)
# Merge with actuals
test_games_df = test_games_df.copy()
test_games_df['actual_margin'] = test_games_df['home_score'] - test_games_df['away_score']
test_games_df['home_win'] = test_games_df['actual_margin'] > 0
evaluator = PredictionEvaluator(
predictions=predictions['home_win_prob'].values,
outcomes=test_games_df['home_win'].astype(int).values,
spreads_pred=predictions['predicted_spread'].values,
spreads_actual=test_games_df['actual_margin'].values
)
return evaluator.evaluate_all()
25.12 Practical Considerations
Handling Injuries
Injuries dramatically affect game predictions. A comprehensive system must account for player absences:
def adjust_prediction_for_injuries(base_prediction, injuries, player_impact_df):
"""
Adjust game prediction for known injuries.
Parameters:
-----------
base_prediction : dict
Base prediction with spread and win_prob
injuries : dict
Dictionary mapping teams to list of injured players
player_impact_df : DataFrame
Player impact data (e.g., VORP, BPM, etc.)
Returns:
--------
dict : Adjusted prediction
"""
adjustment = 0
for team, injured_players in injuries.items():
for player in injured_players:
if player in player_impact_df.index:
# Estimate impact in points per game
player_impact = player_impact_df.loc[player, 'pts_added_per_game']
# Adjust based on which team the player is on
if team == 'home':
adjustment -= player_impact
else:
adjustment += player_impact
adjusted_spread = base_prediction['predicted_spread'] + adjustment
adjusted_win_prob = spread_to_win_probability(adjusted_spread)
return {
'predicted_spread': adjusted_spread,
'home_win_prob': adjusted_win_prob,
'injury_adjustment': adjustment
}
Model Maintenance
Prediction models require ongoing maintenance:
def model_maintenance_schedule(model, schedule_type='daily'):
"""
Guidelines for model maintenance.
"""
maintenance_tasks = {
'daily': [
'Update Elo ratings with previous day\'s results',
'Refresh team statistics rolling averages',
'Check for injury updates',
'Log model predictions vs actual outcomes'
],
'weekly': [
'Recalculate team quality features',
'Update rest/travel features',
'Review prediction accuracy by team',
'Identify systematic biases'
],
'monthly': [
'Retrain regression models on recent data',
'Reoptimize ensemble weights',
'Full calibration analysis',
'Compare to market efficiency'
],
'end_of_season': [
'Comprehensive performance review',
'Regress Elo ratings toward mean',
'Archive model for historical comparison',
'Update for roster changes'
]
}
return maintenance_tasks.get(schedule_type, [])
Summary
Game outcome prediction synthesizes team evaluation, situational analysis, and probabilistic modeling into actionable forecasts. Key takeaways from this chapter include:
- Baselines matter: Simple baselines like home win percentage and team power ratings capture significant predictive power
- Elo systems provide elegant solutions: Self-correcting ratings that update with each game offer interpretable predictions
- Feature engineering is crucial: Rest, travel, and matchup features add predictive value beyond raw team quality
- Market efficiency is real: Betting markets are highly efficient, making consistent profits extremely difficult
- Proper evaluation is essential: Use proper scoring rules (Brier score, log loss) rather than simple accuracy
- Ensembles improve performance: Combining multiple models reduces error when individual models make different mistakes
- Context matters: Injuries, motivation, and scheduling create prediction opportunities
The complete pipeline presented here provides a foundation for building production-quality prediction systems. However, remember that even the best models face substantial uncertainty in basketball outcomes, and maintaining realistic expectations about predictive accuracy is essential for practical applications.
References
- Elo, A. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing.
- Silver, N. (2015). The Signal and the Noise: Why So Many Predictions Fail - But Some Don't. Penguin Books.
- Winston, W. L. (2012). Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football. Princeton University Press.
- Lopez, M. J., & Matthews, G. J. (2015). Building an NCAA men's basketball predictive model and quantifying its success. Journal of Quantitative Analysis in Sports, 11(1), 5-12.
- Manner, H. (2016). Modeling and forecasting the outcomes of NBA basketball games. Journal of Quantitative Analysis in Sports, 12(1), 31-41.
- Zimmermann, J. (2016). Prediction Markets in Sports Betting. Journal of Prediction Markets, 10(2), 1-22.