10 min read

The ability to predict basketball game outcomes represents the culmination of sports analytics knowledge. Game prediction synthesizes everything we have learned about player evaluation, team performance metrics, and situational factors into...

Chapter 25: Game Outcome Prediction

Introduction

The ability to predict basketball game outcomes represents the culmination of sports analytics knowledge. Game prediction synthesizes everything we have learned about player evaluation, team performance metrics, and situational factors into actionable forecasts. Whether you are an analyst supporting a front office, a researcher studying competitive dynamics, or someone interested in testing predictions against betting markets, understanding the fundamentals of game outcome prediction is essential.

This chapter develops a comprehensive framework for predicting NBA game outcomes. We begin with the fundamental approaches to modeling point spreads and totals, then build increasingly sophisticated prediction systems. Along the way, we examine the efficiency of betting markets, quantify situational factors like home court advantage and rest, and develop rigorous evaluation frameworks for our models.

Game prediction in basketball presents unique challenges compared to other sports. The high-scoring nature of the game means that individual possessions matter less than in sports like soccer or hockey, making outcomes somewhat more predictable. However, the importance of matchups, the impact of injuries, and the strategic adjustments coaches make create substantial uncertainty that even the best models cannot fully capture.

25.1 Foundations of Game Prediction

The Prediction Problem

At its core, predicting a basketball game outcome involves estimating probabilities for various results. The most common prediction targets include:

  1. Win probability: The probability that a specific team wins the game
  2. Point spread: The expected margin of victory
  3. Total points: The expected combined score of both teams
  4. Exact score probability: The probability distribution over final scores

These targets are related but distinct. A team might be expected to win by 5 points (spread), but the win probability is not simply "greater than 50%." The distribution of outcomes matters. If a team is favored by 5 points but outcomes are highly variable, their win probability might be only 60%. If outcomes are more predictable, the same 5-point favorite might have a 70% win probability.

The Baseline: Home Team Win Percentage

Before building complex models, we must establish baselines. The simplest baseline for NBA game prediction is the historical home team win percentage. Over the past several decades, NBA home teams have won approximately 58-60% of their games, though this figure has declined in recent years.

import numpy as np
import pandas as pd

def calculate_home_win_rate(games_df):
    """
    Calculate the historical home team win rate.

    Parameters:
    -----------
    games_df : DataFrame
        Game data with columns: home_score, away_score

    Returns:
    --------
    float : Home team win rate
    """
    home_wins = (games_df['home_score'] > games_df['away_score']).sum()
    total_games = len(games_df)
    return home_wins / total_games

# Example usage
# home_win_rate = calculate_home_win_rate(nba_games)
# print(f"Home win rate: {home_win_rate:.1%}")

This baseline tells us that any model must do better than simply predicting the home team to win every game. In terms of accuracy, that naive strategy achieves about 58% correct predictions. In terms of calibration and proper scoring rules (which we discuss later), this baseline performs reasonably well because it implicitly assigns roughly correct probabilities.

Simple Rating Systems

The next level of sophistication involves rating each team. The simplest approach is a power rating that captures each team's overall strength relative to an average team. If Team A has a power rating of +5 and Team B has a rating of -3, we might predict Team A to win by 8 points on a neutral court.

The most straightforward way to estimate power ratings is through least squares regression. Consider a model where:

$$\text{Margin}_i = \text{Rating}_{\text{home}} - \text{Rating}_{\text{away}} + \text{HCA} + \epsilon_i$$

Where HCA is home court advantage. This can be estimated using ridge regression to handle the collinearity inherent in team ratings:

from sklearn.linear_model import Ridge
import numpy as np

def estimate_power_ratings(games_df, teams, alpha=1.0):
    """
    Estimate team power ratings using ridge regression.

    Parameters:
    -----------
    games_df : DataFrame
        Game data with home_team, away_team, home_score, away_score
    teams : list
        List of all team identifiers
    alpha : float
        Ridge regularization parameter

    Returns:
    --------
    dict : Team power ratings
    float : Estimated home court advantage
    """
    n_games = len(games_df)
    n_teams = len(teams)
    team_to_idx = {team: i for i, team in enumerate(teams)}

    # Build design matrix: one column per team + intercept for HCA
    X = np.zeros((n_games, n_teams + 1))
    y = np.zeros(n_games)

    for i, (_, game) in enumerate(games_df.iterrows()):
        home_idx = team_to_idx[game['home_team']]
        away_idx = team_to_idx[game['away_team']]

        X[i, home_idx] = 1   # Home team
        X[i, away_idx] = -1  # Away team
        X[i, -1] = 1         # Home court advantage

        y[i] = game['home_score'] - game['away_score']

    # Fit ridge regression (don't regularize HCA term)
    model = Ridge(alpha=alpha, fit_intercept=False)
    model.fit(X, y)

    ratings = {team: model.coef_[i] for team, i in team_to_idx.items()}
    hca = model.coef_[-1]

    # Center ratings around zero
    mean_rating = np.mean(list(ratings.values()))
    ratings = {team: r - mean_rating for team, r in ratings.items()}

    return ratings, hca

This simple approach captures a substantial portion of the predictable variance in game outcomes. Power ratings explain roughly 15-20% of the variance in game margins, which translates to meaningful improvements over the baseline.

25.2 Point Spread Modeling

Understanding the Spread

The point spread, also known as the line, represents the expected margin of victory for the favored team. When oddsmakers set a spread of -7 for Team A against Team B, they expect Team A to win by approximately 7 points. Bettors can wager on either team to "cover" the spread: Team A must win by more than 7 points, or Team B must lose by fewer than 7 points (or win outright).

Point spreads serve as the market's consensus prediction. Professional oddsmakers and sophisticated bettors contribute to setting these lines, making them difficult to beat consistently. The spread incorporates vast amounts of information, including injury reports, recent performance, matchup history, and situational factors.

Modeling Point Differential

Building a point spread model requires predicting the expected point differential between two teams. The general framework combines:

  1. Team strength measures: Offensive and defensive ratings
  2. Situational factors: Home court, rest, travel
  3. Matchup considerations: Style compatibility, pace effects
  4. Recent form: Hot and cold streaks
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

class PointSpreadModel:
    """
    Model for predicting NBA game point spreads.
    """

    def __init__(self):
        self.model = Ridge(alpha=10.0)
        self.scaler = StandardScaler()
        self.feature_names = None

    def create_features(self, games_df, team_stats_df):
        """
        Create feature matrix for point spread prediction.

        Parameters:
        -----------
        games_df : DataFrame
            Game data with home_team, away_team, date
        team_stats_df : DataFrame
            Team statistics indexed by team and date

        Returns:
        --------
        DataFrame : Feature matrix
        """
        features = []

        for _, game in games_df.iterrows():
            home_stats = team_stats_df.loc[game['home_team'], game['date']]
            away_stats = team_stats_df.loc[game['away_team'], game['date']]

            row = {
                # Efficiency differentials
                'off_rtg_diff': home_stats['off_rtg'] - away_stats['off_rtg'],
                'def_rtg_diff': home_stats['def_rtg'] - away_stats['def_rtg'],
                'net_rtg_diff': home_stats['net_rtg'] - away_stats['net_rtg'],

                # Pace factors
                'home_pace': home_stats['pace'],
                'away_pace': away_stats['pace'],
                'pace_diff': home_stats['pace'] - away_stats['pace'],

                # Four factors (home perspective)
                'efg_diff': home_stats['efg'] - away_stats['efg'],
                'tov_diff': away_stats['tov_rate'] - home_stats['tov_rate'],
                'orb_diff': home_stats['orb_rate'] - away_stats['orb_rate'],
                'ft_rate_diff': home_stats['ft_rate'] - away_stats['ft_rate'],

                # Rest and travel
                'home_rest': game.get('home_rest_days', 1),
                'away_rest': game.get('away_rest_days', 1),
                'rest_diff': game.get('home_rest_days', 1) - game.get('away_rest_days', 1),
                'away_travel_dist': game.get('away_travel_distance', 0),

                # Recent form (last 10 games)
                'home_recent_margin': home_stats.get('last_10_margin', 0),
                'away_recent_margin': away_stats.get('last_10_margin', 0),

                # Home court indicator
                'home_court': 1
            }
            features.append(row)

        return pd.DataFrame(features)

    def fit(self, X, y):
        """
        Fit the point spread model.

        Parameters:
        -----------
        X : DataFrame
            Feature matrix
        y : array-like
            Actual point differentials (home - away)
        """
        self.feature_names = X.columns.tolist()
        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)

    def predict(self, X):
        """
        Predict point spread.

        Parameters:
        -----------
        X : DataFrame
            Feature matrix

        Returns:
        --------
        array : Predicted point spreads
        """
        X_scaled = self.scaler.transform(X)
        return self.model.predict(X_scaled)

    def get_feature_importance(self):
        """
        Get feature importance from model coefficients.

        Returns:
        --------
        DataFrame : Feature importances
        """
        importance = pd.DataFrame({
            'feature': self.feature_names,
            'coefficient': self.model.coef_,
            'abs_coefficient': np.abs(self.model.coef_)
        }).sort_values('abs_coefficient', ascending=False)

        return importance

Spread Distribution and Win Probability

Point spread predictions typically assume a normal distribution of outcomes around the predicted spread. Historically, NBA game margins have a standard deviation of approximately 11-12 points. This allows us to convert spread predictions to win probabilities:

from scipy import stats

def spread_to_win_probability(spread, std_dev=11.5):
    """
    Convert a point spread to win probability for the favored team.

    Parameters:
    -----------
    spread : float
        Predicted point spread (positive = home favored)
    std_dev : float
        Standard deviation of game margins

    Returns:
    --------
    float : Win probability for home team
    """
    # Probability that home team wins = P(margin > 0)
    # If spread is the expected margin, we need P(X > 0) where X ~ N(spread, std_dev^2)
    win_prob = 1 - stats.norm.cdf(0, loc=spread, scale=std_dev)
    return win_prob

# Example: Home team favored by 5 points
spread = 5.0
win_prob = spread_to_win_probability(spread)
print(f"Spread: {spread}, Win probability: {win_prob:.1%}")
# Output: Spread: 5.0, Win probability: 66.9%

The relationship between spread and win probability is nonlinear near zero and approaches 50% and 100% asymptotically. A spread of zero implies a 50% win probability, while very large spreads approach certainty but never reach it.

25.3 Over/Under (Total Points) Prediction

The Totals Market

The over/under, or totals market, involves predicting whether the combined score of both teams will exceed a specified threshold. For example, if the total is set at 215.5 points, bettors wager on whether the actual combined score will be over or under that number.

Totals prediction requires understanding both teams' pace and efficiency. A game between two fast-paced, efficient offenses will likely produce more points than a game between two slow, defensive teams.

Modeling Total Points

class TotalPointsModel:
    """
    Model for predicting total points in NBA games.
    """

    def __init__(self):
        self.model = Ridge(alpha=5.0)
        self.scaler = StandardScaler()

    def create_features(self, games_df, team_stats_df):
        """
        Create features for total points prediction.
        """
        features = []

        for _, game in games_df.iterrows():
            home_stats = team_stats_df.loc[game['home_team'], game['date']]
            away_stats = team_stats_df.loc[game['away_team'], game['date']]

            # Expected pace: geometric mean of team paces
            expected_pace = np.sqrt(home_stats['pace'] * away_stats['pace'])

            # Expected efficiency
            home_off_vs_away_def = (home_stats['off_rtg'] + away_stats['def_rtg']) / 2
            away_off_vs_home_def = (away_stats['off_rtg'] + home_stats['def_rtg']) / 2

            row = {
                'expected_pace': expected_pace,
                'home_off_rtg': home_stats['off_rtg'],
                'away_off_rtg': away_stats['off_rtg'],
                'home_def_rtg': home_stats['def_rtg'],
                'away_def_rtg': away_stats['def_rtg'],
                'combined_off_rtg': home_stats['off_rtg'] + away_stats['off_rtg'],
                'combined_def_rtg': home_stats['def_rtg'] + away_stats['def_rtg'],
                'pace_sum': home_stats['pace'] + away_stats['pace'],
                'home_three_rate': home_stats.get('three_rate', 0.35),
                'away_three_rate': away_stats.get('three_rate', 0.35),
                'home_ft_rate': home_stats.get('ft_rate', 0.2),
                'away_ft_rate': away_stats.get('ft_rate', 0.2),
                'rest_total': game.get('home_rest_days', 1) + game.get('away_rest_days', 1),
                'is_back_to_back': int(
                    game.get('home_rest_days', 1) == 0 or
                    game.get('away_rest_days', 1) == 0
                )
            }
            features.append(row)

        return pd.DataFrame(features)

    def predict_total(self, home_off_rtg, away_off_rtg, home_def_rtg, away_def_rtg,
                      home_pace, away_pace, league_avg_rtg=110):
        """
        Simple analytical prediction of total points.

        Uses the formula: Total = Pace * (Home_Off_Eff + Away_Off_Eff - League_Avg) / 100

        This accounts for the pace of both teams and their relative efficiencies.
        """
        # Expected pace
        expected_pace = (home_pace + away_pace) / 2

        # Adjusted efficiencies accounting for matchups
        # Home offense vs away defense
        home_pts_per_100 = (home_off_rtg + (100 - away_def_rtg)) / 2 + league_avg_rtg / 2
        # Away offense vs home defense
        away_pts_per_100 = (away_off_rtg + (100 - home_def_rtg)) / 2 + league_avg_rtg / 2

        # Total points
        total = expected_pace * (home_pts_per_100 + away_pts_per_100) / 100

        return total

Factors Affecting Totals

Several factors systematically affect game totals:

  1. Pace: Faster-paced teams create more possessions and scoring opportunities
  2. Defensive quality: Elite defenses suppress scoring
  3. Three-point shooting: High-volume three-point teams have more volatile scoring
  4. Free throw rate: Games with more fouls tend to score differently
  5. Rest: Back-to-back games often feature lower totals
  6. Altitude: Games in Denver historically feature slightly different totals

25.4 Elo Rating Systems for Basketball

Origins and Principles

The Elo rating system, originally developed by Arpad Elo for chess, provides an elegant framework for rating competitors based on head-to-head results. Elo systems have been successfully adapted to many sports, including basketball.

The core principles of Elo are:

  1. Zero-sum updates: Points gained by the winner equal points lost by the loser
  2. Expectation-based: Updates depend on whether the result was surprising
  3. Self-correcting: Ratings converge to true strength over time
  4. Interpretable: Rating differences map to win probabilities

Basic Elo Implementation

class EloRatingSystem:
    """
    Elo rating system adapted for NBA basketball.
    """

    def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500):
        """
        Initialize the Elo system.

        Parameters:
        -----------
        k_factor : float
            Maximum rating change per game
        home_advantage : float
            Rating points added to home team's effective rating
        initial_rating : float
            Starting rating for new teams
        """
        self.k_factor = k_factor
        self.home_advantage = home_advantage
        self.initial_rating = initial_rating
        self.ratings = {}
        self.history = []

    def get_rating(self, team):
        """Get current rating for a team."""
        return self.ratings.get(team, self.initial_rating)

    def expected_score(self, rating_a, rating_b):
        """
        Calculate expected score (win probability) for team A.

        Uses the logistic function with 400 as the scale factor.
        A 400-point rating difference corresponds to ~90% win probability.
        """
        return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))

    def update_ratings(self, home_team, away_team, home_score, away_score):
        """
        Update ratings after a game.

        Parameters:
        -----------
        home_team : str
            Home team identifier
        away_team : str
            Away team identifier
        home_score : int
            Home team's final score
        away_score : int
            Away team's final score

        Returns:
        --------
        tuple : (new_home_rating, new_away_rating, home_rating_change)
        """
        # Get current ratings
        home_rating = self.get_rating(home_team)
        away_rating = self.get_rating(away_team)

        # Calculate expected scores with home advantage
        home_expected = self.expected_score(
            home_rating + self.home_advantage,
            away_rating
        )
        away_expected = 1 - home_expected

        # Actual result (1 for win, 0 for loss, 0.5 for tie)
        if home_score > away_score:
            home_actual = 1
            away_actual = 0
        elif away_score > home_score:
            home_actual = 0
            away_actual = 1
        else:
            home_actual = 0.5
            away_actual = 0.5

        # Update ratings
        home_change = self.k_factor * (home_actual - home_expected)
        away_change = self.k_factor * (away_actual - away_expected)

        new_home_rating = home_rating + home_change
        new_away_rating = away_rating + away_change

        self.ratings[home_team] = new_home_rating
        self.ratings[away_team] = new_away_rating

        # Store history
        self.history.append({
            'home_team': home_team,
            'away_team': away_team,
            'home_rating_before': home_rating,
            'away_rating_before': away_rating,
            'home_rating_after': new_home_rating,
            'away_rating_after': new_away_rating,
            'home_expected': home_expected,
            'home_actual': home_actual
        })

        return new_home_rating, new_away_rating, home_change

    def predict_game(self, home_team, away_team):
        """
        Predict win probability and spread for a game.

        Returns:
        --------
        dict : Prediction with win_prob and expected_spread
        """
        home_rating = self.get_rating(home_team)
        away_rating = self.get_rating(away_team)

        # Win probability with home advantage
        home_win_prob = self.expected_score(
            home_rating + self.home_advantage,
            away_rating
        )

        # Convert rating difference to spread
        # Empirically, ~25-30 Elo points ≈ 1 point of spread
        rating_diff = (home_rating + self.home_advantage) - away_rating
        expected_spread = rating_diff / 28

        return {
            'home_win_prob': home_win_prob,
            'away_win_prob': 1 - home_win_prob,
            'expected_spread': expected_spread,
            'home_rating': home_rating,
            'away_rating': away_rating
        }

    def process_season(self, games_df):
        """
        Process all games in a season and update ratings.

        Parameters:
        -----------
        games_df : DataFrame
            Games sorted by date with columns:
            home_team, away_team, home_score, away_score
        """
        for _, game in games_df.iterrows():
            self.update_ratings(
                game['home_team'],
                game['away_team'],
                game['home_score'],
                game['away_score']
            )

    def get_rankings(self):
        """Get current team rankings."""
        rankings = pd.DataFrame([
            {'team': team, 'rating': rating}
            for team, rating in self.ratings.items()
        ]).sort_values('rating', ascending=False)
        rankings['rank'] = range(1, len(rankings) + 1)
        return rankings

Margin-Aware Elo

Standard Elo only considers wins and losses. For basketball, incorporating margin of victory provides additional information:

class MarginAwareElo(EloRatingSystem):
    """
    Elo system that incorporates margin of victory.
    """

    def __init__(self, k_factor=20, home_advantage=100, initial_rating=1500,
                 margin_factor=0.04, margin_cap=20):
        """
        Parameters:
        -----------
        margin_factor : float
            How much to weight margin (higher = more weight)
        margin_cap : float
            Maximum margin to consider (caps blowouts)
        """
        super().__init__(k_factor, home_advantage, initial_rating)
        self.margin_factor = margin_factor
        self.margin_cap = margin_cap

    def margin_multiplier(self, margin, elo_diff):
        """
        Calculate K-factor multiplier based on margin.

        Uses a diminishing returns formula that accounts for
        expected margin based on rating difference.
        """
        # Cap the margin
        margin = min(abs(margin), self.margin_cap)

        # Autocorrelation adjustment: expected margin based on Elo diff
        expected_margin = elo_diff / 28

        # Multiplier increases with margin, but less so for expected blowouts
        multiplier = np.log(margin + 1) * (2.2 / (1 + 0.001 * abs(elo_diff)))

        return max(1.0, multiplier)

    def update_ratings(self, home_team, away_team, home_score, away_score):
        """Update ratings with margin adjustment."""
        home_rating = self.get_rating(home_team)
        away_rating = self.get_rating(away_team)

        # Calculate expected scores
        elo_diff = (home_rating + self.home_advantage) - away_rating
        home_expected = self.expected_score(
            home_rating + self.home_advantage,
            away_rating
        )

        # Actual result
        margin = home_score - away_score
        if margin > 0:
            home_actual = 1
        elif margin < 0:
            home_actual = 0
        else:
            home_actual = 0.5

        # Margin multiplier
        mult = self.margin_multiplier(margin, elo_diff)

        # Update with adjusted K-factor
        effective_k = self.k_factor * mult
        home_change = effective_k * (home_actual - home_expected)

        self.ratings[home_team] = home_rating + home_change
        self.ratings[away_team] = away_rating - home_change

        return self.ratings[home_team], self.ratings[away_team], home_change

Season Carryover

Between seasons, teams change through trades, free agency, and the draft. Elo systems typically regress ratings toward the mean between seasons:

def regress_ratings_to_mean(elo_system, regression_factor=0.25):
    """
    Regress all ratings toward the mean between seasons.

    Parameters:
    -----------
    elo_system : EloRatingSystem
        The Elo system to modify
    regression_factor : float
        Fraction to regress toward mean (0.25 = 25% regression)
    """
    if not elo_system.ratings:
        return

    mean_rating = np.mean(list(elo_system.ratings.values()))

    for team in elo_system.ratings:
        current = elo_system.ratings[team]
        elo_system.ratings[team] = current + regression_factor * (mean_rating - current)

25.5 Feature Engineering for Game Prediction

Temporal Features

The timing of games matters significantly in basketball. Key temporal features include:

def create_temporal_features(games_df, schedule_df):
    """
    Create temporal features for game prediction.

    Parameters:
    -----------
    games_df : DataFrame
        Games to create features for
    schedule_df : DataFrame
        Full schedule for calculating rest, etc.

    Returns:
    --------
    DataFrame : Temporal features
    """
    features = []

    for _, game in games_df.iterrows():
        # Days since last game (rest)
        home_last = schedule_df[
            (schedule_df['team'] == game['home_team']) &
            (schedule_df['date'] < game['date'])
        ]['date'].max()

        away_last = schedule_df[
            (schedule_df['team'] == game['away_team']) &
            (schedule_df['date'] < game['date'])
        ]['date'].max()

        home_rest = (game['date'] - home_last).days if pd.notna(home_last) else 3
        away_rest = (game['date'] - away_last).days if pd.notna(away_last) else 3

        # Game number in current road/home stretch
        # (Not fully implemented - would require additional schedule analysis)

        # Month of season (early, mid, late)
        month = game['date'].month
        is_early_season = month in [10, 11]
        is_late_season = month in [3, 4]

        # Day of week
        day_of_week = game['date'].dayofweek
        is_weekend = day_of_week >= 5

        row = {
            'home_rest_days': home_rest,
            'away_rest_days': away_rest,
            'rest_advantage': home_rest - away_rest,
            'home_back_to_back': int(home_rest == 1),
            'away_back_to_back': int(away_rest == 1),
            'both_rested': int(home_rest >= 2 and away_rest >= 2),
            'is_early_season': int(is_early_season),
            'is_late_season': int(is_late_season),
            'is_weekend': int(is_weekend),
            'games_into_season': game.get('game_number', 41)
        }
        features.append(row)

    return pd.DataFrame(features)

Team Quality Features

Team quality extends beyond simple power ratings:

def create_team_quality_features(games_df, team_stats_df):
    """
    Create features capturing team quality and matchups.
    """
    features = []

    for _, game in games_df.iterrows():
        home = team_stats_df.loc[game['home_team']]
        away = team_stats_df.loc[game['away_team']]

        row = {
            # Efficiency metrics
            'home_net_rtg': home['off_rtg'] - home['def_rtg'],
            'away_net_rtg': away['off_rtg'] - away['def_rtg'],
            'net_rtg_diff': (home['off_rtg'] - home['def_rtg']) - (away['off_rtg'] - away['def_rtg']),

            # Strength of schedule
            'home_sos': home.get('strength_of_schedule', 0),
            'away_sos': away.get('strength_of_schedule', 0),

            # Consistency (low std dev = consistent)
            'home_consistency': -home.get('margin_std', 10),
            'away_consistency': -away.get('margin_std', 10),

            # Clutch performance (4th quarter, close games)
            'home_clutch_net_rtg': home.get('clutch_net_rtg', home['off_rtg'] - home['def_rtg']),
            'away_clutch_net_rtg': away.get('clutch_net_rtg', away['off_rtg'] - away['def_rtg']),

            # Record in close games
            'home_close_game_pct': home.get('close_game_win_pct', 0.5),
            'away_close_game_pct': away.get('close_game_win_pct', 0.5),

            # Win streaks / momentum
            'home_streak': home.get('current_streak', 0),
            'away_streak': away.get('current_streak', 0),

            # Record vs playoff teams
            'home_vs_playoff_pct': home.get('vs_playoff_win_pct', 0.5),
            'away_vs_playoff_pct': away.get('vs_playoff_win_pct', 0.5)
        }
        features.append(row)

    return pd.DataFrame(features)

Style Matchup Features

Basketball is a matchup-driven game. Certain styles perform better against others:

def create_style_features(games_df, team_stats_df):
    """
    Create features based on team playing styles.
    """
    features = []

    for _, game in games_df.iterrows():
        home = team_stats_df.loc[game['home_team']]
        away = team_stats_df.loc[game['away_team']]

        row = {
            # Pace differential
            'pace_diff': home['pace'] - away['pace'],
            'avg_pace': (home['pace'] + away['pace']) / 2,

            # Three-point reliance
            'home_three_rate': home.get('three_att_rate', 0.35),
            'away_three_rate': away.get('three_att_rate', 0.35),
            'three_rate_diff': home.get('three_att_rate', 0.35) - away.get('three_att_rate', 0.35),

            # Paint scoring
            'home_paint_pts_rate': home.get('paint_pts_rate', 0.4),
            'away_paint_pts_rate': away.get('paint_pts_rate', 0.4),

            # Rebounding
            'home_reb_rate': home.get('reb_rate', 0.5),
            'away_reb_rate': away.get('reb_rate', 0.5),

            # Turnover tendencies
            'home_tov_rate': home.get('tov_rate', 0.12),
            'away_tov_rate': away.get('tov_rate', 0.12),

            # Style compatibility score
            # High pace vs low pace, perimeter vs interior, etc.
            'style_mismatch': abs(home['pace'] - away['pace']) +
                            abs(home.get('three_att_rate', 0.35) - away.get('three_att_rate', 0.35)) * 100
        }
        features.append(row)

    return pd.DataFrame(features)

25.6 Betting Market Efficiency

The Efficient Market Hypothesis in Sports

The efficient market hypothesis (EMH) suggests that betting lines incorporate all available information, making it impossible to consistently beat the market. In sports betting, "soft" efficiency means that while markets may have biases, these biases are smaller than the transaction costs (the vig or juice).

Evidence suggests that NBA betting markets are highly efficient:

  1. Closing lines are excellent predictors of game outcomes
  2. Line movements from open to close generally improve accuracy
  3. Profitable betting strategies are rare and often disappear when publicized

Testing Market Efficiency

def analyze_market_efficiency(games_df):
    """
    Analyze the efficiency of betting markets.

    Parameters:
    -----------
    games_df : DataFrame
        Games with columns: spread, actual_margin, over_under, actual_total

    Returns:
    --------
    dict : Market efficiency metrics
    """
    results = {}

    # Point spread accuracy
    games_df['spread_error'] = games_df['actual_margin'] - games_df['spread']

    results['spread_mae'] = games_df['spread_error'].abs().mean()
    results['spread_rmse'] = np.sqrt((games_df['spread_error'] ** 2).mean())
    results['spread_bias'] = games_df['spread_error'].mean()

    # Cover rate (should be ~50% if efficient)
    games_df['home_covers'] = games_df['actual_margin'] > games_df['spread']
    results['home_cover_rate'] = games_df['home_covers'].mean()

    # Over/under accuracy
    games_df['total_error'] = games_df['actual_total'] - games_df['over_under']

    results['total_mae'] = games_df['total_error'].abs().mean()
    results['total_rmse'] = np.sqrt((games_df['total_error'] ** 2).mean())
    results['total_bias'] = games_df['total_error'].mean()

    # Over rate (should be ~50% if efficient)
    games_df['over_hit'] = games_df['actual_total'] > games_df['over_under']
    results['over_rate'] = games_df['over_hit'].mean()

    return results

def test_betting_strategy(games_df, predictions_df, strategy_fn, vig=0.0909):
    """
    Backtest a betting strategy.

    Parameters:
    -----------
    games_df : DataFrame
        Actual game results
    predictions_df : DataFrame
        Model predictions
    strategy_fn : callable
        Function that returns bet decision given prediction and line
    vig : float
        Bookmaker's commission (standard -110 both ways = 0.0909)

    Returns:
    --------
    dict : Strategy performance metrics
    """
    results = []

    for i in range(len(games_df)):
        game = games_df.iloc[i]
        pred = predictions_df.iloc[i]

        # Get bet decision: 1 (bet home), -1 (bet away), 0 (no bet)
        bet = strategy_fn(pred, game)

        if bet == 0:
            continue

        # Determine if bet won
        actual_margin = game['actual_margin']
        spread = game['spread']

        if bet == 1:  # Bet on home team to cover
            won = actual_margin > spread
        else:  # Bet on away team to cover
            won = actual_margin < spread

        # Calculate profit (standard -110 odds means risk 1.10 to win 1.00)
        if won:
            profit = 1.0
        else:
            profit = -(1 + vig)

        results.append({
            'bet': bet,
            'won': won,
            'profit': profit
        })

    results_df = pd.DataFrame(results)

    return {
        'total_bets': len(results_df),
        'wins': results_df['won'].sum(),
        'losses': len(results_df) - results_df['won'].sum(),
        'win_rate': results_df['won'].mean(),
        'total_profit': results_df['profit'].sum(),
        'roi': results_df['profit'].sum() / len(results_df) if len(results_df) > 0 else 0,
        'required_win_rate': (1 + vig) / (2 + vig)  # ~52.4% for standard vig
    }

Market Anomalies

Despite overall efficiency, research has identified several historical anomalies in NBA betting markets:

  1. Home underdog bias: Home underdogs have historically covered at slightly higher rates
  2. Rest advantage undervaluation: Teams with significant rest advantages may be undervalued
  3. Late-season motivation: Playoff-bound teams resting starters create value
  4. Public bias: Heavy public betting on favorites can move lines past true value
def identify_potential_value(games_df, predictions_df, threshold=2.0):
    """
    Identify games where model disagrees significantly with market.

    Parameters:
    -----------
    games_df : DataFrame
        Game data with market spreads
    predictions_df : DataFrame
        Model spread predictions
    threshold : float
        Minimum disagreement to flag (in points)

    Returns:
    --------
    DataFrame : Games with potential value
    """
    games_df = games_df.copy()
    games_df['model_spread'] = predictions_df['predicted_spread']
    games_df['disagreement'] = games_df['model_spread'] - games_df['market_spread']

    # Flag games with significant disagreement
    value_games = games_df[games_df['disagreement'].abs() >= threshold].copy()

    # Determine bet direction
    value_games['suggested_bet'] = np.where(
        value_games['disagreement'] > 0,
        'HOME',  # Model more bullish on home team
        'AWAY'   # Model more bullish on away team
    )

    return value_games[['home_team', 'away_team', 'market_spread',
                        'model_spread', 'disagreement', 'suggested_bet']]

25.7 Model Evaluation Metrics

Accuracy-Based Metrics

Simple accuracy (percentage of correct predictions) is intuitive but limited. It does not reward confidence or calibration:

def calculate_prediction_accuracy(predictions_df, results_df):
    """
    Calculate basic prediction accuracy metrics.
    """
    # Merge predictions with results
    df = predictions_df.merge(results_df, on='game_id')

    # Win/loss accuracy
    df['predicted_winner'] = np.where(df['home_win_prob'] > 0.5, 'home', 'away')
    df['actual_winner'] = np.where(df['home_score'] > df['away_score'], 'home', 'away')
    accuracy = (df['predicted_winner'] == df['actual_winner']).mean()

    # ATS accuracy (against the spread)
    df['predicted_cover'] = df['predicted_spread'] > df['market_spread']
    df['actual_cover'] = df['actual_margin'] > df['market_spread']
    ats_accuracy = (df['predicted_cover'] == df['actual_cover']).mean()

    return {
        'win_accuracy': accuracy,
        'ats_accuracy': ats_accuracy
    }

Brier Score

The Brier score is a proper scoring rule that rewards calibrated probability predictions:

$$\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2$$

Where $p_i$ is the predicted probability and $o_i$ is the actual outcome (1 or 0).

def brier_score(predictions, outcomes):
    """
    Calculate Brier score for probabilistic predictions.

    Parameters:
    -----------
    predictions : array-like
        Predicted probabilities (0 to 1)
    outcomes : array-like
        Actual outcomes (0 or 1)

    Returns:
    --------
    float : Brier score (lower is better, 0 is perfect)
    """
    predictions = np.array(predictions)
    outcomes = np.array(outcomes)
    return np.mean((predictions - outcomes) ** 2)

def brier_skill_score(predictions, outcomes, baseline_prob=None):
    """
    Calculate Brier Skill Score relative to a baseline.

    Parameters:
    -----------
    predictions : array-like
        Model's predicted probabilities
    outcomes : array-like
        Actual outcomes
    baseline_prob : float, optional
        Baseline probability (default: mean outcome)

    Returns:
    --------
    float : Brier Skill Score (higher is better, 0 = baseline, 1 = perfect)
    """
    if baseline_prob is None:
        baseline_prob = np.mean(outcomes)

    model_brier = brier_score(predictions, outcomes)
    baseline_brier = brier_score(np.full_like(predictions, baseline_prob), outcomes)

    return 1 - (model_brier / baseline_brier)

Log Loss

Log loss (cross-entropy loss) more heavily penalizes confident wrong predictions:

$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [o_i \log(p_i) + (1-o_i) \log(1-p_i)]$$

def log_loss(predictions, outcomes, eps=1e-15):
    """
    Calculate log loss for probabilistic predictions.

    Parameters:
    -----------
    predictions : array-like
        Predicted probabilities (0 to 1)
    outcomes : array-like
        Actual outcomes (0 or 1)
    eps : float
        Small value to avoid log(0)

    Returns:
    --------
    float : Log loss (lower is better)
    """
    predictions = np.array(predictions)
    outcomes = np.array(outcomes)

    # Clip predictions to avoid log(0)
    predictions = np.clip(predictions, eps, 1 - eps)

    return -np.mean(
        outcomes * np.log(predictions) +
        (1 - outcomes) * np.log(1 - predictions)
    )

Calibration Analysis

A well-calibrated model's predicted probabilities should match observed frequencies:

def calibration_analysis(predictions, outcomes, n_bins=10):
    """
    Analyze calibration of probability predictions.

    Parameters:
    -----------
    predictions : array-like
        Predicted probabilities
    outcomes : array-like
        Actual outcomes
    n_bins : int
        Number of bins for calibration

    Returns:
    --------
    DataFrame : Calibration data by bin
    """
    predictions = np.array(predictions)
    outcomes = np.array(outcomes)

    # Create bins
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(predictions, bin_edges[1:-1])

    calibration_data = []
    for i in range(n_bins):
        mask = bin_indices == i
        if mask.sum() > 0:
            calibration_data.append({
                'bin_lower': bin_edges[i],
                'bin_upper': bin_edges[i + 1],
                'bin_center': (bin_edges[i] + bin_edges[i + 1]) / 2,
                'mean_predicted': predictions[mask].mean(),
                'mean_observed': outcomes[mask].mean(),
                'count': mask.sum(),
                'calibration_error': abs(predictions[mask].mean() - outcomes[mask].mean())
            })

    return pd.DataFrame(calibration_data)

def expected_calibration_error(predictions, outcomes, n_bins=10):
    """
    Calculate Expected Calibration Error (ECE).
    """
    cal_data = calibration_analysis(predictions, outcomes, n_bins)
    total = cal_data['count'].sum()

    ece = sum(
        row['count'] / total * row['calibration_error']
        for _, row in cal_data.iterrows()
    )

    return ece

Comprehensive Evaluation Framework

class PredictionEvaluator:
    """
    Comprehensive evaluation framework for game predictions.
    """

    def __init__(self, predictions, outcomes, spreads_pred=None,
                 spreads_actual=None, market_spreads=None):
        """
        Initialize evaluator with predictions and outcomes.

        Parameters:
        -----------
        predictions : array-like
            Win probability predictions for home team
        outcomes : array-like
            Actual outcomes (1 if home won, 0 if away won)
        spreads_pred : array-like, optional
            Predicted point spreads
        spreads_actual : array-like, optional
            Actual point margins
        market_spreads : array-like, optional
            Market/betting spreads
        """
        self.predictions = np.array(predictions)
        self.outcomes = np.array(outcomes)
        self.spreads_pred = spreads_pred
        self.spreads_actual = spreads_actual
        self.market_spreads = market_spreads

    def evaluate_all(self):
        """
        Run all evaluation metrics.

        Returns:
        --------
        dict : Comprehensive evaluation results
        """
        results = {}

        # Accuracy metrics
        predicted_winners = (self.predictions > 0.5).astype(int)
        results['accuracy'] = (predicted_winners == self.outcomes).mean()

        # Probabilistic metrics
        results['brier_score'] = brier_score(self.predictions, self.outcomes)
        results['brier_skill_score'] = brier_skill_score(self.predictions, self.outcomes)
        results['log_loss'] = log_loss(self.predictions, self.outcomes)
        results['expected_calibration_error'] = expected_calibration_error(
            self.predictions, self.outcomes
        )

        # Spread-based metrics (if available)
        if self.spreads_pred is not None and self.spreads_actual is not None:
            results['spread_mae'] = np.mean(np.abs(self.spreads_pred - self.spreads_actual))
            results['spread_rmse'] = np.sqrt(np.mean((self.spreads_pred - self.spreads_actual) ** 2))
            results['spread_correlation'] = np.corrcoef(self.spreads_pred, self.spreads_actual)[0, 1]

        # ATS performance (if market spreads available)
        if self.market_spreads is not None and self.spreads_pred is not None:
            model_pick = self.spreads_pred > self.market_spreads
            actual_cover = self.spreads_actual > self.market_spreads
            results['ats_accuracy'] = (model_pick == actual_cover).mean()

        return results

    def generate_report(self):
        """
        Generate a formatted evaluation report.
        """
        results = self.evaluate_all()

        report = """
        ========================================
        GAME PREDICTION MODEL EVALUATION REPORT
        ========================================

        ACCURACY METRICS
        ----------------
        Win/Loss Accuracy:        {accuracy:.1%}

        PROBABILISTIC METRICS
        ---------------------
        Brier Score:              {brier_score:.4f}
        Brier Skill Score:        {brier_skill_score:.4f}
        Log Loss:                 {log_loss:.4f}
        Expected Cal. Error:      {expected_calibration_error:.4f}
        """.format(**results)

        if 'spread_mae' in results:
            report += """
        SPREAD PREDICTION METRICS
        -------------------------
        Spread MAE:               {spread_mae:.2f} pts
        Spread RMSE:              {spread_rmse:.2f} pts
        Spread Correlation:       {spread_correlation:.3f}
        """.format(**results)

        if 'ats_accuracy' in results:
            report += """
        AGAINST THE SPREAD
        ------------------
        ATS Accuracy:             {ats_accuracy:.1%}
        """.format(**results)

        return report

25.8 Home Court Advantage Quantification

Measuring Home Court Advantage

Home court advantage (HCA) is one of the most consistent effects in basketball. Historically, NBA home teams win about 58-60% of games, which translates to roughly 3-4 points of margin advantage.

def quantify_home_court_advantage(games_df, min_games=30):
    """
    Quantify home court advantage overall and by team.

    Parameters:
    -----------
    games_df : DataFrame
        Game results with home_team, away_team, home_score, away_score
    min_games : int
        Minimum games to report team-specific HCA

    Returns:
    --------
    dict : Home court advantage metrics
    """
    # Overall HCA
    games_df['home_margin'] = games_df['home_score'] - games_df['away_score']
    games_df['home_win'] = games_df['home_margin'] > 0

    overall_hca = {
        'win_pct': games_df['home_win'].mean(),
        'avg_margin': games_df['home_margin'].mean(),
        'median_margin': games_df['home_margin'].median(),
        'std_margin': games_df['home_margin'].std()
    }

    # Team-specific HCA
    home_records = games_df.groupby('home_team').agg({
        'home_win': ['sum', 'count'],
        'home_margin': 'mean'
    })
    home_records.columns = ['home_wins', 'home_games', 'home_margin']

    away_records = games_df.groupby('away_team').agg({
        'home_win': lambda x: (1 - x).sum(),
        'home_margin': lambda x: -x.mean()
    })
    away_records.columns = ['away_wins', 'away_margin']
    away_records['away_games'] = games_df.groupby('away_team').size()

    # Calculate team-specific HCA
    team_hca = []
    for team in home_records.index:
        if team in away_records.index:
            home_stats = home_records.loc[team]
            away_stats = away_records.loc[team]

            if home_stats['home_games'] >= min_games and away_stats['away_games'] >= min_games:
                hca_margin = home_stats['home_margin'] - away_stats['away_margin']
                hca_win_pct = (home_stats['home_wins'] / home_stats['home_games']) - \
                            (away_stats['away_wins'] / away_stats['away_games'])

                team_hca.append({
                    'team': team,
                    'home_win_pct': home_stats['home_wins'] / home_stats['home_games'],
                    'away_win_pct': away_stats['away_wins'] / away_stats['away_games'],
                    'home_margin': home_stats['home_margin'],
                    'away_margin': away_stats['away_margin'],
                    'hca_margin': hca_margin,
                    'hca_win_pct': hca_win_pct,
                    'total_games': home_stats['home_games'] + away_stats['away_games']
                })

    team_hca_df = pd.DataFrame(team_hca).sort_values('hca_margin', ascending=False)

    return {
        'overall': overall_hca,
        'by_team': team_hca_df
    }

Components of Home Court Advantage

Research has identified several factors contributing to HCA:

  1. Crowd effects: Psychological impact on players and referees
  2. Travel fatigue: Visiting teams travel more
  3. Familiarity: Home teams know their arena, sight lines, and shooting backgrounds
  4. Schedule: Home stands allow routine; road trips disrupt it
  5. Referee bias: Studies show subtle officiating favoritism
def decompose_home_court_advantage(games_df, ref_data=None):
    """
    Attempt to decompose HCA into component factors.

    This is an approximation based on available data.
    """
    results = {}

    # Base HCA
    base_hca = games_df['home_score'].mean() - games_df['away_score'].mean()
    results['total_hca'] = base_hca

    # Rest-adjusted HCA
    if 'home_rest' in games_df.columns and 'away_rest' in games_df.columns:
        # Games where rest is equal
        equal_rest = games_df[games_df['home_rest'] == games_df['away_rest']]
        rest_neutral_hca = equal_rest['home_score'].mean() - equal_rest['away_score'].mean()
        results['rest_neutral_hca'] = rest_neutral_hca
        results['rest_component'] = base_hca - rest_neutral_hca

    # Travel component (if travel data available)
    if 'away_travel_distance' in games_df.columns:
        short_travel = games_df[games_df['away_travel_distance'] < 500]
        long_travel = games_df[games_df['away_travel_distance'] >= 1500]

        results['short_travel_hca'] = short_travel['home_score'].mean() - short_travel['away_score'].mean()
        results['long_travel_hca'] = long_travel['home_score'].mean() - long_travel['away_score'].mean()
        results['travel_effect'] = results['long_travel_hca'] - results['short_travel_hca']

    # Referee component (if ref data available)
    if ref_data is not None:
        # This would require detailed play-by-play analysis
        pass

    return results

Altitude Effects

Denver's high altitude creates a unique home court advantage:

def analyze_altitude_effect(games_df, high_altitude_teams=['DEN']):
    """
    Analyze the effect of altitude on game outcomes.

    Parameters:
    -----------
    games_df : DataFrame
        Game data
    high_altitude_teams : list
        Teams playing at high altitude

    Returns:
    --------
    dict : Altitude effect analysis
    """
    # Denver home games
    altitude_home = games_df[games_df['home_team'].isin(high_altitude_teams)]
    altitude_margin = altitude_home['home_score'].mean() - altitude_home['away_score'].mean()

    # Other teams' home games
    other_home = games_df[~games_df['home_team'].isin(high_altitude_teams)]
    other_margin = other_home['home_score'].mean() - other_home['away_score'].mean()

    # Effect of altitude beyond normal HCA
    altitude_effect = altitude_margin - other_margin

    # Quarter-by-quarter analysis (fatigue should increase over time)
    # This would require quarter-by-quarter scoring data

    return {
        'altitude_teams_hca': altitude_margin,
        'other_teams_hca': other_margin,
        'altitude_premium': altitude_effect
    }

25.9 Rest and Travel Effects

Quantifying Rest Impact

Rest significantly affects NBA performance. Teams on back-to-backs or short rest face documented disadvantages:

def analyze_rest_effects(games_df):
    """
    Analyze the effect of rest on game outcomes.

    Parameters:
    -----------
    games_df : DataFrame
        Games with rest_days columns for home and away teams

    Returns:
    --------
    dict : Rest effect analysis
    """
    df = games_df.copy()
    df['margin'] = df['home_score'] - df['away_score']
    df['home_win'] = df['margin'] > 0

    # Rest categories
    df['home_rest_cat'] = pd.cut(df['home_rest'],
                                  bins=[-1, 0, 1, 2, 100],
                                  labels=['B2B', '1_day', '2_days', '3+_days'])
    df['away_rest_cat'] = pd.cut(df['away_rest'],
                                  bins=[-1, 0, 1, 2, 100],
                                  labels=['B2B', '1_day', '2_days', '3+_days'])

    # Rest differential analysis
    rest_analysis = df.groupby(['home_rest_cat', 'away_rest_cat']).agg({
        'margin': ['mean', 'std', 'count'],
        'home_win': 'mean'
    })

    # Back-to-back specific analysis
    home_b2b = df[df['home_rest'] == 0]
    away_b2b = df[df['away_rest'] == 0]
    neither_b2b = df[(df['home_rest'] > 0) & (df['away_rest'] > 0)]

    return {
        'rest_matrix': rest_analysis,
        'home_b2b_margin': home_b2b['margin'].mean() if len(home_b2b) > 0 else None,
        'away_b2b_margin': away_b2b['margin'].mean() if len(away_b2b) > 0 else None,
        'neither_b2b_margin': neither_b2b['margin'].mean() if len(neither_b2b) > 0 else None,
        'home_b2b_penalty': (neither_b2b['margin'].mean() - home_b2b['margin'].mean())
                           if len(home_b2b) > 0 and len(neither_b2b) > 0 else None,
        'away_b2b_bonus': (away_b2b['margin'].mean() - neither_b2b['margin'].mean())
                         if len(away_b2b) > 0 and len(neither_b2b) > 0 else None
    }

Travel Fatigue

Travel distance and direction affect performance:

import math

def calculate_travel_distance(lat1, lon1, lat2, lon2):
    """
    Calculate great circle distance between two points.
    """
    R = 3959  # Earth's radius in miles

    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])

    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))

    return R * c

def analyze_travel_effects(games_df, arena_locations):
    """
    Analyze the effect of travel on game outcomes.

    Parameters:
    -----------
    games_df : DataFrame
        Game data
    arena_locations : dict
        Dictionary mapping teams to (latitude, longitude)

    Returns:
    --------
    dict : Travel effect analysis
    """
    df = games_df.copy()

    # Calculate travel distance for away team
    def get_travel(row):
        away_loc = arena_locations.get(row['away_team'])
        home_loc = arena_locations.get(row['home_team'])
        if away_loc and home_loc:
            return calculate_travel_distance(
                away_loc[0], away_loc[1],
                home_loc[0], home_loc[1]
            )
        return None

    df['travel_distance'] = df.apply(get_travel, axis=1)
    df['margin'] = df['home_score'] - df['away_score']

    # Categorize travel
    df['travel_cat'] = pd.cut(df['travel_distance'],
                              bins=[0, 500, 1000, 1500, 2500, 5000],
                              labels=['Local', 'Short', 'Medium', 'Long', 'Cross-country'])

    travel_effects = df.groupby('travel_cat').agg({
        'margin': ['mean', 'std', 'count']
    })

    # Time zone analysis
    def get_timezone_diff(row):
        # Simplified: would need actual timezone data
        away_lon = arena_locations.get(row['away_team'], (0, 0))[1]
        home_lon = arena_locations.get(row['home_team'], (0, 0))[1]
        return int(round((home_lon - away_lon) / 15))  # Approximate time zones

    df['timezone_diff'] = df.apply(get_timezone_diff, axis=1)

    # East-to-West vs West-to-East
    df['direction'] = np.where(df['timezone_diff'] > 0, 'Westward',
                               np.where(df['timezone_diff'] < 0, 'Eastward', 'Same'))

    direction_effects = df.groupby('direction').agg({
        'margin': ['mean', 'count']
    })

    return {
        'travel_effects': travel_effects,
        'direction_effects': direction_effects
    }

Scheduling Patterns

def analyze_schedule_patterns(games_df):
    """
    Analyze schedule-related patterns in game outcomes.
    """
    df = games_df.copy()
    df['margin'] = df['home_score'] - df['away_score']
    df['home_win'] = df['margin'] > 0

    results = {}

    # Games in a row on the road
    if 'home_road_game_streak' in df.columns:
        road_fatigue = df.groupby('away_road_game_streak')['margin'].mean()
        results['road_fatigue'] = road_fatigue

    # Second game of road back-to-back
    if 'away_second_of_b2b' in df.columns:
        second_b2b = df[df['away_second_of_b2b'] == True]
        results['second_b2b_margin'] = second_b2b['margin'].mean()
        results['second_b2b_home_win_pct'] = second_b2b['home_win'].mean()

    # Segment of season
    if 'game_number' in df.columns:
        df['season_segment'] = pd.cut(df['game_number'],
                                       bins=[0, 20, 41, 62, 82],
                                       labels=['Early', 'Pre-ASB', 'Post-ASB', 'Late'])
        segment_effects = df.groupby('season_segment').agg({
            'margin': 'mean',
            'home_win': 'mean'
        })
        results['segment_effects'] = segment_effects

    return results

25.10 Ensemble Methods for Prediction

Why Ensembles Work

Ensemble methods combine multiple models to produce better predictions than any single model. The key insight is that different models make different errors, and averaging reduces error when those errors are uncorrelated.

For game prediction, ensembles can combine: - Elo ratings - Regression-based models - Machine learning models - Market-based predictions

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
import numpy as np

class GamePredictionEnsemble:
    """
    Ensemble model for NBA game prediction.
    """

    def __init__(self):
        self.models = {
            'ridge': Ridge(alpha=10),
            'rf': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
            'gbm': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
        }
        self.weights = None
        self.elo_system = MarginAwareElo()

    def fit(self, X, y, games_for_elo=None):
        """
        Fit all models in the ensemble.

        Parameters:
        -----------
        X : DataFrame
            Feature matrix
        y : array-like
            Target (point differential)
        games_for_elo : DataFrame
            Historical games for Elo initialization
        """
        # Fit Elo system on historical data
        if games_for_elo is not None:
            self.elo_system.process_season(games_for_elo)

        # Fit other models
        for name, model in self.models.items():
            model.fit(X, y)

        # Calculate optimal weights using cross-validation
        self._optimize_weights(X, y)

    def _optimize_weights(self, X, y, n_folds=5):
        """
        Optimize ensemble weights using cross-validation.
        """
        from sklearn.model_selection import KFold

        kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

        # Collect out-of-fold predictions
        oof_preds = {name: np.zeros(len(y)) for name in self.models}

        for train_idx, val_idx in kf.split(X):
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train = y[train_idx]

            for name, model_template in self.models.items():
                # Clone and fit
                model = type(model_template)(**model_template.get_params())
                model.fit(X_train, y_train)
                oof_preds[name][val_idx] = model.predict(X_val)

        # Optimize weights to minimize RMSE
        from scipy.optimize import minimize

        def objective(weights):
            weights = weights / weights.sum()  # Normalize
            ensemble_pred = sum(w * oof_preds[name] for name, w in zip(self.models.keys(), weights))
            return np.sqrt(np.mean((ensemble_pred - y) ** 2))

        n_models = len(self.models)
        result = minimize(
            objective,
            x0=np.ones(n_models) / n_models,
            method='SLSQP',
            bounds=[(0, 1)] * n_models,
            constraints={'type': 'eq', 'fun': lambda w: w.sum() - 1}
        )

        self.weights = dict(zip(self.models.keys(), result.x / result.x.sum()))

    def predict(self, X, home_teams=None, away_teams=None):
        """
        Generate ensemble predictions.

        Parameters:
        -----------
        X : DataFrame
            Feature matrix
        home_teams : list, optional
            Home teams for Elo predictions
        away_teams : list, optional
            Away teams for Elo predictions

        Returns:
        --------
        array : Predicted point differentials
        """
        predictions = {}

        # Model predictions
        for name, model in self.models.items():
            predictions[name] = model.predict(X)

        # Elo predictions
        if home_teams is not None and away_teams is not None:
            elo_preds = []
            for home, away in zip(home_teams, away_teams):
                pred = self.elo_system.predict_game(home, away)
                elo_preds.append(pred['expected_spread'])
            predictions['elo'] = np.array(elo_preds)

            # Add Elo to weights if not present
            if 'elo' not in self.weights:
                # Simple reweighting
                elo_weight = 0.2
                for key in self.weights:
                    self.weights[key] *= (1 - elo_weight)
                self.weights['elo'] = elo_weight

        # Weighted average
        ensemble_pred = sum(
            self.weights.get(name, 0) * preds
            for name, preds in predictions.items()
        )

        return ensemble_pred

    def predict_with_uncertainty(self, X, home_teams=None, away_teams=None):
        """
        Generate predictions with uncertainty estimates.
        """
        predictions = {}

        for name, model in self.models.items():
            predictions[name] = model.predict(X)

        # Ensemble prediction
        ensemble_pred = self.predict(X, home_teams, away_teams)

        # Uncertainty from model disagreement
        pred_matrix = np.column_stack(list(predictions.values()))
        uncertainty = pred_matrix.std(axis=1)

        return {
            'prediction': ensemble_pred,
            'uncertainty': uncertainty,
            'model_predictions': predictions
        }

Stacking Ensemble

Stacking uses a meta-model to learn optimal combinations:

from sklearn.model_selection import cross_val_predict

class StackingEnsemble:
    """
    Stacking ensemble for game prediction.
    """

    def __init__(self):
        self.base_models = {
            'ridge': Ridge(alpha=10),
            'rf': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
            'gbm': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
        }
        self.meta_model = Ridge(alpha=1)

    def fit(self, X, y):
        """
        Fit the stacking ensemble.
        """
        # Generate base model predictions using cross-validation
        base_predictions = np.column_stack([
            cross_val_predict(model, X, y, cv=5)
            for model in self.base_models.values()
        ])

        # Fit meta-model on base predictions
        self.meta_model.fit(base_predictions, y)

        # Refit base models on full data
        for model in self.base_models.values():
            model.fit(X, y)

    def predict(self, X):
        """
        Generate stacked predictions.
        """
        # Get base model predictions
        base_predictions = np.column_stack([
            model.predict(X)
            for model in self.base_models.values()
        ])

        # Meta-model combines them
        return self.meta_model.predict(base_predictions)

25.11 Complete Prediction Pipeline

End-to-End Implementation

class NBAGamePredictor:
    """
    Complete NBA game prediction pipeline.
    """

    def __init__(self, config=None):
        """
        Initialize the prediction system.

        Parameters:
        -----------
        config : dict, optional
            Configuration parameters
        """
        self.config = config or {}

        # Initialize components
        self.elo_system = MarginAwareElo(
            k_factor=self.config.get('k_factor', 20),
            home_advantage=self.config.get('elo_hca', 100)
        )
        self.ensemble = GamePredictionEnsemble()
        self.scaler = StandardScaler()

        # State
        self.team_stats = None
        self.is_fitted = False

    def prepare_data(self, games_df, team_stats_df):
        """
        Prepare data for training or prediction.
        """
        features = []
        targets = []

        for _, game in games_df.iterrows():
            try:
                home_stats = team_stats_df.loc[game['home_team']]
                away_stats = team_stats_df.loc[game['away_team']]
            except KeyError:
                continue

            # Feature engineering
            feature_row = {
                # Efficiency
                'net_rtg_diff': (home_stats['off_rtg'] - home_stats['def_rtg']) -
                               (away_stats['off_rtg'] - away_stats['def_rtg']),
                'off_rtg_diff': home_stats['off_rtg'] - away_stats['off_rtg'],
                'def_rtg_diff': away_stats['def_rtg'] - home_stats['def_rtg'],

                # Pace
                'pace_diff': home_stats['pace'] - away_stats['pace'],
                'avg_pace': (home_stats['pace'] + away_stats['pace']) / 2,

                # Four Factors
                'efg_diff': home_stats.get('efg', 0.5) - away_stats.get('efg', 0.5),
                'tov_diff': away_stats.get('tov_rate', 0.12) - home_stats.get('tov_rate', 0.12),
                'orb_diff': home_stats.get('orb_rate', 0.25) - away_stats.get('orb_rate', 0.25),
                'ft_rate_diff': home_stats.get('ft_rate', 0.2) - away_stats.get('ft_rate', 0.2),

                # Rest and travel
                'rest_diff': game.get('home_rest', 2) - game.get('away_rest', 2),
                'away_travel': game.get('away_travel_distance', 0) / 1000,

                # Recent form
                'home_last10_margin': home_stats.get('last_10_margin', 0),
                'away_last10_margin': away_stats.get('last_10_margin', 0),

                # Elo ratings
                'home_elo': self.elo_system.get_rating(game['home_team']),
                'away_elo': self.elo_system.get_rating(game['away_team']),
                'elo_diff': self.elo_system.get_rating(game['home_team']) -
                           self.elo_system.get_rating(game['away_team'])
            }
            features.append(feature_row)

            if 'home_score' in game and 'away_score' in game:
                targets.append(game['home_score'] - game['away_score'])

        X = pd.DataFrame(features)
        y = np.array(targets) if targets else None

        return X, y

    def fit(self, historical_games_df, team_stats_df):
        """
        Fit the prediction model on historical data.

        Parameters:
        -----------
        historical_games_df : DataFrame
            Historical games sorted by date
        team_stats_df : DataFrame
            Team statistics
        """
        # Initialize Elo ratings from early games
        early_games = historical_games_df.iloc[:len(historical_games_df)//4]
        self.elo_system.process_season(early_games)

        # Prepare training data from later games
        training_games = historical_games_df.iloc[len(historical_games_df)//4:]
        X, y = self.prepare_data(training_games, team_stats_df)

        if len(X) == 0:
            raise ValueError("No valid training data")

        # Scale features
        X_scaled = pd.DataFrame(
            self.scaler.fit_transform(X),
            columns=X.columns
        )

        # Fit ensemble
        self.ensemble.fit(X_scaled, y, games_for_elo=early_games)

        # Update Elo with all games
        self.elo_system.process_season(training_games)

        self.team_stats = team_stats_df
        self.is_fitted = True

    def predict(self, upcoming_games_df, team_stats_df=None):
        """
        Predict outcomes for upcoming games.

        Parameters:
        -----------
        upcoming_games_df : DataFrame
            Games to predict
        team_stats_df : DataFrame, optional
            Updated team statistics

        Returns:
        --------
        DataFrame : Predictions
        """
        if not self.is_fitted:
            raise ValueError("Model must be fitted first")

        if team_stats_df is None:
            team_stats_df = self.team_stats

        # Prepare features
        X, _ = self.prepare_data(upcoming_games_df, team_stats_df)
        X_scaled = pd.DataFrame(
            self.scaler.transform(X),
            columns=X.columns
        )

        # Generate predictions
        home_teams = upcoming_games_df['home_team'].tolist()
        away_teams = upcoming_games_df['away_team'].tolist()

        spread_predictions = self.ensemble.predict(
            X_scaled,
            home_teams=home_teams,
            away_teams=away_teams
        )

        # Convert to win probabilities
        win_probs = [spread_to_win_probability(s) for s in spread_predictions]

        # Compile predictions
        predictions = pd.DataFrame({
            'home_team': home_teams,
            'away_team': away_teams,
            'predicted_spread': spread_predictions,
            'home_win_prob': win_probs,
            'away_win_prob': [1 - p for p in win_probs]
        })

        return predictions

    def update(self, game_result):
        """
        Update model with a new game result.

        Parameters:
        -----------
        game_result : dict or Series
            Game result with home_team, away_team, home_score, away_score
        """
        self.elo_system.update_ratings(
            game_result['home_team'],
            game_result['away_team'],
            game_result['home_score'],
            game_result['away_score']
        )

    def evaluate(self, test_games_df, team_stats_df=None):
        """
        Evaluate model performance on test games.
        """
        predictions = self.predict(test_games_df, team_stats_df)

        # Merge with actuals
        test_games_df = test_games_df.copy()
        test_games_df['actual_margin'] = test_games_df['home_score'] - test_games_df['away_score']
        test_games_df['home_win'] = test_games_df['actual_margin'] > 0

        evaluator = PredictionEvaluator(
            predictions=predictions['home_win_prob'].values,
            outcomes=test_games_df['home_win'].astype(int).values,
            spreads_pred=predictions['predicted_spread'].values,
            spreads_actual=test_games_df['actual_margin'].values
        )

        return evaluator.evaluate_all()

25.12 Practical Considerations

Handling Injuries

Injuries dramatically affect game predictions. A comprehensive system must account for player absences:

def adjust_prediction_for_injuries(base_prediction, injuries, player_impact_df):
    """
    Adjust game prediction for known injuries.

    Parameters:
    -----------
    base_prediction : dict
        Base prediction with spread and win_prob
    injuries : dict
        Dictionary mapping teams to list of injured players
    player_impact_df : DataFrame
        Player impact data (e.g., VORP, BPM, etc.)

    Returns:
    --------
    dict : Adjusted prediction
    """
    adjustment = 0

    for team, injured_players in injuries.items():
        for player in injured_players:
            if player in player_impact_df.index:
                # Estimate impact in points per game
                player_impact = player_impact_df.loc[player, 'pts_added_per_game']

                # Adjust based on which team the player is on
                if team == 'home':
                    adjustment -= player_impact
                else:
                    adjustment += player_impact

    adjusted_spread = base_prediction['predicted_spread'] + adjustment
    adjusted_win_prob = spread_to_win_probability(adjusted_spread)

    return {
        'predicted_spread': adjusted_spread,
        'home_win_prob': adjusted_win_prob,
        'injury_adjustment': adjustment
    }

Model Maintenance

Prediction models require ongoing maintenance:

def model_maintenance_schedule(model, schedule_type='daily'):
    """
    Guidelines for model maintenance.
    """
    maintenance_tasks = {
        'daily': [
            'Update Elo ratings with previous day\'s results',
            'Refresh team statistics rolling averages',
            'Check for injury updates',
            'Log model predictions vs actual outcomes'
        ],
        'weekly': [
            'Recalculate team quality features',
            'Update rest/travel features',
            'Review prediction accuracy by team',
            'Identify systematic biases'
        ],
        'monthly': [
            'Retrain regression models on recent data',
            'Reoptimize ensemble weights',
            'Full calibration analysis',
            'Compare to market efficiency'
        ],
        'end_of_season': [
            'Comprehensive performance review',
            'Regress Elo ratings toward mean',
            'Archive model for historical comparison',
            'Update for roster changes'
        ]
    }

    return maintenance_tasks.get(schedule_type, [])

Summary

Game outcome prediction synthesizes team evaluation, situational analysis, and probabilistic modeling into actionable forecasts. Key takeaways from this chapter include:

  1. Baselines matter: Simple baselines like home win percentage and team power ratings capture significant predictive power
  2. Elo systems provide elegant solutions: Self-correcting ratings that update with each game offer interpretable predictions
  3. Feature engineering is crucial: Rest, travel, and matchup features add predictive value beyond raw team quality
  4. Market efficiency is real: Betting markets are highly efficient, making consistent profits extremely difficult
  5. Proper evaluation is essential: Use proper scoring rules (Brier score, log loss) rather than simple accuracy
  6. Ensembles improve performance: Combining multiple models reduces error when individual models make different mistakes
  7. Context matters: Injuries, motivation, and scheduling create prediction opportunities

The complete pipeline presented here provides a foundation for building production-quality prediction systems. However, remember that even the best models face substantial uncertainty in basketball outcomes, and maintaining realistic expectations about predictive accuracy is essential for practical applications.

References

  1. Elo, A. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing.
  2. Silver, N. (2015). The Signal and the Noise: Why So Many Predictions Fail - But Some Don't. Penguin Books.
  3. Winston, W. L. (2012). Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football. Princeton University Press.
  4. Lopez, M. J., & Matthews, G. J. (2015). Building an NCAA men's basketball predictive model and quantifying its success. Journal of Quantitative Analysis in Sports, 11(1), 5-12.
  5. Manner, H. (2016). Modeling and forecasting the outcomes of NBA basketball games. Journal of Quantitative Analysis in Sports, 12(1), 31-41.
  6. Zimmermann, J. (2016). Prediction Markets in Sports Betting. Journal of Prediction Markets, 10(2), 1-22.