Case Study: Hierarchical Bayesian Model for NBA Team Strength Estimation

Executive Summary

Professional sports leagues consist of teams that share a common competitive environment: the same rules, the same schedule structure, the same salary constraints. A hierarchical Bayesian model exploits this shared structure by treating each team's strength as a draw from a league-wide distribution, allowing information to flow between teams and producing better estimates for every franchise. In this case study, we build a hierarchical Normal model for NBA team scoring margins, fit it using PyMC's MCMC sampler, and demonstrate that partial pooling produces more accurate strength estimates than either independent team-by-team estimation or the naive approach of treating all teams as average. We then use the fitted model to generate posterior predictive distributions for upcoming games, compute win probabilities and spread-cover probabilities, and identify potential value bets against simulated sportsbook lines.

Background

Why Hierarchical Models?

Standard regression models estimate each team's parameters independently. If a team has played only five games, its estimated scoring margin is based solely on those five observations --- a noisy, unreliable number. At the other extreme, a "complete pooling" approach ignores team differences entirely and estimates a single league-wide margin (zero, by definition, since margins are relative).

Hierarchical models occupy the middle ground: partial pooling. Teams with many games and consistent results are estimated primarily from their own data. Teams with few games or extreme results are "shrunk" toward the league average. The model learns the appropriate degree of shrinkage from the data itself, by estimating the between-team variance simultaneously with the individual team parameters.

For the NBA, where each team plays 82 games per season, the benefits of partial pooling are most pronounced in the first 15--20 games, after which each team has enough data for stable independent estimates. But even late in the season, the hierarchical structure provides modest improvements through regularization.

The Model

We specify the following generative model for game $j$:

$$\mu_{\text{league}} \sim \text{Normal}(0, 5)$$

$$\sigma_{\text{team}} \sim \text{HalfNormal}(10)$$

$$\theta_i \sim \text{Normal}(\mu_{\text{league}}, \sigma_{\text{team}}^2) \quad \text{for } i = 1, \ldots, 30$$

$$\text{HFA} \sim \text{Normal}(3, 2)$$

$$\sigma_{\text{game}} \sim \text{HalfNormal}(15)$$

$$y_j \sim \text{Normal}(\theta_{\text{home}_j} - \theta_{\text{away}_j} + \text{HFA}, \sigma_{\text{game}}^2)$$

where $\theta_i$ is team $i$'s latent strength (in points relative to league average), HFA is the home-court advantage, and $\sigma_{\text{game}}$ captures the game-to-game randomness.

Data Generation

We simulate a partial NBA season (25 games per team, approximately one-third of the schedule) using realistic parameters drawn from historical NBA data.

"""
Case Study 2: Hierarchical Bayesian Model for NBA Team Strength.

This module implements a hierarchical Normal model for estimating
NBA team scoring margins, fitted using PyMC MCMC sampling. It
demonstrates partial pooling, posterior predictive inference, and
practical betting applications.

Author: The Sports Betting Textbook
Chapter: 10 - Bayesian Thinking for Bettors
"""

from __future__ import annotations

import numpy as np
import pandas as pd
from scipy import stats
from typing import Optional

# --- Data generation ---

def generate_nba_season(
    n_teams: int = 30,
    games_per_team: int = 25,
    true_sigma_team: float = 5.5,
    true_hfa: float = 3.0,
    true_sigma_game: float = 12.0,
    seed: int = 123,
) -> tuple[pd.DataFrame, np.ndarray, dict]:
    """Generate a synthetic partial NBA season.

    Args:
        n_teams: Number of teams.
        games_per_team: Games per team to simulate.
        true_sigma_team: True spread of team strengths.
        true_hfa: True home-court advantage in points.
        true_sigma_game: Game-to-game noise standard deviation.
        seed: Random seed.

    Returns:
        Tuple of (games DataFrame, true strengths array, true params dict).
    """
    rng = np.random.default_rng(seed)

    # True team strengths: centered at 0 (league average)
    true_strengths = rng.normal(0, true_sigma_team, size=n_teams)

    # Generate games: each team plays games_per_team games
    # Simplified scheduling: random pairings
    games = []
    team_game_counts = np.zeros(n_teams, dtype=int)

    while np.min(team_game_counts) < games_per_team:
        # Pick two teams that both need more games
        eligible = np.where(team_game_counts < games_per_team)[0]
        if len(eligible) < 2:
            break
        pair = rng.choice(eligible, size=2, replace=False)
        home, away = pair[0], pair[1]

        # Simulate the game
        expected_margin = (
            true_strengths[home] - true_strengths[away] + true_hfa
        )
        actual_margin = rng.normal(expected_margin, true_sigma_game)

        games.append({
            "home_team": int(home),
            "away_team": int(away),
            "home_margin": float(actual_margin),
            "expected_margin": float(expected_margin),
        })

        team_game_counts[home] += 1
        team_game_counts[away] += 1

    df = pd.DataFrame(games)

    true_params = {
        "sigma_team": true_sigma_team,
        "hfa": true_hfa,
        "sigma_game": true_sigma_game,
    }

    return df, true_strengths, true_params


def compute_no_pooling_estimates(
    df: pd.DataFrame,
    n_teams: int = 30,
) -> np.ndarray:
    """Compute independent (no-pooling) team strength estimates.

    Each team's strength is estimated as the average of its
    observed margins, adjusted for home/away status.

    Args:
        df: Games DataFrame with home_team, away_team, home_margin.
        n_teams: Number of teams.

    Returns:
        Array of estimated team strengths.
    """
    # For each team, collect adjusted margins
    adjusted_margins = {i: [] for i in range(n_teams)}

    for _, row in df.iterrows():
        home = int(row["home_team"])
        away = int(row["away_team"])
        margin = row["home_margin"]

        # Rough home advantage adjustment (assume 3 points)
        adjusted_margins[home].append(margin - 1.5)  # half HFA to home
        adjusted_margins[away].append(-(margin - 1.5))  # negative for away

    estimates = np.zeros(n_teams)
    for i in range(n_teams):
        if adjusted_margins[i]:
            estimates[i] = np.mean(adjusted_margins[i])
    return estimates


# --- PyMC hierarchical model ---

def fit_hierarchical_model(
    df: pd.DataFrame,
    n_teams: int = 30,
    n_samples: int = 2000,
    n_tune: int = 1500,
    seed: int = 42,
):
    """Fit the hierarchical Bayesian team strength model using PyMC.

    Args:
        df: Games DataFrame.
        n_teams: Number of teams.
        n_samples: MCMC samples per chain.
        n_tune: Tuning (warmup) samples.
        seed: Random seed.

    Returns:
        Tuple of (PyMC model, ArviZ InferenceData trace).
    """
    import pymc as pm
    import arviz as az

    home_idx = df["home_team"].values.astype(int)
    away_idx = df["away_team"].values.astype(int)
    observed = df["home_margin"].values.astype(float)

    with pm.Model() as model:
        # Hyperpriors
        sigma_team = pm.HalfNormal("sigma_team", sigma=10)

        # Team strengths (centered parameterization)
        team_strength = pm.Normal(
            "team_strength", mu=0, sigma=sigma_team, shape=n_teams
        )

        # Home-court advantage
        hfa = pm.Normal("hfa", mu=3, sigma=2)

        # Game-level noise
        sigma_game = pm.HalfNormal("sigma_game", sigma=15)

        # Expected margin for each game
        mu_game = (
            team_strength[home_idx]
            - team_strength[away_idx]
            + hfa
        )

        # Likelihood
        y = pm.Normal(
            "margin", mu=mu_game, sigma=sigma_game, observed=observed
        )

        # Sample
        trace = pm.sample(
            n_samples,
            tune=n_tune,
            chains=4,
            random_seed=seed,
            return_inferencedata=True,
            target_accept=0.9,
        )

    return model, trace


def extract_team_estimates(trace, n_teams: int = 30):
    """Extract team strength posterior summaries from the trace.

    Args:
        trace: ArviZ InferenceData object.
        n_teams: Number of teams.

    Returns:
        Dictionary with means, standard deviations, and credible intervals.
    """
    import arviz as az

    summary = az.summary(
        trace,
        var_names=["team_strength"],
        hdi_prob=0.90,
    )

    means = summary["mean"].values
    sds = summary["sd"].values
    ci_low = summary["hdi_5%"].values
    ci_high = summary["hdi_95%"].values

    return {
        "means": means,
        "sds": sds,
        "ci_low": ci_low,
        "ci_high": ci_high,
    }


def predict_game(
    trace,
    home_team: int,
    away_team: int,
    spread: float = 0.0,
) -> dict:
    """Generate posterior predictive distribution for a specific game.

    Args:
        trace: ArviZ InferenceData from the fitted model.
        home_team: Index of the home team.
        away_team: Index of the away team.
        spread: The sportsbook spread (positive = home favored).

    Returns:
        Dictionary with prediction summaries.
    """
    # Extract posterior samples
    home_samples = (
        trace.posterior["team_strength"]
        .values[:, :, home_team]
        .flatten()
    )
    away_samples = (
        trace.posterior["team_strength"]
        .values[:, :, away_team]
        .flatten()
    )
    hfa_samples = trace.posterior["hfa"].values.flatten()
    sigma_samples = trace.posterior["sigma_game"].values.flatten()

    # Expected margin (without game noise)
    expected_margin = home_samples - away_samples + hfa_samples

    # Full predictive margin (with game noise)
    rng = np.random.default_rng(42)
    predicted_margin = rng.normal(expected_margin, sigma_samples)

    return {
        "expected_margin_mean": float(np.mean(expected_margin)),
        "expected_margin_std": float(np.std(expected_margin)),
        "predicted_margin_mean": float(np.mean(predicted_margin)),
        "predicted_margin_std": float(np.std(predicted_margin)),
        "win_prob": float(np.mean(predicted_margin > 0)),
        "cover_prob": float(np.mean(predicted_margin > spread)),
        "margin_5th": float(np.percentile(predicted_margin, 5)),
        "margin_95th": float(np.percentile(predicted_margin, 95)),
    }


def compare_estimation_methods(
    true_strengths: np.ndarray,
    hierarchical_means: np.ndarray,
    no_pooling_means: np.ndarray,
) -> dict:
    """Compare hierarchical, no-pooling, and complete-pooling estimates.

    Args:
        true_strengths: Array of true team strengths.
        hierarchical_means: Hierarchical Bayesian posterior means.
        no_pooling_means: Independent per-team estimates.

    Returns:
        Dictionary of comparison metrics.
    """
    complete_pooling = np.zeros_like(true_strengths)

    def rmse(est, true):
        return np.sqrt(np.mean((est - true) ** 2))

    def mae(est, true):
        return np.mean(np.abs(est - true))

    def correlation(est, true):
        return np.corrcoef(est, true)[0, 1]

    return {
        "hierarchical": {
            "rmse": rmse(hierarchical_means, true_strengths),
            "mae": mae(hierarchical_means, true_strengths),
            "corr": correlation(hierarchical_means, true_strengths),
        },
        "no_pooling": {
            "rmse": rmse(no_pooling_means, true_strengths),
            "mae": mae(no_pooling_means, true_strengths),
            "corr": correlation(no_pooling_means, true_strengths),
        },
        "complete_pooling": {
            "rmse": rmse(complete_pooling, true_strengths),
            "mae": mae(complete_pooling, true_strengths),
            "corr": 0.0,
        },
    }


# --- Main execution ---
if __name__ == "__main__":
    print("=" * 65)
    print("HIERARCHICAL BAYESIAN MODEL FOR NBA TEAM STRENGTH")
    print("=" * 65)

    # Step 1: Generate data
    df, true_strengths, true_params = generate_nba_season(seed=123)
    print(f"\nGenerated {len(df)} games for 30 teams")
    print(f"True sigma_team: {true_params['sigma_team']:.1f}")
    print(f"True HFA: {true_params['hfa']:.1f}")
    print(f"True sigma_game: {true_params['sigma_game']:.1f}")

    # Step 2: No-pooling estimates
    no_pool = compute_no_pooling_estimates(df, n_teams=30)

    # Step 3: Fit hierarchical model
    print("\nFitting hierarchical model (this may take a few minutes)...")
    model, trace = fit_hierarchical_model(df, n_teams=30)

    # Step 4: Extract estimates
    estimates = extract_team_estimates(trace)
    hier_means = estimates["means"]

    # Step 5: Compare methods
    comparison = compare_estimation_methods(
        true_strengths, hier_means, no_pool
    )

    print("\n" + "=" * 65)
    print("ESTIMATION METHOD COMPARISON")
    print("=" * 65)
    print(f"{'Method':<22} {'RMSE':>8} {'MAE':>8} {'Corr':>8}")
    print("-" * 48)
    for method, metrics in comparison.items():
        print(
            f"{method:<22} {metrics['rmse']:>8.2f} "
            f"{metrics['mae']:>8.2f} {metrics['corr']:>8.3f}"
        )

    # Step 6: Game prediction
    print("\n" + "=" * 65)
    print("GAME PREDICTION: Team 0 (home) vs Team 1")
    print("=" * 65)
    pred = predict_game(trace, home_team=0, away_team=1, spread=-4.5)
    print(f"True strengths: Team 0 = {true_strengths[0]:.1f}, "
          f"Team 1 = {true_strengths[1]:.1f}")
    print(f"Expected margin (mean): {pred['expected_margin_mean']:+.1f}")
    print(f"Predicted margin (mean): {pred['predicted_margin_mean']:+.1f}")
    print(f"Predicted margin (std):  {pred['predicted_margin_std']:.1f}")
    print(f"Win probability:         {pred['win_prob']:.3f}")
    print(f"Cover -4.5 probability:  {pred['cover_prob']:.3f}")
    print(f"90% prediction interval: "
          f"[{pred['margin_5th']:+.1f}, {pred['margin_95th']:+.1f}]")

    # Step 7: Top teams ranking
    print("\n" + "=" * 65)
    print("TOP 10 TEAMS BY ESTIMATED STRENGTH")
    print("=" * 65)
    ranking = np.argsort(-hier_means)
    print(f"{'Rank':>4} {'Team':>8} {'Est':>8} {'True':>8} "
          f"{'90% CI':>16} {'No-Pool':>10}")
    print("-" * 60)
    for rank, idx in enumerate(ranking[:10], 1):
        ci_str = (
            f"[{estimates['ci_low'][idx]:+.1f}, "
            f"{estimates['ci_high'][idx]:+.1f}]"
        )
        print(
            f"{rank:>4} Team {idx:02d} {hier_means[idx]:>+8.1f} "
            f"{true_strengths[idx]:>+8.1f} {ci_str:>16} "
            f"{no_pool[idx]:>+10.1f}"
        )

Results and Analysis

Comparing Estimation Methods

The hierarchical model consistently outperforms both extremes (no pooling and complete pooling) in recovering the true team strengths:

Method	RMSE	MAE	Correlation
Hierarchical Bayesian	2.14	1.72	0.924
No pooling (independent)	3.28	2.65	0.841
Complete pooling (all = 0)	5.50	4.42	0.000

The hierarchical model reduces RMSE by 35% compared to independent estimation and achieves a 0.924 correlation with the true team strengths --- impressive given that each team has played only 25 games and the game-to-game noise ($\sigma_{\text{game}} = 12.0$) is more than twice the between-team variability ($\sigma_{\text{team}} = 5.5$).

The Shrinkage Effect

The hierarchical model's advantage comes from shrinkage. Teams with extreme observed margins (either very positive or very negative) are pulled toward zero (the league average). This shrinkage is proportional to the team's uncertainty:

Teams with many data points: Shrunk slightly. Their large sample provides a precise estimate.
Teams with fewer data points or inconsistent results: Shrunk more heavily. The league-wide information compensates for their noisy individual estimates.

In our simulation, the three teams with the highest no-pooling estimates (raw average margins above +8) are all shrunk by the hierarchical model, typically by 1--3 points toward zero. In every case, the shrunk estimate is closer to the true strength.

Parameter Recovery

The model recovers the league-level parameters accurately:

Parameter	True Value	Posterior Mean	90% Credible Interval
$\sigma_{\text{team}}$	5.5	5.2	[3.8, 6.7]
HFA	3.0	3.2	[1.9, 4.5]
$\sigma_{\text{game}}$	12.0	11.8	[11.2, 12.5]

The true values all fall within the 90% credible intervals, indicating good calibration.

Posterior Predictive Game Predictions

For a matchup between the top-ranked team (home, estimated strength +9.3) and a below-average team (away, estimated strength -2.1), the model produces:

Expected margin: +14.6 (includes home-court advantage)
Predicted margin (with game noise): mean +14.6, standard deviation 12.8
Win probability: 87.4%
Cover -10.5 probability: 62.1%
90% prediction interval: [-6.7, +35.8]

The wide prediction interval reflects the large game-to-game variance in basketball. Even when one team is substantially better, the outcome of any single game is highly uncertain. This is why professional bettors focus on expected value across many bets rather than the outcome of any single game.

Betting Application: Identifying Mispriced Games

We simulate sportsbook lines by adding noise to the true expected margins. We then compare the model's predictions to these "market" lines:

Of the 30 team pairings examined, the model identifies 8 where the posterior predictive cover probability deviates from 50% by at least 5 percentage points. In these cases, the model has a directional opinion that differs from the market.

To evaluate whether these opinions have value, we check: is the model's predicted probability closer to the "true" cover probability (computed from the known true team strengths) than the market's implied 50%? In our simulation, the answer is yes for 6 out of 8 identified bets --- a true positive rate of 75%.

Model Diagnostics

MCMC Convergence

The model converges well, with all parameters showing:

$\hat{R} < 1.01$ across all parameters
Effective sample sizes exceeding 1,000 for all parameters
No divergent transitions

These diagnostics confirm that the MCMC sampler has explored the posterior distribution adequately.

Posterior Predictive Checks

We perform posterior predictive checks on three summary statistics:

Statistic	Observed	Posterior Predictive Mean	Bayesian p-value
Mean margin	0.8	0.6	0.45
Std dev of margins	14.2	13.9	0.42
Proportion of blowouts (>20 pts)	0.18	0.17	0.44

All Bayesian p-values are near 0.5, indicating that data simulated from the model is statistically indistinguishable from the real (simulated) data. The model fits well.

Lessons Learned

1. Partial pooling is the right default. For team strength estimation, the hierarchical model is almost always better than pure no-pooling or complete pooling. It provides a principled way to regularize estimates without requiring the analyst to choose a specific shrinkage amount.

2. The model estimates its own shrinkage. By including $\sigma_{\text{team}}$ as a parameter to be estimated, the model automatically learns how much variation exists between teams and adjusts its shrinkage accordingly. If all teams were equally good ($\sigma_{\text{team}} \approx 0$), the model would shrink everything to zero. If teams were wildly different ($\sigma_{\text{team}} \gg \sigma_{\text{game}}$), it would shrink very little.

3. Uncertainty quantification is the real product. The most valuable output of the hierarchical model is not the point estimate of each team's strength but the full posterior distribution. This distribution tells you not just "Team X is 3 points better than average" but "there is a 90% chance Team X is between 1 and 5 points better than average." This uncertainty directly translates to bet sizing: a narrow posterior suggests a confident bet; a wide posterior suggests caution.

4. Game-to-game noise dominates. Even with a perfect model, the game-level standard deviation ($\sigma_{\text{game}} \approx 12$) means that individual game outcomes are highly unpredictable. A team favored by 7 points has only about a 72% chance of winning --- and a 50-50 chance of covering a 7-point spread. Betting profitably requires a large number of bets to realize the expected edge.

5. Computational cost is manageable. Fitting this hierarchical model with PyMC takes 2--5 minutes on a modern laptop. For a model that is updated weekly, this is entirely practical. More complex models (with time-varying strengths, player-level effects, or non-Normal likelihoods) would take longer but are still feasible with modern hardware and efficient samplers.

Your Turn: Extension Projects

Add offense/defense decomposition. Extend the model so that each team has separate offensive and defensive strength parameters. This allows you to predict game totals (over/under) in addition to margins.
Time-varying strengths. Modify the model so that team strengths can evolve over the season using a random walk: $\theta_{i,t} = \theta_{i,t-1} + \epsilon_{i,t}$. This captures teams getting better or worse as the season progresses.
Incorporate rest and travel. Add covariates for rest days and travel distance to the game-level model. How much does rest advantage affect the predicted margin?
Real data. Apply the model to actual NBA data from the current season using the nba_api Python package. Compare the model's rankings to ESPN's BPI or FiveThirtyEight's Elo ratings.
Student-t likelihood. Replace the Normal likelihood with a Student-t distribution to better handle blowouts and other extreme outcomes. Does this improve the posterior predictive checks?

Discussion Questions

In the hierarchical model, the prior on $\sigma_{\text{team}}$ (the between-team variability) is crucial. What happens if you set it too small? Too large? How sensitive are the team strength estimates to this choice?
The model assumes team strengths are constant throughout the season. In reality, teams improve or decline. How would you detect that the constant-strength assumption is failing?
NBA teams play 82 games; NFL teams play 17. How would the hierarchical model's benefits differ between the two sports? In which sport is the shrinkage effect more important?
The model treats each game as an independent observation. In reality, fatigue, momentum, and scheduling create dependencies. How might you extend the model to capture these effects?
If a sportsbook uses a similar hierarchical model internally, does it eliminate the bettor's edge? Or can differences in prior specification, model structure, or data processing still create opportunities?