Case Study: EDA Deep Dive --- Finding Betting Angles in NBA Data

Overview

In this case study, we conduct a thorough exploratory data analysis of NBA regular-season data to identify potential betting angles. We focus on three widely discussed factors: rest advantage (extra days off), back-to-back games, and travel distance. Our goal is not to build a predictive model --- that comes in Part II --- but to rigorously explore the data, separate signal from noise, and determine which angles merit further investigation.

The complete code for this case study is available in code/case-study-code.py.


The Data

We work with a dataset of NBA regular-season games spanning five seasons (2018-19 through 2022-23). Each row represents one game from one team's perspective, giving us approximately 12,300 team-game observations (1,230 games per season x 2 teams per game x 5 seasons). The dataset includes:

  • Game metadata: date, home/away indicator, opponent, game result
  • Scoring: points scored, points allowed, margin of victory
  • Schedule context: rest days since last game, opponent rest days, back-to-back flag
  • Travel: approximate travel distance from previous game location (in miles)
  • Betting data: point spread, over/under total, actual game total
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from typing import Optional


def load_and_prepare_data(filepath: str) -> pd.DataFrame:
    """Load NBA game data and compute derived columns.

    Args:
        filepath: Path to the CSV file containing game data.

    Returns:
        A cleaned DataFrame with computed features ready for EDA.
    """
    df = pd.read_csv(filepath, parse_dates=["game_date"])
    df = df.sort_values(["team", "game_date"]).reset_index(drop=True)

    # Compute derived columns
    df["margin"] = df["points_scored"] - df["points_allowed"]
    df["won"] = (df["margin"] > 0).astype(int)
    df["ats_margin"] = df["margin"] + df["spread"]  # spread is from team's perspective
    df["covered"] = (df["ats_margin"] > 0).astype(int)
    df["push"] = (df["ats_margin"] == 0).astype(int)
    df["game_total"] = df["points_scored"] + df["points_allowed"]
    df["total_diff"] = df["game_total"] - df["ou_total"]
    df["went_over"] = (df["total_diff"] > 0).astype(int)

    # Rest advantage: this team's rest days minus opponent's rest days
    df["rest_advantage"] = df["rest_days"] - df["opp_rest_days"]

    # Flag back-to-backs
    df["is_b2b"] = (df["rest_days"] == 0).astype(int)

    # Travel buckets for cleaner analysis
    df["travel_bucket"] = pd.cut(
        df["travel_miles"],
        bins=[0, 100, 500, 1000, 2000, 5000],
        labels=["<100mi", "100-500mi", "500-1000mi", "1000-2000mi", "2000+mi"],
        include_lowest=True,
    )

    return df

Phase 1: Understanding the Baseline

Before testing any angle, we need to understand the baseline behavior of our data.

def baseline_summary(df: pd.DataFrame) -> None:
    """Print baseline statistics for the dataset.

    This establishes the 'null hypothesis' benchmarks: if no angle
    has predictive power, these are the rates we would observe.
    """
    n_games = len(df)
    print(f"Total team-game observations: {n_games:,}")
    print(f"Seasons: {df['season'].nunique()}")
    print(f"Teams: {df['team'].nunique()}")
    print()

    # Straight-up home win rate
    home_df = df[df["is_home"] == 1]
    home_win_rate = home_df["won"].mean()
    print(f"Home win rate: {home_win_rate:.3f} ({home_win_rate*100:.1f}%)")

    # ATS baseline: should be near 50% in an efficient market
    cover_rate = df["covered"].mean()
    push_rate = df["push"].mean()
    print(f"Overall ATS cover rate: {cover_rate:.3f} ({cover_rate*100:.1f}%)")
    print(f"Overall push rate: {push_rate:.3f} ({push_rate*100:.1f}%)")

    # Over/under baseline
    over_rate = df["went_over"].mean()
    print(f"Overall over rate: {over_rate:.3f} ({over_rate*100:.1f}%)")

    # Scoring distribution
    print(f"\nPoints scored: mean={df['points_scored'].mean():.1f}, "
          f"std={df['points_scored'].std():.1f}")
    print(f"Game total: mean={df['game_total'].mean():.1f}, "
          f"std={df['game_total'].std():.1f}")
    print(f"Spread: mean={df['spread'].mean():.2f}, "
          f"std={df['spread'].std():.1f}")

Expected baseline: the ATS cover rate should be close to 50%, and the over rate should also be near 50%. Any meaningful betting angle must produce a cover or over rate significantly above these baselines, enough to overcome the ~4.5% vig built into standard -110 odds (requiring ~52.4% wins to break even).


Phase 2: Rest Advantage Analysis

The first angle we investigate is rest advantage. The hypothesis: teams with more rest days than their opponent perform better, and the market undervalues this edge.

def analyze_rest_advantage(df: pd.DataFrame) -> pd.DataFrame:
    """Analyze ATS performance by rest advantage category.

    Rest advantage = team's rest days - opponent's rest days.
    Positive values mean the team had more rest.

    Args:
        df: The prepared game DataFrame.

    Returns:
        A summary DataFrame with ATS records by rest advantage.
    """
    # Group rest advantage into meaningful buckets
    bins = [-10, -3, -1, 0, 1, 3, 10]
    labels = ["Big disadvantage (-3+)",
              "Small disadvantage (-2 to -1)",
              "Even (0)",
              "Small advantage (+1 to +2)",
              "Big advantage (+3+)"]

    # Handle edge cases where rest advantage falls outside bins
    df_rest = df.copy()
    df_rest["rest_bucket"] = pd.cut(
        df_rest["rest_advantage"],
        bins=bins,
        labels=labels,
        include_lowest=True,
    )

    summary = df_rest.groupby("rest_bucket", observed=True).agg(
        games=("covered", "count"),
        covers=("covered", "sum"),
        pushes=("push", "sum"),
        avg_margin=("margin", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
    ).reset_index()

    summary["cover_pct"] = (summary["covers"] / summary["games"] * 100).round(1)
    summary["non_covers"] = summary["games"] - summary["covers"] - summary["pushes"]

    # Statistical significance test against 50%
    # Using a binomial test for each group
    p_values = []
    for _, row in summary.iterrows():
        # Two-sided test: is cover rate significantly different from 50%?
        result = stats.binomtest(
            int(row["covers"]),
            int(row["games"] - row["pushes"]),  # exclude pushes
            p=0.5,
            alternative="two-sided",
        )
        p_values.append(round(result.pvalue, 4))

    summary["p_value"] = p_values

    return summary

What the Data Typically Shows

In most multi-season NBA samples, the rest-advantage effect is real but small:

  • Teams with a big rest advantage (+3 or more days) cover at approximately 53-55% in raw data.
  • Teams on even rest cover at almost exactly 50%.
  • Teams with a big rest disadvantage cover at approximately 46-48%.

However, the key question is whether the market already accounts for this. We examine the average ATS margin (not just cover rate) to see if the spread adjusts for rest:

def rest_advantage_market_efficiency(df: pd.DataFrame) -> None:
    """Test whether the market fully prices in rest advantage.

    If the market is efficient with respect to rest, the ATS margin
    should be near zero regardless of rest advantage.
    """
    rest_groups = df.groupby("rest_advantage").agg(
        games=("ats_margin", "count"),
        avg_ats_margin=("ats_margin", "mean"),
        std_ats_margin=("ats_margin", "std"),
    ).reset_index()

    # Filter to rest advantages with reasonable sample sizes
    rest_groups = rest_groups[rest_groups["games"] >= 50]

    # Compute 95% confidence intervals for the ATS margin
    rest_groups["ci_lower"] = (
        rest_groups["avg_ats_margin"]
        - 1.96 * rest_groups["std_ats_margin"] / np.sqrt(rest_groups["games"])
    )
    rest_groups["ci_upper"] = (
        rest_groups["avg_ats_margin"]
        + 1.96 * rest_groups["std_ats_margin"] / np.sqrt(rest_groups["games"])
    )

    print("Rest Advantage vs. ATS Margin (market efficiency test):")
    print(rest_groups.to_string(index=False))
    print()

    # If the confidence interval includes 0, the market has priced in
    # the rest effect for that group
    for _, row in rest_groups.iterrows():
        rest_adv = int(row["rest_advantage"])
        if row["ci_lower"] > 0:
            print(f"  Rest advantage {rest_adv:+d}: Market UNDERVALUES rest "
                  f"(ATS margin {row['avg_ats_margin']:.2f}, "
                  f"CI [{row['ci_lower']:.2f}, {row['ci_upper']:.2f}])")
        elif row["ci_upper"] < 0:
            print(f"  Rest advantage {rest_adv:+d}: Market OVERVALUES rest "
                  f"(ATS margin {row['avg_ats_margin']:.2f}, "
                  f"CI [{row['ci_lower']:.2f}, {row['ci_upper']:.2f}])")
        else:
            print(f"  Rest advantage {rest_adv:+d}: Market efficiently priced "
                  f"(ATS margin {row['avg_ats_margin']:.2f}, "
                  f"CI [{row['ci_lower']:.2f}, {row['ci_upper']:.2f}])")

Phase 3: Back-to-Back Game Analysis

Back-to-backs are the most extreme form of rest disadvantage. An NBA team playing the second game of a back-to-back has zero rest days.

def analyze_back_to_backs(df: pd.DataFrame) -> dict[str, pd.DataFrame]:
    """Comprehensive analysis of back-to-back game performance.

    Examines: ATS record, straight-up record, scoring changes,
    and whether the effect varies by home/away.

    Args:
        df: The prepared game DataFrame.

    Returns:
        Dictionary of summary DataFrames for different B2B analyses.
    """
    results = {}

    # Overall B2B vs. non-B2B comparison
    b2b_compare = df.groupby("is_b2b").agg(
        games=("covered", "count"),
        su_win_pct=("won", "mean"),
        cover_pct=("covered", "mean"),
        avg_margin=("margin", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
        avg_pts_scored=("points_scored", "mean"),
        avg_pts_allowed=("points_allowed", "mean"),
    ).round(3)
    b2b_compare.index = ["Not B2B", "B2B"]
    results["overall"] = b2b_compare

    # B2B effect by home vs. away
    b2b_location = df.groupby(["is_b2b", "is_home"]).agg(
        games=("covered", "count"),
        cover_pct=("covered", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
    ).round(3)
    results["by_location"] = b2b_location

    # B2B effect by season (is it diminishing over time?)
    b2b_trend = df[df["is_b2b"] == 1].groupby("season").agg(
        games=("covered", "count"),
        cover_pct=("covered", "mean"),
        avg_margin=("margin", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
    ).round(3)
    results["trend"] = b2b_trend

    # B2B against non-B2B opponent (the clearest test)
    b2b_vs_rested = df[
        (df["is_b2b"] == 1) & (df["opp_rest_days"] >= 1)
    ].agg({
        "covered": ["count", "mean"],
        "ats_margin": "mean",
        "margin": "mean",
    })
    results["b2b_vs_rested"] = b2b_vs_rested

    return results

Interpreting Back-to-Back Results

The critical distinction is between the straight-up effect and the ATS effect:

  • Straight-up: Teams on back-to-backs win significantly less often (typically 43-46% versus 51-53% for rested teams). This is a real performance effect.
  • ATS: The cover rate for back-to-back teams has historically been close to 50%, because the market adjusts the spread to account for the expected performance decline. In recent seasons (2020-23), the ATS margin for B2B teams has been near zero, suggesting the market fully prices this factor.

This is a crucial EDA insight: a real performance effect does not automatically mean a betting edge. The edge exists only if the market fails to fully account for the effect.

def b2b_statistical_test(df: pd.DataFrame) -> None:
    """Formal hypothesis test: does being on a B2B affect ATS results?

    H0: B2B teams cover at 50% (market is efficient).
    H1: B2B teams cover at a rate different from 50%.
    """
    b2b_games = df[df["is_b2b"] == 1]
    non_push = b2b_games[b2b_games["push"] == 0]

    n_games = len(non_push)
    n_covers = non_push["covered"].sum()
    cover_rate = n_covers / n_games

    test_result = stats.binomtest(n_covers, n_games, p=0.5, alternative="two-sided")

    print(f"Back-to-Back ATS Test:")
    print(f"  Games (excl. pushes): {n_games}")
    print(f"  Covers: {n_covers}")
    print(f"  Cover rate: {cover_rate:.3f} ({cover_rate*100:.1f}%)")
    print(f"  p-value: {test_result.pvalue:.4f}")
    print(f"  95% CI: [{test_result.proportion_ci().low:.3f}, "
          f"{test_result.proportion_ci().high:.3f}]")

    if test_result.pvalue < 0.05:
        print("  Conclusion: REJECT null hypothesis. B2B has a significant "
              "ATS effect.")
    else:
        print("  Conclusion: FAIL TO REJECT null hypothesis. No significant "
              "ATS effect detected for B2B games.")

Phase 4: Travel Distance Analysis

The third angle we explore is travel distance. Long road trips and cross-country flights are physically taxing, especially on compressed NBA schedules.

def analyze_travel_impact(df: pd.DataFrame) -> pd.DataFrame:
    """Analyze how travel distance affects ATS performance.

    Args:
        df: The prepared game DataFrame with travel_miles and travel_bucket.

    Returns:
        Summary DataFrame with ATS metrics by travel distance bucket.
    """
    # Only away games have meaningful travel (home games = 0 travel)
    away_df = df[df["is_home"] == 0].copy()

    travel_summary = away_df.groupby("travel_bucket", observed=True).agg(
        games=("covered", "count"),
        cover_pct=("covered", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
        avg_margin=("margin", "mean"),
        avg_pts_scored=("points_scored", "mean"),
        avg_pts_allowed=("points_allowed", "mean"),
    ).round(3)

    # Add statistical significance
    p_values = []
    for bucket, group in away_df.groupby("travel_bucket", observed=True):
        non_push = group[group["push"] == 0]
        if len(non_push) > 10:
            result = stats.binomtest(
                int(non_push["covered"].sum()),
                len(non_push),
                p=0.5,
                alternative="two-sided",
            )
            p_values.append(round(result.pvalue, 4))
        else:
            p_values.append(None)

    travel_summary["p_value"] = p_values

    return travel_summary

Travel Interaction Effects

Travel distance alone may not tell the whole story. We test interaction effects: does travel matter more when combined with a back-to-back?

def travel_b2b_interaction(df: pd.DataFrame) -> pd.DataFrame:
    """Test the interaction between travel distance and back-to-back status.

    Hypothesis: long travel + back-to-back is worse than either alone.

    Args:
        df: The prepared game DataFrame.

    Returns:
        A 2x2 summary DataFrame of B2B status vs. long/short travel.
    """
    away_df = df[df["is_home"] == 0].copy()

    # Define "long travel" as > 1000 miles
    away_df["long_travel"] = (away_df["travel_miles"] > 1000).astype(int)

    interaction = away_df.groupby(["is_b2b", "long_travel"]).agg(
        games=("covered", "count"),
        cover_pct=("covered", "mean"),
        avg_ats_margin=("ats_margin", "mean"),
        avg_margin=("margin", "mean"),
    ).round(3)

    interaction.index = interaction.index.set_names(["Back-to-Back", "Long Travel"])

    return interaction

Phase 5: Visualization

Visualization crystallizes findings. Here are the four most informative plots from this analysis:

def create_eda_visualizations(df: pd.DataFrame, output_dir: str = ".") -> None:
    """Generate the four key visualizations for this case study.

    Args:
        df: The prepared game DataFrame.
        output_dir: Directory to save plot files.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Plot 1: ATS margin distribution by rest advantage
    rest_cats = [-2, -1, 0, 1, 2, 3]
    data_groups = [df[df["rest_advantage"] == r]["ats_margin"] for r in rest_cats]
    axes[0, 0].boxplot(data_groups, labels=[str(r) for r in rest_cats])
    axes[0, 0].axhline(y=0, color="red", linestyle="--", alpha=0.5)
    axes[0, 0].set_xlabel("Rest Advantage (days)")
    axes[0, 0].set_ylabel("ATS Margin")
    axes[0, 0].set_title("ATS Margin by Rest Advantage")

    # Plot 2: B2B cover rate by season
    b2b_season = df[df["is_b2b"] == 1].groupby("season")["covered"].mean()
    non_b2b_season = df[df["is_b2b"] == 0].groupby("season")["covered"].mean()
    x_pos = range(len(b2b_season))
    axes[0, 1].bar(
        [x - 0.15 for x in x_pos], b2b_season.values,
        width=0.3, label="B2B", color="coral",
    )
    axes[0, 1].bar(
        [x + 0.15 for x in x_pos], non_b2b_season.values,
        width=0.3, label="Non-B2B", color="steelblue",
    )
    axes[0, 1].axhline(y=0.5, color="red", linestyle="--", alpha=0.5)
    axes[0, 1].set_xticks(list(x_pos))
    axes[0, 1].set_xticklabels(b2b_season.index, rotation=45)
    axes[0, 1].set_ylabel("Cover Rate")
    axes[0, 1].set_title("B2B vs Non-B2B Cover Rate by Season")
    axes[0, 1].legend()

    # Plot 3: Travel distance vs. ATS margin (scatter with regression)
    away_df = df[df["is_home"] == 0]
    axes[1, 0].scatter(
        away_df["travel_miles"], away_df["ats_margin"],
        alpha=0.05, s=10, color="navy",
    )
    # Add regression line
    z = np.polyfit(away_df["travel_miles"].dropna(), away_df["ats_margin"].dropna(), 1)
    p = np.poly1d(z)
    x_line = np.linspace(0, away_df["travel_miles"].max(), 100)
    axes[1, 0].plot(x_line, p(x_line), "r-", linewidth=2)
    axes[1, 0].axhline(y=0, color="gray", linestyle="--", alpha=0.5)
    axes[1, 0].set_xlabel("Travel Distance (miles)")
    axes[1, 0].set_ylabel("ATS Margin")
    axes[1, 0].set_title("Travel Distance vs. ATS Margin (Away Games)")

    # Plot 4: Combined scoring distribution with ou_total overlay
    axes[1, 1].hist(
        df["game_total"], bins=40, alpha=0.7, color="steelblue",
        label="Actual Total", density=True,
    )
    axes[1, 1].axvline(
        x=df["ou_total"].mean(), color="red", linestyle="--",
        linewidth=2, label=f'Avg O/U Line ({df["ou_total"].mean():.1f})',
    )
    axes[1, 1].axvline(
        x=df["game_total"].mean(), color="navy", linestyle="--",
        linewidth=2, label=f'Avg Actual ({df["game_total"].mean():.1f})',
    )
    axes[1, 1].set_xlabel("Total Points")
    axes[1, 1].set_ylabel("Density")
    axes[1, 1].set_title("Scoring Distribution vs. O/U Line")
    axes[1, 1].legend()

    plt.tight_layout()
    plt.savefig(f"{output_dir}/nba_eda_visualizations.png", dpi=150)
    plt.close()
    print(f"Visualizations saved to {output_dir}/nba_eda_visualizations.png")

Phase 6: Synthesizing Findings

After completing the analysis, we compile our findings into a structured summary:

def generate_findings_report(df: pd.DataFrame) -> str:
    """Compile all EDA findings into a structured text report.

    Args:
        df: The prepared game DataFrame.

    Returns:
        A formatted string containing the complete findings report.
    """
    report_lines = [
        "=" * 70,
        "NBA EDA FINDINGS REPORT: BETTING ANGLES INVESTIGATION",
        "=" * 70,
        "",
        f"Dataset: {len(df):,} team-game observations",
        f"Seasons: {sorted(df['season'].unique())}",
        f"Baseline ATS cover rate: {df['covered'].mean():.3f}",
        "",
        "--- FINDING 1: REST ADVANTAGE ---",
        "The raw performance effect of rest is real. Teams with 2+ more",
        "rest days than their opponent win straight-up at ~55-57%. However,",
        "the ATS effect is small (typically 51-53% cover rate) and in most",
        "seasons falls within the confidence interval of 50%. The market",
        "adjusts spreads for known rest differentials. Verdict: the rest",
        "angle is largely priced in for standard rest advantages, but",
        "extreme rest differentials (3+ days) may retain a small edge.",
        "",
        "--- FINDING 2: BACK-TO-BACKS ---",
        "Teams on back-to-backs experience a measurable performance",
        "decline: ~3-4 fewer points scored, ~2 points more allowed.",
        "However, the ATS cover rate is not significantly different from",
        "50% across the full 5-season sample. The market has learned to",
        "price in the B2B effect. Notably, the B2B penalty appears to be",
        "decreasing over time as teams improve load management. Verdict:",
        "not a standalone profitable angle, but a useful feature in",
        "multi-factor models.",
        "",
        "--- FINDING 3: TRAVEL DISTANCE ---",
        "Travel distance shows the weakest signal of the three factors.",
        "The regression coefficient for travel on ATS margin is near zero",
        "and not statistically significant. Even the extreme bucket",
        "(2000+ miles) does not produce a cover rate significantly",
        "different from 50%. However, the *interaction* of long travel +",
        "B2B shows a marginally larger effect than either alone. Verdict:",
        "not useful as a standalone angle. May have minor value as an",
        "interaction feature.",
        "",
        "--- KEY METHODOLOGICAL TAKEAWAYS ---",
        "1. Always test ATS results, not just straight-up results.",
        "   A real performance effect is not the same as a betting edge.",
        "2. Compute confidence intervals and p-values. Sample sizes in",
        "   sports are small; 'interesting' differences are often noise.",
        "3. Check for temporal stability. An angle that worked in",
        "   2018-19 may not work in 2022-23 if the market has adapted.",
        "4. Beware multiple comparisons. We tested 3 major angles with",
        "   multiple sub-analyses. Apply appropriate skepticism.",
        "5. Use interaction effects. Single-factor angles are usually",
        "   priced in. Multi-factor combinations may retain edge.",
        "=" * 70,
    ]

    return "\n".join(report_lines)

Conclusions and Next Steps

This EDA reveals a pattern that experienced bettors know well: the obvious angles are usually priced in. The NBA betting market is efficient enough that simple factors like rest, back-to-backs, and travel distance are reflected in the spread. Any remaining edge is small and requires large sample sizes to detect reliably.

However, the analysis was not fruitless. We identified several leads worth pursuing in Part II:

  1. Interaction effects (B2B + long travel + specific opponent profiles) may hold unexploited value because their sample sizes are too small for the market to price accurately.

  2. Temporal dynamics matter. The B2B penalty is shrinking as load management improves. A model that adapts to these trends will outperform one trained on stale historical averages.

  3. The baseline features we computed (rest advantage, B2B flag, travel distance, schedule context) are useful inputs for the multi-factor models we will build in Part II, even though none of them is profitable in isolation.

  4. The EDA workflow itself --- load, clean, compute baselines, test hypotheses, check significance, visualize, and report --- is a reusable template for any future angle exploration.

The lesson is clear: EDA does not always find gold. Sometimes its greatest value is telling you where not to dig.