11 min read

In This Chapter

Introduction
2.1 The NBA Data Ecosystem: An Overview
2.2 The NBA API: Structure and Access Methods
2.3 Basketball Reference and Web Scraping
2.4 Play-by-Play Data: Structure and Analysis
2.5 Player Tracking Data: Second Spectrum and Beyond
2.6 Public Datasets and Kaggle Resources
2.7 Data Quality, Limitations, and Cleaning
2.8 Building a Complete Data Collection System
2.9 Integrating Multiple Data Sources
2.10 Best Practices and Recommendations
Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 2: Data Sources and Collection

Introduction

The foundation of any basketball analytics endeavor rests upon the quality, comprehensiveness, and accessibility of its underlying data. This chapter provides a thorough examination of the primary data sources available to basketball analysts, from official NBA APIs to proprietary tracking systems, public datasets, and web-scraped resources. We explore the technical mechanisms for accessing each source, discuss their relative strengths and limitations, and provide practical Python implementations for data collection and preprocessing.

Understanding data provenance—where data comes from, how it was collected, and what transformations it has undergone—is essential for producing reliable analytical results. A model trained on incomplete play-by-play data will yield systematically biased conclusions. An analysis that conflates tracking data from different providers may introduce subtle inconsistencies that undermine comparative studies. By the end of this chapter, you will possess the knowledge necessary to make informed decisions about data source selection and the technical skills to implement robust data collection pipelines.

2.1 The NBA Data Ecosystem: An Overview

The modern NBA data ecosystem comprises several interconnected layers, each serving different analytical purposes and audiences:

Official League Data: Statistics and information published directly by the NBA through its websites and applications
Broadcast Data: Information derived from game broadcasts, including traditional box scores and enhanced graphics
Tracking Data: Granular positional data captured by optical or sensor-based systems installed in arenas
Derived Analytics: Computed metrics and models built upon the foundational data layers
Third-Party Aggregations: Websites and services that compile, standardize, and redistribute NBA data

Each layer presents distinct access mechanisms, licensing considerations, and technical challenges. We begin with the most accessible programmatic interface: the NBA's own API infrastructure.

2.2 The NBA API: Structure and Access Methods

2.2.1 Understanding the NBA Stats API

The NBA maintains a comprehensive statistics API that powers its official website (stats.nba.com) and mobile applications. While not officially documented for public use, the API has been extensively reverse-engineered by the analytics community, resulting in robust Python libraries that provide structured access to its endpoints.

The API follows a RESTful architecture, returning JSON-formatted responses. Each endpoint corresponds to a specific statistical view—player game logs, team shooting splits, play-by-play sequences, and hundreds of other data products. The base URL structure follows this pattern:

https://stats.nba.com/stats/{endpoint}?{parameters}

Key characteristics of the NBA Stats API include:

Temporal Coverage: Data extends back to the 1996-97 season for most endpoints, with box score data available from earlier eras
Update Frequency: Live games update in near real-time; historical data is typically finalized within 24-48 hours
Rate Limiting: Aggressive request throttling requires careful pacing of programmatic access
Header Requirements: Requests must include specific HTTP headers to receive valid responses

2.2.2 The nba_api Python Library

The nba_api library, maintained by Swar Patel and contributors, provides a Pythonic interface to the NBA Stats API. Installation is straightforward:

pip install nba_api

The library organizes endpoints into logical modules. The most commonly used are:

from nba_api.stats.endpoints import (
    playergamelog,      # Individual player game-by-game statistics
    teamgamelog,        # Team-level game logs
    playbyplayv2,       # Detailed play-by-play data
    shotchartdetail,    # Shot location and outcome data
    leaguedashplayerstats,  # League-wide player statistics
    commonplayerinfo,   # Player biographical information
    boxscoretraditionalv2,  # Traditional box score statistics
    boxscoreadvancedv2,     # Advanced box score metrics
)

A fundamental example retrieves a player's game log for a specific season:

from nba_api.stats.endpoints import playergamelog
from nba_api.stats.static import players
import pandas as pd
import time

def get_player_game_log(player_name: str, season: str = "2024-25") -> pd.DataFrame:
    """
    Retrieve game-by-game statistics for a specified player and season.

    Args:
        player_name: Full name of the player (e.g., "LeBron James")
        season: Season identifier in YYYY-YY format (e.g., "2024-25")

    Returns:
        DataFrame containing the player's game log with statistics

    Raises:
        ValueError: If player name is not found in the database
    """
    # Look up player ID from static player list
    player_dict = players.find_players_by_full_name(player_name)

    if not player_dict:
        raise ValueError(f"Player '{player_name}' not found in database")

    player_id = player_dict[0]['id']

    # Fetch game log from API
    game_log = playergamelog.PlayerGameLog(
        player_id=player_id,
        season=season,
        season_type_all_star='Regular Season'
    )

    # Convert to DataFrame
    df = game_log.get_data_frames()[0]

    return df


# Example usage
if __name__ == "__main__":
    lebron_logs = get_player_game_log("LeBron James", "2024-25")
    print(f"Games played: {len(lebron_logs)}")
    print(f"Average points: {lebron_logs['PTS'].mean():.1f}")

2.2.3 Handling Rate Limits and Request Headers

The NBA API implements rate limiting to prevent abuse. Exceeding these limits results in HTTP 429 responses or IP-based blocking. Best practices for responsible API access include:

import time
from functools import wraps

def rate_limited(max_per_second: float = 0.5):
    """
    Decorator that enforces a maximum request rate.

    Args:
        max_per_second: Maximum number of requests per second

    Returns:
        Decorated function with rate limiting applied
    """
    min_interval = 1.0 / max_per_second

    def decorator(func):
        last_called = [0.0]

        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed

            if wait_time > 0:
                time.sleep(wait_time)

            result = func(*args, **kwargs)
            last_called[0] = time.time()

            return result

        return wrapper

    return decorator


@rate_limited(max_per_second=0.5)
def fetch_player_stats(player_id: int, season: str) -> pd.DataFrame:
    """Fetch player stats with rate limiting applied."""
    from nba_api.stats.endpoints import playergamelog

    game_log = playergamelog.PlayerGameLog(
        player_id=player_id,
        season=season
    )

    return game_log.get_data_frames()[0]

Custom headers may be required to mimic browser requests:

CUSTOM_HEADERS = {
    'Host': 'stats.nba.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://stats.nba.com/',
    'x-nba-stats-origin': 'stats',
    'x-nba-stats-token': 'true',
    'Connection': 'keep-alive',
}

2.2.4 Key API Endpoints for Analytics

The following table summarizes the most analytically valuable endpoints:

Endpoint	Description	Primary Use Case
`playbyplayv2`	Chronological game events	Possession analysis, lineup evaluation
`shotchartdetail`	Shot locations and outcomes	Shooting analysis, shot quality models
`leaguedashplayerstats`	Aggregated player statistics	Player comparison, league-wide analysis
`boxscoreadvancedv2`	Advanced box score metrics	Game-level performance evaluation
`teamdashlineups`	Lineup combinations and stats	Lineup optimization studies
`synaboredguardingdashboard`	Matchup-specific statistics	Defensive assignment analysis
`playerdashptshots`	Tracking-enhanced shooting data	Shot type classification

2.3 Basketball Reference and Web Scraping

2.3.1 Basketball Reference as a Data Source

Basketball-Reference.com, maintained by Sports Reference LLC, represents the most comprehensive publicly accessible repository of professional basketball statistics. Its data extends from the BAA's inaugural 1946-47 season through the present day, encompassing:

Traditional and advanced box scores
Season and career statistical summaries
Adjusted metrics (e.g., per-100-possessions, league-relative)
Historical awards, draft information, and transactions
International and collegiate basketball statistics

The site's tabular format and consistent HTML structure make it amenable to programmatic extraction, though users must carefully consider the ethical and legal dimensions of web scraping.

2.3.2 Legal and Ethical Considerations

Before implementing any web scraping solution, analysts must understand the governing legal frameworks:

Terms of Service Compliance: Basketball-Reference's terms of service explicitly address automated access. While the site permits limited personal scraping, commercial use or high-volume extraction typically requires a licensing agreement with Sports Reference.

robots.txt Compliance: The site's robots.txt file specifies which paths may be accessed by automated crawlers. Ethical scrapers respect these directives:

import urllib.robotparser

def check_robots_txt(url: str, user_agent: str = '*') -> bool:
    """
    Check if a URL is allowed for scraping according to robots.txt.

    Args:
        url: The URL to check
        user_agent: The user agent string to check permissions for

    Returns:
        True if scraping is permitted, False otherwise
    """
    from urllib.parse import urlparse

    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

Rate Limiting: Regardless of technical permissions, responsible scraping requires modest request rates to avoid imposing undue server load. A common guideline suggests no more than one request per 3-5 seconds.

2.3.3 Implementing a Basketball-Reference Scraper

The pandas.read_html() function provides a convenient mechanism for extracting tabular data from HTML pages:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
from typing import List, Optional


class BasketballReferenceScraper:
    """
    A responsible scraper for Basketball-Reference.com.

    This class implements rate limiting and error handling for
    extracting statistical data from Basketball-Reference.

    Attributes:
        base_url: The base URL for Basketball-Reference
        delay: Minimum seconds between requests
        last_request_time: Timestamp of the most recent request
    """

    BASE_URL = "https://www.basketball-reference.com"

    def __init__(self, delay: float = 3.0):
        """
        Initialize the scraper with a specified request delay.

        Args:
            delay: Minimum seconds to wait between requests
        """
        self.delay = delay
        self.last_request_time = 0.0
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (educational research)'
        })

    def _wait_for_rate_limit(self) -> None:
        """Enforce rate limiting between requests."""
        elapsed = time.time() - self.last_request_time
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_request_time = time.time()

    def get_player_season_stats(
        self,
        player_url_suffix: str
    ) -> pd.DataFrame:
        """
        Retrieve a player's career statistics table.

        Args:
            player_url_suffix: The URL suffix for the player
                              (e.g., "/players/j/jamesle01.html")

        Returns:
            DataFrame containing the player's season-by-season statistics
        """
        self._wait_for_rate_limit()

        url = f"{self.BASE_URL}{player_url_suffix}"
        response = self.session.get(url)
        response.raise_for_status()

        # Basketball-Reference uses comment-wrapped tables
        # that require special handling
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the per-game statistics table
        tables = pd.read_html(str(soup), match='Per Game')

        if tables:
            return tables[0]

        return pd.DataFrame()

    def get_team_roster(
        self,
        team_abbr: str,
        season: int
    ) -> pd.DataFrame:
        """
        Retrieve a team's roster for a specific season.

        Args:
            team_abbr: Three-letter team abbreviation (e.g., "LAL")
            season: Ending year of the season (e.g., 2024 for 2023-24)

        Returns:
            DataFrame containing roster information
        """
        self._wait_for_rate_limit()

        url = f"{self.BASE_URL}/teams/{team_abbr}/{season}.html"
        response = self.session.get(url)
        response.raise_for_status()

        tables = pd.read_html(response.content, match='Roster')

        if tables:
            return tables[0]

        return pd.DataFrame()

2.3.4 Handling Special HTML Structures

Basketball-Reference employs several HTML patterns that complicate automated extraction:

Comment-wrapped tables: Some tables are initially rendered as HTML comments (for performance optimization) and revealed via JavaScript
Multi-level column headers: Tables often use hierarchical headers requiring flattening
Footnote annotations: Statistical values may include superscript markers

def extract_commented_table(html_content: str, table_id: str) -> Optional[str]:
    """
    Extract a table that is wrapped in HTML comments.

    Basketball-Reference wraps some tables in comments like:
    <!-- <div id="all_advanced">... -->

    Args:
        html_content: The full HTML page content
        table_id: The ID of the table to extract

    Returns:
        The extracted table HTML, or None if not found
    """
    import re

    # Pattern to match commented-out div containing our table
    pattern = rf'<!--\s*(<div[^>]*id="{table_id}".*?</div>)\s*-->'

    match = re.search(pattern, html_content, re.DOTALL)

    if match:
        return match.group(1)

    return None


def flatten_multi_level_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Flatten a DataFrame with multi-level column headers.

    Args:
        df: DataFrame with hierarchical column index

    Returns:
        DataFrame with single-level column names
    """
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = ['_'.join(col).strip('_') for col in df.columns.values]

    return df

2.4 Play-by-Play Data: Structure and Analysis

2.4.1 Understanding Play-by-Play Format

Play-by-play (PBP) data provides a chronological record of game events, forming the foundation for possession-level analysis. Each row typically represents a discrete event: a field goal attempt, turnover, substitution, timeout, or foul.

The NBA API's playbyplayv2 endpoint returns data in the following structure:

Field	Description	Example Value
`EVENTNUM`	Sequential event identifier	42
`EVENTMSGTYPE`	Event category code	2 (shot attempt)
`EVENTMSGACTIONTYPE`	Specific action subtype	79 (pullup jump shot)
`PERIOD`	Game period (1-4, 5+ for OT)	2
`PCTIMESTRING`	Time remaining in period	"8:34"
`HOMEDESCRIPTION`	Event description (home team)	"James 25' 3PT"
`VISITORDESCRIPTION`	Event description (away team)	NULL
`PLAYER1_ID`	Primary player involved	2544
`PLAYER1_TEAM_ID`	Primary player's team	1610612747
`SCORE`	Game score after event	"45 - 42"

2.4.2 Event Type Codes

The EVENTMSGTYPE field encodes the fundamental event category:

Code	Event Type	Description
1	Made Shot	Successful field goal
2	Missed Shot	Unsuccessful field goal attempt
3	Free Throw	Free throw attempt (made or missed)
4	Rebound	Offensive or defensive rebound
5	Turnover	Loss of possession
6	Foul	Personal, technical, or flagrant foul
7	Violation	Lane violation, backcourt, etc.
8	Substitution	Player enters or exits game
9	Timeout	Team or official timeout
10	Jump Ball	Possession determined by jump
12	Period Start	Beginning of a period
13	Period End	End of a period

2.4.3 Parsing and Transforming PBP Data

Converting raw play-by-play data into analytically useful structures requires careful transformation:

import pandas as pd
import numpy as np
from typing import Tuple, List
from nba_api.stats.endpoints import playbyplayv2


def fetch_game_pbp(game_id: str) -> pd.DataFrame:
    """
    Fetch and parse play-by-play data for a specific game.

    Args:
        game_id: NBA game ID (e.g., "0022400123")

    Returns:
        DataFrame containing parsed play-by-play events
    """
    pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
    df = pbp.get_data_frames()[0]

    return df


def calculate_game_time(period: int, time_string: str) -> float:
    """
    Convert period and time string to elapsed game seconds.

    Args:
        period: Game period (1-4 for regulation, 5+ for OT)
        time_string: Time remaining in period (e.g., "8:34")

    Returns:
        Total elapsed seconds from game start
    """
    minutes, seconds = map(int, time_string.split(':'))
    time_remaining = minutes * 60 + seconds

    if period <= 4:
        # Regulation periods are 12 minutes
        period_length = 12 * 60
        elapsed_in_period = period_length - time_remaining
        prior_periods = (period - 1) * period_length
    else:
        # Overtime periods are 5 minutes
        ot_period_length = 5 * 60
        elapsed_in_period = ot_period_length - time_remaining
        prior_periods = 4 * 12 * 60 + (period - 5) * ot_period_length

    return prior_periods + elapsed_in_period


def identify_possessions(pbp_df: pd.DataFrame) -> pd.DataFrame:
    """
    Add possession identifiers to play-by-play data.

    A new possession begins after:
    - Made field goals (except and-ones)
    - Defensive rebounds
    - Turnovers
    - End of period

    Args:
        pbp_df: Raw play-by-play DataFrame

    Returns:
        DataFrame with added 'POSSESSION_ID' and 'POSSESSION_TEAM' columns
    """
    df = pbp_df.copy()

    # Event codes that end possessions
    MADE_SHOT = 1
    TURNOVER = 5

    possession_endings = []

    for idx, row in df.iterrows():
        ends_possession = False

        if row['EVENTMSGTYPE'] == MADE_SHOT:
            # Check if next event is not a free throw (and-one)
            if idx + 1 < len(df):
                next_event = df.iloc[idx + 1]
                if next_event['EVENTMSGTYPE'] != 3:  # Not free throw
                    ends_possession = True
            else:
                ends_possession = True

        elif row['EVENTMSGTYPE'] == TURNOVER:
            ends_possession = True

        elif row['EVENTMSGTYPE'] == 4:  # Rebound
            # Defensive rebounds end possessions
            description = str(row.get('HOMEDESCRIPTION', '')) + \
                         str(row.get('VISITORDESCRIPTION', ''))
            if 'DEF' in description.upper():
                ends_possession = True

        possession_endings.append(ends_possession)

    df['ENDS_POSSESSION'] = possession_endings
    df['POSSESSION_ID'] = df['ENDS_POSSESSION'].shift(1).fillna(False).cumsum()

    return df


def calculate_possession_stats(pbp_df: pd.DataFrame) -> pd.DataFrame:
    """
    Aggregate play-by-play data to possession-level statistics.

    Args:
        pbp_df: Play-by-play DataFrame with possession IDs

    Returns:
        DataFrame with one row per possession and computed metrics
    """
    possession_stats = []

    for poss_id, poss_df in pbp_df.groupby('POSSESSION_ID'):
        shots = poss_df[poss_df['EVENTMSGTYPE'].isin([1, 2])]

        stats = {
            'possession_id': poss_id,
            'period': poss_df['PERIOD'].iloc[0],
            'start_time': poss_df['PCTIMESTRING'].iloc[0],
            'num_events': len(poss_df),
            'shot_attempts': len(shots),
            'made_shots': len(poss_df[poss_df['EVENTMSGTYPE'] == 1]),
            'turnovers': len(poss_df[poss_df['EVENTMSGTYPE'] == 5]),
            'free_throws_attempted': len(
                poss_df[poss_df['EVENTMSGTYPE'] == 3]
            ),
        }

        possession_stats.append(stats)

    return pd.DataFrame(possession_stats)

2.4.4 Shot Location Extraction

Shot chart data provides spatial information for field goal attempts:

from nba_api.stats.endpoints import shotchartdetail


def get_shot_chart(
    player_id: int,
    season: str = "2024-25",
    season_type: str = "Regular Season"
) -> pd.DataFrame:
    """
    Retrieve shot chart data for a player and season.

    The returned DataFrame includes x,y coordinates where:
    - x: Horizontal position (-250 to 250, in tenths of feet from center)
    - y: Vertical position (-50 to 890, in tenths of feet from baseline)

    Args:
        player_id: NBA player ID
        season: Season in YYYY-YY format
        season_type: "Regular Season" or "Playoffs"

    Returns:
        DataFrame containing shot locations and outcomes
    """
    shot_chart = shotchartdetail.ShotChartDetail(
        player_id=player_id,
        team_id=0,  # All teams
        season_nullable=season,
        season_type_all_star=season_type,
        context_measure_simple='FGA'
    )

    df = shot_chart.get_data_frames()[0]

    # Convert coordinates to feet
    df['LOC_X_FEET'] = df['LOC_X'] / 10.0
    df['LOC_Y_FEET'] = df['LOC_Y'] / 10.0

    # Calculate distance from basket
    df['SHOT_DISTANCE_CALC'] = np.sqrt(
        df['LOC_X_FEET']**2 + df['LOC_Y_FEET']**2
    )

    return df

2.5 Player Tracking Data: Second Spectrum and Beyond

2.5.1 The Evolution of Tracking Technology

Player tracking represents the most significant advancement in basketball data collection since the advent of play-by-play recording. The NBA's current tracking partner, Second Spectrum, employs optical systems that capture player and ball positions at 25 frames per second, enabling analyses previously impossible with traditional statistics.

The tracking infrastructure includes:

Multiple high-definition cameras mounted in arena rafters
Computer vision algorithms that identify and track all ten players plus the ball
Real-time processing that converts video into structured positional data
Machine learning models that classify actions (screens, drives, cuts, etc.)

2.5.2 Tracking Data Schema

Raw tracking data follows a hierarchical structure:

Game
├── Metadata (teams, players, date, venue)
├── Moments
│   ├── Quarter
│   ├── Game Clock
│   ├── Shot Clock
│   ├── Ball Position (x, y, z)
│   └── Player Positions
│       ├── Player 1 (team_id, player_id, x, y)
│       ├── Player 2 (team_id, player_id, x, y)
│       └── ... (10 total players)

The coordinate system places the origin at center court:

X-axis: Sideline to sideline (-47 to +47 feet)
Y-axis: Baseline to baseline (-25 to +25 feet)
Z-axis: Ball height above floor (0 to ~20 feet)

2.5.3 Derived Tracking Metrics

Second Spectrum processes raw positional data into higher-level metrics that the NBA makes available through its API. Key derived metrics include:

Speed and Distance:

$$\text{Speed} = \frac{\sqrt{(x_t - x_{t-1})^2 + (y_t - y_{t-1})^2}}{\Delta t}$$

$$\text{Distance} = \sum_{t=1}^{T} \sqrt{(x_t - x_{t-1})^2 + (y_t - y_{t-1})^2}$$

Defensive Metrics:

$$\text{Contested Shot} = \begin{cases} 1 & \text{if defender within } d \text{ feet at release} \\ 0 & \text{otherwise} \end{cases}$$

where $d$ is typically set to 4-6 feet depending on the defensive pressure classification.

Touch Metrics: - Time of possession per touch - Dribbles per touch - Average touch duration - Passes received by zone

2.5.4 Accessing Tracking Data via the NBA API

While raw frame-by-frame tracking data remains proprietary, aggregated tracking metrics are available through the API:

from nba_api.stats.endpoints import (
    playerdashptshots,
    leaguedashptstats,
    playerdashptshotdefend
)


def get_player_tracking_shooting(
    player_id: int,
    season: str = "2024-25"
) -> dict:
    """
    Retrieve tracking-enhanced shooting statistics for a player.

    Includes shot type breakdowns (catch-and-shoot, pull-up, etc.)
    and defender distance classifications.

    Args:
        player_id: NBA player ID
        season: Season in YYYY-YY format

    Returns:
        Dictionary containing multiple tracking shooting DataFrames
    """
    tracking = playerdashptshots.PlayerDashPtShots(
        player_id=player_id,
        season=season
    )

    data_frames = tracking.get_data_frames()

    return {
        'overall': data_frames[0],
        'shot_type': data_frames[1],     # Catch-shoot, pull-up, etc.
        'shot_clock': data_frames[2],    # By shot clock range
        'dribbles': data_frames[3],      # By number of dribbles
        'touch_time': data_frames[4],    # By seconds of touch
        'closest_defender': data_frames[5],  # By defender distance
    }


def get_player_tracking_defense(
    player_id: int,
    season: str = "2024-25"
) -> pd.DataFrame:
    """
    Retrieve tracking-based defensive statistics.

    Includes shots defended, opponent field goal percentage,
    and defensive impact metrics.

    Args:
        player_id: NBA player ID
        season: Season in YYYY-YY format

    Returns:
        DataFrame with defensive tracking metrics
    """
    defense = playerdashptshotdefend.PlayerDashPtShotDefend(
        player_id=player_id,
        season=season
    )

    return defense.get_data_frames()[0]

2.5.5 Working with Historical Tracking Data

The NBA first deployed tracking technology (initially SportVU) in 2013-14. When working with tracking data, be aware of these considerations:

Provider changes: SportVU (2013-2017) and Second Spectrum (2017-present) may have slightly different algorithms
Coverage gaps: Not all arenas had tracking in early seasons
Metric evolution: Some derived metrics have changed definitions over time
Availability limitations: The most granular data requires direct partnerships with teams or the league

2.6 Public Datasets and Kaggle Resources

2.6.1 Overview of Public Data Sources

Beyond APIs and web scraping, several curated datasets provide accessible entry points for basketball analytics:

Kaggle Datasets: - NBA Shot Logs (2014-2020): Shot-level data with player tracking context - NBA Game Data: Historical game results and box scores - NBA Player Stats: Comprehensive player statistics across eras

Academic Repositories: - BigDataBall: Game-by-game data with advanced metrics - Sports-Reference Data Dumps: Historical statistical archives

GitHub Repositories: - nba-movement-data: Sample tracking data from 2015-16 - NBA-Player-Movements: Parsed SportVU data

2.6.2 Working with Kaggle Data

import pandas as pd
from pathlib import Path


def load_kaggle_shot_data(
    data_path: Path,
    seasons: list = None
) -> pd.DataFrame:
    """
    Load and preprocess NBA shot data from Kaggle dataset.

    Expected file format: CSV with columns matching NBA shot chart schema.

    Args:
        data_path: Path to the Kaggle data directory
        seasons: Optional list of seasons to filter (e.g., ["2019-20", "2020-21"])

    Returns:
        Preprocessed DataFrame ready for analysis
    """
    # Load the main dataset
    df = pd.read_csv(data_path / "NBA_Shot_Logs.csv")

    # Standardize column names
    df.columns = df.columns.str.upper().str.replace(' ', '_')

    # Filter seasons if specified
    if seasons:
        df = df[df['SEASON'].isin(seasons)]

    # Convert shot result to binary
    if 'SHOT_RESULT' in df.columns:
        df['MADE_SHOT'] = (df['SHOT_RESULT'] == 'made').astype(int)

    # Parse datetime if available
    if 'GAME_DATE' in df.columns:
        df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])

    return df


def load_movement_data(game_id: str, data_path: Path) -> dict:
    """
    Load NBA movement data from the public movement dataset.

    The movement data format is JSON with nested structure for
    each moment containing ball and player positions.

    Args:
        game_id: NBA game identifier
        data_path: Path to the movement data directory

    Returns:
        Dictionary containing parsed movement data
    """
    import json

    file_path = data_path / f"{game_id}.json"

    with open(file_path, 'r') as f:
        data = json.load(f)

    # Extract metadata
    game_info = {
        'game_id': data.get('gameid'),
        'game_date': data.get('gamedate'),
        'home_team': data.get('events', [{}])[0].get('home', {}),
        'away_team': data.get('events', [{}])[0].get('visitor', {}),
    }

    # Parse events
    events = []
    for event in data.get('events', []):
        event_data = {
            'event_id': event.get('eventId'),
            'moments': parse_moments(event.get('moments', []))
        }
        events.append(event_data)

    return {'game_info': game_info, 'events': events}


def parse_moments(moments: list) -> pd.DataFrame:
    """
    Parse moment data into a structured DataFrame.

    Each moment contains:
    [quarter, game_clock, shot_clock, None,
     [[team_id, player_id, x, y, z], ...]]  # Ball + 10 players

    Args:
        moments: List of raw moment data

    Returns:
        DataFrame with parsed positional data
    """
    records = []

    for moment in moments:
        quarter = moment[0]
        game_clock = moment[1]
        shot_clock = moment[2]
        positions = moment[5]

        # First position is always the ball
        ball = positions[0] if positions else [None] * 5

        record = {
            'quarter': quarter,
            'game_clock': game_clock,
            'shot_clock': shot_clock,
            'ball_x': ball[2] if len(ball) > 2 else None,
            'ball_y': ball[3] if len(ball) > 3 else None,
            'ball_z': ball[4] if len(ball) > 4 else None,
        }

        # Add player positions
        for i, player in enumerate(positions[1:11], start=1):
            if player:
                record[f'player_{i}_team'] = player[0]
                record[f'player_{i}_id'] = player[1]
                record[f'player_{i}_x'] = player[2]
                record[f'player_{i}_y'] = player[3]

        records.append(record)

    return pd.DataFrame(records)

2.6.3 Validating Public Data Quality

Public datasets often contain inconsistencies, missing values, or errors introduced during collection or transformation. Systematic validation is essential:

def validate_shot_data(df: pd.DataFrame) -> dict:
    """
    Perform quality validation checks on shot data.

    Args:
        df: DataFrame containing shot records

    Returns:
        Dictionary containing validation results and issues found
    """
    issues = []

    # Check for missing values in critical columns
    critical_cols = ['PLAYER_ID', 'GAME_ID', 'LOC_X', 'LOC_Y', 'SHOT_MADE_FLAG']
    for col in critical_cols:
        if col in df.columns:
            missing = df[col].isna().sum()
            if missing > 0:
                issues.append({
                    'type': 'missing_values',
                    'column': col,
                    'count': missing,
                    'percentage': missing / len(df) * 100
                })

    # Validate coordinate ranges
    if 'LOC_X' in df.columns:
        out_of_range = ((df['LOC_X'] < -250) | (df['LOC_X'] > 250)).sum()
        if out_of_range > 0:
            issues.append({
                'type': 'invalid_coordinates',
                'column': 'LOC_X',
                'count': out_of_range,
                'description': 'X coordinates outside valid court range'
            })

    # Check for duplicate records
    if 'GAME_ID' in df.columns and 'EVENTNUM' in df.columns:
        duplicates = df.duplicated(subset=['GAME_ID', 'EVENTNUM']).sum()
        if duplicates > 0:
            issues.append({
                'type': 'duplicates',
                'count': duplicates,
                'description': 'Duplicate game/event combinations'
            })

    # Validate shot distances
    if 'SHOT_DISTANCE' in df.columns and 'LOC_X' in df.columns:
        calculated = np.sqrt(df['LOC_X']**2 + df['LOC_Y']**2) / 10
        discrepancy = np.abs(df['SHOT_DISTANCE'] - calculated)
        significant_errors = (discrepancy > 1).sum()
        if significant_errors > 0:
            issues.append({
                'type': 'distance_mismatch',
                'count': significant_errors,
                'description': 'Shot distance inconsistent with coordinates'
            })

    return {
        'total_records': len(df),
        'issues_found': len(issues),
        'issues': issues,
        'is_valid': len(issues) == 0
    }

2.7 Data Quality, Limitations, and Cleaning

2.7.1 Common Data Quality Issues

Basketball data exhibits several recurring quality challenges:

Temporal Inconsistencies: - Game clock discrepancies between sources - Missing or incorrect period information - Timezone inconsistencies in game timestamps

Entity Resolution: - Players traded mid-season appearing with multiple team IDs - Historical player ID changes - Inconsistent name formatting across sources

Statistical Anomalies: - Missing play-by-play events - Box score/play-by-play totals that don't reconcile - Incorrect player attribution for assists or rebounds

Tracking Data Artifacts: - Player ID swaps when players cross paths - Ball position dropouts during fast passes - Coordinate jitter in low-light arena conditions

2.7.2 Data Cleaning Pipeline

A robust cleaning pipeline addresses these issues systematically:

import pandas as pd
import numpy as np
from typing import Tuple


class NBADataCleaner:
    """
    Comprehensive data cleaning pipeline for NBA statistics.

    Handles common data quality issues including missing values,
    duplicate records, coordinate validation, and statistical
    reconciliation.
    """

    def __init__(self, verbose: bool = True):
        """
        Initialize the data cleaner.

        Args:
            verbose: Whether to print cleaning progress messages
        """
        self.verbose = verbose
        self.cleaning_log = []

    def _log(self, message: str) -> None:
        """Log a cleaning operation message."""
        self.cleaning_log.append(message)
        if self.verbose:
            print(message)

    def clean_player_names(self, df: pd.DataFrame,
                          name_col: str = 'PLAYER_NAME') -> pd.DataFrame:
        """
        Standardize player name formatting.

        Args:
            df: Input DataFrame
            name_col: Name of the player name column

        Returns:
            DataFrame with cleaned player names
        """
        df = df.copy()

        if name_col not in df.columns:
            return df

        # Strip whitespace
        df[name_col] = df[name_col].str.strip()

        # Standardize case
        df[name_col] = df[name_col].str.title()

        # Handle common name variations
        name_mappings = {
            'Lebron James': 'LeBron James',
            'Demar Derozan': 'DeMar DeRozan',
            'Dangelo Russell': "D'Angelo Russell",
            'Tj Mcconnell': 'T.J. McConnell',
        }

        df[name_col] = df[name_col].replace(name_mappings)

        cleaned_count = len(name_mappings)
        self._log(f"Cleaned player names: {cleaned_count} standardizations applied")

        return df

    def remove_duplicates(self, df: pd.DataFrame,
                         subset: list = None,
                         keep: str = 'first') -> pd.DataFrame:
        """
        Remove duplicate records.

        Args:
            df: Input DataFrame
            subset: Columns to consider for identifying duplicates
            keep: Which duplicate to keep ('first', 'last', or False)

        Returns:
            DataFrame with duplicates removed
        """
        original_len = len(df)
        df = df.drop_duplicates(subset=subset, keep=keep)
        removed = original_len - len(df)

        self._log(f"Removed {removed} duplicate records")

        return df

    def validate_coordinates(self, df: pd.DataFrame,
                            x_col: str = 'LOC_X',
                            y_col: str = 'LOC_Y') -> pd.DataFrame:
        """
        Validate and clean shot coordinates.

        Args:
            df: Input DataFrame
            x_col: Name of X coordinate column
            y_col: Name of Y coordinate column

        Returns:
            DataFrame with validated coordinates
        """
        df = df.copy()

        # Define valid ranges (in tenths of feet)
        X_MIN, X_MAX = -250, 250
        Y_MIN, Y_MAX = -50, 900

        # Identify invalid coordinates
        invalid_mask = (
            (df[x_col] < X_MIN) | (df[x_col] > X_MAX) |
            (df[y_col] < Y_MIN) | (df[y_col] > Y_MAX)
        )

        invalid_count = invalid_mask.sum()

        if invalid_count > 0:
            # Option 1: Flag invalid records
            df['VALID_COORDINATES'] = ~invalid_mask

            # Option 2: Clip to valid range
            df[x_col] = df[x_col].clip(X_MIN, X_MAX)
            df[y_col] = df[y_col].clip(Y_MIN, Y_MAX)

            self._log(f"Found {invalid_count} records with invalid coordinates")

        return df

    def fill_missing_values(self, df: pd.DataFrame,
                           strategy: dict = None) -> pd.DataFrame:
        """
        Fill missing values using specified strategies.

        Args:
            df: Input DataFrame
            strategy: Dict mapping column names to fill strategies
                     ('mean', 'median', 'mode', 'zero', or a specific value)

        Returns:
            DataFrame with missing values filled
        """
        df = df.copy()

        if strategy is None:
            strategy = {}

        for col, method in strategy.items():
            if col not in df.columns:
                continue

            missing = df[col].isna().sum()

            if missing == 0:
                continue

            if method == 'mean':
                fill_value = df[col].mean()
            elif method == 'median':
                fill_value = df[col].median()
            elif method == 'mode':
                fill_value = df[col].mode().iloc[0]
            elif method == 'zero':
                fill_value = 0
            else:
                fill_value = method

            df[col] = df[col].fillna(fill_value)
            self._log(f"Filled {missing} missing values in '{col}' with {method}")

        return df

    def reconcile_box_scores(self, pbp_df: pd.DataFrame,
                            box_df: pd.DataFrame) -> Tuple[bool, dict]:
        """
        Check if play-by-play totals match box score.

        Args:
            pbp_df: Play-by-play DataFrame
            box_df: Box score DataFrame

        Returns:
            Tuple of (is_reconciled, discrepancies_dict)
        """
        discrepancies = {}

        # Calculate PBP totals
        pbp_made_shots = len(pbp_df[pbp_df['EVENTMSGTYPE'] == 1])
        pbp_missed_shots = len(pbp_df[pbp_df['EVENTMSGTYPE'] == 2])
        pbp_fga = pbp_made_shots + pbp_missed_shots

        # Get box score totals
        box_fgm = box_df['FGM'].sum()
        box_fga = box_df['FGA'].sum()

        # Compare
        if pbp_made_shots != box_fgm:
            discrepancies['FGM'] = {
                'pbp': pbp_made_shots,
                'box': box_fgm,
                'diff': pbp_made_shots - box_fgm
            }

        if pbp_fga != box_fga:
            discrepancies['FGA'] = {
                'pbp': pbp_fga,
                'box': box_fga,
                'diff': pbp_fga - box_fga
            }

        is_reconciled = len(discrepancies) == 0

        if not is_reconciled:
            self._log(f"Box score reconciliation failed: {discrepancies}")

        return is_reconciled, discrepancies


def clean_nba_data_pipeline(df: pd.DataFrame) -> pd.DataFrame:
    """
    Apply standard cleaning pipeline to NBA data.

    Args:
        df: Raw NBA data DataFrame

    Returns:
        Cleaned DataFrame
    """
    cleaner = NBADataCleaner(verbose=True)

    # Apply cleaning steps
    df = cleaner.clean_player_names(df)
    df = cleaner.remove_duplicates(df)

    if 'LOC_X' in df.columns:
        df = cleaner.validate_coordinates(df)

    # Fill missing values with appropriate strategies
    fill_strategy = {
        'PTS': 'zero',
        'MIN': 'median',
        'SHOT_DISTANCE': 'mean',
    }
    df = cleaner.fill_missing_values(df, strategy=fill_strategy)

    return df

2.7.3 Handling Era-Specific Variations

Basketball statistics have different meanings and availability across eras:

ERA_CONFIGURATIONS = {
    'pre_three_point': {
        'years': range(1946, 1980),
        'three_pointers': False,
        'shot_clock': lambda y: y >= 1954,
        'tracking': False,
        'play_by_play': False,
    },
    'early_three_point': {
        'years': range(1980, 1997),
        'three_pointers': True,
        'tracking': False,
        'play_by_play': False,
    },
    'modern_stats': {
        'years': range(1997, 2014),
        'three_pointers': True,
        'tracking': False,
        'play_by_play': True,
    },
    'tracking_era': {
        'years': range(2014, 2100),
        'three_pointers': True,
        'tracking': True,
        'play_by_play': True,
    },
}


def get_era_config(season_year: int) -> dict:
    """
    Retrieve configuration for a specific season's data era.

    Args:
        season_year: The ending year of the season (e.g., 2024 for 2023-24)

    Returns:
        Dictionary of data availability and characteristics for that era
    """
    for era_name, config in ERA_CONFIGURATIONS.items():
        if season_year in config['years']:
            return {'era': era_name, **config}

    return {'era': 'unknown'}

2.8 Building a Complete Data Collection System

2.8.1 Architecture Considerations

A production-grade data collection system requires careful architectural planning:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Data Sources  │────▶│   Ingestion     │────▶│   Storage       │
│  - NBA API      │     │   Pipeline      │     │  - Raw Data     │
│  - BBRef        │     │  - Rate Limit   │     │  - Processed    │
│  - Tracking     │     │  - Validation   │     │  - Analytics    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                        ┌─────────────────┐              │
                        │   Scheduling    │◀─────────────┘
                        │  - Daily Sync   │
                        │  - Season Init  │
                        └─────────────────┘

2.8.2 Implementing a Data Collection Framework

import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta
from typing import Optional, List
import json
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


class NBADataCollector:
    """
    Comprehensive data collection system for NBA analytics.

    Manages data retrieval from multiple sources with caching,
    rate limiting, and validation.

    Attributes:
        data_dir: Base directory for data storage
        cache_expiry: Hours before cached data is considered stale
    """

    def __init__(self, data_dir: Path, cache_expiry: int = 24):
        """
        Initialize the data collector.

        Args:
            data_dir: Base directory for data storage
            cache_expiry: Hours before cached data is considered stale
        """
        self.data_dir = Path(data_dir)
        self.cache_expiry = cache_expiry

        # Create directory structure
        self._setup_directories()

        # Initialize metadata tracking
        self.metadata_file = self.data_dir / 'metadata.json'
        self.metadata = self._load_metadata()

    def _setup_directories(self) -> None:
        """Create required directory structure."""
        directories = [
            'raw/games',
            'raw/players',
            'raw/pbp',
            'raw/shots',
            'processed',
            'exports',
        ]

        for dir_path in directories:
            (self.data_dir / dir_path).mkdir(parents=True, exist_ok=True)

    def _load_metadata(self) -> dict:
        """Load or initialize metadata tracking file."""
        if self.metadata_file.exists():
            with open(self.metadata_file, 'r') as f:
                return json.load(f)
        return {'last_sync': {}, 'versions': {}}

    def _save_metadata(self) -> None:
        """Persist metadata to disk."""
        with open(self.metadata_file, 'w') as f:
            json.dump(self.metadata, f, indent=2)

    def _is_cache_valid(self, cache_key: str) -> bool:
        """Check if cached data is still valid."""
        if cache_key not in self.metadata['last_sync']:
            return False

        last_sync = datetime.fromisoformat(self.metadata['last_sync'][cache_key])
        expiry_time = last_sync + timedelta(hours=self.cache_expiry)

        return datetime.now() < expiry_time

    def collect_season_games(self, season: str,
                            force_refresh: bool = False) -> pd.DataFrame:
        """
        Collect all games for a season.

        Args:
            season: Season identifier (e.g., "2024-25")
            force_refresh: Whether to ignore cache

        Returns:
            DataFrame containing all games for the season
        """
        from nba_api.stats.endpoints import leaguegamefinder

        cache_key = f"games_{season}"
        cache_path = self.data_dir / 'raw' / 'games' / f"{season}.parquet"

        # Check cache
        if not force_refresh and cache_path.exists() and self._is_cache_valid(cache_key):
            logger.info(f"Loading cached games for {season}")
            return pd.read_parquet(cache_path)

        # Fetch from API
        logger.info(f"Fetching games for {season} from API")

        game_finder = leaguegamefinder.LeagueGameFinder(
            season_nullable=season,
            league_id_nullable='00'
        )

        df = game_finder.get_data_frames()[0]

        # Save to cache
        df.to_parquet(cache_path)
        self.metadata['last_sync'][cache_key] = datetime.now().isoformat()
        self._save_metadata()

        return df

    def collect_game_pbp(self, game_id: str,
                        force_refresh: bool = False) -> pd.DataFrame:
        """
        Collect play-by-play data for a specific game.

        Args:
            game_id: NBA game identifier
            force_refresh: Whether to ignore cache

        Returns:
            DataFrame containing play-by-play events
        """
        from nba_api.stats.endpoints import playbyplayv2

        cache_key = f"pbp_{game_id}"
        cache_path = self.data_dir / 'raw' / 'pbp' / f"{game_id}.parquet"

        # Check cache
        if not force_refresh and cache_path.exists():
            return pd.read_parquet(cache_path)

        # Fetch from API
        logger.info(f"Fetching play-by-play for game {game_id}")

        pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
        df = pbp.get_data_frames()[0]

        # Save to cache
        df.to_parquet(cache_path)

        return df

    def collect_player_shots(self, player_id: int, season: str,
                            force_refresh: bool = False) -> pd.DataFrame:
        """
        Collect shot chart data for a player.

        Args:
            player_id: NBA player identifier
            season: Season identifier
            force_refresh: Whether to ignore cache

        Returns:
            DataFrame containing shot chart data
        """
        from nba_api.stats.endpoints import shotchartdetail

        cache_path = self.data_dir / 'raw' / 'shots' / f"{player_id}_{season}.parquet"

        # Check cache
        if not force_refresh and cache_path.exists():
            return pd.read_parquet(cache_path)

        # Fetch from API
        logger.info(f"Fetching shots for player {player_id}, season {season}")

        shots = shotchartdetail.ShotChartDetail(
            player_id=player_id,
            team_id=0,
            season_nullable=season,
            context_measure_simple='FGA'
        )

        df = shots.get_data_frames()[0]

        # Save to cache
        df.to_parquet(cache_path)

        return df

    def batch_collect_pbp(self, game_ids: List[str],
                         delay: float = 1.0) -> dict:
        """
        Collect play-by-play data for multiple games.

        Args:
            game_ids: List of game identifiers
            delay: Seconds to wait between requests

        Returns:
            Dictionary mapping game IDs to DataFrames
        """
        import time

        results = {}
        errors = []

        for i, game_id in enumerate(game_ids):
            try:
                df = self.collect_game_pbp(game_id)
                results[game_id] = df
                logger.info(f"Collected {i+1}/{len(game_ids)}: {game_id}")

                if i < len(game_ids) - 1:
                    time.sleep(delay)

            except Exception as e:
                logger.error(f"Error collecting {game_id}: {e}")
                errors.append({'game_id': game_id, 'error': str(e)})

        if errors:
            logger.warning(f"Failed to collect {len(errors)} games")

        return results

    def export_processed_data(self, df: pd.DataFrame,
                             name: str,
                             format: str = 'parquet') -> Path:
        """
        Export processed data to the exports directory.

        Args:
            df: DataFrame to export
            name: Base name for the export file
            format: Export format ('parquet', 'csv', 'json')

        Returns:
            Path to the exported file
        """
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"{name}_{timestamp}.{format}"
        export_path = self.data_dir / 'exports' / filename

        if format == 'parquet':
            df.to_parquet(export_path)
        elif format == 'csv':
            df.to_csv(export_path, index=False)
        elif format == 'json':
            df.to_json(export_path, orient='records', indent=2)
        else:
            raise ValueError(f"Unsupported format: {format}")

        logger.info(f"Exported data to {export_path}")

        return export_path

2.9 Integrating Multiple Data Sources

2.9.1 Entity Resolution Across Sources

Different data sources use different identifiers for the same entities:

class EntityResolver:
    """
    Resolve entity identifiers across data sources.

    Maps between NBA API IDs, Basketball-Reference IDs,
    and other source-specific identifiers.
    """

    def __init__(self):
        """Initialize with base mapping data."""
        self._load_mappings()

    def _load_mappings(self) -> None:
        """Load entity mapping tables."""
        from nba_api.stats.static import players, teams

        # Build player mapping
        self.player_mapping = {}
        for player in players.get_players():
            key = (player['full_name'].lower(), player['is_active'])
            self.player_mapping[key] = {
                'nba_id': player['id'],
                'full_name': player['full_name'],
            }

        # Build team mapping
        self.team_mapping = {}
        for team in teams.get_teams():
            self.team_mapping[team['abbreviation']] = {
                'nba_id': team['id'],
                'full_name': team['full_name'],
                'city': team['city'],
                'nickname': team['nickname'],
            }

    def get_player_nba_id(self, bbref_id: str) -> Optional[int]:
        """
        Convert Basketball-Reference ID to NBA ID.

        Args:
            bbref_id: Basketball-Reference player ID (e.g., "jamesle01")

        Returns:
            NBA player ID or None if not found
        """
        # BBRef IDs follow pattern: first 5 chars of last name +
        # first 2 chars of first name + number
        # This requires a lookup table for accurate conversion

        # For demonstration, return placeholder
        bbref_to_nba = {
            'jamesle01': 2544,      # LeBron James
            'curryst01': 201939,    # Stephen Curry
            'duranke01': 201142,    # Kevin Durant
        }

        return bbref_to_nba.get(bbref_id)

    def normalize_team_name(self, team_str: str) -> dict:
        """
        Normalize team identifier to standard format.

        Args:
            team_str: Team identifier in any format

        Returns:
            Dictionary with standardized team information
        """
        # Handle various input formats
        team_upper = team_str.upper().strip()

        # Direct abbreviation match
        if team_upper in self.team_mapping:
            return self.team_mapping[team_upper]

        # Search by full name or city
        for abbrev, info in self.team_mapping.items():
            if (info['full_name'].upper() == team_upper or
                info['city'].upper() == team_upper or
                info['nickname'].upper() == team_upper):
                return info

        return None

2.9.2 Joining Data from Multiple Sources

def merge_box_score_with_tracking(
    box_df: pd.DataFrame,
    tracking_df: pd.DataFrame,
    on: str = 'PLAYER_ID'
) -> pd.DataFrame:
    """
    Merge traditional box score data with tracking metrics.

    Args:
        box_df: Traditional box score DataFrame
        tracking_df: Tracking statistics DataFrame
        on: Column to join on

    Returns:
        Combined DataFrame with both traditional and tracking stats
    """
    # Ensure common column types
    box_df[on] = box_df[on].astype(int)
    tracking_df[on] = tracking_df[on].astype(int)

    # Identify overlapping columns (excluding join key)
    box_cols = set(box_df.columns)
    tracking_cols = set(tracking_df.columns)
    overlap = (box_cols & tracking_cols) - {on}

    # Rename overlapping columns
    tracking_renamed = tracking_df.rename(
        columns={col: f"{col}_TRACKING" for col in overlap}
    )

    # Merge
    merged = box_df.merge(tracking_renamed, on=on, how='left')

    return merged


def create_unified_player_dataset(
    nba_api_df: pd.DataFrame,
    bbref_df: pd.DataFrame,
    resolver: EntityResolver
) -> pd.DataFrame:
    """
    Create a unified player dataset from multiple sources.

    Combines data from NBA API and Basketball-Reference,
    handling entity resolution and column deduplication.

    Args:
        nba_api_df: Player data from NBA API
        bbref_df: Player data from Basketball-Reference
        resolver: Entity resolver for ID mapping

    Returns:
        Unified player DataFrame
    """
    # Add NBA IDs to BBRef data
    bbref_df = bbref_df.copy()
    bbref_df['NBA_ID'] = bbref_df['BBREF_ID'].apply(
        resolver.get_player_nba_id
    )

    # Filter to players with resolved IDs
    bbref_resolved = bbref_df[bbref_df['NBA_ID'].notna()].copy()
    bbref_resolved['NBA_ID'] = bbref_resolved['NBA_ID'].astype(int)

    # Merge datasets
    unified = nba_api_df.merge(
        bbref_resolved,
        left_on='PLAYER_ID',
        right_on='NBA_ID',
        how='outer',
        suffixes=('_NBA', '_BBREF')
    )

    # Coalesce columns where both sources have data
    if 'PTS_NBA' in unified.columns and 'PTS_BBREF' in unified.columns:
        unified['PTS'] = unified['PTS_NBA'].fillna(unified['PTS_BBREF'])

    return unified

2.10 Best Practices and Recommendations

2.10.1 Data Source Selection Criteria

When selecting data sources for a project, consider:

Criterion	NBA API	Basketball-Reference	Tracking Data
Accessibility	High (free, rate-limited)	Medium (requires scraping)	Low (restricted)
Coverage	1996-present	1946-present	2013-present
Granularity	Game/player level	Game/player level	Frame level
Real-time	Yes	No	Yes (with access)
Reliability	High	High	Very High

2.10.2 Data Pipeline Best Practices

Implement idempotent operations: Data collection should produce identical results when run multiple times
Version your data: Track schema changes and data source versions
Validate early and often: Check data quality at each pipeline stage
Design for failure: Implement retry logic and graceful degradation
Document provenance: Record where each data point originated

2.10.3 Storage Recommendations

For analytical workloads, columnar formats like Parquet offer significant advantages:

# Efficient storage with appropriate data types
def optimize_dataframe_types(df: pd.DataFrame) -> pd.DataFrame:
    """
    Optimize DataFrame memory usage by downcasting numeric types.

    Args:
        df: Input DataFrame

    Returns:
        DataFrame with optimized data types
    """
    df = df.copy()

    for col in df.columns:
        col_type = df[col].dtype

        if col_type == 'int64':
            # Check if values fit in smaller type
            if df[col].min() >= 0:
                if df[col].max() < 255:
                    df[col] = df[col].astype('uint8')
                elif df[col].max() < 65535:
                    df[col] = df[col].astype('uint16')
                elif df[col].max() < 4294967295:
                    df[col] = df[col].astype('uint32')
            else:
                if df[col].min() > -128 and df[col].max() < 127:
                    df[col] = df[col].astype('int8')
                elif df[col].min() > -32768 and df[col].max() < 32767:
                    df[col] = df[col].astype('int16')
                elif df[col].min() > -2147483648 and df[col].max() < 2147483647:
                    df[col] = df[col].astype('int32')

        elif col_type == 'float64':
            df[col] = df[col].astype('float32')

        elif col_type == 'object':
            # Check if column should be categorical
            if df[col].nunique() / len(df) < 0.5:
                df[col] = df[col].astype('category')

    return df

Summary

This chapter has provided a comprehensive examination of the NBA data ecosystem, from official APIs to proprietary tracking systems and public datasets. Key takeaways include:

The NBA API provides structured access to official league statistics through the nba_api Python library, though responsible use requires rate limiting and proper request headers.
Basketball-Reference offers unparalleled historical coverage but requires web scraping with attention to legal and ethical considerations.
Play-by-play data enables possession-level analysis and requires careful parsing to extract meaningful events from the raw event stream.
Tracking data from Second Spectrum provides unprecedented spatial granularity, though access to raw data remains restricted to league partners.
Public datasets on Kaggle and GitHub provide accessible starting points, though quality validation is essential.
Data cleaning must address temporal inconsistencies, entity resolution challenges, and source-specific artifacts.
Production systems require robust architecture with caching, rate limiting, validation, and provenance tracking.

The code examples throughout this chapter provide a foundation for building your own data collection infrastructure. In Chapter 3, we will build upon these data foundations to explore fundamental statistical concepts and their application to basketball analysis.

References

Patel, S. (2023). nba_api: An API Client for NBA.com. GitHub repository.
Sports Reference LLC. (2024). Basketball-Reference.com Data Use Policy.
Second Spectrum. (2024). NBA Tracking Data Technical Documentation.
Kubatko, J., Oliver, D., Pelton, K., & Rosenbaum, D. T. (2007). A starting point for analyzing basketball statistics. Journal of Quantitative Analysis in Sports, 3(3).
Cervone, D., D'Amour, A., Bornn, L., & Goldsberry, K. (2016). A Multiresolution Stochastic Process Model for Predicting Basketball Possession Outcomes. Journal of the American Statistical Association, 111(514), 585-599.