Appendix D: Data Sources and APIs

This appendix provides a comprehensive guide to basketball data sources, APIs, and data acquisition methods. Understanding where and how to obtain data is fundamental to basketball analytics work.


D.1 Official NBA Data Sources

NBA Stats API

The NBA Stats API is the official source for current NBA statistics, providing comprehensive play-by-play, box score, and tracking data.

Base URL: https://stats.nba.com/stats/

Important Notes: - The API is undocumented and subject to change without notice - Requires specific headers to avoid being blocked - Rate limiting may apply; be respectful with request frequency - For production use, consider caching responses

Required Headers:

headers = {
    'Host': 'stats.nba.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://stats.nba.com/',
    'Connection': 'keep-alive',
}

Common Endpoints:

Endpoint Description Key Parameters
leaguegamefinder Search games by various criteria TeamID, Season, DateFrom, DateTo
playergamelog Individual player game logs PlayerID, Season, SeasonType
teamgamelog Team game logs TeamID, Season, SeasonType
boxscoretraditionalv2 Traditional box score GameID
boxscoreadvancedv2 Advanced box score GameID
playbyplayv2 Play-by-play data GameID
shotchartdetail Shot chart data PlayerID, TeamID, Season
leaguedashplayerstats League-wide player stats Season, SeasonType, PerMode
leaguedashteamstats League-wide team stats Season, SeasonType, PerMode
commonplayerinfo Player biographical info PlayerID
playerprofilev2 Detailed player profile PlayerID

Example: Fetching Player Game Log

import requests
import pandas as pd

def get_player_gamelog(player_id, season='2023-24'):
    """Fetch player game log from NBA Stats API."""
    url = 'https://stats.nba.com/stats/playergamelog'

    headers = {
        'Host': 'stats.nba.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json',
        'Referer': 'https://stats.nba.com/',
        'Connection': 'keep-alive',
    }

    params = {
        'PlayerID': player_id,
        'Season': season,
        'SeasonType': 'Regular Season'
    }

    response = requests.get(url, headers=headers, params=params)
    data = response.json()

    headers_list = data['resultSets'][0]['headers']
    rows = data['resultSets'][0]['rowSet']

    return pd.DataFrame(rows, columns=headers_list)

# Example usage
lebron_id = 2544
gamelog = get_player_gamelog(lebron_id, '2023-24')

nba_api Python Package

The nba_api package provides a convenient Python wrapper for the NBA Stats API.

Installation:

pip install nba_api

Basic Usage:

from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import playergamelog, shotchartdetail
from nba_api.stats.endpoints import leaguedashplayerstats

# Find player ID
player_dict = players.get_players()
lebron = [p for p in player_dict if p['full_name'] == 'LeBron James'][0]
print(f"LeBron James ID: {lebron['id']}")

# Get player game log
gamelog = playergamelog.PlayerGameLog(player_id=lebron['id'], season='2023-24')
df = gamelog.get_data_frames()[0]

# Get shot chart data
shots = shotchartdetail.ShotChartDetail(
    player_id=lebron['id'],
    team_id=0,
    season_nullable='2023-24',
    context_measure_simple='FGA'
)
shot_df = shots.get_data_frames()[0]

# Get league-wide stats
league_stats = leaguedashplayerstats.LeagueDashPlayerStats(
    season='2023-24',
    per_mode_detailed='PerGame'
)
all_players = league_stats.get_data_frames()[0]

Available Endpoints in nba_api:

from nba_api.stats.endpoints import (
    # Player endpoints
    playergamelog,
    playerprofilev2,
    playercareerstats,
    playercompare,
    playerdashboardbygeneralsplits,

    # Team endpoints
    teamgamelog,
    teamyearbyyearstats,
    teamdashboardbygeneralsplits,

    # Game endpoints
    boxscoretraditionalv2,
    boxscoreadvancedv2,
    playbyplayv2,

    # League endpoints
    leaguedashplayerstats,
    leaguedashteamstats,
    leaguestandings,

    # Shot chart
    shotchartdetail,
    shotchartlineupdetail
)

D.2 Basketball Reference

Basketball Reference (basketball-reference.com) is the most comprehensive historical basketball statistics resource.

Website Structure

Basketball Reference organizes data by: - Players: /players/{first_letter}/{player_id}.html - Teams: /teams/{team_abbrev}/{year}.html - Seasons: /leagues/NBA_{year}.html - Games: /boxscores/{date}0{home_team}.html

Data Categories Available

Category Description URL Pattern
Per Game Stats Traditional per-game averages /leagues/NBA_{year}_per_game.html
Per 36 Minutes Rate stats normalized to 36 min /leagues/NBA_{year}_per_minute.html
Per 100 Poss Pace-adjusted statistics /leagues/NBA_{year}_per_poss.html
Advanced PER, TS%, BPM, VORP, etc. /leagues/NBA_{year}_advanced.html
Shooting Shot breakdown by distance/type /leagues/NBA_{year}_shooting.html
Play-by-Play On/off stats, usage /leagues/NBA_{year}_play-by-play.html
Totals Cumulative season totals /leagues/NBA_{year}_totals.html

Web Scraping Basketball Reference

Important Guidelines: - Respect robots.txt and rate limits - Add delays between requests (3+ seconds recommended) - Cache responses to avoid repeated requests - Include proper User-Agent header - Consider using their data export feature for large datasets

Example: Scraping Season Stats

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def get_season_stats(year, stat_type='per_game'):
    """
    Scrape season statistics from Basketball Reference.

    Parameters:
    -----------
    year : int
        The ending year of the season (e.g., 2024 for 2023-24)
    stat_type : str
        Type of stats ('per_game', 'per_minute', 'per_poss', 'advanced', 'totals')
    """
    url = f'https://www.basketball-reference.com/leagues/NBA_{year}_{stat_type}.html'

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the stats table
    table = soup.find('table', {'id': f'{stat_type}_stats' if stat_type != 'advanced' else 'advanced_stats'})

    if table is None:
        # Try alternate table ID
        table = soup.find('table', {'id': 'per_game_stats'})

    # Parse table to DataFrame
    df = pd.read_html(str(table))[0]

    # Clean up multi-level headers if present
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = df.columns.droplevel(0)

    # Remove repeated header rows
    df = df[df['Player'] != 'Player']

    # Reset index
    df = df.reset_index(drop=True)

    return df

# Example usage (add delay between requests)
stats_2024 = get_season_stats(2024, 'per_game')
time.sleep(3)
advanced_2024 = get_season_stats(2024, 'advanced')

Example: Scraping Player Career Stats

def get_player_career(player_url):
    """
    Scrape career statistics for a player.

    Parameters:
    -----------
    player_url : str
        Player's Basketball Reference URL path (e.g., 'j/jamesle01')
    """
    url = f'https://www.basketball-reference.com/players/{player_url}.html'

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Regular season per game
    tables = {}

    table_ids = ['per_game', 'advanced', 'totals']

    for table_id in table_ids:
        table = soup.find('table', {'id': table_id})
        if table:
            df = pd.read_html(str(table))[0]
            tables[table_id] = df

    return tables

# Example usage
lebron_stats = get_player_career('j/jamesle01')

Basketball Reference Data Export

For larger datasets, Basketball Reference offers CSV downloads for premium subscribers. The Sports Reference API (sports-reference.com/api) provides programmatic access.


D.3 Play-by-Play Data

NBA Play-by-Play Structure

Play-by-play data contains event-level information for every action in a game.

Key Fields:

Field Description Example Values
EVENTNUM Event sequence number 1, 2, 3, ...
EVENTMSGTYPE Type of event 1=Made Shot, 2=Missed Shot, 3=FT, 4=Rebound
EVENTMSGACTIONTYPE Sub-type of event Layup, Dunk, Jump Shot
PERIOD Game period 1, 2, 3, 4, 5 (OT)
PCTIMESTRING Time remaining in period "11:45", "00:24"
HOMEDESCRIPTION Description of home team action "James 3PT Jump Shot"
VISITORDESCRIPTION Description of away team action "Curry Turnover"
PLAYER1_ID Primary player involved 2544
PLAYER2_ID Secondary player (assists, steals) 201939
PLAYER3_ID Tertiary player (blocks) 203507

Event Message Types:

Code Event Type
1 Made Field Goal
2 Missed Field Goal
3 Free Throw
4 Rebound
5 Turnover
6 Foul
7 Violation
8 Substitution
9 Timeout
10 Jump Ball
12 Period Start
13 Period End

Fetching Play-by-Play Data:

from nba_api.stats.endpoints import playbyplayv2

def get_game_pbp(game_id):
    """Fetch play-by-play data for a game."""
    pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
    return pbp.get_data_frames()[0]

# Example: Get play-by-play for a specific game
game_id = '0022300001'
pbp_data = get_game_pbp(game_id)

# Filter for shots
shots = pbp_data[pbp_data['EVENTMSGTYPE'].isin([1, 2])]

# Filter for made three-pointers
made_threes = pbp_data[
    (pbp_data['EVENTMSGTYPE'] == 1) &
    (pbp_data['HOMEDESCRIPTION'].str.contains('3PT', na=False) |
     pbp_data['VISITORDESCRIPTION'].str.contains('3PT', na=False))
]

D.4 Tracking Data Sources

Second Spectrum (Official NBA Tracking)

Second Spectrum is the official optical tracking provider for the NBA since 2017-18.

Data Available Through NBA Stats: - Player speed and distance - Touches and time of possession - Contested/uncontested shot classifications - Closest defender distance - Catch-and-shoot vs. pull-up classifications

Accessing Tracking Data:

from nba_api.stats.endpoints import (
    leaguedashptstats,           # Player tracking stats
    leaguedashptteamdefend,      # Team defensive tracking
    leaguedashptshotdefend       # Shot defense tracking
)

# Player tracking stats
tracking = leaguedashptstats.LeagueDashPtStats(
    season='2023-24',
    per_mode_simple='PerGame',
    pt_measure_type='SpeedDistance'  # or 'Possessions', 'Passing', etc.
)
tracking_df = tracking.get_data_frames()[0]

Historical Tracking Data

For research purposes, historical tracking data may be available through: - Academic partnerships with the NBA - Kaggle datasets from past competitions - Research data repositories


D.5 Public Datasets

Kaggle Basketball Datasets

Kaggle hosts numerous basketball datasets suitable for analysis and machine learning projects.

Notable Datasets:

Dataset Description Size
NBA Shot Logs Shot-level data with locations ~128K shots
NBA Player Stats Historical player statistics Multiple seasons
March Madness NCAA tournament data 1985-present
NBA Game Data Game-level statistics Multiple seasons
NBA Play-by-Play Event-level game data Varies

Accessing Kaggle Datasets:

# Install Kaggle CLI
pip install kaggle

# Download dataset (requires API key setup)
kaggle datasets download -d nathanlauga/nba-games

Working with Kaggle Data:

import pandas as pd
import zipfile

# Extract and load
with zipfile.ZipFile('nba-games.zip', 'r') as z:
    z.extractall('nba_data/')

games = pd.read_csv('nba_data/games.csv')
games_details = pd.read_csv('nba_data/games_details.csv')
players = pd.read_csv('nba_data/players.csv')
teams = pd.read_csv('nba_data/teams.csv')
rankings = pd.read_csv('nba_data/ranking.csv')

Sports Reference Data Exports

Sports Reference sites provide downloadable CSV files for many statistics tables.

Cleaning the Glass

Cleaning the Glass (cleaningtheglass.com) provides advanced team and player statistics with emphasis on four factors analysis.

Data Categories: - Four Factors (eFG%, TOV%, ORB%, FT Rate) - Lineup data - On/Off statistics - Zone shooting percentages


D.6 Draft and Combine Data

NBA Draft Combine

The NBA Draft Combine produces standardized physical and athletic measurements.

Available Measurements:

Category Metrics
Physical Height (with/without shoes), Weight, Wingspan, Standing Reach, Body Fat %
Athletic No-Step Vertical, Max Vertical, Lane Agility, 3/4 Court Sprint, Bench Press

Accessing Combine Data:

from nba_api.stats.endpoints import draftcombinestats

combine = draftcombinestats.DraftCombineStats(season_all_time='2023-24')
measurements = combine.get_data_frames()[0]

College Statistics

College basketball statistics are available from: - Sports Reference (sports-reference.com/cbb) - ESPN - KenPom (kenpom.com) - advanced team metrics - Barttorvik (barttorvik.com) - player and team analytics


D.7 Salary and Contract Data

Spotrac

Spotrac (spotrac.com) provides comprehensive salary data including: - Current salaries and cap hits - Contract details and options - Free agent projections - Cap space calculations

Basketball Reference Contracts

Basketball Reference includes contract information on player pages: - Salary by season - Contract type (guaranteed, non-guaranteed) - Player options and team options

HoopsHype

HoopsHype (hoopshype.com) provides: - Player salaries - Team payrolls - Historical salary data


D.8 Real-Time and Live Data

NBA Live Feed

For applications requiring real-time data: - ESPN API (unofficial) - NBA Stats live endpoints - Third-party providers (Sportradar, Stats Perform)

Considerations for Live Data: - Rate limiting is critical - WebSocket connections for streaming - Data latency varies by source - Commercial licenses typically required


D.9 Data Quality and Cleaning

Common Data Issues

Issue Description Solution
Missing Values Games not tracked, injured players Imputation or exclusion
Name Inconsistencies "LeBron James" vs "L. James" Standardize using player IDs
Team Abbreviations Different sources use different codes Create mapping dictionary
Date Formats Varies by source Parse to datetime objects
Duplicate Records Multiple entries for same event Deduplicate on unique keys
Trade/Waiver Players Stats split across teams Aggregate or analyze separately

Data Validation Checks

def validate_player_stats(df):
    """Validate player statistics DataFrame."""
    checks = []

    # Check for negative values where inappropriate
    numeric_cols = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'MP']
    for col in numeric_cols:
        if col in df.columns:
            negative = (df[col] < 0).sum()
            checks.append(f'{col}: {negative} negative values')

    # Check shooting percentages are 0-1 or 0-100
    pct_cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT']
    for col in pct_cols:
        if col in df.columns:
            out_of_range = ((df[col] < 0) | (df[col] > 1)).sum()
            checks.append(f'{col}: {out_of_range} out of range')

    # Check points calculation
    if all(c in df.columns for c in ['PTS', 'FGM', 'FG3M', 'FTM']):
        expected_pts = 2 * df['FGM'] + df['FG3M'] + df['FTM']
        mismatches = (df['PTS'] != expected_pts).sum()
        checks.append(f'Points calculation: {mismatches} mismatches')

    return checks

Standardization Functions

def standardize_team_abbrev(abbrev):
    """Standardize team abbreviations."""
    mapping = {
        'NJN': 'BRK', 'BKN': 'BRK',  # Brooklyn Nets
        'NOH': 'NOP', 'NOK': 'NOP',  # New Orleans
        'CHA': 'CHO', 'CHH': 'CHO',  # Charlotte
        'SEA': 'OKC',                 # Seattle to OKC
        'VAN': 'MEM',                 # Vancouver to Memphis
        'WSB': 'WAS',                 # Washington
    }
    return mapping.get(abbrev, abbrev)

def standardize_player_name(name):
    """Standardize player names for matching."""
    import re
    # Remove suffixes
    name = re.sub(r'\s+(Jr\.|Sr\.|III|IV|II|Jr|Sr)$', '', name, flags=re.IGNORECASE)
    # Remove periods and extra spaces
    name = name.replace('.', '').strip()
    # Standardize case
    name = name.title()
    return name

D.10 Building a Data Pipeline

Sample Data Pipeline Architecture

import requests
import pandas as pd
from datetime import datetime, timedelta
import time
import sqlite3
import logging

class NBADataPipeline:
    """Pipeline for fetching and storing NBA data."""

    def __init__(self, db_path='nba_data.db'):
        self.db_path = db_path
        self.setup_logging()
        self.create_tables()

    def setup_logging(self):
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)

    def create_tables(self):
        """Create database tables if they don't exist."""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        cursor.execute('''
            CREATE TABLE IF NOT EXISTS player_games (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                player_id INTEGER,
                game_id TEXT,
                game_date DATE,
                pts INTEGER,
                reb INTEGER,
                ast INTEGER,
                min REAL,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(player_id, game_id)
            )
        ''')

        conn.commit()
        conn.close()

    def fetch_player_games(self, player_id, season):
        """Fetch player game logs with rate limiting."""
        from nba_api.stats.endpoints import playergamelog

        self.logger.info(f'Fetching games for player {player_id}, season {season}')

        try:
            gamelog = playergamelog.PlayerGameLog(
                player_id=player_id,
                season=season
            )
            df = gamelog.get_data_frames()[0]
            time.sleep(0.6)  # Rate limiting
            return df
        except Exception as e:
            self.logger.error(f'Error fetching data: {e}')
            return None

    def store_player_games(self, df):
        """Store player game data in database."""
        conn = sqlite3.connect(self.db_path)

        for _, row in df.iterrows():
            try:
                conn.execute('''
                    INSERT OR REPLACE INTO player_games
                    (player_id, game_id, game_date, pts, reb, ast, min)
                    VALUES (?, ?, ?, ?, ?, ?, ?)
                ''', (
                    row['Player_ID'],
                    row['Game_ID'],
                    row['GAME_DATE'],
                    row['PTS'],
                    row['REB'],
                    row['AST'],
                    row['MIN']
                ))
            except Exception as e:
                self.logger.error(f'Error storing row: {e}')

        conn.commit()
        conn.close()

    def run_daily_update(self, player_ids, season):
        """Run daily update for list of players."""
        for player_id in player_ids:
            df = self.fetch_player_games(player_id, season)
            if df is not None and len(df) > 0:
                self.store_player_games(df)
                self.logger.info(f'Stored {len(df)} games for player {player_id}')

# Usage
pipeline = NBADataPipeline()
star_players = [2544, 201566, 203507]  # LeBron, Harden, Giannis
pipeline.run_daily_update(star_players, '2023-24')

Terms of Service

When accessing basketball data: - Review and comply with website Terms of Service - Respect robots.txt files - Follow rate limiting guidelines - Do not redistribute proprietary data

Attribution

When using data in research or publications: - Credit data sources appropriately - Follow academic citation standards - Check licensing requirements for commercial use

Privacy

Be mindful of: - Player personal information - Medical/injury data sensitivity - Social media data collection regulations


D.12 Quick Reference: Data Source URLs

Source URL Data Types
NBA Stats stats.nba.com Current stats, tracking
Basketball Reference basketball-reference.com Historical stats
ESPN espn.com/nba News, basic stats
Cleaning the Glass cleaningtheglass.com Advanced analytics
PBP Stats pbpstats.com Possession data
Kaggle kaggle.com/datasets Various datasets
Sports Reference sports-reference.com Multi-sport
Spotrac spotrac.com Salary data
RealGM basketball.realgm.com Transactions, rosters
HoopsHype hoopshype.com Salary, rumors
KenPom kenpom.com College analytics
Synergy Sports synergysports.com Video, play types

This appendix provides starting points for basketball data acquisition. Always verify current API endpoints and website structures, as these may change over time. For the most up-to-date information, consult official documentation and community resources.