Appendix D: Data Sources and APIs
This appendix provides a comprehensive guide to basketball data sources, APIs, and data acquisition methods. Understanding where and how to obtain data is fundamental to basketball analytics work.
D.1 Official NBA Data Sources
NBA Stats API
The NBA Stats API is the official source for current NBA statistics, providing comprehensive play-by-play, box score, and tracking data.
Base URL: https://stats.nba.com/stats/
Important Notes: - The API is undocumented and subject to change without notice - Requires specific headers to avoid being blocked - Rate limiting may apply; be respectful with request frequency - For production use, consider caching responses
Required Headers:
headers = {
'Host': 'stats.nba.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://stats.nba.com/',
'Connection': 'keep-alive',
}
Common Endpoints:
| Endpoint | Description | Key Parameters |
|---|---|---|
leaguegamefinder |
Search games by various criteria | TeamID, Season, DateFrom, DateTo |
playergamelog |
Individual player game logs | PlayerID, Season, SeasonType |
teamgamelog |
Team game logs | TeamID, Season, SeasonType |
boxscoretraditionalv2 |
Traditional box score | GameID |
boxscoreadvancedv2 |
Advanced box score | GameID |
playbyplayv2 |
Play-by-play data | GameID |
shotchartdetail |
Shot chart data | PlayerID, TeamID, Season |
leaguedashplayerstats |
League-wide player stats | Season, SeasonType, PerMode |
leaguedashteamstats |
League-wide team stats | Season, SeasonType, PerMode |
commonplayerinfo |
Player biographical info | PlayerID |
playerprofilev2 |
Detailed player profile | PlayerID |
Example: Fetching Player Game Log
import requests
import pandas as pd
def get_player_gamelog(player_id, season='2023-24'):
"""Fetch player game log from NBA Stats API."""
url = 'https://stats.nba.com/stats/playergamelog'
headers = {
'Host': 'stats.nba.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json',
'Referer': 'https://stats.nba.com/',
'Connection': 'keep-alive',
}
params = {
'PlayerID': player_id,
'Season': season,
'SeasonType': 'Regular Season'
}
response = requests.get(url, headers=headers, params=params)
data = response.json()
headers_list = data['resultSets'][0]['headers']
rows = data['resultSets'][0]['rowSet']
return pd.DataFrame(rows, columns=headers_list)
# Example usage
lebron_id = 2544
gamelog = get_player_gamelog(lebron_id, '2023-24')
nba_api Python Package
The nba_api package provides a convenient Python wrapper for the NBA Stats API.
Installation:
pip install nba_api
Basic Usage:
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import playergamelog, shotchartdetail
from nba_api.stats.endpoints import leaguedashplayerstats
# Find player ID
player_dict = players.get_players()
lebron = [p for p in player_dict if p['full_name'] == 'LeBron James'][0]
print(f"LeBron James ID: {lebron['id']}")
# Get player game log
gamelog = playergamelog.PlayerGameLog(player_id=lebron['id'], season='2023-24')
df = gamelog.get_data_frames()[0]
# Get shot chart data
shots = shotchartdetail.ShotChartDetail(
player_id=lebron['id'],
team_id=0,
season_nullable='2023-24',
context_measure_simple='FGA'
)
shot_df = shots.get_data_frames()[0]
# Get league-wide stats
league_stats = leaguedashplayerstats.LeagueDashPlayerStats(
season='2023-24',
per_mode_detailed='PerGame'
)
all_players = league_stats.get_data_frames()[0]
Available Endpoints in nba_api:
from nba_api.stats.endpoints import (
# Player endpoints
playergamelog,
playerprofilev2,
playercareerstats,
playercompare,
playerdashboardbygeneralsplits,
# Team endpoints
teamgamelog,
teamyearbyyearstats,
teamdashboardbygeneralsplits,
# Game endpoints
boxscoretraditionalv2,
boxscoreadvancedv2,
playbyplayv2,
# League endpoints
leaguedashplayerstats,
leaguedashteamstats,
leaguestandings,
# Shot chart
shotchartdetail,
shotchartlineupdetail
)
D.2 Basketball Reference
Basketball Reference (basketball-reference.com) is the most comprehensive historical basketball statistics resource.
Website Structure
Basketball Reference organizes data by:
- Players: /players/{first_letter}/{player_id}.html
- Teams: /teams/{team_abbrev}/{year}.html
- Seasons: /leagues/NBA_{year}.html
- Games: /boxscores/{date}0{home_team}.html
Data Categories Available
| Category | Description | URL Pattern |
|---|---|---|
| Per Game Stats | Traditional per-game averages | /leagues/NBA_{year}_per_game.html |
| Per 36 Minutes | Rate stats normalized to 36 min | /leagues/NBA_{year}_per_minute.html |
| Per 100 Poss | Pace-adjusted statistics | /leagues/NBA_{year}_per_poss.html |
| Advanced | PER, TS%, BPM, VORP, etc. | /leagues/NBA_{year}_advanced.html |
| Shooting | Shot breakdown by distance/type | /leagues/NBA_{year}_shooting.html |
| Play-by-Play | On/off stats, usage | /leagues/NBA_{year}_play-by-play.html |
| Totals | Cumulative season totals | /leagues/NBA_{year}_totals.html |
Web Scraping Basketball Reference
Important Guidelines: - Respect robots.txt and rate limits - Add delays between requests (3+ seconds recommended) - Cache responses to avoid repeated requests - Include proper User-Agent header - Consider using their data export feature for large datasets
Example: Scraping Season Stats
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def get_season_stats(year, stat_type='per_game'):
"""
Scrape season statistics from Basketball Reference.
Parameters:
-----------
year : int
The ending year of the season (e.g., 2024 for 2023-24)
stat_type : str
Type of stats ('per_game', 'per_minute', 'per_poss', 'advanced', 'totals')
"""
url = f'https://www.basketball-reference.com/leagues/NBA_{year}_{stat_type}.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Find the stats table
table = soup.find('table', {'id': f'{stat_type}_stats' if stat_type != 'advanced' else 'advanced_stats'})
if table is None:
# Try alternate table ID
table = soup.find('table', {'id': 'per_game_stats'})
# Parse table to DataFrame
df = pd.read_html(str(table))[0]
# Clean up multi-level headers if present
if isinstance(df.columns, pd.MultiIndex):
df.columns = df.columns.droplevel(0)
# Remove repeated header rows
df = df[df['Player'] != 'Player']
# Reset index
df = df.reset_index(drop=True)
return df
# Example usage (add delay between requests)
stats_2024 = get_season_stats(2024, 'per_game')
time.sleep(3)
advanced_2024 = get_season_stats(2024, 'advanced')
Example: Scraping Player Career Stats
def get_player_career(player_url):
"""
Scrape career statistics for a player.
Parameters:
-----------
player_url : str
Player's Basketball Reference URL path (e.g., 'j/jamesle01')
"""
url = f'https://www.basketball-reference.com/players/{player_url}.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Regular season per game
tables = {}
table_ids = ['per_game', 'advanced', 'totals']
for table_id in table_ids:
table = soup.find('table', {'id': table_id})
if table:
df = pd.read_html(str(table))[0]
tables[table_id] = df
return tables
# Example usage
lebron_stats = get_player_career('j/jamesle01')
Basketball Reference Data Export
For larger datasets, Basketball Reference offers CSV downloads for premium subscribers. The Sports Reference API (sports-reference.com/api) provides programmatic access.
D.3 Play-by-Play Data
NBA Play-by-Play Structure
Play-by-play data contains event-level information for every action in a game.
Key Fields:
| Field | Description | Example Values |
|---|---|---|
| EVENTNUM | Event sequence number | 1, 2, 3, ... |
| EVENTMSGTYPE | Type of event | 1=Made Shot, 2=Missed Shot, 3=FT, 4=Rebound |
| EVENTMSGACTIONTYPE | Sub-type of event | Layup, Dunk, Jump Shot |
| PERIOD | Game period | 1, 2, 3, 4, 5 (OT) |
| PCTIMESTRING | Time remaining in period | "11:45", "00:24" |
| HOMEDESCRIPTION | Description of home team action | "James 3PT Jump Shot" |
| VISITORDESCRIPTION | Description of away team action | "Curry Turnover" |
| PLAYER1_ID | Primary player involved | 2544 |
| PLAYER2_ID | Secondary player (assists, steals) | 201939 |
| PLAYER3_ID | Tertiary player (blocks) | 203507 |
Event Message Types:
| Code | Event Type |
|---|---|
| 1 | Made Field Goal |
| 2 | Missed Field Goal |
| 3 | Free Throw |
| 4 | Rebound |
| 5 | Turnover |
| 6 | Foul |
| 7 | Violation |
| 8 | Substitution |
| 9 | Timeout |
| 10 | Jump Ball |
| 12 | Period Start |
| 13 | Period End |
Fetching Play-by-Play Data:
from nba_api.stats.endpoints import playbyplayv2
def get_game_pbp(game_id):
"""Fetch play-by-play data for a game."""
pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
return pbp.get_data_frames()[0]
# Example: Get play-by-play for a specific game
game_id = '0022300001'
pbp_data = get_game_pbp(game_id)
# Filter for shots
shots = pbp_data[pbp_data['EVENTMSGTYPE'].isin([1, 2])]
# Filter for made three-pointers
made_threes = pbp_data[
(pbp_data['EVENTMSGTYPE'] == 1) &
(pbp_data['HOMEDESCRIPTION'].str.contains('3PT', na=False) |
pbp_data['VISITORDESCRIPTION'].str.contains('3PT', na=False))
]
D.4 Tracking Data Sources
Second Spectrum (Official NBA Tracking)
Second Spectrum is the official optical tracking provider for the NBA since 2017-18.
Data Available Through NBA Stats: - Player speed and distance - Touches and time of possession - Contested/uncontested shot classifications - Closest defender distance - Catch-and-shoot vs. pull-up classifications
Accessing Tracking Data:
from nba_api.stats.endpoints import (
leaguedashptstats, # Player tracking stats
leaguedashptteamdefend, # Team defensive tracking
leaguedashptshotdefend # Shot defense tracking
)
# Player tracking stats
tracking = leaguedashptstats.LeagueDashPtStats(
season='2023-24',
per_mode_simple='PerGame',
pt_measure_type='SpeedDistance' # or 'Possessions', 'Passing', etc.
)
tracking_df = tracking.get_data_frames()[0]
Historical Tracking Data
For research purposes, historical tracking data may be available through: - Academic partnerships with the NBA - Kaggle datasets from past competitions - Research data repositories
D.5 Public Datasets
Kaggle Basketball Datasets
Kaggle hosts numerous basketball datasets suitable for analysis and machine learning projects.
Notable Datasets:
| Dataset | Description | Size |
|---|---|---|
| NBA Shot Logs | Shot-level data with locations | ~128K shots |
| NBA Player Stats | Historical player statistics | Multiple seasons |
| March Madness | NCAA tournament data | 1985-present |
| NBA Game Data | Game-level statistics | Multiple seasons |
| NBA Play-by-Play | Event-level game data | Varies |
Accessing Kaggle Datasets:
# Install Kaggle CLI
pip install kaggle
# Download dataset (requires API key setup)
kaggle datasets download -d nathanlauga/nba-games
Working with Kaggle Data:
import pandas as pd
import zipfile
# Extract and load
with zipfile.ZipFile('nba-games.zip', 'r') as z:
z.extractall('nba_data/')
games = pd.read_csv('nba_data/games.csv')
games_details = pd.read_csv('nba_data/games_details.csv')
players = pd.read_csv('nba_data/players.csv')
teams = pd.read_csv('nba_data/teams.csv')
rankings = pd.read_csv('nba_data/ranking.csv')
Sports Reference Data Exports
Sports Reference sites provide downloadable CSV files for many statistics tables.
Cleaning the Glass
Cleaning the Glass (cleaningtheglass.com) provides advanced team and player statistics with emphasis on four factors analysis.
Data Categories: - Four Factors (eFG%, TOV%, ORB%, FT Rate) - Lineup data - On/Off statistics - Zone shooting percentages
D.6 Draft and Combine Data
NBA Draft Combine
The NBA Draft Combine produces standardized physical and athletic measurements.
Available Measurements:
| Category | Metrics |
|---|---|
| Physical | Height (with/without shoes), Weight, Wingspan, Standing Reach, Body Fat % |
| Athletic | No-Step Vertical, Max Vertical, Lane Agility, 3/4 Court Sprint, Bench Press |
Accessing Combine Data:
from nba_api.stats.endpoints import draftcombinestats
combine = draftcombinestats.DraftCombineStats(season_all_time='2023-24')
measurements = combine.get_data_frames()[0]
College Statistics
College basketball statistics are available from: - Sports Reference (sports-reference.com/cbb) - ESPN - KenPom (kenpom.com) - advanced team metrics - Barttorvik (barttorvik.com) - player and team analytics
D.7 Salary and Contract Data
Spotrac
Spotrac (spotrac.com) provides comprehensive salary data including: - Current salaries and cap hits - Contract details and options - Free agent projections - Cap space calculations
Basketball Reference Contracts
Basketball Reference includes contract information on player pages: - Salary by season - Contract type (guaranteed, non-guaranteed) - Player options and team options
HoopsHype
HoopsHype (hoopshype.com) provides: - Player salaries - Team payrolls - Historical salary data
D.8 Real-Time and Live Data
NBA Live Feed
For applications requiring real-time data: - ESPN API (unofficial) - NBA Stats live endpoints - Third-party providers (Sportradar, Stats Perform)
Considerations for Live Data: - Rate limiting is critical - WebSocket connections for streaming - Data latency varies by source - Commercial licenses typically required
D.9 Data Quality and Cleaning
Common Data Issues
| Issue | Description | Solution |
|---|---|---|
| Missing Values | Games not tracked, injured players | Imputation or exclusion |
| Name Inconsistencies | "LeBron James" vs "L. James" | Standardize using player IDs |
| Team Abbreviations | Different sources use different codes | Create mapping dictionary |
| Date Formats | Varies by source | Parse to datetime objects |
| Duplicate Records | Multiple entries for same event | Deduplicate on unique keys |
| Trade/Waiver Players | Stats split across teams | Aggregate or analyze separately |
Data Validation Checks
def validate_player_stats(df):
"""Validate player statistics DataFrame."""
checks = []
# Check for negative values where inappropriate
numeric_cols = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'MP']
for col in numeric_cols:
if col in df.columns:
negative = (df[col] < 0).sum()
checks.append(f'{col}: {negative} negative values')
# Check shooting percentages are 0-1 or 0-100
pct_cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT']
for col in pct_cols:
if col in df.columns:
out_of_range = ((df[col] < 0) | (df[col] > 1)).sum()
checks.append(f'{col}: {out_of_range} out of range')
# Check points calculation
if all(c in df.columns for c in ['PTS', 'FGM', 'FG3M', 'FTM']):
expected_pts = 2 * df['FGM'] + df['FG3M'] + df['FTM']
mismatches = (df['PTS'] != expected_pts).sum()
checks.append(f'Points calculation: {mismatches} mismatches')
return checks
Standardization Functions
def standardize_team_abbrev(abbrev):
"""Standardize team abbreviations."""
mapping = {
'NJN': 'BRK', 'BKN': 'BRK', # Brooklyn Nets
'NOH': 'NOP', 'NOK': 'NOP', # New Orleans
'CHA': 'CHO', 'CHH': 'CHO', # Charlotte
'SEA': 'OKC', # Seattle to OKC
'VAN': 'MEM', # Vancouver to Memphis
'WSB': 'WAS', # Washington
}
return mapping.get(abbrev, abbrev)
def standardize_player_name(name):
"""Standardize player names for matching."""
import re
# Remove suffixes
name = re.sub(r'\s+(Jr\.|Sr\.|III|IV|II|Jr|Sr)$', '', name, flags=re.IGNORECASE)
# Remove periods and extra spaces
name = name.replace('.', '').strip()
# Standardize case
name = name.title()
return name
D.10 Building a Data Pipeline
Sample Data Pipeline Architecture
import requests
import pandas as pd
from datetime import datetime, timedelta
import time
import sqlite3
import logging
class NBADataPipeline:
"""Pipeline for fetching and storing NBA data."""
def __init__(self, db_path='nba_data.db'):
self.db_path = db_path
self.setup_logging()
self.create_tables()
def setup_logging(self):
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
def create_tables(self):
"""Create database tables if they don't exist."""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS player_games (
id INTEGER PRIMARY KEY AUTOINCREMENT,
player_id INTEGER,
game_id TEXT,
game_date DATE,
pts INTEGER,
reb INTEGER,
ast INTEGER,
min REAL,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(player_id, game_id)
)
''')
conn.commit()
conn.close()
def fetch_player_games(self, player_id, season):
"""Fetch player game logs with rate limiting."""
from nba_api.stats.endpoints import playergamelog
self.logger.info(f'Fetching games for player {player_id}, season {season}')
try:
gamelog = playergamelog.PlayerGameLog(
player_id=player_id,
season=season
)
df = gamelog.get_data_frames()[0]
time.sleep(0.6) # Rate limiting
return df
except Exception as e:
self.logger.error(f'Error fetching data: {e}')
return None
def store_player_games(self, df):
"""Store player game data in database."""
conn = sqlite3.connect(self.db_path)
for _, row in df.iterrows():
try:
conn.execute('''
INSERT OR REPLACE INTO player_games
(player_id, game_id, game_date, pts, reb, ast, min)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
row['Player_ID'],
row['Game_ID'],
row['GAME_DATE'],
row['PTS'],
row['REB'],
row['AST'],
row['MIN']
))
except Exception as e:
self.logger.error(f'Error storing row: {e}')
conn.commit()
conn.close()
def run_daily_update(self, player_ids, season):
"""Run daily update for list of players."""
for player_id in player_ids:
df = self.fetch_player_games(player_id, season)
if df is not None and len(df) > 0:
self.store_player_games(df)
self.logger.info(f'Stored {len(df)} games for player {player_id}')
# Usage
pipeline = NBADataPipeline()
star_players = [2544, 201566, 203507] # LeBron, Harden, Giannis
pipeline.run_daily_update(star_players, '2023-24')
D.11 Legal and Ethical Considerations
Terms of Service
When accessing basketball data: - Review and comply with website Terms of Service - Respect robots.txt files - Follow rate limiting guidelines - Do not redistribute proprietary data
Attribution
When using data in research or publications: - Credit data sources appropriately - Follow academic citation standards - Check licensing requirements for commercial use
Privacy
Be mindful of: - Player personal information - Medical/injury data sensitivity - Social media data collection regulations
D.12 Quick Reference: Data Source URLs
| Source | URL | Data Types |
|---|---|---|
| NBA Stats | stats.nba.com | Current stats, tracking |
| Basketball Reference | basketball-reference.com | Historical stats |
| ESPN | espn.com/nba | News, basic stats |
| Cleaning the Glass | cleaningtheglass.com | Advanced analytics |
| PBP Stats | pbpstats.com | Possession data |
| Kaggle | kaggle.com/datasets | Various datasets |
| Sports Reference | sports-reference.com | Multi-sport |
| Spotrac | spotrac.com | Salary data |
| RealGM | basketball.realgm.com | Transactions, rosters |
| HoopsHype | hoopshype.com | Salary, rumors |
| KenPom | kenpom.com | College analytics |
| Synergy Sports | synergysports.com | Video, play types |
This appendix provides starting points for basketball data acquisition. Always verify current API endpoints and website structures, as these may change over time. For the most up-to-date information, consult official documentation and community resources.