Understanding Basketball Data Types
Beginner
10 min read
0 views
Nov 27, 2025
# Understanding Basketball Data
## Types of NBA Data
### 1. Box Score Data
Box scores provide traditional and advanced statistics for players and teams in each game.
**What's Included:**
- Player statistics (points, rebounds, assists, steals, blocks)
- Shooting percentages (FG%, 3P%, FT%)
- Plus/minus ratings
- Team totals and opponent stats
- Game metadata (date, location, outcome)
**Python Example - Fetching Box Scores:**
```python
from nba_api.stats.endpoints import boxscoretraditionalv2
import pandas as pd
# Get box score for a specific game
game_id = '0022100001'
boxscore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
# Extract player stats
player_stats = boxscore.get_data_frames()[0]
print(player_stats[['PLAYER_NAME', 'PTS', 'REB', 'AST', 'FG_PCT']])
# Extract team stats
team_stats = boxscore.get_data_frames()[1]
print(team_stats[['TEAM_NAME', 'PTS', 'FG_PCT', 'FG3_PCT']])
```
**R Example - Box Score Analysis:**
```r
library(nbastatR)
library(dplyr)
# Get box scores for a date range
box_scores <- game_logs(
seasons = 2023,
result_types = "player",
season_types = "Regular Season"
)
# Analyze player performance
top_scorers <- box_scores %>%
group_by(namePlayer) %>%
summarise(
games = n(),
avg_pts = mean(pts, na.rm = TRUE),
avg_reb = mean(treb, na.rm = TRUE),
avg_ast = mean(ast, na.rm = TRUE)
) %>%
arrange(desc(avg_pts)) %>%
head(10)
print(top_scorers)
```
### 2. Play-by-Play Data
Play-by-play data captures every event that occurs during a game with timestamps and context.
**What's Included:**
- Event types (shot, rebound, turnover, foul, substitution)
- Event timestamps (game clock, period)
- Players involved in each event
- Shot locations and descriptions
- Score after each event
**Python Example - Play-by-Play Analysis:**
```python
from nba_api.stats.endpoints import playbyplayv2
import pandas as pd
# Get play-by-play data
game_id = '0022100001'
pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
plays = pbp.get_data_frames()[0]
# Filter for made shots
shots_made = plays[plays['EVENTMSGTYPE'] == 1] # Made shots
# Analyze shot distribution
print(shots_made['HOMEDESCRIPTION'].value_counts())
# Find scoring runs
plays['SCORE_DIFF'] = plays['SCOREMARGIN'].fillna(method='ffill')
plays['SCORE_CHANGE'] = plays['SCORE_DIFF'].diff()
# Identify momentum swings
momentum_swings = plays[abs(plays['SCORE_CHANGE']) >= 5]
print(momentum_swings[['PCTIMESTRING', 'HOMEDESCRIPTION', 'VISITORDESCRIPTION', 'SCORE_CHANGE']])
```
**R Example - Event Sequence Analysis:**
```r
library(nbastatR)
library(dplyr)
# Get play-by-play data
pbp_data <- play_by_play_v2(game_ids = "0022100001")
# Analyze shot types
shot_analysis <- pbp_data %>%
filter(str_detect(tolower(descriptionPlayHome), "shot|miss")) %>%
mutate(
shot_type = case_when(
str_detect(descriptionPlayHome, "3PT") ~ "Three Pointer",
str_detect(descriptionPlayHome, "Layup|Dunk") ~ "At Rim",
TRUE ~ "Mid Range"
),
made = !str_detect(descriptionPlayHome, "MISS")
) %>%
group_by(shot_type) %>%
summarise(
attempts = n(),
makes = sum(made),
fg_pct = mean(made) * 100
)
print(shot_analysis)
```
### 3. Player Tracking Data
Advanced tracking data from SportVU cameras captures player movements, speeds, and spatial data.
**What's Included:**
- Player speed and distance traveled
- Touch time and dribbles
- Catch-and-shoot vs pull-up shooting
- Defender distance on shots
- Rebounding tracking (positioning, hustle)
**Python Example - Tracking Data:**
```python
from nba_api.stats.endpoints import playerdashptshotlog
import pandas as pd
import matplotlib.pyplot as plt
# Get shot tracking data for a player
player_id = 2544 # LeBron James
shot_log = playerdashptshotlog.PlayerDashPtShotLog(
player_id=player_id,
season='2023-24',
season_type_all_star='Regular Season'
)
shots = shot_log.get_data_frames()[0]
# Analyze shots by defender distance
shots['CLOSE_DEF_DIST_BUCKET'] = pd.cut(
shots['CLOSE_DEF_DIST'],
bins=[0, 2, 4, 6, float('inf')],
labels=['0-2 ft', '2-4 ft', '4-6 ft', '6+ ft']
)
defense_impact = shots.groupby('CLOSE_DEF_DIST_BUCKET').agg({
'SHOT_RESULT': lambda x: (x == 'made').sum(),
'FGM': 'count'
})
defense_impact['FG_PCT'] = defense_impact['SHOT_RESULT'] / defense_impact['FGM']
print(defense_impact)
# Visualize shot chart by defender distance
plt.figure(figsize=(10, 6))
shots.groupby('CLOSE_DEF_DIST_BUCKET')['FG_PCT'].mean().plot(kind='bar')
plt.title('FG% by Defender Distance')
plt.xlabel('Defender Distance')
plt.ylabel('Field Goal Percentage')
plt.show()
```
**R Example - Speed and Distance Analysis:**
```r
library(nbastatR)
library(ggplot2)
# Get player speed and distance data
speed_data <- players_tracking(
seasons = 2023,
measures = "SpeedDistance"
)
# Analyze top players by distance
top_distance <- speed_data %>%
arrange(desc(distMiles)) %>%
head(20) %>%
ggplot(aes(x = reorder(namePlayer, distMiles), y = distMiles)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Players by Distance Traveled",
x = "Player",
y = "Miles per Game"
)
print(top_distance)
```
## Data Schemas
### Box Score Schema
```python
# Typical box score data structure
box_score_schema = {
'game_id': 'str', # Unique game identifier (e.g., '0022100001')
'game_date': 'datetime', # Date of the game
'team_id': 'int', # Team identifier
'team_abbreviation': 'str', # Team code (e.g., 'LAL', 'GSW')
'player_id': 'int', # Player identifier
'player_name': 'str', # Player full name
'start_position': 'str', # Starting position or bench
'minutes': 'float', # Minutes played
'fgm': 'int', # Field goals made
'fga': 'int', # Field goals attempted
'fg_pct': 'float', # Field goal percentage
'fg3m': 'int', # Three-pointers made
'fg3a': 'int', # Three-pointers attempted
'fg3_pct': 'float', # Three-point percentage
'ftm': 'int', # Free throws made
'fta': 'int', # Free throws attempted
'ft_pct': 'float', # Free throw percentage
'oreb': 'int', # Offensive rebounds
'dreb': 'int', # Defensive rebounds
'reb': 'int', # Total rebounds
'ast': 'int', # Assists
'stl': 'int', # Steals
'blk': 'int', # Blocks
'tov': 'int', # Turnovers
'pf': 'int', # Personal fouls
'pts': 'int', # Points
'plus_minus': 'int' # Plus/minus rating
}
```
### Play-by-Play Schema
```python
# Play-by-play data structure
pbp_schema = {
'game_id': 'str', # Game identifier
'event_num': 'int', # Sequential event number
'event_msg_type': 'int', # Event type code (1=made shot, 2=miss, etc.)
'event_msg_action_type': 'int', # Detailed action type
'period': 'int', # Quarter/period number
'wc_time_string': 'str', # Wall clock time
'pc_time_string': 'str', # Period clock time (MM:SS)
'home_description': 'str', # Description for home team event
'neutral_description': 'str', # Neutral event description
'visitor_description': 'str', # Description for visitor team event
'score': 'str', # Current score (e.g., '45-42')
'score_margin': 'str', # Point differential
'person1_type': 'int', # Primary person type
'player1_id': 'int', # Primary player ID
'player1_name': 'str', # Primary player name
'player1_team_id': 'int', # Primary player's team
'person2_type': 'int', # Secondary person type
'player2_id': 'int', # Secondary player ID (assist, fouled, etc.)
'player2_name': 'str', # Secondary player name
'player2_team_id': 'int', # Secondary player's team
'person3_type': 'int', # Tertiary person type
'player3_id': 'int', # Tertiary player ID (blocker, etc.)
'player3_name': 'str', # Tertiary player name
'player3_team_id': 'int' # Tertiary player's team
}
# Event type codes
event_types = {
1: 'Made Shot',
2: 'Missed Shot',
3: 'Free Throw',
4: 'Rebound',
5: 'Turnover',
6: 'Foul',
7: 'Violation',
8: 'Substitution',
9: 'Timeout',
10: 'Jump Ball',
12: 'Start Period',
13: 'End Period'
}
```
### Tracking Data Schema
```python
# Player tracking data structure
tracking_schema = {
'game_id': 'str', # Game identifier
'player_id': 'int', # Player identifier
'team_id': 'int', # Team identifier
'shot_number': 'int', # Shot sequence number
'period': 'int', # Quarter/period
'game_clock': 'str', # Time remaining in period
'shot_clock': 'float', # Shot clock time
'dribbles': 'int', # Number of dribbles before shot
'touch_time': 'float', # Time ball was in possession (seconds)
'shot_dist': 'float', # Distance from basket (feet)
'pts_type': 'int', # Point value (2 or 3)
'shot_result': 'str', # 'Made' or 'Missed'
'closest_defender': 'str', # Name of closest defender
'closest_defender_player_id': 'int', # Defender ID
'close_def_dist': 'float', # Defender distance (feet)
'fgm': 'int', # Field goal made (0 or 1)
'pts': 'int', # Points scored
'player_name': 'str', # Player name
'player_last_team_id': 'int' # Current team ID
}
```
## Player and Game IDs
### Player ID System
The NBA uses unique numeric IDs to identify players across all datasets.
**Finding Player IDs:**
```python
from nba_api.stats.static import players
import pandas as pd
# Get all active players
all_players = players.get_active_players()
player_df = pd.DataFrame(all_players)
# Search for specific player
lebron = players.find_players_by_full_name('LeBron James')
print(f"LeBron James ID: {lebron[0]['id']}") # 2544
# Search by partial name
curry_players = [p for p in all_players if 'Curry' in p['full_name']]
print(curry_players)
# Get historical players (inactive)
all_time_players = players.get_players()
historical_df = pd.DataFrame(all_time_players)
# Find player by ID
player_id = 2544
player_info = [p for p in all_time_players if p['id'] == player_id]
print(player_info)
```
**R Example - Player Lookup:**
```r
library(nbastatR)
# Get all players
all_players <- nba_players()
# Search for specific player
lebron <- all_players %>%
filter(namePlayer == "LeBron James")
print(paste("LeBron James ID:", lebron$idPlayer))
# Search by team
lakers_players <- all_players %>%
filter(slugTeam == "LAL", isActive == TRUE)
print(lakers_players[c("namePlayer", "idPlayer")])
# Player ID mapping dictionary
player_dict <- setNames(all_players$idPlayer, all_players$namePlayer)
```
### Game ID System
Game IDs follow a specific format: `[season_type][season_year][game_number]`
**Game ID Format:**
- Position 1-3: Season type (001=preseason, 002=regular season, 003=all-star, 004=playoffs)
- Position 4-5: Season year (last 2 digits)
- Position 6-10: Game number (00001-01230 for regular season)
**Examples:**
- `0022100001` = First regular season game of 2021-22 season
- `0042200401` = Playoffs game from 2022-23 season
**Finding Game IDs:**
```python
from nba_api.stats.endpoints import leaguegamefinder
import pandas as pd
# Find games for a specific team
game_finder = leaguegamefinder.LeagueGameFinder(
team_id_nullable='1610612747', # Lakers
season_nullable='2023-24',
season_type_nullable='Regular Season'
)
games = game_finder.get_data_frames()[0]
print(games[['GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL']])
# Find games by date range
games_by_date = games[
(games['GAME_DATE'] >= '2024-01-01') &
(games['GAME_DATE'] <= '2024-01-31')
]
# Extract game numbers
games['GAME_NUMBER'] = games['GAME_ID'].str[-5:].astype(int)
games['SEASON_TYPE'] = games['GAME_ID'].str[:3]
```
**R Example - Game Finder:**
```r
library(nbastatR)
library(dplyr)
# Get games for a season
games_2023 <- game_logs(
seasons = 2023,
result_types = "team"
)
# Filter by team
lakers_games <- games_2023 %>%
filter(slugTeam == "LAL") %>%
select(idGame, dateGame, slugMatchup, outcomeGame)
print(head(lakers_games))
# Parse game ID components
lakers_games <- lakers_games %>%
mutate(
season_type = substr(idGame, 1, 3),
season_year = substr(idGame, 4, 5),
game_number = substr(idGame, 6, 10)
)
```
### Team ID Reference
```python
# Common NBA team IDs
team_ids = {
1610612737: 'ATL', # Atlanta Hawks
1610612738: 'BOS', # Boston Celtics
1610612751: 'BKN', # Brooklyn Nets
1610612766: 'CHA', # Charlotte Hornets
1610612741: 'CHI', # Chicago Bulls
1610612739: 'CLE', # Cleveland Cavaliers
1610612742: 'DAL', # Dallas Mavericks
1610612743: 'DEN', # Denver Nuggets
1610612765: 'DET', # Detroit Pistons
1610612744: 'GSW', # Golden State Warriors
1610612745: 'HOU', # Houston Rockets
1610612754: 'IND', # Indiana Pacers
1610612746: 'LAC', # LA Clippers
1610612747: 'LAL', # LA Lakers
1610612763: 'MEM', # Memphis Grizzlies
1610612748: 'MIA', # Miami Heat
1610612749: 'MIL', # Milwaukee Bucks
1610612750: 'MIN', # Minnesota Timberwolves
1610612740: 'NOP', # New Orleans Pelicans
1610612752: 'NYK', # New York Knicks
1610612760: 'OKC', # Oklahoma City Thunder
1610612753: 'ORL', # Orlando Magic
1610612755: 'PHI', # Philadelphia 76ers
1610612756: 'PHX', # Phoenix Suns
1610612757: 'POR', # Portland Trail Blazers
1610612758: 'SAC', # Sacramento Kings
1610612759: 'SAS', # San Antonio Spurs
1610612761: 'TOR', # Toronto Raptors
1610612762: 'UTA', # Utah Jazz
1610612764: 'WAS' # Washington Wizards
}
```
## Joining Data Sources
### Combining Box Score and Play-by-Play Data
**Python Example:**
```python
from nba_api.stats.endpoints import boxscoretraditionalv2, playbyplayv2
import pandas as pd
def combine_box_pbp(game_id):
# Get box score
boxscore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
player_stats = boxscore.get_data_frames()[0]
# Get play-by-play
pbp = playbyplayv2.PlayByPlayV2(game_id=game_id)
plays = pbp.get_data_frames()[0]
# Count events per player from play-by-play
player_events = plays.groupby('PLAYER1_ID').size().reset_index(name='total_events')
# Merge datasets
combined = player_stats.merge(
player_events,
left_on='PLAYER_ID',
right_on='PLAYER1_ID',
how='left'
)
# Calculate events per minute
combined['events_per_minute'] = combined['total_events'] / combined['MIN']
return combined[['PLAYER_NAME', 'MIN', 'PTS', 'total_events', 'events_per_minute']]
# Example usage
game_data = combine_box_pbp('0022100001')
print(game_data.sort_values('events_per_minute', ascending=False))
```
**R Example:**
```r
library(nbastatR)
library(dplyr)
combine_game_data <- function(game_id) {
# Get box score
box <- box_scores(game_ids = game_id, result_types = "player")
# Get play-by-play
pbp <- play_by_play_v2(game_ids = game_id)
# Count player actions in play-by-play
player_actions <- pbp %>%
filter(!is.na(idPlayer1)) %>%
group_by(idPlayer1) %>%
summarise(total_actions = n())
# Join datasets
combined <- box %>%
left_join(player_actions, by = c("idPlayer" = "idPlayer1")) %>%
mutate(actions_per_minute = total_actions / minutes)
return(combined %>% select(namePlayer, minutes, pts, total_actions, actions_per_minute))
}
# Example usage
game_analysis <- combine_game_data("0022100001")
print(game_analysis %>% arrange(desc(actions_per_minute)))
```
### Multi-Game Player Analysis
**Python Example - Season-Long Analysis:**
```python
from nba_api.stats.endpoints import playergamelog, playerdashboardbyyearoveryear
import pandas as pd
import numpy as np
def analyze_player_season(player_id, season='2023-24'):
# Get game log
gamelog = playergamelog.PlayerGameLog(
player_id=player_id,
season=season
)
games = gamelog.get_data_frames()[0]
# Get advanced stats
dashboard = playerdashboardbyyearoveryear.PlayerDashboardByYearOverYear(
player_id=player_id,
season=season
)
advanced = dashboard.get_data_frames()[1]
# Calculate rolling averages
games['PTS_MA5'] = games['PTS'].rolling(window=5).mean()
games['AST_MA5'] = games['AST'].rolling(window=5).mean()
games['REB_MA5'] = games['REB'].rolling(window=5).mean()
# Identify hot/cold streaks
games['SCORING_TREND'] = np.where(
games['PTS'] > games['PTS_MA5'], 'Hot', 'Cold'
)
# Merge with advanced stats
summary = {
'player_id': player_id,
'games_played': len(games),
'ppg': games['PTS'].mean(),
'rpg': games['REB'].mean(),
'apg': games['AST'].mean(),
'fg_pct': games['FG_PCT'].mean(),
'hot_games': (games['SCORING_TREND'] == 'Hot').sum(),
'cold_games': (games['SCORING_TREND'] == 'Cold').sum()
}
return games, summary
# Analyze multiple players
player_ids = [2544, 201939, 201142] # LeBron, Curry, Durant
results = []
for pid in player_ids:
games, summary = analyze_player_season(pid)
results.append(summary)
comparison = pd.DataFrame(results)
print(comparison)
```
**R Example - Multi-Player Comparison:**
```r
library(nbastatR)
library(dplyr)
library(tidyr)
analyze_multiple_players <- function(player_ids, season = 2023) {
# Get game logs for all players
all_games <- map_dfr(player_ids, function(pid) {
games <- game_logs(
seasons = season,
result_types = "player",
player_ids = pid
)
return(games)
})
# Calculate per-game averages
player_summary <- all_games %>%
group_by(idPlayer, namePlayer) %>%
summarise(
games = n(),
ppg = mean(pts, na.rm = TRUE),
rpg = mean(treb, na.rm = TRUE),
apg = mean(ast, na.rm = TRUE),
fg_pct = mean(pctFG, na.rm = TRUE),
fg3_pct = mean(pctFG3, na.rm = TRUE),
efficiency = mean((pts + treb + ast + stl + blk -
(fga - fgm) - (fta - ftm) - tov), na.rm = TRUE)
) %>%
arrange(desc(ppg))
return(player_summary)
}
# Compare players
player_comparison <- analyze_multiple_players(c(2544, 201939, 201142))
print(player_comparison)
```
### Joining Shot Tracking with Box Scores
**Python Example:**
```python
from nba_api.stats.endpoints import playerdashptshotlog, boxscoretraditionalv2
import pandas as pd
def analyze_shooting_context(player_id, season='2023-24'):
# Get shot-level data
shot_log = playerdashptshotlog.PlayerDashPtShotLog(
player_id=player_id,
season=season,
season_type_all_star='Regular Season'
)
shots = shot_log.get_data_frames()[0]
# Aggregate by game
game_shooting = shots.groupby('GAME_ID').agg({
'FGM': 'sum',
'FGA': 'count',
'CLOSE_DEF_DIST': 'mean',
'SHOT_DIST': 'mean',
'DRIBBLES': 'mean',
'TOUCH_TIME': 'mean'
}).reset_index()
game_shooting['FG_PCT'] = game_shooting['FGM'] / game_shooting['FGA']
# Get box scores for context
from nba_api.stats.endpoints import playergamelog
gamelog = playergamelog.PlayerGameLog(player_id=player_id, season=season)
box_scores = gamelog.get_data_frames()[0]
# Merge datasets
combined = box_scores.merge(
game_shooting,
on='GAME_ID',
how='inner'
)
# Analyze relationship between context and performance
analysis = combined[[
'GAME_DATE', 'MATCHUP', 'PTS', 'FG_PCT',
'CLOSE_DEF_DIST', 'SHOT_DIST', 'DRIBBLES', 'TOUCH_TIME'
]]
# Correlation analysis
correlations = analysis[[
'PTS', 'FG_PCT', 'CLOSE_DEF_DIST', 'SHOT_DIST', 'DRIBBLES', 'TOUCH_TIME'
]].corr()
return analysis, correlations
# Example usage
shooting_analysis, correlations = analyze_shooting_context(2544)
print("Shooting Context Correlations:")
print(correlations['FG_PCT'].sort_values(ascending=False))
```
### Creating a Master Dataset
**Python Example - Comprehensive Data Pipeline:**
```python
import pandas as pd
from nba_api.stats.endpoints import (
leaguegamefinder, boxscoretraditionalv2,
playbyplayv2, boxscoreadvancedv2
)
class NBADataIntegrator:
def __init__(self, season='2023-24'):
self.season = season
self.games_df = None
self.players_df = None
self.master_df = None
def fetch_all_games(self, team_id=None):
"""Fetch all games for a season"""
finder = leaguegamefinder.LeagueGameFinder(
season_nullable=self.season,
team_id_nullable=team_id
)
self.games_df = finder.get_data_frames()[0]
return self.games_df
def fetch_game_details(self, game_id):
"""Fetch detailed data for a single game"""
# Traditional box score
trad_box = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
trad_stats = trad_box.get_data_frames()[0]
# Advanced box score
adv_box = boxscoreadvancedv2.BoxScoreAdvancedV2(game_id=game_id)
adv_stats = adv_box.get_data_frames()[0]
# Merge traditional and advanced
combined = trad_stats.merge(
adv_stats[['PLAYER_ID', 'OFF_RATING', 'DEF_RATING', 'NET_RATING',
'AST_PCT', 'REB_PCT', 'TS_PCT', 'USG_PCT']],
on='PLAYER_ID',
how='left'
)
return combined
def build_master_dataset(self, game_ids):
"""Build comprehensive dataset from multiple games"""
all_game_data = []
for game_id in game_ids:
try:
game_data = self.fetch_game_details(game_id)
all_game_data.append(game_data)
except Exception as e:
print(f"Error fetching {game_id}: {e}")
self.master_df = pd.concat(all_game_data, ignore_index=True)
return self.master_df
def aggregate_player_stats(self):
"""Aggregate statistics by player"""
if self.master_df is None:
raise ValueError("Master dataset not built yet")
player_agg = self.master_df.groupby('PLAYER_ID').agg({
'PLAYER_NAME': 'first',
'TEAM_ID': 'first',
'TEAM_ABBREVIATION': 'first',
'MIN': 'sum',
'PTS': 'sum',
'FGM': 'sum',
'FGA': 'sum',
'FG3M': 'sum',
'FG3A': 'sum',
'FTM': 'sum',
'FTA': 'sum',
'REB': 'sum',
'AST': 'sum',
'STL': 'sum',
'BLK': 'sum',
'TOV': 'sum',
'OFF_RATING': 'mean',
'DEF_RATING': 'mean',
'NET_RATING': 'mean',
'TS_PCT': 'mean',
'USG_PCT': 'mean'
}).reset_index()
# Calculate per-game averages
games_played = self.master_df.groupby('PLAYER_ID').size()
player_agg['GP'] = player_agg['PLAYER_ID'].map(games_played)
player_agg['MPG'] = player_agg['MIN'] / player_agg['GP']
player_agg['PPG'] = player_agg['PTS'] / player_agg['GP']
player_agg['RPG'] = player_agg['REB'] / player_agg['GP']
player_agg['APG'] = player_agg['AST'] / player_agg['GP']
return player_agg
# Usage example
integrator = NBADataIntegrator(season='2023-24')
games = integrator.fetch_all_games(team_id='1610612747') # Lakers
game_ids = games['GAME_ID'].unique()[:10] # First 10 games
master_data = integrator.build_master_dataset(game_ids)
player_stats = integrator.aggregate_player_stats()
print(player_stats.sort_values('PPG', ascending=False))
```
## Best Practices
### 1. Data Quality Checks
```python
def validate_data(df):
"""Validate NBA data quality"""
issues = []
# Check for missing values
missing = df.isnull().sum()
if missing.any():
issues.append(f"Missing values: {missing[missing > 0].to_dict()}")
# Check for duplicate records
if df.duplicated().any():
issues.append(f"Duplicate records: {df.duplicated().sum()}")
# Check for invalid ranges
if 'PTS' in df.columns:
if (df['PTS'] < 0).any() or (df['PTS'] > 100).any():
issues.append("Invalid point values detected")
if 'FG_PCT' in df.columns:
if (df['FG_PCT'] < 0).any() or (df['FG_PCT'] > 1).any():
issues.append("Invalid FG% values detected")
return issues
# Example usage
validation_results = validate_data(player_stats)
if validation_results:
print("Data quality issues found:")
for issue in validation_results:
print(f" - {issue}")
else:
print("Data validation passed!")
```
### 2. Handling API Rate Limits
```python
import time
from functools import wraps
def rate_limit(delay=0.6):
"""Decorator to add delay between API calls"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
time.sleep(delay)
return func(*args, **kwargs)
return wrapper
return decorator
@rate_limit(delay=0.6)
def fetch_game_data(game_id):
"""Fetch game data with rate limiting"""
boxscore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
return boxscore.get_data_frames()[0]
```
### 3. Caching Results
```python
import pickle
import os
from datetime import datetime, timedelta
def cache_data(filename, data, expire_days=1):
"""Cache data to disk with expiration"""
cache_file = f"cache/{filename}.pkl"
os.makedirs('cache', exist_ok=True)
with open(cache_file, 'wb') as f:
pickle.dump({'data': data, 'timestamp': datetime.now()}, f)
def load_cached_data(filename, expire_days=1):
"""Load cached data if not expired"""
cache_file = f"cache/{filename}.pkl"
if not os.path.exists(cache_file):
return None
with open(cache_file, 'rb') as f:
cached = pickle.load(f)
age = datetime.now() - cached['timestamp']
if age > timedelta(days=expire_days):
return None
return cached['data']
# Usage
cached = load_cached_data('player_stats_2023')
if cached is None:
print("Fetching fresh data...")
data = fetch_player_stats()
cache_data('player_stats_2023', data)
else:
print("Using cached data")
data = cached
```
## Summary
Understanding NBA data requires familiarity with:
- Different data types (box scores, play-by-play, tracking)
- Data schemas and structures
- ID systems for players, teams, and games
- Techniques for joining and integrating multiple data sources
With this knowledge, you can build comprehensive basketball analytics pipelines that combine traditional statistics with advanced tracking data.
Discussion
Have questions or feedback? Join our community discussion on
Discord or
GitHub Discussions.
Table of Contents
Related Topics
Quick Actions