Exploratory Data Analysis (EDA) forms the foundation of any successful basketball analytics project. Before building predictive models or deriving actionable insights, analysts must thoroughly understand their data through systematic exploration and...
In This Chapter
- Introduction
- 4.1 Loading and Inspecting NBA Data with Pandas
- 4.2 Data Cleaning and Preprocessing Techniques
- 4.3 Handling Missing Values in Basketball Datasets
- 4.4 Visualizing Distributions
- 4.5 Visualizing Relationships
- 4.6 Time Series Analysis of Player and Team Performance
- 4.7 Shot Chart Creation and Spatial Analysis
- 4.8 Putting It All Together: A Complete EDA Workflow
- Summary
Chapter 4: Exploratory Data Analysis for Basketball
Introduction
Exploratory Data Analysis (EDA) forms the foundation of any successful basketball analytics project. Before building predictive models or deriving actionable insights, analysts must thoroughly understand their data through systematic exploration and visualization. This chapter provides a comprehensive guide to conducting EDA on basketball datasets, with a focus on NBA data and the practical techniques that reveal meaningful patterns in player and team performance.
EDA serves multiple critical purposes in basketball analytics. First, it helps identify data quality issues such as missing values, outliers, and inconsistencies that could compromise downstream analyses. Second, it reveals the underlying structure and distributions of key metrics, informing appropriate statistical methods. Third, and perhaps most importantly, it generates hypotheses about relationships between variables that can guide more rigorous investigation.
The tools we employ throughout this chapter include Python's pandas library for data manipulation, matplotlib and seaborn for visualization, and specialized techniques for spatial analysis of shot data. By the end of this chapter, you will possess the skills to transform raw basketball data into compelling visual narratives that inform decision-making.
4.1 Loading and Inspecting NBA Data with Pandas
4.1.1 Common NBA Data Sources and Formats
NBA data comes in various formats depending on the source. The official NBA Stats API provides JSON responses that require parsing, while historical databases often deliver CSV or Excel files. Play-by-play data may arrive in nested JSON structures that demand careful extraction.
Let us begin by loading a typical NBA player statistics dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set display options for better readability
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
# Load NBA player statistics
# This example assumes a CSV file with per-game averages
player_stats = pd.read_csv('nba_player_stats_2023_24.csv')
# First look at the data
print(f"Dataset shape: {player_stats.shape}")
print(f"Number of players: {player_stats.shape[0]}")
print(f"Number of features: {player_stats.shape[1]}")
4.1.2 Initial Data Inspection
The first step in any EDA process involves understanding what data you have. Pandas provides several methods for this purpose:
# View the first few rows
print(player_stats.head(10))
# View the last few rows
print(player_stats.tail(10))
# Get column names and data types
print(player_stats.info())
# Statistical summary of numerical columns
print(player_stats.describe())
# Check for unique values in categorical columns
print(player_stats['team'].nunique())
print(player_stats['position'].value_counts())
The info() method reveals crucial information about data types and memory usage. Pay particular attention to columns that should be numeric but appear as object types, as this often indicates data quality issues requiring attention.
4.1.3 Understanding Data Types in Basketball Datasets
Basketball datasets contain diverse data types that require appropriate handling:
# Identify column types
numerical_cols = player_stats.select_dtypes(include=[np.number]).columns
categorical_cols = player_stats.select_dtypes(include=['object']).columns
print(f"Numerical columns ({len(numerical_cols)}):")
print(numerical_cols.tolist())
print(f"\nCategorical columns ({len(categorical_cols)}):")
print(categorical_cols.tolist())
# Check for mixed types in supposedly numeric columns
def check_numeric_validity(df, column):
"""Check if a column contains non-numeric values."""
try:
pd.to_numeric(df[column], errors='raise')
return True, []
except ValueError:
non_numeric = df[~df[column].apply(
lambda x: isinstance(x, (int, float)) or
(isinstance(x, str) and x.replace('.', '').replace('-', '').isdigit())
)][column].unique()
return False, non_numeric
# Apply to potentially problematic columns
for col in ['points', 'rebounds', 'assists']:
valid, issues = check_numeric_validity(player_stats, col)
if not valid:
print(f"Column '{col}' has non-numeric values: {issues}")
4.1.4 Creating Derived Features During Inspection
During the inspection phase, you will often identify opportunities to create derived features that enhance analysis:
# Calculate per-minute statistics
player_stats['pts_per_min'] = player_stats['points'] / player_stats['minutes']
player_stats['reb_per_min'] = player_stats['rebounds'] / player_stats['minutes']
player_stats['ast_per_min'] = player_stats['assists'] / player_stats['minutes']
# Calculate efficiency metrics
player_stats['true_shooting_pct'] = (
player_stats['points'] /
(2 * (player_stats['fga'] + 0.44 * player_stats['fta']))
)
# Create position groups
position_mapping = {
'PG': 'Guard', 'SG': 'Guard',
'SF': 'Forward', 'PF': 'Forward',
'C': 'Center'
}
player_stats['position_group'] = player_stats['position'].map(position_mapping)
# Verify the new features
print(player_stats[['player_name', 'pts_per_min', 'true_shooting_pct',
'position_group']].head(10))
4.2 Data Cleaning and Preprocessing Techniques
4.2.1 Identifying Data Quality Issues
Data quality issues in basketball datasets manifest in various forms. Common problems include duplicate records, inconsistent naming conventions, and erroneous values that defy basketball logic:
# Check for duplicate rows
duplicate_count = player_stats.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")
# Check for duplicate player names (which might be legitimate or errors)
name_counts = player_stats['player_name'].value_counts()
potential_duplicates = name_counts[name_counts > 1]
print("Potential duplicate players:")
print(potential_duplicates)
# Identify records with the same player appearing multiple times
# This is valid if a player was traded mid-season
traded_players = player_stats[
player_stats['player_name'].isin(potential_duplicates.index)
].sort_values(['player_name', 'team'])
print(traded_players[['player_name', 'team', 'games_played', 'points']])
4.2.2 Handling Inconsistent Data
Inconsistencies often arise in team names, player names, and categorical variables:
# Standardize team abbreviations
team_standardization = {
'PHX': 'PHO', # Phoenix Suns
'BKN': 'BRK', # Brooklyn Nets
'CHA': 'CHO', # Charlotte Hornets
'NOP': 'NOR', # New Orleans Pelicans (if using older abbreviations)
}
# Apply standardization
player_stats['team'] = player_stats['team'].replace(team_standardization)
# Standardize position labels
position_standardization = {
'Point Guard': 'PG',
'Shooting Guard': 'SG',
'Small Forward': 'SF',
'Power Forward': 'PF',
'Center': 'C',
'G': 'SG', # Generic guard to shooting guard
'F': 'SF', # Generic forward to small forward
}
player_stats['position'] = player_stats['position'].replace(position_standardization)
# Clean player names (remove extra spaces, standardize capitalization)
player_stats['player_name'] = player_stats['player_name'].str.strip()
player_stats['player_name'] = player_stats['player_name'].str.title()
4.2.3 Validating Data Against Basketball Logic
Basketball imposes natural constraints on statistics. Field goal percentage cannot exceed 100%, and players cannot play more than 48 minutes per game in regulation:
def validate_basketball_stats(df):
"""Validate statistics against basketball logic constraints."""
issues = []
# Field goal percentage should be between 0 and 1
if 'fg_pct' in df.columns:
invalid_fg = df[(df['fg_pct'] < 0) | (df['fg_pct'] > 1)]
if len(invalid_fg) > 0:
issues.append(f"Invalid FG%: {len(invalid_fg)} records")
# Minutes per game should be between 0 and 48 (regular season)
if 'minutes' in df.columns:
invalid_mins = df[(df['minutes'] < 0) | (df['minutes'] > 48)]
if len(invalid_mins) > 0:
issues.append(f"Invalid minutes: {len(invalid_mins)} records")
# Points should be non-negative
if 'points' in df.columns:
invalid_pts = df[df['points'] < 0]
if len(invalid_pts) > 0:
issues.append(f"Negative points: {len(invalid_pts)} records")
# Assists should not exceed field goals made (approximately)
# This is a soft constraint as team totals differ from individual stats
if 'assists' in df.columns and 'fgm' in df.columns:
suspicious_ast = df[df['assists'] > df['fgm'] * 2]
if len(suspicious_ast) > 0:
issues.append(f"Suspicious assist totals: {len(suspicious_ast)} records")
return issues
# Run validation
validation_issues = validate_basketball_stats(player_stats)
for issue in validation_issues:
print(f"Warning: {issue}")
4.2.4 Data Type Conversions
Proper data type assignment improves both performance and analytical accuracy:
# Convert percentage columns stored as strings
percentage_cols = ['fg_pct', 'fg3_pct', 'ft_pct']
for col in percentage_cols:
if col in player_stats.columns:
# Remove percentage signs if present
if player_stats[col].dtype == 'object':
player_stats[col] = player_stats[col].str.replace('%', '')
player_stats[col] = pd.to_numeric(player_stats[col], errors='coerce') / 100
# Convert categorical columns to category dtype for efficiency
categorical_columns = ['team', 'position', 'position_group']
for col in categorical_columns:
if col in player_stats.columns:
player_stats[col] = player_stats[col].astype('category')
# Verify conversions
print(player_stats.dtypes)
4.3 Handling Missing Values in Basketball Datasets
4.3.1 Identifying Missing Data Patterns
Missing data in basketball datasets often follows specific patterns. A player with zero three-point attempts will have a missing three-point percentage, which represents a fundamentally different situation than truly missing data:
# Overall missing value summary
missing_summary = pd.DataFrame({
'missing_count': player_stats.isnull().sum(),
'missing_pct': (player_stats.isnull().sum() / len(player_stats) * 100).round(2)
})
missing_summary = missing_summary[missing_summary['missing_count'] > 0]
missing_summary = missing_summary.sort_values('missing_pct', ascending=False)
print(missing_summary)
# Visualize missing data patterns
plt.figure(figsize=(12, 6))
sns.heatmap(player_stats.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Data Pattern Visualization')
plt.xlabel('Features')
plt.tight_layout()
plt.savefig('missing_data_pattern.png', dpi=150)
plt.close()
4.3.2 Understanding Missing Data Mechanisms
Missing data in basketball falls into three categories:
- Missing Completely at Random (MCAR): Data collection errors unrelated to any variables
- Missing at Random (MAR): Missingness depends on observed variables
- Missing Not at Random (MNAR): Missingness depends on the missing value itself
# Analyze whether missing three-point percentage relates to position
def analyze_missing_pattern(df, missing_col, grouping_col):
"""Analyze missing data pattern by a grouping variable."""
missing_by_group = df.groupby(grouping_col)[missing_col].apply(
lambda x: x.isnull().mean() * 100
).round(2)
return missing_by_group
# Three-point percentage missing by position
if 'fg3_pct' in player_stats.columns:
missing_3pt_by_pos = analyze_missing_pattern(
player_stats, 'fg3_pct', 'position'
)
print("Missing 3PT% by Position:")
print(missing_3pt_by_pos)
# Investigate correlation between attempts and missing percentage
# Players with zero attempts have undefined percentages
if 'fg3a' in player_stats.columns and 'fg3_pct' in player_stats.columns:
zero_attempts = player_stats[player_stats['fg3a'] == 0]
print(f"\nPlayers with zero 3PA: {len(zero_attempts)}")
print(f"Of these, missing 3PT%: {zero_attempts['fg3_pct'].isnull().sum()}")
4.3.3 Imputation Strategies for Basketball Data
Different missing data scenarios require different imputation approaches:
def impute_basketball_stats(df):
"""Apply appropriate imputation strategies for basketball statistics."""
df_imputed = df.copy()
# Strategy 1: Zero imputation for counting stats of players who didn't play
counting_stats = ['points', 'rebounds', 'assists', 'steals', 'blocks']
for col in counting_stats:
if col in df_imputed.columns:
# Only impute with zero if games_played is 0 or very low
mask = (df_imputed[col].isnull()) & (df_imputed['games_played'] <= 1)
df_imputed.loc[mask, col] = 0
# Strategy 2: For percentages with zero attempts, set to league average or NaN
# This is a design decision - keeping as NaN is often better for analysis
if 'fg3_pct' in df_imputed.columns and 'fg3a' in df_imputed.columns:
# Set 3PT% to NaN for players with zero attempts (undefined)
df_imputed.loc[df_imputed['fg3a'] == 0, 'fg3_pct'] = np.nan
# Strategy 3: Forward/backward fill for time series data
# (if this were game-by-game data)
# Strategy 4: Group-based imputation for missing demographic data
if 'height' in df_imputed.columns:
# Impute missing height with position-average height
position_height = df_imputed.groupby('position')['height'].transform('mean')
df_imputed['height'] = df_imputed['height'].fillna(position_height)
return df_imputed
player_stats_clean = impute_basketball_stats(player_stats)
4.3.4 Documenting Missing Data Decisions
Creating a clear record of missing data handling is essential for reproducibility:
def create_missing_data_report(df_original, df_cleaned, output_path=None):
"""Generate a comprehensive missing data handling report."""
report = []
report.append("=" * 60)
report.append("MISSING DATA HANDLING REPORT")
report.append("=" * 60)
for col in df_original.columns:
orig_missing = df_original[col].isnull().sum()
clean_missing = df_cleaned[col].isnull().sum()
if orig_missing > 0 or clean_missing > 0:
report.append(f"\nColumn: {col}")
report.append(f" Original missing: {orig_missing} ({orig_missing/len(df_original)*100:.1f}%)")
report.append(f" After cleaning: {clean_missing} ({clean_missing/len(df_cleaned)*100:.1f}%)")
report.append(f" Values imputed: {orig_missing - clean_missing}")
report_text = "\n".join(report)
print(report_text)
if output_path:
with open(output_path, 'w') as f:
f.write(report_text)
return report_text
# Generate report
create_missing_data_report(player_stats, player_stats_clean)
4.4 Visualizing Distributions
4.4.1 Histograms for Basketball Statistics
Histograms reveal the distribution shape of individual statistics, helping identify skewness, multimodality, and outliers:
def plot_stat_distribution(df, column, title=None, bins=30, figsize=(10, 6)):
"""Create a comprehensive histogram with statistical annotations."""
fig, ax = plt.subplots(figsize=figsize)
data = df[column].dropna()
# Create histogram
n, bins_edges, patches = ax.hist(data, bins=bins, edgecolor='black',
alpha=0.7, color='steelblue')
# Add mean and median lines
mean_val = data.mean()
median_val = data.median()
ax.axvline(mean_val, color='red', linestyle='--', linewidth=2,
label=f'Mean: {mean_val:.2f}')
ax.axvline(median_val, color='green', linestyle='-', linewidth=2,
label=f'Median: {median_val:.2f}')
# Add statistical annotations
stats_text = (f'N = {len(data)}\n'
f'Std = {data.std():.2f}\n'
f'Skew = {data.skew():.2f}\n'
f'Kurt = {data.kurtosis():.2f}')
ax.text(0.95, 0.95, stats_text, transform=ax.transAxes,
verticalalignment='top', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
fontsize=10)
ax.set_xlabel(column.replace('_', ' ').title())
ax.set_ylabel('Frequency')
ax.set_title(title or f'Distribution of {column.replace("_", " ").title()}')
ax.legend()
plt.tight_layout()
return fig
# Plot distributions for key statistics
fig = plot_stat_distribution(player_stats_clean, 'points',
'Distribution of Points Per Game (2023-24 Season)')
plt.savefig('ppg_distribution.png', dpi=150)
plt.close()
# Compare distributions across positions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
positions = ['PG', 'SG', 'SF', 'PF', 'C']
for idx, pos in enumerate(positions):
ax = axes[idx // 3, idx % 3]
pos_data = player_stats_clean[player_stats_clean['position'] == pos]['points']
ax.hist(pos_data.dropna(), bins=20, edgecolor='black', alpha=0.7)
ax.set_title(f'{pos}: Points Per Game')
ax.set_xlabel('PPG')
ax.set_ylabel('Frequency')
# Remove empty subplot
axes[1, 2].axis('off')
plt.tight_layout()
plt.savefig('ppg_by_position.png', dpi=150)
plt.close()
4.4.2 Box Plots for Comparative Analysis
Box plots excel at comparing distributions across groups and identifying outliers:
def create_grouped_boxplot(df, value_col, group_col, title=None,
figsize=(12, 6), show_points=True):
"""Create a box plot comparing distributions across groups."""
fig, ax = plt.subplots(figsize=figsize)
# Order groups by median value
group_order = df.groupby(group_col)[value_col].median().sort_values(
ascending=False
).index.tolist()
# Create box plot
bp = ax.boxplot([df[df[group_col] == g][value_col].dropna()
for g in group_order],
labels=group_order, patch_artist=True)
# Color boxes
colors = plt.cm.Set3(np.linspace(0, 1, len(group_order)))
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
# Optionally overlay individual points
if show_points:
for idx, group in enumerate(group_order):
y = df[df[group_col] == group][value_col].dropna()
x = np.random.normal(idx + 1, 0.04, size=len(y))
ax.scatter(x, y, alpha=0.3, s=10, color='gray')
ax.set_xlabel(group_col.replace('_', ' ').title())
ax.set_ylabel(value_col.replace('_', ' ').title())
ax.set_title(title or f'{value_col} by {group_col}')
# Add grid for readability
ax.yaxis.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
return fig
# Compare points per game across teams
fig = create_grouped_boxplot(
player_stats_clean[player_stats_clean['minutes'] >= 15], # Filter for rotation players
'points', 'team',
title='Points Per Game by Team (Players with 15+ MPG)'
)
plt.savefig('ppg_by_team_boxplot.png', dpi=150)
plt.close()
# Compare efficiency by position
fig = create_grouped_boxplot(
player_stats_clean[player_stats_clean['minutes'] >= 20],
'true_shooting_pct', 'position',
title='True Shooting Percentage by Position'
)
plt.savefig('ts_pct_by_position.png', dpi=150)
plt.close()
4.4.3 Violin Plots for Distribution Shape
Violin plots combine box plot information with kernel density estimation, revealing distribution shapes that box plots obscure:
def create_violin_comparison(df, value_col, group_col, title=None,
figsize=(12, 6)):
"""Create violin plots for comparing distributions."""
fig, ax = plt.subplots(figsize=figsize)
# Filter out groups with too few observations
group_counts = df[group_col].value_counts()
valid_groups = group_counts[group_counts >= 10].index
df_filtered = df[df[group_col].isin(valid_groups)]
# Create violin plot using seaborn
sns.violinplot(data=df_filtered, x=group_col, y=value_col,
ax=ax, inner='box', palette='Set2')
ax.set_xlabel(group_col.replace('_', ' ').title())
ax.set_ylabel(value_col.replace('_', ' ').title())
ax.set_title(title or f'Distribution of {value_col} by {group_col}')
# Rotate x-axis labels if needed
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
return fig
# Minutes distribution by position
fig = create_violin_comparison(
player_stats_clean, 'minutes', 'position',
title='Minutes Per Game Distribution by Position'
)
plt.savefig('minutes_violin_by_position.png', dpi=150)
plt.close()
# Usage rate distribution
if 'usage_rate' in player_stats_clean.columns:
fig = create_violin_comparison(
player_stats_clean[player_stats_clean['minutes'] >= 15],
'usage_rate', 'position',
title='Usage Rate Distribution by Position (15+ MPG)'
)
plt.savefig('usage_rate_violin.png', dpi=150)
plt.close()
4.4.4 Kernel Density Estimation for Smooth Distributions
KDE plots provide smooth distribution estimates useful for overlaying multiple groups:
def plot_kde_comparison(df, value_col, group_col, groups=None,
title=None, figsize=(10, 6)):
"""Create overlaid KDE plots for group comparison."""
fig, ax = plt.subplots(figsize=figsize)
if groups is None:
groups = df[group_col].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(groups)))
for group, color in zip(groups, colors):
data = df[df[group_col] == group][value_col].dropna()
if len(data) >= 10: # Require minimum observations
sns.kdeplot(data, ax=ax, label=f'{group} (n={len(data)})',
color=color, linewidth=2)
ax.set_xlabel(value_col.replace('_', ' ').title())
ax.set_ylabel('Density')
ax.set_title(title or f'Distribution of {value_col} by {group_col}')
ax.legend(title=group_col.replace('_', ' ').title())
plt.tight_layout()
return fig
# Compare three-point attempt rates by position
fig = plot_kde_comparison(
player_stats_clean[player_stats_clean['minutes'] >= 20],
'fg3a', 'position',
groups=['PG', 'SG', 'SF', 'PF', 'C'],
title='Three-Point Attempts Per Game by Position'
)
plt.savefig('3pa_kde_by_position.png', dpi=150)
plt.close()
4.5 Visualizing Relationships
4.5.1 Scatter Plots for Bivariate Relationships
Scatter plots reveal relationships between two continuous variables and help identify patterns, clusters, and outliers:
def create_scatter_with_regression(df, x_col, y_col, hue_col=None,
title=None, figsize=(10, 8)):
"""Create scatter plot with optional regression line and grouping."""
fig, ax = plt.subplots(figsize=figsize)
if hue_col:
groups = df[hue_col].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(groups)))
for group, color in zip(groups, colors):
mask = df[hue_col] == group
ax.scatter(df.loc[mask, x_col], df.loc[mask, y_col],
alpha=0.6, label=group, color=color, s=50)
else:
ax.scatter(df[x_col], df[y_col], alpha=0.6, s=50, color='steelblue')
# Add regression line
from scipy import stats
mask = df[[x_col, y_col]].notna().all(axis=1)
slope, intercept, r_value, p_value, std_err = stats.linregress(
df.loc[mask, x_col], df.loc[mask, y_col]
)
x_line = np.linspace(df[x_col].min(), df[x_col].max(), 100)
y_line = slope * x_line + intercept
ax.plot(x_line, y_line, 'r--', linewidth=2,
label=f'R² = {r_value**2:.3f}')
ax.set_xlabel(x_col.replace('_', ' ').title())
ax.set_ylabel(y_col.replace('_', ' ').title())
ax.set_title(title or f'{y_col} vs {x_col}')
ax.legend()
plt.tight_layout()
return fig
# Points vs Minutes relationship
fig = create_scatter_with_regression(
player_stats_clean[player_stats_clean['minutes'] >= 10],
'minutes', 'points',
title='Points vs Minutes Per Game'
)
plt.savefig('points_vs_minutes.png', dpi=150)
plt.close()
# Points vs Usage Rate by Position
fig = create_scatter_with_regression(
player_stats_clean[player_stats_clean['minutes'] >= 20],
'usage_rate', 'points', hue_col='position',
title='Points vs Usage Rate by Position'
)
plt.savefig('points_vs_usage_by_position.png', dpi=150)
plt.close()
4.5.2 Correlation Matrices and Heatmaps
Correlation matrices provide a comprehensive view of relationships among multiple variables:
def create_correlation_heatmap(df, columns=None, title=None,
figsize=(12, 10), annot=True):
"""Create a correlation heatmap for selected columns."""
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns
# Calculate correlation matrix
corr_matrix = df[columns].corr()
# Create mask for upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
fig, ax = plt.subplots(figsize=figsize)
# Create heatmap
sns.heatmap(corr_matrix, mask=mask, annot=annot, fmt='.2f',
cmap='RdBu_r', center=0, square=True, linewidths=0.5,
ax=ax, vmin=-1, vmax=1,
cbar_kws={'shrink': 0.8, 'label': 'Correlation'})
ax.set_title(title or 'Correlation Matrix', fontsize=14)
# Rotate labels
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
return fig, corr_matrix
# Select relevant columns for correlation analysis
stat_columns = ['points', 'rebounds', 'assists', 'steals', 'blocks',
'fg_pct', 'fg3_pct', 'ft_pct', 'minutes', 'turnovers']
fig, corr = create_correlation_heatmap(
player_stats_clean[player_stats_clean['minutes'] >= 15],
columns=stat_columns,
title='Correlation Matrix of Key Statistics (15+ MPG Players)'
)
plt.savefig('correlation_heatmap.png', dpi=150)
plt.close()
# Identify strongest correlations
def get_top_correlations(corr_matrix, n=10):
"""Extract the strongest correlations from a correlation matrix."""
# Get upper triangle indices
upper_tri = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# Stack and sort
correlations = upper_tri.stack().sort_values(ascending=False)
print(f"Top {n} Positive Correlations:")
print(correlations.head(n))
print(f"\nTop {n} Negative Correlations:")
print(correlations.tail(n))
return correlations
top_corrs = get_top_correlations(corr, n=5)
4.5.3 Pair Plots for Multivariate Exploration
Pair plots create scatter plot matrices that reveal relationships among multiple variables simultaneously:
def create_pair_plot(df, columns, hue_col=None, title=None):
"""Create a pair plot for selected columns."""
# Filter to relevant columns and remove missing values
plot_data = df[columns + ([hue_col] if hue_col else [])].dropna()
# Create pair plot
g = sns.pairplot(plot_data, hue=hue_col, diag_kind='kde',
plot_kws={'alpha': 0.6, 's': 30},
diag_kws={'alpha': 0.7})
if title:
g.fig.suptitle(title, y=1.02)
return g
# Create pair plot for scoring statistics
scoring_cols = ['points', 'fg_pct', 'fg3_pct', 'ft_pct', 'minutes']
g = create_pair_plot(
player_stats_clean[player_stats_clean['minutes'] >= 20],
scoring_cols, hue_col='position_group',
title='Scoring Statistics Relationships by Position Group'
)
plt.savefig('scoring_pairplot.png', dpi=150, bbox_inches='tight')
plt.close()
4.5.4 Advanced Scatter Plot Techniques
Adding dimensions through size, color, and annotations enhances scatter plot informativeness:
def create_bubble_chart(df, x_col, y_col, size_col, color_col=None,
label_col=None, title=None, figsize=(12, 8)):
"""Create a bubble chart with optional labeling."""
fig, ax = plt.subplots(figsize=figsize)
# Prepare data
plot_df = df[[x_col, y_col, size_col] +
([color_col] if color_col else []) +
([label_col] if label_col else [])].dropna()
# Normalize size for plotting
size_normalized = (plot_df[size_col] - plot_df[size_col].min()) / \
(plot_df[size_col].max() - plot_df[size_col].min())
sizes = size_normalized * 500 + 50 # Scale to reasonable bubble sizes
# Create scatter plot
if color_col:
scatter = ax.scatter(plot_df[x_col], plot_df[y_col], s=sizes,
c=plot_df[color_col], cmap='viridis',
alpha=0.6, edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, label=color_col.replace('_', ' ').title())
else:
scatter = ax.scatter(plot_df[x_col], plot_df[y_col], s=sizes,
alpha=0.6, edgecolors='black', linewidth=0.5)
# Add labels for top performers
if label_col:
# Label top 10 by y-value
top_players = plot_df.nlargest(10, y_col)
for _, row in top_players.iterrows():
ax.annotate(row[label_col], (row[x_col], row[y_col]),
xytext=(5, 5), textcoords='offset points',
fontsize=8, alpha=0.8)
ax.set_xlabel(x_col.replace('_', ' ').title())
ax.set_ylabel(y_col.replace('_', ' ').title())
ax.set_title(title or f'{y_col} vs {x_col} (size: {size_col})')
plt.tight_layout()
return fig
# Create bubble chart: Points vs Efficiency, sized by minutes
fig = create_bubble_chart(
player_stats_clean[player_stats_clean['minutes'] >= 20],
x_col='true_shooting_pct', y_col='points',
size_col='minutes', color_col='assists',
label_col='player_name',
title='Points vs Efficiency (Bubble Size = Minutes, Color = Assists)'
)
plt.savefig('scoring_bubble_chart.png', dpi=150)
plt.close()
4.6 Time Series Analysis of Player and Team Performance
4.6.1 Loading and Preparing Time Series Data
Game-by-game data enables analysis of performance trends over time:
# Load game log data
game_logs = pd.read_csv('nba_player_game_logs_2023_24.csv')
# Convert date column to datetime
game_logs['game_date'] = pd.to_datetime(game_logs['game_date'])
# Sort by player and date
game_logs = game_logs.sort_values(['player_id', 'game_date'])
# Create game number within season for each player
game_logs['game_number'] = game_logs.groupby('player_id').cumcount() + 1
# Verify the structure
print(game_logs[['player_name', 'game_date', 'game_number', 'points']].head(20))
4.6.2 Rolling Averages for Trend Analysis
Rolling averages smooth out game-to-game variance to reveal underlying trends:
def calculate_rolling_stats(df, player_id, stat_cols, windows=[5, 10, 20]):
"""Calculate rolling averages for a player's statistics."""
player_df = df[df['player_id'] == player_id].copy()
for window in windows:
for col in stat_cols:
player_df[f'{col}_rolling_{window}'] = (
player_df[col].rolling(window=window, min_periods=1).mean()
)
return player_df
def plot_player_trend(df, player_id, stat_col, windows=[5, 10],
title=None, figsize=(14, 6)):
"""Plot a player's performance trend with rolling averages."""
player_data = calculate_rolling_stats(df, player_id, [stat_col], windows)
player_name = player_data['player_name'].iloc[0]
fig, ax = plt.subplots(figsize=figsize)
# Plot raw values
ax.scatter(player_data['game_number'], player_data[stat_col],
alpha=0.4, s=30, color='gray', label='Game Values')
# Plot rolling averages
colors = ['blue', 'red', 'green']
for window, color in zip(windows, colors):
ax.plot(player_data['game_number'],
player_data[f'{stat_col}_rolling_{window}'],
linewidth=2, color=color, label=f'{window}-Game Average')
# Add season average line
season_avg = player_data[stat_col].mean()
ax.axhline(y=season_avg, color='black', linestyle='--',
label=f'Season Average: {season_avg:.1f}')
ax.set_xlabel('Game Number')
ax.set_ylabel(stat_col.replace('_', ' ').title())
ax.set_title(title or f'{player_name} - {stat_col.title()} Trend')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
return fig
# Plot scoring trend for a star player
fig = plot_player_trend(game_logs, player_id='jokic_nikola',
stat_col='points', windows=[5, 10],
title='Nikola Jokic - Points Per Game Trend (2023-24)')
plt.savefig('jokic_scoring_trend.png', dpi=150)
plt.close()
4.6.3 Cumulative Statistics and Pace Analysis
Tracking cumulative statistics reveals how players progress toward milestones:
def plot_cumulative_comparison(df, player_ids, stat_col, title=None,
figsize=(12, 6)):
"""Compare cumulative statistics across multiple players."""
fig, ax = plt.subplots(figsize=figsize)
colors = plt.cm.tab10(np.linspace(0, 1, len(player_ids)))
for player_id, color in zip(player_ids, colors):
player_data = df[df['player_id'] == player_id].copy()
player_data['cumulative'] = player_data[stat_col].cumsum()
player_name = player_data['player_name'].iloc[0]
ax.plot(player_data['game_number'], player_data['cumulative'],
linewidth=2, color=color, label=player_name)
ax.set_xlabel('Game Number')
ax.set_ylabel(f'Cumulative {stat_col.replace("_", " ").title()}')
ax.set_title(title or f'Cumulative {stat_col.title()} Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
return fig
# Compare cumulative scoring for MVP candidates
mvp_candidates = ['jokic_nikola', 'gilgeous_alexander_shai',
'doncic_luka', 'tatum_jayson']
fig = plot_cumulative_comparison(game_logs, mvp_candidates, 'points',
title='Cumulative Points - MVP Candidates (2023-24)')
plt.savefig('mvp_cumulative_points.png', dpi=150)
plt.close()
4.6.4 Performance Variation Analysis
Understanding performance consistency is as important as average performance:
def analyze_performance_consistency(df, player_ids, stat_col):
"""Analyze performance consistency across players."""
results = []
for player_id in player_ids:
player_data = df[df['player_id'] == player_id]
player_name = player_data['player_name'].iloc[0]
stats = player_data[stat_col]
results.append({
'player_name': player_name,
'games': len(stats),
'mean': stats.mean(),
'median': stats.median(),
'std': stats.std(),
'cv': stats.std() / stats.mean() * 100, # Coefficient of variation
'min': stats.min(),
'max': stats.max(),
'range': stats.max() - stats.min()
})
return pd.DataFrame(results)
consistency_df = analyze_performance_consistency(
game_logs, mvp_candidates, 'points'
)
print(consistency_df.to_string())
# Visualize consistency with box plots
def plot_consistency_comparison(df, player_ids, stat_col, title=None,
figsize=(12, 6)):
"""Create box plots comparing performance consistency."""
fig, ax = plt.subplots(figsize=figsize)
data = []
labels = []
for player_id in player_ids:
player_data = df[df['player_id'] == player_id]
data.append(player_data[stat_col].dropna())
labels.append(player_data['player_name'].iloc[0].split()[-1]) # Last name
bp = ax.boxplot(data, labels=labels, patch_artist=True)
colors = plt.cm.Set2(np.linspace(0, 1, len(player_ids)))
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
ax.set_ylabel(stat_col.replace('_', ' ').title())
ax.set_title(title or f'{stat_col.title()} Consistency Comparison')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
return fig
fig = plot_consistency_comparison(game_logs, mvp_candidates, 'points',
title='Scoring Consistency - MVP Candidates')
plt.savefig('mvp_scoring_consistency.png', dpi=150)
plt.close()
4.6.5 Monthly and Segment Analysis
Breaking the season into segments reveals performance patterns:
def analyze_monthly_performance(df, player_id, stat_cols):
"""Analyze player performance by month."""
player_data = df[df['player_id'] == player_id].copy()
player_data['month'] = player_data['game_date'].dt.month
player_data['month_name'] = player_data['game_date'].dt.strftime('%B')
monthly_stats = player_data.groupby(['month', 'month_name'])[stat_cols].agg(
['mean', 'std', 'count']
).round(2)
return monthly_stats
# Monthly breakdown for a star player
monthly_stats = analyze_monthly_performance(
game_logs, 'jokic_nikola', ['points', 'rebounds', 'assists']
)
print(monthly_stats)
def plot_monthly_performance(df, player_id, stat_col, title=None,
figsize=(12, 6)):
"""Visualize monthly performance trends."""
player_data = df[df['player_id'] == player_id].copy()
player_data['month'] = player_data['game_date'].dt.to_period('M')
player_name = player_data['player_name'].iloc[0]
monthly = player_data.groupby('month')[stat_col].agg(['mean', 'std', 'count'])
monthly.index = monthly.index.astype(str)
fig, ax = plt.subplots(figsize=figsize)
# Bar chart with error bars
x = range(len(monthly))
bars = ax.bar(x, monthly['mean'], yerr=monthly['std'],
capsize=5, alpha=0.7, color='steelblue')
# Add game counts on top of bars
for i, (mean, count) in enumerate(zip(monthly['mean'], monthly['count'])):
ax.text(i, mean + monthly['std'].iloc[i] + 1, f'n={count}',
ha='center', fontsize=9)
ax.set_xticks(x)
ax.set_xticklabels(monthly.index, rotation=45, ha='right')
ax.set_xlabel('Month')
ax.set_ylabel(stat_col.replace('_', ' ').title())
ax.set_title(title or f'{player_name} - Monthly {stat_col.title()}')
plt.tight_layout()
return fig
fig = plot_monthly_performance(game_logs, 'jokic_nikola', 'points',
title='Nikola Jokic - Monthly Scoring Average')
plt.savefig('jokic_monthly_scoring.png', dpi=150)
plt.close()
4.7 Shot Chart Creation and Spatial Analysis
4.7.1 Understanding Shot Location Data
Shot location data provides x-y coordinates for each shot attempt, enabling spatial analysis:
# Load shot data
shots = pd.read_csv('nba_shots_2023_24.csv')
# Inspect shot data structure
print(shots.info())
print(shots[['player_name', 'shot_x', 'shot_y', 'shot_made',
'shot_type', 'shot_distance']].head(10))
# Shot coordinates are typically in feet from the basket
# NBA court is 94 feet long and 50 feet wide
# Basket is at (0, 0) with positive y toward half court
print(f"X range: {shots['shot_x'].min()} to {shots['shot_x'].max()}")
print(f"Y range: {shots['shot_y'].min()} to {shots['shot_y'].max()}")
4.7.2 Drawing the Basketball Court
Creating an accurate court representation is essential for shot charts:
def draw_basketball_court(ax=None, color='black', lw=2, outer_lines=False):
"""Draw a basketball half-court on a matplotlib axes."""
if ax is None:
fig, ax = plt.subplots(figsize=(12, 11))
# Create court elements
# Hoop
hoop = plt.Circle((0, 0), radius=0.75, linewidth=lw, color=color,
fill=False)
# Backboard
backboard = plt.Rectangle((-3, -0.75), 6, 0, linewidth=lw, color=color)
# Paint (outer box)
outer_box = plt.Rectangle((-8, -5.25), 16, 19, linewidth=lw,
color=color, fill=False)
# Paint (inner box)
inner_box = plt.Rectangle((-6, -5.25), 12, 19, linewidth=lw,
color=color, fill=False)
# Free throw top arc
top_free_throw = plt.Arc((0, 13.75), 12, 12, theta1=0, theta2=180,
linewidth=lw, color=color, fill=False)
# Free throw bottom arc
bottom_free_throw = plt.Arc((0, 13.75), 12, 12, theta1=180, theta2=0,
linewidth=lw, color=color, linestyle='dashed')
# Restricted zone arc
restricted = plt.Arc((0, 0), 8, 8, theta1=0, theta2=180,
linewidth=lw, color=color)
# Three point line
corner_three_left = plt.Rectangle((-22, -5.25), 0, 14, linewidth=lw,
color=color)
corner_three_right = plt.Rectangle((22, -5.25), 0, 14, linewidth=lw,
color=color)
three_arc = plt.Arc((0, 0), 47.5, 47.5, theta1=22, theta2=158,
linewidth=lw, color=color)
# Center court
center_outer_arc = plt.Arc((0, 47), 12, 12, theta1=180, theta2=0,
linewidth=lw, color=color)
center_inner_arc = plt.Arc((0, 47), 4, 4, theta1=180, theta2=0,
linewidth=lw, color=color)
# Add elements to axes
court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
bottom_free_throw, restricted, corner_three_left,
corner_three_right, three_arc, center_outer_arc,
center_inner_arc]
if outer_lines:
outer_lines_rect = plt.Rectangle((-25, -5.25), 50, 52.25,
linewidth=lw, color=color, fill=False)
court_elements.append(outer_lines_rect)
for element in court_elements:
ax.add_patch(element)
# Set axis limits and properties
ax.set_xlim(-25, 25)
ax.set_ylim(-5.25, 47)
ax.set_aspect('equal')
ax.axis('off')
return ax
4.7.3 Basic Shot Charts
Creating simple shot charts that display made and missed shots:
def create_basic_shot_chart(shots_df, player_name, title=None, figsize=(12, 11)):
"""Create a basic shot chart showing makes and misses."""
player_shots = shots_df[shots_df['player_name'] == player_name].copy()
fig, ax = plt.subplots(figsize=figsize)
draw_basketball_court(ax)
# Separate makes and misses
makes = player_shots[player_shots['shot_made'] == 1]
misses = player_shots[player_shots['shot_made'] == 0]
# Plot misses first (so makes appear on top)
ax.scatter(misses['shot_x'], misses['shot_y'], c='red', marker='x',
s=30, alpha=0.6, label=f'Miss ({len(misses)})')
# Plot makes
ax.scatter(makes['shot_x'], makes['shot_y'], c='green', marker='o',
s=30, alpha=0.6, label=f'Make ({len(makes)})')
# Calculate shooting percentage
fg_pct = len(makes) / len(player_shots) * 100 if len(player_shots) > 0 else 0
ax.set_title(title or f'{player_name} Shot Chart\n'
f'{len(player_shots)} shots, {fg_pct:.1f}% FG')
ax.legend(loc='upper right')
plt.tight_layout()
return fig
# Create shot chart for a player
fig = create_basic_shot_chart(shots, 'Stephen Curry',
title='Stephen Curry Shot Chart (2023-24)')
plt.savefig('curry_shot_chart_basic.png', dpi=150)
plt.close()
4.7.4 Hexbin Shot Charts
Hexagonal binning aggregates shots into zones, revealing shooting patterns more clearly:
def create_hexbin_shot_chart(shots_df, player_name, stat='frequency',
title=None, figsize=(12, 11)):
"""Create a hexbin shot chart showing shooting patterns."""
player_shots = shots_df[shots_df['player_name'] == player_name].copy()
fig, ax = plt.subplots(figsize=figsize)
if stat == 'frequency':
# Show shot frequency
hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
gridsize=25, cmap='YlOrRd', mincnt=1)
cb = plt.colorbar(hexbin, ax=ax, label='Shot Attempts')
elif stat == 'efficiency':
# Show shooting percentage
hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
C=player_shots['shot_made'], gridsize=25,
cmap='RdYlGn', mincnt=3, reduce_C_function=np.mean)
cb = plt.colorbar(hexbin, ax=ax, label='FG%')
# Draw court on top
draw_basketball_court(ax, color='black', lw=1)
ax.set_title(title or f'{player_name} Shot Chart ({stat.title()})')
plt.tight_layout()
return fig
# Create frequency and efficiency shot charts
fig = create_hexbin_shot_chart(shots, 'Stephen Curry', stat='frequency',
title='Stephen Curry Shot Frequency')
plt.savefig('curry_shot_chart_frequency.png', dpi=150)
plt.close()
fig = create_hexbin_shot_chart(shots, 'Stephen Curry', stat='efficiency',
title='Stephen Curry Shot Efficiency')
plt.savefig('curry_shot_chart_efficiency.png', dpi=150)
plt.close()
4.7.5 Kernel Density Shot Charts
KDE-based shot charts provide smooth heat maps of shot locations:
def create_kde_shot_chart(shots_df, player_name, title=None,
figsize=(12, 11), bw_adjust=0.5):
"""Create a KDE-based shot density chart."""
player_shots = shots_df[shots_df['player_name'] == player_name].copy()
fig, ax = plt.subplots(figsize=figsize)
# Create KDE plot
sns.kdeplot(data=player_shots, x='shot_x', y='shot_y',
fill=True, cmap='YlOrRd', levels=50, thresh=0.05,
bw_adjust=bw_adjust, ax=ax)
# Draw court
draw_basketball_court(ax, color='black', lw=1)
# Add shot count
ax.set_title(title or f'{player_name} Shot Density\n'
f'({len(player_shots)} total shots)')
plt.tight_layout()
return fig
fig = create_kde_shot_chart(shots, 'Stephen Curry',
title='Stephen Curry Shot Density (2023-24)')
plt.savefig('curry_shot_chart_kde.png', dpi=150)
plt.close()
4.7.6 Zone-Based Analysis
Dividing the court into zones enables structured shooting analysis:
def classify_shot_zone(row):
"""Classify a shot into a court zone based on coordinates."""
x, y = row['shot_x'], row['shot_y']
distance = row['shot_distance']
# Restricted area
if distance <= 4:
return 'Restricted Area'
# Paint (non-restricted)
if abs(x) <= 8 and y <= 14 and distance > 4:
return 'Paint (Non-RA)'
# Mid-range
if distance <= 23.75 and not (abs(x) <= 8 and y <= 14):
if x < -8:
return 'Mid-Range Left'
elif x > 8:
return 'Mid-Range Right'
else:
return 'Mid-Range Center'
# Three-point
if distance > 23.75:
if x < -8:
return 'Corner 3 Left'
elif x > 8:
return 'Corner 3 Right'
elif y > 14:
return 'Above Break 3'
else:
return 'Corner 3'
return 'Other'
# Apply zone classification
shots['shot_zone'] = shots.apply(classify_shot_zone, axis=1)
def analyze_zone_shooting(shots_df, player_name):
"""Analyze shooting by zone for a player."""
player_shots = shots_df[shots_df['player_name'] == player_name]
zone_stats = player_shots.groupby('shot_zone').agg(
attempts=('shot_made', 'count'),
makes=('shot_made', 'sum'),
fg_pct=('shot_made', 'mean')
).round(3)
zone_stats['fg_pct'] = (zone_stats['fg_pct'] * 100).round(1)
zone_stats['pct_of_shots'] = (zone_stats['attempts'] /
zone_stats['attempts'].sum() * 100).round(1)
return zone_stats.sort_values('attempts', ascending=False)
zone_analysis = analyze_zone_shooting(shots, 'Stephen Curry')
print("Stephen Curry Zone Shooting Analysis:")
print(zone_analysis)
4.7.7 Comparative Shot Charts
Comparing shooting patterns between players or time periods:
def create_comparison_shot_chart(shots_df, player1, player2,
title=None, figsize=(20, 10)):
"""Create side-by-side shot charts for comparison."""
fig, axes = plt.subplots(1, 2, figsize=figsize)
for ax, player in zip(axes, [player1, player2]):
player_shots = shots_df[shots_df['player_name'] == player]
# Create hexbin
hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
gridsize=20, cmap='YlOrRd', mincnt=1)
# Draw court
draw_basketball_court(ax, color='black', lw=1)
# Calculate stats
fg_pct = player_shots['shot_made'].mean() * 100
ax.set_title(f'{player}\n{len(player_shots)} shots, {fg_pct:.1f}% FG')
# Add colorbar
plt.colorbar(hexbin, ax=axes, label='Shot Attempts', shrink=0.7)
if title:
fig.suptitle(title, fontsize=14, y=1.02)
plt.tight_layout()
return fig
fig = create_comparison_shot_chart(shots, 'Stephen Curry', 'Luka Doncic',
title='Shot Chart Comparison: Curry vs Doncic')
plt.savefig('curry_vs_doncic_shot_charts.png', dpi=150, bbox_inches='tight')
plt.close()
4.7.8 Expected Points and Shot Quality
Incorporating expected value concepts into shot chart analysis:
def calculate_expected_points(shots_df):
"""Calculate expected points for each shot based on location."""
shots_df = shots_df.copy()
# Calculate zone-based expected FG%
zone_fg = shots_df.groupby('shot_zone')['shot_made'].mean()
shots_df['zone_expected_fg'] = shots_df['shot_zone'].map(zone_fg)
# Determine point value
shots_df['point_value'] = np.where(shots_df['shot_distance'] > 23.75, 3, 2)
# Calculate expected points
shots_df['expected_points'] = (shots_df['zone_expected_fg'] *
shots_df['point_value'])
# Actual points
shots_df['actual_points'] = shots_df['shot_made'] * shots_df['point_value']
return shots_df
shots_with_expected = calculate_expected_points(shots)
def analyze_shot_quality(shots_df, player_name):
"""Analyze shot quality for a player."""
player_shots = shots_df[shots_df['player_name'] == player_name]
analysis = {
'total_shots': len(player_shots),
'expected_points_per_shot': player_shots['expected_points'].mean(),
'actual_points_per_shot': player_shots['actual_points'].mean(),
'shot_quality_differential': (player_shots['actual_points'].mean() -
player_shots['expected_points'].mean()),
'total_expected_points': player_shots['expected_points'].sum(),
'total_actual_points': player_shots['actual_points'].sum()
}
return pd.Series(analysis)
curry_shot_quality = analyze_shot_quality(shots_with_expected, 'Stephen Curry')
print("Stephen Curry Shot Quality Analysis:")
print(curry_shot_quality)
4.8 Putting It All Together: A Complete EDA Workflow
4.8.1 Structured EDA Approach
A systematic approach ensures thorough exploration:
class BasketballEDA:
"""A structured class for conducting EDA on basketball data."""
def __init__(self, df, player_col='player_name'):
self.df = df.copy()
self.player_col = player_col
self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
def summary_report(self):
"""Generate a comprehensive summary report."""
print("=" * 60)
print("BASKETBALL DATA EXPLORATORY ANALYSIS REPORT")
print("=" * 60)
print(f"\nDataset Shape: {self.df.shape[0]} rows, {self.df.shape[1]} columns")
print(f"Numeric Columns: {len(self.numeric_cols)}")
print(f"Categorical Columns: {len(self.categorical_cols)}")
print("\n--- Missing Values ---")
missing = self.df.isnull().sum()
if missing.sum() > 0:
print(missing[missing > 0])
else:
print("No missing values detected")
print("\n--- Numeric Summary ---")
print(self.df[self.numeric_cols].describe().T.round(2))
print("\n--- Categorical Summary ---")
for col in self.categorical_cols[:5]: # Limit output
print(f"\n{col}:")
print(self.df[col].value_counts().head())
def correlation_analysis(self, threshold=0.7):
"""Identify highly correlated features."""
corr = self.df[self.numeric_cols].corr()
# Find pairs above threshold
high_corr = []
for i in range(len(corr.columns)):
for j in range(i+1, len(corr.columns)):
if abs(corr.iloc[i, j]) >= threshold:
high_corr.append({
'var1': corr.columns[i],
'var2': corr.columns[j],
'correlation': corr.iloc[i, j]
})
return pd.DataFrame(high_corr).sort_values('correlation',
ascending=False, key=abs)
def outlier_detection(self, columns=None, method='iqr', threshold=1.5):
"""Detect outliers using IQR or Z-score method."""
if columns is None:
columns = self.numeric_cols
outliers = {}
for col in columns:
if method == 'iqr':
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
outlier_mask = (self.df[col] < lower) | (self.df[col] > upper)
elif method == 'zscore':
from scipy import stats
z_scores = np.abs(stats.zscore(self.df[col].dropna()))
outlier_mask = z_scores > threshold
outliers[col] = self.df[outlier_mask][col]
return outliers
def generate_visualizations(self, output_dir='eda_output'):
"""Generate a standard set of visualizations."""
import os
os.makedirs(output_dir, exist_ok=True)
# Distribution plots for key numeric variables
key_stats = ['points', 'rebounds', 'assists', 'minutes']
available_stats = [s for s in key_stats if s in self.numeric_cols]
if available_stats:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, stat in enumerate(available_stats[:4]):
ax = axes[idx // 2, idx % 2]
self.df[stat].hist(ax=ax, bins=30, edgecolor='black')
ax.set_title(f'Distribution of {stat.title()}')
ax.set_xlabel(stat.title())
plt.tight_layout()
plt.savefig(f'{output_dir}/distributions.png', dpi=150)
plt.close()
# Correlation heatmap
fig, ax = plt.subplots(figsize=(12, 10))
corr = self.df[self.numeric_cols].corr()
sns.heatmap(corr, annot=False, cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Correlation Matrix')
plt.tight_layout()
plt.savefig(f'{output_dir}/correlation_heatmap.png', dpi=150)
plt.close()
print(f"Visualizations saved to {output_dir}/")
# Example usage
eda = BasketballEDA(player_stats_clean)
eda.summary_report()
high_corr = eda.correlation_analysis(threshold=0.8)
print("\nHighly Correlated Features:")
print(high_corr)
4.8.2 Documentation and Reproducibility
Creating reproducible analysis through proper documentation:
def create_eda_notebook_template(output_path):
"""Generate a template for EDA documentation."""
template = '''
# NBA Data Exploratory Data Analysis
## Analysis Date: {date}
### 1. Data Loading and Initial Inspection
- Dataset source: [Specify source]
- Number of records: [N]
- Number of features: [N]
- Time period covered: [Dates]
### 2. Data Quality Assessment
- Missing values identified: [Details]
- Duplicate records: [Count]
- Data type issues: [List]
### 3. Data Cleaning Steps
1. [Step 1]
2. [Step 2]
3. [Step 3]
### 4. Key Findings from Distribution Analysis
- [Finding 1]
- [Finding 2]
### 5. Relationship Analysis
- Strongest correlations: [List]
- Key relationships discovered: [Details]
### 6. Notable Outliers
- [Player/observation 1]
- [Player/observation 2]
### 7. Recommendations for Further Analysis
- [Recommendation 1]
- [Recommendation 2]
'''
from datetime import datetime
template = template.format(date=datetime.now().strftime('%Y-%m-%d'))
with open(output_path, 'w') as f:
f.write(template)
print(f"Template saved to {output_path}")
create_eda_notebook_template('eda_documentation_template.md')
Summary
Exploratory Data Analysis forms the critical foundation for all basketball analytics work. This chapter has equipped you with the tools and techniques to systematically examine NBA data, from initial loading and inspection through sophisticated spatial analysis of shot charts.
The key skills developed in this chapter include:
-
Data Loading and Inspection: Using pandas to efficiently load, examine, and understand basketball datasets, including identifying data types and creating derived features.
-
Data Cleaning: Implementing comprehensive data cleaning procedures that address common issues in basketball data while respecting the sport's inherent constraints.
-
Missing Value Handling: Understanding the mechanisms behind missing data in basketball contexts and applying appropriate imputation strategies.
-
Distribution Visualization: Creating informative histograms, box plots, and violin plots that reveal the shape and characteristics of basketball statistics.
-
Relationship Analysis: Building scatter plots, correlation matrices, and pair plots that uncover meaningful relationships between variables.
-
Time Series Analysis: Tracking player and team performance over time through rolling averages, cumulative statistics, and trend analysis.
-
Spatial Analysis: Creating professional shot charts using various techniques including basic plots, hexbin aggregation, and KDE-based heat maps.
These EDA skills serve as prerequisites for the predictive modeling and advanced analytics techniques covered in subsequent chapters. By thoroughly understanding your data through systematic exploration, you establish a solid foundation for extracting actionable insights that can influence real basketball decisions.
The code examples provided throughout this chapter are designed to be directly applicable to real NBA datasets. As you work through the exercises and case studies that follow, you will gain hands-on experience applying these techniques to authentic basketball analytics scenarios.
Related Reading
Explore this topic in other books
College Football Analytics Visualization Fundamentals NFL Analytics Exploratory Data Analysis Soccer Analytics Pitch Coordinates & Visualization Prediction Markets Exploratory Analysis