7 min read

In This Chapter

Introduction
4.1 Loading and Inspecting NBA Data with Pandas
4.2 Data Cleaning and Preprocessing Techniques
4.3 Handling Missing Values in Basketball Datasets
4.4 Visualizing Distributions
4.5 Visualizing Relationships
4.6 Time Series Analysis of Player and Team Performance
4.7 Shot Chart Creation and Spatial Analysis
4.8 Putting It All Together: A Complete EDA Workflow
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 4: Exploratory Data Analysis for Basketball

Introduction

Exploratory Data Analysis (EDA) forms the foundation of any successful basketball analytics project. Before building predictive models or deriving actionable insights, analysts must thoroughly understand their data through systematic exploration and visualization. This chapter provides a comprehensive guide to conducting EDA on basketball datasets, with a focus on NBA data and the practical techniques that reveal meaningful patterns in player and team performance.

EDA serves multiple critical purposes in basketball analytics. First, it helps identify data quality issues such as missing values, outliers, and inconsistencies that could compromise downstream analyses. Second, it reveals the underlying structure and distributions of key metrics, informing appropriate statistical methods. Third, and perhaps most importantly, it generates hypotheses about relationships between variables that can guide more rigorous investigation.

The tools we employ throughout this chapter include Python's pandas library for data manipulation, matplotlib and seaborn for visualization, and specialized techniques for spatial analysis of shot data. By the end of this chapter, you will possess the skills to transform raw basketball data into compelling visual narratives that inform decision-making.

4.1 Loading and Inspecting NBA Data with Pandas

4.1.1 Common NBA Data Sources and Formats

NBA data comes in various formats depending on the source. The official NBA Stats API provides JSON responses that require parsing, while historical databases often deliver CSV or Excel files. Play-by-play data may arrive in nested JSON structures that demand careful extraction.

Let us begin by loading a typical NBA player statistics dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better readability
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

# Load NBA player statistics
# This example assumes a CSV file with per-game averages
player_stats = pd.read_csv('nba_player_stats_2023_24.csv')

# First look at the data
print(f"Dataset shape: {player_stats.shape}")
print(f"Number of players: {player_stats.shape[0]}")
print(f"Number of features: {player_stats.shape[1]}")

4.1.2 Initial Data Inspection

The first step in any EDA process involves understanding what data you have. Pandas provides several methods for this purpose:

# View the first few rows
print(player_stats.head(10))

# View the last few rows
print(player_stats.tail(10))

# Get column names and data types
print(player_stats.info())

# Statistical summary of numerical columns
print(player_stats.describe())

# Check for unique values in categorical columns
print(player_stats['team'].nunique())
print(player_stats['position'].value_counts())

The info() method reveals crucial information about data types and memory usage. Pay particular attention to columns that should be numeric but appear as object types, as this often indicates data quality issues requiring attention.

4.1.3 Understanding Data Types in Basketball Datasets

Basketball datasets contain diverse data types that require appropriate handling:

# Identify column types
numerical_cols = player_stats.select_dtypes(include=[np.number]).columns
categorical_cols = player_stats.select_dtypes(include=['object']).columns

print(f"Numerical columns ({len(numerical_cols)}):")
print(numerical_cols.tolist())

print(f"\nCategorical columns ({len(categorical_cols)}):")
print(categorical_cols.tolist())

# Check for mixed types in supposedly numeric columns
def check_numeric_validity(df, column):
    """Check if a column contains non-numeric values."""
    try:
        pd.to_numeric(df[column], errors='raise')
        return True, []
    except ValueError:
        non_numeric = df[~df[column].apply(
            lambda x: isinstance(x, (int, float)) or
            (isinstance(x, str) and x.replace('.', '').replace('-', '').isdigit())
        )][column].unique()
        return False, non_numeric

# Apply to potentially problematic columns
for col in ['points', 'rebounds', 'assists']:
    valid, issues = check_numeric_validity(player_stats, col)
    if not valid:
        print(f"Column '{col}' has non-numeric values: {issues}")

4.1.4 Creating Derived Features During Inspection

During the inspection phase, you will often identify opportunities to create derived features that enhance analysis:

# Calculate per-minute statistics
player_stats['pts_per_min'] = player_stats['points'] / player_stats['minutes']
player_stats['reb_per_min'] = player_stats['rebounds'] / player_stats['minutes']
player_stats['ast_per_min'] = player_stats['assists'] / player_stats['minutes']

# Calculate efficiency metrics
player_stats['true_shooting_pct'] = (
    player_stats['points'] /
    (2 * (player_stats['fga'] + 0.44 * player_stats['fta']))
)

# Create position groups
position_mapping = {
    'PG': 'Guard', 'SG': 'Guard',
    'SF': 'Forward', 'PF': 'Forward',
    'C': 'Center'
}
player_stats['position_group'] = player_stats['position'].map(position_mapping)

# Verify the new features
print(player_stats[['player_name', 'pts_per_min', 'true_shooting_pct',
                    'position_group']].head(10))

4.2 Data Cleaning and Preprocessing Techniques

4.2.1 Identifying Data Quality Issues

Data quality issues in basketball datasets manifest in various forms. Common problems include duplicate records, inconsistent naming conventions, and erroneous values that defy basketball logic:

# Check for duplicate rows
duplicate_count = player_stats.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Check for duplicate player names (which might be legitimate or errors)
name_counts = player_stats['player_name'].value_counts()
potential_duplicates = name_counts[name_counts > 1]
print("Potential duplicate players:")
print(potential_duplicates)

# Identify records with the same player appearing multiple times
# This is valid if a player was traded mid-season
traded_players = player_stats[
    player_stats['player_name'].isin(potential_duplicates.index)
].sort_values(['player_name', 'team'])
print(traded_players[['player_name', 'team', 'games_played', 'points']])

4.2.2 Handling Inconsistent Data

Inconsistencies often arise in team names, player names, and categorical variables:

# Standardize team abbreviations
team_standardization = {
    'PHX': 'PHO',  # Phoenix Suns
    'BKN': 'BRK',  # Brooklyn Nets
    'CHA': 'CHO',  # Charlotte Hornets
    'NOP': 'NOR',  # New Orleans Pelicans (if using older abbreviations)
}

# Apply standardization
player_stats['team'] = player_stats['team'].replace(team_standardization)

# Standardize position labels
position_standardization = {
    'Point Guard': 'PG',
    'Shooting Guard': 'SG',
    'Small Forward': 'SF',
    'Power Forward': 'PF',
    'Center': 'C',
    'G': 'SG',  # Generic guard to shooting guard
    'F': 'SF',  # Generic forward to small forward
}
player_stats['position'] = player_stats['position'].replace(position_standardization)

# Clean player names (remove extra spaces, standardize capitalization)
player_stats['player_name'] = player_stats['player_name'].str.strip()
player_stats['player_name'] = player_stats['player_name'].str.title()

4.2.3 Validating Data Against Basketball Logic

Basketball imposes natural constraints on statistics. Field goal percentage cannot exceed 100%, and players cannot play more than 48 minutes per game in regulation:

def validate_basketball_stats(df):
    """Validate statistics against basketball logic constraints."""
    issues = []

    # Field goal percentage should be between 0 and 1
    if 'fg_pct' in df.columns:
        invalid_fg = df[(df['fg_pct'] < 0) | (df['fg_pct'] > 1)]
        if len(invalid_fg) > 0:
            issues.append(f"Invalid FG%: {len(invalid_fg)} records")

    # Minutes per game should be between 0 and 48 (regular season)
    if 'minutes' in df.columns:
        invalid_mins = df[(df['minutes'] < 0) | (df['minutes'] > 48)]
        if len(invalid_mins) > 0:
            issues.append(f"Invalid minutes: {len(invalid_mins)} records")

    # Points should be non-negative
    if 'points' in df.columns:
        invalid_pts = df[df['points'] < 0]
        if len(invalid_pts) > 0:
            issues.append(f"Negative points: {len(invalid_pts)} records")

    # Assists should not exceed field goals made (approximately)
    # This is a soft constraint as team totals differ from individual stats
    if 'assists' in df.columns and 'fgm' in df.columns:
        suspicious_ast = df[df['assists'] > df['fgm'] * 2]
        if len(suspicious_ast) > 0:
            issues.append(f"Suspicious assist totals: {len(suspicious_ast)} records")

    return issues

# Run validation
validation_issues = validate_basketball_stats(player_stats)
for issue in validation_issues:
    print(f"Warning: {issue}")

4.2.4 Data Type Conversions

Proper data type assignment improves both performance and analytical accuracy:

# Convert percentage columns stored as strings
percentage_cols = ['fg_pct', 'fg3_pct', 'ft_pct']
for col in percentage_cols:
    if col in player_stats.columns:
        # Remove percentage signs if present
        if player_stats[col].dtype == 'object':
            player_stats[col] = player_stats[col].str.replace('%', '')
            player_stats[col] = pd.to_numeric(player_stats[col], errors='coerce') / 100

# Convert categorical columns to category dtype for efficiency
categorical_columns = ['team', 'position', 'position_group']
for col in categorical_columns:
    if col in player_stats.columns:
        player_stats[col] = player_stats[col].astype('category')

# Verify conversions
print(player_stats.dtypes)

4.3 Handling Missing Values in Basketball Datasets

4.3.1 Identifying Missing Data Patterns

Missing data in basketball datasets often follows specific patterns. A player with zero three-point attempts will have a missing three-point percentage, which represents a fundamentally different situation than truly missing data:

# Overall missing value summary
missing_summary = pd.DataFrame({
    'missing_count': player_stats.isnull().sum(),
    'missing_pct': (player_stats.isnull().sum() / len(player_stats) * 100).round(2)
})
missing_summary = missing_summary[missing_summary['missing_count'] > 0]
missing_summary = missing_summary.sort_values('missing_pct', ascending=False)
print(missing_summary)

# Visualize missing data patterns
plt.figure(figsize=(12, 6))
sns.heatmap(player_stats.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Data Pattern Visualization')
plt.xlabel('Features')
plt.tight_layout()
plt.savefig('missing_data_pattern.png', dpi=150)
plt.close()

4.3.2 Understanding Missing Data Mechanisms

Missing data in basketball falls into three categories:

Missing Completely at Random (MCAR): Data collection errors unrelated to any variables
Missing at Random (MAR): Missingness depends on observed variables
Missing Not at Random (MNAR): Missingness depends on the missing value itself

# Analyze whether missing three-point percentage relates to position
def analyze_missing_pattern(df, missing_col, grouping_col):
    """Analyze missing data pattern by a grouping variable."""
    missing_by_group = df.groupby(grouping_col)[missing_col].apply(
        lambda x: x.isnull().mean() * 100
    ).round(2)
    return missing_by_group

# Three-point percentage missing by position
if 'fg3_pct' in player_stats.columns:
    missing_3pt_by_pos = analyze_missing_pattern(
        player_stats, 'fg3_pct', 'position'
    )
    print("Missing 3PT% by Position:")
    print(missing_3pt_by_pos)

# Investigate correlation between attempts and missing percentage
# Players with zero attempts have undefined percentages
if 'fg3a' in player_stats.columns and 'fg3_pct' in player_stats.columns:
    zero_attempts = player_stats[player_stats['fg3a'] == 0]
    print(f"\nPlayers with zero 3PA: {len(zero_attempts)}")
    print(f"Of these, missing 3PT%: {zero_attempts['fg3_pct'].isnull().sum()}")

4.3.3 Imputation Strategies for Basketball Data

Different missing data scenarios require different imputation approaches:

def impute_basketball_stats(df):
    """Apply appropriate imputation strategies for basketball statistics."""
    df_imputed = df.copy()

    # Strategy 1: Zero imputation for counting stats of players who didn't play
    counting_stats = ['points', 'rebounds', 'assists', 'steals', 'blocks']
    for col in counting_stats:
        if col in df_imputed.columns:
            # Only impute with zero if games_played is 0 or very low
            mask = (df_imputed[col].isnull()) & (df_imputed['games_played'] <= 1)
            df_imputed.loc[mask, col] = 0

    # Strategy 2: For percentages with zero attempts, set to league average or NaN
    # This is a design decision - keeping as NaN is often better for analysis
    if 'fg3_pct' in df_imputed.columns and 'fg3a' in df_imputed.columns:
        # Set 3PT% to NaN for players with zero attempts (undefined)
        df_imputed.loc[df_imputed['fg3a'] == 0, 'fg3_pct'] = np.nan

    # Strategy 3: Forward/backward fill for time series data
    # (if this were game-by-game data)

    # Strategy 4: Group-based imputation for missing demographic data
    if 'height' in df_imputed.columns:
        # Impute missing height with position-average height
        position_height = df_imputed.groupby('position')['height'].transform('mean')
        df_imputed['height'] = df_imputed['height'].fillna(position_height)

    return df_imputed

player_stats_clean = impute_basketball_stats(player_stats)

4.3.4 Documenting Missing Data Decisions

Creating a clear record of missing data handling is essential for reproducibility:

def create_missing_data_report(df_original, df_cleaned, output_path=None):
    """Generate a comprehensive missing data handling report."""
    report = []
    report.append("=" * 60)
    report.append("MISSING DATA HANDLING REPORT")
    report.append("=" * 60)

    for col in df_original.columns:
        orig_missing = df_original[col].isnull().sum()
        clean_missing = df_cleaned[col].isnull().sum()

        if orig_missing > 0 or clean_missing > 0:
            report.append(f"\nColumn: {col}")
            report.append(f"  Original missing: {orig_missing} ({orig_missing/len(df_original)*100:.1f}%)")
            report.append(f"  After cleaning: {clean_missing} ({clean_missing/len(df_cleaned)*100:.1f}%)")
            report.append(f"  Values imputed: {orig_missing - clean_missing}")

    report_text = "\n".join(report)
    print(report_text)

    if output_path:
        with open(output_path, 'w') as f:
            f.write(report_text)

    return report_text

# Generate report
create_missing_data_report(player_stats, player_stats_clean)

4.4 Visualizing Distributions

4.4.1 Histograms for Basketball Statistics

Histograms reveal the distribution shape of individual statistics, helping identify skewness, multimodality, and outliers:

def plot_stat_distribution(df, column, title=None, bins=30, figsize=(10, 6)):
    """Create a comprehensive histogram with statistical annotations."""
    fig, ax = plt.subplots(figsize=figsize)

    data = df[column].dropna()

    # Create histogram
    n, bins_edges, patches = ax.hist(data, bins=bins, edgecolor='black',
                                      alpha=0.7, color='steelblue')

    # Add mean and median lines
    mean_val = data.mean()
    median_val = data.median()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2,
               label=f'Mean: {mean_val:.2f}')
    ax.axvline(median_val, color='green', linestyle='-', linewidth=2,
               label=f'Median: {median_val:.2f}')

    # Add statistical annotations
    stats_text = (f'N = {len(data)}\n'
                  f'Std = {data.std():.2f}\n'
                  f'Skew = {data.skew():.2f}\n'
                  f'Kurt = {data.kurtosis():.2f}')
    ax.text(0.95, 0.95, stats_text, transform=ax.transAxes,
            verticalalignment='top', horizontalalignment='right',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
            fontsize=10)

    ax.set_xlabel(column.replace('_', ' ').title())
    ax.set_ylabel('Frequency')
    ax.set_title(title or f'Distribution of {column.replace("_", " ").title()}')
    ax.legend()

    plt.tight_layout()
    return fig

# Plot distributions for key statistics
fig = plot_stat_distribution(player_stats_clean, 'points',
                              'Distribution of Points Per Game (2023-24 Season)')
plt.savefig('ppg_distribution.png', dpi=150)
plt.close()

# Compare distributions across positions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
positions = ['PG', 'SG', 'SF', 'PF', 'C']

for idx, pos in enumerate(positions):
    ax = axes[idx // 3, idx % 3]
    pos_data = player_stats_clean[player_stats_clean['position'] == pos]['points']
    ax.hist(pos_data.dropna(), bins=20, edgecolor='black', alpha=0.7)
    ax.set_title(f'{pos}: Points Per Game')
    ax.set_xlabel('PPG')
    ax.set_ylabel('Frequency')

# Remove empty subplot
axes[1, 2].axis('off')
plt.tight_layout()
plt.savefig('ppg_by_position.png', dpi=150)
plt.close()

4.4.2 Box Plots for Comparative Analysis

Box plots excel at comparing distributions across groups and identifying outliers:

def create_grouped_boxplot(df, value_col, group_col, title=None,
                           figsize=(12, 6), show_points=True):
    """Create a box plot comparing distributions across groups."""
    fig, ax = plt.subplots(figsize=figsize)

    # Order groups by median value
    group_order = df.groupby(group_col)[value_col].median().sort_values(
        ascending=False
    ).index.tolist()

    # Create box plot
    bp = ax.boxplot([df[df[group_col] == g][value_col].dropna()
                     for g in group_order],
                    labels=group_order, patch_artist=True)

    # Color boxes
    colors = plt.cm.Set3(np.linspace(0, 1, len(group_order)))
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)

    # Optionally overlay individual points
    if show_points:
        for idx, group in enumerate(group_order):
            y = df[df[group_col] == group][value_col].dropna()
            x = np.random.normal(idx + 1, 0.04, size=len(y))
            ax.scatter(x, y, alpha=0.3, s=10, color='gray')

    ax.set_xlabel(group_col.replace('_', ' ').title())
    ax.set_ylabel(value_col.replace('_', ' ').title())
    ax.set_title(title or f'{value_col} by {group_col}')

    # Add grid for readability
    ax.yaxis.grid(True, linestyle='--', alpha=0.7)

    plt.tight_layout()
    return fig

# Compare points per game across teams
fig = create_grouped_boxplot(
    player_stats_clean[player_stats_clean['minutes'] >= 15],  # Filter for rotation players
    'points', 'team',
    title='Points Per Game by Team (Players with 15+ MPG)'
)
plt.savefig('ppg_by_team_boxplot.png', dpi=150)
plt.close()

# Compare efficiency by position
fig = create_grouped_boxplot(
    player_stats_clean[player_stats_clean['minutes'] >= 20],
    'true_shooting_pct', 'position',
    title='True Shooting Percentage by Position'
)
plt.savefig('ts_pct_by_position.png', dpi=150)
plt.close()

4.4.3 Violin Plots for Distribution Shape

Violin plots combine box plot information with kernel density estimation, revealing distribution shapes that box plots obscure:

def create_violin_comparison(df, value_col, group_col, title=None,
                             figsize=(12, 6)):
    """Create violin plots for comparing distributions."""
    fig, ax = plt.subplots(figsize=figsize)

    # Filter out groups with too few observations
    group_counts = df[group_col].value_counts()
    valid_groups = group_counts[group_counts >= 10].index
    df_filtered = df[df[group_col].isin(valid_groups)]

    # Create violin plot using seaborn
    sns.violinplot(data=df_filtered, x=group_col, y=value_col,
                   ax=ax, inner='box', palette='Set2')

    ax.set_xlabel(group_col.replace('_', ' ').title())
    ax.set_ylabel(value_col.replace('_', ' ').title())
    ax.set_title(title or f'Distribution of {value_col} by {group_col}')

    # Rotate x-axis labels if needed
    plt.xticks(rotation=45, ha='right')

    plt.tight_layout()
    return fig

# Minutes distribution by position
fig = create_violin_comparison(
    player_stats_clean, 'minutes', 'position',
    title='Minutes Per Game Distribution by Position'
)
plt.savefig('minutes_violin_by_position.png', dpi=150)
plt.close()

# Usage rate distribution
if 'usage_rate' in player_stats_clean.columns:
    fig = create_violin_comparison(
        player_stats_clean[player_stats_clean['minutes'] >= 15],
        'usage_rate', 'position',
        title='Usage Rate Distribution by Position (15+ MPG)'
    )
    plt.savefig('usage_rate_violin.png', dpi=150)
    plt.close()

4.4.4 Kernel Density Estimation for Smooth Distributions

KDE plots provide smooth distribution estimates useful for overlaying multiple groups:

def plot_kde_comparison(df, value_col, group_col, groups=None,
                        title=None, figsize=(10, 6)):
    """Create overlaid KDE plots for group comparison."""
    fig, ax = plt.subplots(figsize=figsize)

    if groups is None:
        groups = df[group_col].unique()

    colors = plt.cm.tab10(np.linspace(0, 1, len(groups)))

    for group, color in zip(groups, colors):
        data = df[df[group_col] == group][value_col].dropna()
        if len(data) >= 10:  # Require minimum observations
            sns.kdeplot(data, ax=ax, label=f'{group} (n={len(data)})',
                       color=color, linewidth=2)

    ax.set_xlabel(value_col.replace('_', ' ').title())
    ax.set_ylabel('Density')
    ax.set_title(title or f'Distribution of {value_col} by {group_col}')
    ax.legend(title=group_col.replace('_', ' ').title())

    plt.tight_layout()
    return fig

# Compare three-point attempt rates by position
fig = plot_kde_comparison(
    player_stats_clean[player_stats_clean['minutes'] >= 20],
    'fg3a', 'position',
    groups=['PG', 'SG', 'SF', 'PF', 'C'],
    title='Three-Point Attempts Per Game by Position'
)
plt.savefig('3pa_kde_by_position.png', dpi=150)
plt.close()

4.5 Visualizing Relationships

4.5.1 Scatter Plots for Bivariate Relationships

Scatter plots reveal relationships between two continuous variables and help identify patterns, clusters, and outliers:

def create_scatter_with_regression(df, x_col, y_col, hue_col=None,
                                   title=None, figsize=(10, 8)):
    """Create scatter plot with optional regression line and grouping."""
    fig, ax = plt.subplots(figsize=figsize)

    if hue_col:
        groups = df[hue_col].unique()
        colors = plt.cm.tab10(np.linspace(0, 1, len(groups)))

        for group, color in zip(groups, colors):
            mask = df[hue_col] == group
            ax.scatter(df.loc[mask, x_col], df.loc[mask, y_col],
                      alpha=0.6, label=group, color=color, s=50)
    else:
        ax.scatter(df[x_col], df[y_col], alpha=0.6, s=50, color='steelblue')

        # Add regression line
        from scipy import stats
        mask = df[[x_col, y_col]].notna().all(axis=1)
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            df.loc[mask, x_col], df.loc[mask, y_col]
        )
        x_line = np.linspace(df[x_col].min(), df[x_col].max(), 100)
        y_line = slope * x_line + intercept
        ax.plot(x_line, y_line, 'r--', linewidth=2,
                label=f'R² = {r_value**2:.3f}')

    ax.set_xlabel(x_col.replace('_', ' ').title())
    ax.set_ylabel(y_col.replace('_', ' ').title())
    ax.set_title(title or f'{y_col} vs {x_col}')
    ax.legend()

    plt.tight_layout()
    return fig

# Points vs Minutes relationship
fig = create_scatter_with_regression(
    player_stats_clean[player_stats_clean['minutes'] >= 10],
    'minutes', 'points',
    title='Points vs Minutes Per Game'
)
plt.savefig('points_vs_minutes.png', dpi=150)
plt.close()

# Points vs Usage Rate by Position
fig = create_scatter_with_regression(
    player_stats_clean[player_stats_clean['minutes'] >= 20],
    'usage_rate', 'points', hue_col='position',
    title='Points vs Usage Rate by Position'
)
plt.savefig('points_vs_usage_by_position.png', dpi=150)
plt.close()

4.5.2 Correlation Matrices and Heatmaps

Correlation matrices provide a comprehensive view of relationships among multiple variables:

def create_correlation_heatmap(df, columns=None, title=None,
                               figsize=(12, 10), annot=True):
    """Create a correlation heatmap for selected columns."""
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns

    # Calculate correlation matrix
    corr_matrix = df[columns].corr()

    # Create mask for upper triangle
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

    fig, ax = plt.subplots(figsize=figsize)

    # Create heatmap
    sns.heatmap(corr_matrix, mask=mask, annot=annot, fmt='.2f',
                cmap='RdBu_r', center=0, square=True, linewidths=0.5,
                ax=ax, vmin=-1, vmax=1,
                cbar_kws={'shrink': 0.8, 'label': 'Correlation'})

    ax.set_title(title or 'Correlation Matrix', fontsize=14)

    # Rotate labels
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)

    plt.tight_layout()
    return fig, corr_matrix

# Select relevant columns for correlation analysis
stat_columns = ['points', 'rebounds', 'assists', 'steals', 'blocks',
                'fg_pct', 'fg3_pct', 'ft_pct', 'minutes', 'turnovers']

fig, corr = create_correlation_heatmap(
    player_stats_clean[player_stats_clean['minutes'] >= 15],
    columns=stat_columns,
    title='Correlation Matrix of Key Statistics (15+ MPG Players)'
)
plt.savefig('correlation_heatmap.png', dpi=150)
plt.close()

# Identify strongest correlations
def get_top_correlations(corr_matrix, n=10):
    """Extract the strongest correlations from a correlation matrix."""
    # Get upper triangle indices
    upper_tri = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )

    # Stack and sort
    correlations = upper_tri.stack().sort_values(ascending=False)

    print(f"Top {n} Positive Correlations:")
    print(correlations.head(n))
    print(f"\nTop {n} Negative Correlations:")
    print(correlations.tail(n))

    return correlations

top_corrs = get_top_correlations(corr, n=5)

4.5.3 Pair Plots for Multivariate Exploration

Pair plots create scatter plot matrices that reveal relationships among multiple variables simultaneously:

def create_pair_plot(df, columns, hue_col=None, title=None):
    """Create a pair plot for selected columns."""
    # Filter to relevant columns and remove missing values
    plot_data = df[columns + ([hue_col] if hue_col else [])].dropna()

    # Create pair plot
    g = sns.pairplot(plot_data, hue=hue_col, diag_kind='kde',
                     plot_kws={'alpha': 0.6, 's': 30},
                     diag_kws={'alpha': 0.7})

    if title:
        g.fig.suptitle(title, y=1.02)

    return g

# Create pair plot for scoring statistics
scoring_cols = ['points', 'fg_pct', 'fg3_pct', 'ft_pct', 'minutes']
g = create_pair_plot(
    player_stats_clean[player_stats_clean['minutes'] >= 20],
    scoring_cols, hue_col='position_group',
    title='Scoring Statistics Relationships by Position Group'
)
plt.savefig('scoring_pairplot.png', dpi=150, bbox_inches='tight')
plt.close()

4.5.4 Advanced Scatter Plot Techniques

Adding dimensions through size, color, and annotations enhances scatter plot informativeness:

def create_bubble_chart(df, x_col, y_col, size_col, color_col=None,
                        label_col=None, title=None, figsize=(12, 8)):
    """Create a bubble chart with optional labeling."""
    fig, ax = plt.subplots(figsize=figsize)

    # Prepare data
    plot_df = df[[x_col, y_col, size_col] +
                 ([color_col] if color_col else []) +
                 ([label_col] if label_col else [])].dropna()

    # Normalize size for plotting
    size_normalized = (plot_df[size_col] - plot_df[size_col].min()) / \
                      (plot_df[size_col].max() - plot_df[size_col].min())
    sizes = size_normalized * 500 + 50  # Scale to reasonable bubble sizes

    # Create scatter plot
    if color_col:
        scatter = ax.scatter(plot_df[x_col], plot_df[y_col], s=sizes,
                            c=plot_df[color_col], cmap='viridis',
                            alpha=0.6, edgecolors='black', linewidth=0.5)
        plt.colorbar(scatter, label=color_col.replace('_', ' ').title())
    else:
        scatter = ax.scatter(plot_df[x_col], plot_df[y_col], s=sizes,
                            alpha=0.6, edgecolors='black', linewidth=0.5)

    # Add labels for top performers
    if label_col:
        # Label top 10 by y-value
        top_players = plot_df.nlargest(10, y_col)
        for _, row in top_players.iterrows():
            ax.annotate(row[label_col], (row[x_col], row[y_col]),
                       xytext=(5, 5), textcoords='offset points',
                       fontsize=8, alpha=0.8)

    ax.set_xlabel(x_col.replace('_', ' ').title())
    ax.set_ylabel(y_col.replace('_', ' ').title())
    ax.set_title(title or f'{y_col} vs {x_col} (size: {size_col})')

    plt.tight_layout()
    return fig

# Create bubble chart: Points vs Efficiency, sized by minutes
fig = create_bubble_chart(
    player_stats_clean[player_stats_clean['minutes'] >= 20],
    x_col='true_shooting_pct', y_col='points',
    size_col='minutes', color_col='assists',
    label_col='player_name',
    title='Points vs Efficiency (Bubble Size = Minutes, Color = Assists)'
)
plt.savefig('scoring_bubble_chart.png', dpi=150)
plt.close()

4.6 Time Series Analysis of Player and Team Performance

4.6.1 Loading and Preparing Time Series Data

Game-by-game data enables analysis of performance trends over time:

# Load game log data
game_logs = pd.read_csv('nba_player_game_logs_2023_24.csv')

# Convert date column to datetime
game_logs['game_date'] = pd.to_datetime(game_logs['game_date'])

# Sort by player and date
game_logs = game_logs.sort_values(['player_id', 'game_date'])

# Create game number within season for each player
game_logs['game_number'] = game_logs.groupby('player_id').cumcount() + 1

# Verify the structure
print(game_logs[['player_name', 'game_date', 'game_number', 'points']].head(20))

4.6.2 Rolling Averages for Trend Analysis

Rolling averages smooth out game-to-game variance to reveal underlying trends:

def calculate_rolling_stats(df, player_id, stat_cols, windows=[5, 10, 20]):
    """Calculate rolling averages for a player's statistics."""
    player_df = df[df['player_id'] == player_id].copy()

    for window in windows:
        for col in stat_cols:
            player_df[f'{col}_rolling_{window}'] = (
                player_df[col].rolling(window=window, min_periods=1).mean()
            )

    return player_df

def plot_player_trend(df, player_id, stat_col, windows=[5, 10],
                      title=None, figsize=(14, 6)):
    """Plot a player's performance trend with rolling averages."""
    player_data = calculate_rolling_stats(df, player_id, [stat_col], windows)
    player_name = player_data['player_name'].iloc[0]

    fig, ax = plt.subplots(figsize=figsize)

    # Plot raw values
    ax.scatter(player_data['game_number'], player_data[stat_col],
               alpha=0.4, s=30, color='gray', label='Game Values')

    # Plot rolling averages
    colors = ['blue', 'red', 'green']
    for window, color in zip(windows, colors):
        ax.plot(player_data['game_number'],
                player_data[f'{stat_col}_rolling_{window}'],
                linewidth=2, color=color, label=f'{window}-Game Average')

    # Add season average line
    season_avg = player_data[stat_col].mean()
    ax.axhline(y=season_avg, color='black', linestyle='--',
               label=f'Season Average: {season_avg:.1f}')

    ax.set_xlabel('Game Number')
    ax.set_ylabel(stat_col.replace('_', ' ').title())
    ax.set_title(title or f'{player_name} - {stat_col.title()} Trend')
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    return fig

# Plot scoring trend for a star player
fig = plot_player_trend(game_logs, player_id='jokic_nikola',
                        stat_col='points', windows=[5, 10],
                        title='Nikola Jokic - Points Per Game Trend (2023-24)')
plt.savefig('jokic_scoring_trend.png', dpi=150)
plt.close()

4.6.3 Cumulative Statistics and Pace Analysis

Tracking cumulative statistics reveals how players progress toward milestones:

def plot_cumulative_comparison(df, player_ids, stat_col, title=None,
                               figsize=(12, 6)):
    """Compare cumulative statistics across multiple players."""
    fig, ax = plt.subplots(figsize=figsize)

    colors = plt.cm.tab10(np.linspace(0, 1, len(player_ids)))

    for player_id, color in zip(player_ids, colors):
        player_data = df[df['player_id'] == player_id].copy()
        player_data['cumulative'] = player_data[stat_col].cumsum()
        player_name = player_data['player_name'].iloc[0]

        ax.plot(player_data['game_number'], player_data['cumulative'],
                linewidth=2, color=color, label=player_name)

    ax.set_xlabel('Game Number')
    ax.set_ylabel(f'Cumulative {stat_col.replace("_", " ").title()}')
    ax.set_title(title or f'Cumulative {stat_col.title()} Comparison')
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    return fig

# Compare cumulative scoring for MVP candidates
mvp_candidates = ['jokic_nikola', 'gilgeous_alexander_shai',
                  'doncic_luka', 'tatum_jayson']
fig = plot_cumulative_comparison(game_logs, mvp_candidates, 'points',
                                 title='Cumulative Points - MVP Candidates (2023-24)')
plt.savefig('mvp_cumulative_points.png', dpi=150)
plt.close()

4.6.4 Performance Variation Analysis

Understanding performance consistency is as important as average performance:

def analyze_performance_consistency(df, player_ids, stat_col):
    """Analyze performance consistency across players."""
    results = []

    for player_id in player_ids:
        player_data = df[df['player_id'] == player_id]
        player_name = player_data['player_name'].iloc[0]

        stats = player_data[stat_col]
        results.append({
            'player_name': player_name,
            'games': len(stats),
            'mean': stats.mean(),
            'median': stats.median(),
            'std': stats.std(),
            'cv': stats.std() / stats.mean() * 100,  # Coefficient of variation
            'min': stats.min(),
            'max': stats.max(),
            'range': stats.max() - stats.min()
        })

    return pd.DataFrame(results)

consistency_df = analyze_performance_consistency(
    game_logs, mvp_candidates, 'points'
)
print(consistency_df.to_string())

# Visualize consistency with box plots
def plot_consistency_comparison(df, player_ids, stat_col, title=None,
                                figsize=(12, 6)):
    """Create box plots comparing performance consistency."""
    fig, ax = plt.subplots(figsize=figsize)

    data = []
    labels = []
    for player_id in player_ids:
        player_data = df[df['player_id'] == player_id]
        data.append(player_data[stat_col].dropna())
        labels.append(player_data['player_name'].iloc[0].split()[-1])  # Last name

    bp = ax.boxplot(data, labels=labels, patch_artist=True)

    colors = plt.cm.Set2(np.linspace(0, 1, len(player_ids)))
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)

    ax.set_ylabel(stat_col.replace('_', ' ').title())
    ax.set_title(title or f'{stat_col.title()} Consistency Comparison')
    ax.grid(True, alpha=0.3, axis='y')

    plt.tight_layout()
    return fig

fig = plot_consistency_comparison(game_logs, mvp_candidates, 'points',
                                  title='Scoring Consistency - MVP Candidates')
plt.savefig('mvp_scoring_consistency.png', dpi=150)
plt.close()

4.6.5 Monthly and Segment Analysis

Breaking the season into segments reveals performance patterns:

def analyze_monthly_performance(df, player_id, stat_cols):
    """Analyze player performance by month."""
    player_data = df[df['player_id'] == player_id].copy()
    player_data['month'] = player_data['game_date'].dt.month
    player_data['month_name'] = player_data['game_date'].dt.strftime('%B')

    monthly_stats = player_data.groupby(['month', 'month_name'])[stat_cols].agg(
        ['mean', 'std', 'count']
    ).round(2)

    return monthly_stats

# Monthly breakdown for a star player
monthly_stats = analyze_monthly_performance(
    game_logs, 'jokic_nikola', ['points', 'rebounds', 'assists']
)
print(monthly_stats)

def plot_monthly_performance(df, player_id, stat_col, title=None,
                             figsize=(12, 6)):
    """Visualize monthly performance trends."""
    player_data = df[df['player_id'] == player_id].copy()
    player_data['month'] = player_data['game_date'].dt.to_period('M')
    player_name = player_data['player_name'].iloc[0]

    monthly = player_data.groupby('month')[stat_col].agg(['mean', 'std', 'count'])
    monthly.index = monthly.index.astype(str)

    fig, ax = plt.subplots(figsize=figsize)

    # Bar chart with error bars
    x = range(len(monthly))
    bars = ax.bar(x, monthly['mean'], yerr=monthly['std'],
                  capsize=5, alpha=0.7, color='steelblue')

    # Add game counts on top of bars
    for i, (mean, count) in enumerate(zip(monthly['mean'], monthly['count'])):
        ax.text(i, mean + monthly['std'].iloc[i] + 1, f'n={count}',
                ha='center', fontsize=9)

    ax.set_xticks(x)
    ax.set_xticklabels(monthly.index, rotation=45, ha='right')
    ax.set_xlabel('Month')
    ax.set_ylabel(stat_col.replace('_', ' ').title())
    ax.set_title(title or f'{player_name} - Monthly {stat_col.title()}')

    plt.tight_layout()
    return fig

fig = plot_monthly_performance(game_logs, 'jokic_nikola', 'points',
                               title='Nikola Jokic - Monthly Scoring Average')
plt.savefig('jokic_monthly_scoring.png', dpi=150)
plt.close()

4.7 Shot Chart Creation and Spatial Analysis

4.7.1 Understanding Shot Location Data

Shot location data provides x-y coordinates for each shot attempt, enabling spatial analysis:

# Load shot data
shots = pd.read_csv('nba_shots_2023_24.csv')

# Inspect shot data structure
print(shots.info())
print(shots[['player_name', 'shot_x', 'shot_y', 'shot_made',
             'shot_type', 'shot_distance']].head(10))

# Shot coordinates are typically in feet from the basket
# NBA court is 94 feet long and 50 feet wide
# Basket is at (0, 0) with positive y toward half court
print(f"X range: {shots['shot_x'].min()} to {shots['shot_x'].max()}")
print(f"Y range: {shots['shot_y'].min()} to {shots['shot_y'].max()}")

4.7.2 Drawing the Basketball Court

Creating an accurate court representation is essential for shot charts:

def draw_basketball_court(ax=None, color='black', lw=2, outer_lines=False):
    """Draw a basketball half-court on a matplotlib axes."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(12, 11))

    # Create court elements
    # Hoop
    hoop = plt.Circle((0, 0), radius=0.75, linewidth=lw, color=color,
                       fill=False)

    # Backboard
    backboard = plt.Rectangle((-3, -0.75), 6, 0, linewidth=lw, color=color)

    # Paint (outer box)
    outer_box = plt.Rectangle((-8, -5.25), 16, 19, linewidth=lw,
                               color=color, fill=False)

    # Paint (inner box)
    inner_box = plt.Rectangle((-6, -5.25), 12, 19, linewidth=lw,
                               color=color, fill=False)

    # Free throw top arc
    top_free_throw = plt.Arc((0, 13.75), 12, 12, theta1=0, theta2=180,
                              linewidth=lw, color=color, fill=False)

    # Free throw bottom arc
    bottom_free_throw = plt.Arc((0, 13.75), 12, 12, theta1=180, theta2=0,
                                 linewidth=lw, color=color, linestyle='dashed')

    # Restricted zone arc
    restricted = plt.Arc((0, 0), 8, 8, theta1=0, theta2=180,
                          linewidth=lw, color=color)

    # Three point line
    corner_three_left = plt.Rectangle((-22, -5.25), 0, 14, linewidth=lw,
                                       color=color)
    corner_three_right = plt.Rectangle((22, -5.25), 0, 14, linewidth=lw,
                                        color=color)
    three_arc = plt.Arc((0, 0), 47.5, 47.5, theta1=22, theta2=158,
                         linewidth=lw, color=color)

    # Center court
    center_outer_arc = plt.Arc((0, 47), 12, 12, theta1=180, theta2=0,
                                linewidth=lw, color=color)
    center_inner_arc = plt.Arc((0, 47), 4, 4, theta1=180, theta2=0,
                                linewidth=lw, color=color)

    # Add elements to axes
    court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
                      bottom_free_throw, restricted, corner_three_left,
                      corner_three_right, three_arc, center_outer_arc,
                      center_inner_arc]

    if outer_lines:
        outer_lines_rect = plt.Rectangle((-25, -5.25), 50, 52.25,
                                          linewidth=lw, color=color, fill=False)
        court_elements.append(outer_lines_rect)

    for element in court_elements:
        ax.add_patch(element)

    # Set axis limits and properties
    ax.set_xlim(-25, 25)
    ax.set_ylim(-5.25, 47)
    ax.set_aspect('equal')
    ax.axis('off')

    return ax

4.7.3 Basic Shot Charts

Creating simple shot charts that display made and missed shots:

def create_basic_shot_chart(shots_df, player_name, title=None, figsize=(12, 11)):
    """Create a basic shot chart showing makes and misses."""
    player_shots = shots_df[shots_df['player_name'] == player_name].copy()

    fig, ax = plt.subplots(figsize=figsize)
    draw_basketball_court(ax)

    # Separate makes and misses
    makes = player_shots[player_shots['shot_made'] == 1]
    misses = player_shots[player_shots['shot_made'] == 0]

    # Plot misses first (so makes appear on top)
    ax.scatter(misses['shot_x'], misses['shot_y'], c='red', marker='x',
               s=30, alpha=0.6, label=f'Miss ({len(misses)})')

    # Plot makes
    ax.scatter(makes['shot_x'], makes['shot_y'], c='green', marker='o',
               s=30, alpha=0.6, label=f'Make ({len(makes)})')

    # Calculate shooting percentage
    fg_pct = len(makes) / len(player_shots) * 100 if len(player_shots) > 0 else 0

    ax.set_title(title or f'{player_name} Shot Chart\n'
                 f'{len(player_shots)} shots, {fg_pct:.1f}% FG')
    ax.legend(loc='upper right')

    plt.tight_layout()
    return fig

# Create shot chart for a player
fig = create_basic_shot_chart(shots, 'Stephen Curry',
                               title='Stephen Curry Shot Chart (2023-24)')
plt.savefig('curry_shot_chart_basic.png', dpi=150)
plt.close()

4.7.4 Hexbin Shot Charts

Hexagonal binning aggregates shots into zones, revealing shooting patterns more clearly:

def create_hexbin_shot_chart(shots_df, player_name, stat='frequency',
                              title=None, figsize=(12, 11)):
    """Create a hexbin shot chart showing shooting patterns."""
    player_shots = shots_df[shots_df['player_name'] == player_name].copy()

    fig, ax = plt.subplots(figsize=figsize)

    if stat == 'frequency':
        # Show shot frequency
        hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
                           gridsize=25, cmap='YlOrRd', mincnt=1)
        cb = plt.colorbar(hexbin, ax=ax, label='Shot Attempts')
    elif stat == 'efficiency':
        # Show shooting percentage
        hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
                           C=player_shots['shot_made'], gridsize=25,
                           cmap='RdYlGn', mincnt=3, reduce_C_function=np.mean)
        cb = plt.colorbar(hexbin, ax=ax, label='FG%')

    # Draw court on top
    draw_basketball_court(ax, color='black', lw=1)

    ax.set_title(title or f'{player_name} Shot Chart ({stat.title()})')

    plt.tight_layout()
    return fig

# Create frequency and efficiency shot charts
fig = create_hexbin_shot_chart(shots, 'Stephen Curry', stat='frequency',
                                title='Stephen Curry Shot Frequency')
plt.savefig('curry_shot_chart_frequency.png', dpi=150)
plt.close()

fig = create_hexbin_shot_chart(shots, 'Stephen Curry', stat='efficiency',
                                title='Stephen Curry Shot Efficiency')
plt.savefig('curry_shot_chart_efficiency.png', dpi=150)
plt.close()

4.7.5 Kernel Density Shot Charts

KDE-based shot charts provide smooth heat maps of shot locations:

def create_kde_shot_chart(shots_df, player_name, title=None,
                          figsize=(12, 11), bw_adjust=0.5):
    """Create a KDE-based shot density chart."""
    player_shots = shots_df[shots_df['player_name'] == player_name].copy()

    fig, ax = plt.subplots(figsize=figsize)

    # Create KDE plot
    sns.kdeplot(data=player_shots, x='shot_x', y='shot_y',
                fill=True, cmap='YlOrRd', levels=50, thresh=0.05,
                bw_adjust=bw_adjust, ax=ax)

    # Draw court
    draw_basketball_court(ax, color='black', lw=1)

    # Add shot count
    ax.set_title(title or f'{player_name} Shot Density\n'
                 f'({len(player_shots)} total shots)')

    plt.tight_layout()
    return fig

fig = create_kde_shot_chart(shots, 'Stephen Curry',
                             title='Stephen Curry Shot Density (2023-24)')
plt.savefig('curry_shot_chart_kde.png', dpi=150)
plt.close()

4.7.6 Zone-Based Analysis

Dividing the court into zones enables structured shooting analysis:

def classify_shot_zone(row):
    """Classify a shot into a court zone based on coordinates."""
    x, y = row['shot_x'], row['shot_y']
    distance = row['shot_distance']

    # Restricted area
    if distance <= 4:
        return 'Restricted Area'

    # Paint (non-restricted)
    if abs(x) <= 8 and y <= 14 and distance > 4:
        return 'Paint (Non-RA)'

    # Mid-range
    if distance <= 23.75 and not (abs(x) <= 8 and y <= 14):
        if x < -8:
            return 'Mid-Range Left'
        elif x > 8:
            return 'Mid-Range Right'
        else:
            return 'Mid-Range Center'

    # Three-point
    if distance > 23.75:
        if x < -8:
            return 'Corner 3 Left'
        elif x > 8:
            return 'Corner 3 Right'
        elif y > 14:
            return 'Above Break 3'
        else:
            return 'Corner 3'

    return 'Other'

# Apply zone classification
shots['shot_zone'] = shots.apply(classify_shot_zone, axis=1)

def analyze_zone_shooting(shots_df, player_name):
    """Analyze shooting by zone for a player."""
    player_shots = shots_df[shots_df['player_name'] == player_name]

    zone_stats = player_shots.groupby('shot_zone').agg(
        attempts=('shot_made', 'count'),
        makes=('shot_made', 'sum'),
        fg_pct=('shot_made', 'mean')
    ).round(3)

    zone_stats['fg_pct'] = (zone_stats['fg_pct'] * 100).round(1)
    zone_stats['pct_of_shots'] = (zone_stats['attempts'] /
                                   zone_stats['attempts'].sum() * 100).round(1)

    return zone_stats.sort_values('attempts', ascending=False)

zone_analysis = analyze_zone_shooting(shots, 'Stephen Curry')
print("Stephen Curry Zone Shooting Analysis:")
print(zone_analysis)

4.7.7 Comparative Shot Charts

Comparing shooting patterns between players or time periods:

def create_comparison_shot_chart(shots_df, player1, player2,
                                  title=None, figsize=(20, 10)):
    """Create side-by-side shot charts for comparison."""
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    for ax, player in zip(axes, [player1, player2]):
        player_shots = shots_df[shots_df['player_name'] == player]

        # Create hexbin
        hexbin = ax.hexbin(player_shots['shot_x'], player_shots['shot_y'],
                           gridsize=20, cmap='YlOrRd', mincnt=1)

        # Draw court
        draw_basketball_court(ax, color='black', lw=1)

        # Calculate stats
        fg_pct = player_shots['shot_made'].mean() * 100
        ax.set_title(f'{player}\n{len(player_shots)} shots, {fg_pct:.1f}% FG')

    # Add colorbar
    plt.colorbar(hexbin, ax=axes, label='Shot Attempts', shrink=0.7)

    if title:
        fig.suptitle(title, fontsize=14, y=1.02)

    plt.tight_layout()
    return fig

fig = create_comparison_shot_chart(shots, 'Stephen Curry', 'Luka Doncic',
                                    title='Shot Chart Comparison: Curry vs Doncic')
plt.savefig('curry_vs_doncic_shot_charts.png', dpi=150, bbox_inches='tight')
plt.close()

4.7.8 Expected Points and Shot Quality

Incorporating expected value concepts into shot chart analysis:

def calculate_expected_points(shots_df):
    """Calculate expected points for each shot based on location."""
    shots_df = shots_df.copy()

    # Calculate zone-based expected FG%
    zone_fg = shots_df.groupby('shot_zone')['shot_made'].mean()
    shots_df['zone_expected_fg'] = shots_df['shot_zone'].map(zone_fg)

    # Determine point value
    shots_df['point_value'] = np.where(shots_df['shot_distance'] > 23.75, 3, 2)

    # Calculate expected points
    shots_df['expected_points'] = (shots_df['zone_expected_fg'] *
                                    shots_df['point_value'])

    # Actual points
    shots_df['actual_points'] = shots_df['shot_made'] * shots_df['point_value']

    return shots_df

shots_with_expected = calculate_expected_points(shots)

def analyze_shot_quality(shots_df, player_name):
    """Analyze shot quality for a player."""
    player_shots = shots_df[shots_df['player_name'] == player_name]

    analysis = {
        'total_shots': len(player_shots),
        'expected_points_per_shot': player_shots['expected_points'].mean(),
        'actual_points_per_shot': player_shots['actual_points'].mean(),
        'shot_quality_differential': (player_shots['actual_points'].mean() -
                                       player_shots['expected_points'].mean()),
        'total_expected_points': player_shots['expected_points'].sum(),
        'total_actual_points': player_shots['actual_points'].sum()
    }

    return pd.Series(analysis)

curry_shot_quality = analyze_shot_quality(shots_with_expected, 'Stephen Curry')
print("Stephen Curry Shot Quality Analysis:")
print(curry_shot_quality)

4.8 Putting It All Together: A Complete EDA Workflow

4.8.1 Structured EDA Approach

A systematic approach ensures thorough exploration:

class BasketballEDA:
    """A structured class for conducting EDA on basketball data."""

    def __init__(self, df, player_col='player_name'):
        self.df = df.copy()
        self.player_col = player_col
        self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        self.categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

    def summary_report(self):
        """Generate a comprehensive summary report."""
        print("=" * 60)
        print("BASKETBALL DATA EXPLORATORY ANALYSIS REPORT")
        print("=" * 60)

        print(f"\nDataset Shape: {self.df.shape[0]} rows, {self.df.shape[1]} columns")
        print(f"Numeric Columns: {len(self.numeric_cols)}")
        print(f"Categorical Columns: {len(self.categorical_cols)}")

        print("\n--- Missing Values ---")
        missing = self.df.isnull().sum()
        if missing.sum() > 0:
            print(missing[missing > 0])
        else:
            print("No missing values detected")

        print("\n--- Numeric Summary ---")
        print(self.df[self.numeric_cols].describe().T.round(2))

        print("\n--- Categorical Summary ---")
        for col in self.categorical_cols[:5]:  # Limit output
            print(f"\n{col}:")
            print(self.df[col].value_counts().head())

    def correlation_analysis(self, threshold=0.7):
        """Identify highly correlated features."""
        corr = self.df[self.numeric_cols].corr()

        # Find pairs above threshold
        high_corr = []
        for i in range(len(corr.columns)):
            for j in range(i+1, len(corr.columns)):
                if abs(corr.iloc[i, j]) >= threshold:
                    high_corr.append({
                        'var1': corr.columns[i],
                        'var2': corr.columns[j],
                        'correlation': corr.iloc[i, j]
                    })

        return pd.DataFrame(high_corr).sort_values('correlation',
                                                    ascending=False, key=abs)

    def outlier_detection(self, columns=None, method='iqr', threshold=1.5):
        """Detect outliers using IQR or Z-score method."""
        if columns is None:
            columns = self.numeric_cols

        outliers = {}
        for col in columns:
            if method == 'iqr':
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower = Q1 - threshold * IQR
                upper = Q3 + threshold * IQR
                outlier_mask = (self.df[col] < lower) | (self.df[col] > upper)
            elif method == 'zscore':
                from scipy import stats
                z_scores = np.abs(stats.zscore(self.df[col].dropna()))
                outlier_mask = z_scores > threshold

            outliers[col] = self.df[outlier_mask][col]

        return outliers

    def generate_visualizations(self, output_dir='eda_output'):
        """Generate a standard set of visualizations."""
        import os
        os.makedirs(output_dir, exist_ok=True)

        # Distribution plots for key numeric variables
        key_stats = ['points', 'rebounds', 'assists', 'minutes']
        available_stats = [s for s in key_stats if s in self.numeric_cols]

        if available_stats:
            fig, axes = plt.subplots(2, 2, figsize=(12, 10))
            for idx, stat in enumerate(available_stats[:4]):
                ax = axes[idx // 2, idx % 2]
                self.df[stat].hist(ax=ax, bins=30, edgecolor='black')
                ax.set_title(f'Distribution of {stat.title()}')
                ax.set_xlabel(stat.title())
            plt.tight_layout()
            plt.savefig(f'{output_dir}/distributions.png', dpi=150)
            plt.close()

        # Correlation heatmap
        fig, ax = plt.subplots(figsize=(12, 10))
        corr = self.df[self.numeric_cols].corr()
        sns.heatmap(corr, annot=False, cmap='RdBu_r', center=0, ax=ax)
        ax.set_title('Correlation Matrix')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/correlation_heatmap.png', dpi=150)
        plt.close()

        print(f"Visualizations saved to {output_dir}/")

# Example usage
eda = BasketballEDA(player_stats_clean)
eda.summary_report()
high_corr = eda.correlation_analysis(threshold=0.8)
print("\nHighly Correlated Features:")
print(high_corr)

4.8.2 Documentation and Reproducibility

Creating reproducible analysis through proper documentation:

def create_eda_notebook_template(output_path):
    """Generate a template for EDA documentation."""
    template = '''
# NBA Data Exploratory Data Analysis
## Analysis Date: {date}

### 1. Data Loading and Initial Inspection
- Dataset source: [Specify source]
- Number of records: [N]
- Number of features: [N]
- Time period covered: [Dates]

### 2. Data Quality Assessment
- Missing values identified: [Details]
- Duplicate records: [Count]
- Data type issues: [List]

### 3. Data Cleaning Steps
1. [Step 1]
2. [Step 2]
3. [Step 3]

### 4. Key Findings from Distribution Analysis
- [Finding 1]
- [Finding 2]

### 5. Relationship Analysis
- Strongest correlations: [List]
- Key relationships discovered: [Details]

### 6. Notable Outliers
- [Player/observation 1]
- [Player/observation 2]

### 7. Recommendations for Further Analysis
- [Recommendation 1]
- [Recommendation 2]
'''

    from datetime import datetime
    template = template.format(date=datetime.now().strftime('%Y-%m-%d'))

    with open(output_path, 'w') as f:
        f.write(template)

    print(f"Template saved to {output_path}")

create_eda_notebook_template('eda_documentation_template.md')

Summary

Exploratory Data Analysis forms the critical foundation for all basketball analytics work. This chapter has equipped you with the tools and techniques to systematically examine NBA data, from initial loading and inspection through sophisticated spatial analysis of shot charts.

The key skills developed in this chapter include:

Data Loading and Inspection: Using pandas to efficiently load, examine, and understand basketball datasets, including identifying data types and creating derived features.
Data Cleaning: Implementing comprehensive data cleaning procedures that address common issues in basketball data while respecting the sport's inherent constraints.
Missing Value Handling: Understanding the mechanisms behind missing data in basketball contexts and applying appropriate imputation strategies.
Distribution Visualization: Creating informative histograms, box plots, and violin plots that reveal the shape and characteristics of basketball statistics.
Relationship Analysis: Building scatter plots, correlation matrices, and pair plots that uncover meaningful relationships between variables.
Time Series Analysis: Tracking player and team performance over time through rolling averages, cumulative statistics, and trend analysis.
Spatial Analysis: Creating professional shot charts using various techniques including basic plots, hexbin aggregation, and KDE-based heat maps.

These EDA skills serve as prerequisites for the predictive modeling and advanced analytics techniques covered in subsequent chapters. By thoroughly understanding your data through systematic exploration, you establish a solid foundation for extracting actionable insights that can influence real basketball decisions.

The code examples provided throughout this chapter are designed to be directly applicable to real NBA datasets. As you work through the exercises and case studies that follow, you will gain hands-on experience applying these techniques to authentic basketball analytics scenarios.