Case Study 1: Exploring the Evolution of NBA Shooting

Overview

Scenario: ESPN is producing a special feature on how NBA shooting has changed over the past decade. They need comprehensive exploratory analysis showing the evolution of shot selection, three-point volume, and mid-range extinction. Your analysis will inform both the visual storytelling and the statistical claims made in the broadcast.

Duration: 3-4 hours Difficulty: Intermediate Prerequisites: Chapter 4 concepts, pandas proficiency, matplotlib/seaborn experience


Background

The NBA has undergone a dramatic transformation in shot selection philosophy. The "analytics revolution" has led teams to favor three-pointers and shots at the rim over mid-range jumpers. Your task is to quantify and visualize this evolution using exploratory data analysis.

Key questions to address: 1. How has league-wide shot distribution changed from 2014 to 2024? 2. Which teams have led or lagged in this transition? 3. How has three-point volume and efficiency changed? 4. What happened to mid-range shooting? 5. Are there player archetypes that resist these trends?


Part 1: Data Loading and Initial Exploration

1.1 Loading Multi-Season Shot Data

"""
NBA Shooting Evolution Analysis
Case Study 1 - Chapter 4: Exploratory Data Analysis
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import Dict, List, Tuple

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

def load_shot_data(seasons: List[str], data_dir: Path) -> pd.DataFrame:
    """
    Load and combine shot data from multiple seasons.

    Args:
        seasons: List of season identifiers (e.g., ['2014-15', '2015-16'])
        data_dir: Path to directory containing shot data files

    Returns:
        Combined DataFrame with all seasons
    """
    all_shots = []

    for season in seasons:
        filepath = data_dir / f"shots_{season.replace('-', '_')}.parquet"

        if filepath.exists():
            df = pd.read_parquet(filepath)
            df['SEASON'] = season
            all_shots.append(df)
            print(f"Loaded {len(df):,} shots from {season}")
        else:
            print(f"Warning: No data file for {season}")

    combined = pd.concat(all_shots, ignore_index=True)
    print(f"\nTotal shots loaded: {len(combined):,}")

    return combined


# Define seasons to analyze
SEASONS = [
    '2014-15', '2015-16', '2016-17', '2017-18', '2018-19',
    '2019-20', '2020-21', '2021-22', '2022-23', '2023-24'
]

# Load data (example path - adjust for your setup)
# shots_df = load_shot_data(SEASONS, Path('./data/shots/'))

1.2 Initial Data Inspection

def inspect_shot_data(df: pd.DataFrame) -> Dict:
    """
    Perform comprehensive inspection of shot data.

    Args:
        df: Shot data DataFrame

    Returns:
        Dictionary containing inspection results
    """
    inspection = {
        'shape': df.shape,
        'columns': df.columns.tolist(),
        'dtypes': df.dtypes.to_dict(),
        'missing': df.isnull().sum().to_dict(),
        'seasons': df['SEASON'].unique().tolist() if 'SEASON' in df.columns else None
    }

    # Print summary
    print("=" * 60)
    print("SHOT DATA INSPECTION REPORT")
    print("=" * 60)
    print(f"\nDataset Shape: {inspection['shape'][0]:,} rows x {inspection['shape'][1]} columns")

    print(f"\nSeasons Covered: {len(inspection['seasons']) if inspection['seasons'] else 'N/A'}")
    if inspection['seasons']:
        print(f"  From {min(inspection['seasons'])} to {max(inspection['seasons'])}")

    # Check for key columns
    key_columns = ['LOC_X', 'LOC_Y', 'SHOT_MADE_FLAG', 'SHOT_TYPE',
                   'SHOT_ZONE_BASIC', 'SHOT_DISTANCE']

    print("\nKey Column Availability:")
    for col in key_columns:
        status = "Present" if col in df.columns else "MISSING"
        print(f"  {col}: {status}")

    # Sample data
    print("\nSample Records:")
    print(df.head())

    return inspection


def create_shot_zones(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create standardized shot zone classifications.

    Args:
        df: Shot data with LOC_X, LOC_Y, and SHOT_DISTANCE columns

    Returns:
        DataFrame with added zone columns
    """
    df = df.copy()

    # Convert coordinates to feet if in tenths
    df['X_FEET'] = df['LOC_X'] / 10.0
    df['Y_FEET'] = df['LOC_Y'] / 10.0

    # Calculate distance if not present
    if 'SHOT_DISTANCE' not in df.columns:
        df['SHOT_DISTANCE'] = np.sqrt(df['X_FEET']**2 + df['Y_FEET']**2)

    # Define simplified zones
    def classify_shot(row):
        distance = row['SHOT_DISTANCE']
        x = abs(row['X_FEET'])
        y = row['Y_FEET']

        # Restricted area
        if distance <= 4:
            return 'Restricted Area'

        # Paint (non-RA)
        if x <= 8 and y <= 14 and distance > 4:
            return 'Paint (Non-RA)'

        # Mid-range
        if distance < 23.75:
            if distance < 16:
                return 'Short Mid-Range'
            else:
                return 'Long Mid-Range'

        # Three-point
        if x >= 22 and y < 8.75:
            return 'Corner 3'
        else:
            return 'Above Break 3'

    df['ZONE'] = df.apply(classify_shot, axis=1)

    # Create binary flags
    df['IS_THREE'] = (df['SHOT_DISTANCE'] >= 23.75).astype(int)
    df['IS_MIDRANGE'] = ((df['SHOT_DISTANCE'] > 4) &
                         (df['SHOT_DISTANCE'] < 23.75)).astype(int)
    df['IS_AT_RIM'] = (df['SHOT_DISTANCE'] <= 4).astype(int)

    return df

Part 2: Distribution Analysis

2.1 Shot Volume by Zone Over Time

def analyze_zone_trends(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze how shot distribution across zones has changed over time.

    Args:
        df: Shot data with ZONE and SEASON columns

    Returns:
        DataFrame with zone percentages by season
    """
    # Calculate shots per zone per season
    zone_counts = df.groupby(['SEASON', 'ZONE']).size().reset_index(name='SHOTS')

    # Calculate total shots per season
    season_totals = df.groupby('SEASON').size().reset_index(name='TOTAL')

    # Merge and calculate percentages
    zone_pcts = zone_counts.merge(season_totals, on='SEASON')
    zone_pcts['PCT'] = (zone_pcts['SHOTS'] / zone_pcts['TOTAL'] * 100).round(1)

    # Pivot for easier analysis
    pivot = zone_pcts.pivot(index='SEASON', columns='ZONE', values='PCT')

    return pivot


def plot_zone_evolution(zone_pivot: pd.DataFrame, figsize: Tuple = (14, 8)):
    """
    Create stacked area chart showing zone evolution.

    Args:
        zone_pivot: Pivoted DataFrame with zones as columns
        figsize: Figure dimensions
    """
    fig, ax = plt.subplots(figsize=figsize)

    # Define zone order (rim to three)
    zone_order = ['Restricted Area', 'Paint (Non-RA)', 'Short Mid-Range',
                  'Long Mid-Range', 'Corner 3', 'Above Break 3']

    # Filter to existing zones
    available_zones = [z for z in zone_order if z in zone_pivot.columns]

    # Create stacked area chart
    zone_pivot[available_zones].plot(
        kind='area',
        stacked=True,
        ax=ax,
        alpha=0.8,
        colormap='RdYlGn_r'
    )

    ax.set_xlabel('Season')
    ax.set_ylabel('Percentage of Shots')
    ax.set_title('Evolution of NBA Shot Distribution by Zone (2014-2024)')
    ax.legend(title='Shot Zone', bbox_to_anchor=(1.02, 1), loc='upper left')
    ax.set_ylim(0, 100)

    plt.tight_layout()
    return fig


def plot_three_point_trend(df: pd.DataFrame, figsize: Tuple = (12, 6)):
    """
    Visualize the rise of three-point shooting.

    Args:
        df: Shot data with IS_THREE and SHOT_MADE_FLAG columns
        figsize: Figure dimensions
    """
    # Calculate three-point stats by season
    three_stats = df[df['IS_THREE'] == 1].groupby('SEASON').agg({
        'SHOT_MADE_FLAG': ['count', 'sum', 'mean']
    }).reset_index()

    three_stats.columns = ['SEASON', 'ATTEMPTS', 'MAKES', 'PCT']

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    # Volume plot
    ax1.bar(three_stats['SEASON'], three_stats['ATTEMPTS'], color='steelblue', alpha=0.8)
    ax1.set_xlabel('Season')
    ax1.set_ylabel('Three-Point Attempts')
    ax1.set_title('League-Wide 3PA Volume by Season')
    ax1.tick_params(axis='x', rotation=45)

    # Calculate growth
    pct_growth = ((three_stats['ATTEMPTS'].iloc[-1] / three_stats['ATTEMPTS'].iloc[0]) - 1) * 100
    ax1.annotate(f'+{pct_growth:.0f}% growth',
                 xy=(0.95, 0.95), xycoords='axes fraction',
                 ha='right', va='top', fontsize=12, fontweight='bold')

    # Efficiency plot
    ax2.plot(three_stats['SEASON'], three_stats['PCT'] * 100,
             marker='o', linewidth=2, markersize=8, color='green')
    ax2.set_xlabel('Season')
    ax2.set_ylabel('Three-Point Percentage')
    ax2.set_title('League-Wide 3P% by Season')
    ax2.tick_params(axis='x', rotation=45)
    ax2.set_ylim(34, 38)
    ax2.axhline(y=three_stats['PCT'].mean() * 100, color='red',
                linestyle='--', alpha=0.5, label='Period Average')
    ax2.legend()

    plt.tight_layout()
    return fig

2.2 Mid-Range Extinction Analysis

def analyze_midrange_decline(df: pd.DataFrame) -> pd.DataFrame:
    """
    Document the decline of mid-range shooting.

    Args:
        df: Shot data with zone classifications

    Returns:
        DataFrame showing mid-range statistics by season
    """
    # Filter to mid-range shots
    midrange = df[df['IS_MIDRANGE'] == 1].copy()

    # Stats by season
    stats = midrange.groupby('SEASON').agg({
        'SHOT_MADE_FLAG': ['count', 'mean'],
        'PLAYER_ID': 'nunique'
    }).reset_index()

    stats.columns = ['SEASON', 'ATTEMPTS', 'FG_PCT', 'PLAYERS']

    # Calculate per-player averages
    all_players = df.groupby('SEASON')['PLAYER_ID'].nunique().reset_index()
    all_players.columns = ['SEASON', 'TOTAL_PLAYERS']

    stats = stats.merge(all_players, on='SEASON')
    stats['ATTEMPTS_PER_PLAYER'] = (stats['ATTEMPTS'] / stats['TOTAL_PLAYERS']).round(1)

    return stats


def plot_midrange_extinction(stats: pd.DataFrame, figsize: Tuple = (14, 5)):
    """
    Visualize the decline of mid-range shooting.

    Args:
        stats: Mid-range statistics by season
        figsize: Figure dimensions
    """
    fig, axes = plt.subplots(1, 3, figsize=figsize)

    # Total attempts
    axes[0].bar(stats['SEASON'], stats['ATTEMPTS'], color='indianred', alpha=0.8)
    axes[0].set_xlabel('Season')
    axes[0].set_ylabel('Mid-Range Attempts')
    axes[0].set_title('League-Wide Mid-Range Volume')
    axes[0].tick_params(axis='x', rotation=45)

    # Per-player attempts
    axes[1].plot(stats['SEASON'], stats['ATTEMPTS_PER_PLAYER'],
                 marker='s', linewidth=2, color='darkred')
    axes[1].set_xlabel('Season')
    axes[1].set_ylabel('Mid-Range Attempts per Player')
    axes[1].set_title('Per-Player Mid-Range Volume')
    axes[1].tick_params(axis='x', rotation=45)

    # Efficiency
    axes[2].plot(stats['SEASON'], stats['FG_PCT'] * 100,
                 marker='o', linewidth=2, color='orange')
    axes[2].set_xlabel('Season')
    axes[2].set_ylabel('Field Goal Percentage')
    axes[2].set_title('Mid-Range Efficiency')
    axes[2].tick_params(axis='x', rotation=45)
    axes[2].set_ylim(38, 44)

    plt.tight_layout()
    return fig

Part 3: Team-Level Analysis

3.1 Team Shot Profile Comparison

def create_team_shot_profiles(df: pd.DataFrame, season: str) -> pd.DataFrame:
    """
    Create shot profile summary for each team in a season.

    Args:
        df: Shot data
        season: Season to analyze

    Returns:
        DataFrame with team shot profiles
    """
    season_df = df[df['SEASON'] == season].copy()

    profiles = season_df.groupby('TEAM_NAME').agg({
        'SHOT_MADE_FLAG': ['count', 'mean'],
        'IS_THREE': 'mean',
        'IS_MIDRANGE': 'mean',
        'IS_AT_RIM': 'mean'
    }).reset_index()

    profiles.columns = ['TEAM', 'TOTAL_SHOTS', 'FG_PCT',
                        'THREE_RATE', 'MIDRANGE_RATE', 'RIM_RATE']

    # Calculate expected points per shot
    # Simplified: rim shots ~60%, mid-range ~40%, threes ~36%
    profiles['EXPECTED_PTS'] = (
        profiles['RIM_RATE'] * 0.60 * 2 +
        profiles['MIDRANGE_RATE'] * 0.40 * 2 +
        profiles['THREE_RATE'] * 0.36 * 3
    )

    return profiles.sort_values('THREE_RATE', ascending=False)


def plot_team_evolution(df: pd.DataFrame, teams: List[str],
                        figsize: Tuple = (14, 6)):
    """
    Compare shot evolution for specific teams.

    Args:
        df: Shot data
        teams: List of team names to compare
        figsize: Figure dimensions
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    colors = plt.cm.tab10(np.linspace(0, 1, len(teams)))

    for team, color in zip(teams, colors):
        team_df = df[df['TEAM_NAME'] == team]

        # Three-point rate by season
        three_rate = team_df.groupby('SEASON')['IS_THREE'].mean() * 100
        ax1.plot(three_rate.index, three_rate.values,
                 marker='o', label=team, color=color, linewidth=2)

        # Mid-range rate by season
        mid_rate = team_df.groupby('SEASON')['IS_MIDRANGE'].mean() * 100
        ax2.plot(mid_rate.index, mid_rate.values,
                 marker='s', label=team, color=color, linewidth=2)

    ax1.set_xlabel('Season')
    ax1.set_ylabel('Three-Point Rate (%)')
    ax1.set_title('Three-Point Rate Evolution by Team')
    ax1.legend()
    ax1.tick_params(axis='x', rotation=45)

    ax2.set_xlabel('Season')
    ax2.set_ylabel('Mid-Range Rate (%)')
    ax2.set_title('Mid-Range Rate Evolution by Team')
    ax2.legend()
    ax2.tick_params(axis='x', rotation=45)

    plt.tight_layout()
    return fig

Part 4: Shot Chart Visualizations

4.1 League-Wide Shot Distribution Heat Maps

def create_era_comparison_shot_charts(df: pd.DataFrame,
                                       figsize: Tuple = (18, 8)):
    """
    Create side-by-side shot chart comparisons across eras.

    Args:
        df: Shot data with coordinates
        figsize: Figure dimensions
    """
    # Define era groupings
    early_era = df[df['SEASON'].isin(['2014-15', '2015-16'])]
    recent_era = df[df['SEASON'].isin(['2022-23', '2023-24'])]

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Draw court helper function (simplified)
    def draw_court(ax):
        # Three-point arc
        theta = np.linspace(-np.pi/2, np.pi/2, 100)
        x = 23.75 * np.cos(theta)
        y = 23.75 * np.sin(theta)
        ax.plot(x, y, 'k-', linewidth=2)

        # Corner threes
        ax.plot([-22, -22], [0, 8.75], 'k-', linewidth=2)
        ax.plot([22, 22], [0, 8.75], 'k-', linewidth=2)

        # Paint
        ax.plot([-8, -8, 8, 8], [0, 19, 19, 0], 'k-', linewidth=2)

        # Free throw circle
        circle = plt.Circle((0, 19), 6, fill=False, color='k', linewidth=2)
        ax.add_patch(circle)

        # Restricted area
        ra_arc = plt.Circle((0, 0), 4, fill=False, color='k', linewidth=2)
        ax.add_patch(ra_arc)

        ax.set_xlim(-25, 25)
        ax.set_ylim(-5, 47)
        ax.set_aspect('equal')
        ax.axis('off')

    # Plot early era
    draw_court(axes[0])
    axes[0].hexbin(early_era['X_FEET'], early_era['Y_FEET'],
                   gridsize=30, cmap='YlOrRd', mincnt=100)
    axes[0].set_title(f'Shot Distribution: 2014-16\n({len(early_era):,} shots)')

    # Plot recent era
    draw_court(axes[1])
    hb = axes[1].hexbin(recent_era['X_FEET'], recent_era['Y_FEET'],
                        gridsize=30, cmap='YlOrRd', mincnt=100)
    axes[1].set_title(f'Shot Distribution: 2022-24\n({len(recent_era):,} shots)')

    # Add colorbar
    plt.colorbar(hb, ax=axes, label='Shot Frequency', shrink=0.7)

    plt.suptitle('The Evolution of NBA Shot Selection', fontsize=16, y=1.02)
    plt.tight_layout()

    return fig

Part 5: Key Findings Summary

5.1 Automated Findings Report

def generate_findings_report(df: pd.DataFrame) -> str:
    """
    Generate an automated summary of key findings.

    Args:
        df: Analyzed shot data

    Returns:
        Formatted string report
    """
    # Calculate key statistics
    first_season = df['SEASON'].min()
    last_season = df['SEASON'].max()

    early_3rate = df[df['SEASON'] == first_season]['IS_THREE'].mean() * 100
    late_3rate = df[df['SEASON'] == last_season]['IS_THREE'].mean() * 100

    early_mid = df[df['SEASON'] == first_season]['IS_MIDRANGE'].mean() * 100
    late_mid = df[df['SEASON'] == last_season]['IS_MIDRANGE'].mean() * 100

    report = f"""
================================================================================
NBA SHOOTING EVOLUTION: KEY FINDINGS
Analysis Period: {first_season} to {last_season}
Total Shots Analyzed: {len(df):,}
================================================================================

1. THREE-POINT REVOLUTION
   - Three-point rate increased from {early_3rate:.1f}% to {late_3rate:.1f}%
   - This represents a {((late_3rate/early_3rate) - 1) * 100:.0f}% relative increase
   - The league now takes approximately {late_3rate:.0f} three-pointers per 100 shots

2. MID-RANGE EXTINCTION
   - Mid-range rate declined from {early_mid:.1f}% to {late_mid:.1f}%
   - This represents a {((1 - late_mid/early_mid)) * 100:.0f}% relative decrease
   - The "dying art" of the mid-range jumper is now statistically verified

3. RIM PRESSURE
   - Restricted area attempts have remained relatively stable
   - This indicates the shift has been primarily mid-range -> three-point

4. EFFICIENCY TRENDS
   - League-wide efficiency has generally improved despite volume increases
   - Three-point shooters are more skilled as teams invest in shooting

================================================================================
IMPLICATIONS FOR PRODUCTION:
- Visual storytelling should emphasize the dramatic shift in shot selection
- Side-by-side shot charts from 2014 vs 2024 illustrate the change vividly
- Key quote angle: "The math changed basketball"
================================================================================
"""

    return report

Discussion Questions

Question 1: Causation vs Correlation

The data shows teams shooting more threes are often more successful. Does shooting more threes cause success, or do more talented teams simply shoot more threes?

Question 2: Data Limitations

What limitations exist in using shot chart data to tell the story of basketball evolution? What additional data would strengthen the analysis?

Question 3: Outlier Analysis

Are there successful teams or players who don't follow the three-point revolution? What explains their success?

Question 4: Future Projections

Based on the trends observed, what might NBA shooting look like in 5 years? What factors could accelerate or reverse the trends?


Deliverables

  1. Exploratory Analysis Notebook: Complete Jupyter notebook with all analysis
  2. Visualization Suite: Publication-ready figures for each key finding
  3. Summary Statistics: Tables showing decade-long trends
  4. Shot Chart Comparison: Visual showing 2014 vs 2024 shot selection
  5. Findings Report: Written summary suitable for production team

Key Takeaways

  1. EDA reveals narratives: The data clearly shows the transformation in NBA play style
  2. Multiple visualization types: Different plots reveal different aspects of the evolution
  3. Context matters: Zone definitions and era comparisons require careful thought
  4. Aggregation choices matter: Player-level vs league-level vs team-level tell different stories
  5. Quantification strengthens storytelling: Precise numbers make the narrative compelling