Case Study 1: Exploring the Evolution of NBA Shooting
Overview
Scenario: ESPN is producing a special feature on how NBA shooting has changed over the past decade. They need comprehensive exploratory analysis showing the evolution of shot selection, three-point volume, and mid-range extinction. Your analysis will inform both the visual storytelling and the statistical claims made in the broadcast.
Duration: 3-4 hours Difficulty: Intermediate Prerequisites: Chapter 4 concepts, pandas proficiency, matplotlib/seaborn experience
Background
The NBA has undergone a dramatic transformation in shot selection philosophy. The "analytics revolution" has led teams to favor three-pointers and shots at the rim over mid-range jumpers. Your task is to quantify and visualize this evolution using exploratory data analysis.
Key questions to address: 1. How has league-wide shot distribution changed from 2014 to 2024? 2. Which teams have led or lagged in this transition? 3. How has three-point volume and efficiency changed? 4. What happened to mid-range shooting? 5. Are there player archetypes that resist these trends?
Part 1: Data Loading and Initial Exploration
1.1 Loading Multi-Season Shot Data
"""
NBA Shooting Evolution Analysis
Case Study 1 - Chapter 4: Exploratory Data Analysis
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import Dict, List, Tuple
# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
def load_shot_data(seasons: List[str], data_dir: Path) -> pd.DataFrame:
"""
Load and combine shot data from multiple seasons.
Args:
seasons: List of season identifiers (e.g., ['2014-15', '2015-16'])
data_dir: Path to directory containing shot data files
Returns:
Combined DataFrame with all seasons
"""
all_shots = []
for season in seasons:
filepath = data_dir / f"shots_{season.replace('-', '_')}.parquet"
if filepath.exists():
df = pd.read_parquet(filepath)
df['SEASON'] = season
all_shots.append(df)
print(f"Loaded {len(df):,} shots from {season}")
else:
print(f"Warning: No data file for {season}")
combined = pd.concat(all_shots, ignore_index=True)
print(f"\nTotal shots loaded: {len(combined):,}")
return combined
# Define seasons to analyze
SEASONS = [
'2014-15', '2015-16', '2016-17', '2017-18', '2018-19',
'2019-20', '2020-21', '2021-22', '2022-23', '2023-24'
]
# Load data (example path - adjust for your setup)
# shots_df = load_shot_data(SEASONS, Path('./data/shots/'))
1.2 Initial Data Inspection
def inspect_shot_data(df: pd.DataFrame) -> Dict:
"""
Perform comprehensive inspection of shot data.
Args:
df: Shot data DataFrame
Returns:
Dictionary containing inspection results
"""
inspection = {
'shape': df.shape,
'columns': df.columns.tolist(),
'dtypes': df.dtypes.to_dict(),
'missing': df.isnull().sum().to_dict(),
'seasons': df['SEASON'].unique().tolist() if 'SEASON' in df.columns else None
}
# Print summary
print("=" * 60)
print("SHOT DATA INSPECTION REPORT")
print("=" * 60)
print(f"\nDataset Shape: {inspection['shape'][0]:,} rows x {inspection['shape'][1]} columns")
print(f"\nSeasons Covered: {len(inspection['seasons']) if inspection['seasons'] else 'N/A'}")
if inspection['seasons']:
print(f" From {min(inspection['seasons'])} to {max(inspection['seasons'])}")
# Check for key columns
key_columns = ['LOC_X', 'LOC_Y', 'SHOT_MADE_FLAG', 'SHOT_TYPE',
'SHOT_ZONE_BASIC', 'SHOT_DISTANCE']
print("\nKey Column Availability:")
for col in key_columns:
status = "Present" if col in df.columns else "MISSING"
print(f" {col}: {status}")
# Sample data
print("\nSample Records:")
print(df.head())
return inspection
def create_shot_zones(df: pd.DataFrame) -> pd.DataFrame:
"""
Create standardized shot zone classifications.
Args:
df: Shot data with LOC_X, LOC_Y, and SHOT_DISTANCE columns
Returns:
DataFrame with added zone columns
"""
df = df.copy()
# Convert coordinates to feet if in tenths
df['X_FEET'] = df['LOC_X'] / 10.0
df['Y_FEET'] = df['LOC_Y'] / 10.0
# Calculate distance if not present
if 'SHOT_DISTANCE' not in df.columns:
df['SHOT_DISTANCE'] = np.sqrt(df['X_FEET']**2 + df['Y_FEET']**2)
# Define simplified zones
def classify_shot(row):
distance = row['SHOT_DISTANCE']
x = abs(row['X_FEET'])
y = row['Y_FEET']
# Restricted area
if distance <= 4:
return 'Restricted Area'
# Paint (non-RA)
if x <= 8 and y <= 14 and distance > 4:
return 'Paint (Non-RA)'
# Mid-range
if distance < 23.75:
if distance < 16:
return 'Short Mid-Range'
else:
return 'Long Mid-Range'
# Three-point
if x >= 22 and y < 8.75:
return 'Corner 3'
else:
return 'Above Break 3'
df['ZONE'] = df.apply(classify_shot, axis=1)
# Create binary flags
df['IS_THREE'] = (df['SHOT_DISTANCE'] >= 23.75).astype(int)
df['IS_MIDRANGE'] = ((df['SHOT_DISTANCE'] > 4) &
(df['SHOT_DISTANCE'] < 23.75)).astype(int)
df['IS_AT_RIM'] = (df['SHOT_DISTANCE'] <= 4).astype(int)
return df
Part 2: Distribution Analysis
2.1 Shot Volume by Zone Over Time
def analyze_zone_trends(df: pd.DataFrame) -> pd.DataFrame:
"""
Analyze how shot distribution across zones has changed over time.
Args:
df: Shot data with ZONE and SEASON columns
Returns:
DataFrame with zone percentages by season
"""
# Calculate shots per zone per season
zone_counts = df.groupby(['SEASON', 'ZONE']).size().reset_index(name='SHOTS')
# Calculate total shots per season
season_totals = df.groupby('SEASON').size().reset_index(name='TOTAL')
# Merge and calculate percentages
zone_pcts = zone_counts.merge(season_totals, on='SEASON')
zone_pcts['PCT'] = (zone_pcts['SHOTS'] / zone_pcts['TOTAL'] * 100).round(1)
# Pivot for easier analysis
pivot = zone_pcts.pivot(index='SEASON', columns='ZONE', values='PCT')
return pivot
def plot_zone_evolution(zone_pivot: pd.DataFrame, figsize: Tuple = (14, 8)):
"""
Create stacked area chart showing zone evolution.
Args:
zone_pivot: Pivoted DataFrame with zones as columns
figsize: Figure dimensions
"""
fig, ax = plt.subplots(figsize=figsize)
# Define zone order (rim to three)
zone_order = ['Restricted Area', 'Paint (Non-RA)', 'Short Mid-Range',
'Long Mid-Range', 'Corner 3', 'Above Break 3']
# Filter to existing zones
available_zones = [z for z in zone_order if z in zone_pivot.columns]
# Create stacked area chart
zone_pivot[available_zones].plot(
kind='area',
stacked=True,
ax=ax,
alpha=0.8,
colormap='RdYlGn_r'
)
ax.set_xlabel('Season')
ax.set_ylabel('Percentage of Shots')
ax.set_title('Evolution of NBA Shot Distribution by Zone (2014-2024)')
ax.legend(title='Shot Zone', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.set_ylim(0, 100)
plt.tight_layout()
return fig
def plot_three_point_trend(df: pd.DataFrame, figsize: Tuple = (12, 6)):
"""
Visualize the rise of three-point shooting.
Args:
df: Shot data with IS_THREE and SHOT_MADE_FLAG columns
figsize: Figure dimensions
"""
# Calculate three-point stats by season
three_stats = df[df['IS_THREE'] == 1].groupby('SEASON').agg({
'SHOT_MADE_FLAG': ['count', 'sum', 'mean']
}).reset_index()
three_stats.columns = ['SEASON', 'ATTEMPTS', 'MAKES', 'PCT']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
# Volume plot
ax1.bar(three_stats['SEASON'], three_stats['ATTEMPTS'], color='steelblue', alpha=0.8)
ax1.set_xlabel('Season')
ax1.set_ylabel('Three-Point Attempts')
ax1.set_title('League-Wide 3PA Volume by Season')
ax1.tick_params(axis='x', rotation=45)
# Calculate growth
pct_growth = ((three_stats['ATTEMPTS'].iloc[-1] / three_stats['ATTEMPTS'].iloc[0]) - 1) * 100
ax1.annotate(f'+{pct_growth:.0f}% growth',
xy=(0.95, 0.95), xycoords='axes fraction',
ha='right', va='top', fontsize=12, fontweight='bold')
# Efficiency plot
ax2.plot(three_stats['SEASON'], three_stats['PCT'] * 100,
marker='o', linewidth=2, markersize=8, color='green')
ax2.set_xlabel('Season')
ax2.set_ylabel('Three-Point Percentage')
ax2.set_title('League-Wide 3P% by Season')
ax2.tick_params(axis='x', rotation=45)
ax2.set_ylim(34, 38)
ax2.axhline(y=three_stats['PCT'].mean() * 100, color='red',
linestyle='--', alpha=0.5, label='Period Average')
ax2.legend()
plt.tight_layout()
return fig
2.2 Mid-Range Extinction Analysis
def analyze_midrange_decline(df: pd.DataFrame) -> pd.DataFrame:
"""
Document the decline of mid-range shooting.
Args:
df: Shot data with zone classifications
Returns:
DataFrame showing mid-range statistics by season
"""
# Filter to mid-range shots
midrange = df[df['IS_MIDRANGE'] == 1].copy()
# Stats by season
stats = midrange.groupby('SEASON').agg({
'SHOT_MADE_FLAG': ['count', 'mean'],
'PLAYER_ID': 'nunique'
}).reset_index()
stats.columns = ['SEASON', 'ATTEMPTS', 'FG_PCT', 'PLAYERS']
# Calculate per-player averages
all_players = df.groupby('SEASON')['PLAYER_ID'].nunique().reset_index()
all_players.columns = ['SEASON', 'TOTAL_PLAYERS']
stats = stats.merge(all_players, on='SEASON')
stats['ATTEMPTS_PER_PLAYER'] = (stats['ATTEMPTS'] / stats['TOTAL_PLAYERS']).round(1)
return stats
def plot_midrange_extinction(stats: pd.DataFrame, figsize: Tuple = (14, 5)):
"""
Visualize the decline of mid-range shooting.
Args:
stats: Mid-range statistics by season
figsize: Figure dimensions
"""
fig, axes = plt.subplots(1, 3, figsize=figsize)
# Total attempts
axes[0].bar(stats['SEASON'], stats['ATTEMPTS'], color='indianred', alpha=0.8)
axes[0].set_xlabel('Season')
axes[0].set_ylabel('Mid-Range Attempts')
axes[0].set_title('League-Wide Mid-Range Volume')
axes[0].tick_params(axis='x', rotation=45)
# Per-player attempts
axes[1].plot(stats['SEASON'], stats['ATTEMPTS_PER_PLAYER'],
marker='s', linewidth=2, color='darkred')
axes[1].set_xlabel('Season')
axes[1].set_ylabel('Mid-Range Attempts per Player')
axes[1].set_title('Per-Player Mid-Range Volume')
axes[1].tick_params(axis='x', rotation=45)
# Efficiency
axes[2].plot(stats['SEASON'], stats['FG_PCT'] * 100,
marker='o', linewidth=2, color='orange')
axes[2].set_xlabel('Season')
axes[2].set_ylabel('Field Goal Percentage')
axes[2].set_title('Mid-Range Efficiency')
axes[2].tick_params(axis='x', rotation=45)
axes[2].set_ylim(38, 44)
plt.tight_layout()
return fig
Part 3: Team-Level Analysis
3.1 Team Shot Profile Comparison
def create_team_shot_profiles(df: pd.DataFrame, season: str) -> pd.DataFrame:
"""
Create shot profile summary for each team in a season.
Args:
df: Shot data
season: Season to analyze
Returns:
DataFrame with team shot profiles
"""
season_df = df[df['SEASON'] == season].copy()
profiles = season_df.groupby('TEAM_NAME').agg({
'SHOT_MADE_FLAG': ['count', 'mean'],
'IS_THREE': 'mean',
'IS_MIDRANGE': 'mean',
'IS_AT_RIM': 'mean'
}).reset_index()
profiles.columns = ['TEAM', 'TOTAL_SHOTS', 'FG_PCT',
'THREE_RATE', 'MIDRANGE_RATE', 'RIM_RATE']
# Calculate expected points per shot
# Simplified: rim shots ~60%, mid-range ~40%, threes ~36%
profiles['EXPECTED_PTS'] = (
profiles['RIM_RATE'] * 0.60 * 2 +
profiles['MIDRANGE_RATE'] * 0.40 * 2 +
profiles['THREE_RATE'] * 0.36 * 3
)
return profiles.sort_values('THREE_RATE', ascending=False)
def plot_team_evolution(df: pd.DataFrame, teams: List[str],
figsize: Tuple = (14, 6)):
"""
Compare shot evolution for specific teams.
Args:
df: Shot data
teams: List of team names to compare
figsize: Figure dimensions
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
colors = plt.cm.tab10(np.linspace(0, 1, len(teams)))
for team, color in zip(teams, colors):
team_df = df[df['TEAM_NAME'] == team]
# Three-point rate by season
three_rate = team_df.groupby('SEASON')['IS_THREE'].mean() * 100
ax1.plot(three_rate.index, three_rate.values,
marker='o', label=team, color=color, linewidth=2)
# Mid-range rate by season
mid_rate = team_df.groupby('SEASON')['IS_MIDRANGE'].mean() * 100
ax2.plot(mid_rate.index, mid_rate.values,
marker='s', label=team, color=color, linewidth=2)
ax1.set_xlabel('Season')
ax1.set_ylabel('Three-Point Rate (%)')
ax1.set_title('Three-Point Rate Evolution by Team')
ax1.legend()
ax1.tick_params(axis='x', rotation=45)
ax2.set_xlabel('Season')
ax2.set_ylabel('Mid-Range Rate (%)')
ax2.set_title('Mid-Range Rate Evolution by Team')
ax2.legend()
ax2.tick_params(axis='x', rotation=45)
plt.tight_layout()
return fig
Part 4: Shot Chart Visualizations
4.1 League-Wide Shot Distribution Heat Maps
def create_era_comparison_shot_charts(df: pd.DataFrame,
figsize: Tuple = (18, 8)):
"""
Create side-by-side shot chart comparisons across eras.
Args:
df: Shot data with coordinates
figsize: Figure dimensions
"""
# Define era groupings
early_era = df[df['SEASON'].isin(['2014-15', '2015-16'])]
recent_era = df[df['SEASON'].isin(['2022-23', '2023-24'])]
fig, axes = plt.subplots(1, 2, figsize=figsize)
# Draw court helper function (simplified)
def draw_court(ax):
# Three-point arc
theta = np.linspace(-np.pi/2, np.pi/2, 100)
x = 23.75 * np.cos(theta)
y = 23.75 * np.sin(theta)
ax.plot(x, y, 'k-', linewidth=2)
# Corner threes
ax.plot([-22, -22], [0, 8.75], 'k-', linewidth=2)
ax.plot([22, 22], [0, 8.75], 'k-', linewidth=2)
# Paint
ax.plot([-8, -8, 8, 8], [0, 19, 19, 0], 'k-', linewidth=2)
# Free throw circle
circle = plt.Circle((0, 19), 6, fill=False, color='k', linewidth=2)
ax.add_patch(circle)
# Restricted area
ra_arc = plt.Circle((0, 0), 4, fill=False, color='k', linewidth=2)
ax.add_patch(ra_arc)
ax.set_xlim(-25, 25)
ax.set_ylim(-5, 47)
ax.set_aspect('equal')
ax.axis('off')
# Plot early era
draw_court(axes[0])
axes[0].hexbin(early_era['X_FEET'], early_era['Y_FEET'],
gridsize=30, cmap='YlOrRd', mincnt=100)
axes[0].set_title(f'Shot Distribution: 2014-16\n({len(early_era):,} shots)')
# Plot recent era
draw_court(axes[1])
hb = axes[1].hexbin(recent_era['X_FEET'], recent_era['Y_FEET'],
gridsize=30, cmap='YlOrRd', mincnt=100)
axes[1].set_title(f'Shot Distribution: 2022-24\n({len(recent_era):,} shots)')
# Add colorbar
plt.colorbar(hb, ax=axes, label='Shot Frequency', shrink=0.7)
plt.suptitle('The Evolution of NBA Shot Selection', fontsize=16, y=1.02)
plt.tight_layout()
return fig
Part 5: Key Findings Summary
5.1 Automated Findings Report
def generate_findings_report(df: pd.DataFrame) -> str:
"""
Generate an automated summary of key findings.
Args:
df: Analyzed shot data
Returns:
Formatted string report
"""
# Calculate key statistics
first_season = df['SEASON'].min()
last_season = df['SEASON'].max()
early_3rate = df[df['SEASON'] == first_season]['IS_THREE'].mean() * 100
late_3rate = df[df['SEASON'] == last_season]['IS_THREE'].mean() * 100
early_mid = df[df['SEASON'] == first_season]['IS_MIDRANGE'].mean() * 100
late_mid = df[df['SEASON'] == last_season]['IS_MIDRANGE'].mean() * 100
report = f"""
================================================================================
NBA SHOOTING EVOLUTION: KEY FINDINGS
Analysis Period: {first_season} to {last_season}
Total Shots Analyzed: {len(df):,}
================================================================================
1. THREE-POINT REVOLUTION
- Three-point rate increased from {early_3rate:.1f}% to {late_3rate:.1f}%
- This represents a {((late_3rate/early_3rate) - 1) * 100:.0f}% relative increase
- The league now takes approximately {late_3rate:.0f} three-pointers per 100 shots
2. MID-RANGE EXTINCTION
- Mid-range rate declined from {early_mid:.1f}% to {late_mid:.1f}%
- This represents a {((1 - late_mid/early_mid)) * 100:.0f}% relative decrease
- The "dying art" of the mid-range jumper is now statistically verified
3. RIM PRESSURE
- Restricted area attempts have remained relatively stable
- This indicates the shift has been primarily mid-range -> three-point
4. EFFICIENCY TRENDS
- League-wide efficiency has generally improved despite volume increases
- Three-point shooters are more skilled as teams invest in shooting
================================================================================
IMPLICATIONS FOR PRODUCTION:
- Visual storytelling should emphasize the dramatic shift in shot selection
- Side-by-side shot charts from 2014 vs 2024 illustrate the change vividly
- Key quote angle: "The math changed basketball"
================================================================================
"""
return report
Discussion Questions
Question 1: Causation vs Correlation
The data shows teams shooting more threes are often more successful. Does shooting more threes cause success, or do more talented teams simply shoot more threes?
Question 2: Data Limitations
What limitations exist in using shot chart data to tell the story of basketball evolution? What additional data would strengthen the analysis?
Question 3: Outlier Analysis
Are there successful teams or players who don't follow the three-point revolution? What explains their success?
Question 4: Future Projections
Based on the trends observed, what might NBA shooting look like in 5 years? What factors could accelerate or reverse the trends?
Deliverables
- Exploratory Analysis Notebook: Complete Jupyter notebook with all analysis
- Visualization Suite: Publication-ready figures for each key finding
- Summary Statistics: Tables showing decade-long trends
- Shot Chart Comparison: Visual showing 2014 vs 2024 shot selection
- Findings Report: Written summary suitable for production team
Key Takeaways
- EDA reveals narratives: The data clearly shows the transformation in NBA play style
- Multiple visualization types: Different plots reveal different aspects of the evolution
- Context matters: Zone definitions and era comparisons require careful thought
- Aggregation choices matter: Player-level vs league-level vs team-level tell different stories
- Quantification strengthens storytelling: Precise numbers make the narrative compelling