Your First Basketball Analysis

Beginner 10 min read 1 views Nov 27, 2025

Your First Basketball Analysis: A Comprehensive Step-by-Step Guide

Welcome to the world of basketball analytics! Whether you're a passionate NBA fan wanting to understand the game at a deeper level, a data scientist exploring sports analytics, or a student learning data analysis, this comprehensive guide will walk you through your first complete basketball analysis project from start to finish.

By the end of this tutorial, you'll have hands-on experience loading real NBA data, performing exploratory data analysis (EDA), calculating key basketball metrics, and creating insightful visualizations using both Python and R. We'll work through a complete analysis that you can adapt for your own basketball analytics projects.

Setting Up Your Basketball Analytics Environment

Python Setup

For Python-based basketball analytics, you'll need several essential libraries. Open your terminal or command prompt and install these packages:

pip install pandas numpy matplotlib seaborn nba_api

Here's what each library provides:

  • pandas: Data manipulation and analysis, DataFrame structures for working with tabular data
  • numpy: Numerical computing, array operations, and mathematical functions
  • matplotlib: Core plotting and visualization library
  • seaborn: Statistical data visualization built on matplotlib with better defaults
  • nba_api: Official NBA stats API wrapper for accessing comprehensive NBA data

R Setup

For R users, install these packages in your R console or RStudio:

install.packages(c("tidyverse", "hoopR", "ggplot2", "jsonlite"))

# tidyverse includes dplyr, tidyr, ggplot2, and other essential data tools
# hoopR provides access to NBA statistics and play-by-play data
# ggplot2 is the premier visualization library (included in tidyverse)
# jsonlite helps work with JSON data from APIs

Understanding NBA Data Sources

Before diving into code, let's understand the primary data sources available for NBA analysis:

  • NBA Stats API: Official NBA statistics database with player stats, team stats, play-by-play data, shot chart data, and advanced metrics. Available from the 1996-97 season onwards.
  • Basketball-Reference: Comprehensive historical basketball statistics dating back to the BAA (Basketball Association of America) in 1946, including player career stats, team records, and advanced analytics.
  • NBA Tracking Data: Player tracking metrics including speed, distance traveled, touches, and defensive impact. Available from the 2013-14 season onwards.
  • Play-by-Play Data: Detailed event-level data for every possession, shot, and substitution in NBA games.

Loading and Inspecting NBA Data

Loading Player Stats Data (Python)

Let's start by loading current season player statistics using the nba_api library:

import pandas as pd
import numpy as np
from nba_api.stats.endpoints import leaguedashplayerstats
from nba_api.stats.static import players
import warnings
warnings.filterwarnings('ignore')

# Load current season player statistics
print("Loading NBA player statistics...")
player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
    season='2023-24',
    per_mode_detailed='PerGame'
)

# Convert to DataFrame
df = player_stats.get_data_frames()[0]

# Display basic information
print(f"Loaded {len(df)} player records")
print(f"Columns: {df.shape[1]}")
print(f"\nDataset shape: {df.shape}")

# View first few rows
print("\nFirst 5 players:")
print(df.head())

# Check data types
print("\nData types:")
print(df.dtypes)

# Get column names
print("\nAvailable columns:")
print(df.columns.tolist())

Loading Team Statistics (Python)

from nba_api.stats.endpoints import leaguedashteamstats

# Load team statistics
print("Loading NBA team statistics...")
team_stats = leaguedashteamstats.LeagueDashTeamStats(
    season='2023-24',
    per_mode_detailed='PerGame'
)

teams_df = team_stats.get_data_frames()[0]
print(f"Loaded {len(teams_df)} teams")
print(teams_df.head())

# Key team statistics columns
print("\nKey team metrics:")
print(teams_df[['TEAM_NAME', 'W', 'L', 'W_PCT', 'PTS', 'FG_PCT', 'FG3_PCT']].head(10))

Loading Data in R

library(tidyverse)
library(hoopR)

# Load NBA player box scores for the current season
print("Loading NBA player statistics...")
nba_player_box <- load_nba_player_box(seasons = 2024)

glimpse(nba_player_box)

# Aggregate player statistics
player_stats <- nba_player_box %>%
  group_by(athlete_id, athlete_display_name, team_name) %>%
  summarise(
    games = n(),
    total_points = sum(points, na.rm = TRUE),
    total_rebounds = sum(rebounds, na.rm = TRUE),
    total_assists = sum(assists, na.rm = TRUE),
    total_fga = sum(field_goals_attempted, na.rm = TRUE),
    total_fgm = sum(field_goals_made, na.rm = TRUE),
    total_3pa = sum(three_point_field_goals_attempted, na.rm = TRUE),
    total_3pm = sum(three_point_field_goals_made, na.rm = TRUE),
    total_fta = sum(free_throws_attempted, na.rm = TRUE),
    total_ftm = sum(free_throws_made, na.rm = TRUE),
    total_minutes = sum(minutes, na.rm = TRUE),
    .groups = 'drop'
  )

print(paste("Loaded", nrow(player_stats), "players"))
glimpse(player_stats)

Exploratory Data Analysis (EDA) Workflow

A systematic EDA workflow is crucial for understanding your basketball data before performing complex analyses. Follow these essential steps:

Step 1: Understand the Data Structure

Always start by examining the data's basic properties:

# Python - Examine data structure
print("Dataset Information:")
print(df.info())

print("\nBasic Statistics:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
missing_counts = df.isnull().sum()
print(missing_counts[missing_counts > 0])

# Unique values in key columns
print(f"\nUnique players: {df['PLAYER_NAME'].nunique()}")
print(f"Unique teams: {df['TEAM_ABBREVIATION'].nunique()}")
print(f"Total records: {len(df)}")
# R - Examine data structure
summary(player_stats)

# Missing values
player_stats %>%
  summarise_all(~sum(is.na(.))) %>%
  select_if(~. > 0)

# Unique values
player_stats %>%
  summarise(
    unique_players = n_distinct(athlete_id),
    unique_teams = n_distinct(team_name),
    total_records = n()
  )

Step 2: Data Quality Checks

Real-world basketball data can have issues. Check for common data quality problems:

# Python - Data quality validation
# Check for impossible values
print("Data Quality Checks:")
print(f"Players with negative points: {(df['PTS'] < 0).sum()}")
print(f"Players with FG% > 100%: {(df['FG_PCT'] > 1.0).sum()}")
print(f"Players with more FGM than FGA: {(df['FGM'] > df['FGA']).sum()}")

# Check for outliers in minutes played
print(f"\nMinutes played range: {df['MIN'].min():.1f} to {df['MIN'].max():.1f}")

# Filter for qualified players (minimum games/minutes)
qualified = df[df['MIN'] >= 15].copy()
print(f"\nQualified players (min 15 MPG): {len(qualified)}")

# Display statistical outliers
print("\nTop 5 scorers:")
print(qualified.nlargest(5, 'PTS')[['PLAYER_NAME', 'TEAM_ABBREVIATION', 'PTS', 'MIN']])

Step 3: Calculate Derived Basketball Metrics

Basketball analytics requires calculating advanced statistics beyond basic box score stats:

# Python - Calculate advanced basketball metrics
def calculate_advanced_stats(df):
    """Calculate essential basketball analytics metrics"""
    df = df.copy()

    # True Shooting Percentage (TS%)
    # Accounts for 2PT, 3PT, and FT efficiency
    df['TS%'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA']))

    # Effective Field Goal Percentage (eFG%)
    # Adjusts FG% for the fact that 3PT shots are worth more
    df['eFG%'] = (df['FGM'] + 0.5 * df['FG3M']) / df['FGA']

    # Three-Point Attempt Rate (3PAr)
    # Percentage of field goal attempts that are 3-pointers
    df['3PAr'] = df['FG3A'] / df['FGA']

    # Free Throw Rate (FTr)
    # How often a player gets to the free throw line
    df['FTr'] = df['FTA'] / df['FGA']

    # Assist-to-Turnover Ratio (AST/TO)
    df['AST_TO_Ratio'] = np.where(df['TOV'] > 0, df['AST'] / df['TOV'], df['AST'])

    # Points per shot attempt
    df['PTS_per_FGA'] = df['PTS'] / df['FGA']

    # Rebound Rate (approximate without team data)
    # Total rebounds per game
    df['REB_Rate'] = df['REB']

    return df

# Apply calculations
df_advanced = calculate_advanced_stats(df)
print("Advanced Statistics Sample:")
print(df_advanced[['PLAYER_NAME', 'PTS', 'TS%', 'eFG%', '3PAr', 'AST_TO_Ratio']].head(10))
# R - Calculate advanced basketball statistics
player_stats <- player_stats %>%
  mutate(
    # Per-game statistics
    ppg = total_points / games,
    rpg = total_rebounds / games,
    apg = total_assists / games,
    mpg = total_minutes / games,

    # Shooting percentages
    fg_pct = ifelse(total_fga > 0, total_fgm / total_fga, 0),
    fg3_pct = ifelse(total_3pa > 0, total_3pm / total_3pa, 0),
    ft_pct = ifelse(total_fta > 0, total_ftm / total_fta, 0),

    # True Shooting Percentage
    ts_pct = ifelse((total_fga + 0.44 * total_fta) > 0,
                    total_points / (2 * (total_fga + 0.44 * total_fta)), 0),

    # Effective Field Goal Percentage
    efg_pct = ifelse(total_fga > 0,
                     (total_fgm + 0.5 * total_3pm) / total_fga, 0),

    # Three-Point Attempt Rate
    three_par = ifelse(total_fga > 0, total_3pa / total_fga, 0),

    # Free Throw Rate
    ftr = ifelse(total_fga > 0, total_fta / total_fga, 0)
  )

# Display results
player_stats %>%
  select(athlete_display_name, ppg, rpg, apg, ts_pct, efg_pct) %>%
  arrange(desc(ppg)) %>%
  head(10)

Asking Good Analytical Questions

The foundation of good analysis is asking meaningful questions. Here are examples organized by complexity:

Beginner Questions

  • Who is the leading scorer in the NBA this season?
  • Which team has the best three-point shooting percentage?
  • What is the average points per game for centers vs. guards?
  • How many players are averaging a double-double (10+ points and rebounds)?
  • Which player has the highest field goal percentage (minimum 5 FGA per game)?

Intermediate Questions

  • Has the number of three-point attempts increased over the past 5 seasons?
  • What is the relationship between assists and turnovers for point guards?
  • Which players have the highest True Shooting Percentage among high-volume scorers?
  • How does home court advantage affect team shooting percentages?
  • Do taller players have better rebounding rates per minute played?

Advanced Questions

  • Can we predict a player's scoring output based on usage rate and shooting efficiency?
  • Which player lineups have the best net rating (offensive - defensive rating)?
  • How do different shot types (at rim, mid-range, three-point) affect overall efficiency?
  • What factors best predict team success: offensive rating, defensive rating, or pace?
  • Can we identify undervalued players based on advanced metrics vs. traditional stats?

Generating Basketball Statistical Summaries

Summary Statistics by Position

# Python - Summarize by position (if position data available)
# Note: You may need to add position data from another source
position_summary = df.groupby('POSITION').agg({
    'PTS': ['mean', 'median', 'std'],
    'REB': ['mean', 'median'],
    'AST': ['mean', 'median'],
    'FG_PCT': 'mean',
    'FG3_PCT': 'mean',
    'MIN': 'mean'
}).round(2)

print("Statistics by Position:")
print(position_summary)

Team Performance Analysis

# Python - Team-level analysis
team_summary = teams_df[['TEAM_NAME', 'W', 'L', 'W_PCT', 'PTS',
                          'FG_PCT', 'FG3_PCT', 'REB', 'AST']].copy()

# Sort by winning percentage
team_summary = team_summary.sort_values('W_PCT', ascending=False)

print("Top 10 Teams by Win Percentage:")
print(team_summary.head(10))
# R - Team summary statistics
team_summary <- nba_player_box %>%
  group_by(team_name) %>%
  summarise(
    games = n_distinct(game_id),
    total_points = sum(points, na.rm = TRUE),
    avg_points_per_game = mean(points, na.rm = TRUE),
    total_3pm = sum(three_point_field_goals_made, na.rm = TRUE),
    total_3pa = sum(three_point_field_goals_attempted, na.rm = TRUE),
    three_pt_pct = total_3pm / total_3pa * 100,
    .groups = 'drop'
  ) %>%
  arrange(desc(avg_points_per_game))

print("Team Summary Statistics:")
print(team_summary)

Percentile Analysis

# Python - Percentile analysis for key metrics
qualified = df[df['MIN'] >= 20].copy()

metrics = ['PTS', 'REB', 'AST', 'TS%', 'eFG%']
percentiles = [10, 25, 50, 75, 90, 95, 99]

print("Percentile Analysis (Players with 20+ MPG):")
print("=" * 70)

for metric in metrics:
    if metric in qualified.columns:
        percentile_values = np.percentile(qualified[metric].dropna(), percentiles)
        print(f"\n{metric}:")
        for p, value in zip(percentiles, percentile_values):
            print(f"  {p}th percentile: {value:.2f}")

Creating Basketball Visualizations

Bar Charts - Top Scorers

# Python - Visualize top scorers
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Get top 15 scorers
top_scorers = df.nlargest(15, 'PTS')[['PLAYER_NAME', 'PTS', 'TEAM_ABBREVIATION']].copy()

plt.figure(figsize=(12, 8))
bars = plt.barh(range(len(top_scorers)), top_scorers['PTS'], color='steelblue')

# Color the top 3 differently
bars[0].set_color('#FFD700')  # Gold
bars[1].set_color('#C0C0C0')  # Silver
bars[2].set_color('#CD7F32')  # Bronze

plt.yticks(range(len(top_scorers)),
           [f"{name} ({team})" for name, team in
            zip(top_scorers['PLAYER_NAME'], top_scorers['TEAM_ABBREVIATION'])])
plt.xlabel('Points Per Game', fontsize=12)
plt.title('Top 15 Scorers - NBA 2023-24 Season', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()

for i, (idx, row) in enumerate(top_scorers.iterrows()):
    plt.text(row['PTS'] + 0.3, i, f"{row['PTS']:.1f}",
             va='center', fontsize=10)

plt.tight_layout()
plt.savefig('top_scorers_2024.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visualization saved: top_scorers_2024.png")
# R - Top scorers visualization
library(ggplot2)

top_scorers <- player_stats %>%
  filter(games >= 20) %>%
  arrange(desc(ppg)) %>%
  head(15) %>%
  mutate(
    rank = row_number(),
    medal = case_when(
      rank == 1 ~ "Gold",
      rank == 2 ~ "Silver",
      rank == 3 ~ "Bronze",
      TRUE ~ "Standard"
    )
  )

ggplot(top_scorers, aes(x = reorder(athlete_display_name, ppg), y = ppg, fill = medal)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Gold" = "#FFD700", "Silver" = "#C0C0C0",
                                "Bronze" = "#CD7F32", "Standard" = "steelblue")) +
  coord_flip() +
  geom_text(aes(label = sprintf("%.1f", ppg)), hjust = -0.2, size = 3.5) +
  labs(
    title = "Top 15 Scorers - NBA 2023-24 Season",
    x = "Player",
    y = "Points Per Game"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "none"
  ) +
  expand_limits(y = max(top_scorers$ppg) * 1.1)

ggsave("top_scorers_2024.png", width = 10, height = 8, dpi = 300)

Scatter Plots - Efficiency Analysis

# Python - Points vs. True Shooting Percentage
qualified = df[(df['MIN'] >= 20) & (df['FGA'] >= 5)].copy()
qualified = calculate_advanced_stats(qualified)

plt.figure(figsize=(12, 8))

# Create scatter plot
scatter = plt.scatter(qualified['PTS'], qualified['TS%'],
                     alpha=0.6, s=100, c=qualified['FG3A'],
                     cmap='viridis', edgecolors='black', linewidth=0.5)

# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('3-Point Attempts per Game', fontsize=11)

# Label top performers
top_efficient = qualified.nlargest(5, 'TS%')
for idx, row in top_efficient.iterrows():
    plt.annotate(row['PLAYER_NAME'],
                xy=(row['PTS'], row['TS%']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=8, alpha=0.8)

plt.xlabel('Points Per Game', fontsize=12)
plt.ylabel('True Shooting Percentage', fontsize=12)
plt.title('Scoring Volume vs. Efficiency (2023-24 Season)',
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('scoring_efficiency_2024.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate correlation
correlation = qualified[['PTS', 'TS%']].corr()
print(f"\nCorrelation between Points and TS%: {correlation.loc['PTS', 'TS%']:.3f}")
# R - Scatter plot with regression
qualified <- player_stats %>%
  filter(games >= 20, total_fga / games >= 5) %>%
  mutate(three_pa_per_game = total_3pa / games)

ggplot(qualified, aes(x = ppg, y = ts_pct)) +
  geom_point(aes(color = three_pa_per_game, size = ppg), alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  scale_color_viridis_c(option = "viridis") +
  labs(
    title = "Scoring Volume vs. Efficiency",
    subtitle = "NBA 2023-24 Season (min 20 games, 5 FGA/game)",
    x = "Points Per Game",
    y = "True Shooting %",
    color = "3PA per Game",
    size = "Points"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    legend.position = "right"
  )

ggsave("scoring_efficiency_2024.png", width = 12, height = 8, dpi = 300)

# Correlation
cor_value <- cor(qualified$ppg, qualified$ts_pct, use = "complete.obs")
cat(sprintf("\nCorrelation between PPG and TS%%: %.3f\n", cor_value))

Distribution Plots - Shooting Percentages

# Python - Distribution of three-point shooting
qualified_shooters = df[(df['FG3A'] >= 3) & (df['MIN'] >= 15)].copy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Histogram
ax1.hist(qualified_shooters['FG3_PCT'], bins=25, edgecolor='black',
         alpha=0.7, color='orangered')
mean_3p = qualified_shooters['FG3_PCT'].mean()
median_3p = qualified_shooters['FG3_PCT'].median()

ax1.axvline(mean_3p, color='blue', linestyle='--', linewidth=2,
            label=f'Mean: {mean_3p:.3f}')
ax1.axvline(median_3p, color='green', linestyle='--', linewidth=2,
            label=f'Median: {median_3p:.3f}')
ax1.set_xlabel('3-Point Percentage', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Distribution of 3-Point Shooting', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Box plot
ax2.boxplot([qualified_shooters['FG_PCT'], qualified_shooters['FG3_PCT']],
            labels=['FG%', '3PT%'])
ax2.set_ylabel('Percentage', fontsize=12)
ax2.set_title('Shooting Percentage Comparison', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('shooting_distribution_2024.png', dpi=300, bbox_inches='tight')
plt.show()

Complete End-to-End Analysis: Finding the Most Efficient Scorers

Analysis Question: Who Are the Most Efficient High-Volume Scorers in the NBA?

Let's perform a complete analysis from data loading to visualization and interpretation:

Python Complete Implementation

import pandas as pd
import numpy as np
from nba_api.stats.endpoints import leaguedashplayerstats
import matplotlib.pyplot as plt
import seaborn as sns

print("=" * 80)
print("NBA EFFICIENT SCORING ANALYSIS - 2023-24 SEASON")
print("=" * 80)

# Step 1: Load the data
print("\nStep 1: Loading NBA player statistics...")
player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
    season='2023-24',
    per_mode_detailed='PerGame'
)
df = player_stats.get_data_frames()[0]
print(f"✓ Loaded {len(df)} player records")

# Step 2: Data quality checks
print("\nStep 2: Performing data quality checks...")
initial_count = len(df)
df = df.dropna(subset=['PTS', 'FGA', 'FTA'])
print(f"✓ Removed {initial_count - len(df)} records with missing data")
print(f"✓ Working with {len(df)} complete records")

# Step 3: Set qualification criteria
print("\nStep 3: Setting qualification criteria...")
min_ppg = 15.0  # Minimum points per game
min_mpg = 25.0  # Minimum minutes per game
min_games = 20  # Minimum games played

qualified = df[(df['PTS'] >= min_ppg) &
               (df['MIN'] >= min_mpg) &
               (df['GP'] >= min_games)].copy()

print(f"✓ Qualification criteria:")
print(f"  - Minimum {min_ppg} PPG")
print(f"  - Minimum {min_mpg} MPG")
print(f"  - Minimum {min_games} games played")
print(f"✓ {len(qualified)} players meet criteria")

# Step 4: Calculate advanced efficiency metrics
print("\nStep 4: Calculating advanced efficiency metrics...")

qualified['TS%'] = qualified['PTS'] / (2 * (qualified['FGA'] + 0.44 * qualified['FTA']))
qualified['eFG%'] = (qualified['FGM'] + 0.5 * qualified['FG3M']) / qualified['FGA']
qualified['3PAr'] = qualified['FG3A'] / qualified['FGA']
qualified['FTr'] = qualified['FTA'] / qualified['FGA']
qualified['PPG_per_FGA'] = qualified['PTS'] / qualified['FGA']

print("✓ Calculated: TS%, eFG%, 3PAr, FTr, PPG per FGA")

# Step 5: Identify most efficient scorers
print("\nStep 5: Identifying most efficient high-volume scorers...")

# Create efficiency score combining volume and efficiency
qualified['Efficiency_Score'] = qualified['TS%'] * qualified['PTS']
top_efficient = qualified.nlargest(15, 'Efficiency_Score')

print("\nTop 15 Most Efficient High-Volume Scorers:")
print("=" * 80)
print(f"{'Rank':<6}{'Player':<25}{'PPG':<8}{'TS%':<8}{'eFG%':<8}{'Score':<10}")
print("-" * 80)

for rank, (idx, row) in enumerate(top_efficient.iterrows(), 1):
    print(f"{rank:<6}{row['PLAYER_NAME'][:24]:<25}{row['PTS']:<8.1f}"
          f"{row['TS%']:<8.3f}{row['eFG%']:<8.3f}{row['Efficiency_Score']:<10.2f}")

# Step 6: Statistical summary
print("\n" + "=" * 80)
print("STATISTICAL SUMMARY - QUALIFIED SCORERS")
print("=" * 80)
print(f"Mean PPG: {qualified['PTS'].mean():.2f}")
print(f"Median PPG: {qualified['PTS'].median():.2f}")
print(f"Mean TS%: {qualified['TS%'].mean():.3f}")
print(f"Median TS%: {qualified['TS%'].median():.3f}")
print(f"Mean eFG%: {qualified['eFG%'].mean():.3f}")
print(f"Median eFG%: {qualified['eFG%'].median():.3f}")

# Identify elite efficiency (top quartile in both PPG and TS%)
ppg_75th = qualified['PTS'].quantile(0.75)
ts_75th = qualified['TS%'].quantile(0.75)

elite_scorers = qualified[(qualified['PTS'] >= ppg_75th) &
                          (qualified['TS%'] >= ts_75th)]

print(f"\n{len(elite_scorers)} players are in top quartile for BOTH volume and efficiency")
print("\nElite Scorers (75th+ percentile in PPG and TS%):")
for idx, row in elite_scorers.iterrows():
    print(f"  • {row['PLAYER_NAME']}: {row['PTS']:.1f} PPG, {row['TS%']:.3f} TS%")

# Step 7: Create comprehensive visualization
print("\nStep 7: Creating visualization...")

fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3)

# Plot 1: Scatter plot - PPG vs TS%
ax1 = fig.add_subplot(gs[0, :])
scatter = ax1.scatter(qualified['PTS'], qualified['TS%'],
                     s=qualified['MIN']*5, alpha=0.6,
                     c=qualified['FG3A'], cmap='plasma',
                     edgecolors='black', linewidth=0.5)

# Annotate top 5 efficient scorers
for idx, row in top_efficient.head(5).iterrows():
    ax1.annotate(row['PLAYER_NAME'],
                xy=(row['PTS'], row['TS%']),
                xytext=(8, 8), textcoords='offset points',
                fontsize=9, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.6))

cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('3-Point Attempts per Game', fontsize=11)

ax1.axhline(qualified['TS%'].mean(), color='red', linestyle='--',
           alpha=0.5, label=f"Mean TS%: {qualified['TS%'].mean():.3f}")
ax1.axvline(qualified['PTS'].mean(), color='blue', linestyle='--',
           alpha=0.5, label=f"Mean PPG: {qualified['PTS'].mean():.1f}")

ax1.set_xlabel('Points Per Game', fontsize=12, fontweight='bold')
ax1.set_ylabel('True Shooting Percentage', fontsize=12, fontweight='bold')
ax1.set_title('NBA Scoring Efficiency Analysis - 2023-24 Season',
             fontsize=14, fontweight='bold')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# Plot 2: Histogram of TS%
ax2 = fig.add_subplot(gs[1, 0])
ax2.hist(qualified['TS%'], bins=20, edgecolor='black', alpha=0.7, color='teal')
ax2.axvline(qualified['TS%'].mean(), color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('True Shooting %', fontsize=11)
ax2.set_ylabel('Frequency', fontsize=11)
ax2.set_title('Distribution of TS%', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

# Plot 3: Top 10 efficiency scores
ax3 = fig.add_subplot(gs[1, 1])
top_10 = top_efficient.head(10)
bars = ax3.barh(range(len(top_10)), top_10['Efficiency_Score'], color='coral')
ax3.set_yticks(range(len(top_10)))
ax3.set_yticklabels(top_10['PLAYER_NAME'], fontsize=9)
ax3.set_xlabel('Efficiency Score (TS% × PPG)', fontsize=11)
ax3.set_title('Top 10 Efficiency Scores', fontsize=12, fontweight='bold')
ax3.invert_yaxis()
ax3.grid(True, alpha=0.3, axis='x')

plt.savefig('nba_efficiency_analysis_2024.png', dpi=300, bbox_inches='tight')
print("✓ Visualization saved: nba_efficiency_analysis_2024.png")

# Step 8: Key insights and conclusions
print("\n" + "=" * 80)
print("KEY INSIGHTS")
print("=" * 80)

winner = top_efficient.iloc[0]
print(f"\n1. MOST EFFICIENT HIGH-VOLUME SCORER:")
print(f"   {winner['PLAYER_NAME']}")
print(f"   • {winner['PTS']:.1f} points per game")
print(f"   • {winner['TS%']:.1%} True Shooting Percentage")
print(f"   • {winner['eFG%']:.1%} Effective Field Goal Percentage")
print(f"   • Efficiency Score: {winner['Efficiency_Score']:.2f}")

print(f"\n2. LEAGUE BENCHMARKS (Qualified Scorers):")
print(f"   • Average scoring: {qualified['PTS'].mean():.1f} PPG")
print(f"   • Average efficiency: {qualified['TS%'].mean():.1%} TS%")
print(f"   • Elite efficiency threshold (75th percentile): {ts_75th:.1%}")

print(f"\n3. THREE-POINT SHOOTING IMPACT:")
high_3pa = qualified[qualified['3PAr'] >= 0.4]
low_3pa = qualified[qualified['3PAr'] < 0.4]
print(f"   • High 3PA rate (≥40% of FGA): avg TS% = {high_3pa['TS%'].mean():.1%}")
print(f"   • Low 3PA rate (<40% of FGA): avg TS% = {low_3pa['TS%'].mean():.1%}")

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE")
print("=" * 80)

R Complete Implementation

library(tidyverse)
library(hoopR)
library(ggplot2)
library(patchwork)

cat("================================================================================\n")
cat("NBA EFFICIENT SCORING ANALYSIS - 2023-24 SEASON\n")
cat("================================================================================\n")

# Step 1: Load the data
cat("\nStep 1: Loading NBA player statistics...\n")
nba_player_box <- load_nba_player_box(seasons = 2024)
cat(sprintf("✓ Loaded %d game records\n", nrow(nba_player_box)))

# Step 2: Aggregate player statistics
cat("\nStep 2: Aggregating player statistics...\n")
player_stats <- nba_player_box %>%
  group_by(athlete_id, athlete_display_name, team_name) %>%
  summarise(
    games = n(),
    total_points = sum(points, na.rm = TRUE),
    total_fga = sum(field_goals_attempted, na.rm = TRUE),
    total_fgm = sum(field_goals_made, na.rm = TRUE),
    total_3pa = sum(three_point_field_goals_attempted, na.rm = TRUE),
    total_3pm = sum(three_point_field_goals_made, na.rm = TRUE),
    total_fta = sum(free_throws_attempted, na.rm = TRUE),
    total_minutes = sum(minutes, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    ppg = total_points / games,
    mpg = total_minutes / games,
    fga_per_game = total_fga / games
  )

cat(sprintf("✓ Aggregated data for %d players\n", nrow(player_stats)))

# Step 3: Set qualification criteria
cat("\nStep 3: Setting qualification criteria...\n")
min_ppg <- 15.0
min_mpg <- 25.0
min_games <- 20

qualified <- player_stats %>%
  filter(ppg >= min_ppg, mpg >= min_mpg, games >= min_games)

cat(sprintf("✓ Qualification criteria:\n"))
cat(sprintf("  - Minimum %.1f PPG\n", min_ppg))
cat(sprintf("  - Minimum %.1f MPG\n", min_mpg))
cat(sprintf("  - Minimum %d games played\n", min_games))
cat(sprintf("✓ %d players meet criteria\n", nrow(qualified)))

# Step 4: Calculate advanced metrics
cat("\nStep 4: Calculating advanced efficiency metrics...\n")

qualified <- qualified %>%
  mutate(
    ts_pct = total_points / (2 * (total_fga + 0.44 * total_fta)),
    efg_pct = (total_fgm + 0.5 * total_3pm) / total_fga,
    three_par = total_3pa / total_fga,
    ftr = total_fta / total_fga,
    ppg_per_fga = ppg / fga_per_game,
    efficiency_score = ts_pct * ppg
  )

cat("✓ Calculated: TS%, eFG%, 3PAr, FTr, Efficiency Score\n")

# Step 5: Identify top efficient scorers
cat("\nStep 5: Identifying most efficient high-volume scorers...\n")

top_efficient <- qualified %>%
  arrange(desc(efficiency_score)) %>%
  head(15)

cat("\nTop 15 Most Efficient High-Volume Scorers:\n")
cat("================================================================================\n")
cat(sprintf("%-6s%-30s%-10s%-10s%-10s%-12s\n",
           "Rank", "Player", "PPG", "TS%", "eFG%", "Score"))
cat("--------------------------------------------------------------------------------\n")

for (i in 1:nrow(top_efficient)) {
  row <- top_efficient[i, ]
  cat(sprintf("%-6d%-30s%-10.1f%-10.3f%-10.3f%-12.2f\n",
             i, row$athlete_display_name, row$ppg,
             row$ts_pct, row$efg_pct, row$efficiency_score))
}

# Step 6: Statistical summary
cat("\n================================================================================\n")
cat("STATISTICAL SUMMARY - QUALIFIED SCORERS\n")
cat("================================================================================\n")
cat(sprintf("Mean PPG: %.2f\n", mean(qualified$ppg)))
cat(sprintf("Median PPG: %.2f\n", median(qualified$ppg)))
cat(sprintf("Mean TS%%: %.3f\n", mean(qualified$ts_pct, na.rm = TRUE)))
cat(sprintf("Median TS%%: %.3f\n", median(qualified$ts_pct, na.rm = TRUE)))
cat(sprintf("Mean eFG%%: %.3f\n", mean(qualified$efg_pct, na.rm = TRUE)))
cat(sprintf("Median eFG%%: %.3f\n", median(qualified$efg_pct, na.rm = TRUE)))

# Elite scorers
ppg_75th <- quantile(qualified$ppg, 0.75)
ts_75th <- quantile(qualified$ts_pct, 0.75, na.rm = TRUE)

elite_scorers <- qualified %>%
  filter(ppg >= ppg_75th, ts_pct >= ts_75th)

cat(sprintf("\n%d players are in top quartile for BOTH volume and efficiency\n",
           nrow(elite_scorers)))
cat("\nElite Scorers (75th+ percentile in PPG and TS%):\n")
for (i in 1:nrow(elite_scorers)) {
  row <- elite_scorers[i, ]
  cat(sprintf("  • %s: %.1f PPG, %.3f TS%%\n",
             row$athlete_display_name, row$ppg, row$ts_pct))
}

# Step 7: Create visualizations
cat("\nStep 7: Creating visualizations...\n")

# Plot 1: Main scatter plot
p1 <- ggplot(qualified, aes(x = ppg, y = ts_pct)) +
  geom_point(aes(size = mpg, color = total_3pa / games), alpha = 0.6) +
  geom_hline(yintercept = mean(qualified$ts_pct, na.rm = TRUE),
             linetype = "dashed", color = "red", alpha = 0.5) +
  geom_vline(xintercept = mean(qualified$ppg),
             linetype = "dashed", color = "blue", alpha = 0.5) +
  scale_color_viridis_c(option = "plasma") +
  labs(
    title = "NBA Scoring Efficiency Analysis - 2023-24 Season",
    x = "Points Per Game",
    y = "True Shooting Percentage",
    color = "3PA per Game",
    size = "Minutes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )

# Plot 2: Histogram
p2 <- ggplot(qualified, aes(x = ts_pct)) +
  geom_histogram(bins = 20, fill = "teal", color = "black", alpha = 0.7) +
  geom_vline(xintercept = mean(qualified$ts_pct, na.rm = TRUE),
             color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of True Shooting %",
    x = "True Shooting %",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12))

# Plot 3: Top 10 bar chart
p3 <- ggplot(top_efficient %>% head(10),
            aes(x = reorder(athlete_display_name, efficiency_score),
                y = efficiency_score)) +
  geom_bar(stat = "identity", fill = "coral", color = "black") +
  coord_flip() +
  labs(
    title = "Top 10 Efficiency Scores",
    x = "",
    y = "Efficiency Score (TS% × PPG)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 12),
    axis.text.y = element_text(size = 9)
  )

# Combine plots
combined_plot <- p1 / (p2 | p3)
ggsave("nba_efficiency_analysis_2024.png", combined_plot,
      width = 14, height = 10, dpi = 300)

cat("✓ Visualization saved: nba_efficiency_analysis_2024.png\n")

# Step 8: Key insights
cat("\n================================================================================\n")
cat("KEY INSIGHTS\n")
cat("================================================================================\n")

winner <- top_efficient[1, ]
cat(sprintf("\n1. MOST EFFICIENT HIGH-VOLUME SCORER:\n"))
cat(sprintf("   %s\n", winner$athlete_display_name))
cat(sprintf("   • %.1f points per game\n", winner$ppg))
cat(sprintf("   • %.1f%% True Shooting Percentage\n", winner$ts_pct * 100))
cat(sprintf("   • %.1f%% Effective Field Goal Percentage\n", winner$efg_pct * 100))
cat(sprintf("   • Efficiency Score: %.2f\n", winner$efficiency_score))

cat(sprintf("\n2. LEAGUE BENCHMARKS (Qualified Scorers):\n"))
cat(sprintf("   • Average scoring: %.1f PPG\n", mean(qualified$ppg)))
cat(sprintf("   • Average efficiency: %.1f%% TS%%\n",
           mean(qualified$ts_pct, na.rm = TRUE) * 100))
cat(sprintf("   • Elite efficiency threshold (75th percentile): %.1f%%\n",
           ts_75th * 100))

high_3pa <- qualified %>% filter(three_par >= 0.4)
low_3pa <- qualified %>% filter(three_par < 0.4)

cat(sprintf("\n3. THREE-POINT SHOOTING IMPACT:\n"))
cat(sprintf("   • High 3PA rate (≥40%% of FGA): avg TS%% = %.1f%%\n",
           mean(high_3pa$ts_pct, na.rm = TRUE) * 100))
cat(sprintf("   • Low 3PA rate (<40%% of FGA): avg TS%% = %.1f%%\n",
           mean(low_3pa$ts_pct, na.rm = TRUE) * 100))

cat("\n================================================================================\n")
cat("ANALYSIS COMPLETE\n")
cat("================================================================================\n")

Interpreting Basketball Analytics Results

Understanding Context

When interpreting basketball analytics results, always consider:

  • Sample size matters: A player shooting 60% TS% in 5 games is different from maintaining that over 82 games.
  • Position context: Centers typically have higher FG% than guards due to shot location, but guards create more offense.
  • Role on team: Usage rate and offensive system heavily influence individual statistics.
  • Pace of play: Teams playing faster generate more possessions, inflating counting stats.
  • Defensive attention: Stars facing double teams may have lower efficiency but create opportunities for teammates.

NBA Statistical Benchmarks

Metric Poor Average Good Excellent Elite
True Shooting % < 52% 54-57% 58-61% 62-65% > 65%
Effective FG% < 48% 50-53% 54-57% 58-62% > 62%
Points Per Game < 10 12-17 18-22 23-27 > 28
Assists Per Game < 2 3-5 6-7 8-9 > 10
3-Point % < 32% 33-36% 37-39% 40-43% > 43%
AST/TO Ratio < 1.0 1.5-2.0 2.1-2.5 2.6-3.0 > 3.0

Common Pitfalls in Basketball Analytics

1. Overvaluing Counting Stats

The Problem: Focusing only on points, rebounds, and assists without considering efficiency or context.

The Solution: Always evaluate volume stats alongside efficiency metrics like TS%, eFG%, and usage rate.

# Python - Compare volume vs. efficiency
high_volume = df[df['PTS'] >= 20].copy()
high_volume = calculate_advanced_stats(high_volume)

# Find inefficient high scorers
inefficient = high_volume[high_volume['TS%'] < 0.55].sort_values('PTS', ascending=False)
print("High-volume, low-efficiency scorers:")
print(inefficient[['PLAYER_NAME', 'PTS', 'TS%', 'FGA']].head())

2. Ignoring Pace of Play

The Problem: Comparing raw stats between teams playing at different paces.

The Solution: Use per-100-possession stats or per-36-minute stats for fair comparisons.

# Python - Calculate per-36-minute stats
df['PTS_per_36'] = (df['PTS'] / df['MIN']) * 36
df['REB_per_36'] = (df['REB'] / df['MIN']) * 36
df['AST_per_36'] = (df['AST'] / df['MIN']) * 36

3. Small Sample Size Errors

The Problem: Drawing conclusions from limited data (e.g., first 10 games of season).

The Solution: Set minimum thresholds for games played and minutes.

# Python - Proper qualification
qualified = df[(df['GP'] >= 41) & (df['MIN'] >= 20)].copy()
print(f"Qualified players (41+ games, 20+ MPG): {len(qualified)}")

4. Not Accounting for Position

The Problem: Comparing players at different positions without context.

The Solution: Compare players within position groups or use position-adjusted metrics.

5. Overlooking Defensive Impact

The Problem: Focusing exclusively on offensive statistics.

The Solution: Include defensive metrics like defensive rating, steals, blocks, and defensive win shares when available.

Next Steps in Your Basketball Analytics Journey

Congratulations on completing your first basketball analysis! Here's how to continue developing your skills:

Immediate Practice Projects

  1. Replicate with different seasons: Compare how league-wide statistics have changed from 2015 to 2024
  2. Team-level analysis: Analyze which team strategies (pace, 3PA rate) correlate with winning
  3. Player development: Track how young players improve their efficiency over their first 3 seasons
  4. Shot chart analysis: Examine shooting efficiency by court location
  5. Lineup analysis: Identify which player combinations have the best net rating

Advanced Topics to Explore

  • Player tracking data: Speed, distance, touches, and defensive metrics
  • Play-by-play analysis: Possession-level data and lineup interactions
  • Shot quality models: Expected field goal percentage based on shot location and defender distance
  • Plus/minus variants: Regularized adjusted plus/minus (RAPM) and real plus/minus (RPM)
  • Machine learning: Predictive models for player performance and game outcomes
  • Salary cap analytics: Contract value and market efficiency

Essential Learning Resources

  • Basketball-Reference.com: Historical statistics and advanced metrics for every player and team
  • NBA.com/stats: Official NBA statistics portal with tracking data and advanced metrics
  • Cleaning the Glass: Advanced analytics with role and pace adjustments
  • FiveThirtyEight: RAPTOR ratings and predictive models
  • The Pudding: Creative basketball data visualizations and interactive analysis
  • Thinking Basketball YouTube: In-depth analytical breakdowns of players and strategies

Conclusion

You've now completed a comprehensive basketball analytics project, learning how to load NBA data, perform exploratory analysis, calculate advanced metrics, create visualizations, and interpret results. The analytical framework you've learned—from data quality checks to visualization best practices—applies to any basketball analytics question you want to explore.

Basketball analytics combines statistical rigor with deep basketball knowledge. The most valuable insights come from analysts who understand both the numbers and the game. As you continue your journey, watch games with an analytical eye, question conventional wisdom, and let data guide you to deeper understanding.

The NBA is increasingly driven by data, from shot selection to lineup optimization to player evaluation. The skills you've developed in this tutorial position you to contribute to this data revolution, whether as a fan seeking deeper understanding, a student pursuing a career in sports analytics, or an aspiring NBA analyst.

Keep practicing, stay curious, and remember: the best basketball analysis doesn't just crunch numbers—it tells compelling stories about the game we love.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.