The revolution in basketball analytics began with a simple question: are all shots created equal? The answer, of course, is no. A wide-open three-pointer from the corner differs fundamentally from a contested fadeaway over a seven-footer. Shot...
In This Chapter
- Introduction
- 16.1 Expected Points Models
- 16.2 Shot Difficulty Factors
- 16.3 Building a Shot Quality Model with Logistic Regression
- 16.4 Feature Engineering for Shot Prediction
- 16.5 Advanced Models: Gradient Boosting and Neural Networks
- 16.6 Shot Creation Value
- 16.7 Shooting Luck vs. Skill: Regression to the Mean
- 16.8 Points Above Expected
- 16.9 Applications: Player Evaluation
- 16.10 Applications: Coaching Decisions
- 16.11 Model Calibration and Validation
- 16.12 Practical Considerations and Limitations
- Summary
- Chapter References
Chapter 16: Shot Quality Models
Introduction
The revolution in basketball analytics began with a simple question: are all shots created equal? The answer, of course, is no. A wide-open three-pointer from the corner differs fundamentally from a contested fadeaway over a seven-footer. Shot quality models attempt to quantify these differences, providing a framework for understanding the true value of shooting opportunities and the players who create and convert them.
This chapter explores the construction and application of shot quality models, from basic expected points calculations to sophisticated machine learning approaches that account for dozens of contextual factors. We will build complete models using Python and scikit-learn, examining feature engineering, model evaluation, and practical applications in player evaluation and coaching decisions.
Shot quality modeling represents one of the most successful applications of machine learning in sports analytics. Unlike many predictive tasks where the outcome is distant and influenced by countless intervening factors, shot outcomes are immediate and binary: the ball either goes in or it does not. This clarity makes shot prediction an ideal domain for developing and refining analytical techniques.
16.1 Expected Points Models
The Foundation of Shot Quality
Expected points (xPoints or xPts) represents the average number of points a shot would yield if taken many times under identical circumstances. For a two-point shot with a 45% probability of going in, the expected points equals 0.90. For a three-pointer with a 35% make probability, the expected points equals 1.05.
The expected points framework transforms shooting from a binary outcome (made or missed) into a continuous measure of quality. This transformation is essential because:
- Sample size limitations: Even prolific scorers take only a few hundred shots per season from any specific location
- Variance in outcomes: A shooter might hit 5 consecutive difficult shots or miss 5 consecutive easy ones
- Decision evaluation: We want to assess whether taking a shot was a good decision, independent of whether it went in
Simple Expected Points Calculation
The most basic expected points model uses only shot location:
$$xPts = P(make | location) \times points\_value$$
Where $P(make | location)$ is the league-average field goal percentage from that location, and $points\_value$ is 2 or 3 depending on whether the shot is inside or beyond the arc.
import numpy as np
import pandas as pd
def simple_expected_points(shot_distance, is_three_pointer, league_fg_by_distance):
"""
Calculate expected points using only shot distance.
Parameters:
-----------
shot_distance : float
Distance from basket in feet
is_three_pointer : bool
Whether the shot is a three-point attempt
league_fg_by_distance : dict
Dictionary mapping distance ranges to league-average FG%
Returns:
--------
float : Expected points for the shot
"""
# Determine the distance bucket
if shot_distance < 4:
fg_pct = league_fg_by_distance['rim']
elif shot_distance < 10:
fg_pct = league_fg_by_distance['short']
elif shot_distance < 16:
fg_pct = league_fg_by_distance['mid_short']
elif shot_distance < 22:
fg_pct = league_fg_by_distance['mid_long']
else:
fg_pct = league_fg_by_distance['three']
points_value = 3 if is_three_pointer else 2
return fg_pct * points_value
# Example league averages (approximate 2023-24 values)
league_fg = {
'rim': 0.65, # 0-4 feet
'short': 0.42, # 4-10 feet
'mid_short': 0.40, # 10-16 feet
'mid_long': 0.40, # 16-22 feet
'three': 0.36 # 22+ feet
}
# Calculate expected points for different shots
print(f"Rim shot xPts: {simple_expected_points(2, False, league_fg):.3f}")
print(f"Mid-range xPts: {simple_expected_points(15, False, league_fg):.3f}")
print(f"Three-pointer xPts: {simple_expected_points(24, True, league_fg):.3f}")
Output:
Rim shot xPts: 1.300
Mid-range xPts: 0.800
Three-pointer xPts: 1.080
This simple model immediately reveals why modern offenses emphasize rim attempts and three-pointers: they yield significantly higher expected points than mid-range shots.
Zone-Based Expected Points
A more refined approach divides the court into zones, calculating expected points for each:
def create_shot_zones():
"""
Define standard shot zones used in NBA analysis.
Returns:
--------
dict : Zone definitions with boundaries and typical FG%
"""
zones = {
'restricted_area': {
'description': 'Within 4 feet of rim',
'distance_range': (0, 4),
'angle_range': None, # All angles
'fg_pct': 0.63,
'points': 2
},
'paint_non_ra': {
'description': 'Paint outside restricted area',
'distance_range': (4, 14),
'angle_range': (-80, 80), # Inside the paint lines
'fg_pct': 0.40,
'points': 2
},
'mid_range_left': {
'description': 'Left side mid-range',
'distance_range': (14, 22),
'angle_range': (45, 135),
'fg_pct': 0.41,
'points': 2
},
'mid_range_right': {
'description': 'Right side mid-range',
'distance_range': (14, 22),
'angle_range': (-135, -45),
'fg_pct': 0.41,
'points': 2
},
'mid_range_center': {
'description': 'Top of key mid-range',
'distance_range': (14, 22),
'angle_range': (-45, 45),
'fg_pct': 0.40,
'points': 2
},
'corner_three_left': {
'description': 'Left corner three',
'distance_range': (22, 24),
'angle_range': (70, 110),
'fg_pct': 0.39,
'points': 3
},
'corner_three_right': {
'description': 'Right corner three',
'distance_range': (22, 24),
'angle_range': (-110, -70),
'fg_pct': 0.39,
'points': 3
},
'above_break_three': {
'description': 'Above the break three',
'distance_range': (22, 30),
'angle_range': (-70, 70),
'fg_pct': 0.36,
'points': 3
},
'deep_three': {
'description': 'Beyond 30 feet',
'distance_range': (30, 50),
'angle_range': None,
'fg_pct': 0.30,
'points': 3
}
}
return zones
def zone_expected_points(x, y, zones):
"""
Calculate expected points based on shot zone.
Parameters:
-----------
x, y : float
Shot coordinates (basket at origin)
zones : dict
Zone definitions from create_shot_zones()
Returns:
--------
tuple : (zone_name, expected_points)
"""
distance = np.sqrt(x**2 + y**2)
angle = np.degrees(np.arctan2(x, y)) # Angle from basket
for zone_name, zone in zones.items():
dist_min, dist_max = zone['distance_range']
if dist_min <= distance < dist_max:
if zone['angle_range'] is None:
return zone_name, zone['fg_pct'] * zone['points']
angle_min, angle_max = zone['angle_range']
if angle_min <= angle <= angle_max:
return zone_name, zone['fg_pct'] * zone['points']
# Default to deep three if no zone matched
return 'deep_three', zones['deep_three']['fg_pct'] * 3
Limitations of Location-Only Models
While location-based models provide a useful baseline, they ignore crucial contextual factors:
- Defender position: An open shot differs vastly from a contested one
- Shot type: Catch-and-shoot versus off-the-dribble
- Game situation: Shot clock, score differential, quarter
- Player fatigue: Minutes played, pace of game
- Individual skill: Not all shooters are equal
These limitations motivate the development of more sophisticated shot quality models that incorporate additional features.
16.2 Shot Difficulty Factors
Distance from Basket
Shot distance is the most predictive single feature for make probability. League-wide field goal percentage declines steadily with distance:
def analyze_distance_effect(shots_df):
"""
Analyze the relationship between shot distance and FG%.
Parameters:
-----------
shots_df : DataFrame
Shot data with 'distance' and 'made' columns
Returns:
--------
DataFrame : FG% by distance bucket
"""
# Create distance buckets
bins = [0, 4, 8, 12, 16, 20, 24, 28, 35]
labels = ['0-4', '4-8', '8-12', '12-16', '16-20', '20-24', '24-28', '28+']
shots_df['distance_bucket'] = pd.cut(
shots_df['distance'],
bins=bins,
labels=labels
)
# Calculate FG% by bucket
distance_analysis = shots_df.groupby('distance_bucket').agg(
attempts=('made', 'count'),
makes=('made', 'sum'),
fg_pct=('made', 'mean')
).round(3)
distance_analysis['expected_pts_2pt'] = distance_analysis['fg_pct'] * 2
distance_analysis['expected_pts_3pt'] = distance_analysis['fg_pct'] * 3
return distance_analysis
The relationship is not perfectly linear. Shots at the rim benefit from banking and tip-ins, creating a spike in efficiency. The three-point line creates a discontinuity where slightly longer shots become more valuable despite lower make rates.
Defender Proximity
Defender distance is arguably the second most important factor in shot difficulty. NBA tracking data provides closest defender distance at the moment of release:
def categorize_defender_distance(defender_distance):
"""
Categorize shots by defender proximity.
Categories match NBA.com tracking data definitions:
- Wide Open: 6+ feet
- Open: 4-6 feet
- Tight: 2-4 feet
- Very Tight: 0-2 feet
Parameters:
-----------
defender_distance : float
Distance to closest defender in feet
Returns:
--------
str : Defender distance category
"""
if defender_distance >= 6:
return 'wide_open'
elif defender_distance >= 4:
return 'open'
elif defender_distance >= 2:
return 'tight'
else:
return 'very_tight'
def analyze_defender_impact(shots_df):
"""
Analyze how defender proximity affects shot success.
Parameters:
-----------
shots_df : DataFrame
Shot data with 'defender_distance' and 'made' columns
Returns:
--------
DataFrame : FG% by defender distance category
"""
shots_df['contest_level'] = shots_df['defender_distance'].apply(
categorize_defender_distance
)
# Order categories properly
category_order = ['wide_open', 'open', 'tight', 'very_tight']
contest_analysis = shots_df.groupby('contest_level').agg(
attempts=('made', 'count'),
makes=('made', 'sum'),
fg_pct=('made', 'mean'),
avg_distance=('distance', 'mean')
).reindex(category_order).round(3)
return contest_analysis
League-wide data shows approximately: - Wide Open (6+ feet): ~55% on two-pointers, ~40% on three-pointers - Open (4-6 feet): ~48% on two-pointers, ~36% on three-pointers - Tight (2-4 feet): ~42% on two-pointers, ~33% on three-pointers - Very Tight (0-2 feet): ~38% on two-pointers, ~30% on three-pointers
Shot Clock
Shot clock time remaining influences shot quality in complex ways:
def analyze_shot_clock_impact(shots_df):
"""
Analyze how shot clock affects shot selection and success.
Parameters:
-----------
shots_df : DataFrame
Shot data with 'shot_clock' and 'made' columns
Returns:
--------
DataFrame : Analysis by shot clock ranges
"""
# Define shot clock ranges
def shot_clock_category(seconds):
if pd.isna(seconds):
return 'unknown'
elif seconds <= 4:
return 'very_late'
elif seconds <= 8:
return 'late'
elif seconds <= 15:
return 'mid'
else:
return 'early'
shots_df['clock_category'] = shots_df['shot_clock'].apply(shot_clock_category)
analysis = shots_df.groupby('clock_category').agg(
attempts=('made', 'count'),
fg_pct=('made', 'mean'),
avg_distance=('distance', 'mean'),
three_pt_rate=('is_three', 'mean'),
avg_defender_dist=('defender_distance', 'mean')
).round(3)
return analysis
Key findings from shot clock analysis: - Early clock (15+ seconds): Often transition opportunities, higher FG% - Mid clock (8-15 seconds): Typical half-court attempts - Late clock (4-8 seconds): Slightly lower FG%, but still reasonable - Very late clock (0-4 seconds): Significantly lower FG%, often difficult forced shots
Touch Time
How long a player holds the ball before shooting affects outcomes:
def analyze_touch_time(shots_df):
"""
Analyze the impact of touch time on shooting.
Parameters:
-----------
shots_df : DataFrame
Shot data with 'touch_time' column (seconds)
Returns:
--------
DataFrame : Analysis by touch time category
"""
def touch_time_category(seconds):
if seconds < 2:
return 'catch_and_shoot'
elif seconds < 4:
return 'quick'
elif seconds < 6:
return 'moderate'
else:
return 'extended'
shots_df['touch_category'] = shots_df['touch_time'].apply(touch_time_category)
analysis = shots_df.groupby('touch_category').agg(
attempts=('made', 'count'),
fg_pct=('made', 'mean'),
avg_defender_dist=('defender_distance', 'mean')
).round(3)
return analysis
Catch-and-shoot opportunities generally yield higher percentages than off-the-dribble shots, even after controlling for defender distance. The rhythm and balance advantages of catching and immediately shooting contribute to this difference.
Dribbles Before Shot
Related to touch time, the number of dribbles before a shot correlates with difficulty:
def analyze_dribbles_impact(shots_df):
"""
Analyze how dribbles before shot affect success rate.
Parameters:
-----------
shots_df : DataFrame
Shot data with 'dribbles' column
Returns:
--------
DataFrame : Analysis by dribble count
"""
# Cap at 7+ for grouping
shots_df['dribble_group'] = shots_df['dribbles'].clip(upper=7)
shots_df.loc[shots_df['dribble_group'] == 7, 'dribble_group'] = '7+'
analysis = shots_df.groupby('dribble_group').agg(
attempts=('made', 'count'),
fg_pct=('made', 'mean'),
avg_distance=('distance', 'mean'),
avg_defender_dist=('defender_distance', 'mean')
).round(3)
return analysis
Zero dribbles (catch-and-shoot) typically yields the highest percentages, with efficiency declining as dribbles increase. However, players who take many dribbles often face tighter defense, so this variable interacts with defender proximity.
Additional Difficulty Factors
Beyond the core factors above, comprehensive shot quality models may include:
- Shot type: Layup, dunk, hook, floating, pull-up, step-back
- Player position on court: Angle affects shot difficulty
- Game state: Score differential, quarter, playoffs vs regular season
- Defender height and wingspan: Longer defenders more disruptive
- Previous action: Off screen, isolation, post-up, transition
- Shooter's prior shots: Hot hand, fatigue effects
16.3 Building a Shot Quality Model with Logistic Regression
Logistic regression provides an interpretable framework for shot quality modeling. The model predicts the probability of a make given input features, naturally bounded between 0 and 1.
Data Preparation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, roc_auc_score, brier_score_loss,
log_loss, classification_report
)
def prepare_shot_data(shots_df):
"""
Prepare shot data for modeling.
Parameters:
-----------
shots_df : DataFrame
Raw shot data
Returns:
--------
tuple : (X, y, feature_names)
"""
# Define features
numeric_features = [
'distance',
'defender_distance',
'shot_clock',
'touch_time',
'dribbles',
'x_coord',
'y_coord'
]
categorical_features = [
'shot_type',
'action_type',
'period',
'is_home'
]
# Handle missing values
shots_df = shots_df.dropna(subset=['made'] + numeric_features)
# Fill categorical NAs
for col in categorical_features:
if col in shots_df.columns:
shots_df[col] = shots_df[col].fillna('unknown')
# Create feature matrix
X = shots_df[numeric_features + categorical_features].copy()
y = shots_df['made'].astype(int)
return X, y, numeric_features, categorical_features
def create_preprocessing_pipeline(numeric_features, categorical_features):
"""
Create preprocessing pipeline for shot features.
Parameters:
-----------
numeric_features : list
Names of numeric columns
categorical_features : list
Names of categorical columns
Returns:
--------
ColumnTransformer : Preprocessing pipeline
"""
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False),
categorical_features)
]
)
return preprocessor
Model Training
def train_shot_quality_model(X, y, numeric_features, categorical_features):
"""
Train a logistic regression shot quality model.
Parameters:
-----------
X : DataFrame
Feature matrix
y : Series
Target variable (1=made, 0=missed)
numeric_features, categorical_features : lists
Feature names by type
Returns:
--------
tuple : (trained_pipeline, X_test, y_test, metrics)
"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create pipeline
preprocessor = create_preprocessing_pipeline(
numeric_features, categorical_features
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression(
max_iter=1000,
solver='lbfgs',
C=1.0, # Regularization strength
class_weight='balanced'
))
])
# Train model
pipeline.fit(X_train, y_train)
# Generate predictions
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
# Calculate metrics
metrics = {
'accuracy': accuracy_score(y_test, y_pred),
'roc_auc': roc_auc_score(y_test, y_prob),
'brier_score': brier_score_loss(y_test, y_prob),
'log_loss': log_loss(y_test, y_prob)
}
return pipeline, X_test, y_test, y_prob, metrics
def print_model_evaluation(metrics, y_test, y_pred):
"""
Print comprehensive model evaluation.
Parameters:
-----------
metrics : dict
Model performance metrics
y_test : array
True labels
y_pred : array
Predicted labels
"""
print("=" * 50)
print("SHOT QUALITY MODEL EVALUATION")
print("=" * 50)
print(f"\nAccuracy: {metrics['accuracy']:.4f}")
print(f"ROC AUC: {metrics['roc_auc']:.4f}")
print(f"Brier Score: {metrics['brier_score']:.4f}")
print(f"Log Loss: {metrics['log_loss']:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Miss', 'Make']))
Interpreting Coefficients
One advantage of logistic regression is interpretable coefficients:
def interpret_logistic_coefficients(pipeline, numeric_features, categorical_features):
"""
Extract and interpret logistic regression coefficients.
Parameters:
-----------
pipeline : Pipeline
Trained sklearn pipeline
numeric_features, categorical_features : lists
Feature names
Returns:
--------
DataFrame : Coefficients with interpretation
"""
# Get the classifier
clf = pipeline.named_steps['classifier']
# Get feature names after one-hot encoding
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)
all_features = list(numeric_features) + list(cat_feature_names)
# Create coefficient DataFrame
coef_df = pd.DataFrame({
'feature': all_features,
'coefficient': clf.coef_[0],
'odds_ratio': np.exp(clf.coef_[0])
})
coef_df['impact'] = coef_df['coefficient'].apply(
lambda x: 'positive' if x > 0 else 'negative'
)
# Sort by absolute coefficient
coef_df['abs_coef'] = coef_df['coefficient'].abs()
coef_df = coef_df.sort_values('abs_coef', ascending=False)
return coef_df.drop('abs_coef', axis=1)
Example interpretation of coefficients: - Distance coefficient: -0.08: Each additional foot from the basket decreases log-odds by 0.08, corresponding to an odds ratio of 0.92 (8% decrease in odds per foot) - Defender distance coefficient: +0.15: Each additional foot from the nearest defender increases log-odds by 0.15, corresponding to an odds ratio of 1.16 - Dunk shot type: +2.1: Dunks have dramatically higher make probability than the baseline shot type
16.4 Feature Engineering for Shot Prediction
Effective shot quality models require thoughtful feature engineering. Raw tracking data must be transformed into meaningful predictive features.
Spatial Features
def engineer_spatial_features(shots_df):
"""
Create spatial features from shot coordinates.
Parameters:
-----------
shots_df : DataFrame
Shot data with x, y coordinates (basket at origin)
Returns:
--------
DataFrame : Enhanced with spatial features
"""
df = shots_df.copy()
# Distance from basket
df['distance'] = np.sqrt(df['x_coord']**2 + df['y_coord']**2)
# Angle from basket (0 = straight on, 90/-90 = sideline)
df['angle'] = np.degrees(np.arctan2(df['x_coord'], df['y_coord']))
df['abs_angle'] = df['angle'].abs()
# Is corner three (sideline three-pointer)
df['is_corner_three'] = (
(df['distance'] >= 22) &
(df['distance'] < 24) &
(df['abs_angle'] > 70)
).astype(int)
# Side of court
df['court_side'] = np.where(df['x_coord'] > 0, 'right', 'left')
# Distance from three-point line (negative = inside arc)
# Approximate arc distance
three_point_distance = 23.75 # Above the break
corner_three_distance = 22.0
df['distance_from_arc'] = np.where(
df['abs_angle'] > 70,
df['distance'] - corner_three_distance,
df['distance'] - three_point_distance
)
# Rim area indicator
df['at_rim'] = (df['distance'] < 4).astype(int)
# Paint indicator (inside the key)
df['in_paint'] = (
(df['distance'] < 16) &
(df['abs_angle'] < 45)
).astype(int)
return df
def engineer_defender_features(shots_df):
"""
Create features related to defensive pressure.
Parameters:
-----------
shots_df : DataFrame
Shot data with defender tracking info
Returns:
--------
DataFrame : Enhanced with defender features
"""
df = shots_df.copy()
# Categorical contest level
df['contest_level'] = pd.cut(
df['defender_distance'],
bins=[-np.inf, 2, 4, 6, np.inf],
labels=['very_tight', 'tight', 'open', 'wide_open']
)
# Is heavily contested (binary)
df['is_contested'] = (df['defender_distance'] < 4).astype(int)
# Defender closing speed (if available)
if 'defender_closing_speed' in df.columns:
df['defender_closing_fast'] = (
df['defender_closing_speed'] > 5
).astype(int)
# Number of defenders in proximity (if available)
if 'defenders_within_5ft' in df.columns:
df['multiple_defenders'] = (
df['defenders_within_5ft'] > 1
).astype(int)
return df
Temporal Features
def engineer_temporal_features(shots_df):
"""
Create features related to game timing and situation.
Parameters:
-----------
shots_df : DataFrame
Shot data with game timing information
Returns:
--------
DataFrame : Enhanced with temporal features
"""
df = shots_df.copy()
# Shot clock pressure
df['shot_clock_bucket'] = pd.cut(
df['shot_clock'],
bins=[0, 4, 8, 15, 24],
labels=['very_late', 'late', 'mid', 'early']
)
df['shot_clock_pressure'] = (df['shot_clock'] < 7).astype(int)
# Game clock features
if 'game_clock' in df.columns and 'period' in df.columns:
# End of quarter (final 2 minutes)
df['end_of_quarter'] = (
(df['game_clock'] < 120) & (df['period'] <= 4)
).astype(int)
# Clutch time (final 5 minutes, score within 5)
if 'score_margin' in df.columns:
df['clutch'] = (
(df['game_clock'] < 300) &
(df['period'] == 4) &
(df['score_margin'].abs() <= 5)
).astype(int)
# Quarter effects
df['period_numeric'] = df['period'].clip(upper=4) # Cap at 4 for OT
df['is_overtime'] = (df['period'] > 4).astype(int)
return df
def engineer_player_context_features(shots_df):
"""
Create features related to player actions and context.
Parameters:
-----------
shots_df : DataFrame
Shot data with player tracking info
Returns:
--------
DataFrame : Enhanced with player context features
"""
df = shots_df.copy()
# Shot creation type
df['is_catch_and_shoot'] = (df['touch_time'] < 2).astype(int)
df['is_pull_up'] = (
(df['dribbles'] > 0) &
(df['touch_time'] >= 2)
).astype(int)
# Dribble features
df['dribbles_capped'] = df['dribbles'].clip(upper=10)
df['many_dribbles'] = (df['dribbles'] >= 5).astype(int)
# Touch time buckets
df['touch_time_bucket'] = pd.cut(
df['touch_time'],
bins=[0, 2, 4, 6, np.inf],
labels=['instant', 'quick', 'moderate', 'long']
)
return df
Interaction Features
def engineer_interaction_features(shots_df):
"""
Create interaction features between variables.
Parameters:
-----------
shots_df : DataFrame
Shot data with base features
Returns:
--------
DataFrame : Enhanced with interaction features
"""
df = shots_df.copy()
# Distance x Contest interaction
df['distance_contest_interaction'] = (
df['distance'] * (1 / (df['defender_distance'] + 1))
)
# Three-pointer x Contest
if 'is_three' in df.columns:
df['contested_three'] = (
df['is_three'] * df['is_contested']
)
df['open_three'] = (
df['is_three'] * (1 - df['is_contested'])
)
# Catch-and-shoot x Open
df['open_catch_shoot'] = (
df['is_catch_and_shoot'] *
(df['defender_distance'] >= 4).astype(int)
)
# Late clock x Distance
df['late_clock_distance'] = (
df['shot_clock_pressure'] * df['distance']
)
# Rim shot x Contest (very important interaction)
df['contested_rim'] = (
df['at_rim'] * df['is_contested']
)
return df
16.5 Advanced Models: Gradient Boosting and Neural Networks
While logistic regression provides interpretability, gradient boosting and neural networks often achieve superior predictive performance.
Gradient Boosting Implementation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import xgboost as xgb
def train_xgboost_shot_model(X, y, numeric_features, categorical_features):
"""
Train an XGBoost shot quality model with hyperparameter tuning.
Parameters:
-----------
X : DataFrame
Feature matrix
y : Series
Target variable
numeric_features, categorical_features : lists
Feature names by type
Returns:
--------
tuple : (trained_pipeline, best_params, cv_scores)
"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create preprocessing pipeline
preprocessor = create_preprocessing_pipeline(
numeric_features, categorical_features
)
# Transform data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Define parameter grid for tuning
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
# Initialize XGBoost
xgb_clf = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='auc',
use_label_encoder=False,
random_state=42
)
# Grid search with cross-validation
grid_search = GridSearchCV(
xgb_clf,
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_processed, y_train)
# Best model
best_model = grid_search.best_estimator_
# Evaluate on test set
y_prob = best_model.predict_proba(X_test_processed)[:, 1]
metrics = {
'roc_auc': roc_auc_score(y_test, y_prob),
'brier_score': brier_score_loss(y_test, y_prob),
'log_loss': log_loss(y_test, y_prob)
}
return best_model, preprocessor, grid_search.best_params_, metrics
def get_xgboost_feature_importance(model, feature_names):
"""
Extract feature importance from XGBoost model.
Parameters:
-----------
model : XGBClassifier
Trained XGBoost model
feature_names : list
Names of features
Returns:
--------
DataFrame : Feature importance rankings
"""
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
})
importance_df = importance_df.sort_values(
'importance', ascending=False
).reset_index(drop=True)
return importance_df
Neural Network Implementation
from sklearn.neural_network import MLPClassifier
def train_neural_network_shot_model(X, y, numeric_features, categorical_features):
"""
Train a neural network shot quality model.
Parameters:
-----------
X : DataFrame
Feature matrix
y : Series
Target variable
numeric_features, categorical_features : lists
Feature names by type
Returns:
--------
tuple : (trained_model, preprocessor, metrics)
"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create preprocessing pipeline
preprocessor = create_preprocessing_pipeline(
numeric_features, categorical_features
)
# Transform data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Define neural network architecture
mlp = MLPClassifier(
hidden_layer_sizes=(128, 64, 32), # Three hidden layers
activation='relu',
solver='adam',
alpha=0.001, # L2 regularization
batch_size=256,
learning_rate='adaptive',
learning_rate_init=0.001,
max_iter=500,
early_stopping=True,
validation_fraction=0.1,
n_iter_no_change=20,
random_state=42,
verbose=True
)
# Train model
mlp.fit(X_train_processed, y_train)
# Evaluate
y_prob = mlp.predict_proba(X_test_processed)[:, 1]
metrics = {
'roc_auc': roc_auc_score(y_test, y_prob),
'brier_score': brier_score_loss(y_test, y_prob),
'log_loss': log_loss(y_test, y_prob),
'n_iterations': mlp.n_iter_
}
return mlp, preprocessor, metrics
Model Comparison
def compare_shot_models(X, y, numeric_features, categorical_features):
"""
Compare multiple shot quality models.
Parameters:
-----------
X : DataFrame
Feature matrix
y : Series
Target variable
numeric_features, categorical_features : lists
Feature names by type
Returns:
--------
DataFrame : Model comparison results
"""
# Split data once for fair comparison
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Preprocessing
preprocessor = create_preprocessing_pipeline(
numeric_features, categorical_features
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Define models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, C=1.0),
'Gradient Boosting': GradientBoostingClassifier(
n_estimators=200, max_depth=5, learning_rate=0.05
),
'XGBoost': xgb.XGBClassifier(
n_estimators=200, max_depth=5, learning_rate=0.05,
use_label_encoder=False, eval_metric='auc'
),
'Neural Network': MLPClassifier(
hidden_layer_sizes=(128, 64), max_iter=500, early_stopping=True
)
}
results = []
for name, model in models.items():
print(f"Training {name}...")
model.fit(X_train_processed, y_train)
y_prob = model.predict_proba(X_test_processed)[:, 1]
results.append({
'model': name,
'roc_auc': roc_auc_score(y_test, y_prob),
'brier_score': brier_score_loss(y_test, y_prob),
'log_loss': log_loss(y_test, y_prob)
})
return pd.DataFrame(results)
16.6 Shot Creation Value
Beyond predicting whether shots go in, we must assess the value of creating shooting opportunities. Shot creation value measures how much a player contributes by generating shots for themselves or teammates.
Defining Shot Creation
def calculate_shot_creation_value(player_shots_df, model, preprocessor):
"""
Calculate shot creation value for a player.
Shot creation value = sum of (xPts for shots created - league average xPts)
Parameters:
-----------
player_shots_df : DataFrame
Shots taken by or assisted by the player
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
dict : Shot creation metrics
"""
# Get expected make probability for each shot
X = player_shots_df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
xfg_probs = model.predict_proba(X_processed)[:, 1]
# Calculate expected points
player_shots_df = player_shots_df.copy()
player_shots_df['xfg'] = xfg_probs
player_shots_df['xpts'] = (
player_shots_df['xfg'] *
np.where(player_shots_df['is_three'], 3, 2)
)
# League average xPts per shot (baseline)
league_avg_xpts = 1.05 # Approximate value
# Shots created for self (unassisted)
self_created = player_shots_df[player_shots_df['assisted'] == False]
# Shots created for others (assists)
assists_df = player_shots_df[player_shots_df['is_assist'] == True]
metrics = {
'total_shots_created': len(player_shots_df),
'self_created_shots': len(self_created),
'assisted_shots': len(assists_df),
'avg_xpts_self_created': self_created['xpts'].mean(),
'avg_xpts_assists': assists_df['xpts'].mean() if len(assists_df) > 0 else 0,
'total_xpts_created': player_shots_df['xpts'].sum(),
'xpts_above_average': (
player_shots_df['xpts'].sum() -
len(player_shots_df) * league_avg_xpts
),
'creation_value_per_shot': (
player_shots_df['xpts'].mean() - league_avg_xpts
)
}
return metrics
def analyze_shot_creation_by_type(player_shots_df, model, preprocessor):
"""
Break down shot creation value by play type.
Parameters:
-----------
player_shots_df : DataFrame
Player's created shots with play type info
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : Creation value by play type
"""
# Calculate xPts for all shots
X = player_shots_df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
player_shots_df = player_shots_df.copy()
player_shots_df['xfg'] = model.predict_proba(X_processed)[:, 1]
player_shots_df['xpts'] = (
player_shots_df['xfg'] *
np.where(player_shots_df['is_three'], 3, 2)
)
# Group by play type
creation_by_type = player_shots_df.groupby('play_type').agg(
shots=('xpts', 'count'),
total_xpts=('xpts', 'sum'),
avg_xpts=('xpts', 'mean'),
actual_pts=('points_scored', 'sum'),
fg_pct=('made', 'mean')
).round(3)
creation_by_type['pts_vs_expected'] = (
creation_by_type['actual_pts'] - creation_by_type['total_xpts']
)
return creation_by_type.sort_values('total_xpts', ascending=False)
Shot Creation Profiles
Different players create value through different means:
def create_shot_creation_profile(player_shots_df, model, preprocessor):
"""
Create a comprehensive shot creation profile for a player.
Parameters:
-----------
player_shots_df : DataFrame
All shots created by player
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
dict : Comprehensive creation profile
"""
df = player_shots_df.copy()
# Calculate xPts
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
profile = {
# Volume
'total_shots': len(df),
'shots_per_game': len(df) / df['game_id'].nunique(),
# Location distribution
'rim_rate': (df['distance'] < 4).mean(),
'mid_range_rate': ((df['distance'] >= 4) & (df['distance'] < 22)).mean(),
'three_rate': df['is_three'].mean(),
# Quality
'avg_xpts': df['xpts'].mean(),
'avg_xfg': df['xfg'].mean(),
'avg_defender_dist': df['defender_distance'].mean(),
# Creation style
'catch_and_shoot_rate': (df['touch_time'] < 2).mean(),
'pull_up_rate': ((df['dribbles'] > 0) & (df['touch_time'] >= 2)).mean(),
'assisted_rate': df['assisted'].mean() if 'assisted' in df.columns else None,
# Difficulty profile
'contested_rate': (df['defender_distance'] < 4).mean(),
'avg_shot_clock': df['shot_clock'].mean(),
'late_clock_rate': (df['shot_clock'] < 7).mean(),
# Efficiency
'actual_fg_pct': df['made'].mean(),
'actual_pts': df[df['made'] == True]['is_three'].apply(
lambda x: 3 if x else 2
).sum(),
'expected_pts': df['xpts'].sum(),
'pts_vs_expected': None # Calculated below
}
profile['pts_vs_expected'] = profile['actual_pts'] - profile['expected_pts']
return profile
16.7 Shooting Luck vs. Skill: Regression to the Mean
One of the most important applications of shot quality models is distinguishing luck from skill in shooting performance. Players who significantly outperform their expected field goal percentage may be demonstrating exceptional skill, or they may be benefiting from positive variance that will regress.
Understanding Regression to the Mean
def analyze_shooting_luck(player_season_df, model, preprocessor):
"""
Analyze shooting luck vs skill for players.
Parameters:
-----------
player_season_df : DataFrame
Season shooting data for multiple players
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : Luck analysis by player
"""
results = []
for player_id in player_season_df['player_id'].unique():
player_shots = player_season_df[
player_season_df['player_id'] == player_id
]
if len(player_shots) < 100: # Minimum sample
continue
# Calculate xFG%
X = player_shots[model.feature_names_in_]
X_processed = preprocessor.transform(X)
xfg_probs = model.predict_proba(X_processed)[:, 1]
actual_fg = player_shots['made'].mean()
expected_fg = xfg_probs.mean()
# Difference suggests luck or skill
fg_diff = actual_fg - expected_fg
results.append({
'player_id': player_id,
'player_name': player_shots['player_name'].iloc[0],
'attempts': len(player_shots),
'actual_fg_pct': actual_fg,
'expected_fg_pct': expected_fg,
'fg_diff': fg_diff,
'z_score': fg_diff / np.sqrt(
expected_fg * (1 - expected_fg) / len(player_shots)
)
})
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('fg_diff', ascending=False)
return results_df
def calculate_regression_projection(current_fg, expected_fg, attempts,
regression_weight=500):
"""
Project true shooting ability with regression to expected.
Uses empirical Bayes shrinkage toward expected FG%.
Parameters:
-----------
current_fg : float
Current observed FG%
expected_fg : float
Expected FG% from shot quality model
attempts : int
Number of shot attempts
regression_weight : int
Weight given to expected (like adding 'regression_weight'
shots at expected rate)
Returns:
--------
float : Regressed estimate of true FG%
"""
# Weighted average of observed and expected
regressed_fg = (
(current_fg * attempts + expected_fg * regression_weight) /
(attempts + regression_weight)
)
return regressed_fg
def identify_regression_candidates(player_df, model, preprocessor,
threshold_z=2.0):
"""
Identify players likely to regress toward expected performance.
Parameters:
-----------
player_df : DataFrame
Player shooting data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
threshold_z : float
Z-score threshold for flagging regression candidates
Returns:
--------
tuple : (positive_regression_candidates, negative_regression_candidates)
"""
luck_analysis = analyze_shooting_luck(player_df, model, preprocessor)
# Players shooting above expected (likely to regress down)
positive_luck = luck_analysis[luck_analysis['z_score'] > threshold_z]
# Players shooting below expected (likely to regress up)
negative_luck = luck_analysis[luck_analysis['z_score'] < -threshold_z]
return positive_luck, negative_luck
Three-Point Shooting Regression
Three-point percentage is particularly prone to variance due to lower base rates:
def analyze_three_point_regression(player_df, min_attempts=100):
"""
Analyze three-point shooting for regression candidates.
Parameters:
-----------
player_df : DataFrame
Player shooting data
min_attempts : int
Minimum three-point attempts for inclusion
Returns:
--------
DataFrame : Three-point regression analysis
"""
# Filter to three-pointers
threes_df = player_df[player_df['is_three'] == True]
results = []
for player_id in threes_df['player_id'].unique():
player_threes = threes_df[threes_df['player_id'] == player_id]
if len(player_threes) < min_attempts:
continue
# Calculate actual 3P%
actual_3p = player_threes['made'].mean()
# Career baseline (if available)
career_3p = player_threes['career_3p_pct'].iloc[0] if \
'career_3p_pct' in player_threes.columns else 0.36
# League average
league_avg_3p = 0.36
# Simple regression toward career/league average
attempts = len(player_threes)
regressed_3p = calculate_regression_projection(
actual_3p, career_3p, attempts, regression_weight=300
)
results.append({
'player_id': player_id,
'player_name': player_threes['player_name'].iloc[0],
'3pa': attempts,
'actual_3p_pct': actual_3p,
'career_3p_pct': career_3p,
'regressed_3p_pct': regressed_3p,
'current_vs_career': actual_3p - career_3p,
'expected_regression': regressed_3p - actual_3p
})
return pd.DataFrame(results).sort_values('current_vs_career', ascending=False)
Stabilization Points
The stabilization point is the sample size at which observed performance becomes as predictive as true ability:
def estimate_stabilization_point(shot_df, shot_type='all'):
"""
Estimate the stabilization point for shooting percentage.
Parameters:
-----------
shot_df : DataFrame
Historical shot data
shot_type : str
'all', 'two', or 'three'
Returns:
--------
int : Estimated stabilization point in attempts
"""
# Filter by shot type
if shot_type == 'two':
df = shot_df[shot_df['is_three'] == False]
elif shot_type == 'three':
df = shot_df[shot_df['is_three'] == True]
else:
df = shot_df
# Calculate league average
league_avg = df['made'].mean()
# Estimate variance of true shooting ability
# Using player season data
player_season = df.groupby(['player_id', 'season']).agg(
attempts=('made', 'count'),
fg_pct=('made', 'mean')
).reset_index()
# Filter for adequate sample
player_season = player_season[player_season['attempts'] >= 200]
# Observed variance = true variance + sampling variance
# Var(observed) = Var(true) + p(1-p)/n
observed_var = player_season['fg_pct'].var()
avg_attempts = player_season['attempts'].mean()
sampling_var = league_avg * (1 - league_avg) / avg_attempts
true_var = max(observed_var - sampling_var, 0.0001)
# Stabilization point formula
# At stabilization, Var(sampling) = Var(true)
stabilization = league_avg * (1 - league_avg) / true_var
return int(stabilization)
# Typical stabilization points:
# - 2-point FG%: ~400-500 attempts
# - 3-point FG%: ~700-800 attempts
# - Free throw %: ~200-300 attempts
16.8 Points Above Expected
Points Above Expected (PAE) measures how many more or fewer points a player scored compared to what the shot quality model predicted. This metric separates shooting skill from shot selection.
Calculating Points Above Expected
def calculate_points_above_expected(player_shots_df, model, preprocessor):
"""
Calculate Points Above Expected for a player's shot attempts.
Parameters:
-----------
player_shots_df : DataFrame
Player's shot data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
dict : PAE metrics
"""
df = player_shots_df.copy()
# Get expected make probability
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
# Calculate expected and actual points
df['shot_value'] = np.where(df['is_three'], 3, 2)
df['xpts'] = df['xfg'] * df['shot_value']
df['actual_pts'] = df['made'] * df['shot_value']
metrics = {
'attempts': len(df),
'actual_points': df['actual_pts'].sum(),
'expected_points': df['xpts'].sum(),
'points_above_expected': df['actual_pts'].sum() - df['xpts'].sum(),
'pae_per_shot': (df['actual_pts'].sum() - df['xpts'].sum()) / len(df),
'pae_per_100_shots': 100 * (df['actual_pts'].sum() - df['xpts'].sum()) / len(df)
}
return metrics
def calculate_pae_by_zone(player_shots_df, model, preprocessor):
"""
Break down Points Above Expected by court zone.
Parameters:
-----------
player_shots_df : DataFrame
Player's shot data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : PAE by zone
"""
df = player_shots_df.copy()
# Get expected make probability
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
# Calculate points
df['shot_value'] = np.where(df['is_three'], 3, 2)
df['xpts'] = df['xfg'] * df['shot_value']
df['actual_pts'] = df['made'] * df['shot_value']
# Assign zones
df['zone'] = df.apply(lambda x: assign_shot_zone(x['distance'], x['angle']), axis=1)
# Aggregate by zone
zone_pae = df.groupby('zone').agg(
attempts=('xpts', 'count'),
actual_pts=('actual_pts', 'sum'),
expected_pts=('xpts', 'sum'),
actual_fg=('made', 'mean'),
expected_fg=('xfg', 'mean')
)
zone_pae['pae'] = zone_pae['actual_pts'] - zone_pae['expected_pts']
zone_pae['pae_per_shot'] = zone_pae['pae'] / zone_pae['attempts']
return zone_pae.round(3)
def assign_shot_zone(distance, angle):
"""Helper function to assign shot zone."""
abs_angle = abs(angle) if angle is not None else 0
if distance < 4:
return 'Restricted Area'
elif distance < 14:
return 'Paint'
elif distance < 22:
return 'Mid-Range'
elif distance < 24 and abs_angle > 70:
return 'Corner Three'
elif distance < 28:
return 'Above Break Three'
else:
return 'Deep Three'
League-Wide PAE Analysis
def league_pae_leaderboard(all_shots_df, model, preprocessor, min_attempts=300):
"""
Create league-wide PAE leaderboard.
Parameters:
-----------
all_shots_df : DataFrame
All shots in dataset
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
min_attempts : int
Minimum attempts for inclusion
Returns:
--------
DataFrame : PAE leaderboard
"""
results = []
for player_id in all_shots_df['player_id'].unique():
player_shots = all_shots_df[all_shots_df['player_id'] == player_id]
if len(player_shots) < min_attempts:
continue
pae_metrics = calculate_points_above_expected(
player_shots, model, preprocessor
)
pae_metrics['player_id'] = player_id
pae_metrics['player_name'] = player_shots['player_name'].iloc[0]
results.append(pae_metrics)
leaderboard = pd.DataFrame(results)
leaderboard = leaderboard.sort_values('pae_per_100_shots', ascending=False)
return leaderboard[['player_name', 'attempts', 'actual_points',
'expected_points', 'points_above_expected',
'pae_per_100_shots']]
16.9 Applications: Player Evaluation
Shot quality models provide powerful tools for player evaluation, separating skill from circumstance and shot selection from conversion ability.
Comprehensive Shooter Evaluation
def evaluate_shooter(player_shots_df, league_shots_df, model, preprocessor):
"""
Comprehensive evaluation of a player as a shooter.
Parameters:
-----------
player_shots_df : DataFrame
Player's shot data
league_shots_df : DataFrame
League-wide shot data for comparison
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
dict : Comprehensive shooter evaluation
"""
# Get player's expected and actual performance
X_player = player_shots_df[model.feature_names_in_]
X_player_processed = preprocessor.transform(X_player)
player_xfg = model.predict_proba(X_player_processed)[:, 1]
# Calculate league averages for context
X_league = league_shots_df[model.feature_names_in_]
X_league_processed = preprocessor.transform(X_league)
league_xfg = model.predict_proba(X_league_processed)[:, 1]
player_df = player_shots_df.copy()
player_df['xfg'] = player_xfg
evaluation = {
# Volume metrics
'shot_attempts': len(player_df),
'three_point_rate': player_df['is_three'].mean(),
'rim_attempt_rate': (player_df['distance'] < 4).mean(),
# Shot quality metrics
'avg_shot_quality': player_df['xfg'].mean(),
'league_avg_shot_quality': league_xfg.mean(),
'shot_quality_vs_league': player_df['xfg'].mean() - league_xfg.mean(),
# Efficiency metrics
'actual_fg_pct': player_df['made'].mean(),
'expected_fg_pct': player_df['xfg'].mean(),
'fg_vs_expected': player_df['made'].mean() - player_df['xfg'].mean(),
# Shot difficulty
'avg_defender_distance': player_df['defender_distance'].mean(),
'contested_rate': (player_df['defender_distance'] < 4).mean(),
'self_created_rate': 1 - player_df['assisted'].mean() if 'assisted' in player_df.columns else None,
# Efficiency by zone
'rim_fg_vs_expected': None,
'three_fg_vs_expected': None,
}
# Zone-specific analysis
rim_shots = player_df[player_df['distance'] < 4]
if len(rim_shots) > 30:
evaluation['rim_fg_vs_expected'] = rim_shots['made'].mean() - rim_shots['xfg'].mean()
three_shots = player_df[player_df['is_three'] == True]
if len(three_shots) > 50:
evaluation['three_fg_vs_expected'] = three_shots['made'].mean() - three_shots['xfg'].mean()
return evaluation
def compare_shooters(players_dict, league_shots_df, model, preprocessor):
"""
Compare multiple players as shooters.
Parameters:
-----------
players_dict : dict
{player_name: player_shots_df}
league_shots_df : DataFrame
League-wide shot data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : Comparative shooter analysis
"""
evaluations = []
for player_name, player_df in players_dict.items():
eval_result = evaluate_shooter(player_df, league_shots_df, model, preprocessor)
eval_result['player_name'] = player_name
evaluations.append(eval_result)
comparison_df = pd.DataFrame(evaluations)
comparison_df = comparison_df.set_index('player_name')
return comparison_df
Player Shooting Profile Visualization
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, Rectangle, Arc
def create_shooting_profile_visualization(player_shots_df, model, preprocessor,
player_name):
"""
Create visual shooting profile for a player.
Parameters:
-----------
player_shots_df : DataFrame
Player's shot data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
player_name : str
Player's name for title
Returns:
--------
matplotlib figure
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
# Calculate expected values
X = player_shots_df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
player_shots_df = player_shots_df.copy()
player_shots_df['xfg'] = model.predict_proba(X_processed)[:, 1]
# Plot 1: Shot chart with actual vs expected
ax1 = axes[0, 0]
draw_court(ax1)
made = player_shots_df[player_shots_df['made'] == True]
missed = player_shots_df[player_shots_df['made'] == False]
ax1.scatter(missed['x_coord'], missed['y_coord'], c='red',
alpha=0.3, s=30, label='Missed')
ax1.scatter(made['x_coord'], made['y_coord'], c='green',
alpha=0.5, s=30, label='Made')
ax1.legend()
ax1.set_title(f'{player_name} - Shot Chart')
ax1.set_xlim(-25, 25)
ax1.set_ylim(-5, 35)
# Plot 2: FG% vs xFG% by distance
ax2 = axes[0, 1]
player_shots_df['dist_bin'] = pd.cut(player_shots_df['distance'],
bins=[0, 4, 10, 16, 22, 30])
by_distance = player_shots_df.groupby('dist_bin').agg(
actual=('made', 'mean'),
expected=('xfg', 'mean')
).reset_index()
x = range(len(by_distance))
width = 0.35
ax2.bar([i - width/2 for i in x], by_distance['actual'],
width, label='Actual FG%', color='green', alpha=0.7)
ax2.bar([i + width/2 for i in x], by_distance['expected'],
width, label='Expected FG%', color='blue', alpha=0.7)
ax2.set_xticks(x)
ax2.set_xticklabels(['Rim', 'Short', 'Mid-Short', 'Mid-Long', 'Three'])
ax2.legend()
ax2.set_title('Actual vs Expected FG% by Distance')
ax2.set_ylabel('FG%')
# Plot 3: Shot difficulty distribution
ax3 = axes[1, 0]
ax3.hist(player_shots_df['xfg'], bins=20, color='blue',
alpha=0.7, edgecolor='black')
ax3.axvline(player_shots_df['xfg'].mean(), color='red',
linestyle='--', label=f"Mean: {player_shots_df['xfg'].mean():.3f}")
ax3.legend()
ax3.set_xlabel('Expected FG%')
ax3.set_ylabel('Frequency')
ax3.set_title('Shot Difficulty Distribution')
# Plot 4: Performance by contest level
ax4 = axes[1, 1]
contest_bins = [0, 2, 4, 6, 10]
contest_labels = ['Very Tight', 'Tight', 'Open', 'Wide Open']
player_shots_df['contest'] = pd.cut(player_shots_df['defender_distance'],
bins=contest_bins, labels=contest_labels)
by_contest = player_shots_df.groupby('contest').agg(
actual=('made', 'mean'),
expected=('xfg', 'mean'),
count=('made', 'count')
).reset_index()
x = range(len(by_contest))
ax4.bar([i - width/2 for i in x], by_contest['actual'],
width, label='Actual', color='green', alpha=0.7)
ax4.bar([i + width/2 for i in x], by_contest['expected'],
width, label='Expected', color='blue', alpha=0.7)
ax4.set_xticks(x)
ax4.set_xticklabels(by_contest['contest'])
ax4.legend()
ax4.set_title('Performance by Defender Proximity')
ax4.set_ylabel('FG%')
plt.tight_layout()
return fig
def draw_court(ax, color='black', lw=2):
"""Draw basketball court on matplotlib axis."""
# Hoop
hoop = Circle((0, 0), radius=0.75, linewidth=lw, color=color, fill=False)
ax.add_patch(hoop)
# Backboard
ax.plot([-3, 3], [-0.75, -0.75], color=color, lw=lw)
# Paint
outer_box = Rectangle((-8, -0.75), 16, 19, linewidth=lw,
color=color, fill=False)
ax.add_patch(outer_box)
# Free throw circle
free_throw = Arc((0, 14.25), 12, 12, theta1=0, theta2=180,
linewidth=lw, color=color)
ax.add_patch(free_throw)
# Three-point line
ax.plot([-22, -22], [-0.75, 9], color=color, lw=lw)
ax.plot([22, 22], [-0.75, 9], color=color, lw=lw)
three_arc = Arc((0, 0), 47.5, 47.5, theta1=22, theta2=158,
linewidth=lw, color=color)
ax.add_patch(three_arc)
# Restricted area
restricted = Arc((0, 0), 8, 8, theta1=0, theta2=180,
linewidth=lw, color=color)
ax.add_patch(restricted)
ax.set_aspect('equal')
16.10 Applications: Coaching Decisions
Shot quality models inform coaching decisions on shot selection, lineup optimization, and game strategy.
Optimal Shot Selection Analysis
def analyze_shot_selection(team_shots_df, model, preprocessor):
"""
Analyze whether a team is taking optimal shots.
Parameters:
-----------
team_shots_df : DataFrame
Team's shot data
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
dict : Shot selection analysis
"""
df = team_shots_df.copy()
# Calculate expected points
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
# Analyze by shot type
shot_type_analysis = df.groupby('zone').agg(
attempts=('xpts', 'count'),
avg_xpts=('xpts', 'mean'),
total_xpts=('xpts', 'sum'),
actual_pts=('points_scored', 'sum')
).sort_values('avg_xpts', ascending=False)
# Calculate what optimal shot selection would look like
# (More shots from high xPts zones, fewer from low)
total_shots = len(df)
recommendations = {
'current_avg_xpts': df['xpts'].mean(),
'shot_type_breakdown': shot_type_analysis.to_dict(),
'rim_rate': (df['distance'] < 4).mean(),
'three_rate': df['is_three'].mean(),
'mid_range_rate': ((df['distance'] >= 4) & (~df['is_three'])).mean(),
}
# Simple optimization: what if 10% of mid-range became rim attempts?
mid_range_shots = df[(df['distance'] >= 10) & (df['distance'] < 22)]
rim_shots = df[df['distance'] < 4]
if len(mid_range_shots) > 0 and len(rim_shots) > 0:
mid_range_xpts = mid_range_shots['xpts'].mean()
rim_xpts = rim_shots['xpts'].mean()
# If we converted 10% of mid-range to rim attempts
shots_to_convert = len(mid_range_shots) * 0.1
xpts_lost = shots_to_convert * mid_range_xpts
xpts_gained = shots_to_convert * rim_xpts
recommendations['rim_vs_midrange_opportunity'] = xpts_gained - xpts_lost
return recommendations
def evaluate_play_type_efficiency(team_shots_df, model, preprocessor):
"""
Evaluate efficiency of different play types.
Parameters:
-----------
team_shots_df : DataFrame
Team's shot data with play type labels
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : Play type efficiency analysis
"""
df = team_shots_df.copy()
# Calculate expected values
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
df['actual_pts'] = df['made'] * np.where(df['is_three'], 3, 2)
play_type_analysis = df.groupby('play_type').agg(
possessions=('xpts', 'count'),
avg_xpts=('xpts', 'mean'),
actual_ppp=('actual_pts', 'mean'),
fg_pct=('made', 'mean'),
xfg_pct=('xfg', 'mean'),
three_rate=('is_three', 'mean')
).round(3)
play_type_analysis['efficiency_vs_expected'] = (
play_type_analysis['actual_ppp'] - play_type_analysis['avg_xpts']
)
return play_type_analysis.sort_values('actual_ppp', ascending=False)
Lineup Shot Quality Analysis
def analyze_lineup_shot_quality(lineups_shots_df, model, preprocessor):
"""
Analyze shot quality generated by different lineups.
Parameters:
-----------
lineups_shots_df : DataFrame
Shot data with lineup identifiers
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
Returns:
--------
DataFrame : Lineup shot quality analysis
"""
df = lineups_shots_df.copy()
# Calculate expected values
X = df[model.feature_names_in_]
X_processed = preprocessor.transform(X)
df['xfg'] = model.predict_proba(X_processed)[:, 1]
df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
df['actual_pts'] = df['made'] * np.where(df['is_three'], 3, 2)
lineup_analysis = df.groupby('lineup_id').agg(
minutes=('possession_time', 'sum'),
shots=('xpts', 'count'),
avg_xpts_created=('xpts', 'mean'),
actual_ppp=('actual_pts', 'mean'),
rim_rate=('at_rim', 'mean'),
three_rate=('is_three', 'mean'),
avg_shot_quality=('xfg', 'mean')
)
# Filter for minimum minutes
lineup_analysis = lineup_analysis[lineup_analysis['minutes'] >= 50]
return lineup_analysis.sort_values('avg_xpts_created', ascending=False)
End-of-Game Decision Support
def late_game_shot_analysis(game_shots_df, model, preprocessor,
seconds_remaining=24, score_margin_range=(-3, 3)):
"""
Analyze shot quality in late-game situations.
Parameters:
-----------
game_shots_df : DataFrame
Shot data with game context
model : trained model
Shot quality model
preprocessor : fitted preprocessor
Feature preprocessing pipeline
seconds_remaining : int
Define "late game" threshold
score_margin_range : tuple
Score differential range for close games
Returns:
--------
dict : Late game shot analysis
"""
df = game_shots_df.copy()
# Filter to late-game situations
late_game = df[
(df['game_clock'] <= seconds_remaining) &
(df['period'] >= 4) &
(df['score_margin'] >= score_margin_range[0]) &
(df['score_margin'] <= score_margin_range[1])
]
if len(late_game) < 50:
return {'error': 'Insufficient late-game shots for analysis'}
# Calculate expected values
X = late_game[model.feature_names_in_]
X_processed = preprocessor.transform(X)
late_game['xfg'] = model.predict_proba(X_processed)[:, 1]
late_game['xpts'] = late_game['xfg'] * np.where(late_game['is_three'], 3, 2)
analysis = {
'total_late_game_shots': len(late_game),
'avg_xpts': late_game['xpts'].mean(),
'actual_fg_pct': late_game['made'].mean(),
'expected_fg_pct': late_game['xfg'].mean(),
'three_rate': late_game['is_three'].mean(),
'rim_rate': (late_game['distance'] < 4).mean(),
'avg_defender_distance': late_game['defender_distance'].mean(),
}
# Who takes late-game shots
shooter_analysis = late_game.groupby('player_name').agg(
attempts=('xpts', 'count'),
avg_xpts=('xpts', 'mean'),
fg_pct=('made', 'mean'),
xfg_pct=('xfg', 'mean')
).sort_values('attempts', ascending=False)
analysis['primary_shooters'] = shooter_analysis.head(5).to_dict()
return analysis
16.11 Model Calibration and Validation
A well-calibrated shot quality model should produce probability estimates that match observed frequencies. If the model predicts 40% for a group of shots, approximately 40% should go in.
Calibration Analysis
from sklearn.calibration import calibration_curve
def analyze_model_calibration(y_true, y_prob, n_bins=10):
"""
Analyze calibration of shot quality model.
Parameters:
-----------
y_true : array
Actual outcomes (0/1)
y_prob : array
Predicted probabilities
n_bins : int
Number of bins for calibration curve
Returns:
--------
dict : Calibration metrics and curve data
"""
# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
# Calculate calibration metrics
calibration_error = np.abs(prob_true - prob_pred).mean()
max_calibration_error = np.abs(prob_true - prob_pred).max()
# Brier skill score (compared to predicting league average)
league_avg = y_true.mean()
brier_baseline = np.mean((y_true - league_avg)**2)
brier_model = np.mean((y_true - y_prob)**2)
brier_skill_score = 1 - (brier_model / brier_baseline)
results = {
'mean_calibration_error': calibration_error,
'max_calibration_error': max_calibration_error,
'brier_score': brier_model,
'brier_skill_score': brier_skill_score,
'calibration_curve': {
'predicted': prob_pred.tolist(),
'actual': prob_true.tolist()
}
}
return results
def plot_calibration_curve(y_true, y_prob, model_name='Shot Quality Model'):
"""
Plot calibration curve for shot quality model.
Parameters:
-----------
y_true : array
Actual outcomes
y_prob : array
Predicted probabilities
model_name : str
Name for plot title
Returns:
--------
matplotlib figure
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
ax1.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
ax1.plot(prob_pred, prob_true, 's-', label=model_name)
ax1.set_xlabel('Mean Predicted Probability')
ax1.set_ylabel('Fraction of Positives')
ax1.set_title('Calibration Curve')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Histogram of predictions
ax2.hist(y_prob, bins=50, color='blue', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Predicted Probability')
ax2.set_ylabel('Count')
ax2.set_title('Distribution of Predictions')
ax2.axvline(y_true.mean(), color='red', linestyle='--',
label=f'League Avg: {y_true.mean():.3f}')
ax2.legend()
plt.tight_layout()
return fig
Cross-Validation for Shot Models
from sklearn.model_selection import cross_val_predict, StratifiedKFold
def cross_validate_shot_model(X, y, model, preprocessor, n_splits=5):
"""
Perform cross-validation for shot quality model.
Parameters:
-----------
X : DataFrame
Feature matrix
y : Series
Target variable
model : estimator
Model to validate
preprocessor : transformer
Preprocessing pipeline
n_splits : int
Number of CV folds
Returns:
--------
dict : Cross-validation results
"""
# Create full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model)
])
# Stratified K-Fold
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
# Get cross-validated predictions
y_prob_cv = cross_val_predict(pipeline, X, y, cv=cv, method='predict_proba')[:, 1]
# Calculate metrics for each fold
fold_metrics = []
for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)):
X_test_fold = X.iloc[test_idx]
y_test_fold = y.iloc[test_idx]
y_prob_fold = y_prob_cv[test_idx]
fold_metrics.append({
'fold': fold + 1,
'roc_auc': roc_auc_score(y_test_fold, y_prob_fold),
'brier_score': brier_score_loss(y_test_fold, y_prob_fold),
'log_loss': log_loss(y_test_fold, y_prob_fold)
})
fold_df = pd.DataFrame(fold_metrics)
results = {
'fold_results': fold_df,
'mean_roc_auc': fold_df['roc_auc'].mean(),
'std_roc_auc': fold_df['roc_auc'].std(),
'mean_brier': fold_df['brier_score'].mean(),
'mean_log_loss': fold_df['log_loss'].mean(),
'cv_predictions': y_prob_cv
}
return results
16.12 Practical Considerations and Limitations
Data Quality Issues
def validate_shot_data(shots_df):
"""
Validate shot data quality before modeling.
Parameters:
-----------
shots_df : DataFrame
Raw shot data
Returns:
--------
dict : Data quality report
"""
report = {
'total_records': len(shots_df),
'missing_values': {},
'outliers': {},
'consistency_checks': {}
}
# Check missing values
for col in shots_df.columns:
missing_pct = shots_df[col].isna().mean() * 100
if missing_pct > 0:
report['missing_values'][col] = f"{missing_pct:.2f}%"
# Check for outliers
if 'distance' in shots_df.columns:
extreme_distance = (shots_df['distance'] > 50).sum()
report['outliers']['extreme_distance'] = extreme_distance
if 'shot_clock' in shots_df.columns:
invalid_clock = (
(shots_df['shot_clock'] < 0) |
(shots_df['shot_clock'] > 24)
).sum()
report['outliers']['invalid_shot_clock'] = invalid_clock
# Consistency checks
if 'is_three' in shots_df.columns and 'distance' in shots_df.columns:
# Three-pointers should generally be >= 22 feet
inconsistent_threes = (
(shots_df['is_three'] == True) &
(shots_df['distance'] < 20)
).sum()
report['consistency_checks']['short_three_pointers'] = inconsistent_threes
if 'made' in shots_df.columns:
invalid_made = (~shots_df['made'].isin([0, 1, True, False])).sum()
report['consistency_checks']['invalid_made_values'] = invalid_made
return report
Model Limitations
Shot quality models, despite their utility, have important limitations:
-
Unobserved factors: No model captures everything that affects shot success - Shooter's physical state (minor injuries, fatigue) - Defensive scheme and help defense positioning - Environmental factors (arena, crowd noise) - Psychological factors (pressure, confidence)
-
Selection bias: Players choose when to shoot - Good shooters may attempt more difficult shots - Role players often get cleaner looks
-
Tracking data limitations: - Defender distance is measured at release, not throughout - Shot type classification may be imperfect - Player identification occasionally incorrect
-
Temporal changes: - Players improve or decline - League-wide shooting evolves over time - Rule changes affect shot selection
def document_model_limitations(model_name, training_data_description):
"""
Create documentation of model limitations.
Parameters:
-----------
model_name : str
Name of the model
training_data_description : str
Description of training data
Returns:
--------
str : Formatted limitations documentation
"""
limitations = f"""
MODEL LIMITATIONS: {model_name}
{'='*50}
Training Data: {training_data_description}
KNOWN LIMITATIONS:
1. UNOBSERVED FACTORS
- Physical condition of shooter not captured
- Defensive scheme and rotations not fully modeled
- Game importance/pressure not quantified
- Shooter confidence/momentum not measured
2. DATA QUALITY
- Defender distance measured at release only
- Some shot types may be misclassified
- Tracking data has occasional errors (~1-2%)
3. SELECTION EFFECTS
- Model assumes shots are representative
- Better shooters may attempt harder shots
- Does not capture counterfactual (shots not taken)
4. TEMPORAL VALIDITY
- Model trained on specific time period
- League-wide trends may shift
- Individual player ability changes over time
5. CONTEXT LIMITATIONS
- Does not model game state effects beyond basic features
- Playoff vs regular season differences not captured
- Back-to-back and travel effects not included
RECOMMENDED USE:
- Use for large sample analysis (100+ shots)
- Combine with other evaluation methods
- Regularly retrain on new data
- Interpret with appropriate uncertainty
"""
return limitations
Summary
Shot quality models represent a powerful tool for understanding basketball at a granular level. By predicting the probability of any shot going in based on contextual factors, we can:
- Evaluate shot selection: Distinguish good decisions from bad ones, independent of outcome
- Assess true shooting skill: Separate skill from luck using regression to expected performance
- Compare players fairly: Account for the difficulty of shots each player attempts
- Inform coaching decisions: Optimize shot selection and lineup construction
- Project future performance: Identify players likely to improve or decline
The key components of effective shot quality modeling include:
- Comprehensive features: Distance, defender proximity, shot clock, touch time, shot type
- Appropriate algorithms: Logistic regression for interpretability, gradient boosting for accuracy
- Careful calibration: Ensuring predicted probabilities match observed frequencies
- Proper validation: Cross-validation and out-of-sample testing
- Honest limitations: Acknowledging what the model cannot capture
As tracking data continues to improve in quality and coverage, shot quality models will become even more precise. The fundamental framework, however, remains the same: quantify the difficulty of each shot, predict the expected outcome, and measure performance against that expectation.
Chapter References
-
Cervone, D., D'Amour, A., Bornn, L., & Goldsberry, K. (2016). A multiresolution stochastic process model for predicting basketball possession outcomes. Journal of the American Statistical Association, 111(514), 585-599.
-
Goldsberry, K. (2019). Sprawlball: A visual tour of the new era of the NBA. Houghton Mifflin Harcourt.
-
Franks, A., Miller, A., Bornn, L., & Goldsberry, K. (2015). Characterizing the spatial structure of defensive skill in professional basketball. The Annals of Applied Statistics, 9(1), 94-121.
-
Skinner, B. (2012). The price of anarchy in basketball. Journal of Quantitative Analysis in Sports, 6(1).
-
Chang, Y. H., Maheswaran, R., Su, J., Kwok, S., Levy, T., Wexler, A., & Squire, K. (2014). Quantifying shot quality in the NBA. In Proceedings of the MIT Sloan Sports Analytics Conference.