Shot Quality Models

Beginner 10 min read 0 views Nov 27, 2025

Shot Quality Models

Overview

Shot quality models predict the probability of a shot being successful based on contextual factors at the time of the attempt. These models generate expected field goal percentage (xFG%) for each shot, allowing teams to evaluate shooting efficiency independent of outcome variance and assess decision-making quality.

Unlike traditional shooting percentages that only measure results, shot quality models account for shot difficulty, enabling more accurate player evaluation and strategic decision-making.

What Shot Quality Models Predict

Expected Field Goal Percentage (xFG%)

The primary output of shot quality models is the probability that a given shot will be successful based on its characteristics:

  • Individual Shot Probability: P(make) for each attempt given context
  • Aggregate xFG%: Average expected success rate across multiple shots
  • Shot Quality Delta: Difference between actual FG% and xFG%, indicating over/under-performance
  • Shot Selection Quality: Average xFG% of shots taken, independent of makes/misses

Applications

  • Shooting Efficiency: Identify players who convert at higher rates than expected (shooting talent)
  • Shot Selection: Evaluate decision-making by comparing shot quality taken vs. available
  • Defensive Impact: Measure how defenders affect opponent shot quality
  • Play Design: Assess which offensive actions generate highest-quality shots
  • Variance Reduction: Stabilize small-sample assessments by accounting for shot difficulty

Key Features for Shot Quality Models

Spatial Features

  • Shot Distance: Distance from basket (most predictive single feature)
  • Shot Angle: Horizontal angle relative to basket (corner vs. wing vs. top)
  • X/Y Coordinates: Exact court location for spatial modeling
  • Zone Classification: Restricted area, paint, mid-range, three-point zones
  • Corner Three Indicator: Corner threes have higher success rates than above-the-break threes

Defensive Pressure Features

  • Closest Defender Distance: Distance to nearest defender at shot release
  • Defender Contest Quality: Whether defender's hand was near ball or disrupted shot
  • Defender Height Differential: Size mismatch between shooter and defender
  • Help Defender Proximity: Distance to second-closest defender
  • Defensive Attention: Number of defenders within specified radius

Shot Type and Context

  • Shot Type: Jump shot, layup, dunk, hook shot, tip-in
  • Dribbles Before Shot: Catch-and-shoot (0 dribbles) vs. off-the-dribble
  • Touch Time: Seconds ball was held before shot
  • Shot Clock: Time remaining when shot was taken
  • Game Clock: Time in quarter/game (clutch situations)
  • Score Differential: Point margin when shot was attempted

Player and Team Context

  • Shooter Identity: Player-specific effects (some models exclude, some include)
  • Shooter Height: Taller players convert interior shots more efficiently
  • Home/Away: Home court advantage effect
  • Lineup Configuration: Floor spacing based on teammates' locations

Building Shot Quality Models in Python

Data Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score, brier_score_loss
import matplotlib.pyplot as plt
import seaborn as sns

# Load shot data
shots = pd.read_csv('nba_shots_2023_24.csv')

# Feature engineering
def engineer_features(df):
    """Create features for shot quality model"""

    # Distance from basket
    df['shot_distance'] = np.sqrt(df['loc_x']**2 + df['loc_y']**2)

    # Shot angle (radians)
    df['shot_angle'] = np.arctan2(df['loc_y'], df['loc_x'])

    # Corner three indicator
    df['is_corner_three'] = ((df['shot_type'] == '3PT') &
                              (np.abs(df['loc_y']) > 22) &
                              (df['loc_x'] < 7.8))

    # Above the break three
    df['is_atb_three'] = ((df['shot_type'] == '3PT') &
                           (~df['is_corner_three']))

    # Restricted area (within 4 feet)
    df['is_restricted'] = df['shot_distance'] < 4

    # Paint (non-restricted area within 8 feet)
    df['is_paint'] = (df['shot_distance'] >= 4) & (df['shot_distance'] < 8)

    # Mid-range
    df['is_midrange'] = (df['shot_distance'] >= 8) & (df['shot_type'] == '2PT')

    # Defender pressure categories
    df['open'] = df['defender_distance'] > 6
    df['wide_open'] = df['defender_distance'] > 8
    df['tight'] = df['defender_distance'] < 2
    df['very_tight'] = df['defender_distance'] < 1

    # Catch and shoot indicator
    df['catch_and_shoot'] = df['dribbles'] == 0

    # Shot clock pressure
    df['shot_clock_late'] = df['shot_clock'] < 7
    df['shot_clock_very_late'] = df['shot_clock'] < 4

    return df

# Engineer features
shots = engineer_features(shots)

# Define feature columns
feature_cols = [
    'shot_distance', 'shot_angle', 'defender_distance',
    'is_corner_three', 'is_atb_three', 'is_restricted',
    'is_paint', 'is_midrange',
    'dribbles', 'touch_time', 'shot_clock',
    'open', 'wide_open', 'tight', 'very_tight',
    'catch_and_shoot', 'shooter_height'
]

# Prepare data
X = shots[feature_cols].copy()
y = shots['shot_made'].values  # 1 = made, 0 = missed

# Handle missing values
X = X.fillna(X.median())

# Train-test split (chronological for time-series data)
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Overall FG%: {y.mean():.3f}")

Baseline: Logistic Regression Model

# Logistic regression baseline
from sklearn.preprocessing import StandardScaler

# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
lr_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)
lr_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("Logistic Regression Performance:")
print(f"  Log Loss: {log_loss(y_test, y_pred_lr):.4f}")
print(f"  ROC AUC: {roc_auc_score(y_test, y_pred_lr):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_lr):.4f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'coefficient': lr_model.coef_[0]
}).sort_values('coefficient', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Gradient Boosting Model (Recommended)

# Gradient Boosting Classifier (best performance)
gb_model = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    min_samples_split=200,
    min_samples_leaf=50,
    subsample=0.8,
    max_features='sqrt',
    random_state=42,
    verbose=1
)

# Train model
gb_model.fit(X_train, y_train)

# Predictions (xFG%)
y_pred_gb = gb_model.predict_proba(X_test)[:, 1]

# Evaluate
print("\nGradient Boosting Performance:")
print(f"  Log Loss: {log_loss(y_test, y_pred_gb):.4f}")
print(f"  ROC AUC: {roc_auc_score(y_test, y_pred_gb):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_gb):.4f}")

# Feature importance
gb_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance (Gradient Boosting):")
print(gb_importance)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(gb_importance['feature'][:15], gb_importance['importance'][:15])
plt.xlabel('Importance')
plt.title('Top 15 Features for Shot Quality Prediction')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()

Generate xFG% and Shot Quality Metrics

# Add xFG% to original dataframe
shots_test = shots[train_size:].copy()
shots_test['xFG'] = y_pred_gb
shots_test['actual_FG'] = y_test

# Calculate shot quality delta (actual - expected)
shots_test['FG_delta'] = shots_test['actual_FG'] - shots_test['xFG']

# Player-level aggregation
player_stats = shots_test.groupby('player_name').agg({
    'shot_made': ['sum', 'count', 'mean'],  # Makes, attempts, FG%
    'xFG': 'mean',  # Average shot quality (xFG%)
    'FG_delta': 'mean',  # Shooting talent (FG% - xFG%)
    'shot_distance': 'mean'
}).round(3)

player_stats.columns = ['makes', 'attempts', 'FG%', 'xFG%', 'FG%_over_xFG', 'avg_distance']

# Filter to players with 50+ attempts
player_stats = player_stats[player_stats['attempts'] >= 50].copy()

# Sort by shooting talent
player_stats_sorted = player_stats.sort_values('FG%_over_xFG', ascending=False)

print("\nTop 10 Players - Best Shooting Talent (FG% over xFG%):")
print(player_stats_sorted.head(10))

print("\nTop 10 Players - Best Shot Selection (highest xFG%):")
print(player_stats.sort_values('xFG%', ascending=False).head(10))

# Visualization: Actual vs Expected FG%
plt.figure(figsize=(10, 8))
plt.scatter(player_stats['xFG%'], player_stats['FG%'],
            alpha=0.6, s=player_stats['attempts'])
plt.plot([0.3, 0.7], [0.3, 0.7], 'r--', label='Expected (diagonal)')
plt.xlabel('Expected FG% (xFG%)')
plt.ylabel('Actual FG%')
plt.title('Player Shooting: Actual vs Expected FG%\n(size = attempts)')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('actual_vs_expected_fg.png', dpi=300)
plt.show()

Shot Quality Models in R

Logistic Regression Approach

library(tidyverse)
library(caret)
library(pROC)
library(ggplot2)

# Load data
shots <- read_csv("nba_shots_2023_24.csv")

# Feature engineering
shots <- shots %>%
  mutate(
    # Spatial features
    shot_distance = sqrt(loc_x^2 + loc_y^2),
    shot_angle = atan2(loc_y, loc_x),

    # Shot zones
    is_corner_three = (shot_type == "3PT" & abs(loc_y) > 22 & loc_x < 7.8),
    is_atb_three = (shot_type == "3PT" & !is_corner_three),
    is_restricted = shot_distance < 4,
    is_paint = shot_distance >= 4 & shot_distance < 8,
    is_midrange = shot_distance >= 8 & shot_type == "2PT",

    # Defender pressure
    open = defender_distance > 6,
    wide_open = defender_distance > 8,
    tight = defender_distance < 2,

    # Context
    catch_and_shoot = dribbles == 0,
    shot_clock_late = shot_clock < 7,

    # Outcome
    shot_made = as.factor(shot_made)
  )

# Train-test split (80-20)
set.seed(42)
train_idx <- createDataPartition(shots$shot_made, p = 0.8, list = FALSE)
train_data <- shots[train_idx, ]
test_data <- shots[-train_idx, ]

# Define formula
formula <- shot_made ~ shot_distance + shot_angle + defender_distance +
                       is_corner_three + is_atb_three + is_restricted +
                       is_paint + is_midrange +
                       dribbles + touch_time + shot_clock +
                       open + wide_open + tight + catch_and_shoot +
                       shooter_height

# Train logistic regression model
logit_model <- glm(formula,
                   data = train_data,
                   family = binomial(link = "logit"))

# Model summary
summary(logit_model)

# Predictions (xFG%)
test_data$xFG <- predict(logit_model, newdata = test_data, type = "response")
test_data$actual_FG <- as.numeric(test_data$shot_made) - 1

# Evaluate model
predictions <- test_data$xFG
actual <- test_data$actual_FG

# Log loss
log_loss <- -mean(actual * log(predictions + 1e-15) +
                  (1 - actual) * log(1 - predictions + 1e-15))

# ROC AUC
roc_obj <- roc(actual, predictions)
auc_score <- auc(roc_obj)

# Brier score
brier_score <- mean((predictions - actual)^2)

cat(sprintf("Model Performance:\n"))
cat(sprintf("  Log Loss: %.4f\n", log_loss))
cat(sprintf("  ROC AUC: %.4f\n", auc_score))
cat(sprintf("  Brier Score: %.4f\n", brier_score))

Player-Level Shot Quality Analysis

# Calculate shot quality metrics by player
player_stats <- test_data %>%
  group_by(player_name) %>%
  summarise(
    attempts = n(),
    makes = sum(actual_FG),
    FG_pct = mean(actual_FG),
    xFG_pct = mean(xFG),
    FG_over_expected = mean(actual_FG - xFG),
    avg_shot_distance = mean(shot_distance),
    pct_catch_shoot = mean(catch_and_shoot),
    avg_defender_dist = mean(defender_distance, na.rm = TRUE)
  ) %>%
  filter(attempts >= 50) %>%
  arrange(desc(FG_over_expected))

# Top shooters (beating expectations)
cat("\nTop 10 Shooters (FG% over xFG%):\n")
print(head(player_stats, 10))

# Best shot selection (highest xFG%)
cat("\nTop 10 Shot Selection (highest xFG%):\n")
print(player_stats %>% arrange(desc(xFG_pct)) %>% head(10))

# Visualization: Actual vs Expected FG%
ggplot(player_stats, aes(x = xFG_pct, y = FG_pct, size = attempts)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Player Shooting Efficiency: Actual vs Expected FG%",
    subtitle = "Points above the line indicate better-than-expected shooting",
    x = "Expected FG% (xFG%)",
    y = "Actual FG%",
    size = "Attempts"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))

ggsave("actual_vs_expected_fg_r.png", width = 10, height = 8, dpi = 300)

Shot Quality by Zone

# Analyze shot quality by zone
zone_analysis <- test_data %>%
  mutate(
    zone = case_when(
      is_restricted ~ "Restricted Area",
      is_paint ~ "Paint (Non-RA)",
      is_midrange ~ "Mid-Range",
      is_corner_three ~ "Corner 3",
      is_atb_three ~ "Above Break 3",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(zone) %>%
  summarise(
    attempts = n(),
    FG_pct = mean(actual_FG),
    xFG_pct = mean(xFG),
    avg_defender_dist = mean(defender_distance, na.rm = TRUE),
    pct_open = mean(open, na.rm = TRUE)
  ) %>%
  arrange(desc(FG_pct))

print(zone_analysis)

# Visualize zone efficiency
ggplot(zone_analysis, aes(x = reorder(zone, FG_pct))) +
  geom_col(aes(y = FG_pct, fill = "Actual FG%"), alpha = 0.7, width = 0.4, position = position_nudge(x = -0.2)) +
  geom_col(aes(y = xFG_pct, fill = "Expected FG%"), alpha = 0.7, width = 0.4, position = position_nudge(x = 0.2)) +
  coord_flip() +
  scale_fill_manual(values = c("Actual FG%" = "darkblue", "Expected FG%" = "orange")) +
  labs(
    title = "Shot Efficiency by Zone: Actual vs Expected",
    x = "Shot Zone",
    y = "Field Goal %",
    fill = ""
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggsave("zone_efficiency.png", width = 10, height = 6, dpi = 300)

Machine Learning Approaches

Random Forest

Random forests handle non-linear relationships and interactions well without requiring feature scaling:

from sklearn.ensemble import RandomForestClassifier

# Train random forest
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=15,
    min_samples_split=100,
    min_samples_leaf=30,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest Performance:")
print(f"  Log Loss: {log_loss(y_test, y_pred_rf):.4f}")
print(f"  ROC AUC: {roc_auc_score(y_test, y_pred_rf):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_rf):.4f}")

XGBoost (Industry Standard)

import xgboost as xgb

# Prepare DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 50,
    'scale_pos_weight': 1,
    'seed': 42
}

# Train with early stopping
evals = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=evals,
    early_stopping_rounds=50,
    verbose_eval=50
)

# Predictions
y_pred_xgb = xgb_model.predict(dtest)

print("\nXGBoost Performance:")
print(f"  Log Loss: {log_loss(y_test, y_pred_xgb):.4f}")
print(f"  ROC AUC: {roc_auc_score(y_test, y_pred_xgb):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_xgb):.4f}")

# Feature importance
importance = xgb_model.get_score(importance_type='gain')
importance_df = pd.DataFrame({
    'feature': list(importance.keys()),
    'importance': list(importance.values())
}).sort_values('importance', ascending=False)

print("\nTop Features (XGBoost):")
print(importance_df.head(10))

Neural Network (Deep Learning)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Scale features
scaler = StandardScaler()
X_train_nn = scaler.fit_transform(X_train)
X_test_nn = scaler.transform(X_test)

# Build neural network
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(X_train_nn.shape[1],)),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.AUC(name='auc')]
)

# Train
history = model.fit(
    X_train_nn, y_train,
    validation_split=0.2,
    epochs=50,
    batch_size=256,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(patience=3, factor=0.5)
    ],
    verbose=1
)

# Predictions
y_pred_nn = model.predict(X_test_nn).flatten()

print("\nNeural Network Performance:")
print(f"  Log Loss: {log_loss(y_test, y_pred_nn):.4f}")
print(f"  ROC AUC: {roc_auc_score(y_test, y_pred_nn):.4f}")
print(f"  Brier Score: {brier_score_loss(y_test, y_pred_nn):.4f}")

Model Validation and Calibration

Calibration Assessment

Shot quality models must be well-calibrated: when the model predicts 45% probability, shots should go in 45% of the time:

from sklearn.calibration import calibration_curve

# Calculate calibration curve
prob_true, prob_pred = calibration_curve(
    y_test, y_pred_gb, n_bins=20, strategy='quantile'
)

# Plot calibration
plt.figure(figsize=(10, 6))
plt.plot(prob_pred, prob_true, marker='o', linewidth=2, label='Gradient Boosting')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives (Actual)')
plt.title('Calibration Curve - Shot Quality Model')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('calibration_curve.png', dpi=300)
plt.show()

# Expected Calibration Error (ECE)
ece = np.mean(np.abs(prob_true - prob_pred))
print(f"Expected Calibration Error: {ece:.4f}")

Reliability by Probability Bins

# Bin predictions and check actual conversion rates
bins = np.linspace(0, 1, 21)  # 20 bins
bin_indices = np.digitize(y_pred_gb, bins)

reliability = []
for i in range(1, len(bins)):
    mask = bin_indices == i
    if mask.sum() > 0:
        actual_rate = y_test[mask].mean()
        predicted_rate = y_pred_gb[mask].mean()
        count = mask.sum()
        reliability.append({
            'bin_start': bins[i-1],
            'bin_end': bins[i],
            'predicted': predicted_rate,
            'actual': actual_rate,
            'count': count,
            'difference': actual_rate - predicted_rate
        })

reliability_df = pd.DataFrame(reliability)
print("\nReliability by Probability Bin:")
print(reliability_df)

Cross-Validation

from sklearn.model_selection import cross_val_score, KFold

# K-fold cross-validation (be careful with time-series data)
kfold = KFold(n_splits=5, shuffle=False)  # No shuffle for temporal data

cv_scores = cross_val_score(
    gb_model, X_train, y_train,
    cv=kfold,
    scoring='neg_log_loss',
    n_jobs=-1
)

print(f"\nCross-Validation Log Loss: {-cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Temporal Validation

For basketball shot data, validate on future games to ensure model generalizes:

# Split by date (train on earlier games, test on later)
shots['game_date'] = pd.to_datetime(shots['game_date'])
shots_sorted = shots.sort_values('game_date')

# Use first 80% of season for training, last 20% for testing
split_date = shots_sorted['game_date'].quantile(0.8)

train_temporal = shots_sorted[shots_sorted['game_date'] < split_date]
test_temporal = shots_sorted[shots_sorted['game_date'] >= split_date]

print(f"Training period: {train_temporal['game_date'].min()} to {train_temporal['game_date'].max()}")
print(f"Testing period: {test_temporal['game_date'].min()} to {test_temporal['game_date'].max()}")

# Train and evaluate with temporal split
# (Use same process as before with new train/test sets)

Applications for Player Evaluation

1. Shooting Talent vs. Shot Selection

Decompose scoring efficiency into two components:

  • Shooting Talent: FG% - xFG% (ability to convert difficult shots)
  • Shot Selection: Average xFG% of shots taken (quality of looks generated)
# Calculate for each player
player_evaluation = shots_test.groupby('player_name').agg({
    'shot_made': 'mean',  # Actual FG%
    'xFG': 'mean',  # Shot quality (selection)
    'shot_id': 'count'  # Attempts
}).round(3)

player_evaluation.columns = ['FG%', 'Shot_Quality', 'Attempts']
player_evaluation['Shooting_Talent'] = (player_evaluation['FG%'] -
                                         player_evaluation['Shot_Quality'])

# Filter minimum attempts
player_evaluation = player_evaluation[player_evaluation['Attempts'] >= 100]

# Classify players into quadrants
player_evaluation['Category'] = 'Average'
player_evaluation.loc[
    (player_evaluation['Shooting_Talent'] > 0.02) &
    (player_evaluation['Shot_Quality'] > 0.50), 'Category'
] = 'Elite (Good Talent + Good Selection)'

player_evaluation.loc[
    (player_evaluation['Shooting_Talent'] > 0.02) &
    (player_evaluation['Shot_Quality'] <= 0.50), 'Category'
] = 'Efficient (Good Talent + Poor Selection)'

player_evaluation.loc[
    (player_evaluation['Shooting_Talent'] <= 0.02) &
    (player_evaluation['Shot_Quality'] > 0.50), 'Category'
] = 'System Player (Poor Talent + Good Selection)'

print(player_evaluation.sort_values('FG%', ascending=False).head(20))

2. Offensive Role Identification

Use shot quality patterns to identify player archetypes:

# Profile each player's shot diet
player_profiles = shots_test.groupby('player_name').agg({
    'is_restricted': 'mean',
    'is_corner_three': 'mean',
    'is_atb_three': 'mean',
    'is_midrange': 'mean',
    'catch_and_shoot': 'mean',
    'dribbles': 'mean',
    'defender_distance': 'mean',
    'shot_id': 'count'
}).round(3)

player_profiles.columns = [
    'Pct_Rim', 'Pct_Corner3', 'Pct_ATB3', 'Pct_Midrange',
    'Pct_CatchShoot', 'Avg_Dribbles', 'Avg_DefenderDist', 'Attempts'
]

# Example: Identify spot-up shooters
spot_up_shooters = player_profiles[
    (player_profiles['Pct_CatchShoot'] > 0.7) &
    (player_profiles['Pct_Corner3'] > 0.3) &
    (player_profiles['Attempts'] >= 100)
].sort_values('Pct_Corner3', ascending=False)

print("Spot-Up Shooters (70%+ Catch & Shoot, 30%+ Corner 3s):")
print(spot_up_shooters)

3. Defensive Impact on Shot Quality

Measure how individual defenders affect opponent shot quality:

# Calculate xFG% allowed by each defender
defender_impact = shots_test.groupby('closest_defender').agg({
    'xFG': 'mean',  # Average xFG% allowed (shot quality)
    'shot_made': 'mean',  # Actual FG% allowed
    'shot_id': 'count',  # Shots defended
    'defender_distance': 'mean'  # Avg contest distance
}).round(3)

defender_impact.columns = ['xFG%_Allowed', 'FG%_Allowed', 'Shots_Defended', 'Avg_Distance']

# DXFG: Defensive impact = (Actual FG% - xFG% allowed)
# Negative = good defense (opponents shot worse than expected)
defender_impact['DXFG'] = (defender_impact['FG%_Allowed'] -
                            defender_impact['xFG%_Allowed'])

# Filter minimum sample
defender_impact = defender_impact[defender_impact['Shots_Defended'] >= 100]

# Best defenders (most negative DXFG)
print("Top 10 Defenders (Lowest DXFG - Force Misses):")
print(defender_impact.sort_values('DXFG').head(10))

4. Clutch Performance Analysis

Evaluate shot quality and conversion in high-pressure situations:

# Define clutch situations (last 5 min, score within 5)
clutch_shots = shots_test[
    (shots_test['game_clock'] <= 300) &  # Last 5 minutes
    (abs(shots_test['score_margin']) <= 5)  # Within 5 points
].copy()

# Compare clutch vs. non-clutch
clutch_stats = clutch_shots.groupby('player_name').agg({
    'shot_made': 'mean',
    'xFG': 'mean',
    'shot_id': 'count'
}).round(3)

clutch_stats.columns = ['Clutch_FG%', 'Clutch_xFG%', 'Clutch_Attempts']
clutch_stats['Clutch_Talent'] = clutch_stats['Clutch_FG%'] - clutch_stats['Clutch_xFG%']

# Filter minimum clutch attempts
clutch_stats = clutch_stats[clutch_stats['Clutch_Attempts'] >= 20]

print("Top Clutch Performers (FG% over xFG% in clutch):")
print(clutch_stats.sort_values('Clutch_Talent', ascending=False).head(10))

5. Play Type Efficiency

Evaluate shot quality generated by different offensive actions:

# Assuming play_type column exists (e.g., pick-and-roll, isolation, transition)
play_efficiency = shots_test.groupby('play_type').agg({
    'xFG': 'mean',
    'shot_made': 'mean',
    'shot_id': 'count'
}).round(3)

play_efficiency.columns = ['Avg_xFG%', 'Avg_FG%', 'Frequency']
play_efficiency['Efficiency_Delta'] = (play_efficiency['Avg_FG%'] -
                                        play_efficiency['Avg_xFG%'])

print("Shot Quality by Play Type:")
print(play_efficiency.sort_values('Avg_xFG%', ascending=False))

6. Lineup Shot Quality Analysis

Evaluate which lineups generate and allow the best shot quality:

# Group by 5-man lineup
lineup_offense = shots_test.groupby('lineup_id').agg({
    'xFG': 'mean',  # Shot quality generated
    'shot_made': 'mean',
    'shot_id': 'count'
}).round(3)

lineup_offense.columns = ['Offensive_xFG%', 'Offensive_FG%', 'Shots']
lineup_offense = lineup_offense[lineup_offense['Shots'] >= 50]

print("Best Offensive Lineups (Highest xFG% Generated):")
print(lineup_offense.sort_values('Offensive_xFG%', ascending=False).head(10))

# Can do similar analysis for defensive lineups (xFG% allowed)

Best Practices and Considerations

Model Development

  • Sample Size: Train on full season(s) of data (100,000+ shots) for stability
  • Feature Selection: Include only objective, measurable features; avoid including shooter identity if evaluating shooting talent
  • Temporal Validation: Test on future games, not random splits, to ensure generalization
  • Calibration: Ensure predicted probabilities match observed frequencies
  • Interpretability: Simpler models (logistic regression, GAMs) are easier to explain to stakeholders

Data Quality

  • Tracking Data Accuracy: Defender distance and player locations depend on SportVU/tracking system precision
  • Shot Classification: Ensure consistent labeling of shot types across games
  • Missing Data: Handle missing defender distance or play type data appropriately
  • Outlier Detection: Flag and investigate impossible values (e.g., negative distances)

Interpretation Pitfalls

  • Small Samples: xFG% differences stabilize faster than FG%, but still require 50+ shots
  • Context Dependency: Models trained on NBA data may not generalize to college or international basketball
  • Skill vs. Luck: Even with shot quality models, variance exists; don't overreact to short-term deviations
  • Selection Bias: Players who take difficult shots may be doing so because no better option exists
  • Causality: xFG% difference indicates correlation, not necessarily shooting skill (could reflect shot preparation, footwork, etc.)

Model Maintenance

  • Regular Retraining: Update models each season as shot trends evolve (e.g., increasing 3-point volume)
  • Rule Changes: Adjust for defensive rule changes that affect shot difficulty
  • Tracking Updates: Recalibrate if tracking system changes or improves
  • Performance Monitoring: Track model performance metrics over time to detect drift

Summary

Shot quality models are essential tools for modern basketball analytics, enabling teams to:

  • Separate shooting talent from shot selection in player evaluation
  • Identify players who consistently beat expectations or generate high-quality looks
  • Measure defensive impact by quantifying how defenders affect opponent shot difficulty
  • Evaluate offensive and defensive schemes based on shot quality generated/allowed
  • Make more informed decisions in small samples by accounting for shot difficulty

By combining spatial data, defensive pressure metrics, and contextual features with machine learning techniques like gradient boosting and XGBoost, analysts can build highly accurate models that predict shot success probability. These xFG% predictions form the foundation for advanced metrics used throughout the NBA for player evaluation, lineup optimization, and strategic planning.

The key to successful shot quality modeling is ensuring calibration, validating on future data, and interpreting results within the broader context of team strategy and individual player roles.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.