9 min read

In This Chapter

Introduction
16.1 Expected Points Models
16.2 Shot Difficulty Factors
16.3 Building a Shot Quality Model with Logistic Regression
16.4 Feature Engineering for Shot Prediction
16.5 Advanced Models: Gradient Boosting and Neural Networks
16.6 Shot Creation Value
16.7 Shooting Luck vs. Skill: Regression to the Mean
16.8 Points Above Expected
16.9 Applications: Player Evaluation
16.10 Applications: Coaching Decisions
16.11 Model Calibration and Validation
16.12 Practical Considerations and Limitations
Summary
Chapter References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 16: Shot Quality Models

Introduction

The revolution in basketball analytics began with a simple question: are all shots created equal? The answer, of course, is no. A wide-open three-pointer from the corner differs fundamentally from a contested fadeaway over a seven-footer. Shot quality models attempt to quantify these differences, providing a framework for understanding the true value of shooting opportunities and the players who create and convert them.

This chapter explores the construction and application of shot quality models, from basic expected points calculations to sophisticated machine learning approaches that account for dozens of contextual factors. We will build complete models using Python and scikit-learn, examining feature engineering, model evaluation, and practical applications in player evaluation and coaching decisions.

Shot quality modeling represents one of the most successful applications of machine learning in sports analytics. Unlike many predictive tasks where the outcome is distant and influenced by countless intervening factors, shot outcomes are immediate and binary: the ball either goes in or it does not. This clarity makes shot prediction an ideal domain for developing and refining analytical techniques.

16.1 Expected Points Models

The Foundation of Shot Quality

Expected points (xPoints or xPts) represents the average number of points a shot would yield if taken many times under identical circumstances. For a two-point shot with a 45% probability of going in, the expected points equals 0.90. For a three-pointer with a 35% make probability, the expected points equals 1.05.

The expected points framework transforms shooting from a binary outcome (made or missed) into a continuous measure of quality. This transformation is essential because:

Sample size limitations: Even prolific scorers take only a few hundred shots per season from any specific location
Variance in outcomes: A shooter might hit 5 consecutive difficult shots or miss 5 consecutive easy ones
Decision evaluation: We want to assess whether taking a shot was a good decision, independent of whether it went in

Simple Expected Points Calculation

The most basic expected points model uses only shot location:

$$xPts = P(make | location) \times points\_value$$

Where $P(make | location)$ is the league-average field goal percentage from that location, and $points\_value$ is 2 or 3 depending on whether the shot is inside or beyond the arc.

import numpy as np
import pandas as pd

def simple_expected_points(shot_distance, is_three_pointer, league_fg_by_distance):
    """
    Calculate expected points using only shot distance.

    Parameters:
    -----------
    shot_distance : float
        Distance from basket in feet
    is_three_pointer : bool
        Whether the shot is a three-point attempt
    league_fg_by_distance : dict
        Dictionary mapping distance ranges to league-average FG%

    Returns:
    --------
    float : Expected points for the shot
    """
    # Determine the distance bucket
    if shot_distance < 4:
        fg_pct = league_fg_by_distance['rim']
    elif shot_distance < 10:
        fg_pct = league_fg_by_distance['short']
    elif shot_distance < 16:
        fg_pct = league_fg_by_distance['mid_short']
    elif shot_distance < 22:
        fg_pct = league_fg_by_distance['mid_long']
    else:
        fg_pct = league_fg_by_distance['three']

    points_value = 3 if is_three_pointer else 2

    return fg_pct * points_value

# Example league averages (approximate 2023-24 values)
league_fg = {
    'rim': 0.65,       # 0-4 feet
    'short': 0.42,     # 4-10 feet
    'mid_short': 0.40, # 10-16 feet
    'mid_long': 0.40,  # 16-22 feet
    'three': 0.36      # 22+ feet
}

# Calculate expected points for different shots
print(f"Rim shot xPts: {simple_expected_points(2, False, league_fg):.3f}")
print(f"Mid-range xPts: {simple_expected_points(15, False, league_fg):.3f}")
print(f"Three-pointer xPts: {simple_expected_points(24, True, league_fg):.3f}")

Output:

Rim shot xPts: 1.300
Mid-range xPts: 0.800
Three-pointer xPts: 1.080

This simple model immediately reveals why modern offenses emphasize rim attempts and three-pointers: they yield significantly higher expected points than mid-range shots.

Zone-Based Expected Points

A more refined approach divides the court into zones, calculating expected points for each:

def create_shot_zones():
    """
    Define standard shot zones used in NBA analysis.

    Returns:
    --------
    dict : Zone definitions with boundaries and typical FG%
    """
    zones = {
        'restricted_area': {
            'description': 'Within 4 feet of rim',
            'distance_range': (0, 4),
            'angle_range': None,  # All angles
            'fg_pct': 0.63,
            'points': 2
        },
        'paint_non_ra': {
            'description': 'Paint outside restricted area',
            'distance_range': (4, 14),
            'angle_range': (-80, 80),  # Inside the paint lines
            'fg_pct': 0.40,
            'points': 2
        },
        'mid_range_left': {
            'description': 'Left side mid-range',
            'distance_range': (14, 22),
            'angle_range': (45, 135),
            'fg_pct': 0.41,
            'points': 2
        },
        'mid_range_right': {
            'description': 'Right side mid-range',
            'distance_range': (14, 22),
            'angle_range': (-135, -45),
            'fg_pct': 0.41,
            'points': 2
        },
        'mid_range_center': {
            'description': 'Top of key mid-range',
            'distance_range': (14, 22),
            'angle_range': (-45, 45),
            'fg_pct': 0.40,
            'points': 2
        },
        'corner_three_left': {
            'description': 'Left corner three',
            'distance_range': (22, 24),
            'angle_range': (70, 110),
            'fg_pct': 0.39,
            'points': 3
        },
        'corner_three_right': {
            'description': 'Right corner three',
            'distance_range': (22, 24),
            'angle_range': (-110, -70),
            'fg_pct': 0.39,
            'points': 3
        },
        'above_break_three': {
            'description': 'Above the break three',
            'distance_range': (22, 30),
            'angle_range': (-70, 70),
            'fg_pct': 0.36,
            'points': 3
        },
        'deep_three': {
            'description': 'Beyond 30 feet',
            'distance_range': (30, 50),
            'angle_range': None,
            'fg_pct': 0.30,
            'points': 3
        }
    }
    return zones

def zone_expected_points(x, y, zones):
    """
    Calculate expected points based on shot zone.

    Parameters:
    -----------
    x, y : float
        Shot coordinates (basket at origin)
    zones : dict
        Zone definitions from create_shot_zones()

    Returns:
    --------
    tuple : (zone_name, expected_points)
    """
    distance = np.sqrt(x**2 + y**2)
    angle = np.degrees(np.arctan2(x, y))  # Angle from basket

    for zone_name, zone in zones.items():
        dist_min, dist_max = zone['distance_range']

        if dist_min <= distance < dist_max:
            if zone['angle_range'] is None:
                return zone_name, zone['fg_pct'] * zone['points']

            angle_min, angle_max = zone['angle_range']
            if angle_min <= angle <= angle_max:
                return zone_name, zone['fg_pct'] * zone['points']

    # Default to deep three if no zone matched
    return 'deep_three', zones['deep_three']['fg_pct'] * 3

Limitations of Location-Only Models

While location-based models provide a useful baseline, they ignore crucial contextual factors:

Defender position: An open shot differs vastly from a contested one
Shot type: Catch-and-shoot versus off-the-dribble
Game situation: Shot clock, score differential, quarter
Player fatigue: Minutes played, pace of game
Individual skill: Not all shooters are equal

These limitations motivate the development of more sophisticated shot quality models that incorporate additional features.

16.2 Shot Difficulty Factors

Distance from Basket

Shot distance is the most predictive single feature for make probability. League-wide field goal percentage declines steadily with distance:

def analyze_distance_effect(shots_df):
    """
    Analyze the relationship between shot distance and FG%.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with 'distance' and 'made' columns

    Returns:
    --------
    DataFrame : FG% by distance bucket
    """
    # Create distance buckets
    bins = [0, 4, 8, 12, 16, 20, 24, 28, 35]
    labels = ['0-4', '4-8', '8-12', '12-16', '16-20', '20-24', '24-28', '28+']

    shots_df['distance_bucket'] = pd.cut(
        shots_df['distance'],
        bins=bins,
        labels=labels
    )

    # Calculate FG% by bucket
    distance_analysis = shots_df.groupby('distance_bucket').agg(
        attempts=('made', 'count'),
        makes=('made', 'sum'),
        fg_pct=('made', 'mean')
    ).round(3)

    distance_analysis['expected_pts_2pt'] = distance_analysis['fg_pct'] * 2
    distance_analysis['expected_pts_3pt'] = distance_analysis['fg_pct'] * 3

    return distance_analysis

The relationship is not perfectly linear. Shots at the rim benefit from banking and tip-ins, creating a spike in efficiency. The three-point line creates a discontinuity where slightly longer shots become more valuable despite lower make rates.

Defender Proximity

Defender distance is arguably the second most important factor in shot difficulty. NBA tracking data provides closest defender distance at the moment of release:

def categorize_defender_distance(defender_distance):
    """
    Categorize shots by defender proximity.

    Categories match NBA.com tracking data definitions:
    - Wide Open: 6+ feet
    - Open: 4-6 feet
    - Tight: 2-4 feet
    - Very Tight: 0-2 feet

    Parameters:
    -----------
    defender_distance : float
        Distance to closest defender in feet

    Returns:
    --------
    str : Defender distance category
    """
    if defender_distance >= 6:
        return 'wide_open'
    elif defender_distance >= 4:
        return 'open'
    elif defender_distance >= 2:
        return 'tight'
    else:
        return 'very_tight'

def analyze_defender_impact(shots_df):
    """
    Analyze how defender proximity affects shot success.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with 'defender_distance' and 'made' columns

    Returns:
    --------
    DataFrame : FG% by defender distance category
    """
    shots_df['contest_level'] = shots_df['defender_distance'].apply(
        categorize_defender_distance
    )

    # Order categories properly
    category_order = ['wide_open', 'open', 'tight', 'very_tight']

    contest_analysis = shots_df.groupby('contest_level').agg(
        attempts=('made', 'count'),
        makes=('made', 'sum'),
        fg_pct=('made', 'mean'),
        avg_distance=('distance', 'mean')
    ).reindex(category_order).round(3)

    return contest_analysis

League-wide data shows approximately: - Wide Open (6+ feet): ~55% on two-pointers, ~40% on three-pointers - Open (4-6 feet): ~48% on two-pointers, ~36% on three-pointers - Tight (2-4 feet): ~42% on two-pointers, ~33% on three-pointers - Very Tight (0-2 feet): ~38% on two-pointers, ~30% on three-pointers

Shot Clock

Shot clock time remaining influences shot quality in complex ways:

def analyze_shot_clock_impact(shots_df):
    """
    Analyze how shot clock affects shot selection and success.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with 'shot_clock' and 'made' columns

    Returns:
    --------
    DataFrame : Analysis by shot clock ranges
    """
    # Define shot clock ranges
    def shot_clock_category(seconds):
        if pd.isna(seconds):
            return 'unknown'
        elif seconds <= 4:
            return 'very_late'
        elif seconds <= 8:
            return 'late'
        elif seconds <= 15:
            return 'mid'
        else:
            return 'early'

    shots_df['clock_category'] = shots_df['shot_clock'].apply(shot_clock_category)

    analysis = shots_df.groupby('clock_category').agg(
        attempts=('made', 'count'),
        fg_pct=('made', 'mean'),
        avg_distance=('distance', 'mean'),
        three_pt_rate=('is_three', 'mean'),
        avg_defender_dist=('defender_distance', 'mean')
    ).round(3)

    return analysis

Key findings from shot clock analysis: - Early clock (15+ seconds): Often transition opportunities, higher FG% - Mid clock (8-15 seconds): Typical half-court attempts - Late clock (4-8 seconds): Slightly lower FG%, but still reasonable - Very late clock (0-4 seconds): Significantly lower FG%, often difficult forced shots

Touch Time

How long a player holds the ball before shooting affects outcomes:

def analyze_touch_time(shots_df):
    """
    Analyze the impact of touch time on shooting.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with 'touch_time' column (seconds)

    Returns:
    --------
    DataFrame : Analysis by touch time category
    """
    def touch_time_category(seconds):
        if seconds < 2:
            return 'catch_and_shoot'
        elif seconds < 4:
            return 'quick'
        elif seconds < 6:
            return 'moderate'
        else:
            return 'extended'

    shots_df['touch_category'] = shots_df['touch_time'].apply(touch_time_category)

    analysis = shots_df.groupby('touch_category').agg(
        attempts=('made', 'count'),
        fg_pct=('made', 'mean'),
        avg_defender_dist=('defender_distance', 'mean')
    ).round(3)

    return analysis

Catch-and-shoot opportunities generally yield higher percentages than off-the-dribble shots, even after controlling for defender distance. The rhythm and balance advantages of catching and immediately shooting contribute to this difference.

Dribbles Before Shot

Related to touch time, the number of dribbles before a shot correlates with difficulty:

def analyze_dribbles_impact(shots_df):
    """
    Analyze how dribbles before shot affect success rate.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with 'dribbles' column

    Returns:
    --------
    DataFrame : Analysis by dribble count
    """
    # Cap at 7+ for grouping
    shots_df['dribble_group'] = shots_df['dribbles'].clip(upper=7)
    shots_df.loc[shots_df['dribble_group'] == 7, 'dribble_group'] = '7+'

    analysis = shots_df.groupby('dribble_group').agg(
        attempts=('made', 'count'),
        fg_pct=('made', 'mean'),
        avg_distance=('distance', 'mean'),
        avg_defender_dist=('defender_distance', 'mean')
    ).round(3)

    return analysis

Zero dribbles (catch-and-shoot) typically yields the highest percentages, with efficiency declining as dribbles increase. However, players who take many dribbles often face tighter defense, so this variable interacts with defender proximity.

Additional Difficulty Factors

Beyond the core factors above, comprehensive shot quality models may include:

Shot type: Layup, dunk, hook, floating, pull-up, step-back
Player position on court: Angle affects shot difficulty
Game state: Score differential, quarter, playoffs vs regular season
Defender height and wingspan: Longer defenders more disruptive
Previous action: Off screen, isolation, post-up, transition
Shooter's prior shots: Hot hand, fatigue effects

16.3 Building a Shot Quality Model with Logistic Regression

Logistic regression provides an interpretable framework for shot quality modeling. The model predicts the probability of a make given input features, naturally bounded between 0 and 1.

Data Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, roc_auc_score, brier_score_loss,
    log_loss, classification_report
)

def prepare_shot_data(shots_df):
    """
    Prepare shot data for modeling.

    Parameters:
    -----------
    shots_df : DataFrame
        Raw shot data

    Returns:
    --------
    tuple : (X, y, feature_names)
    """
    # Define features
    numeric_features = [
        'distance',
        'defender_distance',
        'shot_clock',
        'touch_time',
        'dribbles',
        'x_coord',
        'y_coord'
    ]

    categorical_features = [
        'shot_type',
        'action_type',
        'period',
        'is_home'
    ]

    # Handle missing values
    shots_df = shots_df.dropna(subset=['made'] + numeric_features)

    # Fill categorical NAs
    for col in categorical_features:
        if col in shots_df.columns:
            shots_df[col] = shots_df[col].fillna('unknown')

    # Create feature matrix
    X = shots_df[numeric_features + categorical_features].copy()
    y = shots_df['made'].astype(int)

    return X, y, numeric_features, categorical_features

def create_preprocessing_pipeline(numeric_features, categorical_features):
    """
    Create preprocessing pipeline for shot features.

    Parameters:
    -----------
    numeric_features : list
        Names of numeric columns
    categorical_features : list
        Names of categorical columns

    Returns:
    --------
    ColumnTransformer : Preprocessing pipeline
    """
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_features),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False),
             categorical_features)
        ]
    )

    return preprocessor

Model Training

def train_shot_quality_model(X, y, numeric_features, categorical_features):
    """
    Train a logistic regression shot quality model.

    Parameters:
    -----------
    X : DataFrame
        Feature matrix
    y : Series
        Target variable (1=made, 0=missed)
    numeric_features, categorical_features : lists
        Feature names by type

    Returns:
    --------
    tuple : (trained_pipeline, X_test, y_test, metrics)
    """
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Create pipeline
    preprocessor = create_preprocessing_pipeline(
        numeric_features, categorical_features
    )

    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(
            max_iter=1000,
            solver='lbfgs',
            C=1.0,  # Regularization strength
            class_weight='balanced'
        ))
    ])

    # Train model
    pipeline.fit(X_train, y_train)

    # Generate predictions
    y_pred = pipeline.predict(X_test)
    y_prob = pipeline.predict_proba(X_test)[:, 1]

    # Calculate metrics
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_prob),
        'brier_score': brier_score_loss(y_test, y_prob),
        'log_loss': log_loss(y_test, y_prob)
    }

    return pipeline, X_test, y_test, y_prob, metrics

def print_model_evaluation(metrics, y_test, y_pred):
    """
    Print comprehensive model evaluation.

    Parameters:
    -----------
    metrics : dict
        Model performance metrics
    y_test : array
        True labels
    y_pred : array
        Predicted labels
    """
    print("=" * 50)
    print("SHOT QUALITY MODEL EVALUATION")
    print("=" * 50)
    print(f"\nAccuracy: {metrics['accuracy']:.4f}")
    print(f"ROC AUC: {metrics['roc_auc']:.4f}")
    print(f"Brier Score: {metrics['brier_score']:.4f}")
    print(f"Log Loss: {metrics['log_loss']:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Miss', 'Make']))

Interpreting Coefficients

One advantage of logistic regression is interpretable coefficients:

def interpret_logistic_coefficients(pipeline, numeric_features, categorical_features):
    """
    Extract and interpret logistic regression coefficients.

    Parameters:
    -----------
    pipeline : Pipeline
        Trained sklearn pipeline
    numeric_features, categorical_features : lists
        Feature names

    Returns:
    --------
    DataFrame : Coefficients with interpretation
    """
    # Get the classifier
    clf = pipeline.named_steps['classifier']

    # Get feature names after one-hot encoding
    cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
    cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)

    all_features = list(numeric_features) + list(cat_feature_names)

    # Create coefficient DataFrame
    coef_df = pd.DataFrame({
        'feature': all_features,
        'coefficient': clf.coef_[0],
        'odds_ratio': np.exp(clf.coef_[0])
    })

    coef_df['impact'] = coef_df['coefficient'].apply(
        lambda x: 'positive' if x > 0 else 'negative'
    )

    # Sort by absolute coefficient
    coef_df['abs_coef'] = coef_df['coefficient'].abs()
    coef_df = coef_df.sort_values('abs_coef', ascending=False)

    return coef_df.drop('abs_coef', axis=1)

Example interpretation of coefficients: - Distance coefficient: -0.08: Each additional foot from the basket decreases log-odds by 0.08, corresponding to an odds ratio of 0.92 (8% decrease in odds per foot) - Defender distance coefficient: +0.15: Each additional foot from the nearest defender increases log-odds by 0.15, corresponding to an odds ratio of 1.16 - Dunk shot type: +2.1: Dunks have dramatically higher make probability than the baseline shot type

16.4 Feature Engineering for Shot Prediction

Effective shot quality models require thoughtful feature engineering. Raw tracking data must be transformed into meaningful predictive features.

Spatial Features

def engineer_spatial_features(shots_df):
    """
    Create spatial features from shot coordinates.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with x, y coordinates (basket at origin)

    Returns:
    --------
    DataFrame : Enhanced with spatial features
    """
    df = shots_df.copy()

    # Distance from basket
    df['distance'] = np.sqrt(df['x_coord']**2 + df['y_coord']**2)

    # Angle from basket (0 = straight on, 90/-90 = sideline)
    df['angle'] = np.degrees(np.arctan2(df['x_coord'], df['y_coord']))
    df['abs_angle'] = df['angle'].abs()

    # Is corner three (sideline three-pointer)
    df['is_corner_three'] = (
        (df['distance'] >= 22) &
        (df['distance'] < 24) &
        (df['abs_angle'] > 70)
    ).astype(int)

    # Side of court
    df['court_side'] = np.where(df['x_coord'] > 0, 'right', 'left')

    # Distance from three-point line (negative = inside arc)
    # Approximate arc distance
    three_point_distance = 23.75  # Above the break
    corner_three_distance = 22.0

    df['distance_from_arc'] = np.where(
        df['abs_angle'] > 70,
        df['distance'] - corner_three_distance,
        df['distance'] - three_point_distance
    )

    # Rim area indicator
    df['at_rim'] = (df['distance'] < 4).astype(int)

    # Paint indicator (inside the key)
    df['in_paint'] = (
        (df['distance'] < 16) &
        (df['abs_angle'] < 45)
    ).astype(int)

    return df

def engineer_defender_features(shots_df):
    """
    Create features related to defensive pressure.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with defender tracking info

    Returns:
    --------
    DataFrame : Enhanced with defender features
    """
    df = shots_df.copy()

    # Categorical contest level
    df['contest_level'] = pd.cut(
        df['defender_distance'],
        bins=[-np.inf, 2, 4, 6, np.inf],
        labels=['very_tight', 'tight', 'open', 'wide_open']
    )

    # Is heavily contested (binary)
    df['is_contested'] = (df['defender_distance'] < 4).astype(int)

    # Defender closing speed (if available)
    if 'defender_closing_speed' in df.columns:
        df['defender_closing_fast'] = (
            df['defender_closing_speed'] > 5
        ).astype(int)

    # Number of defenders in proximity (if available)
    if 'defenders_within_5ft' in df.columns:
        df['multiple_defenders'] = (
            df['defenders_within_5ft'] > 1
        ).astype(int)

    return df

Temporal Features

def engineer_temporal_features(shots_df):
    """
    Create features related to game timing and situation.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with game timing information

    Returns:
    --------
    DataFrame : Enhanced with temporal features
    """
    df = shots_df.copy()

    # Shot clock pressure
    df['shot_clock_bucket'] = pd.cut(
        df['shot_clock'],
        bins=[0, 4, 8, 15, 24],
        labels=['very_late', 'late', 'mid', 'early']
    )

    df['shot_clock_pressure'] = (df['shot_clock'] < 7).astype(int)

    # Game clock features
    if 'game_clock' in df.columns and 'period' in df.columns:
        # End of quarter (final 2 minutes)
        df['end_of_quarter'] = (
            (df['game_clock'] < 120) & (df['period'] <= 4)
        ).astype(int)

        # Clutch time (final 5 minutes, score within 5)
        if 'score_margin' in df.columns:
            df['clutch'] = (
                (df['game_clock'] < 300) &
                (df['period'] == 4) &
                (df['score_margin'].abs() <= 5)
            ).astype(int)

    # Quarter effects
    df['period_numeric'] = df['period'].clip(upper=4)  # Cap at 4 for OT
    df['is_overtime'] = (df['period'] > 4).astype(int)

    return df

def engineer_player_context_features(shots_df):
    """
    Create features related to player actions and context.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with player tracking info

    Returns:
    --------
    DataFrame : Enhanced with player context features
    """
    df = shots_df.copy()

    # Shot creation type
    df['is_catch_and_shoot'] = (df['touch_time'] < 2).astype(int)
    df['is_pull_up'] = (
        (df['dribbles'] > 0) &
        (df['touch_time'] >= 2)
    ).astype(int)

    # Dribble features
    df['dribbles_capped'] = df['dribbles'].clip(upper=10)
    df['many_dribbles'] = (df['dribbles'] >= 5).astype(int)

    # Touch time buckets
    df['touch_time_bucket'] = pd.cut(
        df['touch_time'],
        bins=[0, 2, 4, 6, np.inf],
        labels=['instant', 'quick', 'moderate', 'long']
    )

    return df

Interaction Features

def engineer_interaction_features(shots_df):
    """
    Create interaction features between variables.

    Parameters:
    -----------
    shots_df : DataFrame
        Shot data with base features

    Returns:
    --------
    DataFrame : Enhanced with interaction features
    """
    df = shots_df.copy()

    # Distance x Contest interaction
    df['distance_contest_interaction'] = (
        df['distance'] * (1 / (df['defender_distance'] + 1))
    )

    # Three-pointer x Contest
    if 'is_three' in df.columns:
        df['contested_three'] = (
            df['is_three'] * df['is_contested']
        )

        df['open_three'] = (
            df['is_three'] * (1 - df['is_contested'])
        )

    # Catch-and-shoot x Open
    df['open_catch_shoot'] = (
        df['is_catch_and_shoot'] *
        (df['defender_distance'] >= 4).astype(int)
    )

    # Late clock x Distance
    df['late_clock_distance'] = (
        df['shot_clock_pressure'] * df['distance']
    )

    # Rim shot x Contest (very important interaction)
    df['contested_rim'] = (
        df['at_rim'] * df['is_contested']
    )

    return df

16.5 Advanced Models: Gradient Boosting and Neural Networks

While logistic regression provides interpretability, gradient boosting and neural networks often achieve superior predictive performance.

Gradient Boosting Implementation

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import xgboost as xgb

def train_xgboost_shot_model(X, y, numeric_features, categorical_features):
    """
    Train an XGBoost shot quality model with hyperparameter tuning.

    Parameters:
    -----------
    X : DataFrame
        Feature matrix
    y : Series
        Target variable
    numeric_features, categorical_features : lists
        Feature names by type

    Returns:
    --------
    tuple : (trained_pipeline, best_params, cv_scores)
    """
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Create preprocessing pipeline
    preprocessor = create_preprocessing_pipeline(
        numeric_features, categorical_features
    )

    # Transform data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    # Define parameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.05, 0.1],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }

    # Initialize XGBoost
    xgb_clf = xgb.XGBClassifier(
        objective='binary:logistic',
        eval_metric='auc',
        use_label_encoder=False,
        random_state=42
    )

    # Grid search with cross-validation
    grid_search = GridSearchCV(
        xgb_clf,
        param_grid,
        cv=5,
        scoring='roc_auc',
        n_jobs=-1,
        verbose=1
    )

    grid_search.fit(X_train_processed, y_train)

    # Best model
    best_model = grid_search.best_estimator_

    # Evaluate on test set
    y_prob = best_model.predict_proba(X_test_processed)[:, 1]

    metrics = {
        'roc_auc': roc_auc_score(y_test, y_prob),
        'brier_score': brier_score_loss(y_test, y_prob),
        'log_loss': log_loss(y_test, y_prob)
    }

    return best_model, preprocessor, grid_search.best_params_, metrics

def get_xgboost_feature_importance(model, feature_names):
    """
    Extract feature importance from XGBoost model.

    Parameters:
    -----------
    model : XGBClassifier
        Trained XGBoost model
    feature_names : list
        Names of features

    Returns:
    --------
    DataFrame : Feature importance rankings
    """
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    })

    importance_df = importance_df.sort_values(
        'importance', ascending=False
    ).reset_index(drop=True)

    return importance_df

Neural Network Implementation

from sklearn.neural_network import MLPClassifier

def train_neural_network_shot_model(X, y, numeric_features, categorical_features):
    """
    Train a neural network shot quality model.

    Parameters:
    -----------
    X : DataFrame
        Feature matrix
    y : Series
        Target variable
    numeric_features, categorical_features : lists
        Feature names by type

    Returns:
    --------
    tuple : (trained_model, preprocessor, metrics)
    """
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Create preprocessing pipeline
    preprocessor = create_preprocessing_pipeline(
        numeric_features, categorical_features
    )

    # Transform data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    # Define neural network architecture
    mlp = MLPClassifier(
        hidden_layer_sizes=(128, 64, 32),  # Three hidden layers
        activation='relu',
        solver='adam',
        alpha=0.001,  # L2 regularization
        batch_size=256,
        learning_rate='adaptive',
        learning_rate_init=0.001,
        max_iter=500,
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        random_state=42,
        verbose=True
    )

    # Train model
    mlp.fit(X_train_processed, y_train)

    # Evaluate
    y_prob = mlp.predict_proba(X_test_processed)[:, 1]

    metrics = {
        'roc_auc': roc_auc_score(y_test, y_prob),
        'brier_score': brier_score_loss(y_test, y_prob),
        'log_loss': log_loss(y_test, y_prob),
        'n_iterations': mlp.n_iter_
    }

    return mlp, preprocessor, metrics

Model Comparison

def compare_shot_models(X, y, numeric_features, categorical_features):
    """
    Compare multiple shot quality models.

    Parameters:
    -----------
    X : DataFrame
        Feature matrix
    y : Series
        Target variable
    numeric_features, categorical_features : lists
        Feature names by type

    Returns:
    --------
    DataFrame : Model comparison results
    """
    # Split data once for fair comparison
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Preprocessing
    preprocessor = create_preprocessing_pipeline(
        numeric_features, categorical_features
    )

    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    # Define models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, C=1.0),
        'Gradient Boosting': GradientBoostingClassifier(
            n_estimators=200, max_depth=5, learning_rate=0.05
        ),
        'XGBoost': xgb.XGBClassifier(
            n_estimators=200, max_depth=5, learning_rate=0.05,
            use_label_encoder=False, eval_metric='auc'
        ),
        'Neural Network': MLPClassifier(
            hidden_layer_sizes=(128, 64), max_iter=500, early_stopping=True
        )
    }

    results = []

    for name, model in models.items():
        print(f"Training {name}...")
        model.fit(X_train_processed, y_train)
        y_prob = model.predict_proba(X_test_processed)[:, 1]

        results.append({
            'model': name,
            'roc_auc': roc_auc_score(y_test, y_prob),
            'brier_score': brier_score_loss(y_test, y_prob),
            'log_loss': log_loss(y_test, y_prob)
        })

    return pd.DataFrame(results)

16.6 Shot Creation Value

Beyond predicting whether shots go in, we must assess the value of creating shooting opportunities. Shot creation value measures how much a player contributes by generating shots for themselves or teammates.

Defining Shot Creation

def calculate_shot_creation_value(player_shots_df, model, preprocessor):
    """
    Calculate shot creation value for a player.

    Shot creation value = sum of (xPts for shots created - league average xPts)

    Parameters:
    -----------
    player_shots_df : DataFrame
        Shots taken by or assisted by the player
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    dict : Shot creation metrics
    """
    # Get expected make probability for each shot
    X = player_shots_df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    xfg_probs = model.predict_proba(X_processed)[:, 1]

    # Calculate expected points
    player_shots_df = player_shots_df.copy()
    player_shots_df['xfg'] = xfg_probs
    player_shots_df['xpts'] = (
        player_shots_df['xfg'] *
        np.where(player_shots_df['is_three'], 3, 2)
    )

    # League average xPts per shot (baseline)
    league_avg_xpts = 1.05  # Approximate value

    # Shots created for self (unassisted)
    self_created = player_shots_df[player_shots_df['assisted'] == False]

    # Shots created for others (assists)
    assists_df = player_shots_df[player_shots_df['is_assist'] == True]

    metrics = {
        'total_shots_created': len(player_shots_df),
        'self_created_shots': len(self_created),
        'assisted_shots': len(assists_df),
        'avg_xpts_self_created': self_created['xpts'].mean(),
        'avg_xpts_assists': assists_df['xpts'].mean() if len(assists_df) > 0 else 0,
        'total_xpts_created': player_shots_df['xpts'].sum(),
        'xpts_above_average': (
            player_shots_df['xpts'].sum() -
            len(player_shots_df) * league_avg_xpts
        ),
        'creation_value_per_shot': (
            player_shots_df['xpts'].mean() - league_avg_xpts
        )
    }

    return metrics

def analyze_shot_creation_by_type(player_shots_df, model, preprocessor):
    """
    Break down shot creation value by play type.

    Parameters:
    -----------
    player_shots_df : DataFrame
        Player's created shots with play type info
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : Creation value by play type
    """
    # Calculate xPts for all shots
    X = player_shots_df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    player_shots_df = player_shots_df.copy()
    player_shots_df['xfg'] = model.predict_proba(X_processed)[:, 1]
    player_shots_df['xpts'] = (
        player_shots_df['xfg'] *
        np.where(player_shots_df['is_three'], 3, 2)
    )

    # Group by play type
    creation_by_type = player_shots_df.groupby('play_type').agg(
        shots=('xpts', 'count'),
        total_xpts=('xpts', 'sum'),
        avg_xpts=('xpts', 'mean'),
        actual_pts=('points_scored', 'sum'),
        fg_pct=('made', 'mean')
    ).round(3)

    creation_by_type['pts_vs_expected'] = (
        creation_by_type['actual_pts'] - creation_by_type['total_xpts']
    )

    return creation_by_type.sort_values('total_xpts', ascending=False)

Shot Creation Profiles

Different players create value through different means:

def create_shot_creation_profile(player_shots_df, model, preprocessor):
    """
    Create a comprehensive shot creation profile for a player.

    Parameters:
    -----------
    player_shots_df : DataFrame
        All shots created by player
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    dict : Comprehensive creation profile
    """
    df = player_shots_df.copy()

    # Calculate xPts
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]
    df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)

    profile = {
        # Volume
        'total_shots': len(df),
        'shots_per_game': len(df) / df['game_id'].nunique(),

        # Location distribution
        'rim_rate': (df['distance'] < 4).mean(),
        'mid_range_rate': ((df['distance'] >= 4) & (df['distance'] < 22)).mean(),
        'three_rate': df['is_three'].mean(),

        # Quality
        'avg_xpts': df['xpts'].mean(),
        'avg_xfg': df['xfg'].mean(),
        'avg_defender_dist': df['defender_distance'].mean(),

        # Creation style
        'catch_and_shoot_rate': (df['touch_time'] < 2).mean(),
        'pull_up_rate': ((df['dribbles'] > 0) & (df['touch_time'] >= 2)).mean(),
        'assisted_rate': df['assisted'].mean() if 'assisted' in df.columns else None,

        # Difficulty profile
        'contested_rate': (df['defender_distance'] < 4).mean(),
        'avg_shot_clock': df['shot_clock'].mean(),
        'late_clock_rate': (df['shot_clock'] < 7).mean(),

        # Efficiency
        'actual_fg_pct': df['made'].mean(),
        'actual_pts': df[df['made'] == True]['is_three'].apply(
            lambda x: 3 if x else 2
        ).sum(),
        'expected_pts': df['xpts'].sum(),
        'pts_vs_expected': None  # Calculated below
    }

    profile['pts_vs_expected'] = profile['actual_pts'] - profile['expected_pts']

    return profile

16.7 Shooting Luck vs. Skill: Regression to the Mean

One of the most important applications of shot quality models is distinguishing luck from skill in shooting performance. Players who significantly outperform their expected field goal percentage may be demonstrating exceptional skill, or they may be benefiting from positive variance that will regress.

Understanding Regression to the Mean

def analyze_shooting_luck(player_season_df, model, preprocessor):
    """
    Analyze shooting luck vs skill for players.

    Parameters:
    -----------
    player_season_df : DataFrame
        Season shooting data for multiple players
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : Luck analysis by player
    """
    results = []

    for player_id in player_season_df['player_id'].unique():
        player_shots = player_season_df[
            player_season_df['player_id'] == player_id
        ]

        if len(player_shots) < 100:  # Minimum sample
            continue

        # Calculate xFG%
        X = player_shots[model.feature_names_in_]
        X_processed = preprocessor.transform(X)
        xfg_probs = model.predict_proba(X_processed)[:, 1]

        actual_fg = player_shots['made'].mean()
        expected_fg = xfg_probs.mean()

        # Difference suggests luck or skill
        fg_diff = actual_fg - expected_fg

        results.append({
            'player_id': player_id,
            'player_name': player_shots['player_name'].iloc[0],
            'attempts': len(player_shots),
            'actual_fg_pct': actual_fg,
            'expected_fg_pct': expected_fg,
            'fg_diff': fg_diff,
            'z_score': fg_diff / np.sqrt(
                expected_fg * (1 - expected_fg) / len(player_shots)
            )
        })

    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('fg_diff', ascending=False)

    return results_df

def calculate_regression_projection(current_fg, expected_fg, attempts,
                                    regression_weight=500):
    """
    Project true shooting ability with regression to expected.

    Uses empirical Bayes shrinkage toward expected FG%.

    Parameters:
    -----------
    current_fg : float
        Current observed FG%
    expected_fg : float
        Expected FG% from shot quality model
    attempts : int
        Number of shot attempts
    regression_weight : int
        Weight given to expected (like adding 'regression_weight'
        shots at expected rate)

    Returns:
    --------
    float : Regressed estimate of true FG%
    """
    # Weighted average of observed and expected
    regressed_fg = (
        (current_fg * attempts + expected_fg * regression_weight) /
        (attempts + regression_weight)
    )

    return regressed_fg

def identify_regression_candidates(player_df, model, preprocessor,
                                   threshold_z=2.0):
    """
    Identify players likely to regress toward expected performance.

    Parameters:
    -----------
    player_df : DataFrame
        Player shooting data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline
    threshold_z : float
        Z-score threshold for flagging regression candidates

    Returns:
    --------
    tuple : (positive_regression_candidates, negative_regression_candidates)
    """
    luck_analysis = analyze_shooting_luck(player_df, model, preprocessor)

    # Players shooting above expected (likely to regress down)
    positive_luck = luck_analysis[luck_analysis['z_score'] > threshold_z]

    # Players shooting below expected (likely to regress up)
    negative_luck = luck_analysis[luck_analysis['z_score'] < -threshold_z]

    return positive_luck, negative_luck

Three-Point Shooting Regression

Three-point percentage is particularly prone to variance due to lower base rates:

def analyze_three_point_regression(player_df, min_attempts=100):
    """
    Analyze three-point shooting for regression candidates.

    Parameters:
    -----------
    player_df : DataFrame
        Player shooting data
    min_attempts : int
        Minimum three-point attempts for inclusion

    Returns:
    --------
    DataFrame : Three-point regression analysis
    """
    # Filter to three-pointers
    threes_df = player_df[player_df['is_three'] == True]

    results = []

    for player_id in threes_df['player_id'].unique():
        player_threes = threes_df[threes_df['player_id'] == player_id]

        if len(player_threes) < min_attempts:
            continue

        # Calculate actual 3P%
        actual_3p = player_threes['made'].mean()

        # Career baseline (if available)
        career_3p = player_threes['career_3p_pct'].iloc[0] if \
                    'career_3p_pct' in player_threes.columns else 0.36

        # League average
        league_avg_3p = 0.36

        # Simple regression toward career/league average
        attempts = len(player_threes)
        regressed_3p = calculate_regression_projection(
            actual_3p, career_3p, attempts, regression_weight=300
        )

        results.append({
            'player_id': player_id,
            'player_name': player_threes['player_name'].iloc[0],
            '3pa': attempts,
            'actual_3p_pct': actual_3p,
            'career_3p_pct': career_3p,
            'regressed_3p_pct': regressed_3p,
            'current_vs_career': actual_3p - career_3p,
            'expected_regression': regressed_3p - actual_3p
        })

    return pd.DataFrame(results).sort_values('current_vs_career', ascending=False)

Stabilization Points

The stabilization point is the sample size at which observed performance becomes as predictive as true ability:

def estimate_stabilization_point(shot_df, shot_type='all'):
    """
    Estimate the stabilization point for shooting percentage.

    Parameters:
    -----------
    shot_df : DataFrame
        Historical shot data
    shot_type : str
        'all', 'two', or 'three'

    Returns:
    --------
    int : Estimated stabilization point in attempts
    """
    # Filter by shot type
    if shot_type == 'two':
        df = shot_df[shot_df['is_three'] == False]
    elif shot_type == 'three':
        df = shot_df[shot_df['is_three'] == True]
    else:
        df = shot_df

    # Calculate league average
    league_avg = df['made'].mean()

    # Estimate variance of true shooting ability
    # Using player season data
    player_season = df.groupby(['player_id', 'season']).agg(
        attempts=('made', 'count'),
        fg_pct=('made', 'mean')
    ).reset_index()

    # Filter for adequate sample
    player_season = player_season[player_season['attempts'] >= 200]

    # Observed variance = true variance + sampling variance
    # Var(observed) = Var(true) + p(1-p)/n
    observed_var = player_season['fg_pct'].var()
    avg_attempts = player_season['attempts'].mean()
    sampling_var = league_avg * (1 - league_avg) / avg_attempts

    true_var = max(observed_var - sampling_var, 0.0001)

    # Stabilization point formula
    # At stabilization, Var(sampling) = Var(true)
    stabilization = league_avg * (1 - league_avg) / true_var

    return int(stabilization)

# Typical stabilization points:
# - 2-point FG%: ~400-500 attempts
# - 3-point FG%: ~700-800 attempts
# - Free throw %: ~200-300 attempts

16.8 Points Above Expected

Points Above Expected (PAE) measures how many more or fewer points a player scored compared to what the shot quality model predicted. This metric separates shooting skill from shot selection.

Calculating Points Above Expected

def calculate_points_above_expected(player_shots_df, model, preprocessor):
    """
    Calculate Points Above Expected for a player's shot attempts.

    Parameters:
    -----------
    player_shots_df : DataFrame
        Player's shot data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    dict : PAE metrics
    """
    df = player_shots_df.copy()

    # Get expected make probability
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]

    # Calculate expected and actual points
    df['shot_value'] = np.where(df['is_three'], 3, 2)
    df['xpts'] = df['xfg'] * df['shot_value']
    df['actual_pts'] = df['made'] * df['shot_value']

    metrics = {
        'attempts': len(df),
        'actual_points': df['actual_pts'].sum(),
        'expected_points': df['xpts'].sum(),
        'points_above_expected': df['actual_pts'].sum() - df['xpts'].sum(),
        'pae_per_shot': (df['actual_pts'].sum() - df['xpts'].sum()) / len(df),
        'pae_per_100_shots': 100 * (df['actual_pts'].sum() - df['xpts'].sum()) / len(df)
    }

    return metrics

def calculate_pae_by_zone(player_shots_df, model, preprocessor):
    """
    Break down Points Above Expected by court zone.

    Parameters:
    -----------
    player_shots_df : DataFrame
        Player's shot data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : PAE by zone
    """
    df = player_shots_df.copy()

    # Get expected make probability
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]

    # Calculate points
    df['shot_value'] = np.where(df['is_three'], 3, 2)
    df['xpts'] = df['xfg'] * df['shot_value']
    df['actual_pts'] = df['made'] * df['shot_value']

    # Assign zones
    df['zone'] = df.apply(lambda x: assign_shot_zone(x['distance'], x['angle']), axis=1)

    # Aggregate by zone
    zone_pae = df.groupby('zone').agg(
        attempts=('xpts', 'count'),
        actual_pts=('actual_pts', 'sum'),
        expected_pts=('xpts', 'sum'),
        actual_fg=('made', 'mean'),
        expected_fg=('xfg', 'mean')
    )

    zone_pae['pae'] = zone_pae['actual_pts'] - zone_pae['expected_pts']
    zone_pae['pae_per_shot'] = zone_pae['pae'] / zone_pae['attempts']

    return zone_pae.round(3)

def assign_shot_zone(distance, angle):
    """Helper function to assign shot zone."""
    abs_angle = abs(angle) if angle is not None else 0

    if distance < 4:
        return 'Restricted Area'
    elif distance < 14:
        return 'Paint'
    elif distance < 22:
        return 'Mid-Range'
    elif distance < 24 and abs_angle > 70:
        return 'Corner Three'
    elif distance < 28:
        return 'Above Break Three'
    else:
        return 'Deep Three'

League-Wide PAE Analysis

def league_pae_leaderboard(all_shots_df, model, preprocessor, min_attempts=300):
    """
    Create league-wide PAE leaderboard.

    Parameters:
    -----------
    all_shots_df : DataFrame
        All shots in dataset
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline
    min_attempts : int
        Minimum attempts for inclusion

    Returns:
    --------
    DataFrame : PAE leaderboard
    """
    results = []

    for player_id in all_shots_df['player_id'].unique():
        player_shots = all_shots_df[all_shots_df['player_id'] == player_id]

        if len(player_shots) < min_attempts:
            continue

        pae_metrics = calculate_points_above_expected(
            player_shots, model, preprocessor
        )
        pae_metrics['player_id'] = player_id
        pae_metrics['player_name'] = player_shots['player_name'].iloc[0]

        results.append(pae_metrics)

    leaderboard = pd.DataFrame(results)
    leaderboard = leaderboard.sort_values('pae_per_100_shots', ascending=False)

    return leaderboard[['player_name', 'attempts', 'actual_points',
                        'expected_points', 'points_above_expected',
                        'pae_per_100_shots']]

16.9 Applications: Player Evaluation

Shot quality models provide powerful tools for player evaluation, separating skill from circumstance and shot selection from conversion ability.

Comprehensive Shooter Evaluation

def evaluate_shooter(player_shots_df, league_shots_df, model, preprocessor):
    """
    Comprehensive evaluation of a player as a shooter.

    Parameters:
    -----------
    player_shots_df : DataFrame
        Player's shot data
    league_shots_df : DataFrame
        League-wide shot data for comparison
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    dict : Comprehensive shooter evaluation
    """
    # Get player's expected and actual performance
    X_player = player_shots_df[model.feature_names_in_]
    X_player_processed = preprocessor.transform(X_player)
    player_xfg = model.predict_proba(X_player_processed)[:, 1]

    # Calculate league averages for context
    X_league = league_shots_df[model.feature_names_in_]
    X_league_processed = preprocessor.transform(X_league)
    league_xfg = model.predict_proba(X_league_processed)[:, 1]

    player_df = player_shots_df.copy()
    player_df['xfg'] = player_xfg

    evaluation = {
        # Volume metrics
        'shot_attempts': len(player_df),
        'three_point_rate': player_df['is_three'].mean(),
        'rim_attempt_rate': (player_df['distance'] < 4).mean(),

        # Shot quality metrics
        'avg_shot_quality': player_df['xfg'].mean(),
        'league_avg_shot_quality': league_xfg.mean(),
        'shot_quality_vs_league': player_df['xfg'].mean() - league_xfg.mean(),

        # Efficiency metrics
        'actual_fg_pct': player_df['made'].mean(),
        'expected_fg_pct': player_df['xfg'].mean(),
        'fg_vs_expected': player_df['made'].mean() - player_df['xfg'].mean(),

        # Shot difficulty
        'avg_defender_distance': player_df['defender_distance'].mean(),
        'contested_rate': (player_df['defender_distance'] < 4).mean(),
        'self_created_rate': 1 - player_df['assisted'].mean() if 'assisted' in player_df.columns else None,

        # Efficiency by zone
        'rim_fg_vs_expected': None,
        'three_fg_vs_expected': None,
    }

    # Zone-specific analysis
    rim_shots = player_df[player_df['distance'] < 4]
    if len(rim_shots) > 30:
        evaluation['rim_fg_vs_expected'] = rim_shots['made'].mean() - rim_shots['xfg'].mean()

    three_shots = player_df[player_df['is_three'] == True]
    if len(three_shots) > 50:
        evaluation['three_fg_vs_expected'] = three_shots['made'].mean() - three_shots['xfg'].mean()

    return evaluation

def compare_shooters(players_dict, league_shots_df, model, preprocessor):
    """
    Compare multiple players as shooters.

    Parameters:
    -----------
    players_dict : dict
        {player_name: player_shots_df}
    league_shots_df : DataFrame
        League-wide shot data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : Comparative shooter analysis
    """
    evaluations = []

    for player_name, player_df in players_dict.items():
        eval_result = evaluate_shooter(player_df, league_shots_df, model, preprocessor)
        eval_result['player_name'] = player_name
        evaluations.append(eval_result)

    comparison_df = pd.DataFrame(evaluations)
    comparison_df = comparison_df.set_index('player_name')

    return comparison_df

Player Shooting Profile Visualization

import matplotlib.pyplot as plt
from matplotlib.patches import Circle, Rectangle, Arc

def create_shooting_profile_visualization(player_shots_df, model, preprocessor,
                                          player_name):
    """
    Create visual shooting profile for a player.

    Parameters:
    -----------
    player_shots_df : DataFrame
        Player's shot data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline
    player_name : str
        Player's name for title

    Returns:
    --------
    matplotlib figure
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))

    # Calculate expected values
    X = player_shots_df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    player_shots_df = player_shots_df.copy()
    player_shots_df['xfg'] = model.predict_proba(X_processed)[:, 1]

    # Plot 1: Shot chart with actual vs expected
    ax1 = axes[0, 0]
    draw_court(ax1)

    made = player_shots_df[player_shots_df['made'] == True]
    missed = player_shots_df[player_shots_df['made'] == False]

    ax1.scatter(missed['x_coord'], missed['y_coord'], c='red',
                alpha=0.3, s=30, label='Missed')
    ax1.scatter(made['x_coord'], made['y_coord'], c='green',
                alpha=0.5, s=30, label='Made')
    ax1.legend()
    ax1.set_title(f'{player_name} - Shot Chart')
    ax1.set_xlim(-25, 25)
    ax1.set_ylim(-5, 35)

    # Plot 2: FG% vs xFG% by distance
    ax2 = axes[0, 1]
    player_shots_df['dist_bin'] = pd.cut(player_shots_df['distance'],
                                          bins=[0, 4, 10, 16, 22, 30])

    by_distance = player_shots_df.groupby('dist_bin').agg(
        actual=('made', 'mean'),
        expected=('xfg', 'mean')
    ).reset_index()

    x = range(len(by_distance))
    width = 0.35

    ax2.bar([i - width/2 for i in x], by_distance['actual'],
            width, label='Actual FG%', color='green', alpha=0.7)
    ax2.bar([i + width/2 for i in x], by_distance['expected'],
            width, label='Expected FG%', color='blue', alpha=0.7)

    ax2.set_xticks(x)
    ax2.set_xticklabels(['Rim', 'Short', 'Mid-Short', 'Mid-Long', 'Three'])
    ax2.legend()
    ax2.set_title('Actual vs Expected FG% by Distance')
    ax2.set_ylabel('FG%')

    # Plot 3: Shot difficulty distribution
    ax3 = axes[1, 0]
    ax3.hist(player_shots_df['xfg'], bins=20, color='blue',
             alpha=0.7, edgecolor='black')
    ax3.axvline(player_shots_df['xfg'].mean(), color='red',
                linestyle='--', label=f"Mean: {player_shots_df['xfg'].mean():.3f}")
    ax3.legend()
    ax3.set_xlabel('Expected FG%')
    ax3.set_ylabel('Frequency')
    ax3.set_title('Shot Difficulty Distribution')

    # Plot 4: Performance by contest level
    ax4 = axes[1, 1]
    contest_bins = [0, 2, 4, 6, 10]
    contest_labels = ['Very Tight', 'Tight', 'Open', 'Wide Open']
    player_shots_df['contest'] = pd.cut(player_shots_df['defender_distance'],
                                         bins=contest_bins, labels=contest_labels)

    by_contest = player_shots_df.groupby('contest').agg(
        actual=('made', 'mean'),
        expected=('xfg', 'mean'),
        count=('made', 'count')
    ).reset_index()

    x = range(len(by_contest))

    ax4.bar([i - width/2 for i in x], by_contest['actual'],
            width, label='Actual', color='green', alpha=0.7)
    ax4.bar([i + width/2 for i in x], by_contest['expected'],
            width, label='Expected', color='blue', alpha=0.7)

    ax4.set_xticks(x)
    ax4.set_xticklabels(by_contest['contest'])
    ax4.legend()
    ax4.set_title('Performance by Defender Proximity')
    ax4.set_ylabel('FG%')

    plt.tight_layout()
    return fig

def draw_court(ax, color='black', lw=2):
    """Draw basketball court on matplotlib axis."""
    # Hoop
    hoop = Circle((0, 0), radius=0.75, linewidth=lw, color=color, fill=False)
    ax.add_patch(hoop)

    # Backboard
    ax.plot([-3, 3], [-0.75, -0.75], color=color, lw=lw)

    # Paint
    outer_box = Rectangle((-8, -0.75), 16, 19, linewidth=lw,
                          color=color, fill=False)
    ax.add_patch(outer_box)

    # Free throw circle
    free_throw = Arc((0, 14.25), 12, 12, theta1=0, theta2=180,
                     linewidth=lw, color=color)
    ax.add_patch(free_throw)

    # Three-point line
    ax.plot([-22, -22], [-0.75, 9], color=color, lw=lw)
    ax.plot([22, 22], [-0.75, 9], color=color, lw=lw)

    three_arc = Arc((0, 0), 47.5, 47.5, theta1=22, theta2=158,
                    linewidth=lw, color=color)
    ax.add_patch(three_arc)

    # Restricted area
    restricted = Arc((0, 0), 8, 8, theta1=0, theta2=180,
                     linewidth=lw, color=color)
    ax.add_patch(restricted)

    ax.set_aspect('equal')

16.10 Applications: Coaching Decisions

Shot quality models inform coaching decisions on shot selection, lineup optimization, and game strategy.

Optimal Shot Selection Analysis

def analyze_shot_selection(team_shots_df, model, preprocessor):
    """
    Analyze whether a team is taking optimal shots.

    Parameters:
    -----------
    team_shots_df : DataFrame
        Team's shot data
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    dict : Shot selection analysis
    """
    df = team_shots_df.copy()

    # Calculate expected points
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]
    df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)

    # Analyze by shot type
    shot_type_analysis = df.groupby('zone').agg(
        attempts=('xpts', 'count'),
        avg_xpts=('xpts', 'mean'),
        total_xpts=('xpts', 'sum'),
        actual_pts=('points_scored', 'sum')
    ).sort_values('avg_xpts', ascending=False)

    # Calculate what optimal shot selection would look like
    # (More shots from high xPts zones, fewer from low)
    total_shots = len(df)

    recommendations = {
        'current_avg_xpts': df['xpts'].mean(),
        'shot_type_breakdown': shot_type_analysis.to_dict(),
        'rim_rate': (df['distance'] < 4).mean(),
        'three_rate': df['is_three'].mean(),
        'mid_range_rate': ((df['distance'] >= 4) & (~df['is_three'])).mean(),
    }

    # Simple optimization: what if 10% of mid-range became rim attempts?
    mid_range_shots = df[(df['distance'] >= 10) & (df['distance'] < 22)]
    rim_shots = df[df['distance'] < 4]

    if len(mid_range_shots) > 0 and len(rim_shots) > 0:
        mid_range_xpts = mid_range_shots['xpts'].mean()
        rim_xpts = rim_shots['xpts'].mean()

        # If we converted 10% of mid-range to rim attempts
        shots_to_convert = len(mid_range_shots) * 0.1
        xpts_lost = shots_to_convert * mid_range_xpts
        xpts_gained = shots_to_convert * rim_xpts

        recommendations['rim_vs_midrange_opportunity'] = xpts_gained - xpts_lost

    return recommendations

def evaluate_play_type_efficiency(team_shots_df, model, preprocessor):
    """
    Evaluate efficiency of different play types.

    Parameters:
    -----------
    team_shots_df : DataFrame
        Team's shot data with play type labels
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : Play type efficiency analysis
    """
    df = team_shots_df.copy()

    # Calculate expected values
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]
    df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
    df['actual_pts'] = df['made'] * np.where(df['is_three'], 3, 2)

    play_type_analysis = df.groupby('play_type').agg(
        possessions=('xpts', 'count'),
        avg_xpts=('xpts', 'mean'),
        actual_ppp=('actual_pts', 'mean'),
        fg_pct=('made', 'mean'),
        xfg_pct=('xfg', 'mean'),
        three_rate=('is_three', 'mean')
    ).round(3)

    play_type_analysis['efficiency_vs_expected'] = (
        play_type_analysis['actual_ppp'] - play_type_analysis['avg_xpts']
    )

    return play_type_analysis.sort_values('actual_ppp', ascending=False)

Lineup Shot Quality Analysis

def analyze_lineup_shot_quality(lineups_shots_df, model, preprocessor):
    """
    Analyze shot quality generated by different lineups.

    Parameters:
    -----------
    lineups_shots_df : DataFrame
        Shot data with lineup identifiers
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline

    Returns:
    --------
    DataFrame : Lineup shot quality analysis
    """
    df = lineups_shots_df.copy()

    # Calculate expected values
    X = df[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    df['xfg'] = model.predict_proba(X_processed)[:, 1]
    df['xpts'] = df['xfg'] * np.where(df['is_three'], 3, 2)
    df['actual_pts'] = df['made'] * np.where(df['is_three'], 3, 2)

    lineup_analysis = df.groupby('lineup_id').agg(
        minutes=('possession_time', 'sum'),
        shots=('xpts', 'count'),
        avg_xpts_created=('xpts', 'mean'),
        actual_ppp=('actual_pts', 'mean'),
        rim_rate=('at_rim', 'mean'),
        three_rate=('is_three', 'mean'),
        avg_shot_quality=('xfg', 'mean')
    )

    # Filter for minimum minutes
    lineup_analysis = lineup_analysis[lineup_analysis['minutes'] >= 50]

    return lineup_analysis.sort_values('avg_xpts_created', ascending=False)

End-of-Game Decision Support

def late_game_shot_analysis(game_shots_df, model, preprocessor,
                            seconds_remaining=24, score_margin_range=(-3, 3)):
    """
    Analyze shot quality in late-game situations.

    Parameters:
    -----------
    game_shots_df : DataFrame
        Shot data with game context
    model : trained model
        Shot quality model
    preprocessor : fitted preprocessor
        Feature preprocessing pipeline
    seconds_remaining : int
        Define "late game" threshold
    score_margin_range : tuple
        Score differential range for close games

    Returns:
    --------
    dict : Late game shot analysis
    """
    df = game_shots_df.copy()

    # Filter to late-game situations
    late_game = df[
        (df['game_clock'] <= seconds_remaining) &
        (df['period'] >= 4) &
        (df['score_margin'] >= score_margin_range[0]) &
        (df['score_margin'] <= score_margin_range[1])
    ]

    if len(late_game) < 50:
        return {'error': 'Insufficient late-game shots for analysis'}

    # Calculate expected values
    X = late_game[model.feature_names_in_]
    X_processed = preprocessor.transform(X)
    late_game['xfg'] = model.predict_proba(X_processed)[:, 1]
    late_game['xpts'] = late_game['xfg'] * np.where(late_game['is_three'], 3, 2)

    analysis = {
        'total_late_game_shots': len(late_game),
        'avg_xpts': late_game['xpts'].mean(),
        'actual_fg_pct': late_game['made'].mean(),
        'expected_fg_pct': late_game['xfg'].mean(),
        'three_rate': late_game['is_three'].mean(),
        'rim_rate': (late_game['distance'] < 4).mean(),
        'avg_defender_distance': late_game['defender_distance'].mean(),
    }

    # Who takes late-game shots
    shooter_analysis = late_game.groupby('player_name').agg(
        attempts=('xpts', 'count'),
        avg_xpts=('xpts', 'mean'),
        fg_pct=('made', 'mean'),
        xfg_pct=('xfg', 'mean')
    ).sort_values('attempts', ascending=False)

    analysis['primary_shooters'] = shooter_analysis.head(5).to_dict()

    return analysis

16.11 Model Calibration and Validation

A well-calibrated shot quality model should produce probability estimates that match observed frequencies. If the model predicts 40% for a group of shots, approximately 40% should go in.

Calibration Analysis

from sklearn.calibration import calibration_curve

def analyze_model_calibration(y_true, y_prob, n_bins=10):
    """
    Analyze calibration of shot quality model.

    Parameters:
    -----------
    y_true : array
        Actual outcomes (0/1)
    y_prob : array
        Predicted probabilities
    n_bins : int
        Number of bins for calibration curve

    Returns:
    --------
    dict : Calibration metrics and curve data
    """
    # Calculate calibration curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)

    # Calculate calibration metrics
    calibration_error = np.abs(prob_true - prob_pred).mean()
    max_calibration_error = np.abs(prob_true - prob_pred).max()

    # Brier skill score (compared to predicting league average)
    league_avg = y_true.mean()
    brier_baseline = np.mean((y_true - league_avg)**2)
    brier_model = np.mean((y_true - y_prob)**2)
    brier_skill_score = 1 - (brier_model / brier_baseline)

    results = {
        'mean_calibration_error': calibration_error,
        'max_calibration_error': max_calibration_error,
        'brier_score': brier_model,
        'brier_skill_score': brier_skill_score,
        'calibration_curve': {
            'predicted': prob_pred.tolist(),
            'actual': prob_true.tolist()
        }
    }

    return results

def plot_calibration_curve(y_true, y_prob, model_name='Shot Quality Model'):
    """
    Plot calibration curve for shot quality model.

    Parameters:
    -----------
    y_true : array
        Actual outcomes
    y_prob : array
        Predicted probabilities
    model_name : str
        Name for plot title

    Returns:
    --------
    matplotlib figure
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

    # Calibration curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)

    ax1.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
    ax1.plot(prob_pred, prob_true, 's-', label=model_name)
    ax1.set_xlabel('Mean Predicted Probability')
    ax1.set_ylabel('Fraction of Positives')
    ax1.set_title('Calibration Curve')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # Histogram of predictions
    ax2.hist(y_prob, bins=50, color='blue', alpha=0.7, edgecolor='black')
    ax2.set_xlabel('Predicted Probability')
    ax2.set_ylabel('Count')
    ax2.set_title('Distribution of Predictions')
    ax2.axvline(y_true.mean(), color='red', linestyle='--',
                label=f'League Avg: {y_true.mean():.3f}')
    ax2.legend()

    plt.tight_layout()
    return fig

Cross-Validation for Shot Models

from sklearn.model_selection import cross_val_predict, StratifiedKFold

def cross_validate_shot_model(X, y, model, preprocessor, n_splits=5):
    """
    Perform cross-validation for shot quality model.

    Parameters:
    -----------
    X : DataFrame
        Feature matrix
    y : Series
        Target variable
    model : estimator
        Model to validate
    preprocessor : transformer
        Preprocessing pipeline
    n_splits : int
        Number of CV folds

    Returns:
    --------
    dict : Cross-validation results
    """
    # Create full pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Stratified K-Fold
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    # Get cross-validated predictions
    y_prob_cv = cross_val_predict(pipeline, X, y, cv=cv, method='predict_proba')[:, 1]

    # Calculate metrics for each fold
    fold_metrics = []

    for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_test_fold = X.iloc[test_idx]
        y_test_fold = y.iloc[test_idx]
        y_prob_fold = y_prob_cv[test_idx]

        fold_metrics.append({
            'fold': fold + 1,
            'roc_auc': roc_auc_score(y_test_fold, y_prob_fold),
            'brier_score': brier_score_loss(y_test_fold, y_prob_fold),
            'log_loss': log_loss(y_test_fold, y_prob_fold)
        })

    fold_df = pd.DataFrame(fold_metrics)

    results = {
        'fold_results': fold_df,
        'mean_roc_auc': fold_df['roc_auc'].mean(),
        'std_roc_auc': fold_df['roc_auc'].std(),
        'mean_brier': fold_df['brier_score'].mean(),
        'mean_log_loss': fold_df['log_loss'].mean(),
        'cv_predictions': y_prob_cv
    }

    return results

16.12 Practical Considerations and Limitations

Data Quality Issues

def validate_shot_data(shots_df):
    """
    Validate shot data quality before modeling.

    Parameters:
    -----------
    shots_df : DataFrame
        Raw shot data

    Returns:
    --------
    dict : Data quality report
    """
    report = {
        'total_records': len(shots_df),
        'missing_values': {},
        'outliers': {},
        'consistency_checks': {}
    }

    # Check missing values
    for col in shots_df.columns:
        missing_pct = shots_df[col].isna().mean() * 100
        if missing_pct > 0:
            report['missing_values'][col] = f"{missing_pct:.2f}%"

    # Check for outliers
    if 'distance' in shots_df.columns:
        extreme_distance = (shots_df['distance'] > 50).sum()
        report['outliers']['extreme_distance'] = extreme_distance

    if 'shot_clock' in shots_df.columns:
        invalid_clock = (
            (shots_df['shot_clock'] < 0) |
            (shots_df['shot_clock'] > 24)
        ).sum()
        report['outliers']['invalid_shot_clock'] = invalid_clock

    # Consistency checks
    if 'is_three' in shots_df.columns and 'distance' in shots_df.columns:
        # Three-pointers should generally be >= 22 feet
        inconsistent_threes = (
            (shots_df['is_three'] == True) &
            (shots_df['distance'] < 20)
        ).sum()
        report['consistency_checks']['short_three_pointers'] = inconsistent_threes

    if 'made' in shots_df.columns:
        invalid_made = (~shots_df['made'].isin([0, 1, True, False])).sum()
        report['consistency_checks']['invalid_made_values'] = invalid_made

    return report

Model Limitations

Shot quality models, despite their utility, have important limitations:

Unobserved factors: No model captures everything that affects shot success - Shooter's physical state (minor injuries, fatigue) - Defensive scheme and help defense positioning - Environmental factors (arena, crowd noise) - Psychological factors (pressure, confidence)
Selection bias: Players choose when to shoot - Good shooters may attempt more difficult shots - Role players often get cleaner looks
Tracking data limitations: - Defender distance is measured at release, not throughout - Shot type classification may be imperfect - Player identification occasionally incorrect
Temporal changes: - Players improve or decline - League-wide shooting evolves over time - Rule changes affect shot selection

def document_model_limitations(model_name, training_data_description):
    """
    Create documentation of model limitations.

    Parameters:
    -----------
    model_name : str
        Name of the model
    training_data_description : str
        Description of training data

    Returns:
    --------
    str : Formatted limitations documentation
    """
    limitations = f"""
    MODEL LIMITATIONS: {model_name}
    {'='*50}

    Training Data: {training_data_description}

    KNOWN LIMITATIONS:

    1. UNOBSERVED FACTORS
       - Physical condition of shooter not captured
       - Defensive scheme and rotations not fully modeled
       - Game importance/pressure not quantified
       - Shooter confidence/momentum not measured

    2. DATA QUALITY
       - Defender distance measured at release only
       - Some shot types may be misclassified
       - Tracking data has occasional errors (~1-2%)

    3. SELECTION EFFECTS
       - Model assumes shots are representative
       - Better shooters may attempt harder shots
       - Does not capture counterfactual (shots not taken)

    4. TEMPORAL VALIDITY
       - Model trained on specific time period
       - League-wide trends may shift
       - Individual player ability changes over time

    5. CONTEXT LIMITATIONS
       - Does not model game state effects beyond basic features
       - Playoff vs regular season differences not captured
       - Back-to-back and travel effects not included

    RECOMMENDED USE:
    - Use for large sample analysis (100+ shots)
    - Combine with other evaluation methods
    - Regularly retrain on new data
    - Interpret with appropriate uncertainty
    """

    return limitations

Summary

Shot quality models represent a powerful tool for understanding basketball at a granular level. By predicting the probability of any shot going in based on contextual factors, we can:

Evaluate shot selection: Distinguish good decisions from bad ones, independent of outcome
Assess true shooting skill: Separate skill from luck using regression to expected performance
Compare players fairly: Account for the difficulty of shots each player attempts
Inform coaching decisions: Optimize shot selection and lineup construction
Project future performance: Identify players likely to improve or decline

The key components of effective shot quality modeling include:

Comprehensive features: Distance, defender proximity, shot clock, touch time, shot type
Appropriate algorithms: Logistic regression for interpretability, gradient boosting for accuracy
Careful calibration: Ensuring predicted probabilities match observed frequencies
Proper validation: Cross-validation and out-of-sample testing
Honest limitations: Acknowledging what the model cannot capture

As tracking data continues to improve in quality and coverage, shot quality models will become even more precise. The fundamental framework, however, remains the same: quantify the difficulty of each shot, predict the expected outcome, and measure performance against that expectation.

Chapter References

Cervone, D., D'Amour, A., Bornn, L., & Goldsberry, K. (2016). A multiresolution stochastic process model for predicting basketball possession outcomes. Journal of the American Statistical Association, 111(514), 585-599.
Goldsberry, K. (2019). Sprawlball: A visual tour of the new era of the NBA. Houghton Mifflin Harcourt.
Franks, A., Miller, A., Bornn, L., & Goldsberry, K. (2015). Characterizing the spatial structure of defensive skill in professional basketball. The Annals of Applied Statistics, 9(1), 94-121.
Skinner, B. (2012). The price of anarchy in basketball. Journal of Quantitative Analysis in Sports, 6(1).
Chang, Y. H., Maheswaran, R., Su, J., Kwok, S., Levy, T., Wexler, A., & Squire, K. (2014). Quantifying shot quality in the NBA. In Proceedings of the MIT Sloan Sports Analytics Conference.