7 min read

The intersection of machine learning and basketball analytics represents one of the most exciting frontiers in sports science. While traditional statistical methods have long served as the foundation of basketball analysis, machine learning offers...

Chapter 26: Machine Learning in Basketball

Introduction

The intersection of machine learning and basketball analytics represents one of the most exciting frontiers in sports science. While traditional statistical methods have long served as the foundation of basketball analysis, machine learning offers capabilities that extend far beyond what conventional approaches can achieve. From identifying previously unknown player archetypes to predicting game outcomes with unprecedented accuracy, ML techniques are reshaping how teams, analysts, and researchers understand the game.

This chapter provides a comprehensive treatment of machine learning applications in basketball. We begin with fundamental concepts and progressively build toward sophisticated implementations. Throughout, we emphasize not just the "how" of these techniques but the "when" and "why"—understanding which methods are appropriate for specific problems and, crucially, when simpler approaches might be preferable.

Machine learning in basketball is not about replacing human judgment with algorithmic decision-making. Rather, it's about augmenting human expertise with computational power, discovering patterns too subtle for the naked eye, and providing decision-makers with probabilistic insights grounded in data. The most successful implementations of ML in basketball invariably combine algorithmic sophistication with deep domain knowledge.


26.1 Foundations of Machine Learning for Basketball

26.1.1 The Machine Learning Paradigm

Machine learning differs from traditional programming in a fundamental way: instead of explicitly coding rules, we provide algorithms with data and let them discover patterns. In basketball terms, rather than manually defining what makes a player a "stretch four" based on our preconceptions, we let clustering algorithms discover natural groupings in the data.

The three primary paradigms of machine learning are:

Supervised Learning: The algorithm learns from labeled examples. Given player statistics and their actual positions, it learns to predict positions for new players. Given historical game data with outcomes, it learns to predict winners.

Unsupervised Learning: The algorithm finds structure in unlabeled data. Given player statistics without position labels, it discovers natural groupings—perhaps revealing that the traditional five positions poorly represent modern playing styles.

Reinforcement Learning: The algorithm learns through trial and error, receiving rewards for good decisions. This paradigm is particularly relevant for in-game decision-making, though its basketball applications are still emerging.

26.1.2 The Basketball ML Pipeline

Every machine learning project in basketball follows a common pipeline:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

class BasketballMLPipeline:
    """
    A standardized pipeline for basketball machine learning projects.

    This class encapsulates the common steps in basketball ML:
    data preparation, feature engineering, model training, and evaluation.
    """

    def __init__(self, model, feature_columns, target_column):
        self.model = model
        self.feature_columns = feature_columns
        self.target_column = target_column
        self.scaler = StandardScaler()
        self.is_fitted = False

    def prepare_features(self, df):
        """
        Extract and prepare features from a basketball dataframe.

        Parameters:
        -----------
        df : pd.DataFrame
            Raw basketball data with player/game statistics

        Returns:
        --------
        np.ndarray
            Prepared feature matrix
        """
        X = df[self.feature_columns].copy()

        # Handle missing values common in basketball data
        X = X.fillna(X.median())

        return X.values

    def fit(self, df):
        """
        Fit the pipeline on training data.

        Parameters:
        -----------
        df : pd.DataFrame
            Training data with features and target
        """
        X = self.prepare_features(df)
        y = df[self.target_column].values

        # Scale features
        X_scaled = self.scaler.fit_transform(X)

        # Fit model
        self.model.fit(X_scaled, y)
        self.is_fitted = True

        return self

    def predict(self, df):
        """
        Generate predictions for new data.

        Parameters:
        -----------
        df : pd.DataFrame
            New data to predict on

        Returns:
        --------
        np.ndarray
            Model predictions
        """
        if not self.is_fitted:
            raise ValueError("Pipeline must be fitted before prediction")

        X = self.prepare_features(df)
        X_scaled = self.scaler.transform(X)

        return self.model.predict(X_scaled)

26.1.3 Feature Engineering for Basketball

The quality of machine learning models depends critically on the features provided. Basketball presents unique opportunities for feature engineering:

def engineer_basketball_features(df):
    """
    Create advanced features from basic basketball statistics.

    This function demonstrates common feature engineering patterns
    for basketball data, including rate statistics, efficiency metrics,
    and composite measures.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with basic counting statistics

    Returns:
    --------
    pd.DataFrame
        DataFrame with engineered features added
    """
    df = df.copy()

    # Rate statistics (per 36 minutes normalization)
    counting_stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV']
    for stat in counting_stats:
        if stat in df.columns and 'MP' in df.columns:
            df[f'{stat}_per36'] = (df[stat] / df['MP']) * 36

    # Efficiency metrics
    if all(col in df.columns for col in ['PTS', 'FGA', 'FTA']):
        # True Shooting Percentage
        df['TS_pct'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA']))

    if all(col in df.columns for col in ['AST', 'TOV']):
        # Assist to Turnover Ratio
        df['AST_TOV_ratio'] = df['AST'] / (df['TOV'] + 0.001)

    # Usage patterns
    if all(col in df.columns for col in ['FGA', 'FTA', 'TOV', 'MP']):
        # Approximate usage rate
        df['usage_approx'] = (df['FGA'] + 0.44 * df['FTA'] + df['TOV']) / df['MP']

    # Shooting profile
    if all(col in df.columns for col in ['FG3A', 'FGA']):
        df['three_point_rate'] = df['FG3A'] / (df['FGA'] + 0.001)

    if all(col in df.columns for col in ['FTA', 'FGA']):
        df['free_throw_rate'] = df['FTA'] / (df['FGA'] + 0.001)

    # Rebounding rates
    if all(col in df.columns for col in ['ORB', 'DRB', 'REB']):
        df['ORB_pct_of_reb'] = df['ORB'] / (df['REB'] + 0.001)

    # Versatility index (coefficient of variation of stats)
    versatility_stats = ['PTS_per36', 'REB_per36', 'AST_per36']
    available_stats = [s for s in versatility_stats if s in df.columns]
    if len(available_stats) >= 2:
        df['versatility'] = df[available_stats].std(axis=1) / (df[available_stats].mean(axis=1) + 0.001)

    return df

26.2 Clustering Player Types

Clustering represents one of the most natural applications of machine learning in basketball. Traditional positions—point guard, shooting guard, small forward, power forward, and center—were defined in an era of more rigid playing styles. Modern basketball's positionless revolution demands a data-driven approach to understanding player types.

26.2.1 K-Means Clustering

K-means is the workhorse of clustering algorithms, prized for its simplicity and interpretability. The algorithm partitions players into k clusters by minimizing within-cluster variance.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

class PlayerClusteringAnalysis:
    """
    Comprehensive player clustering using K-means.

    This class provides methods for determining optimal cluster count,
    fitting clusters, and interpreting results in basketball terms.
    """

    def __init__(self, feature_columns):
        self.feature_columns = feature_columns
        self.scaler = StandardScaler()
        self.kmeans = None
        self.cluster_centers_original = None

    def find_optimal_k(self, df, k_range=range(2, 15)):
        """
        Use the elbow method and silhouette scores to find optimal k.

        Parameters:
        -----------
        df : pd.DataFrame
            Player statistics dataframe
        k_range : range
            Range of k values to test

        Returns:
        --------
        dict
            Dictionary with inertia and silhouette scores for each k
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.fit_transform(X)

        results = {'k': [], 'inertia': [], 'silhouette': []}

        for k in k_range:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(X_scaled)

            results['k'].append(k)
            results['inertia'].append(kmeans.inertia_)
            results['silhouette'].append(silhouette_score(X_scaled, labels))

        return results

    def fit_clusters(self, df, n_clusters):
        """
        Fit K-means clustering with specified number of clusters.

        Parameters:
        -----------
        df : pd.DataFrame
            Player statistics dataframe
        n_clusters : int
            Number of clusters to create

        Returns:
        --------
        np.ndarray
            Cluster labels for each player
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.fit_transform(X)

        self.kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        labels = self.kmeans.fit_predict(X_scaled)

        # Store cluster centers in original scale for interpretation
        self.cluster_centers_original = self.scaler.inverse_transform(
            self.kmeans.cluster_centers_
        )

        return labels

    def interpret_clusters(self, df, labels):
        """
        Generate interpretable descriptions of each cluster.

        Parameters:
        -----------
        df : pd.DataFrame
            Original dataframe with player data
        labels : np.ndarray
            Cluster assignments

        Returns:
        --------
        pd.DataFrame
            Summary statistics for each cluster
        """
        df_clustered = df.copy()
        df_clustered['cluster'] = labels

        # Calculate mean statistics for each cluster
        cluster_summary = df_clustered.groupby('cluster')[self.feature_columns].mean()

        # Add cluster sizes
        cluster_summary['n_players'] = df_clustered.groupby('cluster').size()

        return cluster_summary

    def get_cluster_exemplars(self, df, labels, n_exemplars=3):
        """
        Find players closest to each cluster center.

        These exemplar players serve as prototypes for understanding
        what each cluster represents.

        Parameters:
        -----------
        df : pd.DataFrame
            Player dataframe with 'Player' column
        labels : np.ndarray
            Cluster assignments
        n_exemplars : int
            Number of exemplar players per cluster

        Returns:
        --------
        dict
            Dictionary mapping cluster ID to list of exemplar players
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.transform(X)

        exemplars = {}

        for cluster_id in range(self.kmeans.n_clusters):
            # Get indices of players in this cluster
            cluster_mask = labels == cluster_id
            cluster_indices = np.where(cluster_mask)[0]

            # Calculate distances to cluster center
            center = self.kmeans.cluster_centers_[cluster_id]
            distances = np.linalg.norm(
                X_scaled[cluster_mask] - center, axis=1
            )

            # Get closest players
            closest_idx = np.argsort(distances)[:n_exemplars]
            exemplar_indices = cluster_indices[closest_idx]

            if 'Player' in df.columns:
                exemplars[cluster_id] = df.iloc[exemplar_indices]['Player'].tolist()
            else:
                exemplars[cluster_id] = exemplar_indices.tolist()

        return exemplars


# Example usage with realistic basketball features
def cluster_nba_players(player_df):
    """
    Complete workflow for clustering NBA players.

    Parameters:
    -----------
    player_df : pd.DataFrame
        DataFrame with player statistics including:
        PTS_per36, REB_per36, AST_per36, STL_per36, BLK_per36,
        TS_pct, USG_pct, three_point_rate, AST_pct

    Returns:
    --------
    tuple
        (labels, cluster_summary, exemplars)
    """
    # Features that capture playing style
    style_features = [
        'PTS_per36', 'REB_per36', 'AST_per36', 'STL_per36', 'BLK_per36',
        'TS_pct', 'USG_pct', 'three_point_rate', 'AST_pct'
    ]

    # Filter to available features
    available_features = [f for f in style_features if f in player_df.columns]

    # Initialize analyzer
    analyzer = PlayerClusteringAnalysis(available_features)

    # Find optimal k
    k_results = analyzer.find_optimal_k(player_df)

    # Based on elbow and silhouette, typically 7-10 clusters work well
    # for capturing modern NBA playing styles
    optimal_k = k_results['k'][np.argmax(k_results['silhouette'])]

    # Fit final clusters
    labels = analyzer.fit_clusters(player_df, n_clusters=optimal_k)

    # Interpret results
    summary = analyzer.interpret_clusters(player_df, labels)
    exemplars = analyzer.get_cluster_exemplars(player_df, labels)

    return labels, summary, exemplars

26.2.2 Hierarchical Clustering

While K-means requires pre-specifying the number of clusters, hierarchical clustering builds a tree structure (dendrogram) that reveals clustering at multiple levels of granularity. This is particularly valuable in basketball, where we might want to examine both broad categories (perimeter vs. interior players) and fine-grained distinctions (scoring guards vs. playmaking guards).

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist

class HierarchicalPlayerClustering:
    """
    Hierarchical clustering for basketball player analysis.

    This approach reveals the nested structure of player types,
    allowing analysis at multiple levels of specificity.
    """

    def __init__(self, feature_columns, linkage_method='ward'):
        """
        Initialize hierarchical clustering.

        Parameters:
        -----------
        feature_columns : list
            Column names to use as features
        linkage_method : str
            Linkage criterion: 'ward', 'complete', 'average', 'single'
            Ward's method typically works well for player clustering
        """
        self.feature_columns = feature_columns
        self.linkage_method = linkage_method
        self.scaler = StandardScaler()
        self.linkage_matrix = None

    def fit(self, df):
        """
        Compute hierarchical clustering structure.

        Parameters:
        -----------
        df : pd.DataFrame
            Player statistics dataframe

        Returns:
        --------
        self
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.fit_transform(X)

        # Compute linkage matrix
        self.linkage_matrix = linkage(X_scaled, method=self.linkage_method)

        return self

    def plot_dendrogram(self, df, figsize=(15, 8), truncate_mode='level', p=5):
        """
        Visualize the hierarchical clustering structure.

        Parameters:
        -----------
        df : pd.DataFrame
            DataFrame with 'Player' column for labels
        figsize : tuple
            Figure size
        truncate_mode : str
            How to truncate dendrogram for display
        p : int
            Truncation parameter
        """
        plt.figure(figsize=figsize)

        labels = df['Player'].tolist() if 'Player' in df.columns else None

        dendrogram(
            self.linkage_matrix,
            labels=labels,
            truncate_mode=truncate_mode,
            p=p,
            leaf_rotation=90,
            leaf_font_size=8
        )

        plt.title('Hierarchical Clustering of NBA Players')
        plt.xlabel('Player')
        plt.ylabel('Distance')
        plt.tight_layout()

        return plt.gcf()

    def get_clusters_at_level(self, n_clusters):
        """
        Extract flat clusters at a specified granularity.

        Parameters:
        -----------
        n_clusters : int
            Desired number of clusters

        Returns:
        --------
        np.ndarray
            Cluster labels
        """
        return fcluster(self.linkage_matrix, n_clusters, criterion='maxclust')

    def analyze_hierarchy(self, df, levels=[3, 6, 10]):
        """
        Analyze clustering at multiple hierarchical levels.

        This reveals how player types split as we increase granularity.

        Parameters:
        -----------
        df : pd.DataFrame
            Player dataframe
        levels : list
            Number of clusters at each level to analyze

        Returns:
        --------
        dict
            Nested analysis at each level
        """
        results = {}

        for n in levels:
            labels = self.get_clusters_at_level(n)

            df_temp = df.copy()
            df_temp['cluster'] = labels

            # Summarize each cluster
            level_summary = {}
            for cluster_id in range(1, n + 1):
                cluster_players = df_temp[df_temp['cluster'] == cluster_id]

                level_summary[cluster_id] = {
                    'size': len(cluster_players),
                    'avg_stats': cluster_players[self.feature_columns].mean().to_dict(),
                    'exemplars': cluster_players.nsmallest(3, 'MP')['Player'].tolist()
                        if 'Player' in cluster_players.columns else []
                }

            results[f'{n}_clusters'] = level_summary

        return results

26.2.3 Interpreting Player Clusters

The challenge with clustering is moving from mathematical groupings to basketball-meaningful archetypes. Here's a framework for interpretation:

def interpret_player_clusters(cluster_summary, feature_columns):
    """
    Generate human-readable interpretations of player clusters.

    This function compares each cluster to the overall mean to identify
    distinguishing characteristics, then maps these to basketball concepts.

    Parameters:
    -----------
    cluster_summary : pd.DataFrame
        Summary statistics for each cluster
    feature_columns : list
        Features used in clustering

    Returns:
    --------
    dict
        Dictionary mapping cluster ID to descriptive archetype
    """
    # Calculate z-scores relative to overall mean
    overall_mean = cluster_summary[feature_columns].mean()
    overall_std = cluster_summary[feature_columns].std()

    archetypes = {}

    for cluster_id in cluster_summary.index:
        cluster_stats = cluster_summary.loc[cluster_id, feature_columns]
        z_scores = (cluster_stats - overall_mean) / (overall_std + 0.001)

        # Identify standout characteristics (|z| > 1)
        standout_high = z_scores[z_scores > 1].sort_values(ascending=False)
        standout_low = z_scores[z_scores < -1].sort_values()

        # Map to archetype descriptions
        archetype_traits = []

        # Check for common archetypes
        if 'AST_per36' in standout_high.index and z_scores['AST_per36'] > 1.5:
            archetype_traits.append('Playmaker')
        if 'PTS_per36' in standout_high.index and z_scores['PTS_per36'] > 1.5:
            archetype_traits.append('Scorer')
        if 'three_point_rate' in standout_high.index:
            archetype_traits.append('Perimeter-Oriented')
        if 'BLK_per36' in standout_high.index and z_scores['BLK_per36'] > 1:
            archetype_traits.append('Rim Protector')
        if 'REB_per36' in standout_high.index and z_scores['REB_per36'] > 1.5:
            archetype_traits.append('Rebounder')
        if 'STL_per36' in standout_high.index:
            archetype_traits.append('Ball Hawk')

        # Check for role indicators
        if 'USG_pct' in z_scores.index:
            if z_scores['USG_pct'] < -1:
                archetype_traits.append('Role Player')
            elif z_scores['USG_pct'] > 1:
                archetype_traits.append('High Usage')

        # Combine into archetype name
        if archetype_traits:
            archetype = ' / '.join(archetype_traits[:3])  # Top 3 traits
        else:
            archetype = f'Cluster {cluster_id}'

        archetypes[cluster_id] = {
            'name': archetype,
            'standout_high': standout_high.head(3).to_dict(),
            'standout_low': standout_low.head(3).to_dict(),
            'size': cluster_summary.loc[cluster_id, 'n_players']
        }

    return archetypes

26.3 Classification Problems in Basketball

Classification—predicting categorical outcomes—has numerous applications in basketball: predicting game winners, identifying future All-Stars, classifying shot types, and more.

26.3.1 Binary Classification: Predicting Game Outcomes

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix
)

class GameOutcomePredictor:
    """
    Predict game outcomes using team statistics.

    This class demonstrates binary classification for basketball,
    including proper handling of home/away asymmetry and
    appropriate evaluation metrics.
    """

    def __init__(self, model_type='logistic'):
        """
        Initialize predictor with chosen model type.

        Parameters:
        -----------
        model_type : str
            'logistic' or 'random_forest'
        """
        if model_type == 'logistic':
            self.model = LogisticRegression(random_state=42, max_iter=1000)
        else:
            self.model = RandomForestClassifier(
                n_estimators=100, random_state=42, n_jobs=-1
            )

        self.scaler = StandardScaler()
        self.feature_columns = None

    def prepare_game_features(self, games_df, team_stats_df):
        """
        Create features for game outcome prediction.

        Parameters:
        -----------
        games_df : pd.DataFrame
            Game data with home_team, away_team, home_win columns
        team_stats_df : pd.DataFrame
            Team statistics indexed by team name

        Returns:
        --------
        pd.DataFrame
            Prepared feature dataframe
        """
        features = []

        for _, game in games_df.iterrows():
            home_team = game['home_team']
            away_team = game['away_team']

            # Get team statistics
            home_stats = team_stats_df.loc[home_team]
            away_stats = team_stats_df.loc[away_team]

            # Create differential features
            game_features = {
                'home_win': game['home_win']
            }

            # Point differential expectation
            for stat in ['OFF_RTG', 'DEF_RTG', 'NET_RTG', 'PACE']:
                if stat in team_stats_df.columns:
                    game_features[f'{stat}_diff'] = home_stats[stat] - away_stats[stat]

            # Home court advantage is implicit in the differential
            # but we can add interaction terms
            game_features['is_back_to_back_home'] = game.get('home_b2b', 0)
            game_features['is_back_to_back_away'] = game.get('away_b2b', 0)

            features.append(game_features)

        return pd.DataFrame(features)

    def fit(self, X, y):
        """
        Fit the classifier.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Feature matrix
        y : np.ndarray
            Binary outcome labels
        """
        self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None
        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)

        return self

    def predict(self, X):
        """Generate predictions."""
        X_scaled = self.scaler.transform(X)
        return self.model.predict(X_scaled)

    def predict_proba(self, X):
        """Generate probability predictions."""
        X_scaled = self.scaler.transform(X)
        return self.model.predict_proba(X_scaled)

    def evaluate(self, X, y, threshold=0.5):
        """
        Comprehensive evaluation of classification performance.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Feature matrix
        y : np.ndarray
            True labels
        threshold : float
            Classification threshold

        Returns:
        --------
        dict
            Dictionary of evaluation metrics
        """
        y_proba = self.predict_proba(X)[:, 1]
        y_pred = (y_proba >= threshold).astype(int)

        return {
            'accuracy': accuracy_score(y, y_pred),
            'precision': precision_score(y, y_pred),
            'recall': recall_score(y, y_pred),
            'f1': f1_score(y, y_pred),
            'roc_auc': roc_auc_score(y, y_proba),
            'confusion_matrix': confusion_matrix(y, y_pred)
        }

26.3.2 Multi-Class Classification: Player Roles

Predicting player positions or roles represents a multi-class classification problem:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelEncoder

class PlayerRoleClassifier:
    """
    Classify players into roles based on statistical profile.

    This can use traditional positions or data-driven archetypes.
    """

    def __init__(self, feature_columns, model=None):
        self.feature_columns = feature_columns
        self.model = model or RandomForestClassifier(
            n_estimators=100, random_state=42, n_jobs=-1
        )
        self.scaler = StandardScaler()
        self.label_encoder = LabelEncoder()

    def fit(self, df, role_column='position'):
        """
        Train the role classifier.

        Parameters:
        -----------
        df : pd.DataFrame
            Player data with statistics and role labels
        role_column : str
            Column containing role/position labels
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        y = self.label_encoder.fit_transform(df[role_column])

        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)

        return self

    def predict(self, df):
        """Predict roles for new players."""
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.transform(X)

        predictions_encoded = self.model.predict(X_scaled)
        return self.label_encoder.inverse_transform(predictions_encoded)

    def predict_proba_all_roles(self, df):
        """
        Get probability distribution over all roles.

        Returns:
        --------
        pd.DataFrame
            Probabilities for each role
        """
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.transform(X)

        probas = self.model.predict_proba(X_scaled)

        return pd.DataFrame(
            probas,
            columns=self.label_encoder.classes_,
            index=df.index
        )

    def analyze_misclassifications(self, df, role_column='position'):
        """
        Analyze where the classifier makes mistakes.

        This reveals which roles are most easily confused.

        Returns:
        --------
        pd.DataFrame
            Confusion matrix as DataFrame with role labels
        """
        from sklearn.metrics import confusion_matrix

        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        y_true = self.label_encoder.transform(df[role_column])

        X_scaled = self.scaler.transform(X)
        y_pred = self.model.predict(X_scaled)

        cm = confusion_matrix(y_true, y_pred)

        return pd.DataFrame(
            cm,
            index=self.label_encoder.classes_,
            columns=self.label_encoder.classes_
        )

26.4 Random Forests and Gradient Boosting

Tree-based ensemble methods have become workhorses of sports analytics. They handle non-linear relationships, require minimal preprocessing, and provide interpretable feature importance measures.

26.4.1 Random Forests for Basketball

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.inspection import permutation_importance

class BasketballRandomForest:
    """
    Random Forest implementation tailored for basketball analytics.

    Includes methods for feature importance analysis and
    partial dependence exploration.
    """

    def __init__(self, task='classification', n_estimators=100, max_depth=None):
        """
        Initialize Random Forest for basketball analysis.

        Parameters:
        -----------
        task : str
            'classification' or 'regression'
        n_estimators : int
            Number of trees in the forest
        max_depth : int or None
            Maximum depth of trees (None for unlimited)
        """
        if task == 'classification':
            self.model = RandomForestClassifier(
                n_estimators=n_estimators,
                max_depth=max_depth,
                random_state=42,
                n_jobs=-1,
                oob_score=True  # Out-of-bag evaluation
            )
        else:
            self.model = RandomForestRegressor(
                n_estimators=n_estimators,
                max_depth=max_depth,
                random_state=42,
                n_jobs=-1,
                oob_score=True
            )

        self.feature_columns = None
        self.task = task

    def fit(self, X, y):
        """Fit the random forest."""
        self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None
        self.model.fit(X, y)
        return self

    def get_feature_importance(self, method='gini'):
        """
        Get feature importance scores.

        Parameters:
        -----------
        method : str
            'gini' for impurity-based (fast but can be biased)

        Returns:
        --------
        pd.Series
            Feature importance scores, sorted descending
        """
        importance = self.model.feature_importances_

        if self.feature_columns:
            importance_series = pd.Series(importance, index=self.feature_columns)
        else:
            importance_series = pd.Series(importance)

        return importance_series.sort_values(ascending=False)

    def get_permutation_importance(self, X, y, n_repeats=10):
        """
        Calculate permutation importance (more reliable but slower).

        Permutation importance measures how much model performance
        decreases when a feature's values are randomly shuffled.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Feature matrix
        y : np.ndarray
            Target values
        n_repeats : int
            Number of times to repeat the permutation

        Returns:
        --------
        pd.DataFrame
            Importance scores with mean and std
        """
        result = permutation_importance(
            self.model, X, y,
            n_repeats=n_repeats,
            random_state=42,
            n_jobs=-1
        )

        importance_df = pd.DataFrame({
            'mean_importance': result.importances_mean,
            'std_importance': result.importances_std
        }, index=self.feature_columns)

        return importance_df.sort_values('mean_importance', ascending=False)

    def analyze_tree_paths(self, X, sample_idx=0):
        """
        Analyze decision paths for a specific sample.

        This helps understand why the model made a particular prediction.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Feature matrix
        sample_idx : int
            Index of sample to analyze

        Returns:
        --------
        list
            List of (feature, threshold, direction) tuples for first tree
        """
        sample = X.iloc[[sample_idx]] if hasattr(X, 'iloc') else X[[sample_idx]]

        # Get decision path through first tree
        tree = self.model.estimators_[0]
        node_indicator = tree.decision_path(sample)

        # Extract path
        feature_names = self.feature_columns or [f'feature_{i}' for i in range(X.shape[1])]

        path = []
        nodes = node_indicator.indices

        for node_id in nodes[:-1]:  # Exclude leaf
            feature_id = tree.tree_.feature[node_id]
            threshold = tree.tree_.threshold[node_id]

            if hasattr(sample, 'iloc'):
                value = sample.iloc[0, feature_id]
            else:
                value = sample[0, feature_id]

            direction = 'left (<=)' if value <= threshold else 'right (>)'

            path.append({
                'feature': feature_names[feature_id],
                'threshold': threshold,
                'sample_value': value,
                'direction': direction
            })

        return path

26.4.2 Gradient Boosting for Basketball

Gradient boosting often achieves higher accuracy than random forests, though it requires more careful tuning:

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
import xgboost as xgb

class BasketballGradientBoosting:
    """
    Gradient Boosting implementation for basketball analytics.

    Supports both scikit-learn's implementation and XGBoost.
    """

    def __init__(self, task='classification', use_xgboost=True, **kwargs):
        """
        Initialize gradient boosting model.

        Parameters:
        -----------
        task : str
            'classification' or 'regression'
        use_xgboost : bool
            Whether to use XGBoost (recommended for performance)
        **kwargs : dict
            Additional parameters for the model
        """
        self.task = task
        self.use_xgboost = use_xgboost

        default_params = {
            'n_estimators': 100,
            'learning_rate': 0.1,
            'max_depth': 5,
            'random_state': 42
        }
        default_params.update(kwargs)

        if use_xgboost:
            if task == 'classification':
                self.model = xgb.XGBClassifier(**default_params, use_label_encoder=False)
            else:
                self.model = xgb.XGBRegressor(**default_params)
        else:
            if task == 'classification':
                self.model = GradientBoostingClassifier(**default_params)
            else:
                self.model = GradientBoostingRegressor(**default_params)

        self.feature_columns = None

    def fit(self, X, y, eval_set=None, early_stopping_rounds=None):
        """
        Fit the gradient boosting model.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Training features
        y : np.ndarray
            Training target
        eval_set : list of tuples, optional
            Validation set for early stopping (XGBoost only)
        early_stopping_rounds : int, optional
            Stop if no improvement for this many rounds
        """
        self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None

        if self.use_xgboost and eval_set is not None:
            self.model.fit(
                X, y,
                eval_set=eval_set,
                early_stopping_rounds=early_stopping_rounds,
                verbose=False
            )
        else:
            self.model.fit(X, y)

        return self

    def tune_hyperparameters(self, X, y, cv=5, param_grid=None):
        """
        Perform hyperparameter tuning using cross-validation.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target values
        cv : int
            Number of cross-validation folds
        param_grid : dict, optional
            Parameter grid to search

        Returns:
        --------
        dict
            Best parameters and cross-validation scores
        """
        from sklearn.model_selection import GridSearchCV

        if param_grid is None:
            param_grid = {
                'n_estimators': [50, 100, 200],
                'learning_rate': [0.01, 0.1, 0.2],
                'max_depth': [3, 5, 7]
            }

        grid_search = GridSearchCV(
            self.model, param_grid,
            cv=cv,
            scoring='accuracy' if self.task == 'classification' else 'neg_mean_squared_error',
            n_jobs=-1
        )

        grid_search.fit(X, y)

        return {
            'best_params': grid_search.best_params_,
            'best_score': grid_search.best_score_,
            'cv_results': pd.DataFrame(grid_search.cv_results_)
        }

    def get_feature_importance(self):
        """Get feature importance from the model."""
        if self.use_xgboost:
            importance = self.model.feature_importances_
        else:
            importance = self.model.feature_importances_

        if self.feature_columns:
            return pd.Series(importance, index=self.feature_columns).sort_values(ascending=False)
        return pd.Series(importance).sort_values(ascending=False)

26.5 Neural Networks for Basketball Applications

Neural networks excel at capturing complex, non-linear relationships in data. While they require more data and computational resources than simpler methods, they're increasingly important for advanced basketball applications.

26.5.1 Basic Neural Network Implementation

from sklearn.neural_network import MLPClassifier, MLPRegressor
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class BasketballNeuralNetwork:
    """
    Neural network implementation for basketball analytics.

    Provides both scikit-learn (simple) and Keras (flexible) implementations.
    """

    def __init__(self, task='classification', use_keras=False,
                 hidden_layers=(64, 32), input_dim=None):
        """
        Initialize neural network.

        Parameters:
        -----------
        task : str
            'classification' or 'regression'
        use_keras : bool
            Whether to use Keras (vs scikit-learn)
        hidden_layers : tuple
            Number of neurons in each hidden layer
        input_dim : int
            Input dimension (required for Keras)
        """
        self.task = task
        self.use_keras = use_keras
        self.scaler = StandardScaler()

        if use_keras:
            self.model = self._build_keras_model(hidden_layers, input_dim, task)
        else:
            if task == 'classification':
                self.model = MLPClassifier(
                    hidden_layer_sizes=hidden_layers,
                    activation='relu',
                    solver='adam',
                    max_iter=500,
                    random_state=42,
                    early_stopping=True,
                    validation_fraction=0.1
                )
            else:
                self.model = MLPRegressor(
                    hidden_layer_sizes=hidden_layers,
                    activation='relu',
                    solver='adam',
                    max_iter=500,
                    random_state=42,
                    early_stopping=True
                )

    def _build_keras_model(self, hidden_layers, input_dim, task):
        """
        Build a Keras neural network.

        Parameters:
        -----------
        hidden_layers : tuple
            Neurons per hidden layer
        input_dim : int
            Number of input features
        task : str
            'classification' or 'regression'

        Returns:
        --------
        keras.Model
            Compiled Keras model
        """
        model = keras.Sequential()

        # Input layer
        model.add(layers.InputLayer(input_shape=(input_dim,)))

        # Hidden layers with dropout for regularization
        for i, units in enumerate(hidden_layers):
            model.add(layers.Dense(units, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.3))

        # Output layer
        if task == 'classification':
            model.add(layers.Dense(1, activation='sigmoid'))
            model.compile(
                optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy']
            )
        else:
            model.add(layers.Dense(1))
            model.compile(
                optimizer='adam',
                loss='mse',
                metrics=['mae']
            )

        return model

    def fit(self, X, y, epochs=100, batch_size=32, validation_split=0.2):
        """
        Train the neural network.

        Parameters:
        -----------
        X : pd.DataFrame or np.ndarray
            Training features
        y : np.ndarray
            Training target
        epochs : int
            Number of training epochs (Keras only)
        batch_size : int
            Batch size (Keras only)
        validation_split : float
            Fraction for validation (Keras only)
        """
        X_scaled = self.scaler.fit_transform(X)

        if self.use_keras:
            # Early stopping callback
            early_stop = keras.callbacks.EarlyStopping(
                monitor='val_loss',
                patience=10,
                restore_best_weights=True
            )

            self.history = self.model.fit(
                X_scaled, y,
                epochs=epochs,
                batch_size=batch_size,
                validation_split=validation_split,
                callbacks=[early_stop],
                verbose=0
            )
        else:
            self.model.fit(X_scaled, y)

        return self

    def predict(self, X):
        """Generate predictions."""
        X_scaled = self.scaler.transform(X)

        if self.use_keras:
            predictions = self.model.predict(X_scaled, verbose=0)
            if self.task == 'classification':
                return (predictions > 0.5).astype(int).flatten()
            return predictions.flatten()

        return self.model.predict(X_scaled)

    def predict_proba(self, X):
        """Get prediction probabilities (classification only)."""
        X_scaled = self.scaler.transform(X)

        if self.use_keras:
            return self.model.predict(X_scaled, verbose=0)

        return self.model.predict_proba(X_scaled)


class PlayerTrajectoryNetwork:
    """
    Recurrent neural network for player trajectory/career prediction.

    This handles sequential data like season-by-season statistics
    to predict future performance.
    """

    def __init__(self, sequence_length, n_features, prediction_type='regression'):
        """
        Initialize trajectory prediction network.

        Parameters:
        -----------
        sequence_length : int
            Number of past seasons to consider
        n_features : int
            Number of statistical features per season
        prediction_type : str
            'regression' or 'classification'
        """
        self.sequence_length = sequence_length
        self.n_features = n_features
        self.prediction_type = prediction_type
        self.scaler = StandardScaler()

        self.model = self._build_model()

    def _build_model(self):
        """Build LSTM model for trajectory prediction."""
        model = keras.Sequential([
            layers.LSTM(64, return_sequences=True,
                       input_shape=(self.sequence_length, self.n_features)),
            layers.Dropout(0.3),
            layers.LSTM(32),
            layers.Dropout(0.3),
            layers.Dense(16, activation='relu'),
            layers.Dense(1, activation='sigmoid' if self.prediction_type == 'classification' else None)
        ])

        loss = 'binary_crossentropy' if self.prediction_type == 'classification' else 'mse'
        model.compile(optimizer='adam', loss=loss)

        return model

    def prepare_sequences(self, player_seasons_df, target_column):
        """
        Prepare sequential data from player seasons.

        Parameters:
        -----------
        player_seasons_df : pd.DataFrame
            DataFrame with player_id, season, and statistics
        target_column : str
            Column to predict

        Returns:
        --------
        tuple
            (X_sequences, y_targets)
        """
        sequences = []
        targets = []

        for player_id in player_seasons_df['player_id'].unique():
            player_data = player_seasons_df[
                player_seasons_df['player_id'] == player_id
            ].sort_values('season')

            feature_cols = [c for c in player_data.columns
                          if c not in ['player_id', 'season', target_column]]

            # Create sequences
            for i in range(len(player_data) - self.sequence_length):
                seq = player_data.iloc[i:i + self.sequence_length][feature_cols].values
                target = player_data.iloc[i + self.sequence_length][target_column]

                sequences.append(seq)
                targets.append(target)

        return np.array(sequences), np.array(targets)

    def fit(self, X_sequences, y_targets, epochs=50, batch_size=32):
        """Train the trajectory model."""
        # Scale features
        original_shape = X_sequences.shape
        X_flat = X_sequences.reshape(-1, X_sequences.shape[-1])
        X_scaled = self.scaler.fit_transform(X_flat)
        X_sequences_scaled = X_scaled.reshape(original_shape)

        self.model.fit(
            X_sequences_scaled, y_targets,
            epochs=epochs,
            batch_size=batch_size,
            validation_split=0.2,
            verbose=0
        )

        return self

26.6 Ensemble Methods

Ensemble methods combine multiple models to achieve better performance than any single model. In basketball analytics, ensembles are particularly valuable for reducing prediction variance.

26.6.1 Building Effective Ensembles

from sklearn.ensemble import VotingClassifier, VotingRegressor, StackingClassifier
from sklearn.linear_model import LogisticRegression, Ridge

class BasketballEnsemble:
    """
    Ensemble methods for basketball prediction.

    Combines multiple models to achieve robust predictions.
    """

    def __init__(self, task='classification', ensemble_type='voting'):
        """
        Initialize ensemble.

        Parameters:
        -----------
        task : str
            'classification' or 'regression'
        ensemble_type : str
            'voting' or 'stacking'
        """
        self.task = task
        self.ensemble_type = ensemble_type
        self.base_models = self._create_base_models()
        self.ensemble = self._create_ensemble()

    def _create_base_models(self):
        """Create diverse base models."""
        if self.task == 'classification':
            return [
                ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
                ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
                ('lr', LogisticRegression(random_state=42, max_iter=1000)),
                ('mlp', MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
            ]
        else:
            return [
                ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
                ('gb', GradientBoostingRegressor(n_estimators=100, random_state=42)),
                ('ridge', Ridge(random_state=42)),
                ('mlp', MLPRegressor(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
            ]

    def _create_ensemble(self):
        """Create the ensemble model."""
        if self.ensemble_type == 'voting':
            if self.task == 'classification':
                return VotingClassifier(
                    estimators=self.base_models,
                    voting='soft'  # Use probability averaging
                )
            else:
                return VotingRegressor(estimators=self.base_models)

        else:  # stacking
            if self.task == 'classification':
                return StackingClassifier(
                    estimators=self.base_models,
                    final_estimator=LogisticRegression(random_state=42),
                    cv=5
                )
            else:
                from sklearn.ensemble import StackingRegressor
                return StackingRegressor(
                    estimators=self.base_models,
                    final_estimator=Ridge(random_state=42),
                    cv=5
                )

    def fit(self, X, y):
        """Fit the ensemble."""
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        self.ensemble.fit(X_scaled, y)
        return self

    def predict(self, X):
        """Generate predictions."""
        X_scaled = self.scaler.transform(X)
        return self.ensemble.predict(X_scaled)

    def get_model_contributions(self, X, y):
        """
        Analyze individual model contributions.

        Returns:
        --------
        pd.DataFrame
            Performance metrics for each base model
        """
        X_scaled = self.scaler.transform(X)

        results = []
        for name, model in self.base_models:
            y_pred = model.predict(X_scaled)

            if self.task == 'classification':
                score = accuracy_score(y, y_pred)
            else:
                from sklearn.metrics import r2_score
                score = r2_score(y, y_pred)

            results.append({'model': name, 'score': score})

        # Add ensemble performance
        y_ensemble = self.ensemble.predict(X_scaled)
        if self.task == 'classification':
            ensemble_score = accuracy_score(y, y_ensemble)
        else:
            ensemble_score = r2_score(y, y_ensemble)

        results.append({'model': 'ensemble', 'score': ensemble_score})

        return pd.DataFrame(results)

26.6.2 Model Blending for Basketball

class BasketballModelBlender:
    """
    Blend predictions from multiple models using learned weights.

    This approach learns optimal weights for combining model predictions
    based on validation performance.
    """

    def __init__(self, models, task='classification'):
        """
        Initialize blender.

        Parameters:
        -----------
        models : list
            List of (name, model) tuples
        task : str
            'classification' or 'regression'
        """
        self.models = models
        self.task = task
        self.weights = None
        self.scalers = {}

    def fit(self, X_train, y_train, X_val, y_val):
        """
        Fit base models and learn blending weights.

        Parameters:
        -----------
        X_train : pd.DataFrame
            Training features
        y_train : np.ndarray
            Training target
        X_val : pd.DataFrame
            Validation features
        y_val : np.ndarray
            Validation target
        """
        # Fit each model
        for name, model in self.models:
            self.scalers[name] = StandardScaler()
            X_train_scaled = self.scalers[name].fit_transform(X_train)
            model.fit(X_train_scaled, y_train)

        # Get validation predictions
        val_predictions = []
        for name, model in self.models:
            X_val_scaled = self.scalers[name].transform(X_val)
            if self.task == 'classification' and hasattr(model, 'predict_proba'):
                pred = model.predict_proba(X_val_scaled)[:, 1]
            else:
                pred = model.predict(X_val_scaled)
            val_predictions.append(pred)

        # Stack predictions
        val_predictions = np.column_stack(val_predictions)

        # Learn optimal weights (constrained optimization)
        from scipy.optimize import minimize

        def loss_function(weights):
            weights = weights / weights.sum()  # Normalize
            blended = np.dot(val_predictions, weights)

            if self.task == 'classification':
                # Log loss
                eps = 1e-7
                blended = np.clip(blended, eps, 1 - eps)
                return -np.mean(y_val * np.log(blended) + (1 - y_val) * np.log(1 - blended))
            else:
                # MSE
                return np.mean((y_val - blended) ** 2)

        # Initial weights (equal)
        initial_weights = np.ones(len(self.models)) / len(self.models)

        # Bounds (non-negative weights)
        bounds = [(0, 1) for _ in range(len(self.models))]

        result = minimize(loss_function, initial_weights, bounds=bounds, method='SLSQP')
        self.weights = result.x / result.x.sum()

        return self

    def predict(self, X):
        """Generate blended predictions."""
        predictions = []
        for name, model in self.models:
            X_scaled = self.scalers[name].transform(X)
            if self.task == 'classification' and hasattr(model, 'predict_proba'):
                pred = model.predict_proba(X_scaled)[:, 1]
            else:
                pred = model.predict(X_scaled)
            predictions.append(pred)

        predictions = np.column_stack(predictions)
        blended = np.dot(predictions, self.weights)

        if self.task == 'classification':
            return (blended > 0.5).astype(int)
        return blended

    def get_weights_summary(self):
        """Get model weights summary."""
        return pd.DataFrame({
            'model': [name for name, _ in self.models],
            'weight': self.weights
        }).sort_values('weight', ascending=False)

26.7 Feature Importance and Model Interpretability

Understanding why models make predictions is crucial in basketball analytics, where stakeholders need to trust and act on model outputs.

26.7.1 Comprehensive Feature Importance Analysis

class BasketballModelInterpreter:
    """
    Tools for interpreting machine learning models in basketball context.

    Provides multiple methods for understanding feature importance
    and model behavior.
    """

    def __init__(self, model, feature_columns):
        self.model = model
        self.feature_columns = feature_columns

    def gini_importance(self):
        """
        Get Gini (impurity-based) feature importance.

        Note: This is fast but can be biased toward high-cardinality features.
        """
        if hasattr(self.model, 'feature_importances_'):
            return pd.Series(
                self.model.feature_importances_,
                index=self.feature_columns
            ).sort_values(ascending=False)
        else:
            raise ValueError("Model does not have feature_importances_ attribute")

    def permutation_importance(self, X, y, n_repeats=10):
        """
        Calculate permutation importance.

        More reliable than Gini importance but slower.
        """
        result = permutation_importance(
            self.model, X, y,
            n_repeats=n_repeats,
            random_state=42,
            n_jobs=-1
        )

        return pd.DataFrame({
            'importance_mean': result.importances_mean,
            'importance_std': result.importances_std
        }, index=self.feature_columns).sort_values('importance_mean', ascending=False)

    def shap_analysis(self, X, sample_size=100):
        """
        SHAP (SHapley Additive exPlanations) analysis.

        Provides both global feature importance and local explanations.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        sample_size : int
            Number of samples to use for SHAP calculation

        Returns:
        --------
        dict
            SHAP values and summary
        """
        import shap

        # Sample for efficiency
        if len(X) > sample_size:
            X_sample = X.sample(sample_size, random_state=42)
        else:
            X_sample = X

        # Create explainer based on model type
        if hasattr(self.model, 'estimators_'):
            # Tree-based model
            explainer = shap.TreeExplainer(self.model)
        else:
            # Use kernel explainer as fallback
            explainer = shap.KernelExplainer(
                self.model.predict,
                shap.sample(X, 50)
            )

        shap_values = explainer.shap_values(X_sample)

        # Handle multi-output (classification with multiple classes)
        if isinstance(shap_values, list):
            shap_values = shap_values[1]  # Use positive class for binary

        # Global importance (mean absolute SHAP values)
        global_importance = pd.Series(
            np.abs(shap_values).mean(axis=0),
            index=self.feature_columns
        ).sort_values(ascending=False)

        return {
            'shap_values': shap_values,
            'global_importance': global_importance,
            'X_sample': X_sample
        }

    def partial_dependence(self, X, feature, grid_resolution=50):
        """
        Calculate partial dependence for a feature.

        Shows the marginal effect of a feature on predictions.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        feature : str
            Feature name to analyze
        grid_resolution : int
            Number of points in the grid

        Returns:
        --------
        dict
            Feature values and corresponding average predictions
        """
        from sklearn.inspection import partial_dependence as pd_sklearn

        feature_idx = self.feature_columns.index(feature)

        result = pd_sklearn(
            self.model, X,
            features=[feature_idx],
            grid_resolution=grid_resolution
        )

        return {
            'values': result['values'][0],
            'predictions': result['average'][0]
        }

    def explain_prediction(self, X_single, top_n=5):
        """
        Explain a single prediction.

        Parameters:
        -----------
        X_single : pd.DataFrame
            Single-row feature dataframe
        top_n : int
            Number of top contributing features to show

        Returns:
        --------
        dict
            Prediction explanation
        """
        import shap

        prediction = self.model.predict(X_single)[0]

        if hasattr(self.model, 'predict_proba'):
            proba = self.model.predict_proba(X_single)[0]
        else:
            proba = None

        # Get SHAP values for this prediction
        if hasattr(self.model, 'estimators_'):
            explainer = shap.TreeExplainer(self.model)
        else:
            # Simple approximation using feature importance
            return {
                'prediction': prediction,
                'probability': proba,
                'note': 'SHAP analysis not available for this model type'
            }

        shap_values = explainer.shap_values(X_single)

        if isinstance(shap_values, list):
            shap_values = shap_values[1]  # Binary classification

        # Get top contributors
        shap_series = pd.Series(shap_values[0], index=self.feature_columns)
        top_positive = shap_series.nlargest(top_n)
        top_negative = shap_series.nsmallest(top_n)

        return {
            'prediction': prediction,
            'probability': proba,
            'top_positive_contributors': top_positive.to_dict(),
            'top_negative_contributors': top_negative.to_dict(),
            'feature_values': X_single.iloc[0].to_dict()
        }

26.8 Cross-Validation Strategies for Basketball Data

Basketball data presents unique challenges for cross-validation due to its temporal nature and hierarchical structure.

26.8.1 Time-Aware Cross-Validation

from sklearn.model_selection import TimeSeriesSplit, cross_val_score

class BasketballCrossValidator:
    """
    Cross-validation strategies appropriate for basketball data.

    Handles temporal dependencies, seasonal structure, and
    the need to predict future based on past.
    """

    def __init__(self, cv_type='time_series'):
        """
        Initialize cross-validator.

        Parameters:
        -----------
        cv_type : str
            'time_series', 'season_aware', or 'stratified'
        """
        self.cv_type = cv_type

    def time_series_cv(self, X, y, model, n_splits=5, test_size=None):
        """
        Time series cross-validation.

        Ensures training data always precedes test data temporally.
        Essential for game prediction and season-level analysis.

        Parameters:
        -----------
        X : pd.DataFrame
            Features (should be sorted by time)
        y : np.ndarray
            Target
        model : estimator
            Scikit-learn compatible model
        n_splits : int
            Number of splits
        test_size : int, optional
            Fixed test size for each split

        Returns:
        --------
        dict
            Cross-validation results
        """
        tscv = TimeSeriesSplit(n_splits=n_splits, test_size=test_size)

        scores = []
        predictions_all = []

        for train_idx, test_idx in tscv.split(X):
            X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]

            # Scale features
            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            # Fit and predict
            model_clone = clone(model)
            model_clone.fit(X_train_scaled, y_train)
            y_pred = model_clone.predict(X_test_scaled)

            # Score
            if hasattr(model, 'predict_proba'):
                score = accuracy_score(y_test, y_pred)
            else:
                from sklearn.metrics import r2_score
                score = r2_score(y_test, y_pred)

            scores.append(score)
            predictions_all.extend(zip(test_idx, y_pred, y_test))

        return {
            'scores': scores,
            'mean_score': np.mean(scores),
            'std_score': np.std(scores),
            'predictions': predictions_all
        }

    def season_aware_cv(self, X, y, season_column, model, min_train_seasons=3):
        """
        Cross-validation respecting season boundaries.

        Trains on complete seasons and tests on subsequent seasons.

        Parameters:
        -----------
        X : pd.DataFrame
            Features with season column
        y : np.ndarray
            Target
        season_column : str
            Name of column containing season identifier
        model : estimator
            Model to evaluate
        min_train_seasons : int
            Minimum number of training seasons

        Returns:
        --------
        dict
            Season-by-season performance
        """
        seasons = sorted(X[season_column].unique())
        results = []

        for i in range(min_train_seasons, len(seasons)):
            # Train on all previous seasons
            train_seasons = seasons[:i]
            test_season = seasons[i]

            train_mask = X[season_column].isin(train_seasons)
            test_mask = X[season_column] == test_season

            X_train = X[train_mask].drop(columns=[season_column])
            X_test = X[test_mask].drop(columns=[season_column])
            y_train = y[train_mask]
            y_test = y[test_mask]

            # Scale and fit
            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            model_clone = clone(model)
            model_clone.fit(X_train_scaled, y_train)
            y_pred = model_clone.predict(X_test_scaled)

            # Score
            if hasattr(model, 'predict_proba'):
                score = accuracy_score(y_test, y_pred)
            else:
                score = np.mean((y_test - y_pred) ** 2)

            results.append({
                'test_season': test_season,
                'train_seasons': train_seasons,
                'score': score,
                'n_train': len(y_train),
                'n_test': len(y_test)
            })

        return {
            'results': results,
            'mean_score': np.mean([r['score'] for r in results])
        }

    def grouped_cv(self, X, y, group_column, model, n_splits=5):
        """
        Cross-validation where groups (e.g., teams, players) don't leak.

        Ensures that all data for a group is in either train or test,
        never split between them.

        Parameters:
        -----------
        X : pd.DataFrame
            Features with group column
        y : np.ndarray
            Target
        group_column : str
            Column defining groups
        model : estimator
            Model to evaluate
        n_splits : int
            Number of CV folds
        """
        from sklearn.model_selection import GroupKFold

        groups = X[group_column]
        X_features = X.drop(columns=[group_column])

        gkf = GroupKFold(n_splits=n_splits)

        scores = []

        for train_idx, test_idx in gkf.split(X_features, y, groups):
            X_train, X_test = X_features.iloc[train_idx], X_features.iloc[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]

            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            model_clone = clone(model)
            model_clone.fit(X_train_scaled, y_train)
            y_pred = model_clone.predict(X_test_scaled)

            if hasattr(model, 'predict_proba'):
                score = accuracy_score(y_test, y_pred)
            else:
                from sklearn.metrics import r2_score
                score = r2_score(y_test, y_pred)

            scores.append(score)

        return {
            'scores': scores,
            'mean_score': np.mean(scores),
            'std_score': np.std(scores)
        }


from sklearn.base import clone

26.9 Handling Time-Series Aspects

Basketball data is inherently temporal. Proper handling of time dependencies is crucial for valid predictions.

26.9.1 Feature Engineering for Time Series

class TemporalBasketballFeatures:
    """
    Create time-aware features for basketball data.

    Handles rolling averages, trends, and recency weighting.
    """

    def __init__(self, date_column='date', entity_column='team'):
        self.date_column = date_column
        self.entity_column = entity_column

    def add_rolling_features(self, df, stat_columns, windows=[5, 10, 20]):
        """
        Add rolling average features.

        Parameters:
        -----------
        df : pd.DataFrame
            Data sorted by date
        stat_columns : list
            Columns to create rolling averages for
        windows : list
            Rolling window sizes

        Returns:
        --------
        pd.DataFrame
            DataFrame with rolling features added
        """
        df = df.sort_values([self.entity_column, self.date_column])

        for window in windows:
            for stat in stat_columns:
                col_name = f'{stat}_rolling_{window}'
                df[col_name] = df.groupby(self.entity_column)[stat].transform(
                    lambda x: x.shift(1).rolling(window, min_periods=1).mean()
                )

        return df

    def add_trend_features(self, df, stat_columns, window=10):
        """
        Add trend (slope) features.

        Captures whether a statistic is improving or declining.

        Parameters:
        -----------
        df : pd.DataFrame
            Sorted data
        stat_columns : list
            Columns to calculate trends for
        window : int
            Window for trend calculation

        Returns:
        --------
        pd.DataFrame
            DataFrame with trend features
        """
        from scipy.stats import linregress

        def calculate_trend(series):
            if len(series) < 3:
                return 0
            x = np.arange(len(series))
            try:
                slope, _, _, _, _ = linregress(x, series)
                return slope
            except:
                return 0

        df = df.sort_values([self.entity_column, self.date_column])

        for stat in stat_columns:
            col_name = f'{stat}_trend_{window}'
            df[col_name] = df.groupby(self.entity_column)[stat].transform(
                lambda x: x.shift(1).rolling(window, min_periods=3).apply(calculate_trend)
            )

        return df

    def add_recency_weighted_features(self, df, stat_columns, half_life=5):
        """
        Add exponentially weighted features.

        More recent games count more heavily.

        Parameters:
        -----------
        df : pd.DataFrame
            Sorted data
        stat_columns : list
            Columns to weight
        half_life : int
            Number of games for weight to decay by half
        """
        df = df.sort_values([self.entity_column, self.date_column])

        for stat in stat_columns:
            col_name = f'{stat}_ewm_{half_life}'
            df[col_name] = df.groupby(self.entity_column)[stat].transform(
                lambda x: x.shift(1).ewm(halflife=half_life, min_periods=1).mean()
            )

        return df

    def add_rest_days(self, df):
        """Add feature for days of rest since last game."""
        df = df.sort_values([self.entity_column, self.date_column])

        df['rest_days'] = df.groupby(self.entity_column)[self.date_column].transform(
            lambda x: x.diff().dt.days
        )
        df['rest_days'] = df['rest_days'].fillna(3)  # Assume 3 days for first game

        # Cap at reasonable maximum
        df['rest_days'] = df['rest_days'].clip(upper=10)

        return df

    def add_schedule_features(self, df):
        """Add features related to schedule (back-to-backs, etc.)."""
        df = df.sort_values([self.entity_column, self.date_column])

        # Back to back indicator
        df['is_back_to_back'] = (df['rest_days'] == 1).astype(int)

        # Games in last 7 days
        df['games_last_7_days'] = df.groupby(self.entity_column)[self.date_column].transform(
            lambda x: x.rolling('7D', on=x).count()
        )

        return df

26.9.2 Time Series Models for Basketball

class BasketballTimeSeriesModel:
    """
    Time series modeling for basketball predictions.

    Combines traditional time series methods with ML approaches.
    """

    def __init__(self, model_type='arima'):
        """
        Initialize time series model.

        Parameters:
        -----------
        model_type : str
            'arima', 'prophet', or 'lstm'
        """
        self.model_type = model_type
        self.models = {}

    def fit_arima(self, series, order=(1, 1, 1)):
        """
        Fit ARIMA model to a series.

        Parameters:
        -----------
        series : pd.Series
            Time series data
        order : tuple
            (p, d, q) order for ARIMA
        """
        from statsmodels.tsa.arima.model import ARIMA

        model = ARIMA(series, order=order)
        fitted = model.fit()

        return fitted

    def fit_by_entity(self, df, entity_column, target_column, date_column):
        """
        Fit separate models for each entity (team/player).

        Parameters:
        -----------
        df : pd.DataFrame
            Data with entity, date, and target columns
        entity_column : str
            Column identifying entities
        target_column : str
            Target variable to predict
        date_column : str
            Date column
        """
        for entity in df[entity_column].unique():
            entity_data = df[df[entity_column] == entity].sort_values(date_column)
            series = entity_data.set_index(date_column)[target_column]

            try:
                self.models[entity] = self.fit_arima(series)
            except Exception as e:
                print(f"Could not fit model for {entity}: {e}")

        return self

    def forecast(self, entity, steps=5):
        """
        Generate forecast for an entity.

        Parameters:
        -----------
        entity : str
            Entity to forecast
        steps : int
            Number of steps ahead to forecast

        Returns:
        --------
        np.ndarray
            Forecasted values
        """
        if entity not in self.models:
            raise ValueError(f"No model fitted for {entity}")

        return self.models[entity].forecast(steps=steps)

26.10 When ML Beats Simple Models (and When It Doesn't)

Perhaps the most important skill in applied machine learning is knowing when to use it. Complex models are not always superior.

26.10.1 Comparing Simple and Complex Approaches

class ModelComplexityAnalyzer:
    """
    Compare simple and complex models for basketball problems.

    Helps determine when sophisticated ML is warranted.
    """

    def __init__(self):
        self.results = []

    def compare_models(self, X, y, cv_splits=5, task='classification'):
        """
        Compare models of varying complexity.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target values
        cv_splits : int
            Number of cross-validation splits
        task : str
            'classification' or 'regression'

        Returns:
        --------
        pd.DataFrame
            Comparison of model performance
        """
        from sklearn.model_selection import cross_val_score
        from sklearn.dummy import DummyClassifier, DummyRegressor
        from sklearn.linear_model import LogisticRegression, Ridge

        if task == 'classification':
            models = [
                ('Baseline (majority)', DummyClassifier(strategy='most_frequent')),
                ('Logistic Regression', LogisticRegression(random_state=42, max_iter=1000)),
                ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
                ('Gradient Boosting', GradientBoostingClassifier(n_estimators=100, random_state=42)),
                ('Neural Network', MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
            ]
            scoring = 'accuracy'
        else:
            models = [
                ('Baseline (mean)', DummyRegressor(strategy='mean')),
                ('Linear Regression', Ridge(random_state=42)),
                ('Random Forest', RandomForestRegressor(n_estimators=100, random_state=42)),
                ('Gradient Boosting', GradientBoostingRegressor(n_estimators=100, random_state=42)),
                ('Neural Network', MLPRegressor(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
            ]
            scoring = 'r2'

        # Scale features
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        results = []

        for name, model in models:
            try:
                scores = cross_val_score(model, X_scaled, y, cv=cv_splits, scoring=scoring)
                results.append({
                    'model': name,
                    'mean_score': scores.mean(),
                    'std_score': scores.std(),
                    'complexity': self._estimate_complexity(model)
                })
            except Exception as e:
                print(f"Error with {name}: {e}")

        return pd.DataFrame(results).sort_values('mean_score', ascending=False)

    def _estimate_complexity(self, model):
        """Estimate model complexity for comparison."""
        if 'Dummy' in str(type(model)):
            return 1
        elif 'Logistic' in str(type(model)) or 'Ridge' in str(type(model)):
            return 2
        elif 'RandomForest' in str(type(model)):
            return 3
        elif 'GradientBoosting' in str(type(model)):
            return 4
        elif 'MLP' in str(type(model)):
            return 5
        return 3

    def learning_curve_analysis(self, X, y, model, train_sizes=None):
        """
        Analyze how model performance changes with data size.

        Helps identify if more data would help or if we're overfitting.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target
        model : estimator
            Model to analyze
        train_sizes : list, optional
            Fractions of training data to use

        Returns:
        --------
        dict
            Learning curve results
        """
        from sklearn.model_selection import learning_curve

        if train_sizes is None:
            train_sizes = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0]

        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        train_sizes_abs, train_scores, test_scores = learning_curve(
            model, X_scaled, y,
            train_sizes=train_sizes,
            cv=5,
            n_jobs=-1,
            random_state=42
        )

        return {
            'train_sizes': train_sizes_abs,
            'train_scores_mean': train_scores.mean(axis=1),
            'train_scores_std': train_scores.std(axis=1),
            'test_scores_mean': test_scores.mean(axis=1),
            'test_scores_std': test_scores.std(axis=1)
        }

    def feature_ablation(self, X, y, model, task='classification'):
        """
        Analyze impact of removing features.

        Helps identify if simpler feature sets suffice.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target
        model : estimator
            Model to use
        task : str
            'classification' or 'regression'

        Returns:
        --------
        pd.DataFrame
            Performance with each feature removed
        """
        from sklearn.model_selection import cross_val_score

        scaler = StandardScaler()
        scoring = 'accuracy' if task == 'classification' else 'r2'

        # Baseline with all features
        X_scaled = scaler.fit_transform(X)
        baseline_scores = cross_val_score(model, X_scaled, y, cv=5, scoring=scoring)
        baseline_mean = baseline_scores.mean()

        results = [{'feature': 'ALL FEATURES', 'score': baseline_mean, 'delta': 0}]

        # Remove each feature
        for feature in X.columns:
            X_reduced = X.drop(columns=[feature])
            X_scaled = scaler.fit_transform(X_reduced)

            scores = cross_val_score(clone(model), X_scaled, y, cv=5, scoring=scoring)
            score_mean = scores.mean()

            results.append({
                'feature': feature,
                'score': score_mean,
                'delta': baseline_mean - score_mean  # Positive = feature is important
            })

        return pd.DataFrame(results).sort_values('delta', ascending=False)

26.10.2 Guidelines for Model Selection

The decision between simple and complex models should consider:

  1. Sample size: Complex models need more data. With fewer than 500 samples, simpler models often win.

  2. Signal-to-noise ratio: Basketball has inherent randomness. Complex models can overfit to noise.

  3. Interpretability needs: If stakeholders need to understand predictions, simpler models may be preferable.

  4. Deployment constraints: Complex models may be slower in production.

  5. Feature quality: With well-engineered features, simpler models often perform comparably.

def model_selection_guide(n_samples, n_features, interpretability_need='medium'):
    """
    Provide model selection guidance based on problem characteristics.

    Parameters:
    -----------
    n_samples : int
        Number of training samples
    n_features : int
        Number of features
    interpretability_need : str
        'low', 'medium', or 'high'

    Returns:
    --------
    dict
        Recommended models and rationale
    """
    recommendations = {
        'primary': None,
        'secondary': None,
        'avoid': [],
        'rationale': []
    }

    # Sample size considerations
    if n_samples < 200:
        recommendations['primary'] = 'Logistic Regression / Ridge'
        recommendations['secondary'] = 'Simple Decision Tree'
        recommendations['avoid'] = ['Neural Network', 'Deep Ensembles']
        recommendations['rationale'].append(
            f"Small sample size ({n_samples}) favors simple models to avoid overfitting"
        )

    elif n_samples < 1000:
        recommendations['primary'] = 'Random Forest'
        recommendations['secondary'] = 'Gradient Boosting (careful tuning)'
        recommendations['avoid'] = ['Deep Neural Networks']
        recommendations['rationale'].append(
            f"Moderate sample size ({n_samples}) supports tree ensembles"
        )

    else:
        recommendations['primary'] = 'Gradient Boosting'
        recommendations['secondary'] = 'Neural Network'
        recommendations['avoid'] = []
        recommendations['rationale'].append(
            f"Large sample size ({n_samples}) supports complex models"
        )

    # Feature considerations
    if n_features > n_samples / 10:
        recommendations['rationale'].append(
            f"High feature-to-sample ratio ({n_features}/{n_samples}) - consider regularization"
        )
        if 'Logistic' not in recommendations['primary']:
            recommendations['secondary'] = 'Regularized Linear Model'

    # Interpretability
    if interpretability_need == 'high':
        recommendations['primary'] = 'Logistic Regression'
        recommendations['secondary'] = 'Single Decision Tree'
        recommendations['rationale'].append(
            "High interpretability need prioritizes transparent models"
        )

    return recommendations

26.11 Complete Implementation Example

Let's bring everything together with a complete example of predicting All-Star selections:

class AllStarPredictionPipeline:
    """
    Complete ML pipeline for predicting NBA All-Star selections.

    Demonstrates end-to-end implementation including feature engineering,
    model selection, training, and evaluation.
    """

    def __init__(self):
        self.feature_columns = None
        self.models = {}
        self.best_model = None
        self.scaler = StandardScaler()

    def engineer_features(self, player_df):
        """
        Create features for All-Star prediction.

        Parameters:
        -----------
        player_df : pd.DataFrame
            Player season statistics

        Returns:
        --------
        pd.DataFrame
            DataFrame with engineered features
        """
        df = player_df.copy()

        # Per-game statistics
        if 'G' in df.columns:
            for stat in ['PTS', 'REB', 'AST', 'STL', 'BLK']:
                if stat in df.columns:
                    df[f'{stat}_per_game'] = df[stat] / df['G']

        # Efficiency metrics
        if all(col in df.columns for col in ['PTS', 'FGA', 'FTA']):
            df['TS_pct'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA'] + 0.001))

        # Market size proxy (using team)
        # In practice, you'd join with a team market size table

        # Win contribution (if team wins available)
        if 'team_wins' in df.columns:
            df['on_winning_team'] = (df['team_wins'] > 41).astype(int)

        # Previous All-Star selections (important predictor)
        if 'prev_all_star' in df.columns:
            df['is_previous_all_star'] = df['prev_all_star']

        return df

    def prepare_data(self, df, target_column='all_star'):
        """
        Prepare data for modeling.

        Parameters:
        -----------
        df : pd.DataFrame
            Engineered feature DataFrame
        target_column : str
            Name of target column

        Returns:
        --------
        tuple
            (X, y, feature_columns)
        """
        # Select numeric features
        numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

        # Remove target and identifiers
        exclude_cols = [target_column, 'player_id', 'season', 'team_id']
        self.feature_columns = [c for c in numeric_cols if c not in exclude_cols]

        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        y = df[target_column].values

        return X, y

    def train_and_evaluate(self, X, y, test_size=0.2):
        """
        Train multiple models and evaluate them.

        Parameters:
        -----------
        X : pd.DataFrame
            Feature matrix
        y : np.ndarray
            Target values
        test_size : float
            Fraction for test set

        Returns:
        --------
        pd.DataFrame
            Model comparison results
        """
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=42, stratify=y
        )

        # Scale features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)

        # Define models to compare
        models_to_test = {
            'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
            'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
            'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
            'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'),
            'Neural Network': MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500)
        }

        results = []

        for name, model in models_to_test.items():
            # Train
            model.fit(X_train_scaled, y_train)
            self.models[name] = model

            # Predict
            y_pred = model.predict(X_test_scaled)
            y_proba = model.predict_proba(X_test_scaled)[:, 1]

            # Evaluate
            results.append({
                'model': name,
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred),
                'recall': recall_score(y_test, y_pred),
                'f1': f1_score(y_test, y_pred),
                'roc_auc': roc_auc_score(y_test, y_proba)
            })

        results_df = pd.DataFrame(results).sort_values('roc_auc', ascending=False)

        # Select best model
        best_model_name = results_df.iloc[0]['model']
        self.best_model = self.models[best_model_name]

        return results_df

    def get_feature_importance(self):
        """Get feature importance from best model."""
        if self.best_model is None:
            raise ValueError("Must train models first")

        if hasattr(self.best_model, 'feature_importances_'):
            importance = self.best_model.feature_importances_
        elif hasattr(self.best_model, 'coef_'):
            importance = np.abs(self.best_model.coef_[0])
        else:
            return None

        return pd.Series(
            importance,
            index=self.feature_columns
        ).sort_values(ascending=False)

    def predict_all_stars(self, new_player_df, threshold=0.5):
        """
        Predict All-Star selections for new data.

        Parameters:
        -----------
        new_player_df : pd.DataFrame
            New player statistics
        threshold : float
            Classification threshold

        Returns:
        --------
        pd.DataFrame
            Predictions with probabilities
        """
        if self.best_model is None:
            raise ValueError("Must train models first")

        # Engineer features
        df = self.engineer_features(new_player_df)

        # Prepare features
        X = df[self.feature_columns].fillna(df[self.feature_columns].median())
        X_scaled = self.scaler.transform(X)

        # Predict
        probabilities = self.best_model.predict_proba(X_scaled)[:, 1]
        predictions = (probabilities >= threshold).astype(int)

        # Create results dataframe
        results = new_player_df[['Player', 'Team']].copy() if 'Player' in new_player_df.columns else pd.DataFrame()
        results['all_star_probability'] = probabilities
        results['predicted_all_star'] = predictions

        return results.sort_values('all_star_probability', ascending=False)


# Usage example
def run_all_star_prediction_pipeline(player_data_path):
    """
    Complete workflow for All-Star prediction.

    Parameters:
    -----------
    player_data_path : str
        Path to player statistics CSV
    """
    # Load data
    player_df = pd.read_csv(player_data_path)

    # Initialize pipeline
    pipeline = AllStarPredictionPipeline()

    # Engineer features
    df_featured = pipeline.engineer_features(player_df)

    # Prepare data
    X, y = pipeline.prepare_data(df_featured)

    # Train and evaluate
    results = pipeline.train_and_evaluate(X, y)
    print("Model Comparison:")
    print(results.to_string(index=False))

    # Feature importance
    importance = pipeline.get_feature_importance()
    print("\nTop 10 Features:")
    print(importance.head(10))

    return pipeline

Summary

Machine learning offers powerful tools for basketball analytics, but successful application requires careful consideration of the problem structure, data characteristics, and organizational needs. Key takeaways from this chapter:

  1. Clustering reveals natural player groupings that may not align with traditional positions, offering insights into modern playing styles.

  2. Classification and regression problems are ubiquitous in basketball, from predicting game outcomes to forecasting player development.

  3. Tree-based ensembles (Random Forests, Gradient Boosting) are workhorses of basketball ML, offering strong performance with reasonable interpretability.

  4. Neural networks are increasingly important for complex pattern recognition but require more data and careful tuning.

  5. Feature engineering remains crucial—domain knowledge about basketball often matters more than algorithm sophistication.

  6. Cross-validation must respect the temporal and hierarchical structure of basketball data.

  7. Interpretability is essential in basketball contexts where decisions affect player careers and organizational strategy.

  8. Simpler models often suffice—always establish baselines and question whether complexity is warranted.

The most successful practitioners of machine learning in basketball combine computational expertise with deep understanding of the game. Algorithms reveal patterns; domain experts translate those patterns into actionable insights.


References

  1. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  2. Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  4. Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems.
  5. Muthukrishan, S., & Edakunni, N. (2016). Machine Learning Applications in Sports Analytics. International Conference on Big Data Analytics.