The intersection of machine learning and basketball analytics represents one of the most exciting frontiers in sports science. While traditional statistical methods have long served as the foundation of basketball analysis, machine learning offers...
In This Chapter
- Introduction
- 26.1 Foundations of Machine Learning for Basketball
- 26.2 Clustering Player Types
- 26.3 Classification Problems in Basketball
- 26.4 Random Forests and Gradient Boosting
- 26.5 Neural Networks for Basketball Applications
- 26.6 Ensemble Methods
- 26.7 Feature Importance and Model Interpretability
- 26.8 Cross-Validation Strategies for Basketball Data
- 26.9 Handling Time-Series Aspects
- 26.10 When ML Beats Simple Models (and When It Doesn't)
- 26.11 Complete Implementation Example
- Summary
- References
Chapter 26: Machine Learning in Basketball
Introduction
The intersection of machine learning and basketball analytics represents one of the most exciting frontiers in sports science. While traditional statistical methods have long served as the foundation of basketball analysis, machine learning offers capabilities that extend far beyond what conventional approaches can achieve. From identifying previously unknown player archetypes to predicting game outcomes with unprecedented accuracy, ML techniques are reshaping how teams, analysts, and researchers understand the game.
This chapter provides a comprehensive treatment of machine learning applications in basketball. We begin with fundamental concepts and progressively build toward sophisticated implementations. Throughout, we emphasize not just the "how" of these techniques but the "when" and "why"—understanding which methods are appropriate for specific problems and, crucially, when simpler approaches might be preferable.
Machine learning in basketball is not about replacing human judgment with algorithmic decision-making. Rather, it's about augmenting human expertise with computational power, discovering patterns too subtle for the naked eye, and providing decision-makers with probabilistic insights grounded in data. The most successful implementations of ML in basketball invariably combine algorithmic sophistication with deep domain knowledge.
26.1 Foundations of Machine Learning for Basketball
26.1.1 The Machine Learning Paradigm
Machine learning differs from traditional programming in a fundamental way: instead of explicitly coding rules, we provide algorithms with data and let them discover patterns. In basketball terms, rather than manually defining what makes a player a "stretch four" based on our preconceptions, we let clustering algorithms discover natural groupings in the data.
The three primary paradigms of machine learning are:
Supervised Learning: The algorithm learns from labeled examples. Given player statistics and their actual positions, it learns to predict positions for new players. Given historical game data with outcomes, it learns to predict winners.
Unsupervised Learning: The algorithm finds structure in unlabeled data. Given player statistics without position labels, it discovers natural groupings—perhaps revealing that the traditional five positions poorly represent modern playing styles.
Reinforcement Learning: The algorithm learns through trial and error, receiving rewards for good decisions. This paradigm is particularly relevant for in-game decision-making, though its basketball applications are still emerging.
26.1.2 The Basketball ML Pipeline
Every machine learning project in basketball follows a common pipeline:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
class BasketballMLPipeline:
"""
A standardized pipeline for basketball machine learning projects.
This class encapsulates the common steps in basketball ML:
data preparation, feature engineering, model training, and evaluation.
"""
def __init__(self, model, feature_columns, target_column):
self.model = model
self.feature_columns = feature_columns
self.target_column = target_column
self.scaler = StandardScaler()
self.is_fitted = False
def prepare_features(self, df):
"""
Extract and prepare features from a basketball dataframe.
Parameters:
-----------
df : pd.DataFrame
Raw basketball data with player/game statistics
Returns:
--------
np.ndarray
Prepared feature matrix
"""
X = df[self.feature_columns].copy()
# Handle missing values common in basketball data
X = X.fillna(X.median())
return X.values
def fit(self, df):
"""
Fit the pipeline on training data.
Parameters:
-----------
df : pd.DataFrame
Training data with features and target
"""
X = self.prepare_features(df)
y = df[self.target_column].values
# Scale features
X_scaled = self.scaler.fit_transform(X)
# Fit model
self.model.fit(X_scaled, y)
self.is_fitted = True
return self
def predict(self, df):
"""
Generate predictions for new data.
Parameters:
-----------
df : pd.DataFrame
New data to predict on
Returns:
--------
np.ndarray
Model predictions
"""
if not self.is_fitted:
raise ValueError("Pipeline must be fitted before prediction")
X = self.prepare_features(df)
X_scaled = self.scaler.transform(X)
return self.model.predict(X_scaled)
26.1.3 Feature Engineering for Basketball
The quality of machine learning models depends critically on the features provided. Basketball presents unique opportunities for feature engineering:
def engineer_basketball_features(df):
"""
Create advanced features from basic basketball statistics.
This function demonstrates common feature engineering patterns
for basketball data, including rate statistics, efficiency metrics,
and composite measures.
Parameters:
-----------
df : pd.DataFrame
DataFrame with basic counting statistics
Returns:
--------
pd.DataFrame
DataFrame with engineered features added
"""
df = df.copy()
# Rate statistics (per 36 minutes normalization)
counting_stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV']
for stat in counting_stats:
if stat in df.columns and 'MP' in df.columns:
df[f'{stat}_per36'] = (df[stat] / df['MP']) * 36
# Efficiency metrics
if all(col in df.columns for col in ['PTS', 'FGA', 'FTA']):
# True Shooting Percentage
df['TS_pct'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA']))
if all(col in df.columns for col in ['AST', 'TOV']):
# Assist to Turnover Ratio
df['AST_TOV_ratio'] = df['AST'] / (df['TOV'] + 0.001)
# Usage patterns
if all(col in df.columns for col in ['FGA', 'FTA', 'TOV', 'MP']):
# Approximate usage rate
df['usage_approx'] = (df['FGA'] + 0.44 * df['FTA'] + df['TOV']) / df['MP']
# Shooting profile
if all(col in df.columns for col in ['FG3A', 'FGA']):
df['three_point_rate'] = df['FG3A'] / (df['FGA'] + 0.001)
if all(col in df.columns for col in ['FTA', 'FGA']):
df['free_throw_rate'] = df['FTA'] / (df['FGA'] + 0.001)
# Rebounding rates
if all(col in df.columns for col in ['ORB', 'DRB', 'REB']):
df['ORB_pct_of_reb'] = df['ORB'] / (df['REB'] + 0.001)
# Versatility index (coefficient of variation of stats)
versatility_stats = ['PTS_per36', 'REB_per36', 'AST_per36']
available_stats = [s for s in versatility_stats if s in df.columns]
if len(available_stats) >= 2:
df['versatility'] = df[available_stats].std(axis=1) / (df[available_stats].mean(axis=1) + 0.001)
return df
26.2 Clustering Player Types
Clustering represents one of the most natural applications of machine learning in basketball. Traditional positions—point guard, shooting guard, small forward, power forward, and center—were defined in an era of more rigid playing styles. Modern basketball's positionless revolution demands a data-driven approach to understanding player types.
26.2.1 K-Means Clustering
K-means is the workhorse of clustering algorithms, prized for its simplicity and interpretability. The algorithm partitions players into k clusters by minimizing within-cluster variance.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
class PlayerClusteringAnalysis:
"""
Comprehensive player clustering using K-means.
This class provides methods for determining optimal cluster count,
fitting clusters, and interpreting results in basketball terms.
"""
def __init__(self, feature_columns):
self.feature_columns = feature_columns
self.scaler = StandardScaler()
self.kmeans = None
self.cluster_centers_original = None
def find_optimal_k(self, df, k_range=range(2, 15)):
"""
Use the elbow method and silhouette scores to find optimal k.
Parameters:
-----------
df : pd.DataFrame
Player statistics dataframe
k_range : range
Range of k values to test
Returns:
--------
dict
Dictionary with inertia and silhouette scores for each k
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.fit_transform(X)
results = {'k': [], 'inertia': [], 'silhouette': []}
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
results['k'].append(k)
results['inertia'].append(kmeans.inertia_)
results['silhouette'].append(silhouette_score(X_scaled, labels))
return results
def fit_clusters(self, df, n_clusters):
"""
Fit K-means clustering with specified number of clusters.
Parameters:
-----------
df : pd.DataFrame
Player statistics dataframe
n_clusters : int
Number of clusters to create
Returns:
--------
np.ndarray
Cluster labels for each player
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.fit_transform(X)
self.kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
labels = self.kmeans.fit_predict(X_scaled)
# Store cluster centers in original scale for interpretation
self.cluster_centers_original = self.scaler.inverse_transform(
self.kmeans.cluster_centers_
)
return labels
def interpret_clusters(self, df, labels):
"""
Generate interpretable descriptions of each cluster.
Parameters:
-----------
df : pd.DataFrame
Original dataframe with player data
labels : np.ndarray
Cluster assignments
Returns:
--------
pd.DataFrame
Summary statistics for each cluster
"""
df_clustered = df.copy()
df_clustered['cluster'] = labels
# Calculate mean statistics for each cluster
cluster_summary = df_clustered.groupby('cluster')[self.feature_columns].mean()
# Add cluster sizes
cluster_summary['n_players'] = df_clustered.groupby('cluster').size()
return cluster_summary
def get_cluster_exemplars(self, df, labels, n_exemplars=3):
"""
Find players closest to each cluster center.
These exemplar players serve as prototypes for understanding
what each cluster represents.
Parameters:
-----------
df : pd.DataFrame
Player dataframe with 'Player' column
labels : np.ndarray
Cluster assignments
n_exemplars : int
Number of exemplar players per cluster
Returns:
--------
dict
Dictionary mapping cluster ID to list of exemplar players
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.transform(X)
exemplars = {}
for cluster_id in range(self.kmeans.n_clusters):
# Get indices of players in this cluster
cluster_mask = labels == cluster_id
cluster_indices = np.where(cluster_mask)[0]
# Calculate distances to cluster center
center = self.kmeans.cluster_centers_[cluster_id]
distances = np.linalg.norm(
X_scaled[cluster_mask] - center, axis=1
)
# Get closest players
closest_idx = np.argsort(distances)[:n_exemplars]
exemplar_indices = cluster_indices[closest_idx]
if 'Player' in df.columns:
exemplars[cluster_id] = df.iloc[exemplar_indices]['Player'].tolist()
else:
exemplars[cluster_id] = exemplar_indices.tolist()
return exemplars
# Example usage with realistic basketball features
def cluster_nba_players(player_df):
"""
Complete workflow for clustering NBA players.
Parameters:
-----------
player_df : pd.DataFrame
DataFrame with player statistics including:
PTS_per36, REB_per36, AST_per36, STL_per36, BLK_per36,
TS_pct, USG_pct, three_point_rate, AST_pct
Returns:
--------
tuple
(labels, cluster_summary, exemplars)
"""
# Features that capture playing style
style_features = [
'PTS_per36', 'REB_per36', 'AST_per36', 'STL_per36', 'BLK_per36',
'TS_pct', 'USG_pct', 'three_point_rate', 'AST_pct'
]
# Filter to available features
available_features = [f for f in style_features if f in player_df.columns]
# Initialize analyzer
analyzer = PlayerClusteringAnalysis(available_features)
# Find optimal k
k_results = analyzer.find_optimal_k(player_df)
# Based on elbow and silhouette, typically 7-10 clusters work well
# for capturing modern NBA playing styles
optimal_k = k_results['k'][np.argmax(k_results['silhouette'])]
# Fit final clusters
labels = analyzer.fit_clusters(player_df, n_clusters=optimal_k)
# Interpret results
summary = analyzer.interpret_clusters(player_df, labels)
exemplars = analyzer.get_cluster_exemplars(player_df, labels)
return labels, summary, exemplars
26.2.2 Hierarchical Clustering
While K-means requires pre-specifying the number of clusters, hierarchical clustering builds a tree structure (dendrogram) that reveals clustering at multiple levels of granularity. This is particularly valuable in basketball, where we might want to examine both broad categories (perimeter vs. interior players) and fine-grained distinctions (scoring guards vs. playmaking guards).
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
class HierarchicalPlayerClustering:
"""
Hierarchical clustering for basketball player analysis.
This approach reveals the nested structure of player types,
allowing analysis at multiple levels of specificity.
"""
def __init__(self, feature_columns, linkage_method='ward'):
"""
Initialize hierarchical clustering.
Parameters:
-----------
feature_columns : list
Column names to use as features
linkage_method : str
Linkage criterion: 'ward', 'complete', 'average', 'single'
Ward's method typically works well for player clustering
"""
self.feature_columns = feature_columns
self.linkage_method = linkage_method
self.scaler = StandardScaler()
self.linkage_matrix = None
def fit(self, df):
"""
Compute hierarchical clustering structure.
Parameters:
-----------
df : pd.DataFrame
Player statistics dataframe
Returns:
--------
self
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.fit_transform(X)
# Compute linkage matrix
self.linkage_matrix = linkage(X_scaled, method=self.linkage_method)
return self
def plot_dendrogram(self, df, figsize=(15, 8), truncate_mode='level', p=5):
"""
Visualize the hierarchical clustering structure.
Parameters:
-----------
df : pd.DataFrame
DataFrame with 'Player' column for labels
figsize : tuple
Figure size
truncate_mode : str
How to truncate dendrogram for display
p : int
Truncation parameter
"""
plt.figure(figsize=figsize)
labels = df['Player'].tolist() if 'Player' in df.columns else None
dendrogram(
self.linkage_matrix,
labels=labels,
truncate_mode=truncate_mode,
p=p,
leaf_rotation=90,
leaf_font_size=8
)
plt.title('Hierarchical Clustering of NBA Players')
plt.xlabel('Player')
plt.ylabel('Distance')
plt.tight_layout()
return plt.gcf()
def get_clusters_at_level(self, n_clusters):
"""
Extract flat clusters at a specified granularity.
Parameters:
-----------
n_clusters : int
Desired number of clusters
Returns:
--------
np.ndarray
Cluster labels
"""
return fcluster(self.linkage_matrix, n_clusters, criterion='maxclust')
def analyze_hierarchy(self, df, levels=[3, 6, 10]):
"""
Analyze clustering at multiple hierarchical levels.
This reveals how player types split as we increase granularity.
Parameters:
-----------
df : pd.DataFrame
Player dataframe
levels : list
Number of clusters at each level to analyze
Returns:
--------
dict
Nested analysis at each level
"""
results = {}
for n in levels:
labels = self.get_clusters_at_level(n)
df_temp = df.copy()
df_temp['cluster'] = labels
# Summarize each cluster
level_summary = {}
for cluster_id in range(1, n + 1):
cluster_players = df_temp[df_temp['cluster'] == cluster_id]
level_summary[cluster_id] = {
'size': len(cluster_players),
'avg_stats': cluster_players[self.feature_columns].mean().to_dict(),
'exemplars': cluster_players.nsmallest(3, 'MP')['Player'].tolist()
if 'Player' in cluster_players.columns else []
}
results[f'{n}_clusters'] = level_summary
return results
26.2.3 Interpreting Player Clusters
The challenge with clustering is moving from mathematical groupings to basketball-meaningful archetypes. Here's a framework for interpretation:
def interpret_player_clusters(cluster_summary, feature_columns):
"""
Generate human-readable interpretations of player clusters.
This function compares each cluster to the overall mean to identify
distinguishing characteristics, then maps these to basketball concepts.
Parameters:
-----------
cluster_summary : pd.DataFrame
Summary statistics for each cluster
feature_columns : list
Features used in clustering
Returns:
--------
dict
Dictionary mapping cluster ID to descriptive archetype
"""
# Calculate z-scores relative to overall mean
overall_mean = cluster_summary[feature_columns].mean()
overall_std = cluster_summary[feature_columns].std()
archetypes = {}
for cluster_id in cluster_summary.index:
cluster_stats = cluster_summary.loc[cluster_id, feature_columns]
z_scores = (cluster_stats - overall_mean) / (overall_std + 0.001)
# Identify standout characteristics (|z| > 1)
standout_high = z_scores[z_scores > 1].sort_values(ascending=False)
standout_low = z_scores[z_scores < -1].sort_values()
# Map to archetype descriptions
archetype_traits = []
# Check for common archetypes
if 'AST_per36' in standout_high.index and z_scores['AST_per36'] > 1.5:
archetype_traits.append('Playmaker')
if 'PTS_per36' in standout_high.index and z_scores['PTS_per36'] > 1.5:
archetype_traits.append('Scorer')
if 'three_point_rate' in standout_high.index:
archetype_traits.append('Perimeter-Oriented')
if 'BLK_per36' in standout_high.index and z_scores['BLK_per36'] > 1:
archetype_traits.append('Rim Protector')
if 'REB_per36' in standout_high.index and z_scores['REB_per36'] > 1.5:
archetype_traits.append('Rebounder')
if 'STL_per36' in standout_high.index:
archetype_traits.append('Ball Hawk')
# Check for role indicators
if 'USG_pct' in z_scores.index:
if z_scores['USG_pct'] < -1:
archetype_traits.append('Role Player')
elif z_scores['USG_pct'] > 1:
archetype_traits.append('High Usage')
# Combine into archetype name
if archetype_traits:
archetype = ' / '.join(archetype_traits[:3]) # Top 3 traits
else:
archetype = f'Cluster {cluster_id}'
archetypes[cluster_id] = {
'name': archetype,
'standout_high': standout_high.head(3).to_dict(),
'standout_low': standout_low.head(3).to_dict(),
'size': cluster_summary.loc[cluster_id, 'n_players']
}
return archetypes
26.3 Classification Problems in Basketball
Classification—predicting categorical outcomes—has numerous applications in basketball: predicting game winners, identifying future All-Stars, classifying shot types, and more.
26.3.1 Binary Classification: Predicting Game Outcomes
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix
)
class GameOutcomePredictor:
"""
Predict game outcomes using team statistics.
This class demonstrates binary classification for basketball,
including proper handling of home/away asymmetry and
appropriate evaluation metrics.
"""
def __init__(self, model_type='logistic'):
"""
Initialize predictor with chosen model type.
Parameters:
-----------
model_type : str
'logistic' or 'random_forest'
"""
if model_type == 'logistic':
self.model = LogisticRegression(random_state=42, max_iter=1000)
else:
self.model = RandomForestClassifier(
n_estimators=100, random_state=42, n_jobs=-1
)
self.scaler = StandardScaler()
self.feature_columns = None
def prepare_game_features(self, games_df, team_stats_df):
"""
Create features for game outcome prediction.
Parameters:
-----------
games_df : pd.DataFrame
Game data with home_team, away_team, home_win columns
team_stats_df : pd.DataFrame
Team statistics indexed by team name
Returns:
--------
pd.DataFrame
Prepared feature dataframe
"""
features = []
for _, game in games_df.iterrows():
home_team = game['home_team']
away_team = game['away_team']
# Get team statistics
home_stats = team_stats_df.loc[home_team]
away_stats = team_stats_df.loc[away_team]
# Create differential features
game_features = {
'home_win': game['home_win']
}
# Point differential expectation
for stat in ['OFF_RTG', 'DEF_RTG', 'NET_RTG', 'PACE']:
if stat in team_stats_df.columns:
game_features[f'{stat}_diff'] = home_stats[stat] - away_stats[stat]
# Home court advantage is implicit in the differential
# but we can add interaction terms
game_features['is_back_to_back_home'] = game.get('home_b2b', 0)
game_features['is_back_to_back_away'] = game.get('away_b2b', 0)
features.append(game_features)
return pd.DataFrame(features)
def fit(self, X, y):
"""
Fit the classifier.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Feature matrix
y : np.ndarray
Binary outcome labels
"""
self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled, y)
return self
def predict(self, X):
"""Generate predictions."""
X_scaled = self.scaler.transform(X)
return self.model.predict(X_scaled)
def predict_proba(self, X):
"""Generate probability predictions."""
X_scaled = self.scaler.transform(X)
return self.model.predict_proba(X_scaled)
def evaluate(self, X, y, threshold=0.5):
"""
Comprehensive evaluation of classification performance.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Feature matrix
y : np.ndarray
True labels
threshold : float
Classification threshold
Returns:
--------
dict
Dictionary of evaluation metrics
"""
y_proba = self.predict_proba(X)[:, 1]
y_pred = (y_proba >= threshold).astype(int)
return {
'accuracy': accuracy_score(y, y_pred),
'precision': precision_score(y, y_pred),
'recall': recall_score(y, y_pred),
'f1': f1_score(y, y_pred),
'roc_auc': roc_auc_score(y, y_proba),
'confusion_matrix': confusion_matrix(y, y_pred)
}
26.3.2 Multi-Class Classification: Player Roles
Predicting player positions or roles represents a multi-class classification problem:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelEncoder
class PlayerRoleClassifier:
"""
Classify players into roles based on statistical profile.
This can use traditional positions or data-driven archetypes.
"""
def __init__(self, feature_columns, model=None):
self.feature_columns = feature_columns
self.model = model or RandomForestClassifier(
n_estimators=100, random_state=42, n_jobs=-1
)
self.scaler = StandardScaler()
self.label_encoder = LabelEncoder()
def fit(self, df, role_column='position'):
"""
Train the role classifier.
Parameters:
-----------
df : pd.DataFrame
Player data with statistics and role labels
role_column : str
Column containing role/position labels
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
y = self.label_encoder.fit_transform(df[role_column])
X_scaled = self.scaler.fit_transform(X)
self.model.fit(X_scaled, y)
return self
def predict(self, df):
"""Predict roles for new players."""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.transform(X)
predictions_encoded = self.model.predict(X_scaled)
return self.label_encoder.inverse_transform(predictions_encoded)
def predict_proba_all_roles(self, df):
"""
Get probability distribution over all roles.
Returns:
--------
pd.DataFrame
Probabilities for each role
"""
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.transform(X)
probas = self.model.predict_proba(X_scaled)
return pd.DataFrame(
probas,
columns=self.label_encoder.classes_,
index=df.index
)
def analyze_misclassifications(self, df, role_column='position'):
"""
Analyze where the classifier makes mistakes.
This reveals which roles are most easily confused.
Returns:
--------
pd.DataFrame
Confusion matrix as DataFrame with role labels
"""
from sklearn.metrics import confusion_matrix
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
y_true = self.label_encoder.transform(df[role_column])
X_scaled = self.scaler.transform(X)
y_pred = self.model.predict(X_scaled)
cm = confusion_matrix(y_true, y_pred)
return pd.DataFrame(
cm,
index=self.label_encoder.classes_,
columns=self.label_encoder.classes_
)
26.4 Random Forests and Gradient Boosting
Tree-based ensemble methods have become workhorses of sports analytics. They handle non-linear relationships, require minimal preprocessing, and provide interpretable feature importance measures.
26.4.1 Random Forests for Basketball
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.inspection import permutation_importance
class BasketballRandomForest:
"""
Random Forest implementation tailored for basketball analytics.
Includes methods for feature importance analysis and
partial dependence exploration.
"""
def __init__(self, task='classification', n_estimators=100, max_depth=None):
"""
Initialize Random Forest for basketball analysis.
Parameters:
-----------
task : str
'classification' or 'regression'
n_estimators : int
Number of trees in the forest
max_depth : int or None
Maximum depth of trees (None for unlimited)
"""
if task == 'classification':
self.model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42,
n_jobs=-1,
oob_score=True # Out-of-bag evaluation
)
else:
self.model = RandomForestRegressor(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42,
n_jobs=-1,
oob_score=True
)
self.feature_columns = None
self.task = task
def fit(self, X, y):
"""Fit the random forest."""
self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None
self.model.fit(X, y)
return self
def get_feature_importance(self, method='gini'):
"""
Get feature importance scores.
Parameters:
-----------
method : str
'gini' for impurity-based (fast but can be biased)
Returns:
--------
pd.Series
Feature importance scores, sorted descending
"""
importance = self.model.feature_importances_
if self.feature_columns:
importance_series = pd.Series(importance, index=self.feature_columns)
else:
importance_series = pd.Series(importance)
return importance_series.sort_values(ascending=False)
def get_permutation_importance(self, X, y, n_repeats=10):
"""
Calculate permutation importance (more reliable but slower).
Permutation importance measures how much model performance
decreases when a feature's values are randomly shuffled.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Feature matrix
y : np.ndarray
Target values
n_repeats : int
Number of times to repeat the permutation
Returns:
--------
pd.DataFrame
Importance scores with mean and std
"""
result = permutation_importance(
self.model, X, y,
n_repeats=n_repeats,
random_state=42,
n_jobs=-1
)
importance_df = pd.DataFrame({
'mean_importance': result.importances_mean,
'std_importance': result.importances_std
}, index=self.feature_columns)
return importance_df.sort_values('mean_importance', ascending=False)
def analyze_tree_paths(self, X, sample_idx=0):
"""
Analyze decision paths for a specific sample.
This helps understand why the model made a particular prediction.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Feature matrix
sample_idx : int
Index of sample to analyze
Returns:
--------
list
List of (feature, threshold, direction) tuples for first tree
"""
sample = X.iloc[[sample_idx]] if hasattr(X, 'iloc') else X[[sample_idx]]
# Get decision path through first tree
tree = self.model.estimators_[0]
node_indicator = tree.decision_path(sample)
# Extract path
feature_names = self.feature_columns or [f'feature_{i}' for i in range(X.shape[1])]
path = []
nodes = node_indicator.indices
for node_id in nodes[:-1]: # Exclude leaf
feature_id = tree.tree_.feature[node_id]
threshold = tree.tree_.threshold[node_id]
if hasattr(sample, 'iloc'):
value = sample.iloc[0, feature_id]
else:
value = sample[0, feature_id]
direction = 'left (<=)' if value <= threshold else 'right (>)'
path.append({
'feature': feature_names[feature_id],
'threshold': threshold,
'sample_value': value,
'direction': direction
})
return path
26.4.2 Gradient Boosting for Basketball
Gradient boosting often achieves higher accuracy than random forests, though it requires more careful tuning:
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
import xgboost as xgb
class BasketballGradientBoosting:
"""
Gradient Boosting implementation for basketball analytics.
Supports both scikit-learn's implementation and XGBoost.
"""
def __init__(self, task='classification', use_xgboost=True, **kwargs):
"""
Initialize gradient boosting model.
Parameters:
-----------
task : str
'classification' or 'regression'
use_xgboost : bool
Whether to use XGBoost (recommended for performance)
**kwargs : dict
Additional parameters for the model
"""
self.task = task
self.use_xgboost = use_xgboost
default_params = {
'n_estimators': 100,
'learning_rate': 0.1,
'max_depth': 5,
'random_state': 42
}
default_params.update(kwargs)
if use_xgboost:
if task == 'classification':
self.model = xgb.XGBClassifier(**default_params, use_label_encoder=False)
else:
self.model = xgb.XGBRegressor(**default_params)
else:
if task == 'classification':
self.model = GradientBoostingClassifier(**default_params)
else:
self.model = GradientBoostingRegressor(**default_params)
self.feature_columns = None
def fit(self, X, y, eval_set=None, early_stopping_rounds=None):
"""
Fit the gradient boosting model.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Training features
y : np.ndarray
Training target
eval_set : list of tuples, optional
Validation set for early stopping (XGBoost only)
early_stopping_rounds : int, optional
Stop if no improvement for this many rounds
"""
self.feature_columns = X.columns.tolist() if hasattr(X, 'columns') else None
if self.use_xgboost and eval_set is not None:
self.model.fit(
X, y,
eval_set=eval_set,
early_stopping_rounds=early_stopping_rounds,
verbose=False
)
else:
self.model.fit(X, y)
return self
def tune_hyperparameters(self, X, y, cv=5, param_grid=None):
"""
Perform hyperparameter tuning using cross-validation.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : np.ndarray
Target values
cv : int
Number of cross-validation folds
param_grid : dict, optional
Parameter grid to search
Returns:
--------
dict
Best parameters and cross-validation scores
"""
from sklearn.model_selection import GridSearchCV
if param_grid is None:
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7]
}
grid_search = GridSearchCV(
self.model, param_grid,
cv=cv,
scoring='accuracy' if self.task == 'classification' else 'neg_mean_squared_error',
n_jobs=-1
)
grid_search.fit(X, y)
return {
'best_params': grid_search.best_params_,
'best_score': grid_search.best_score_,
'cv_results': pd.DataFrame(grid_search.cv_results_)
}
def get_feature_importance(self):
"""Get feature importance from the model."""
if self.use_xgboost:
importance = self.model.feature_importances_
else:
importance = self.model.feature_importances_
if self.feature_columns:
return pd.Series(importance, index=self.feature_columns).sort_values(ascending=False)
return pd.Series(importance).sort_values(ascending=False)
26.5 Neural Networks for Basketball Applications
Neural networks excel at capturing complex, non-linear relationships in data. While they require more data and computational resources than simpler methods, they're increasingly important for advanced basketball applications.
26.5.1 Basic Neural Network Implementation
from sklearn.neural_network import MLPClassifier, MLPRegressor
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class BasketballNeuralNetwork:
"""
Neural network implementation for basketball analytics.
Provides both scikit-learn (simple) and Keras (flexible) implementations.
"""
def __init__(self, task='classification', use_keras=False,
hidden_layers=(64, 32), input_dim=None):
"""
Initialize neural network.
Parameters:
-----------
task : str
'classification' or 'regression'
use_keras : bool
Whether to use Keras (vs scikit-learn)
hidden_layers : tuple
Number of neurons in each hidden layer
input_dim : int
Input dimension (required for Keras)
"""
self.task = task
self.use_keras = use_keras
self.scaler = StandardScaler()
if use_keras:
self.model = self._build_keras_model(hidden_layers, input_dim, task)
else:
if task == 'classification':
self.model = MLPClassifier(
hidden_layer_sizes=hidden_layers,
activation='relu',
solver='adam',
max_iter=500,
random_state=42,
early_stopping=True,
validation_fraction=0.1
)
else:
self.model = MLPRegressor(
hidden_layer_sizes=hidden_layers,
activation='relu',
solver='adam',
max_iter=500,
random_state=42,
early_stopping=True
)
def _build_keras_model(self, hidden_layers, input_dim, task):
"""
Build a Keras neural network.
Parameters:
-----------
hidden_layers : tuple
Neurons per hidden layer
input_dim : int
Number of input features
task : str
'classification' or 'regression'
Returns:
--------
keras.Model
Compiled Keras model
"""
model = keras.Sequential()
# Input layer
model.add(layers.InputLayer(input_shape=(input_dim,)))
# Hidden layers with dropout for regularization
for i, units in enumerate(hidden_layers):
model.add(layers.Dense(units, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.3))
# Output layer
if task == 'classification':
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
else:
model.add(layers.Dense(1))
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
return model
def fit(self, X, y, epochs=100, batch_size=32, validation_split=0.2):
"""
Train the neural network.
Parameters:
-----------
X : pd.DataFrame or np.ndarray
Training features
y : np.ndarray
Training target
epochs : int
Number of training epochs (Keras only)
batch_size : int
Batch size (Keras only)
validation_split : float
Fraction for validation (Keras only)
"""
X_scaled = self.scaler.fit_transform(X)
if self.use_keras:
# Early stopping callback
early_stop = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
self.history = self.model.fit(
X_scaled, y,
epochs=epochs,
batch_size=batch_size,
validation_split=validation_split,
callbacks=[early_stop],
verbose=0
)
else:
self.model.fit(X_scaled, y)
return self
def predict(self, X):
"""Generate predictions."""
X_scaled = self.scaler.transform(X)
if self.use_keras:
predictions = self.model.predict(X_scaled, verbose=0)
if self.task == 'classification':
return (predictions > 0.5).astype(int).flatten()
return predictions.flatten()
return self.model.predict(X_scaled)
def predict_proba(self, X):
"""Get prediction probabilities (classification only)."""
X_scaled = self.scaler.transform(X)
if self.use_keras:
return self.model.predict(X_scaled, verbose=0)
return self.model.predict_proba(X_scaled)
class PlayerTrajectoryNetwork:
"""
Recurrent neural network for player trajectory/career prediction.
This handles sequential data like season-by-season statistics
to predict future performance.
"""
def __init__(self, sequence_length, n_features, prediction_type='regression'):
"""
Initialize trajectory prediction network.
Parameters:
-----------
sequence_length : int
Number of past seasons to consider
n_features : int
Number of statistical features per season
prediction_type : str
'regression' or 'classification'
"""
self.sequence_length = sequence_length
self.n_features = n_features
self.prediction_type = prediction_type
self.scaler = StandardScaler()
self.model = self._build_model()
def _build_model(self):
"""Build LSTM model for trajectory prediction."""
model = keras.Sequential([
layers.LSTM(64, return_sequences=True,
input_shape=(self.sequence_length, self.n_features)),
layers.Dropout(0.3),
layers.LSTM(32),
layers.Dropout(0.3),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid' if self.prediction_type == 'classification' else None)
])
loss = 'binary_crossentropy' if self.prediction_type == 'classification' else 'mse'
model.compile(optimizer='adam', loss=loss)
return model
def prepare_sequences(self, player_seasons_df, target_column):
"""
Prepare sequential data from player seasons.
Parameters:
-----------
player_seasons_df : pd.DataFrame
DataFrame with player_id, season, and statistics
target_column : str
Column to predict
Returns:
--------
tuple
(X_sequences, y_targets)
"""
sequences = []
targets = []
for player_id in player_seasons_df['player_id'].unique():
player_data = player_seasons_df[
player_seasons_df['player_id'] == player_id
].sort_values('season')
feature_cols = [c for c in player_data.columns
if c not in ['player_id', 'season', target_column]]
# Create sequences
for i in range(len(player_data) - self.sequence_length):
seq = player_data.iloc[i:i + self.sequence_length][feature_cols].values
target = player_data.iloc[i + self.sequence_length][target_column]
sequences.append(seq)
targets.append(target)
return np.array(sequences), np.array(targets)
def fit(self, X_sequences, y_targets, epochs=50, batch_size=32):
"""Train the trajectory model."""
# Scale features
original_shape = X_sequences.shape
X_flat = X_sequences.reshape(-1, X_sequences.shape[-1])
X_scaled = self.scaler.fit_transform(X_flat)
X_sequences_scaled = X_scaled.reshape(original_shape)
self.model.fit(
X_sequences_scaled, y_targets,
epochs=epochs,
batch_size=batch_size,
validation_split=0.2,
verbose=0
)
return self
26.6 Ensemble Methods
Ensemble methods combine multiple models to achieve better performance than any single model. In basketball analytics, ensembles are particularly valuable for reducing prediction variance.
26.6.1 Building Effective Ensembles
from sklearn.ensemble import VotingClassifier, VotingRegressor, StackingClassifier
from sklearn.linear_model import LogisticRegression, Ridge
class BasketballEnsemble:
"""
Ensemble methods for basketball prediction.
Combines multiple models to achieve robust predictions.
"""
def __init__(self, task='classification', ensemble_type='voting'):
"""
Initialize ensemble.
Parameters:
-----------
task : str
'classification' or 'regression'
ensemble_type : str
'voting' or 'stacking'
"""
self.task = task
self.ensemble_type = ensemble_type
self.base_models = self._create_base_models()
self.ensemble = self._create_ensemble()
def _create_base_models(self):
"""Create diverse base models."""
if self.task == 'classification':
return [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
('lr', LogisticRegression(random_state=42, max_iter=1000)),
('mlp', MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
]
else:
return [
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=100, random_state=42)),
('ridge', Ridge(random_state=42)),
('mlp', MLPRegressor(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
]
def _create_ensemble(self):
"""Create the ensemble model."""
if self.ensemble_type == 'voting':
if self.task == 'classification':
return VotingClassifier(
estimators=self.base_models,
voting='soft' # Use probability averaging
)
else:
return VotingRegressor(estimators=self.base_models)
else: # stacking
if self.task == 'classification':
return StackingClassifier(
estimators=self.base_models,
final_estimator=LogisticRegression(random_state=42),
cv=5
)
else:
from sklearn.ensemble import StackingRegressor
return StackingRegressor(
estimators=self.base_models,
final_estimator=Ridge(random_state=42),
cv=5
)
def fit(self, X, y):
"""Fit the ensemble."""
self.scaler = StandardScaler()
X_scaled = self.scaler.fit_transform(X)
self.ensemble.fit(X_scaled, y)
return self
def predict(self, X):
"""Generate predictions."""
X_scaled = self.scaler.transform(X)
return self.ensemble.predict(X_scaled)
def get_model_contributions(self, X, y):
"""
Analyze individual model contributions.
Returns:
--------
pd.DataFrame
Performance metrics for each base model
"""
X_scaled = self.scaler.transform(X)
results = []
for name, model in self.base_models:
y_pred = model.predict(X_scaled)
if self.task == 'classification':
score = accuracy_score(y, y_pred)
else:
from sklearn.metrics import r2_score
score = r2_score(y, y_pred)
results.append({'model': name, 'score': score})
# Add ensemble performance
y_ensemble = self.ensemble.predict(X_scaled)
if self.task == 'classification':
ensemble_score = accuracy_score(y, y_ensemble)
else:
ensemble_score = r2_score(y, y_ensemble)
results.append({'model': 'ensemble', 'score': ensemble_score})
return pd.DataFrame(results)
26.6.2 Model Blending for Basketball
class BasketballModelBlender:
"""
Blend predictions from multiple models using learned weights.
This approach learns optimal weights for combining model predictions
based on validation performance.
"""
def __init__(self, models, task='classification'):
"""
Initialize blender.
Parameters:
-----------
models : list
List of (name, model) tuples
task : str
'classification' or 'regression'
"""
self.models = models
self.task = task
self.weights = None
self.scalers = {}
def fit(self, X_train, y_train, X_val, y_val):
"""
Fit base models and learn blending weights.
Parameters:
-----------
X_train : pd.DataFrame
Training features
y_train : np.ndarray
Training target
X_val : pd.DataFrame
Validation features
y_val : np.ndarray
Validation target
"""
# Fit each model
for name, model in self.models:
self.scalers[name] = StandardScaler()
X_train_scaled = self.scalers[name].fit_transform(X_train)
model.fit(X_train_scaled, y_train)
# Get validation predictions
val_predictions = []
for name, model in self.models:
X_val_scaled = self.scalers[name].transform(X_val)
if self.task == 'classification' and hasattr(model, 'predict_proba'):
pred = model.predict_proba(X_val_scaled)[:, 1]
else:
pred = model.predict(X_val_scaled)
val_predictions.append(pred)
# Stack predictions
val_predictions = np.column_stack(val_predictions)
# Learn optimal weights (constrained optimization)
from scipy.optimize import minimize
def loss_function(weights):
weights = weights / weights.sum() # Normalize
blended = np.dot(val_predictions, weights)
if self.task == 'classification':
# Log loss
eps = 1e-7
blended = np.clip(blended, eps, 1 - eps)
return -np.mean(y_val * np.log(blended) + (1 - y_val) * np.log(1 - blended))
else:
# MSE
return np.mean((y_val - blended) ** 2)
# Initial weights (equal)
initial_weights = np.ones(len(self.models)) / len(self.models)
# Bounds (non-negative weights)
bounds = [(0, 1) for _ in range(len(self.models))]
result = minimize(loss_function, initial_weights, bounds=bounds, method='SLSQP')
self.weights = result.x / result.x.sum()
return self
def predict(self, X):
"""Generate blended predictions."""
predictions = []
for name, model in self.models:
X_scaled = self.scalers[name].transform(X)
if self.task == 'classification' and hasattr(model, 'predict_proba'):
pred = model.predict_proba(X_scaled)[:, 1]
else:
pred = model.predict(X_scaled)
predictions.append(pred)
predictions = np.column_stack(predictions)
blended = np.dot(predictions, self.weights)
if self.task == 'classification':
return (blended > 0.5).astype(int)
return blended
def get_weights_summary(self):
"""Get model weights summary."""
return pd.DataFrame({
'model': [name for name, _ in self.models],
'weight': self.weights
}).sort_values('weight', ascending=False)
26.7 Feature Importance and Model Interpretability
Understanding why models make predictions is crucial in basketball analytics, where stakeholders need to trust and act on model outputs.
26.7.1 Comprehensive Feature Importance Analysis
class BasketballModelInterpreter:
"""
Tools for interpreting machine learning models in basketball context.
Provides multiple methods for understanding feature importance
and model behavior.
"""
def __init__(self, model, feature_columns):
self.model = model
self.feature_columns = feature_columns
def gini_importance(self):
"""
Get Gini (impurity-based) feature importance.
Note: This is fast but can be biased toward high-cardinality features.
"""
if hasattr(self.model, 'feature_importances_'):
return pd.Series(
self.model.feature_importances_,
index=self.feature_columns
).sort_values(ascending=False)
else:
raise ValueError("Model does not have feature_importances_ attribute")
def permutation_importance(self, X, y, n_repeats=10):
"""
Calculate permutation importance.
More reliable than Gini importance but slower.
"""
result = permutation_importance(
self.model, X, y,
n_repeats=n_repeats,
random_state=42,
n_jobs=-1
)
return pd.DataFrame({
'importance_mean': result.importances_mean,
'importance_std': result.importances_std
}, index=self.feature_columns).sort_values('importance_mean', ascending=False)
def shap_analysis(self, X, sample_size=100):
"""
SHAP (SHapley Additive exPlanations) analysis.
Provides both global feature importance and local explanations.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
sample_size : int
Number of samples to use for SHAP calculation
Returns:
--------
dict
SHAP values and summary
"""
import shap
# Sample for efficiency
if len(X) > sample_size:
X_sample = X.sample(sample_size, random_state=42)
else:
X_sample = X
# Create explainer based on model type
if hasattr(self.model, 'estimators_'):
# Tree-based model
explainer = shap.TreeExplainer(self.model)
else:
# Use kernel explainer as fallback
explainer = shap.KernelExplainer(
self.model.predict,
shap.sample(X, 50)
)
shap_values = explainer.shap_values(X_sample)
# Handle multi-output (classification with multiple classes)
if isinstance(shap_values, list):
shap_values = shap_values[1] # Use positive class for binary
# Global importance (mean absolute SHAP values)
global_importance = pd.Series(
np.abs(shap_values).mean(axis=0),
index=self.feature_columns
).sort_values(ascending=False)
return {
'shap_values': shap_values,
'global_importance': global_importance,
'X_sample': X_sample
}
def partial_dependence(self, X, feature, grid_resolution=50):
"""
Calculate partial dependence for a feature.
Shows the marginal effect of a feature on predictions.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
feature : str
Feature name to analyze
grid_resolution : int
Number of points in the grid
Returns:
--------
dict
Feature values and corresponding average predictions
"""
from sklearn.inspection import partial_dependence as pd_sklearn
feature_idx = self.feature_columns.index(feature)
result = pd_sklearn(
self.model, X,
features=[feature_idx],
grid_resolution=grid_resolution
)
return {
'values': result['values'][0],
'predictions': result['average'][0]
}
def explain_prediction(self, X_single, top_n=5):
"""
Explain a single prediction.
Parameters:
-----------
X_single : pd.DataFrame
Single-row feature dataframe
top_n : int
Number of top contributing features to show
Returns:
--------
dict
Prediction explanation
"""
import shap
prediction = self.model.predict(X_single)[0]
if hasattr(self.model, 'predict_proba'):
proba = self.model.predict_proba(X_single)[0]
else:
proba = None
# Get SHAP values for this prediction
if hasattr(self.model, 'estimators_'):
explainer = shap.TreeExplainer(self.model)
else:
# Simple approximation using feature importance
return {
'prediction': prediction,
'probability': proba,
'note': 'SHAP analysis not available for this model type'
}
shap_values = explainer.shap_values(X_single)
if isinstance(shap_values, list):
shap_values = shap_values[1] # Binary classification
# Get top contributors
shap_series = pd.Series(shap_values[0], index=self.feature_columns)
top_positive = shap_series.nlargest(top_n)
top_negative = shap_series.nsmallest(top_n)
return {
'prediction': prediction,
'probability': proba,
'top_positive_contributors': top_positive.to_dict(),
'top_negative_contributors': top_negative.to_dict(),
'feature_values': X_single.iloc[0].to_dict()
}
26.8 Cross-Validation Strategies for Basketball Data
Basketball data presents unique challenges for cross-validation due to its temporal nature and hierarchical structure.
26.8.1 Time-Aware Cross-Validation
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
class BasketballCrossValidator:
"""
Cross-validation strategies appropriate for basketball data.
Handles temporal dependencies, seasonal structure, and
the need to predict future based on past.
"""
def __init__(self, cv_type='time_series'):
"""
Initialize cross-validator.
Parameters:
-----------
cv_type : str
'time_series', 'season_aware', or 'stratified'
"""
self.cv_type = cv_type
def time_series_cv(self, X, y, model, n_splits=5, test_size=None):
"""
Time series cross-validation.
Ensures training data always precedes test data temporally.
Essential for game prediction and season-level analysis.
Parameters:
-----------
X : pd.DataFrame
Features (should be sorted by time)
y : np.ndarray
Target
model : estimator
Scikit-learn compatible model
n_splits : int
Number of splits
test_size : int, optional
Fixed test size for each split
Returns:
--------
dict
Cross-validation results
"""
tscv = TimeSeriesSplit(n_splits=n_splits, test_size=test_size)
scores = []
predictions_all = []
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit and predict
model_clone = clone(model)
model_clone.fit(X_train_scaled, y_train)
y_pred = model_clone.predict(X_test_scaled)
# Score
if hasattr(model, 'predict_proba'):
score = accuracy_score(y_test, y_pred)
else:
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
scores.append(score)
predictions_all.extend(zip(test_idx, y_pred, y_test))
return {
'scores': scores,
'mean_score': np.mean(scores),
'std_score': np.std(scores),
'predictions': predictions_all
}
def season_aware_cv(self, X, y, season_column, model, min_train_seasons=3):
"""
Cross-validation respecting season boundaries.
Trains on complete seasons and tests on subsequent seasons.
Parameters:
-----------
X : pd.DataFrame
Features with season column
y : np.ndarray
Target
season_column : str
Name of column containing season identifier
model : estimator
Model to evaluate
min_train_seasons : int
Minimum number of training seasons
Returns:
--------
dict
Season-by-season performance
"""
seasons = sorted(X[season_column].unique())
results = []
for i in range(min_train_seasons, len(seasons)):
# Train on all previous seasons
train_seasons = seasons[:i]
test_season = seasons[i]
train_mask = X[season_column].isin(train_seasons)
test_mask = X[season_column] == test_season
X_train = X[train_mask].drop(columns=[season_column])
X_test = X[test_mask].drop(columns=[season_column])
y_train = y[train_mask]
y_test = y[test_mask]
# Scale and fit
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_clone = clone(model)
model_clone.fit(X_train_scaled, y_train)
y_pred = model_clone.predict(X_test_scaled)
# Score
if hasattr(model, 'predict_proba'):
score = accuracy_score(y_test, y_pred)
else:
score = np.mean((y_test - y_pred) ** 2)
results.append({
'test_season': test_season,
'train_seasons': train_seasons,
'score': score,
'n_train': len(y_train),
'n_test': len(y_test)
})
return {
'results': results,
'mean_score': np.mean([r['score'] for r in results])
}
def grouped_cv(self, X, y, group_column, model, n_splits=5):
"""
Cross-validation where groups (e.g., teams, players) don't leak.
Ensures that all data for a group is in either train or test,
never split between them.
Parameters:
-----------
X : pd.DataFrame
Features with group column
y : np.ndarray
Target
group_column : str
Column defining groups
model : estimator
Model to evaluate
n_splits : int
Number of CV folds
"""
from sklearn.model_selection import GroupKFold
groups = X[group_column]
X_features = X.drop(columns=[group_column])
gkf = GroupKFold(n_splits=n_splits)
scores = []
for train_idx, test_idx in gkf.split(X_features, y, groups):
X_train, X_test = X_features.iloc[train_idx], X_features.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_clone = clone(model)
model_clone.fit(X_train_scaled, y_train)
y_pred = model_clone.predict(X_test_scaled)
if hasattr(model, 'predict_proba'):
score = accuracy_score(y_test, y_pred)
else:
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
scores.append(score)
return {
'scores': scores,
'mean_score': np.mean(scores),
'std_score': np.std(scores)
}
from sklearn.base import clone
26.9 Handling Time-Series Aspects
Basketball data is inherently temporal. Proper handling of time dependencies is crucial for valid predictions.
26.9.1 Feature Engineering for Time Series
class TemporalBasketballFeatures:
"""
Create time-aware features for basketball data.
Handles rolling averages, trends, and recency weighting.
"""
def __init__(self, date_column='date', entity_column='team'):
self.date_column = date_column
self.entity_column = entity_column
def add_rolling_features(self, df, stat_columns, windows=[5, 10, 20]):
"""
Add rolling average features.
Parameters:
-----------
df : pd.DataFrame
Data sorted by date
stat_columns : list
Columns to create rolling averages for
windows : list
Rolling window sizes
Returns:
--------
pd.DataFrame
DataFrame with rolling features added
"""
df = df.sort_values([self.entity_column, self.date_column])
for window in windows:
for stat in stat_columns:
col_name = f'{stat}_rolling_{window}'
df[col_name] = df.groupby(self.entity_column)[stat].transform(
lambda x: x.shift(1).rolling(window, min_periods=1).mean()
)
return df
def add_trend_features(self, df, stat_columns, window=10):
"""
Add trend (slope) features.
Captures whether a statistic is improving or declining.
Parameters:
-----------
df : pd.DataFrame
Sorted data
stat_columns : list
Columns to calculate trends for
window : int
Window for trend calculation
Returns:
--------
pd.DataFrame
DataFrame with trend features
"""
from scipy.stats import linregress
def calculate_trend(series):
if len(series) < 3:
return 0
x = np.arange(len(series))
try:
slope, _, _, _, _ = linregress(x, series)
return slope
except:
return 0
df = df.sort_values([self.entity_column, self.date_column])
for stat in stat_columns:
col_name = f'{stat}_trend_{window}'
df[col_name] = df.groupby(self.entity_column)[stat].transform(
lambda x: x.shift(1).rolling(window, min_periods=3).apply(calculate_trend)
)
return df
def add_recency_weighted_features(self, df, stat_columns, half_life=5):
"""
Add exponentially weighted features.
More recent games count more heavily.
Parameters:
-----------
df : pd.DataFrame
Sorted data
stat_columns : list
Columns to weight
half_life : int
Number of games for weight to decay by half
"""
df = df.sort_values([self.entity_column, self.date_column])
for stat in stat_columns:
col_name = f'{stat}_ewm_{half_life}'
df[col_name] = df.groupby(self.entity_column)[stat].transform(
lambda x: x.shift(1).ewm(halflife=half_life, min_periods=1).mean()
)
return df
def add_rest_days(self, df):
"""Add feature for days of rest since last game."""
df = df.sort_values([self.entity_column, self.date_column])
df['rest_days'] = df.groupby(self.entity_column)[self.date_column].transform(
lambda x: x.diff().dt.days
)
df['rest_days'] = df['rest_days'].fillna(3) # Assume 3 days for first game
# Cap at reasonable maximum
df['rest_days'] = df['rest_days'].clip(upper=10)
return df
def add_schedule_features(self, df):
"""Add features related to schedule (back-to-backs, etc.)."""
df = df.sort_values([self.entity_column, self.date_column])
# Back to back indicator
df['is_back_to_back'] = (df['rest_days'] == 1).astype(int)
# Games in last 7 days
df['games_last_7_days'] = df.groupby(self.entity_column)[self.date_column].transform(
lambda x: x.rolling('7D', on=x).count()
)
return df
26.9.2 Time Series Models for Basketball
class BasketballTimeSeriesModel:
"""
Time series modeling for basketball predictions.
Combines traditional time series methods with ML approaches.
"""
def __init__(self, model_type='arima'):
"""
Initialize time series model.
Parameters:
-----------
model_type : str
'arima', 'prophet', or 'lstm'
"""
self.model_type = model_type
self.models = {}
def fit_arima(self, series, order=(1, 1, 1)):
"""
Fit ARIMA model to a series.
Parameters:
-----------
series : pd.Series
Time series data
order : tuple
(p, d, q) order for ARIMA
"""
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(series, order=order)
fitted = model.fit()
return fitted
def fit_by_entity(self, df, entity_column, target_column, date_column):
"""
Fit separate models for each entity (team/player).
Parameters:
-----------
df : pd.DataFrame
Data with entity, date, and target columns
entity_column : str
Column identifying entities
target_column : str
Target variable to predict
date_column : str
Date column
"""
for entity in df[entity_column].unique():
entity_data = df[df[entity_column] == entity].sort_values(date_column)
series = entity_data.set_index(date_column)[target_column]
try:
self.models[entity] = self.fit_arima(series)
except Exception as e:
print(f"Could not fit model for {entity}: {e}")
return self
def forecast(self, entity, steps=5):
"""
Generate forecast for an entity.
Parameters:
-----------
entity : str
Entity to forecast
steps : int
Number of steps ahead to forecast
Returns:
--------
np.ndarray
Forecasted values
"""
if entity not in self.models:
raise ValueError(f"No model fitted for {entity}")
return self.models[entity].forecast(steps=steps)
26.10 When ML Beats Simple Models (and When It Doesn't)
Perhaps the most important skill in applied machine learning is knowing when to use it. Complex models are not always superior.
26.10.1 Comparing Simple and Complex Approaches
class ModelComplexityAnalyzer:
"""
Compare simple and complex models for basketball problems.
Helps determine when sophisticated ML is warranted.
"""
def __init__(self):
self.results = []
def compare_models(self, X, y, cv_splits=5, task='classification'):
"""
Compare models of varying complexity.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : np.ndarray
Target values
cv_splits : int
Number of cross-validation splits
task : str
'classification' or 'regression'
Returns:
--------
pd.DataFrame
Comparison of model performance
"""
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.linear_model import LogisticRegression, Ridge
if task == 'classification':
models = [
('Baseline (majority)', DummyClassifier(strategy='most_frequent')),
('Logistic Regression', LogisticRegression(random_state=42, max_iter=1000)),
('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
('Gradient Boosting', GradientBoostingClassifier(n_estimators=100, random_state=42)),
('Neural Network', MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
]
scoring = 'accuracy'
else:
models = [
('Baseline (mean)', DummyRegressor(strategy='mean')),
('Linear Regression', Ridge(random_state=42)),
('Random Forest', RandomForestRegressor(n_estimators=100, random_state=42)),
('Gradient Boosting', GradientBoostingRegressor(n_estimators=100, random_state=42)),
('Neural Network', MLPRegressor(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500))
]
scoring = 'r2'
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
results = []
for name, model in models:
try:
scores = cross_val_score(model, X_scaled, y, cv=cv_splits, scoring=scoring)
results.append({
'model': name,
'mean_score': scores.mean(),
'std_score': scores.std(),
'complexity': self._estimate_complexity(model)
})
except Exception as e:
print(f"Error with {name}: {e}")
return pd.DataFrame(results).sort_values('mean_score', ascending=False)
def _estimate_complexity(self, model):
"""Estimate model complexity for comparison."""
if 'Dummy' in str(type(model)):
return 1
elif 'Logistic' in str(type(model)) or 'Ridge' in str(type(model)):
return 2
elif 'RandomForest' in str(type(model)):
return 3
elif 'GradientBoosting' in str(type(model)):
return 4
elif 'MLP' in str(type(model)):
return 5
return 3
def learning_curve_analysis(self, X, y, model, train_sizes=None):
"""
Analyze how model performance changes with data size.
Helps identify if more data would help or if we're overfitting.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : np.ndarray
Target
model : estimator
Model to analyze
train_sizes : list, optional
Fractions of training data to use
Returns:
--------
dict
Learning curve results
"""
from sklearn.model_selection import learning_curve
if train_sizes is None:
train_sizes = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
train_sizes_abs, train_scores, test_scores = learning_curve(
model, X_scaled, y,
train_sizes=train_sizes,
cv=5,
n_jobs=-1,
random_state=42
)
return {
'train_sizes': train_sizes_abs,
'train_scores_mean': train_scores.mean(axis=1),
'train_scores_std': train_scores.std(axis=1),
'test_scores_mean': test_scores.mean(axis=1),
'test_scores_std': test_scores.std(axis=1)
}
def feature_ablation(self, X, y, model, task='classification'):
"""
Analyze impact of removing features.
Helps identify if simpler feature sets suffice.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : np.ndarray
Target
model : estimator
Model to use
task : str
'classification' or 'regression'
Returns:
--------
pd.DataFrame
Performance with each feature removed
"""
from sklearn.model_selection import cross_val_score
scaler = StandardScaler()
scoring = 'accuracy' if task == 'classification' else 'r2'
# Baseline with all features
X_scaled = scaler.fit_transform(X)
baseline_scores = cross_val_score(model, X_scaled, y, cv=5, scoring=scoring)
baseline_mean = baseline_scores.mean()
results = [{'feature': 'ALL FEATURES', 'score': baseline_mean, 'delta': 0}]
# Remove each feature
for feature in X.columns:
X_reduced = X.drop(columns=[feature])
X_scaled = scaler.fit_transform(X_reduced)
scores = cross_val_score(clone(model), X_scaled, y, cv=5, scoring=scoring)
score_mean = scores.mean()
results.append({
'feature': feature,
'score': score_mean,
'delta': baseline_mean - score_mean # Positive = feature is important
})
return pd.DataFrame(results).sort_values('delta', ascending=False)
26.10.2 Guidelines for Model Selection
The decision between simple and complex models should consider:
-
Sample size: Complex models need more data. With fewer than 500 samples, simpler models often win.
-
Signal-to-noise ratio: Basketball has inherent randomness. Complex models can overfit to noise.
-
Interpretability needs: If stakeholders need to understand predictions, simpler models may be preferable.
-
Deployment constraints: Complex models may be slower in production.
-
Feature quality: With well-engineered features, simpler models often perform comparably.
def model_selection_guide(n_samples, n_features, interpretability_need='medium'):
"""
Provide model selection guidance based on problem characteristics.
Parameters:
-----------
n_samples : int
Number of training samples
n_features : int
Number of features
interpretability_need : str
'low', 'medium', or 'high'
Returns:
--------
dict
Recommended models and rationale
"""
recommendations = {
'primary': None,
'secondary': None,
'avoid': [],
'rationale': []
}
# Sample size considerations
if n_samples < 200:
recommendations['primary'] = 'Logistic Regression / Ridge'
recommendations['secondary'] = 'Simple Decision Tree'
recommendations['avoid'] = ['Neural Network', 'Deep Ensembles']
recommendations['rationale'].append(
f"Small sample size ({n_samples}) favors simple models to avoid overfitting"
)
elif n_samples < 1000:
recommendations['primary'] = 'Random Forest'
recommendations['secondary'] = 'Gradient Boosting (careful tuning)'
recommendations['avoid'] = ['Deep Neural Networks']
recommendations['rationale'].append(
f"Moderate sample size ({n_samples}) supports tree ensembles"
)
else:
recommendations['primary'] = 'Gradient Boosting'
recommendations['secondary'] = 'Neural Network'
recommendations['avoid'] = []
recommendations['rationale'].append(
f"Large sample size ({n_samples}) supports complex models"
)
# Feature considerations
if n_features > n_samples / 10:
recommendations['rationale'].append(
f"High feature-to-sample ratio ({n_features}/{n_samples}) - consider regularization"
)
if 'Logistic' not in recommendations['primary']:
recommendations['secondary'] = 'Regularized Linear Model'
# Interpretability
if interpretability_need == 'high':
recommendations['primary'] = 'Logistic Regression'
recommendations['secondary'] = 'Single Decision Tree'
recommendations['rationale'].append(
"High interpretability need prioritizes transparent models"
)
return recommendations
26.11 Complete Implementation Example
Let's bring everything together with a complete example of predicting All-Star selections:
class AllStarPredictionPipeline:
"""
Complete ML pipeline for predicting NBA All-Star selections.
Demonstrates end-to-end implementation including feature engineering,
model selection, training, and evaluation.
"""
def __init__(self):
self.feature_columns = None
self.models = {}
self.best_model = None
self.scaler = StandardScaler()
def engineer_features(self, player_df):
"""
Create features for All-Star prediction.
Parameters:
-----------
player_df : pd.DataFrame
Player season statistics
Returns:
--------
pd.DataFrame
DataFrame with engineered features
"""
df = player_df.copy()
# Per-game statistics
if 'G' in df.columns:
for stat in ['PTS', 'REB', 'AST', 'STL', 'BLK']:
if stat in df.columns:
df[f'{stat}_per_game'] = df[stat] / df['G']
# Efficiency metrics
if all(col in df.columns for col in ['PTS', 'FGA', 'FTA']):
df['TS_pct'] = df['PTS'] / (2 * (df['FGA'] + 0.44 * df['FTA'] + 0.001))
# Market size proxy (using team)
# In practice, you'd join with a team market size table
# Win contribution (if team wins available)
if 'team_wins' in df.columns:
df['on_winning_team'] = (df['team_wins'] > 41).astype(int)
# Previous All-Star selections (important predictor)
if 'prev_all_star' in df.columns:
df['is_previous_all_star'] = df['prev_all_star']
return df
def prepare_data(self, df, target_column='all_star'):
"""
Prepare data for modeling.
Parameters:
-----------
df : pd.DataFrame
Engineered feature DataFrame
target_column : str
Name of target column
Returns:
--------
tuple
(X, y, feature_columns)
"""
# Select numeric features
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Remove target and identifiers
exclude_cols = [target_column, 'player_id', 'season', 'team_id']
self.feature_columns = [c for c in numeric_cols if c not in exclude_cols]
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
y = df[target_column].values
return X, y
def train_and_evaluate(self, X, y, test_size=0.2):
"""
Train multiple models and evaluate them.
Parameters:
-----------
X : pd.DataFrame
Feature matrix
y : np.ndarray
Target values
test_size : float
Fraction for test set
Returns:
--------
pd.DataFrame
Model comparison results
"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=42, stratify=y
)
# Scale features
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Define models to compare
models_to_test = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss'),
'Neural Network': MLPClassifier(hidden_layer_sizes=(64, 32), random_state=42, max_iter=500)
}
results = []
for name, model in models_to_test.items():
# Train
model.fit(X_train_scaled, y_train)
self.models[name] = model
# Predict
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
results.append({
'model': name,
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred),
'recall': recall_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred),
'roc_auc': roc_auc_score(y_test, y_proba)
})
results_df = pd.DataFrame(results).sort_values('roc_auc', ascending=False)
# Select best model
best_model_name = results_df.iloc[0]['model']
self.best_model = self.models[best_model_name]
return results_df
def get_feature_importance(self):
"""Get feature importance from best model."""
if self.best_model is None:
raise ValueError("Must train models first")
if hasattr(self.best_model, 'feature_importances_'):
importance = self.best_model.feature_importances_
elif hasattr(self.best_model, 'coef_'):
importance = np.abs(self.best_model.coef_[0])
else:
return None
return pd.Series(
importance,
index=self.feature_columns
).sort_values(ascending=False)
def predict_all_stars(self, new_player_df, threshold=0.5):
"""
Predict All-Star selections for new data.
Parameters:
-----------
new_player_df : pd.DataFrame
New player statistics
threshold : float
Classification threshold
Returns:
--------
pd.DataFrame
Predictions with probabilities
"""
if self.best_model is None:
raise ValueError("Must train models first")
# Engineer features
df = self.engineer_features(new_player_df)
# Prepare features
X = df[self.feature_columns].fillna(df[self.feature_columns].median())
X_scaled = self.scaler.transform(X)
# Predict
probabilities = self.best_model.predict_proba(X_scaled)[:, 1]
predictions = (probabilities >= threshold).astype(int)
# Create results dataframe
results = new_player_df[['Player', 'Team']].copy() if 'Player' in new_player_df.columns else pd.DataFrame()
results['all_star_probability'] = probabilities
results['predicted_all_star'] = predictions
return results.sort_values('all_star_probability', ascending=False)
# Usage example
def run_all_star_prediction_pipeline(player_data_path):
"""
Complete workflow for All-Star prediction.
Parameters:
-----------
player_data_path : str
Path to player statistics CSV
"""
# Load data
player_df = pd.read_csv(player_data_path)
# Initialize pipeline
pipeline = AllStarPredictionPipeline()
# Engineer features
df_featured = pipeline.engineer_features(player_df)
# Prepare data
X, y = pipeline.prepare_data(df_featured)
# Train and evaluate
results = pipeline.train_and_evaluate(X, y)
print("Model Comparison:")
print(results.to_string(index=False))
# Feature importance
importance = pipeline.get_feature_importance()
print("\nTop 10 Features:")
print(importance.head(10))
return pipeline
Summary
Machine learning offers powerful tools for basketball analytics, but successful application requires careful consideration of the problem structure, data characteristics, and organizational needs. Key takeaways from this chapter:
-
Clustering reveals natural player groupings that may not align with traditional positions, offering insights into modern playing styles.
-
Classification and regression problems are ubiquitous in basketball, from predicting game outcomes to forecasting player development.
-
Tree-based ensembles (Random Forests, Gradient Boosting) are workhorses of basketball ML, offering strong performance with reasonable interpretability.
-
Neural networks are increasingly important for complex pattern recognition but require more data and careful tuning.
-
Feature engineering remains crucial—domain knowledge about basketball often matters more than algorithm sophistication.
-
Cross-validation must respect the temporal and hierarchical structure of basketball data.
-
Interpretability is essential in basketball contexts where decisions affect player careers and organizational strategy.
-
Simpler models often suffice—always establish baselines and question whether complexity is warranted.
The most successful practitioners of machine learning in basketball combine computational expertise with deep understanding of the game. Algorithms reveal patterns; domain experts translate those patterns into actionable insights.
References
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
- Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems.
- Muthukrishan, S., & Edakunni, N. (2016). Machine Learning Applications in Sports Analytics. International Conference on Big Data Analytics.
Related Reading
Explore this topic in other books
AI Engineering Supervised Learning Sports Betting Feature Engineering for ML College Football Analytics Machine Learning for Football NFL Analytics ML Prediction Models Soccer Analytics Machine Learning for Soccer Prediction Markets ML for Market Prediction