Player Projection Systems
NBA Player Projection Systems
Player projection systems represent the cutting edge of NBA analytics, combining historical data, statistical modeling, and machine learning to forecast future player performance. These systems help teams make informed decisions about contracts, trades, and roster construction, while providing analysts and fans with data-driven expectations for player development.
Overview of Modern Projection Systems
FiveThirtyEight RAPTOR
RAPTOR (Robust Algorithm using Player Tracking and On/Off Ratings) is FiveThirtyEight's comprehensive player rating system that combines:
- Box score statistics: Traditional and advanced metrics from play-by-play data
- On-off ratings: Team performance with player on vs. off the court
- Player tracking data: Movement and positioning from SportVU cameras
- Aging curves: Expected improvement or decline based on player age
RAPTOR produces separate offensive and defensive ratings, as well as projections for future seasons based on historical aging patterns and regression to the mean.
FiveThirtyEight CARMELO
CARMELO (Career-Arc Regression Model Estimator with Local Optimization) was FiveThirtyEight's previous system that projected player career trajectories by:
- Comparing players to historical comparables with similar statistics at the same age
- Using thousands of player seasons to model typical career arcs
- Incorporating individual player characteristics and performance trends
- Producing probabilistic forecasts for multiple seasons ahead
Other Professional Systems
- ESPN RPM Projections: Based on Real Plus-Minus with ridge regression
- Basketball-Reference Similarity Scores: Historical player comparisons
- NBA Teams' Internal Models: Proprietary systems using private data
- Cleaning the Glass: Role-based projections accounting for context
Age Curves and Peak Performance
The Typical NBA Career Arc
Research consistently shows predictable patterns in NBA player performance across age:
- Ages 19-23: Rapid improvement as players develop skills and adjust to NBA pace
- Ages 24-28: Peak performance years with optimal blend of skills and athleticism
- Ages 27-29: Absolute peak for most players, particularly superstars
- Ages 30-33: Gradual decline, offset by experience and basketball IQ
- Ages 34+: Accelerated decline, role players often exit the league
Position-Specific Age Curves
Different positions show varying aging patterns:
- Point Guards: Longer peak due to playmaking skills and basketball IQ
- Wings: Earlier decline in athleticism-dependent areas (defense, transition)
- Centers: Sharp decline after age 30, particularly in mobility and rim protection
Skill-Specific Aging
Individual skills age at different rates:
- Shooting: Relatively stable, can improve into early 30s
- Athleticism: Declines steadily after age 27
- Defense: Sharp decline after age 29-30
- Playmaking: Stable or improving through experience
- Rebounding: Maintains well into early 30s
Building Projection Systems with Python
Data Collection and Preprocessing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from nba_api.stats.endpoints import playercareerstats, leaguedashplayerstats
from nba_api.stats.static import players
# Fetch historical player data
def get_player_career_stats(player_id):
"""Retrieve complete career statistics for a player."""
career = playercareerstats.PlayerCareerStats(player_id=player_id)
df = career.get_data_frames()[0]
return df
# Collect comprehensive dataset
def build_projection_dataset(start_season='2010-11', end_season='2023-24'):
"""Build dataset with player stats, age, and next-season performance."""
all_players = players.get_players()
data = []
for player in all_players:
try:
career_df = get_player_career_stats(player['id'])
# Add age and calculate per-36 stats
for idx, row in career_df.iterrows():
age = row['PLAYER_AGE']
season = row['SEASON_ID']
# Skip if insufficient playing time
if row['MIN'] < 500:
continue
# Calculate per-36 minute stats
mins = row['MIN']
per_36_stats = {
'player_id': player['id'],
'player_name': player['full_name'],
'season': season,
'age': age,
'gp': row['GP'],
'pts_36': row['PTS'] / mins * 36,
'reb_36': row['REB'] / mins * 36,
'ast_36': row['AST'] / mins * 36,
'stl_36': row['STL'] / mins * 36,
'blk_36': row['BLK'] / mins * 36,
'tov_36': row['TOV'] / mins * 36,
'fg_pct': row['FG_PCT'],
'fg3_pct': row['FG3_PCT'],
'ft_pct': row['FT_PCT'],
'ts_pct': calculate_ts_pct(row),
'usage_pct': estimate_usage(row),
}
data.append(per_36_stats)
except:
continue
df = pd.DataFrame(data)
return df
def calculate_ts_pct(row):
"""Calculate True Shooting Percentage."""
if row['FGA'] + 0.44 * row['FTA'] == 0:
return 0
return row['PTS'] / (2 * (row['FGA'] + 0.44 * row['FTA']))
def estimate_usage(row):
"""Estimate usage rate from available box score stats."""
team_poss = row['MIN'] / 5 * 100 # Rough estimate
player_poss = row['FGA'] + 0.44 * row['FTA'] + row['TOV']
return player_poss / team_poss * 100 if team_poss > 0 else 0
Feature Engineering for Projections
def create_projection_features(df):
"""Engineer features for player projection model."""
df = df.sort_values(['player_id', 'season']).copy()
# Calculate career trajectory features
df['seasons_played'] = df.groupby('player_id').cumcount() + 1
df['career_pts_avg'] = df.groupby('player_id')['pts_36'].expanding().mean().values
df['career_pts_std'] = df.groupby('player_id')['pts_36'].expanding().std().values
# Age-based features
df['age_squared'] = df['age'] ** 2
df['age_cubed'] = df['age'] ** 3
df['years_from_peak'] = df['age'] - 27
# Recent performance trends (last 3 seasons)
for stat in ['pts_36', 'reb_36', 'ast_36', 'ts_pct']:
df[f'{stat}_ma3'] = df.groupby('player_id')[stat].rolling(3, min_periods=1).mean().values
df[f'{stat}_trend'] = df.groupby('player_id')[stat].diff()
# Calculate year-over-year changes
for stat in ['pts_36', 'reb_36', 'ast_36', 'ts_pct', 'usage_pct']:
df[f'{stat}_change'] = df.groupby('player_id')[stat].diff()
df[f'{stat}_pct_change'] = df.groupby('player_id')[stat].pct_change()
# Create target variable (next season performance)
df['next_pts_36'] = df.groupby('player_id')['pts_36'].shift(-1)
df['next_ts_pct'] = df.groupby('player_id')['ts_pct'].shift(-1)
df['next_usage'] = df.groupby('player_id')['usage_pct'].shift(-1)
# Remove rows without next season data
df = df.dropna(subset=['next_pts_36'])
return df
# Apply feature engineering
projection_df = create_projection_features(df)
# Define feature sets
basic_features = ['age', 'age_squared', 'pts_36', 'reb_36', 'ast_36',
'ts_pct', 'usage_pct']
advanced_features = basic_features + ['pts_36_ma3', 'pts_36_trend',
'career_pts_avg', 'seasons_played',
'years_from_peak']
# Prepare training data
X = projection_df[advanced_features]
y = projection_df['next_pts_36']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Baseline Projection Model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
# Train multiple models
models = {
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}
results = {}
for name, model in models.items():
# Train model
model.fit(X_train, y_train)
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Evaluate
results[name] = {
'train_rmse': np.sqrt(mean_squared_error(y_train, y_pred_train)),
'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
'train_mae': mean_absolute_error(y_train, y_pred_train),
'test_mae': mean_absolute_error(y_test, y_pred_test),
'test_r2': r2_score(y_test, y_pred_test)
}
print(f"\n{name} Results:")
print(f"Test RMSE: {results[name]['test_rmse']:.3f}")
print(f"Test MAE: {results[name]['test_mae']:.3f}")
print(f"Test R²: {results[name]['test_r2']:.3f}")
# Feature importance (for tree-based models)
rf_model = models['Random Forest']
feature_importance = pd.DataFrame({
'feature': advanced_features,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))
Statistical Modeling with R
Age Curve Estimation
library(tidyverse)
library(mgcv)
library(lme4)
library(splines)
# Load player data
player_data <- read_csv("nba_player_stats.csv")
# Fit polynomial age curve model
age_curve_model <- lm(
pts_per_36 ~ poly(age, 3) + poly(age, 3):position,
data = player_data
)
summary(age_curve_model)
# Fit generalized additive model (GAM) for smooth age curves
gam_model <- gam(
pts_per_36 ~ s(age, by = position) + position +
s(seasons_played) + s(usage_rate),
data = player_data,
family = gaussian()
)
summary(gam_model)
# Visualize age curves by position
age_seq <- seq(19, 38, by = 0.1)
positions <- c("PG", "SG", "SF", "PF", "C")
predictions <- expand.grid(
age = age_seq,
position = positions,
seasons_played = 5,
usage_rate = 22
)
predictions$predicted_pts <- predict(gam_model, newdata = predictions)
ggplot(predictions, aes(x = age, y = predicted_pts, color = position)) +
geom_line(size = 1.2) +
labs(
title = "NBA Player Age Curves by Position",
x = "Age",
y = "Predicted Points per 36 Minutes",
color = "Position"
) +
theme_minimal() +
theme(legend.position = "bottom")
Mixed Effects Models for Player Projections
# Mixed effects model with random intercepts and slopes
mixed_model <- lmer(
pts_per_36_next ~ age + I(age^2) + pts_per_36 + ts_pct + usage_rate +
pts_per_36_trend + (1 + age | player_id),
data = player_data,
REML = TRUE
)
summary(mixed_model)
# Extract random effects
random_effects <- ranef(mixed_model)$player_id
fixed_effects <- fixef(mixed_model)
# Function to project player performance
project_player <- function(player_id, current_stats, future_age) {
player_random <- random_effects[as.character(player_id), ]
prediction <- fixed_effects["(Intercept)"] +
player_random$"(Intercept)" +
(fixed_effects["age"] + player_random$age) * future_age +
fixed_effects["I(age^2)"] * future_age^2 +
fixed_effects["pts_per_36"] * current_stats$pts_per_36 +
fixed_effects["ts_pct"] * current_stats$ts_pct +
fixed_effects["usage_rate"] * current_stats$usage_rate +
fixed_effects["pts_per_36_trend"] * current_stats$pts_per_36_trend
return(prediction)
}
# Example projection
player_stats <- list(
pts_per_36 = 22.5,
ts_pct = 0.58,
usage_rate = 25.3,
pts_per_36_trend = 1.2
)
projected_pts <- project_player(
player_id = "2544", # LeBron James
current_stats = player_stats,
future_age = 39
)
print(paste("Projected points per 36:", round(projected_pts, 2)))
Bayesian Hierarchical Models
library(rstan)
library(rstanarm)
# Bayesian hierarchical model with Stan
stan_model <- stan_glmer(
pts_per_36_next ~ age + I(age^2) + I(age^3) +
pts_per_36 + ts_pct + usage_rate +
(1 + age | player_id) + (1 | season),
data = player_data,
family = gaussian(),
prior = normal(0, 2.5),
prior_intercept = normal(15, 5),
prior_covariance = decov(regularization = 2),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 42
)
# Model summary
summary(stan_model, digits = 3)
# Posterior predictive checks
pp_check(stan_model, nreps = 50)
# Generate probabilistic projections with uncertainty
new_data <- data.frame(
player_id = "203999", # Nikola Jokic
age = 30,
pts_per_36 = 24.8,
ts_pct = 0.65,
usage_rate = 28.5,
season = "2024-25"
)
# Posterior predictions with credible intervals
posterior_pred <- posterior_predict(stan_model, newdata = new_data)
# Calculate credible intervals
ci_80 <- quantile(posterior_pred, c(0.1, 0.9))
ci_95 <- quantile(posterior_pred, c(0.025, 0.975))
print(paste("Median projection:", round(median(posterior_pred), 2)))
print(paste("80% CI:", round(ci_80[1], 2), "-", round(ci_80[2], 2)))
print(paste("95% CI:", round(ci_95[1], 2), "-", round(ci_95[2], 2)))
# Visualize posterior distribution
posterior_df <- data.frame(prediction = as.vector(posterior_pred))
ggplot(posterior_df, aes(x = prediction)) +
geom_density(fill = "steelblue", alpha = 0.6) +
geom_vline(xintercept = median(posterior_pred),
linetype = "dashed", color = "red", size = 1) +
geom_vline(xintercept = ci_95, linetype = "dotted", color = "red") +
labs(
title = "Posterior Predictive Distribution",
x = "Projected Points per 36 Minutes",
y = "Density"
) +
theme_minimal()
Machine Learning Approaches
Ensemble Methods with XGBoost
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
import matplotlib.pyplot as plt
import seaborn as sns
# Prepare data with temporal ordering
df_sorted = projection_df.sort_values(['season', 'player_id'])
# Create time series splits (maintain temporal order)
tscv = TimeSeriesSplit(n_splits=5)
# XGBoost parameters to tune
param_grid = {
'max_depth': [4, 6, 8],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 200, 300],
'min_child_weight': [1, 3, 5],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0],
'gamma': [0, 0.1, 0.2]
}
# Create XGBoost model
xgb_model = xgb.XGBRegressor(
objective='reg:squarederror',
random_state=42,
tree_method='hist'
)
# Grid search with time series cross-validation
grid_search = GridSearchCV(
estimator=xgb_model,
param_grid=param_grid,
cv=tscv,
scoring='neg_mean_squared_error',
n_jobs=-1,
verbose=1
)
# Fit model
X = df_sorted[advanced_features]
y = df_sorted['next_pts_36']
grid_search.fit(X, y)
# Best model
best_xgb = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)
# Make predictions
y_pred = best_xgb.predict(X)
# Evaluate
rmse = np.sqrt(mean_squared_error(y, y_pred))
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"\nXGBoost Performance:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")
# Feature importance
importance_df = pd.DataFrame({
'feature': advanced_features,
'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)
# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df.head(15), x='importance', y='feature')
plt.title('XGBoost Feature Importance for Player Projections')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('xgboost_feature_importance.png', dpi=300)
plt.show()
Neural Networks for Complex Patterns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from tensorflow.keras.models import Sequential
# Prepare sequences for recurrent neural network
def create_sequences(df, lookback=3):
"""Create sequences of player stats for LSTM model."""
sequences = []
targets = []
for player_id in df['player_id'].unique():
player_df = df[df['player_id'] == player_id].sort_values('season')
if len(player_df) < lookback + 1:
continue
for i in range(len(player_df) - lookback):
sequence = player_df[advanced_features].iloc[i:i+lookback].values
target = player_df['next_pts_36'].iloc[i+lookback-1]
sequences.append(sequence)
targets.append(target)
return np.array(sequences), np.array(targets)
# Create sequences
lookback = 3
X_seq, y_seq = create_sequences(projection_df, lookback=lookback)
# Split data
split_idx = int(0.8 * len(X_seq))
X_train_seq, X_test_seq = X_seq[:split_idx], X_seq[split_idx:]
y_train_seq, y_test_seq = y_seq[:split_idx], y_seq[split_idx:]
# Build LSTM model
model = Sequential([
layers.LSTM(64, activation='relu', input_shape=(lookback, len(advanced_features)),
return_sequences=True),
layers.Dropout(0.2),
layers.LSTM(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
layers.Dense(1)
])
# Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='mse',
metrics=['mae']
)
# Callbacks
early_stop = callbacks.EarlyStopping(
monitor='val_loss',
patience=20,
restore_best_weights=True
)
reduce_lr = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=10,
min_lr=0.00001
)
# Train model
history = model.fit(
X_train_seq, y_train_seq,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stop, reduce_lr],
verbose=1
)
# Evaluate
test_loss, test_mae = model.evaluate(X_test_seq, y_test_seq)
y_pred_seq = model.predict(X_test_seq).flatten()
test_rmse = np.sqrt(mean_squared_error(y_test_seq, y_pred_seq))
test_r2 = r2_score(y_test_seq, y_pred_seq)
print(f"\nLSTM Performance:")
print(f"RMSE: {test_rmse:.3f}")
print(f"MAE: {test_mae:.3f}")
print(f"R²: {test_r2:.3f}")
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True)
axes[1].plot(history.history['mae'], label='Training MAE')
axes[1].plot(history.history['val_mae'], label='Validation MAE')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MAE')
axes[1].set_title('Training and Validation MAE')
axes[1].legend()
axes[1].grid(True)
plt.tight_layout()
plt.savefig('lstm_training_history.png', dpi=300)
plt.show()
Stacking and Model Ensembles
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
import lightgbm as lgb
# Define base models
base_models = [
('ridge', Ridge(alpha=10.0)),
('elastic', ElasticNet(alpha=0.5, l1_ratio=0.5)),
('knn', KNeighborsRegressor(n_neighbors=10)),
('xgb', xgb.XGBRegressor(
n_estimators=200,
max_depth=6,
learning_rate=0.05,
random_state=42
)),
('lgb', lgb.LGBMRegressor(
n_estimators=200,
max_depth=6,
learning_rate=0.05,
random_state=42
))
]
# Meta-model
meta_model = Ridge(alpha=1.0)
# Create stacking ensemble
stacking_model = StackingRegressor(
estimators=base_models,
final_estimator=meta_model,
cv=5,
n_jobs=-1
)
# Train stacking model
stacking_model.fit(X_train, y_train)
# Predictions
y_pred_stack = stacking_model.predict(X_test)
# Evaluate
stack_rmse = np.sqrt(mean_squared_error(y_test, y_pred_stack))
stack_mae = mean_absolute_error(y_test, y_pred_stack)
stack_r2 = r2_score(y_test, y_pred_stack)
print(f"\nStacking Ensemble Performance:")
print(f"RMSE: {stack_rmse:.3f}")
print(f"MAE: {stack_mae:.3f}")
print(f"R²: {stack_r2:.3f}")
# Compare all models
model_comparison = pd.DataFrame({
'Model': ['Ridge', 'Random Forest', 'Gradient Boosting', 'XGBoost', 'LSTM', 'Stacking'],
'RMSE': [
results['Ridge']['test_rmse'],
results['Random Forest']['test_rmse'],
results['Gradient Boosting']['test_rmse'],
rmse,
test_rmse,
stack_rmse
],
'MAE': [
results['Ridge']['test_mae'],
results['Random Forest']['test_mae'],
results['Gradient Boosting']['test_mae'],
mae,
test_mae,
stack_mae
],
'R²': [
results['Ridge']['test_r2'],
results['Random Forest']['test_r2'],
results['Gradient Boosting']['test_r2'],
r2,
test_r2,
stack_r2
]
}).sort_values('RMSE')
print("\nModel Comparison:")
print(model_comparison.to_string(index=False))
Validation and Backtesting
Time Series Cross-Validation
def backtest_projections(df, model, features, target, start_season='2015-16'):
"""
Backtest projection system using expanding window approach.
Train on all data up to season N, predict season N+1.
"""
seasons = sorted(df['season'].unique())
start_idx = seasons.index(start_season)
results = []
for i in range(start_idx, len(seasons) - 1):
train_seasons = seasons[:i+1]
test_season = seasons[i+1]
# Split data
train_df = df[df['season'].isin(train_seasons)]
test_df = df[df['season'] == test_season]
# Prepare features
X_train = train_df[features]
y_train = train_df[target]
X_test = test_df[features]
y_test = test_df[target]
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Store results
results.append({
'season': test_season,
'n_players': len(test_df),
'rmse': rmse,
'mae': mae,
'r2': r2,
'predictions': pd.DataFrame({
'player_id': test_df['player_id'].values,
'actual': y_test.values,
'predicted': y_pred
})
})
print(f"Season {test_season}: RMSE={rmse:.3f}, MAE={mae:.3f}, R²={r2:.3f}")
return results
# Run backtest
backtest_results = backtest_projections(
df=projection_df,
model=xgb.XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.05),
features=advanced_features,
target='next_pts_36',
start_season='2015-16'
)
# Aggregate results
avg_rmse = np.mean([r['rmse'] for r in backtest_results])
avg_mae = np.mean([r['mae'] for r in backtest_results])
avg_r2 = np.mean([r['r2'] for r in backtest_results])
print(f"\nBacktest Summary (2015-16 to 2023-24):")
print(f"Average RMSE: {avg_rmse:.3f}")
print(f"Average MAE: {avg_mae:.3f}")
print(f"Average R²: {avg_r2:.3f}")
Player-Level Accuracy Analysis
def analyze_projection_accuracy(backtest_results, df):
"""Analyze which types of players are easier/harder to project."""
# Combine all predictions
all_preds = pd.concat([r['predictions'] for r in backtest_results])
all_preds = all_preds.merge(
df[['player_id', 'age', 'seasons_played', 'pts_36', 'usage_pct']],
on='player_id',
how='left'
)
# Calculate absolute errors
all_preds['abs_error'] = abs(all_preds['actual'] - all_preds['predicted'])
all_preds['pct_error'] = abs(all_preds['abs_error'] / all_preds['actual']) * 100
# Analyze by age groups
all_preds['age_group'] = pd.cut(
all_preds['age'],
bins=[0, 23, 27, 30, 100],
labels=['Young (19-23)', 'Prime (24-27)', 'Veteran (28-30)', 'Old (31+)']
)
age_accuracy = all_preds.groupby('age_group').agg({
'abs_error': ['mean', 'median'],
'pct_error': ['mean', 'median'],
'player_id': 'count'
}).round(3)
print("Projection Accuracy by Age Group:")
print(age_accuracy)
# Analyze by scoring volume
all_preds['scoring_tier'] = pd.cut(
all_preds['pts_36'],
bins=[0, 10, 15, 20, 100],
labels=['Low (<10)', 'Medium (10-15)', 'High (15-20)', 'Elite (20+)']
)
scoring_accuracy = all_preds.groupby('scoring_tier').agg({
'abs_error': ['mean', 'median'],
'pct_error': ['mean', 'median'],
'player_id': 'count'
}).round(3)
print("\nProjection Accuracy by Scoring Volume:")
print(scoring_accuracy)
# Identify most/least predictable players
player_accuracy = all_preds.groupby('player_id').agg({
'abs_error': 'mean',
'pct_error': 'mean',
'actual': 'count'
}).reset_index()
player_accuracy = player_accuracy[player_accuracy['actual'] >= 3] # Min 3 projections
player_accuracy = player_accuracy.sort_values('abs_error')
print("\nMost Predictable Players (Lowest Error):")
print(player_accuracy.head(10))
print("\nLeast Predictable Players (Highest Error):")
print(player_accuracy.tail(10))
return all_preds, age_accuracy, scoring_accuracy, player_accuracy
# Run accuracy analysis
pred_analysis, age_acc, score_acc, player_acc = analyze_projection_accuracy(
backtest_results, projection_df
)
# Visualize error distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Error distribution
axes[0, 0].hist(pred_analysis['abs_error'], bins=50, edgecolor='black')
axes[0, 0].set_xlabel('Absolute Error (Points per 36)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Projection Errors')
axes[0, 0].axvline(pred_analysis['abs_error'].median(), color='red',
linetype='dashed', label='Median')
axes[0, 0].legend()
# Actual vs Predicted
axes[0, 1].scatter(pred_analysis['actual'], pred_analysis['predicted'],
alpha=0.5, s=10)
axes[0, 1].plot([0, 35], [0, 35], 'r--', label='Perfect Projection')
axes[0, 1].set_xlabel('Actual Points per 36')
axes[0, 1].set_ylabel('Projected Points per 36')
axes[0, 1].set_title('Actual vs Projected Performance')
axes[0, 1].legend()
# Error by age group
age_acc_plot = pred_analysis.groupby('age_group')['abs_error'].mean()
axes[1, 0].bar(range(len(age_acc_plot)), age_acc_plot.values)
axes[1, 0].set_xticks(range(len(age_acc_plot)))
axes[1, 0].set_xticklabels(age_acc_plot.index, rotation=45)
axes[1, 0].set_ylabel('Mean Absolute Error')
axes[1, 0].set_title('Projection Error by Age Group')
# Error by scoring tier
score_acc_plot = pred_analysis.groupby('scoring_tier')['abs_error'].mean()
axes[1, 1].bar(range(len(score_acc_plot)), score_acc_plot.values)
axes[1, 1].set_xticks(range(len(score_acc_plot)))
axes[1, 1].set_xticklabels(score_acc_plot.index, rotation=45)
axes[1, 1].set_ylabel('Mean Absolute Error')
axes[1, 1].set_title('Projection Error by Scoring Volume')
plt.tight_layout()
plt.savefig('projection_accuracy_analysis.png', dpi=300)
plt.show()
Calibration and Uncertainty Quantification
from sklearn.ensemble import GradientBoostingRegressor
import scipy.stats as stats
# Quantile regression for uncertainty estimates
def train_quantile_models(X_train, y_train, quantiles=[0.1, 0.5, 0.9]):
"""Train separate models for different quantiles."""
models = {}
for q in quantiles:
model = GradientBoostingRegressor(
n_estimators=200,
max_depth=5,
learning_rate=0.05,
loss='quantile',
alpha=q,
random_state=42
)
model.fit(X_train, y_train)
models[q] = model
return models
# Train quantile models
quantile_models = train_quantile_models(X_train, y_train)
# Make probabilistic predictions
def predict_with_intervals(X, quantile_models):
"""Generate point estimate and prediction intervals."""
predictions = {
'median': quantile_models[0.5].predict(X),
'lower_80': quantile_models[0.1].predict(X),
'upper_80': quantile_models[0.9].predict(X),
}
return pd.DataFrame(predictions)
# Generate predictions with intervals
pred_intervals = predict_with_intervals(X_test, quantile_models)
pred_intervals['actual'] = y_test.values
# Calculate coverage (percentage of actuals within intervals)
pred_intervals['in_interval'] = (
(pred_intervals['actual'] >= pred_intervals['lower_80']) &
(pred_intervals['actual'] <= pred_intervals['upper_80'])
)
coverage = pred_intervals['in_interval'].mean()
print(f"80% Prediction Interval Coverage: {coverage:.1%}")
# Calibration plot
def plot_calibration(pred_intervals):
"""Plot calibration curve for prediction intervals."""
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Prediction intervals
sample_idx = np.random.choice(len(pred_intervals), size=100, replace=False)
sample_df = pred_intervals.iloc[sample_idx].sort_values('median')
x = range(len(sample_df))
axes[0].plot(x, sample_df['median'], 'b-', label='Median Projection')
axes[0].fill_between(
x,
sample_df['lower_80'],
sample_df['upper_80'],
alpha=0.3,
label='80% Interval'
)
axes[0].scatter(x, sample_df['actual'], color='red', s=20,
alpha=0.6, label='Actual')
axes[0].set_xlabel('Player (sorted by projection)')
axes[0].set_ylabel('Points per 36')
axes[0].set_title('Projection Intervals vs Actual Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Residual distribution
residuals = pred_intervals['actual'] - pred_intervals['median']
axes[1].hist(residuals, bins=50, density=True, alpha=0.7,
edgecolor='black', label='Actual Residuals')
# Overlay normal distribution
mu, sigma = residuals.mean(), residuals.std()
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
axes[1].plot(x_norm, stats.norm.pdf(x_norm, mu, sigma),
'r-', linewidth=2, label='Normal Distribution')
axes[1].set_xlabel('Residual (Actual - Projected)')
axes[1].set_ylabel('Density')
axes[1].set_title('Residual Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('projection_calibration.png', dpi=300)
plt.show()
plot_calibration(pred_intervals)
Practical Applications
Contract Valuation
Player projection systems are essential for evaluating contract offers:
def calculate_contract_value(player_projections, salary_cap=140_000_000):
"""
Estimate fair contract value based on projected performance.
Uses wins above replacement (WAR) and market rates.
"""
# Convert projections to WAR estimate
# Simplified formula: WAR ≈ (RAPTOR × Minutes) / 100
player_projections['projected_war'] = (
player_projections['projected_raptor'] *
player_projections['projected_minutes']
) / 100
# Calculate marginal win value
# Rule of thumb: 1 WAR ≈ $2M in salary cap
dollars_per_war = salary_cap * 0.014 # ~$2M at $140M cap
player_projections['contract_value'] = (
player_projections['projected_war'] * dollars_per_war
)
# Age and injury adjustments
player_projections['risk_discount'] = np.where(
player_projections['age'] > 30,
0.85, # 15% discount for age risk
1.0
)
player_projections['adjusted_value'] = (
player_projections['contract_value'] *
player_projections['risk_discount']
)
return player_projections
# Example: Evaluate max contract decision
player_proj = pd.DataFrame({
'player_name': ['Player A'],
'age': [28],
'projected_raptor': [4.5],
'projected_minutes': [2400],
})
contract_eval = calculate_contract_value(player_proj)
print(f"Projected WAR: {contract_eval['projected_war'].values[0]:.2f}")
print(f"Fair Contract Value: ${contract_eval['adjusted_value'].values[0]/1e6:.1f}M per year")
Trade Analysis
Projection systems help evaluate trade proposals by comparing future value:
def evaluate_trade(players_out, players_in, projection_model, years=3):
"""
Compare projected value of players in vs out over contract period.
"""
def project_future_value(player_data, years):
total_value = 0
for year in range(1, years + 1):
# Age player
future_age = player_data['age'] + year
# Generate projection
features = create_features_for_projection(player_data, future_age)
projected_performance = projection_model.predict(features)[0]
# Convert to WAR and discount future years
war = projected_performance * 0.15 # Simplified conversion
discount_factor = 0.9 ** year # Time discount
total_value += war * discount_factor
return total_value
# Calculate total value for each side
value_out = sum(project_future_value(p, years) for p in players_out)
value_in = sum(project_future_value(p, years) for p in players_in)
net_value = value_in - value_out
print(f"Players Out - Projected Value: {value_out:.2f} WAR")
print(f"Players In - Projected Value: {value_in:.2f} WAR")
print(f"Net Value: {net_value:+.2f} WAR")
if net_value > 0:
print("Recommendation: ACCEPT trade")
else:
print("Recommendation: REJECT trade")
return net_value
Draft Strategy
Project rookie performance and development to optimize draft selections:
def project_rookie_career(college_stats, draft_position,
historical_comparisons):
"""
Project NBA career trajectory for draft prospects.
Uses historical player comparisons and translation factors.
"""
# College-to-NBA translation factors
translation_factors = {
'pts': 0.45, # NCAA scoring translates at ~45%
'reb': 0.65,
'ast': 0.55,
'ts_pct': 0.85 # Efficiency translates better
}
# Translate college stats
nba_rookie_proj = {
stat: college_stats[stat] * factor
for stat, factor in translation_factors.items()
}
# Draft position adjustment
pick_value_curve = {
range(1, 4): 1.3, # Top 3 picks: 30% boost
range(4, 11): 1.1, # Lottery: 10% boost
range(11, 31): 1.0, # First round: no adjustment
range(31, 61): 0.85 # Second round: 15% penalty
}
adjustment = next(
adj for picks, adj in pick_value_curve.items()
if draft_position in picks
)
nba_rookie_proj = {k: v * adjustment for k, v in nba_rookie_proj.items()}
# Project Year 2-5 using typical growth curves
career_projection = []
for year in range(1, 6):
year_growth = min(1.0 + (year - 1) * 0.15, 1.6) # 15% growth per year, cap at 60%
year_proj = {
'season': year,
**{k: v * year_growth for k, v in nba_rookie_proj.items()}
}
career_projection.append(year_proj)
return pd.DataFrame(career_projection)
# Example: Project lottery pick
college_stats = {
'pts': 18.5,
'reb': 8.2,
'ast': 3.1,
'ts_pct': 0.595
}
rookie_projection = project_rookie_career(
college_stats=college_stats,
draft_position=5,
historical_comparisons=None # Would use similarity scores
)
print("5-Year Rookie Projection:")
print(rookie_projection)
Injury Impact Assessment
def adjust_projection_for_injury(baseline_projection, injury_history):
"""
Adjust performance projections based on injury history and type.
"""
injury_impact_factors = {
'ACL tear': {'impact': 0.85, 'recovery_years': 2},
'Achilles tear': {'impact': 0.70, 'recovery_years': 2},
'Ankle sprain': {'impact': 0.95, 'recovery_years': 1},
'Back injury': {'impact': 0.90, 'recovery_years': 1},
'Knee surgery': {'impact': 0.88, 'recovery_years': 2},
}
adjusted_projection = baseline_projection.copy()
for injury in injury_history:
injury_type = injury['type']
years_since = injury['years_ago']
if injury_type in injury_impact_factors:
impact = injury_impact_factors[injury_type]['impact']
recovery = injury_impact_factors[injury_type]['recovery_years']
# Impact diminishes over time
if years_since < recovery:
recovery_progress = years_since / recovery
current_impact = impact + (1 - impact) * recovery_progress
else:
current_impact = 1.0
# Apply impact to projections
adjusted_projection['projected_pts'] *= current_impact
adjusted_projection['projected_minutes'] *= current_impact
# Add injury risk factor for future seasons
recent_injuries = sum(1 for inj in injury_history if inj['years_ago'] < 2)
injury_risk = min(0.2 * recent_injuries, 0.5) # Max 50% injury risk
adjusted_projection['injury_risk'] = injury_risk
adjusted_projection['games_played_proj'] *= (1 - injury_risk)
return adjusted_projection
Best Practices and Considerations
Key Principles
- Regression to the mean: Extreme performances tend to regress toward average
- Sample size matters: Weight multi-year averages more than single-season outliers
- Context is critical: Account for team quality, usage, and role changes
- Age curves are predictable: Use historical patterns but allow for individual variance
- Uncertainty quantification: Always provide ranges, not just point estimates
Common Pitfalls
- Overfitting to recent data: Short-term trends can mislead projections
- Ignoring playing time: Opportunity is as important as ability
- Position changes: Role transitions affect all statistical categories
- Injury concerns: Health history significantly impacts projections
- System changes: New coaches and teammates alter usage patterns
Validation Metrics
- RMSE and MAE: Measure average prediction error
- R² score: Proportion of variance explained by the model
- Calibration: Do 80% prediction intervals contain 80% of actuals?
- Player-level accuracy: Track errors for specific player types
- Time-based validation: Always test on future seasons, not random splits
Future Directions
Emerging Techniques
- Transfer learning: Apply models from other leagues (NCAA, international)
- Attention mechanisms: Neural networks that focus on key performance factors
- Graph neural networks: Model player relationships and team dynamics
- Causal inference: Distinguish correlation from true causal effects
- Multi-task learning: Jointly predict multiple stats and outcomes
Data Opportunities
- Second Spectrum tracking: More granular movement and decision data
- Biometric data: Sleep, training load, and health metrics
- Shot quality models: Better estimate of scoring ability vs. variance
- Lineup data: Predict performance in specific team contexts
- External factors: Travel, rest, opponent strength adjustments
Key Takeaways
- Player projection systems combine statistical modeling, machine learning, and domain expertise to forecast NBA performance
- Age curves are the foundation - most players peak around 27-29 and decline predictably after 30
- Multiple modeling approaches (linear, tree-based, neural networks) capture different patterns
- Rigorous backtesting with time series validation is essential for reliable projections
- Practical applications include contract valuation, trade analysis, draft strategy, and injury assessment
- Uncertainty quantification through prediction intervals provides more actionable information than point estimates