Player Projection Systems

Beginner 10 min read 18 views Nov 27, 2025

NBA Player Projection Systems

Player projection systems represent the cutting edge of NBA analytics, combining historical data, statistical modeling, and machine learning to forecast future player performance. These systems help teams make informed decisions about contracts, trades, and roster construction, while providing analysts and fans with data-driven expectations for player development.

Overview of Modern Projection Systems

FiveThirtyEight RAPTOR

RAPTOR (Robust Algorithm using Player Tracking and On/Off Ratings) is FiveThirtyEight's comprehensive player rating system that combines:

Box score statistics: Traditional and advanced metrics from play-by-play data
On-off ratings: Team performance with player on vs. off the court
Player tracking data: Movement and positioning from SportVU cameras
Aging curves: Expected improvement or decline based on player age

RAPTOR produces separate offensive and defensive ratings, as well as projections for future seasons based on historical aging patterns and regression to the mean.

FiveThirtyEight CARMELO

CARMELO (Career-Arc Regression Model Estimator with Local Optimization) was FiveThirtyEight's previous system that projected player career trajectories by:

Comparing players to historical comparables with similar statistics at the same age
Using thousands of player seasons to model typical career arcs
Incorporating individual player characteristics and performance trends
Producing probabilistic forecasts for multiple seasons ahead

Other Professional Systems

ESPN RPM Projections: Based on Real Plus-Minus with ridge regression
Basketball-Reference Similarity Scores: Historical player comparisons
NBA Teams' Internal Models: Proprietary systems using private data
Cleaning the Glass: Role-based projections accounting for context

Age Curves and Peak Performance

The Typical NBA Career Arc

Research consistently shows predictable patterns in NBA player performance across age:

Ages 19-23: Rapid improvement as players develop skills and adjust to NBA pace
Ages 24-28: Peak performance years with optimal blend of skills and athleticism
Ages 27-29: Absolute peak for most players, particularly superstars
Ages 30-33: Gradual decline, offset by experience and basketball IQ
Ages 34+: Accelerated decline, role players often exit the league

Position-Specific Age Curves

Different positions show varying aging patterns:

Point Guards: Longer peak due to playmaking skills and basketball IQ
Wings: Earlier decline in athleticism-dependent areas (defense, transition)
Centers: Sharp decline after age 30, particularly in mobility and rim protection

Skill-Specific Aging

Individual skills age at different rates:

Shooting: Relatively stable, can improve into early 30s
Athleticism: Declines steadily after age 27
Defense: Sharp decline after age 29-30
Playmaking: Stable or improving through experience
Rebounding: Maintains well into early 30s

Building Projection Systems with Python

Data Collection and Preprocessing

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from nba_api.stats.endpoints import playercareerstats, leaguedashplayerstats
from nba_api.stats.static import players

# Fetch historical player data
def get_player_career_stats(player_id):
    """Retrieve complete career statistics for a player."""
    career = playercareerstats.PlayerCareerStats(player_id=player_id)
    df = career.get_data_frames()[0]
    return df

# Collect comprehensive dataset
def build_projection_dataset(start_season='2010-11', end_season='2023-24'):
    """Build dataset with player stats, age, and next-season performance."""
    all_players = players.get_players()
    data = []

    for player in all_players:
        try:
            career_df = get_player_career_stats(player['id'])

            # Add age and calculate per-36 stats
            for idx, row in career_df.iterrows():
                age = row['PLAYER_AGE']
                season = row['SEASON_ID']

                # Skip if insufficient playing time
                if row['MIN'] < 500:
                    continue

                # Calculate per-36 minute stats
                mins = row['MIN']
                per_36_stats = {
                    'player_id': player['id'],
                    'player_name': player['full_name'],
                    'season': season,
                    'age': age,
                    'gp': row['GP'],
                    'pts_36': row['PTS'] / mins * 36,
                    'reb_36': row['REB'] / mins * 36,
                    'ast_36': row['AST'] / mins * 36,
                    'stl_36': row['STL'] / mins * 36,
                    'blk_36': row['BLK'] / mins * 36,
                    'tov_36': row['TOV'] / mins * 36,
                    'fg_pct': row['FG_PCT'],
                    'fg3_pct': row['FG3_PCT'],
                    'ft_pct': row['FT_PCT'],
                    'ts_pct': calculate_ts_pct(row),
                    'usage_pct': estimate_usage(row),
                }

                data.append(per_36_stats)
        except:
            continue

    df = pd.DataFrame(data)
    return df

def calculate_ts_pct(row):
    """Calculate True Shooting Percentage."""
    if row['FGA'] + 0.44 * row['FTA'] == 0:
        return 0
    return row['PTS'] / (2 * (row['FGA'] + 0.44 * row['FTA']))

def estimate_usage(row):
    """Estimate usage rate from available box score stats."""
    team_poss = row['MIN'] / 5 * 100  # Rough estimate
    player_poss = row['FGA'] + 0.44 * row['FTA'] + row['TOV']
    return player_poss / team_poss * 100 if team_poss > 0 else 0

Feature Engineering for Projections

def create_projection_features(df):
    """Engineer features for player projection model."""
    df = df.sort_values(['player_id', 'season']).copy()

    # Calculate career trajectory features
    df['seasons_played'] = df.groupby('player_id').cumcount() + 1
    df['career_pts_avg'] = df.groupby('player_id')['pts_36'].expanding().mean().values
    df['career_pts_std'] = df.groupby('player_id')['pts_36'].expanding().std().values

    # Age-based features
    df['age_squared'] = df['age'] ** 2
    df['age_cubed'] = df['age'] ** 3
    df['years_from_peak'] = df['age'] - 27

    # Recent performance trends (last 3 seasons)
    for stat in ['pts_36', 'reb_36', 'ast_36', 'ts_pct']:
        df[f'{stat}_ma3'] = df.groupby('player_id')[stat].rolling(3, min_periods=1).mean().values
        df[f'{stat}_trend'] = df.groupby('player_id')[stat].diff()

    # Calculate year-over-year changes
    for stat in ['pts_36', 'reb_36', 'ast_36', 'ts_pct', 'usage_pct']:
        df[f'{stat}_change'] = df.groupby('player_id')[stat].diff()
        df[f'{stat}_pct_change'] = df.groupby('player_id')[stat].pct_change()

    # Create target variable (next season performance)
    df['next_pts_36'] = df.groupby('player_id')['pts_36'].shift(-1)
    df['next_ts_pct'] = df.groupby('player_id')['ts_pct'].shift(-1)
    df['next_usage'] = df.groupby('player_id')['usage_pct'].shift(-1)

    # Remove rows without next season data
    df = df.dropna(subset=['next_pts_36'])

    return df

# Apply feature engineering
projection_df = create_projection_features(df)

# Define feature sets
basic_features = ['age', 'age_squared', 'pts_36', 'reb_36', 'ast_36',
                  'ts_pct', 'usage_pct']
advanced_features = basic_features + ['pts_36_ma3', 'pts_36_trend',
                                      'career_pts_avg', 'seasons_played',
                                      'years_from_peak']

# Prepare training data
X = projection_df[advanced_features]
y = projection_df['next_pts_36']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Baseline Projection Model

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train multiple models
models = {
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
}

results = {}
for name, model in models.items():
    # Train model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Evaluate
    results[name] = {
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_pred_train)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'train_mae': mean_absolute_error(y_train, y_pred_train),
        'test_mae': mean_absolute_error(y_test, y_pred_test),
        'test_r2': r2_score(y_test, y_pred_test)
    }

    print(f"\n{name} Results:")
    print(f"Test RMSE: {results[name]['test_rmse']:.3f}")
    print(f"Test MAE: {results[name]['test_mae']:.3f}")
    print(f"Test R²: {results[name]['test_r2']:.3f}")

# Feature importance (for tree-based models)
rf_model = models['Random Forest']
feature_importance = pd.DataFrame({
    'feature': advanced_features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Statistical Modeling with R

Age Curve Estimation

library(tidyverse)
library(mgcv)
library(lme4)
library(splines)

# Load player data
player_data <- read_csv("nba_player_stats.csv")

# Fit polynomial age curve model
age_curve_model <- lm(
  pts_per_36 ~ poly(age, 3) + poly(age, 3):position,
  data = player_data
)

summary(age_curve_model)

# Fit generalized additive model (GAM) for smooth age curves
gam_model <- gam(
  pts_per_36 ~ s(age, by = position) + position +
               s(seasons_played) + s(usage_rate),
  data = player_data,
  family = gaussian()
)

summary(gam_model)

# Visualize age curves by position
age_seq <- seq(19, 38, by = 0.1)
positions <- c("PG", "SG", "SF", "PF", "C")

predictions <- expand.grid(
  age = age_seq,
  position = positions,
  seasons_played = 5,
  usage_rate = 22
)

predictions$predicted_pts <- predict(gam_model, newdata = predictions)

ggplot(predictions, aes(x = age, y = predicted_pts, color = position)) +
  geom_line(size = 1.2) +
  labs(
    title = "NBA Player Age Curves by Position",
    x = "Age",
    y = "Predicted Points per 36 Minutes",
    color = "Position"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Mixed Effects Models for Player Projections

# Mixed effects model with random intercepts and slopes
mixed_model <- lmer(
  pts_per_36_next ~ age + I(age^2) + pts_per_36 + ts_pct + usage_rate +
                     pts_per_36_trend + (1 + age | player_id),
  data = player_data,
  REML = TRUE
)

summary(mixed_model)

# Extract random effects
random_effects <- ranef(mixed_model)$player_id
fixed_effects <- fixef(mixed_model)

# Function to project player performance
project_player <- function(player_id, current_stats, future_age) {
  player_random <- random_effects[as.character(player_id), ]

  prediction <- fixed_effects["(Intercept)"] +
                player_random$"(Intercept)" +
                (fixed_effects["age"] + player_random$age) * future_age +
                fixed_effects["I(age^2)"] * future_age^2 +
                fixed_effects["pts_per_36"] * current_stats$pts_per_36 +
                fixed_effects["ts_pct"] * current_stats$ts_pct +
                fixed_effects["usage_rate"] * current_stats$usage_rate +
                fixed_effects["pts_per_36_trend"] * current_stats$pts_per_36_trend

  return(prediction)
}

# Example projection
player_stats <- list(
  pts_per_36 = 22.5,
  ts_pct = 0.58,
  usage_rate = 25.3,
  pts_per_36_trend = 1.2
)

projected_pts <- project_player(
  player_id = "2544",  # LeBron James
  current_stats = player_stats,
  future_age = 39
)

print(paste("Projected points per 36:", round(projected_pts, 2)))

Bayesian Hierarchical Models

library(rstan)
library(rstanarm)

# Bayesian hierarchical model with Stan
stan_model <- stan_glmer(
  pts_per_36_next ~ age + I(age^2) + I(age^3) +
                     pts_per_36 + ts_pct + usage_rate +
                     (1 + age | player_id) + (1 | season),
  data = player_data,
  family = gaussian(),
  prior = normal(0, 2.5),
  prior_intercept = normal(15, 5),
  prior_covariance = decov(regularization = 2),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  seed = 42
)

# Model summary
summary(stan_model, digits = 3)

# Posterior predictive checks
pp_check(stan_model, nreps = 50)

# Generate probabilistic projections with uncertainty
new_data <- data.frame(
  player_id = "203999",  # Nikola Jokic
  age = 30,
  pts_per_36 = 24.8,
  ts_pct = 0.65,
  usage_rate = 28.5,
  season = "2024-25"
)

# Posterior predictions with credible intervals
posterior_pred <- posterior_predict(stan_model, newdata = new_data)

# Calculate credible intervals
ci_80 <- quantile(posterior_pred, c(0.1, 0.9))
ci_95 <- quantile(posterior_pred, c(0.025, 0.975))

print(paste("Median projection:", round(median(posterior_pred), 2)))
print(paste("80% CI:", round(ci_80[1], 2), "-", round(ci_80[2], 2)))
print(paste("95% CI:", round(ci_95[1], 2), "-", round(ci_95[2], 2)))

# Visualize posterior distribution
posterior_df <- data.frame(prediction = as.vector(posterior_pred))

ggplot(posterior_df, aes(x = prediction)) +
  geom_density(fill = "steelblue", alpha = 0.6) +
  geom_vline(xintercept = median(posterior_pred),
             linetype = "dashed", color = "red", size = 1) +
  geom_vline(xintercept = ci_95, linetype = "dotted", color = "red") +
  labs(
    title = "Posterior Predictive Distribution",
    x = "Projected Points per 36 Minutes",
    y = "Density"
  ) +
  theme_minimal()

Machine Learning Approaches

Ensemble Methods with XGBoost

import xgboost as xgb
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare data with temporal ordering
df_sorted = projection_df.sort_values(['season', 'player_id'])

# Create time series splits (maintain temporal order)
tscv = TimeSeriesSplit(n_splits=5)

# XGBoost parameters to tune
param_grid = {
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2]
}

# Create XGBoost model
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    random_state=42,
    tree_method='hist'
)

# Grid search with time series cross-validation
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    cv=tscv,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# Fit model
X = df_sorted[advanced_features]
y = df_sorted['next_pts_36']

grid_search.fit(X, y)

# Best model
best_xgb = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)

# Make predictions
y_pred = best_xgb.predict(X)

# Evaluate
rmse = np.sqrt(mean_squared_error(y, y_pred))
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"\nXGBoost Performance:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")

# Feature importance
importance_df = pd.DataFrame({
    'feature': advanced_features,
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df.head(15), x='importance', y='feature')
plt.title('XGBoost Feature Importance for Player Projections')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('xgboost_feature_importance.png', dpi=300)
plt.show()

Neural Networks for Complex Patterns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from tensorflow.keras.models import Sequential

# Prepare sequences for recurrent neural network
def create_sequences(df, lookback=3):
    """Create sequences of player stats for LSTM model."""
    sequences = []
    targets = []

    for player_id in df['player_id'].unique():
        player_df = df[df['player_id'] == player_id].sort_values('season')

        if len(player_df) < lookback + 1:
            continue

        for i in range(len(player_df) - lookback):
            sequence = player_df[advanced_features].iloc[i:i+lookback].values
            target = player_df['next_pts_36'].iloc[i+lookback-1]

            sequences.append(sequence)
            targets.append(target)

    return np.array(sequences), np.array(targets)

# Create sequences
lookback = 3
X_seq, y_seq = create_sequences(projection_df, lookback=lookback)

# Split data
split_idx = int(0.8 * len(X_seq))
X_train_seq, X_test_seq = X_seq[:split_idx], X_seq[split_idx:]
y_train_seq, y_test_seq = y_seq[:split_idx], y_seq[split_idx:]

# Build LSTM model
model = Sequential([
    layers.LSTM(64, activation='relu', input_shape=(lookback, len(advanced_features)),
                return_sequences=True),
    layers.Dropout(0.2),
    layers.LSTM(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(16, activation='relu'),
    layers.Dense(1)
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='mse',
    metrics=['mae']
)

# Callbacks
early_stop = callbacks.EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True
)

reduce_lr = callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=10,
    min_lr=0.00001
)

# Train model
history = model.fit(
    X_train_seq, y_train_seq,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop, reduce_lr],
    verbose=1
)

# Evaluate
test_loss, test_mae = model.evaluate(X_test_seq, y_test_seq)
y_pred_seq = model.predict(X_test_seq).flatten()
test_rmse = np.sqrt(mean_squared_error(y_test_seq, y_pred_seq))
test_r2 = r2_score(y_test_seq, y_pred_seq)

print(f"\nLSTM Performance:")
print(f"RMSE: {test_rmse:.3f}")
print(f"MAE: {test_mae:.3f}")
print(f"R²: {test_r2:.3f}")

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True)

axes[1].plot(history.history['mae'], label='Training MAE')
axes[1].plot(history.history['val_mae'], label='Validation MAE')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MAE')
axes[1].set_title('Training and Validation MAE')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.savefig('lstm_training_history.png', dpi=300)
plt.show()

Stacking and Model Ensembles

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
import lightgbm as lgb

# Define base models
base_models = [
    ('ridge', Ridge(alpha=10.0)),
    ('elastic', ElasticNet(alpha=0.5, l1_ratio=0.5)),
    ('knn', KNeighborsRegressor(n_neighbors=10)),
    ('xgb', xgb.XGBRegressor(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.05,
        random_state=42
    )),
    ('lgb', lgb.LGBMRegressor(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.05,
        random_state=42
    ))
]

# Meta-model
meta_model = Ridge(alpha=1.0)

# Create stacking ensemble
stacking_model = StackingRegressor(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
    n_jobs=-1
)

# Train stacking model
stacking_model.fit(X_train, y_train)

# Predictions
y_pred_stack = stacking_model.predict(X_test)

# Evaluate
stack_rmse = np.sqrt(mean_squared_error(y_test, y_pred_stack))
stack_mae = mean_absolute_error(y_test, y_pred_stack)
stack_r2 = r2_score(y_test, y_pred_stack)

print(f"\nStacking Ensemble Performance:")
print(f"RMSE: {stack_rmse:.3f}")
print(f"MAE: {stack_mae:.3f}")
print(f"R²: {stack_r2:.3f}")

# Compare all models
model_comparison = pd.DataFrame({
    'Model': ['Ridge', 'Random Forest', 'Gradient Boosting', 'XGBoost', 'LSTM', 'Stacking'],
    'RMSE': [
        results['Ridge']['test_rmse'],
        results['Random Forest']['test_rmse'],
        results['Gradient Boosting']['test_rmse'],
        rmse,
        test_rmse,
        stack_rmse
    ],
    'MAE': [
        results['Ridge']['test_mae'],
        results['Random Forest']['test_mae'],
        results['Gradient Boosting']['test_mae'],
        mae,
        test_mae,
        stack_mae
    ],
    'R²': [
        results['Ridge']['test_r2'],
        results['Random Forest']['test_r2'],
        results['Gradient Boosting']['test_r2'],
        r2,
        test_r2,
        stack_r2
    ]
}).sort_values('RMSE')

print("\nModel Comparison:")
print(model_comparison.to_string(index=False))

Validation and Backtesting

Time Series Cross-Validation

def backtest_projections(df, model, features, target, start_season='2015-16'):
    """
    Backtest projection system using expanding window approach.
    Train on all data up to season N, predict season N+1.
    """
    seasons = sorted(df['season'].unique())
    start_idx = seasons.index(start_season)

    results = []

    for i in range(start_idx, len(seasons) - 1):
        train_seasons = seasons[:i+1]
        test_season = seasons[i+1]

        # Split data
        train_df = df[df['season'].isin(train_seasons)]
        test_df = df[df['season'] == test_season]

        # Prepare features
        X_train = train_df[features]
        y_train = train_df[target]
        X_test = test_df[features]
        y_test = test_df[target]

        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train model
        model.fit(X_train_scaled, y_train)

        # Predict
        y_pred = model.predict(X_test_scaled)

        # Calculate metrics
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Store results
        results.append({
            'season': test_season,
            'n_players': len(test_df),
            'rmse': rmse,
            'mae': mae,
            'r2': r2,
            'predictions': pd.DataFrame({
                'player_id': test_df['player_id'].values,
                'actual': y_test.values,
                'predicted': y_pred
            })
        })

        print(f"Season {test_season}: RMSE={rmse:.3f}, MAE={mae:.3f}, R²={r2:.3f}")

    return results

# Run backtest
backtest_results = backtest_projections(
    df=projection_df,
    model=xgb.XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.05),
    features=advanced_features,
    target='next_pts_36',
    start_season='2015-16'
)

# Aggregate results
avg_rmse = np.mean([r['rmse'] for r in backtest_results])
avg_mae = np.mean([r['mae'] for r in backtest_results])
avg_r2 = np.mean([r['r2'] for r in backtest_results])

print(f"\nBacktest Summary (2015-16 to 2023-24):")
print(f"Average RMSE: {avg_rmse:.3f}")
print(f"Average MAE: {avg_mae:.3f}")
print(f"Average R²: {avg_r2:.3f}")

Player-Level Accuracy Analysis

def analyze_projection_accuracy(backtest_results, df):
    """Analyze which types of players are easier/harder to project."""

    # Combine all predictions
    all_preds = pd.concat([r['predictions'] for r in backtest_results])
    all_preds = all_preds.merge(
        df[['player_id', 'age', 'seasons_played', 'pts_36', 'usage_pct']],
        on='player_id',
        how='left'
    )

    # Calculate absolute errors
    all_preds['abs_error'] = abs(all_preds['actual'] - all_preds['predicted'])
    all_preds['pct_error'] = abs(all_preds['abs_error'] / all_preds['actual']) * 100

    # Analyze by age groups
    all_preds['age_group'] = pd.cut(
        all_preds['age'],
        bins=[0, 23, 27, 30, 100],
        labels=['Young (19-23)', 'Prime (24-27)', 'Veteran (28-30)', 'Old (31+)']
    )

    age_accuracy = all_preds.groupby('age_group').agg({
        'abs_error': ['mean', 'median'],
        'pct_error': ['mean', 'median'],
        'player_id': 'count'
    }).round(3)

    print("Projection Accuracy by Age Group:")
    print(age_accuracy)

    # Analyze by scoring volume
    all_preds['scoring_tier'] = pd.cut(
        all_preds['pts_36'],
        bins=[0, 10, 15, 20, 100],
        labels=['Low (<10)', 'Medium (10-15)', 'High (15-20)', 'Elite (20+)']
    )

    scoring_accuracy = all_preds.groupby('scoring_tier').agg({
        'abs_error': ['mean', 'median'],
        'pct_error': ['mean', 'median'],
        'player_id': 'count'
    }).round(3)

    print("\nProjection Accuracy by Scoring Volume:")
    print(scoring_accuracy)

    # Identify most/least predictable players
    player_accuracy = all_preds.groupby('player_id').agg({
        'abs_error': 'mean',
        'pct_error': 'mean',
        'actual': 'count'
    }).reset_index()

    player_accuracy = player_accuracy[player_accuracy['actual'] >= 3]  # Min 3 projections
    player_accuracy = player_accuracy.sort_values('abs_error')

    print("\nMost Predictable Players (Lowest Error):")
    print(player_accuracy.head(10))

    print("\nLeast Predictable Players (Highest Error):")
    print(player_accuracy.tail(10))

    return all_preds, age_accuracy, scoring_accuracy, player_accuracy

# Run accuracy analysis
pred_analysis, age_acc, score_acc, player_acc = analyze_projection_accuracy(
    backtest_results, projection_df
)

# Visualize error distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Error distribution
axes[0, 0].hist(pred_analysis['abs_error'], bins=50, edgecolor='black')
axes[0, 0].set_xlabel('Absolute Error (Points per 36)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Projection Errors')
axes[0, 0].axvline(pred_analysis['abs_error'].median(), color='red',
                    linetype='dashed', label='Median')
axes[0, 0].legend()

# Actual vs Predicted
axes[0, 1].scatter(pred_analysis['actual'], pred_analysis['predicted'],
                   alpha=0.5, s=10)
axes[0, 1].plot([0, 35], [0, 35], 'r--', label='Perfect Projection')
axes[0, 1].set_xlabel('Actual Points per 36')
axes[0, 1].set_ylabel('Projected Points per 36')
axes[0, 1].set_title('Actual vs Projected Performance')
axes[0, 1].legend()

# Error by age group
age_acc_plot = pred_analysis.groupby('age_group')['abs_error'].mean()
axes[1, 0].bar(range(len(age_acc_plot)), age_acc_plot.values)
axes[1, 0].set_xticks(range(len(age_acc_plot)))
axes[1, 0].set_xticklabels(age_acc_plot.index, rotation=45)
axes[1, 0].set_ylabel('Mean Absolute Error')
axes[1, 0].set_title('Projection Error by Age Group')

# Error by scoring tier
score_acc_plot = pred_analysis.groupby('scoring_tier')['abs_error'].mean()
axes[1, 1].bar(range(len(score_acc_plot)), score_acc_plot.values)
axes[1, 1].set_xticks(range(len(score_acc_plot)))
axes[1, 1].set_xticklabels(score_acc_plot.index, rotation=45)
axes[1, 1].set_ylabel('Mean Absolute Error')
axes[1, 1].set_title('Projection Error by Scoring Volume')

plt.tight_layout()
plt.savefig('projection_accuracy_analysis.png', dpi=300)
plt.show()

Calibration and Uncertainty Quantification

from sklearn.ensemble import GradientBoostingRegressor
import scipy.stats as stats

# Quantile regression for uncertainty estimates
def train_quantile_models(X_train, y_train, quantiles=[0.1, 0.5, 0.9]):
    """Train separate models for different quantiles."""
    models = {}

    for q in quantiles:
        model = GradientBoostingRegressor(
            n_estimators=200,
            max_depth=5,
            learning_rate=0.05,
            loss='quantile',
            alpha=q,
            random_state=42
        )
        model.fit(X_train, y_train)
        models[q] = model

    return models

# Train quantile models
quantile_models = train_quantile_models(X_train, y_train)

# Make probabilistic predictions
def predict_with_intervals(X, quantile_models):
    """Generate point estimate and prediction intervals."""
    predictions = {
        'median': quantile_models[0.5].predict(X),
        'lower_80': quantile_models[0.1].predict(X),
        'upper_80': quantile_models[0.9].predict(X),
    }
    return pd.DataFrame(predictions)

# Generate predictions with intervals
pred_intervals = predict_with_intervals(X_test, quantile_models)
pred_intervals['actual'] = y_test.values

# Calculate coverage (percentage of actuals within intervals)
pred_intervals['in_interval'] = (
    (pred_intervals['actual'] >= pred_intervals['lower_80']) &
    (pred_intervals['actual'] <= pred_intervals['upper_80'])
)

coverage = pred_intervals['in_interval'].mean()
print(f"80% Prediction Interval Coverage: {coverage:.1%}")

# Calibration plot
def plot_calibration(pred_intervals):
    """Plot calibration curve for prediction intervals."""
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))

    # Prediction intervals
    sample_idx = np.random.choice(len(pred_intervals), size=100, replace=False)
    sample_df = pred_intervals.iloc[sample_idx].sort_values('median')

    x = range(len(sample_df))
    axes[0].plot(x, sample_df['median'], 'b-', label='Median Projection')
    axes[0].fill_between(
        x,
        sample_df['lower_80'],
        sample_df['upper_80'],
        alpha=0.3,
        label='80% Interval'
    )
    axes[0].scatter(x, sample_df['actual'], color='red', s=20,
                    alpha=0.6, label='Actual')
    axes[0].set_xlabel('Player (sorted by projection)')
    axes[0].set_ylabel('Points per 36')
    axes[0].set_title('Projection Intervals vs Actual Performance')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Residual distribution
    residuals = pred_intervals['actual'] - pred_intervals['median']
    axes[1].hist(residuals, bins=50, density=True, alpha=0.7,
                 edgecolor='black', label='Actual Residuals')

    # Overlay normal distribution
    mu, sigma = residuals.mean(), residuals.std()
    x_norm = np.linspace(residuals.min(), residuals.max(), 100)
    axes[1].plot(x_norm, stats.norm.pdf(x_norm, mu, sigma),
                'r-', linewidth=2, label='Normal Distribution')

    axes[1].set_xlabel('Residual (Actual - Projected)')
    axes[1].set_ylabel('Density')
    axes[1].set_title('Residual Distribution')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('projection_calibration.png', dpi=300)
    plt.show()

plot_calibration(pred_intervals)

Practical Applications

Contract Valuation

Player projection systems are essential for evaluating contract offers:

def calculate_contract_value(player_projections, salary_cap=140_000_000):
    """
    Estimate fair contract value based on projected performance.
    Uses wins above replacement (WAR) and market rates.
    """
    # Convert projections to WAR estimate
    # Simplified formula: WAR ≈ (RAPTOR × Minutes) / 100
    player_projections['projected_war'] = (
        player_projections['projected_raptor'] *
        player_projections['projected_minutes']
    ) / 100

    # Calculate marginal win value
    # Rule of thumb: 1 WAR ≈ $2M in salary cap
    dollars_per_war = salary_cap * 0.014  # ~$2M at $140M cap

    player_projections['contract_value'] = (
        player_projections['projected_war'] * dollars_per_war
    )

    # Age and injury adjustments
    player_projections['risk_discount'] = np.where(
        player_projections['age'] > 30,
        0.85,  # 15% discount for age risk
        1.0
    )

    player_projections['adjusted_value'] = (
        player_projections['contract_value'] *
        player_projections['risk_discount']
    )

    return player_projections

# Example: Evaluate max contract decision
player_proj = pd.DataFrame({
    'player_name': ['Player A'],
    'age': [28],
    'projected_raptor': [4.5],
    'projected_minutes': [2400],
})

contract_eval = calculate_contract_value(player_proj)
print(f"Projected WAR: {contract_eval['projected_war'].values[0]:.2f}")
print(f"Fair Contract Value: ${contract_eval['adjusted_value'].values[0]/1e6:.1f}M per year")

Trade Analysis

Projection systems help evaluate trade proposals by comparing future value:

def evaluate_trade(players_out, players_in, projection_model, years=3):
    """
    Compare projected value of players in vs out over contract period.
    """
    def project_future_value(player_data, years):
        total_value = 0

        for year in range(1, years + 1):
            # Age player
            future_age = player_data['age'] + year

            # Generate projection
            features = create_features_for_projection(player_data, future_age)
            projected_performance = projection_model.predict(features)[0]

            # Convert to WAR and discount future years
            war = projected_performance * 0.15  # Simplified conversion
            discount_factor = 0.9 ** year  # Time discount

            total_value += war * discount_factor

        return total_value

    # Calculate total value for each side
    value_out = sum(project_future_value(p, years) for p in players_out)
    value_in = sum(project_future_value(p, years) for p in players_in)

    net_value = value_in - value_out

    print(f"Players Out - Projected Value: {value_out:.2f} WAR")
    print(f"Players In - Projected Value: {value_in:.2f} WAR")
    print(f"Net Value: {net_value:+.2f} WAR")

    if net_value > 0:
        print("Recommendation: ACCEPT trade")
    else:
        print("Recommendation: REJECT trade")

    return net_value

Draft Strategy

Project rookie performance and development to optimize draft selections:

def project_rookie_career(college_stats, draft_position,
                              historical_comparisons):
    """
    Project NBA career trajectory for draft prospects.
    Uses historical player comparisons and translation factors.
    """
    # College-to-NBA translation factors
    translation_factors = {
        'pts': 0.45,  # NCAA scoring translates at ~45%
        'reb': 0.65,
        'ast': 0.55,
        'ts_pct': 0.85  # Efficiency translates better
    }

    # Translate college stats
    nba_rookie_proj = {
        stat: college_stats[stat] * factor
        for stat, factor in translation_factors.items()
    }

    # Draft position adjustment
    pick_value_curve = {
        range(1, 4): 1.3,    # Top 3 picks: 30% boost
        range(4, 11): 1.1,   # Lottery: 10% boost
        range(11, 31): 1.0,  # First round: no adjustment
        range(31, 61): 0.85  # Second round: 15% penalty
    }

    adjustment = next(
        adj for picks, adj in pick_value_curve.items()
        if draft_position in picks
    )

    nba_rookie_proj = {k: v * adjustment for k, v in nba_rookie_proj.items()}

    # Project Year 2-5 using typical growth curves
    career_projection = []
    for year in range(1, 6):
        year_growth = min(1.0 + (year - 1) * 0.15, 1.6)  # 15% growth per year, cap at 60%

        year_proj = {
            'season': year,
            **{k: v * year_growth for k, v in nba_rookie_proj.items()}
        }
        career_projection.append(year_proj)

    return pd.DataFrame(career_projection)

# Example: Project lottery pick
college_stats = {
    'pts': 18.5,
    'reb': 8.2,
    'ast': 3.1,
    'ts_pct': 0.595
}

rookie_projection = project_rookie_career(
    college_stats=college_stats,
    draft_position=5,
    historical_comparisons=None  # Would use similarity scores
)

print("5-Year Rookie Projection:")
print(rookie_projection)

Injury Impact Assessment

def adjust_projection_for_injury(baseline_projection, injury_history):
    """
    Adjust performance projections based on injury history and type.
    """
    injury_impact_factors = {
        'ACL tear': {'impact': 0.85, 'recovery_years': 2},
        'Achilles tear': {'impact': 0.70, 'recovery_years': 2},
        'Ankle sprain': {'impact': 0.95, 'recovery_years': 1},
        'Back injury': {'impact': 0.90, 'recovery_years': 1},
        'Knee surgery': {'impact': 0.88, 'recovery_years': 2},
    }

    adjusted_projection = baseline_projection.copy()

    for injury in injury_history:
        injury_type = injury['type']
        years_since = injury['years_ago']

        if injury_type in injury_impact_factors:
            impact = injury_impact_factors[injury_type]['impact']
            recovery = injury_impact_factors[injury_type]['recovery_years']

            # Impact diminishes over time
            if years_since < recovery:
                recovery_progress = years_since / recovery
                current_impact = impact + (1 - impact) * recovery_progress
            else:
                current_impact = 1.0

            # Apply impact to projections
            adjusted_projection['projected_pts'] *= current_impact
            adjusted_projection['projected_minutes'] *= current_impact

    # Add injury risk factor for future seasons
    recent_injuries = sum(1 for inj in injury_history if inj['years_ago'] < 2)
    injury_risk = min(0.2 * recent_injuries, 0.5)  # Max 50% injury risk

    adjusted_projection['injury_risk'] = injury_risk
    adjusted_projection['games_played_proj'] *= (1 - injury_risk)

    return adjusted_projection

Best Practices and Considerations

Key Principles

Regression to the mean: Extreme performances tend to regress toward average
Sample size matters: Weight multi-year averages more than single-season outliers
Context is critical: Account for team quality, usage, and role changes
Age curves are predictable: Use historical patterns but allow for individual variance
Uncertainty quantification: Always provide ranges, not just point estimates

Common Pitfalls

Overfitting to recent data: Short-term trends can mislead projections
Ignoring playing time: Opportunity is as important as ability
Position changes: Role transitions affect all statistical categories
Injury concerns: Health history significantly impacts projections
System changes: New coaches and teammates alter usage patterns

Validation Metrics

RMSE and MAE: Measure average prediction error
R² score: Proportion of variance explained by the model
Calibration: Do 80% prediction intervals contain 80% of actuals?
Player-level accuracy: Track errors for specific player types
Time-based validation: Always test on future seasons, not random splits

Future Directions

Emerging Techniques

Transfer learning: Apply models from other leagues (NCAA, international)
Attention mechanisms: Neural networks that focus on key performance factors
Graph neural networks: Model player relationships and team dynamics
Causal inference: Distinguish correlation from true causal effects
Multi-task learning: Jointly predict multiple stats and outcomes

Data Opportunities

Second Spectrum tracking: More granular movement and decision data
Biometric data: Sleep, training load, and health metrics
Shot quality models: Better estimate of scoring ability vs. variance
Lineup data: Predict performance in specific team contexts
External factors: Travel, rest, opponent strength adjustments

Key Takeaways

Player projection systems combine statistical modeling, machine learning, and domain expertise to forecast NBA performance
Age curves are the foundation - most players peak around 27-29 and decline predictably after 30
Multiple modeling approaches (linear, tree-based, neural networks) capture different patterns
Rigorous backtesting with time series validation is essential for reliable projections
Practical applications include contract valuation, trade analysis, draft strategy, and injury assessment
Uncertainty quantification through prediction intervals provides more actionable information than point estimates

Win Probability Models Previous

NBA Draft Prediction Models Next

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents