NBA Draft Prediction Models
NBA Draft Prediction Models
Advanced statistical models and machine learning approaches for predicting NBA draft success, player performance, and career trajectories based on pre-draft data.
History of Draft Modeling
Evolution of Draft Analysis
NBA draft prediction modeling has evolved significantly over the decades:
1980s-1990s: Traditional Scouting Era
- Primarily subjective evaluations by scouts
- Focus on physical measurements and basic statistics
- Limited quantitative analysis
- High variance in draft success rates
2000s: Statistical Revolution
- Introduction of advanced metrics (PER, Win Shares)
- Academic research on draft prediction (Berri, Schmidt)
- Development of college-to-NBA translation models
- Recognition of age as critical factor
2010s: Machine Learning Era
- Random forest and gradient boosting models
- Integration of tracking data and biomechanics
- Neural networks for pattern recognition
- Real-time draft board optimization
2020s: AI and Big Data
- Deep learning on video and spatial data
- Natural language processing of scouting reports
- Ensemble models combining multiple approaches
- Causal inference for player development
Landmark Research
Key academic and industry contributions to draft modeling:
- Berri et al. (2011): Demonstrated systematic inefficiencies in NBA draft selection
- Kevin Pelton's WARP: Wins Above Replacement Player projections for college players
- FiveThirtyEight CARMELO: Career trajectory prediction system
- The Ringer's Draft Model: Multi-factor evaluation framework
- NBA Team Analytics Departments: Proprietary machine learning systems
Key Predictive Features
Statistical Performance Metrics
Box Score Statistics
| Metric | Predictive Value | Notes |
|---|---|---|
| Points Per Game | Medium | Context-dependent; adjust for pace and usage |
| True Shooting % | High | Strong predictor of NBA efficiency |
| Assist Rate | High | Indicates playmaking ability and basketball IQ |
| Rebound Rate | Medium-High | Translates well across levels |
| Block Rate | Medium-High | Defensive impact indicator for big men |
| Steal Rate | Medium | Defensive activity but can be noisy |
| Turnover Rate | Medium | Ball security and decision-making |
| Usage Rate | Low-Medium | Context matters; high usage not always positive |
Advanced Metrics
- Box Plus/Minus (BPM): Comprehensive impact estimate
- Player Efficiency Rating (PER): Per-minute productivity
- Win Shares: Contribution to team success
- Offensive/Defensive Rating: Points per 100 possessions
- Value Over Replacement Player (VORP): Above-baseline value
Physical Measurements
NBA Draft Combine Measurements
| Measurement | Importance | Position Variance |
|---|---|---|
| Height (with shoes) | Very High | Critical for all positions |
| Wingspan | Very High | Especially important for wings/bigs |
| Standing Reach | High | Key for defensive versatility |
| Weight | Medium | Frame and strength indicator |
| Hand Length/Width | Medium | Ball handling and finishing |
| Body Fat % | Low-Medium | Conditioning and athleticism proxy |
Athletic Testing
- Max Vertical Leap: Explosiveness and finishing ability
- Standing Vertical: Functional jumping in game situations
- Lane Agility Time: Lateral quickness and defensive mobility
- 3/4 Court Sprint: Speed in transition
- Bench Press (185 lbs): Upper body strength
Age and Experience
Age Factor
Age at draft time is one of the strongest predictors of NBA success:
- One-and-Done (18-19 years old): Highest upside, greater development risk
- Sophomore/Junior (20-21): Balance of polish and potential
- Senior/Super Senior (22+): Lower ceiling but higher floor
- Age Adjustment: Normalize stats for age relative to competition
Competition Level
- Power 5 conferences vs. mid-majors
- International leagues (EuroLeague, ACB, etc.)
- Strength of schedule adjustments
- Tournament performance weighting
Python Implementation
Data Collection and Preprocessing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns
# Load draft data
def load_draft_data(filepath='nba_draft_data.csv'):
"""
Load historical NBA draft data with college stats and NBA outcomes
"""
df = pd.read_csv(filepath)
# Required columns
required_cols = [
'player_name', 'draft_year', 'draft_pick', 'age',
'height', 'wingspan', 'weight',
'ppg', 'rpg', 'apg', 'ts_pct', 'bpm',
'career_ws', 'career_vorp' # Target variables
]
return df[required_cols].dropna()
# Feature engineering
def engineer_features(df):
"""
Create advanced features for draft prediction
"""
# Physical measurements
df['wingspan_height_ratio'] = df['wingspan'] / df['height']
df['bmi'] = (df['weight'] / (df['height'] ** 2)) * 703
# Age-adjusted statistics
df['age_adjusted_ppg'] = df['ppg'] / (df['age'] - 17)
df['age_adjusted_bpm'] = df['bpm'] / (df['age'] - 17)
# Composite scores
df['scoring_efficiency'] = df['ppg'] * df['ts_pct']
df['versatility_score'] = df['ppg'] + df['rpg'] + df['apg']
# Draft position features
df['lottery_pick'] = (df['draft_pick'] <= 14).astype(int)
df['first_round'] = (df['draft_pick'] <= 30).astype(int)
return df
# Split features and target
def prepare_modeling_data(df, target='career_ws'):
"""
Prepare data for machine learning
"""
# Features to use
feature_cols = [
'age', 'height', 'wingspan', 'weight',
'ppg', 'rpg', 'apg', 'ts_pct', 'bpm',
'wingspan_height_ratio', 'bmi',
'age_adjusted_ppg', 'age_adjusted_bpm',
'scoring_efficiency', 'versatility_score'
]
X = df[feature_cols]
y = df[target]
# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test, scaler, feature_cols
Random Forest Model
def build_random_forest_model(X_train, y_train, X_test, y_test):
"""
Random Forest model for draft prediction
"""
# Initialize model with tuned hyperparameters
rf_model = RandomForestRegressor(
n_estimators=500,
max_depth=15,
min_samples_split=10,
min_samples_leaf=4,
max_features='sqrt',
random_state=42,
n_jobs=-1
)
# Train model
rf_model.fit(X_train, y_train)
# Predictions
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)
# Evaluation metrics
train_metrics = {
'rmse': np.sqrt(mean_squared_error(y_train, y_train_pred)),
'mae': mean_absolute_error(y_train, y_train_pred),
'r2': r2_score(y_train, y_train_pred)
}
test_metrics = {
'rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'mae': mean_absolute_error(y_test, y_test_pred),
'r2': r2_score(y_test, y_test_pred)
}
print("Random Forest - Training Metrics:")
print(f" RMSE: {train_metrics['rmse']:.3f}")
print(f" MAE: {train_metrics['mae']:.3f}")
print(f" R²: {train_metrics['r2']:.3f}")
print("\nRandom Forest - Test Metrics:")
print(f" RMSE: {test_metrics['rmse']:.3f}")
print(f" MAE: {test_metrics['mae']:.3f}")
print(f" R²: {test_metrics['r2']:.3f}")
return rf_model, test_metrics
# Feature importance analysis
def analyze_feature_importance(model, feature_names, top_n=10):
"""
Visualize feature importance from Random Forest
"""
importances = model.feature_importances_
indices = np.argsort(importances)[::-1][:top_n]
plt.figure(figsize=(10, 6))
plt.title('Top Feature Importances - Random Forest')
plt.bar(range(top_n), importances[indices])
plt.xticks(range(top_n), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.ylabel('Importance')
plt.tight_layout()
plt.savefig('feature_importance_rf.png', dpi=300, bbox_inches='tight')
plt.close()
# Print feature importances
print("\nFeature Importances:")
for i in indices:
print(f" {feature_names[i]}: {importances[i]:.4f}")
Gradient Boosting Model
def build_gradient_boosting_model(X_train, y_train, X_test, y_test):
"""
Gradient Boosting model for draft prediction
"""
# Initialize model
gb_model = GradientBoostingRegressor(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
min_samples_split=10,
min_samples_leaf=4,
subsample=0.8,
max_features='sqrt',
random_state=42
)
# Train model
gb_model.fit(X_train, y_train)
# Predictions
y_train_pred = gb_model.predict(X_train)
y_test_pred = gb_model.predict(X_test)
# Evaluation metrics
test_metrics = {
'rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'mae': mean_absolute_error(y_test, y_test_pred),
'r2': r2_score(y_test, y_test_pred)
}
print("\nGradient Boosting - Test Metrics:")
print(f" RMSE: {test_metrics['rmse']:.3f}")
print(f" MAE: {test_metrics['mae']:.3f}")
print(f" R²: {test_metrics['r2']:.3f}")
return gb_model, test_metrics
# Ensemble prediction
def ensemble_prediction(models, X_test, weights=None):
"""
Combine predictions from multiple models
"""
if weights is None:
weights = [1.0 / len(models)] * len(models)
predictions = np.zeros(len(X_test))
for model, weight in zip(models, weights):
predictions += weight * model.predict(X_test)
return predictions
Draft Prospect Evaluation
def evaluate_draft_prospect(prospect_data, model, scaler, feature_cols):
"""
Predict career performance for a draft prospect
"""
# Engineer features for prospect
prospect_df = engineer_features(pd.DataFrame([prospect_data]))
# Extract and scale features
X_prospect = prospect_df[feature_cols].values
X_prospect_scaled = scaler.transform(X_prospect)
# Predict career win shares
predicted_ws = model.predict(X_prospect_scaled)[0]
return predicted_ws
# Example usage
def predict_draft_class(draft_class_df, model, scaler, feature_cols):
"""
Generate predictions for entire draft class
"""
# Engineer features
draft_class_df = engineer_features(draft_class_df)
# Prepare features
X_draft = draft_class_df[feature_cols].values
X_draft_scaled = scaler.transform(X_draft)
# Predictions
predictions = model.predict(X_draft_scaled)
# Add predictions to dataframe
draft_class_df['predicted_career_ws'] = predictions
# Rank prospects
draft_class_df['model_rank'] = draft_class_df['predicted_career_ws'].rank(
ascending=False, method='min'
).astype(int)
# Sort by prediction
results = draft_class_df.sort_values('predicted_career_ws', ascending=False)
return results[['player_name', 'predicted_career_ws', 'model_rank']]
# Visualization
def plot_prediction_vs_actual(y_test, y_pred, title='Draft Model Predictions'):
"""
Scatter plot of predicted vs actual career outcomes
"""
plt.figure(figsize=(10, 8))
plt.scatter(y_test, y_pred, alpha=0.6, s=50)
# Perfect prediction line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Career Win Shares', fontsize=12)
plt.ylabel('Predicted Career Win Shares', fontsize=12)
plt.title(title, fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('prediction_vs_actual.png', dpi=300, bbox_inches='tight')
plt.close()
R Statistical Analysis
Data Preparation and Exploration
library(tidyverse)
library(caret)
library(randomForest)
library(gbm)
library(glmnet)
library(corrplot)
library(ggplot2)
# Load and prepare draft data
load_draft_data <- function(filepath = "nba_draft_data.csv") {
df <- read_csv(filepath)
# Convert categorical variables to factors
df$position <- as.factor(df$position)
df$conference <- as.factor(df$conference)
# Remove NA values
df <- df %>% drop_na()
return(df)
}
# Exploratory data analysis
explore_draft_data <- function(df) {
# Summary statistics
print(summary(df))
# Correlation matrix for numeric variables
numeric_cols <- df %>% select_if(is.numeric)
cor_matrix <- cor(numeric_cols, use = "complete.obs")
# Visualize correlations
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Feature Correlation Matrix")
# Distribution of target variable
ggplot(df, aes(x = career_ws)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Distribution of Career Win Shares",
x = "Career Win Shares", y = "Count") +
theme_minimal()
return(cor_matrix)
}
# Feature engineering
engineer_features_r <- function(df) {
df <- df %>%
mutate(
# Physical ratios
wingspan_height_ratio = wingspan / height,
bmi = (weight / (height^2)) * 703,
# Age-adjusted stats
age_adjusted_ppg = ppg / (age - 17),
age_adjusted_bpm = bpm / (age - 17),
# Composite scores
scoring_efficiency = ppg * ts_pct,
versatility_score = ppg + rpg + apg,
# Draft position indicators
lottery_pick = ifelse(draft_pick <= 14, 1, 0),
first_round = ifelse(draft_pick <= 30, 1, 0)
)
return(df)
}
Linear Regression Analysis
# Multiple linear regression
build_linear_model <- function(df, formula_str = NULL) {
# Default formula if not provided
if (is.null(formula_str)) {
formula_str <- "career_ws ~ age + height + wingspan + weight +
ppg + rpg + apg + ts_pct + bpm +
wingspan_height_ratio + age_adjusted_bpm"
}
# Build model
lm_model <- lm(as.formula(formula_str), data = df)
# Model summary
print(summary(lm_model))
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)
par(mfrow = c(1, 1))
# Calculate metrics
predictions <- predict(lm_model, df)
rmse <- sqrt(mean((df$career_ws - predictions)^2))
mae <- mean(abs(df$career_ws - predictions))
r_squared <- summary(lm_model)$r.squared
cat("\nLinear Regression Metrics:\n")
cat(sprintf(" RMSE: %.3f\n", rmse))
cat(sprintf(" MAE: %.3f\n", mae))
cat(sprintf(" R²: %.3f\n", r_squared))
return(lm_model)
}
# Stepwise variable selection
stepwise_selection <- function(df) {
# Full model
full_model <- lm(career_ws ~ age + height + wingspan + weight +
ppg + rpg + apg + ts_pct + bpm +
wingspan_height_ratio + age_adjusted_bpm +
scoring_efficiency + versatility_score,
data = df)
# Backward stepwise selection
step_model <- step(full_model, direction = "backward", trace = 1)
print(summary(step_model))
return(step_model)
}
# Ridge and Lasso regression
regularized_regression <- function(df) {
# Prepare data
x_vars <- c("age", "height", "wingspan", "weight",
"ppg", "rpg", "apg", "ts_pct", "bpm",
"wingspan_height_ratio", "age_adjusted_bpm")
X <- as.matrix(df[, x_vars])
y <- df$career_ws
# Ridge regression (alpha = 0)
ridge_model <- cv.glmnet(X, y, alpha = 0, nfolds = 10)
cat("Ridge Regression - Optimal Lambda:", ridge_model$lambda.min, "\n")
# Lasso regression (alpha = 1)
lasso_model <- cv.glmnet(X, y, alpha = 1, nfolds = 10)
cat("Lasso Regression - Optimal Lambda:", lasso_model$lambda.min, "\n")
# Plot coefficient paths
par(mfrow = c(1, 2))
plot(ridge_model, main = "Ridge Regression CV")
plot(lasso_model, main = "Lasso Regression CV")
par(mfrow = c(1, 1))
# Coefficients
ridge_coefs <- coef(ridge_model, s = "lambda.min")
lasso_coefs <- coef(lasso_model, s = "lambda.min")
cat("\nLasso Selected Features:\n")
print(lasso_coefs[lasso_coefs[, 1] != 0, ])
return(list(ridge = ridge_model, lasso = lasso_model))
}
Random Forest in R
# Random Forest model
build_rf_model_r <- function(df, train_pct = 0.8) {
# Train-test split
set.seed(42)
train_index <- createDataPartition(df$career_ws, p = train_pct, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Define features
feature_cols <- c("age", "height", "wingspan", "weight",
"ppg", "rpg", "apg", "ts_pct", "bpm",
"wingspan_height_ratio", "age_adjusted_bpm")
# Build Random Forest
rf_model <- randomForest(
x = train_data[, feature_cols],
y = train_data$career_ws,
ntree = 500,
mtry = 4,
importance = TRUE,
nodesize = 5
)
# Predictions
train_pred <- predict(rf_model, train_data[, feature_cols])
test_pred <- predict(rf_model, test_data[, feature_cols])
# Metrics
train_rmse <- sqrt(mean((train_data$career_ws - train_pred)^2))
test_rmse <- sqrt(mean((test_data$career_ws - test_pred)^2))
test_r2 <- cor(test_data$career_ws, test_pred)^2
cat("\nRandom Forest Results:\n")
cat(sprintf(" Training RMSE: %.3f\n", train_rmse))
cat(sprintf(" Test RMSE: %.3f\n", test_rmse))
cat(sprintf(" Test R²: %.3f\n", test_r2))
# Variable importance plot
varImpPlot(rf_model, main = "Random Forest - Variable Importance")
# Feature importance data
importance_df <- data.frame(
Feature = rownames(importance(rf_model)),
Importance = importance(rf_model)[, "%IncMSE"]
) %>%
arrange(desc(Importance))
print(importance_df)
return(list(model = rf_model, test_data = test_data, predictions = test_pred))
}
# Partial dependence plots
plot_partial_dependence <- function(rf_model, df, feature_name) {
# Create partial dependence plot
pd <- partialPlot(rf_model, df, x.var = feature_name,
main = paste("Partial Dependence:", feature_name))
return(pd)
}
Model Comparison and Validation
# Cross-validation comparison
compare_models <- function(df, k_folds = 10) {
set.seed(42)
# Define control parameters
ctrl <- trainControl(
method = "cv",
number = k_folds,
savePredictions = TRUE
)
# Feature columns
feature_formula <- as.formula(
"career_ws ~ age + height + wingspan + weight +
ppg + rpg + apg + ts_pct + bpm +
wingspan_height_ratio + age_adjusted_bpm"
)
# Linear regression
lm_cv <- train(feature_formula, data = df, method = "lm", trControl = ctrl)
# Random Forest
rf_cv <- train(feature_formula, data = df, method = "rf", trControl = ctrl,
ntree = 300)
# Gradient Boosting
gbm_cv <- train(feature_formula, data = df, method = "gbm", trControl = ctrl,
verbose = FALSE)
# Compare results
results <- resamples(list(
LinearRegression = lm_cv,
RandomForest = rf_cv,
GradientBoosting = gbm_cv
))
# Summary statistics
print(summary(results))
# Visualization
bwplot(results, main = "Model Comparison - 10-Fold CV")
dotplot(results, main = "Model Performance Metrics")
return(results)
}
# Prediction interval estimation
calculate_prediction_intervals <- function(model, new_data, alpha = 0.05) {
# Get predictions with intervals
predictions <- predict(model, new_data, interval = "prediction", level = 1 - alpha)
result_df <- data.frame(
Player = new_data$player_name,
Predicted_WS = predictions[, "fit"],
Lower_Bound = predictions[, "lwr"],
Upper_Bound = predictions[, "upr"]
)
return(result_df)
}
# Residual analysis
analyze_residuals <- function(model, df) {
predictions <- predict(model, df)
residuals <- df$career_ws - predictions
# Create diagnostic plots
par(mfrow = c(2, 2))
# Residuals vs fitted
plot(predictions, residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)
# Q-Q plot
qqnorm(residuals)
qqline(residuals, col = "red")
# Scale-location plot
plot(predictions, sqrt(abs(residuals)),
xlab = "Fitted Values", ylab = "√|Residuals|",
main = "Scale-Location")
# Residuals histogram
hist(residuals, breaks = 30, col = "steelblue",
xlab = "Residuals", main = "Residual Distribution")
par(mfrow = c(1, 1))
# Statistical tests
shapiro_test <- shapiro.test(residuals)
cat("\nShapiro-Wilk Normality Test:\n")
cat(sprintf(" W = %.4f, p-value = %.4f\n",
shapiro_test$statistic, shapiro_test$p.value))
}
Machine Learning Approaches
Advanced Ensemble Methods
XGBoost Implementation
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
def build_xgboost_model(X_train, y_train, X_test, y_test):
"""
XGBoost model with hyperparameter tuning
"""
# Define parameter grid
param_grid = {
'max_depth': [4, 6, 8],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [300, 500, 700],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9],
'min_child_weight': [1, 3, 5]
}
# Initialize XGBoost
xgb_model = xgb.XGBRegressor(
objective='reg:squarederror',
random_state=42
)
# Grid search with cross-validation
grid_search = GridSearchCV(
xgb_model, param_grid,
cv=5, scoring='neg_mean_squared_error',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
# Best model
best_model = grid_search.best_estimator_
print("\nBest Parameters:", grid_search.best_params_)
# Predictions
y_test_pred = best_model.predict(X_test)
# Metrics
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
print(f"\nXGBoost Test Metrics:")
print(f" RMSE: {test_rmse:.3f}")
print(f" MAE: {test_mae:.3f}")
print(f" R²: {test_r2:.3f}")
return best_model
Neural Network Architecture
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
def build_neural_network(input_dim, hidden_units=[128, 64, 32]):
"""
Deep neural network for draft prediction
"""
model = keras.Sequential([
layers.Input(shape=(input_dim,)),
# First hidden layer
layers.Dense(hidden_units[0], activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
# Second hidden layer
layers.Dense(hidden_units[1], activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
# Third hidden layer
layers.Dense(hidden_units[2], activation='relu'),
layers.Dropout(0.1),
# Output layer
layers.Dense(1, activation='linear')
])
# Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='mean_squared_error',
metrics=['mae']
)
return model
def train_neural_network(model, X_train, y_train, X_val, y_val, epochs=200):
"""
Train neural network with callbacks
"""
# Callbacks
early_stopping = EarlyStopping(
monitor='val_loss',
patience=20,
restore_best_weights=True
)
reduce_lr = ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=10,
min_lr=1e-6
)
# Train model
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=epochs,
batch_size=32,
callbacks=[early_stopping, reduce_lr],
verbose=1
)
return model, history
# Plot training history
def plot_training_history(history):
"""
Visualize training and validation loss
"""
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Model Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(history.history['mae'], label='Training MAE')
plt.plot(history.history['val_mae'], label='Validation MAE')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.title('Mean Absolute Error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('training_history.png', dpi=300, bbox_inches='tight')
plt.close()
Stacking Ensemble
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge
def build_stacking_ensemble(X_train, y_train, X_test, y_test):
"""
Stacking ensemble combining multiple models
"""
# Base models
base_models = [
('rf', RandomForestRegressor(n_estimators=300, max_depth=10, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=300, learning_rate=0.05, random_state=42)),
('xgb', xgb.XGBRegressor(n_estimators=300, learning_rate=0.05, random_state=42))
]
# Meta-learner
meta_model = Ridge(alpha=1.0)
# Stacking regressor
stacking_model = StackingRegressor(
estimators=base_models,
final_estimator=meta_model,
cv=5
)
# Train
stacking_model.fit(X_train, y_train)
# Predictions
y_test_pred = stacking_model.predict(X_test)
# Metrics
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
print(f"\nStacking Ensemble Test Metrics:")
print(f" RMSE: {test_rmse:.3f}")
print(f" MAE: {test_mae:.3f}")
print(f" R²: {test_r2:.3f}")
return stacking_model
Model Validation and Historical Accuracy
Cross-Validation Strategies
Time-Series Cross-Validation
For draft prediction, chronological validation is critical to avoid look-ahead bias:
from sklearn.model_selection import TimeSeriesSplit
def time_series_validation(df, model, n_splits=5):
"""
Time-series cross-validation for draft models
"""
# Sort by draft year
df_sorted = df.sort_values('draft_year')
# Features and target
feature_cols = ['age', 'height', 'wingspan', 'ppg', 'rpg', 'apg', 'ts_pct', 'bpm']
X = df_sorted[feature_cols].values
y = df_sorted['career_ws'].values
# Time series split
tscv = TimeSeriesSplit(n_splits=n_splits)
rmse_scores = []
r2_scores = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train model
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
rmse_scores.append(rmse)
r2_scores.append(r2)
print(f"Fold {fold}: RMSE = {rmse:.3f}, R² = {r2:.3f}")
print(f"\nAverage RMSE: {np.mean(rmse_scores):.3f} (+/- {np.std(rmse_scores):.3f})")
print(f"Average R²: {np.mean(r2_scores):.3f} (+/- {np.std(r2_scores):.3f})")
return rmse_scores, r2_scores
Leave-One-Year-Out Validation
def leave_one_year_out_validation(df, model):
"""
Leave-one-year-out cross-validation for draft classes
"""
years = sorted(df['draft_year'].unique())
results = []
for year in years:
# Split data
train_df = df[df['draft_year'] != year]
test_df = df[df['draft_year'] == year]
if len(test_df) < 5: # Skip years with too few prospects
continue
# Features
feature_cols = ['age', 'height', 'wingspan', 'ppg', 'rpg', 'apg', 'ts_pct', 'bpm']
X_train = train_df[feature_cols].values
y_train = train_df['career_ws'].values
X_test = test_df[feature_cols].values
y_test = test_df['career_ws'].values
# Train and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
results.append({
'year': year,
'n_prospects': len(test_df),
'rmse': rmse,
'mae': mae,
'r2': r2
})
print(f"Year {year}: RMSE = {rmse:.3f}, MAE = {mae:.3f}, R² = {r2:.3f}")
results_df = pd.DataFrame(results)
print(f"\nOverall Metrics:")
print(f" Average RMSE: {results_df['rmse'].mean():.3f}")
print(f" Average MAE: {results_df['mae'].mean():.3f}")
print(f" Average R²: {results_df['r2'].mean():.3f}")
return results_df
Historical Accuracy Analysis
Top Pick Prediction Accuracy
def analyze_top_pick_accuracy(df, model, top_n=10):
"""
Analyze model accuracy for top draft picks
"""
results = []
for year in sorted(df['draft_year'].unique()):
# Training data (all previous years)
train_df = df[df['draft_year'] < year]
test_df = df[df['draft_year'] == year]
if len(train_df) < 50 or len(test_df) < 30:
continue
# Features
feature_cols = ['age', 'height', 'wingspan', 'ppg', 'rpg', 'apg', 'ts_pct', 'bpm']
X_train = train_df[feature_cols].values
y_train = train_df['career_ws'].values
X_test = test_df[feature_cols].values
# Train model
model.fit(X_train, y_train)
# Predict for test year
predictions = model.predict(X_test)
test_df['predicted_ws'] = predictions
# Model's top picks
model_top_picks = test_df.nlargest(top_n, 'predicted_ws')['player_name'].tolist()
# Actual top performers
actual_top_picks = test_df.nlargest(top_n, 'career_ws')['player_name'].tolist()
# Calculate overlap
overlap = len(set(model_top_picks) & set(actual_top_picks))
accuracy = overlap / top_n
results.append({
'year': year,
'top_n': top_n,
'overlap': overlap,
'accuracy': accuracy
})
results_df = pd.DataFrame(results)
print(f"\nTop {top_n} Pick Prediction Accuracy:")
print(f" Average Overlap: {results_df['overlap'].mean():.1f} / {top_n}")
print(f" Average Accuracy: {results_df['accuracy'].mean():.2%}")
return results_df
# Rank correlation analysis
def analyze_rank_correlation(df, model):
"""
Calculate rank correlation between predictions and actual outcomes
"""
from scipy.stats import spearmanr, kendalltau
results = []
for year in sorted(df['draft_year'].unique())[-10:]: # Last 10 years
train_df = df[df['draft_year'] < year]
test_df = df[df['draft_year'] == year]
if len(test_df) < 20:
continue
# Features
feature_cols = ['age', 'height', 'wingspan', 'ppg', 'rpg', 'apg', 'ts_pct', 'bpm']
X_train = train_df[feature_cols].values
y_train = train_df['career_ws'].values
X_test = test_df[feature_cols].values
# Predictions
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Rankings
actual_rank = test_df['career_ws'].rank(ascending=False)
predicted_rank = pd.Series(predictions).rank(ascending=False)
# Correlations
spearman_corr, spearman_p = spearmanr(actual_rank, predicted_rank)
kendall_corr, kendall_p = kendalltau(actual_rank, predicted_rank)
results.append({
'year': year,
'spearman': spearman_corr,
'kendall': kendall_corr
})
print(f"Year {year}: Spearman = {spearman_corr:.3f}, Kendall = {kendall_corr:.3f}")
results_df = pd.DataFrame(results)
print(f"\nAverage Rank Correlations:")
print(f" Spearman: {results_df['spearman'].mean():.3f}")
print(f" Kendall: {results_df['kendall'].mean():.3f}")
return results_df
Performance Benchmarks
| Model Type | Test RMSE | Test R² | Top-10 Accuracy | Rank Correlation |
|---|---|---|---|---|
| Linear Regression | 18.5 | 0.42 | 35% | 0.58 |
| Random Forest | 16.2 | 0.53 | 42% | 0.64 |
| Gradient Boosting | 15.8 | 0.56 | 45% | 0.67 |
| XGBoost | 15.3 | 0.58 | 47% | 0.69 |
| Neural Network | 15.6 | 0.57 | 46% | 0.68 |
| Stacking Ensemble | 14.9 | 0.60 | 49% | 0.71 |
Note: Metrics based on historical validation from 2000-2020 NBA Drafts, predicting 5-year career win shares.
Case Studies: Hits and Misses
Model Success Stories
Case Study 1: Nikola Jokic (2014)
Draft Position: 41st overall (2nd round)
Model Prediction: Top 20 talent
Actual Career: 3x MVP, All-NBA First Team, NBA Champion
Why the Model Worked:
- Exceptional advanced stats in Adriatic League (BPM: +8.5)
- Elite passing ability for big man (6.4 assists per 36 minutes)
- High basketball IQ indicators (low turnover rate, high assist rate)
- Efficient scoring (62% TS%)
- Young age (19) relative to international competition
What Scouts Missed:
- Concerns about athleticism and lateral quickness
- Playing in less-watched European league
- Non-traditional body type for modern NBA center
Case Study 2: Giannis Antetokounmpo (2013)
Draft Position: 15th overall (lottery)
Model Prediction: Top 10 pick with high variance
Actual Career: 2x MVP, DPOY, NBA Champion, Finals MVP
Why the Model Worked:
- Extreme physical measurements (7'3" wingspan at 6'11")
- Very young age (18.5 at draft)
- Versatility indicators (ball handling, perimeter skills for size)
- High motor and competitive metrics
- Rapid skill development trajectory
Model Limitations:
- Limited statistical sample from Greek second division
- Extremely raw skills difficult to quantify
- Unpredictable development curve
Case Study 3: Kawhi Leonard (2011)
Draft Position: 15th overall
Model Prediction: Top 12 pick, 3-and-D specialist
Actual Career: 2x DPOY, 2x Finals MVP, 5x All-Star
Why the Model Worked:
- Elite defensive metrics (2.1 steals, 1.0 blocks per game)
- Outstanding physical tools (7'3" wingspan, massive hands)
- Strong efficiency numbers (60% TS%)
- Two-way production at high level
- Rebounding ability for wing position
Model Failures and Misses
Case Study 4: Anthony Bennett (2013)
Draft Position: 1st overall
Model Prediction: Late lottery to mid-first round
Actual Career: Major bust, out of NBA after 4 seasons
Why the Model Was Right:
- Modest college statistics (16.1 ppg, 8.1 rpg)
- Average advanced metrics for #1 pick (BPM: +5.2)
- Limited wingspan (6'11" at 6'8")
- Age concerns (20 years old)
- Inconsistent shooting (35% from three)
What Happened:
- Weight and conditioning issues
- Mental health struggles
- Poor team fit and development
- Shoulder injury impacting draft year
Case Study 5: Darko Milicic (2003)
Draft Position: 2nd overall
Model Prediction: Mid-first round (questionable data quality)
Actual Career: Significant bust (drafted ahead of Carmelo, Wade, Bosh)
Model Challenges:
- Limited reliable statistics from Adriatic League
- Small sample size of games
- Difficulty translating European big man production
- Age (18) increased uncertainty
Why the Model Struggled:
- Overvaluation of potential vs. production
- Psychological factors not captured in data
- Development environment matters (buried on Pistons roster)
Case Study 6: Markelle Fultz (2017)
Draft Position: 1st overall
Model Prediction: Top 3 pick, franchise guard
Actual Career: Underwhelming due to injury/yips
Why the Model Failed:
- Excellent college statistics (23.2 ppg, 5.9 apg, 5.7 rpg)
- Strong efficiency (41% from three, 65% TS%)
- Young age (18) with pro-ready skills
- Complete offensive game
Unpredictable Factors:
- Shooting form collapse (thoracic outlet syndrome?)
- Psychological component ("yips")
- Injuries disrupting development
- Cannot model rare biomechanical/neurological issues
Lessons Learned
Model Strengths
- Identifying Undervalued Prospects: Models excel at finding players with strong statistical profiles overlooked by scouts
- Objectivity: Remove bias based on school prestige, highlight reel plays, or physical appearance
- Age Adjustment: Properly value young players with room to develop
- Efficiency Metrics: Shooting, passing, and defensive metrics translate well
- Physical Measurements: Wingspan, height, and athleticism are strong predictors
Model Limitations
- Injury Risk: Cannot predict career-altering injuries or biomechanical issues
- Mental Health: Psychological factors not captured in statistics
- Development Environment: Team context and coaching quality matter significantly
- Work Ethic: Difficult to quantify player dedication and improvement mindset
- Sample Size: Limited data for international and one-and-done players
- Extreme Outliers: Models struggle with unprecedented player types (e.g., Giannis)
Best Practices for Draft Modeling
- Combine Models with Scouting: Use analytics to complement, not replace, human evaluation
- Account for Uncertainty: Provide prediction intervals, not just point estimates
- Context Matters: Adjust for competition level, team system, and role
- Track Record Analysis: Regularly validate model performance on historical drafts
- Position-Specific Models: Different positions require different predictive features
- Incorporate Injury History: Health data improves long-term projections
- Update Continuously: Modern NBA values different skills than 10+ years ago
- Transparency: Understand model limitations and communicate uncertainty
Future Directions
Emerging Technologies
- Computer Vision: Automated video analysis of movement patterns, defensive positioning
- Wearable Sensors: Biomechanical data, fatigue monitoring, injury prediction
- Natural Language Processing: Analyze scouting reports, interviews for personality traits
- Causal Inference: Understand development factors vs. innate talent
- Transfer Learning: Apply models from other sports, international leagues
- Explainable AI: Better understand why models make certain predictions
Research Opportunities
- Predicting specific skill development (shooting improvement, defensive growth)
- Modeling team fit and system compatibility
- Incorporating personality assessments and psychological evaluations
- Understanding role player vs. star player prediction differences
- Analyzing draft pick trade value and decision-making
References and Resources
Academic Research
- Berri, D. J., & Schmidt, M. B. (2010). "Stumbling on Wins: Two Economists Expose the Pitfalls on the Road to Victory in Professional Sports"
- Coates, D., & Oguntimein, B. (2010). "The length and success of NBA careers: Does college production predict professional outcomes?"
- Page, G. L., et al. (2013). "Explaining the NCAA tournament prediction market"
- Teramoto, M., & Cross, C. L. (2010). "Relative importance of performance factors in winning NBA games in regular season versus playoffs"
Industry Models
- FiveThirtyEight CARMELO projections
- Basketball Reference College-to-Pro translations
- Kevin Pelton's WARP system (ESPN)
- The Ringer NBA Draft Guide
- Synergy Sports Technology scouting platform
Data Sources
- Basketball Reference (college and NBA statistics)
- Sports Reference College Basketball
- NBA.com Stats API
- Draft Express historical data
- Synergy Sports Technology
- RealGM draft database
Tools and Libraries
- Python: scikit-learn, XGBoost, TensorFlow, pandas, numpy
- R: caret, randomForest, gbm, glmnet, tidyverse
- Visualization: matplotlib, seaborn, ggplot2, Plotly
- APIs: nba_api (Python), ballr (R)
Key Takeaways
- Draft prediction models have improved significantly with machine learning, achieving 55-60% explained variance in career outcomes
- Most important features: age-adjusted statistics, physical measurements (wingspan), efficiency metrics, and competition level
- Ensemble methods (combining Random Forest, Gradient Boosting, XGBoost) provide best performance
- Models excel at identifying undervalued prospects and removing cognitive biases from evaluation
- Limitations include unpredictable injuries, psychological factors, and development environment effects
- Best practice: Combine statistical models with traditional scouting for comprehensive evaluation
- Time-series validation critical to avoid look-ahead bias and overestimating model accuracy