The NBA Draft represents one of the highest-leverage decisions a basketball organization makes each year. A single draft pick can define a franchise for a decade or more, with the difference between selecting a perennial All-Star versus a career...
In This Chapter
- Project Overview
- Part 1: Data Collection and Integration
- Part 2: Feature Engineering for Draft Prediction
- Part 3: Target Variable Definition
- Part 4: Model Selection and Training
- Part 5: Backtesting Methodology
- Part 6: Evaluation Metrics
- Part 7: Creating a Draft Board
- Part 8: Visualization of Results
- Part 9: Presentation to Stakeholders
- Part 10: Complete Python ML Pipeline
- Extension Exercises
- Summary
- References and Further Reading
Capstone Project 2: Create a Draft Model
Project Overview
The NBA Draft represents one of the highest-leverage decisions a basketball organization makes each year. A single draft pick can define a franchise for a decade or more, with the difference between selecting a perennial All-Star versus a career backup potentially worth hundreds of millions of dollars in both direct compensation and downstream revenue effects. This capstone project guides you through building a comprehensive draft prediction model that integrates college statistics, athletic measurements, contextual factors, and historical outcomes to generate actionable intelligence for draft decision-making.
Learning Objectives
By completing this project, you will be able to:
- Collect and integrate heterogeneous data sources including college box score statistics, advanced metrics, combine measurements, and historical draft outcomes
- Engineer predictive features that capture player potential beyond raw statistics
- Define and operationalize target variables that meaningfully represent career value
- Select and train appropriate machine learning models for player projection
- Implement rigorous backtesting methodology that simulates real-world prediction scenarios
- Evaluate model performance using domain-appropriate metrics
- Create actionable draft boards that communicate predictions to stakeholders
- Visualize uncertainty and comparisons in ways that support decision-making
- Present findings professionally to both technical and non-technical audiences
Professional Context
NBA front offices increasingly rely on quantitative draft analysis to inform their selections. Modern draft rooms typically include:
- Statistical models that project college performance to NBA outcomes
- Comparison systems that identify historical players with similar profiles
- Risk assessments that quantify the uncertainty around projections
- Value estimates that translate projections into draft position recommendations
This project replicates the analytical workflow used by professional basketball analytics departments. The techniques you develop here transfer directly to front office work, player agency analysis, and media draft coverage.
Part 1: Data Collection and Integration
1.1 Data Sources Overview
Building a draft model requires integrating multiple data sources, each providing distinct signal about player potential:
| Data Source | Key Variables | Coverage | Primary Signal |
|---|---|---|---|
| College Box Scores | Points, rebounds, assists, etc. | 2000-present | Production level |
| College Advanced Stats | PER, BPM, WS/40 | 2010-present | Efficiency and impact |
| NBA Combine | Height, wingspan, vertical, agility | 2000-present | Athletic tools |
| Biographical Data | Age, experience, school tier | Complete | Context and trajectory |
| Draft History | Pick number, team, trade details | Complete | Selection outcomes |
| NBA Career Stats | Career totals, per-game, advanced | Complete | Target variables |
1.2 Data Collection Implementation
"""
data_collection.py - Comprehensive draft data collection module
This module handles the collection and integration of data from multiple
sources required for draft modeling.
"""
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Tuple
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DraftDataCollector:
"""
Collects and integrates data from multiple sources for draft modeling.
This class handles the complexity of combining college statistics,
combine measurements, and NBA career outcomes into a unified dataset.
"""
def __init__(self, start_year: int = 2000, end_year: int = None):
"""
Initialize the data collector.
Args:
start_year: First draft year to include
end_year: Last draft year to include (defaults to current year - 4
to ensure sufficient NBA career data for evaluation)
"""
self.start_year = start_year
self.end_year = end_year or (datetime.now().year - 4)
self.data_cache = {}
def collect_college_stats(self, player_ids: List[str]) -> pd.DataFrame:
"""
Collect college statistics for a list of players.
In production, this would connect to a college statistics database
or API. For this example, we demonstrate the expected data structure.
Args:
player_ids: List of unique player identifiers
Returns:
DataFrame with college statistics
"""
# Expected columns for college statistics
college_columns = [
'player_id', 'name', 'season', 'school', 'conference',
'games_played', 'games_started', 'minutes_per_game',
'points_per_game', 'rebounds_per_game', 'assists_per_game',
'steals_per_game', 'blocks_per_game', 'turnovers_per_game',
'fg_pct', 'three_pt_pct', 'ft_pct', 'usage_rate',
'true_shooting_pct', 'assist_rate', 'turnover_rate',
'rebound_rate', 'block_rate', 'steal_rate',
'box_plus_minus', 'win_shares_per_40'
]
logger.info(f"Collecting college stats for {len(player_ids)} players")
# In production: Query database or API
# df = self._query_college_database(player_ids)
# Placeholder for demonstration
df = pd.DataFrame(columns=college_columns)
return df
def collect_combine_data(self, draft_years: List[int]) -> pd.DataFrame:
"""
Collect NBA Combine measurements for specified draft years.
The combine provides standardized athletic measurements that
are particularly valuable for projecting players with limited
college statistical samples.
Args:
draft_years: List of draft years to collect
Returns:
DataFrame with combine measurements
"""
combine_columns = [
'player_id', 'name', 'draft_year', 'position',
'height_no_shoes', 'height_with_shoes', 'weight',
'wingspan', 'standing_reach', 'body_fat_pct',
'hand_length', 'hand_width',
'standing_vertical', 'max_vertical',
'lane_agility', 'three_quarter_sprint',
'bench_press_reps'
]
logger.info(f"Collecting combine data for years {draft_years}")
# In production: Query combine database
df = pd.DataFrame(columns=combine_columns)
return df
def collect_draft_history(self) -> pd.DataFrame:
"""
Collect historical draft results including pick numbers and teams.
Returns:
DataFrame with draft history
"""
draft_columns = [
'player_id', 'name', 'draft_year', 'draft_round',
'draft_pick', 'draft_team', 'college', 'position'
]
logger.info(f"Collecting draft history {self.start_year}-{self.end_year}")
df = pd.DataFrame(columns=draft_columns)
return df
def collect_nba_careers(self, player_ids: List[str]) -> pd.DataFrame:
"""
Collect NBA career statistics for drafted players.
This provides the target variables for our prediction model.
We collect comprehensive career statistics to enable multiple
definitions of "success."
Args:
player_ids: List of player identifiers
Returns:
DataFrame with NBA career statistics
"""
career_columns = [
'player_id', 'name', 'seasons_played', 'games_played',
'games_started', 'total_minutes', 'career_ppg', 'career_rpg',
'career_apg', 'career_ws', 'career_vorp', 'career_bpm',
'peak_ws', 'peak_vorp', 'peak_bpm', 'all_star_selections',
'all_nba_selections', 'championships', 'career_earnings'
]
logger.info(f"Collecting NBA careers for {len(player_ids)} players")
df = pd.DataFrame(columns=career_columns)
return df
def build_integrated_dataset(self) -> pd.DataFrame:
"""
Build the complete integrated dataset for modeling.
This method orchestrates the collection from all sources and
performs the necessary joins to create a unified dataset.
Returns:
Complete DataFrame ready for feature engineering
"""
logger.info("Building integrated draft dataset")
# Collect draft history as the backbone
draft_df = self.collect_draft_history()
if draft_df.empty:
logger.warning("No draft data collected - returning sample data")
return self._generate_sample_data()
# Get player IDs
player_ids = draft_df['player_id'].unique().tolist()
# Collect from other sources
college_df = self.collect_college_stats(player_ids)
combine_df = self.collect_combine_data(
list(range(self.start_year, self.end_year + 1))
)
career_df = self.collect_nba_careers(player_ids)
# Merge datasets
merged = draft_df.merge(college_df, on='player_id', how='left')
merged = merged.merge(combine_df, on='player_id', how='left')
merged = merged.merge(career_df, on='player_id', how='left')
logger.info(f"Integrated dataset: {len(merged)} players")
return merged
def _generate_sample_data(self) -> pd.DataFrame:
"""
Generate realistic sample data for demonstration and testing.
This creates a synthetic dataset with realistic distributions
and correlations for model development.
"""
np.random.seed(42)
n_players = 500
# Generate synthetic draft data
data = {
'player_id': [f'player_{i}' for i in range(n_players)],
'name': [f'Player {i}' for i in range(n_players)],
'draft_year': np.random.randint(2010, 2021, n_players),
'draft_pick': np.random.randint(1, 61, n_players),
# College statistics (correlated with draft pick)
'college_ppg': np.clip(25 - np.random.randn(n_players) * 5, 5, 35),
'college_rpg': np.clip(8 - np.random.randn(n_players) * 2, 2, 15),
'college_apg': np.clip(4 - np.random.randn(n_players) * 1.5, 0.5, 10),
'college_fg_pct': np.clip(0.45 + np.random.randn(n_players) * 0.05, 0.35, 0.65),
'college_3pt_pct': np.clip(0.35 + np.random.randn(n_players) * 0.06, 0.20, 0.50),
'college_bpm': np.random.randn(n_players) * 3 + 5,
'college_ws_per_40': np.random.randn(n_players) * 0.05 + 0.15,
# Combine measurements
'height_inches': np.random.normal(79, 3, n_players),
'wingspan_inches': np.random.normal(83, 4, n_players),
'weight_lbs': np.random.normal(215, 25, n_players),
'standing_vertical': np.random.normal(28, 3, n_players),
'max_vertical': np.random.normal(34, 4, n_players),
'lane_agility': np.random.normal(11.2, 0.5, n_players),
'three_quarter_sprint': np.random.normal(3.2, 0.15, n_players),
# Contextual features
'age_at_draft': np.random.normal(21, 1.5, n_players),
'years_in_college': np.random.choice([1, 2, 3, 4], n_players, p=[0.2, 0.3, 0.25, 0.25]),
'conference_strength': np.random.uniform(0.5, 1.0, n_players),
}
df = pd.DataFrame(data)
# Generate correlated NBA outcomes
# Higher draft picks and better college stats lead to better outcomes
base_talent = (60 - df['draft_pick']) / 60 + df['college_bpm'] / 10
noise = np.random.randn(n_players) * 0.3
df['career_ws'] = np.clip((base_talent + noise) * 50, 0, 200)
df['career_vorp'] = np.clip((base_talent + noise) * 30, -5, 100)
df['seasons_played'] = np.clip(
((base_talent + noise + 0.5) * 8).astype(int), 1, 18
)
df['all_star_selections'] = np.clip(
((base_talent + noise) * 3).astype(int), 0, 15
)
return df
def load_draft_data(
start_year: int = 2010,
end_year: int = 2020,
use_sample: bool = True
) -> pd.DataFrame:
"""
Main function to load draft data for modeling.
Args:
start_year: First draft year to include
end_year: Last draft year to include
use_sample: Whether to use sample data (True for demonstration)
Returns:
Integrated DataFrame ready for feature engineering
"""
collector = DraftDataCollector(start_year, end_year)
if use_sample:
return collector._generate_sample_data()
else:
return collector.build_integrated_dataset()
1.3 Data Quality Considerations
Before proceeding to feature engineering, we must address common data quality issues:
Missing Data Patterns: - Combine data is missing for approximately 30% of draft picks (players who opt out or are not invited) - International players often lack complete college statistics - Early career exits create censored outcome data
Handling Missing Values:
def handle_missing_data(df: pd.DataFrame) -> pd.DataFrame:
"""
Apply appropriate missing data strategies for each variable type.
"""
# Physical measurements: impute with position-specific medians
physical_cols = ['height_inches', 'wingspan_inches', 'weight_lbs']
for col in physical_cols:
if col in df.columns:
df[col] = df.groupby('position')[col].transform(
lambda x: x.fillna(x.median())
)
# Athletic testing: create missing indicator + impute median
athletic_cols = ['standing_vertical', 'max_vertical', 'lane_agility']
for col in athletic_cols:
if col in df.columns:
df[f'{col}_missing'] = df[col].isna().astype(int)
df[col] = df[col].fillna(df[col].median())
# College stats: forward-fill for multi-year players
stat_cols = ['college_ppg', 'college_rpg', 'college_apg']
for col in stat_cols:
if col in df.columns:
df[col] = df[col].fillna(df[col].median())
return df
Part 2: Feature Engineering for Draft Prediction
Feature engineering is where domain expertise most directly impacts model performance. We transform raw measurements and statistics into features that capture the underlying factors that predict NBA success.
2.1 Feature Categories
We organize features into five categories:
- Production Features: What did the player accomplish in college?
- Efficiency Features: How efficiently did they produce?
- Physical Features: What are their measurable physical tools?
- Context Features: What context helps interpret their production?
- Trajectory Features: How are they improving over time?
2.2 Core Feature Engineering Implementation
# See code/feature_engineering.py for complete implementation
from feature_engineering import DraftFeatureEngineer
# Example usage
engineer = DraftFeatureEngineer()
features_df = engineer.create_all_features(raw_data)
2.3 Key Feature Concepts
Physical Profile Index: Combining height, wingspan, and athleticism into a single index allows the model to capture the overall physical package:
def calculate_physical_index(row):
"""
Create composite physical profile score.
Weights reflect the relative importance of each attribute
for overall physical projection.
"""
height_z = (row['height_inches'] - 79) / 3
wingspan_z = (row['wingspan_inches'] - 83) / 4
vertical_z = (row['max_vertical'] - 34) / 4
# Wingspan relative to height is particularly predictive
wingspan_diff = row['wingspan_inches'] - row['height_inches']
wingspan_diff_z = (wingspan_diff - 4) / 2
return 0.25 * height_z + 0.30 * wingspan_diff_z + 0.25 * vertical_z + 0.20 * wingspan_z
Age-Adjusted Production: A 19-year-old averaging 15 PPG is more impressive than a 23-year-old with the same production. We adjust for this:
def age_adjusted_production(ppg, age, reference_age=21):
"""
Adjust scoring production for age.
Younger players get a boost, older players get penalized.
The adjustment factor is calibrated to historical data showing
approximately 2 PPG improvement per year of experience at the
college level.
"""
age_adjustment = (reference_age - age) * 2.0
return ppg + age_adjustment
Conference Strength Adjustment: Production against weaker competition is less predictive:
def conference_adjusted_stats(stats, conference_strength):
"""
Adjust statistics based on strength of competition.
Conference strength is measured on a 0-1 scale where 1.0
represents the strongest conferences.
"""
adjustment_factor = 0.5 + (conference_strength * 0.5)
return stats * adjustment_factor
Part 3: Target Variable Definition
The choice of target variable fundamentally shapes what our model learns to predict. Different target definitions capture different aspects of player value.
3.1 Career Value Metrics
| Metric | Definition | Pros | Cons |
|---|---|---|---|
| Win Shares | Cumulative wins attributed | Captures total value | Favors longevity over peak |
| VORP | Value over replacement | Position-adjusted | Complex to interpret |
| All-Star Selections | Counting stat | Clear milestone | Subject to popularity bias |
| Career Earnings | Total compensation | Market valuation | Affected by cap era |
| Composite Score | Weighted combination | Multi-dimensional | Requires weight selection |
3.2 Implementing Target Variables
def create_target_variables(df: pd.DataFrame) -> pd.DataFrame:
"""
Create multiple target variable definitions for model training.
We create several targets to enable different analyses:
- Regression targets for predicting career value
- Classification targets for identifying tiers
- Ordinal targets for ranking predictions
"""
# Continuous regression targets
df['target_career_ws'] = df['career_ws']
df['target_career_vorp'] = df['career_vorp']
# Log-transformed targets (reduces impact of outliers)
df['target_log_ws'] = np.log1p(df['career_ws'])
df['target_log_vorp'] = np.log1p(np.clip(df['career_vorp'], 0, None))
# Composite target: weighted combination
df['target_composite'] = (
0.4 * df['career_ws'].rank(pct=True) +
0.3 * df['career_vorp'].rank(pct=True) +
0.2 * df['all_star_selections'].rank(pct=True) +
0.1 * df['seasons_played'].rank(pct=True)
)
# Classification targets
ws_thresholds = df['career_ws'].quantile([0.25, 0.50, 0.75, 0.90])
df['target_tier'] = pd.cut(
df['career_ws'],
bins=[-np.inf, ws_thresholds[0.25], ws_thresholds[0.50],
ws_thresholds[0.75], ws_thresholds[0.90], np.inf],
labels=['Bust', 'Bench', 'Rotation', 'Starter', 'Star']
)
# Binary classification: "hit" vs "miss"
df['target_hit'] = (df['career_ws'] > ws_thresholds[0.50]).astype(int)
# Star identification
df['target_star'] = (df['career_ws'] > ws_thresholds[0.90]).astype(int)
return df
3.3 Handling Career Outcome Uncertainty
Players from recent drafts have incomplete career data. We address this through:
- Minimum seasons threshold: Only include players with 4+ seasons for primary analysis
- Projection to career totals: Use per-season rates to project incomplete careers
- Separate models by experience: Train different models for different projection horizons
def project_career_outcomes(df: pd.DataFrame, min_seasons: int = 4) -> pd.DataFrame:
"""
Project career outcomes for players with incomplete data.
"""
# Mark players with sufficient data
df['sufficient_data'] = df['seasons_played'] >= min_seasons
# For insufficient data, project based on per-season rates
df['ws_per_season'] = df['career_ws'] / df['seasons_played']
df['projected_career_ws'] = np.where(
df['sufficient_data'],
df['career_ws'],
df['ws_per_season'] * 10 # Project to 10-year career
)
return df
Part 4: Model Selection and Training
4.1 Model Selection Rationale
For draft prediction, we evaluate several model families:
| Model | Strengths | Weaknesses | Use Case |
|---|---|---|---|
| Random Forest | Handles non-linearity, robust | Less interpretable | Primary prediction |
| Gradient Boosting | Highest accuracy potential | Prone to overfitting | Ensemble component |
| Linear Regression | Highly interpretable | Misses interactions | Baseline, insights |
| Neural Network | Captures complex patterns | Requires more data | Large datasets |
4.2 Model Training Pipeline
# Complete pipeline in code/draft_model.py
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import xgboost as xgb
class DraftModelPipeline:
"""
Complete pipeline for training and evaluating draft prediction models.
"""
def __init__(self, target_col: str = 'target_career_ws'):
self.target_col = target_col
self.models = {}
self.scaler = StandardScaler()
self.feature_cols = None
def prepare_features(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
"""
Prepare feature matrix and target vector.
"""
# Exclude target and identifier columns
exclude_cols = ['player_id', 'name', 'draft_year'] + \
[c for c in df.columns if c.startswith('target_')]
self.feature_cols = [c for c in df.columns if c not in exclude_cols
and df[c].dtype in ['int64', 'float64']]
X = df[self.feature_cols].values
y = df[self.target_col].values
return X, y
def train_models(
self,
X: np.ndarray,
y: np.ndarray,
draft_years: np.ndarray
) -> Dict:
"""
Train multiple models with time-series cross-validation.
We use draft year as the time index to ensure we never
train on future data when backtesting.
"""
results = {}
# Define models
models = {
'random_forest': RandomForestRegressor(
n_estimators=200,
max_depth=10,
min_samples_leaf=5,
random_state=42
),
'gradient_boosting': GradientBoostingRegressor(
n_estimators=200,
max_depth=5,
learning_rate=0.05,
random_state=42
),
'xgboost': xgb.XGBRegressor(
n_estimators=200,
max_depth=6,
learning_rate=0.05,
random_state=42
)
}
# Time-series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
for name, model in models.items():
# Scale features
X_scaled = self.scaler.fit_transform(X)
# Cross-validation scores
cv_scores = cross_val_score(
model, X_scaled, y,
cv=tscv,
scoring='neg_mean_squared_error'
)
# Train final model on all data
model.fit(X_scaled, y)
self.models[name] = model
results[name] = {
'cv_rmse': np.sqrt(-cv_scores.mean()),
'cv_std': np.sqrt(-cv_scores).std(),
'model': model
}
print(f"{name}: RMSE = {results[name]['cv_rmse']:.3f} "
f"(+/- {results[name]['cv_std']:.3f})")
return results
4.3 Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
def tune_hyperparameters(X, y, n_iter=50):
"""
Perform hyperparameter tuning for the best model.
"""
param_distributions = {
'n_estimators': [100, 200, 300, 500],
'max_depth': [5, 8, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5, 10],
'max_features': ['sqrt', 'log2', None]
}
rf = RandomForestRegressor(random_state=42)
search = RandomizedSearchCV(
rf, param_distributions,
n_iter=n_iter,
cv=5,
scoring='neg_mean_squared_error',
random_state=42,
n_jobs=-1
)
search.fit(X, y)
print(f"Best parameters: {search.best_params_}")
print(f"Best RMSE: {np.sqrt(-search.best_score_):.3f}")
return search.best_estimator_
Part 5: Backtesting Methodology
Backtesting is critical for validating that our model would have performed well on past drafts. We must be careful to avoid data leakage that would artificially inflate performance.
5.1 Walk-Forward Validation
def walk_forward_backtest(
df: pd.DataFrame,
feature_cols: List[str],
target_col: str,
train_years: int = 5,
test_years: int = 1
) -> pd.DataFrame:
"""
Perform walk-forward backtesting on historical drafts.
For each test year, we train on only the previous train_years
of data, simulating what we would have known at the time.
Args:
df: Complete dataset with draft_year column
feature_cols: Features to use for prediction
target_col: Target variable
train_years: Number of years to include in training
test_years: Number of years to predict
Returns:
DataFrame with predictions for each test year
"""
results = []
years = sorted(df['draft_year'].unique())
for i, test_year in enumerate(years):
# Determine training years (only past data)
train_start = test_year - train_years
train_end = test_year - 1
# Skip if not enough training data
if train_start < years[0]:
continue
# Split data
train_mask = (df['draft_year'] >= train_start) & \
(df['draft_year'] <= train_end)
test_mask = df['draft_year'] == test_year
X_train = df.loc[train_mask, feature_cols].values
y_train = df.loc[train_mask, target_col].values
X_test = df.loc[test_mask, feature_cols].values
# Train model
model = RandomForestRegressor(
n_estimators=200,
max_depth=10,
random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
# Make predictions
predictions = model.predict(X_test_scaled)
# Store results
test_df = df.loc[test_mask].copy()
test_df['predicted_value'] = predictions
test_df['prediction_rank'] = test_df['predicted_value'].rank(
ascending=False
).astype(int)
results.append(test_df)
return pd.concat(results, ignore_index=True)
5.2 Avoiding Data Leakage
Common sources of data leakage in draft modeling:
- Future career outcomes in features: Never use NBA stats as features
- Anachronistic knowledge: Ensure all training data was available at prediction time
- Selection bias: Include undrafted players in training when possible
- Target encoding leakage: Calculate means only on training data
def validate_no_leakage(df: pd.DataFrame, test_year: int) -> bool:
"""
Verify that the dataset has no data leakage for the test year.
"""
# Check that no target variables use future data
feature_cols = [c for c in df.columns if not c.startswith('target_')]
for col in feature_cols:
# All feature data should be from before the test year
if 'nba_' in col.lower():
raise ValueError(f"NBA stats found in features: {col}")
return True
Part 6: Evaluation Metrics
6.1 Regression Metrics
Standard regression metrics evaluate prediction accuracy:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def evaluate_regression(y_true, y_pred):
"""
Calculate comprehensive regression metrics.
"""
return {
'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
'mae': mean_absolute_error(y_true, y_pred),
'r2': r2_score(y_true, y_pred),
'mape': np.mean(np.abs((y_true - y_pred) / (y_true + 1))) * 100
}
6.2 Draft-Specific Metrics
Standard ML metrics do not fully capture draft model performance. We need domain-specific metrics:
def draft_specific_metrics(df: pd.DataFrame) -> Dict:
"""
Calculate draft-specific evaluation metrics.
These metrics capture whether the model would have improved
actual draft outcomes.
"""
results = {}
# 1. Star Detection Rate
# How often did we rank future stars in the top 10 of our board?
stars = df[df['target_star'] == 1]
star_detection = (stars['prediction_rank'] <= 10).mean()
results['star_detection_rate_top10'] = star_detection
# 2. Bust Avoidance Rate
# How often did we avoid ranking busts in our top 10?
busts = df[df['target_tier'] == 'Bust']
bust_avoidance = (busts['prediction_rank'] > 10).mean()
results['bust_avoidance_rate'] = bust_avoidance
# 3. Rank Correlation
# How well does our ranking correlate with actual outcomes?
rank_correlation = df['prediction_rank'].corr(
df['target_career_ws'].rank(ascending=False),
method='spearman'
)
results['rank_correlation'] = rank_correlation
# 4. Value Over Draft Position
# Did our top picks outperform players taken at the same positions?
df['model_value_added'] = df['target_career_ws'] - \
df.groupby('draft_pick')['target_career_ws'].transform('mean')
top_10_picks = df[df['prediction_rank'] <= 10]
results['avg_value_added_top10'] = top_10_picks['model_value_added'].mean()
# 5. Hit Rate by Tier
for tier in ['Star', 'Starter', 'Rotation']:
tier_players = df[df['target_tier'] == tier]
hit_rate = (tier_players['prediction_rank'] <= 15).mean()
results[f'hit_rate_{tier.lower()}'] = hit_rate
return results
6.3 Confidence Calibration
Models should know what they do not know:
def evaluate_calibration(y_true, y_pred, y_std):
"""
Evaluate whether prediction uncertainties are well-calibrated.
A well-calibrated model should have 68% of observations within
1 standard deviation and 95% within 2 standard deviations.
"""
within_1std = np.mean(np.abs(y_true - y_pred) <= y_std)
within_2std = np.mean(np.abs(y_true - y_pred) <= 2 * y_std)
return {
'within_1std': within_1std, # Should be ~0.68
'within_2std': within_2std, # Should be ~0.95
'calibration_error_1std': abs(within_1std - 0.68),
'calibration_error_2std': abs(within_2std - 0.95)
}
Part 7: Creating a Draft Board
The draft board translates model predictions into an actionable ranking with uncertainty quantification.
7.1 Board Generation
def create_draft_board(
predictions_df: pd.DataFrame,
n_players: int = 60
) -> pd.DataFrame:
"""
Create a complete draft board from model predictions.
Args:
predictions_df: DataFrame with predictions and player info
n_players: Number of players to include
Returns:
Formatted draft board DataFrame
"""
# Sort by predicted value
board = predictions_df.nlargest(n_players, 'predicted_value').copy()
# Add board rank
board['board_rank'] = range(1, len(board) + 1)
# Calculate percentile ranks
board['percentile'] = (
board['predicted_value'].rank(pct=True) * 100
).round(1)
# Add value tier labels
board['tier'] = pd.cut(
board['percentile'],
bins=[0, 25, 50, 75, 90, 100],
labels=['5th Tier', '4th Tier', '3rd Tier', '2nd Tier', '1st Tier']
)
# Calculate relative value to next pick
board['value_gap'] = board['predicted_value'].diff(-1)
# Identify value drops (good trade-down spots)
board['trade_down_opportunity'] = board['value_gap'] < \
board['value_gap'].quantile(0.25)
# Format for presentation
display_cols = [
'board_rank', 'name', 'position', 'college',
'predicted_value', 'prediction_std', 'percentile', 'tier',
'primary_comp', 'secondary_comp'
]
# Only include columns that exist
display_cols = [c for c in display_cols if c in board.columns]
return board[display_cols]
7.2 Player Comparisons
Historical comparisons help stakeholders contextualize predictions:
def find_player_comps(
prospect: pd.Series,
historical_df: pd.DataFrame,
feature_cols: List[str],
n_comps: int = 5
) -> List[Dict]:
"""
Find historical players with similar pre-draft profiles.
Uses Euclidean distance in feature space to identify
the most similar historical prospects.
"""
# Standardize features
scaler = StandardScaler()
historical_scaled = scaler.fit_transform(
historical_df[feature_cols].fillna(0)
)
prospect_scaled = scaler.transform(
prospect[feature_cols].fillna(0).values.reshape(1, -1)
)
# Calculate distances
distances = np.linalg.norm(
historical_scaled - prospect_scaled,
axis=1
)
# Get closest matches
closest_idx = np.argsort(distances)[:n_comps]
comps = []
for idx in closest_idx:
player = historical_df.iloc[idx]
comps.append({
'name': player['name'],
'draft_year': player['draft_year'],
'draft_pick': player['draft_pick'],
'career_ws': player['career_ws'],
'similarity': 1 / (1 + distances[idx])
})
return comps
Part 8: Visualization of Results
Effective visualization is critical for communicating findings to decision-makers.
8.1 Core Visualizations
# Complete visualization code in code/visualization.py
import matplotlib.pyplot as plt
import seaborn as sns
def plot_draft_board(board_df: pd.DataFrame, figsize=(14, 10)):
"""
Create a visual draft board showing rankings and uncertainty.
"""
fig, axes = plt.subplots(2, 2, figsize=figsize)
# 1. Predicted value with confidence intervals
ax1 = axes[0, 0]
top_20 = board_df.head(20)
ax1.barh(range(len(top_20)), top_20['predicted_value'],
color='steelblue', alpha=0.7)
if 'prediction_std' in top_20.columns:
ax1.errorbar(
top_20['predicted_value'],
range(len(top_20)),
xerr=top_20['prediction_std'] * 1.96,
fmt='none',
color='black',
capsize=3
)
ax1.set_yticks(range(len(top_20)))
ax1.set_yticklabels(top_20['name'])
ax1.invert_yaxis()
ax1.set_xlabel('Predicted Career Value')
ax1.set_title('Top 20 Draft Board')
# 2. Tier distribution
ax2 = axes[0, 1]
tier_counts = board_df['tier'].value_counts()
colors = sns.color_palette("viridis", len(tier_counts))
ax2.pie(tier_counts, labels=tier_counts.index, colors=colors,
autopct='%1.0f%%')
ax2.set_title('Prospect Tier Distribution')
# 3. Value gaps (trade-down opportunities)
ax3 = axes[1, 0]
ax3.bar(range(1, len(board_df) + 1), board_df['value_gap'].fillna(0),
color='coral')
ax3.axhline(y=board_df['value_gap'].median(), color='red',
linestyle='--', label='Median Gap')
ax3.set_xlabel('Board Position')
ax3.set_ylabel('Value Gap to Next Pick')
ax3.set_title('Value Drops (Trade-Down Opportunities)')
ax3.legend()
# 4. Prediction vs draft position (for backtests)
ax4 = axes[1, 1]
if 'draft_pick' in board_df.columns:
ax4.scatter(board_df['draft_pick'], board_df['predicted_value'],
alpha=0.6, c='teal')
ax4.set_xlabel('Actual Draft Position')
ax4.set_ylabel('Model Predicted Value')
ax4.set_title('Model vs. Consensus')
plt.tight_layout()
return fig
def plot_feature_importance(model, feature_names: List[str], top_n: int = 20):
"""
Visualize feature importances from tree-based models.
"""
importances = model.feature_importances_
indices = np.argsort(importances)[-top_n:]
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(range(len(indices)), importances[indices], color='steelblue')
ax.set_yticks(range(len(indices)))
ax.set_yticklabels([feature_names[i] for i in indices])
ax.set_xlabel('Feature Importance')
ax.set_title('Top Features for Draft Prediction')
plt.tight_layout()
return fig
def plot_backtest_results(backtest_df: pd.DataFrame):
"""
Visualize backtest performance across years.
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Correlation by year
ax1 = axes[0, 0]
yearly_corr = backtest_df.groupby('draft_year').apply(
lambda x: x['predicted_value'].corr(x['target_career_ws'])
)
ax1.bar(yearly_corr.index, yearly_corr.values, color='steelblue')
ax1.set_xlabel('Draft Year')
ax1.set_ylabel('Correlation')
ax1.set_title('Prediction Accuracy by Draft Year')
ax1.axhline(y=yearly_corr.mean(), color='red', linestyle='--')
# 2. Star detection by year
ax2 = axes[0, 1]
yearly_star = backtest_df.groupby('draft_year').apply(
lambda x: (x[x['target_star'] == 1]['prediction_rank'] <= 10).mean()
)
ax2.bar(yearly_star.index, yearly_star.values, color='coral')
ax2.set_xlabel('Draft Year')
ax2.set_ylabel('Detection Rate')
ax2.set_title('Star Detection Rate (Top 10)')
# 3. Predicted vs actual scatter
ax3 = axes[1, 0]
ax3.scatter(backtest_df['predicted_value'],
backtest_df['target_career_ws'],
alpha=0.3, c='teal')
ax3.plot([0, backtest_df['predicted_value'].max()],
[0, backtest_df['predicted_value'].max()],
'r--', label='Perfect Prediction')
ax3.set_xlabel('Predicted Value')
ax3.set_ylabel('Actual Career Win Shares')
ax3.set_title('Predicted vs. Actual Outcomes')
ax3.legend()
# 4. Value added by draft position
ax4 = axes[1, 1]
position_bins = pd.cut(backtest_df['draft_pick'],
bins=[0, 10, 20, 30, 60])
value_by_position = backtest_df.groupby(position_bins)['model_value_added'].mean()
ax4.bar(range(len(value_by_position)), value_by_position.values,
color='purple', alpha=0.7)
ax4.set_xticks(range(len(value_by_position)))
ax4.set_xticklabels(['1-10', '11-20', '21-30', '31-60'])
ax4.set_xlabel('Draft Position Range')
ax4.set_ylabel('Average Value Added')
ax4.set_title('Model Value Added by Draft Position')
plt.tight_layout()
return fig
8.2 Interactive Dashboard Concept
For production use, consider building an interactive dashboard:
# Example using Streamlit (conceptual)
"""
import streamlit as st
def draft_dashboard():
st.title("Draft Model Dashboard")
# Sidebar controls
draft_year = st.sidebar.selectbox("Draft Year", range(2024, 2015, -1))
model_type = st.sidebar.selectbox("Model", ["Random Forest", "XGBoost"])
# Load predictions
predictions = load_predictions(draft_year, model_type)
# Main content
col1, col2 = st.columns(2)
with col1:
st.subheader("Draft Board")
st.dataframe(predictions.head(30))
with col2:
st.subheader("Prediction Distribution")
fig = plot_prediction_distribution(predictions)
st.pyplot(fig)
# Player deep dive
selected_player = st.selectbox("Select Player", predictions['name'])
show_player_profile(predictions, selected_player)
"""
Part 9: Presentation to Stakeholders
9.1 Executive Summary Template
When presenting to decision-makers, structure your findings clearly:
DRAFT MODEL EXECUTIVE SUMMARY
Model Performance:
- Backtest correlation: 0.65 (strong positive relationship)
- Star detection rate: 72% (7 of 10 stars ranked in top 10)
- Bust avoidance: 85% (avoided ranking busts highly)
Key Findings:
1. Age-adjusted production is the strongest predictor
2. Wingspan-to-height ratio adds significant signal
3. Model identifies 3 potential value picks outside top 10
Recommendations:
- Player A: Best value in 5-10 range (model rank: 3)
- Player B: Significant bust risk despite consensus top-5 ranking
- Player C: Undervalued by consensus, strong physical profile
Model Limitations:
- International players have higher uncertainty
- Injury risk not captured in current features
- Recent draft classes have incomplete outcome data
9.2 Technical Documentation
For technical audiences, provide comprehensive documentation:
## Model Technical Specification
### Features (52 total)
- College production: 15 features
- Athletic measurements: 12 features
- Physical profile: 8 features
- Context adjustments: 10 features
- Trajectory indicators: 7 features
### Model Architecture
- Ensemble of Random Forest and XGBoost
- 5-fold time-series cross-validation
- Hyperparameters tuned via Bayesian optimization
### Performance Metrics
| Metric | Training | Validation | Test |
|--------|----------|------------|------|
| RMSE | 12.3 | 15.8 | 16.2 |
| R^2 | 0.78 | 0.62 | 0.58 |
| Rank Corr | 0.85 | 0.68 | 0.65 |
### Known Limitations
1. Sample size for star outcomes is small (< 50)
2. International prospects underrepresented
3. Position-specific models not yet implemented
Part 10: Complete Python ML Pipeline
The following brings together all components into a cohesive pipeline:
"""
main_pipeline.py - Complete draft model pipeline
This script orchestrates the entire draft modeling workflow from
data loading through final board generation.
"""
import pandas as pd
import numpy as np
from pathlib import Path
import logging
import pickle
from datetime import datetime
# Local imports (from code/ directory)
from draft_model import DraftModelPipeline
from feature_engineering import DraftFeatureEngineer
from evaluation import DraftModelEvaluator
from visualization import DraftVisualizer
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def run_draft_pipeline(
data_path: str = None,
output_dir: str = "output",
target_year: int = None,
save_artifacts: bool = True
) -> dict:
"""
Execute the complete draft modeling pipeline.
Args:
data_path: Path to input data (uses sample if None)
output_dir: Directory for saving outputs
target_year: Year to generate predictions for
save_artifacts: Whether to save model and results
Returns:
Dictionary containing model, predictions, and metrics
"""
logger.info("=" * 60)
logger.info("DRAFT MODEL PIPELINE")
logger.info("=" * 60)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
# Step 1: Data Collection
logger.info("\n[1/6] Loading data...")
if data_path:
data = pd.read_csv(data_path)
else:
from data_collection import load_draft_data
data = load_draft_data(use_sample=True)
logger.info(f"Loaded {len(data)} players from {data['draft_year'].min()}"
f" to {data['draft_year'].max()}")
# Step 2: Feature Engineering
logger.info("\n[2/6] Engineering features...")
engineer = DraftFeatureEngineer()
features_df = engineer.create_all_features(data)
logger.info(f"Created {len(engineer.feature_names)} features")
# Step 3: Target Variable Creation
logger.info("\n[3/6] Creating target variables...")
from data_collection import create_target_variables
features_df = create_target_variables(features_df)
# Step 4: Model Training
logger.info("\n[4/6] Training models...")
pipeline = DraftModelPipeline(target_col='target_career_ws')
# Split for training (historical) and prediction (target year)
if target_year:
train_df = features_df[features_df['draft_year'] < target_year]
pred_df = features_df[features_df['draft_year'] == target_year]
else:
# Use all but most recent year for training
max_year = features_df['draft_year'].max()
train_df = features_df[features_df['draft_year'] < max_year]
pred_df = features_df[features_df['draft_year'] == max_year]
target_year = max_year
X_train, y_train = pipeline.prepare_features(train_df)
model_results = pipeline.train_models(
X_train, y_train,
train_df['draft_year'].values
)
# Step 5: Backtesting
logger.info("\n[5/6] Running backtests...")
evaluator = DraftModelEvaluator()
backtest_results = evaluator.walk_forward_backtest(
features_df,
pipeline.feature_cols,
'target_career_ws'
)
metrics = evaluator.calculate_metrics(backtest_results)
logger.info("\nBacktest Results:")
for metric, value in metrics.items():
logger.info(f" {metric}: {value:.3f}")
# Step 6: Generate Draft Board
logger.info("\n[6/6] Generating draft board...")
# Predict for target year
X_pred = pred_df[pipeline.feature_cols].values
X_pred_scaled = pipeline.scaler.transform(X_pred)
# Use best model
best_model_name = min(model_results, key=lambda x: model_results[x]['cv_rmse'])
best_model = model_results[best_model_name]['model']
predictions = best_model.predict(X_pred_scaled)
# Create board
pred_df = pred_df.copy()
pred_df['predicted_value'] = predictions
# Estimate uncertainty (for random forest, use tree variance)
if hasattr(best_model, 'estimators_'):
tree_preds = np.array([
tree.predict(X_pred_scaled) for tree in best_model.estimators_
])
pred_df['prediction_std'] = tree_preds.std(axis=0)
from draft_model import create_draft_board
draft_board = create_draft_board(pred_df)
logger.info(f"\nDraft Board for {target_year}:")
logger.info(draft_board.head(15).to_string())
# Step 7: Visualization
logger.info("\n[7/7] Creating visualizations...")
visualizer = DraftVisualizer()
fig_board = visualizer.plot_draft_board(draft_board)
fig_backtest = visualizer.plot_backtest_results(backtest_results)
fig_importance = visualizer.plot_feature_importance(
best_model, pipeline.feature_cols
)
# Save artifacts
if save_artifacts:
logger.info("\nSaving artifacts...")
# Save model
with open(output_path / 'draft_model.pkl', 'wb') as f:
pickle.dump({
'model': best_model,
'scaler': pipeline.scaler,
'feature_cols': pipeline.feature_cols,
'train_date': datetime.now().isoformat()
}, f)
# Save draft board
draft_board.to_csv(output_path / 'draft_board.csv', index=False)
# Save backtest results
backtest_results.to_csv(output_path / 'backtest_results.csv', index=False)
# Save figures
fig_board.savefig(output_path / 'draft_board.png', dpi=150)
fig_backtest.savefig(output_path / 'backtest_results.png', dpi=150)
fig_importance.savefig(output_path / 'feature_importance.png', dpi=150)
logger.info(f"Artifacts saved to {output_path}")
logger.info("\n" + "=" * 60)
logger.info("PIPELINE COMPLETE")
logger.info("=" * 60)
return {
'model': best_model,
'draft_board': draft_board,
'backtest_results': backtest_results,
'metrics': metrics,
'feature_importance': dict(zip(
pipeline.feature_cols,
best_model.feature_importances_
))
}
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Run Draft Model Pipeline")
parser.add_argument('--data', type=str, help='Path to data file')
parser.add_argument('--year', type=int, help='Target draft year')
parser.add_argument('--output', type=str, default='output',
help='Output directory')
args = parser.parse_args()
results = run_draft_pipeline(
data_path=args.data,
target_year=args.year,
output_dir=args.output
)
print("\nTop 10 Draft Board:")
print(results['draft_board'].head(10))
Extension Exercises
Exercise 1: Position-Specific Models
Build separate models for guards, wings, and bigs to capture position-specific success factors.
Exercise 2: International Player Integration
Develop methods to incorporate international league statistics with appropriate adjustments.
Exercise 3: Injury Risk Modeling
Add a parallel model that predicts injury risk based on combine data and college workload.
Exercise 4: Trade Value Calculator
Extend the model to output draft pick trade values based on expected value at each position.
Exercise 5: Monte Carlo Simulation
Implement simulation of draft scenarios to identify optimal team strategies.
Summary
In this capstone project, you have built a complete NBA draft prediction system that:
- Collects and integrates data from multiple sources
- Engineers meaningful features that capture player potential
- Defines appropriate target variables for career success
- Trains and validates models using proper backtesting
- Evaluates performance with domain-specific metrics
- Generates actionable draft boards with uncertainty quantification
- Visualizes results for stakeholder communication
This project demonstrates the full lifecycle of a machine learning application in sports analytics, from raw data to business decision support. The techniques and frameworks you have developed here transfer directly to professional front office work and establish a foundation for more advanced modeling approaches.
References and Further Reading
- Pelton, K. "Draft Analytics: Quantifying the NBA Draft" (MIT Sloan Sports Analytics Conference)
- Silver, N. "A Better Way to Evaluate NBA Draft Picks" (FiveThirtyEight)
- Kubatko, J. et al. "A Starting Point for Analyzing Basketball Statistics"
- Myers, D. "Position-less Basketball and Draft Strategy"
- NBA Combine Official Measurements Database
- Sports Reference College Basketball Statistics