7 min read

In This Chapter

Project Overview
Part 1: Data Collection and Integration
Part 2: Feature Engineering for Draft Prediction
Part 3: Target Variable Definition
Part 4: Model Selection and Training
Part 5: Backtesting Methodology
Part 6: Evaluation Metrics
Part 7: Creating a Draft Board
Part 8: Visualization of Results
Part 9: Presentation to Stakeholders
Part 10: Complete Python ML Pipeline
Extension Exercises
Summary
References and Further Reading

Capstone Project 2: Create a Draft Model

Project Overview

The NBA Draft represents one of the highest-leverage decisions a basketball organization makes each year. A single draft pick can define a franchise for a decade or more, with the difference between selecting a perennial All-Star versus a career backup potentially worth hundreds of millions of dollars in both direct compensation and downstream revenue effects. This capstone project guides you through building a comprehensive draft prediction model that integrates college statistics, athletic measurements, contextual factors, and historical outcomes to generate actionable intelligence for draft decision-making.

Learning Objectives

By completing this project, you will be able to:

Collect and integrate heterogeneous data sources including college box score statistics, advanced metrics, combine measurements, and historical draft outcomes
Engineer predictive features that capture player potential beyond raw statistics
Define and operationalize target variables that meaningfully represent career value
Select and train appropriate machine learning models for player projection
Implement rigorous backtesting methodology that simulates real-world prediction scenarios
Evaluate model performance using domain-appropriate metrics
Create actionable draft boards that communicate predictions to stakeholders
Visualize uncertainty and comparisons in ways that support decision-making
Present findings professionally to both technical and non-technical audiences

Professional Context

NBA front offices increasingly rely on quantitative draft analysis to inform their selections. Modern draft rooms typically include:

Statistical models that project college performance to NBA outcomes
Comparison systems that identify historical players with similar profiles
Risk assessments that quantify the uncertainty around projections
Value estimates that translate projections into draft position recommendations

This project replicates the analytical workflow used by professional basketball analytics departments. The techniques you develop here transfer directly to front office work, player agency analysis, and media draft coverage.

Part 1: Data Collection and Integration

1.1 Data Sources Overview

Building a draft model requires integrating multiple data sources, each providing distinct signal about player potential:

Data Source	Key Variables	Coverage	Primary Signal
College Box Scores	Points, rebounds, assists, etc.	2000-present	Production level
College Advanced Stats	PER, BPM, WS/40	2010-present	Efficiency and impact
NBA Combine	Height, wingspan, vertical, agility	2000-present	Athletic tools
Biographical Data	Age, experience, school tier	Complete	Context and trajectory
Draft History	Pick number, team, trade details	Complete	Selection outcomes
NBA Career Stats	Career totals, per-game, advanced	Complete	Target variables

1.2 Data Collection Implementation

"""
data_collection.py - Comprehensive draft data collection module

This module handles the collection and integration of data from multiple
sources required for draft modeling.
"""

import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Tuple
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class DraftDataCollector:
    """
    Collects and integrates data from multiple sources for draft modeling.

    This class handles the complexity of combining college statistics,
    combine measurements, and NBA career outcomes into a unified dataset.
    """

    def __init__(self, start_year: int = 2000, end_year: int = None):
        """
        Initialize the data collector.

        Args:
            start_year: First draft year to include
            end_year: Last draft year to include (defaults to current year - 4
                     to ensure sufficient NBA career data for evaluation)
        """
        self.start_year = start_year
        self.end_year = end_year or (datetime.now().year - 4)
        self.data_cache = {}

    def collect_college_stats(self, player_ids: List[str]) -> pd.DataFrame:
        """
        Collect college statistics for a list of players.

        In production, this would connect to a college statistics database
        or API. For this example, we demonstrate the expected data structure.

        Args:
            player_ids: List of unique player identifiers

        Returns:
            DataFrame with college statistics
        """
        # Expected columns for college statistics
        college_columns = [
            'player_id', 'name', 'season', 'school', 'conference',
            'games_played', 'games_started', 'minutes_per_game',
            'points_per_game', 'rebounds_per_game', 'assists_per_game',
            'steals_per_game', 'blocks_per_game', 'turnovers_per_game',
            'fg_pct', 'three_pt_pct', 'ft_pct', 'usage_rate',
            'true_shooting_pct', 'assist_rate', 'turnover_rate',
            'rebound_rate', 'block_rate', 'steal_rate',
            'box_plus_minus', 'win_shares_per_40'
        ]

        logger.info(f"Collecting college stats for {len(player_ids)} players")

        # In production: Query database or API
        # df = self._query_college_database(player_ids)

        # Placeholder for demonstration
        df = pd.DataFrame(columns=college_columns)

        return df

    def collect_combine_data(self, draft_years: List[int]) -> pd.DataFrame:
        """
        Collect NBA Combine measurements for specified draft years.

        The combine provides standardized athletic measurements that
        are particularly valuable for projecting players with limited
        college statistical samples.

        Args:
            draft_years: List of draft years to collect

        Returns:
            DataFrame with combine measurements
        """
        combine_columns = [
            'player_id', 'name', 'draft_year', 'position',
            'height_no_shoes', 'height_with_shoes', 'weight',
            'wingspan', 'standing_reach', 'body_fat_pct',
            'hand_length', 'hand_width',
            'standing_vertical', 'max_vertical',
            'lane_agility', 'three_quarter_sprint',
            'bench_press_reps'
        ]

        logger.info(f"Collecting combine data for years {draft_years}")

        # In production: Query combine database
        df = pd.DataFrame(columns=combine_columns)

        return df

    def collect_draft_history(self) -> pd.DataFrame:
        """
        Collect historical draft results including pick numbers and teams.

        Returns:
            DataFrame with draft history
        """
        draft_columns = [
            'player_id', 'name', 'draft_year', 'draft_round',
            'draft_pick', 'draft_team', 'college', 'position'
        ]

        logger.info(f"Collecting draft history {self.start_year}-{self.end_year}")

        df = pd.DataFrame(columns=draft_columns)

        return df

    def collect_nba_careers(self, player_ids: List[str]) -> pd.DataFrame:
        """
        Collect NBA career statistics for drafted players.

        This provides the target variables for our prediction model.
        We collect comprehensive career statistics to enable multiple
        definitions of "success."

        Args:
            player_ids: List of player identifiers

        Returns:
            DataFrame with NBA career statistics
        """
        career_columns = [
            'player_id', 'name', 'seasons_played', 'games_played',
            'games_started', 'total_minutes', 'career_ppg', 'career_rpg',
            'career_apg', 'career_ws', 'career_vorp', 'career_bpm',
            'peak_ws', 'peak_vorp', 'peak_bpm', 'all_star_selections',
            'all_nba_selections', 'championships', 'career_earnings'
        ]

        logger.info(f"Collecting NBA careers for {len(player_ids)} players")

        df = pd.DataFrame(columns=career_columns)

        return df

    def build_integrated_dataset(self) -> pd.DataFrame:
        """
        Build the complete integrated dataset for modeling.

        This method orchestrates the collection from all sources and
        performs the necessary joins to create a unified dataset.

        Returns:
            Complete DataFrame ready for feature engineering
        """
        logger.info("Building integrated draft dataset")

        # Collect draft history as the backbone
        draft_df = self.collect_draft_history()

        if draft_df.empty:
            logger.warning("No draft data collected - returning sample data")
            return self._generate_sample_data()

        # Get player IDs
        player_ids = draft_df['player_id'].unique().tolist()

        # Collect from other sources
        college_df = self.collect_college_stats(player_ids)
        combine_df = self.collect_combine_data(
            list(range(self.start_year, self.end_year + 1))
        )
        career_df = self.collect_nba_careers(player_ids)

        # Merge datasets
        merged = draft_df.merge(college_df, on='player_id', how='left')
        merged = merged.merge(combine_df, on='player_id', how='left')
        merged = merged.merge(career_df, on='player_id', how='left')

        logger.info(f"Integrated dataset: {len(merged)} players")

        return merged

    def _generate_sample_data(self) -> pd.DataFrame:
        """
        Generate realistic sample data for demonstration and testing.

        This creates a synthetic dataset with realistic distributions
        and correlations for model development.
        """
        np.random.seed(42)
        n_players = 500

        # Generate synthetic draft data
        data = {
            'player_id': [f'player_{i}' for i in range(n_players)],
            'name': [f'Player {i}' for i in range(n_players)],
            'draft_year': np.random.randint(2010, 2021, n_players),
            'draft_pick': np.random.randint(1, 61, n_players),

            # College statistics (correlated with draft pick)
            'college_ppg': np.clip(25 - np.random.randn(n_players) * 5, 5, 35),
            'college_rpg': np.clip(8 - np.random.randn(n_players) * 2, 2, 15),
            'college_apg': np.clip(4 - np.random.randn(n_players) * 1.5, 0.5, 10),
            'college_fg_pct': np.clip(0.45 + np.random.randn(n_players) * 0.05, 0.35, 0.65),
            'college_3pt_pct': np.clip(0.35 + np.random.randn(n_players) * 0.06, 0.20, 0.50),
            'college_bpm': np.random.randn(n_players) * 3 + 5,
            'college_ws_per_40': np.random.randn(n_players) * 0.05 + 0.15,

            # Combine measurements
            'height_inches': np.random.normal(79, 3, n_players),
            'wingspan_inches': np.random.normal(83, 4, n_players),
            'weight_lbs': np.random.normal(215, 25, n_players),
            'standing_vertical': np.random.normal(28, 3, n_players),
            'max_vertical': np.random.normal(34, 4, n_players),
            'lane_agility': np.random.normal(11.2, 0.5, n_players),
            'three_quarter_sprint': np.random.normal(3.2, 0.15, n_players),

            # Contextual features
            'age_at_draft': np.random.normal(21, 1.5, n_players),
            'years_in_college': np.random.choice([1, 2, 3, 4], n_players, p=[0.2, 0.3, 0.25, 0.25]),
            'conference_strength': np.random.uniform(0.5, 1.0, n_players),
        }

        df = pd.DataFrame(data)

        # Generate correlated NBA outcomes
        # Higher draft picks and better college stats lead to better outcomes
        base_talent = (60 - df['draft_pick']) / 60 + df['college_bpm'] / 10
        noise = np.random.randn(n_players) * 0.3

        df['career_ws'] = np.clip((base_talent + noise) * 50, 0, 200)
        df['career_vorp'] = np.clip((base_talent + noise) * 30, -5, 100)
        df['seasons_played'] = np.clip(
            ((base_talent + noise + 0.5) * 8).astype(int), 1, 18
        )
        df['all_star_selections'] = np.clip(
            ((base_talent + noise) * 3).astype(int), 0, 15
        )

        return df


def load_draft_data(
    start_year: int = 2010,
    end_year: int = 2020,
    use_sample: bool = True
) -> pd.DataFrame:
    """
    Main function to load draft data for modeling.

    Args:
        start_year: First draft year to include
        end_year: Last draft year to include
        use_sample: Whether to use sample data (True for demonstration)

    Returns:
        Integrated DataFrame ready for feature engineering
    """
    collector = DraftDataCollector(start_year, end_year)

    if use_sample:
        return collector._generate_sample_data()
    else:
        return collector.build_integrated_dataset()

1.3 Data Quality Considerations

Before proceeding to feature engineering, we must address common data quality issues:

Missing Data Patterns: - Combine data is missing for approximately 30% of draft picks (players who opt out or are not invited) - International players often lack complete college statistics - Early career exits create censored outcome data

Handling Missing Values:

def handle_missing_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Apply appropriate missing data strategies for each variable type.
    """
    # Physical measurements: impute with position-specific medians
    physical_cols = ['height_inches', 'wingspan_inches', 'weight_lbs']
    for col in physical_cols:
        if col in df.columns:
            df[col] = df.groupby('position')[col].transform(
                lambda x: x.fillna(x.median())
            )

    # Athletic testing: create missing indicator + impute median
    athletic_cols = ['standing_vertical', 'max_vertical', 'lane_agility']
    for col in athletic_cols:
        if col in df.columns:
            df[f'{col}_missing'] = df[col].isna().astype(int)
            df[col] = df[col].fillna(df[col].median())

    # College stats: forward-fill for multi-year players
    stat_cols = ['college_ppg', 'college_rpg', 'college_apg']
    for col in stat_cols:
        if col in df.columns:
            df[col] = df[col].fillna(df[col].median())

    return df

Part 2: Feature Engineering for Draft Prediction

Feature engineering is where domain expertise most directly impacts model performance. We transform raw measurements and statistics into features that capture the underlying factors that predict NBA success.

2.1 Feature Categories

We organize features into five categories:

Production Features: What did the player accomplish in college?
Efficiency Features: How efficiently did they produce?
Physical Features: What are their measurable physical tools?
Context Features: What context helps interpret their production?
Trajectory Features: How are they improving over time?

2.2 Core Feature Engineering Implementation

# See code/feature_engineering.py for complete implementation

from feature_engineering import DraftFeatureEngineer

# Example usage
engineer = DraftFeatureEngineer()
features_df = engineer.create_all_features(raw_data)

2.3 Key Feature Concepts

Physical Profile Index: Combining height, wingspan, and athleticism into a single index allows the model to capture the overall physical package:

def calculate_physical_index(row):
    """
    Create composite physical profile score.

    Weights reflect the relative importance of each attribute
    for overall physical projection.
    """
    height_z = (row['height_inches'] - 79) / 3
    wingspan_z = (row['wingspan_inches'] - 83) / 4
    vertical_z = (row['max_vertical'] - 34) / 4

    # Wingspan relative to height is particularly predictive
    wingspan_diff = row['wingspan_inches'] - row['height_inches']
    wingspan_diff_z = (wingspan_diff - 4) / 2

    return 0.25 * height_z + 0.30 * wingspan_diff_z + 0.25 * vertical_z + 0.20 * wingspan_z

Age-Adjusted Production: A 19-year-old averaging 15 PPG is more impressive than a 23-year-old with the same production. We adjust for this:

def age_adjusted_production(ppg, age, reference_age=21):
    """
    Adjust scoring production for age.

    Younger players get a boost, older players get penalized.
    The adjustment factor is calibrated to historical data showing
    approximately 2 PPG improvement per year of experience at the
    college level.
    """
    age_adjustment = (reference_age - age) * 2.0
    return ppg + age_adjustment

Conference Strength Adjustment: Production against weaker competition is less predictive:

def conference_adjusted_stats(stats, conference_strength):
    """
    Adjust statistics based on strength of competition.

    Conference strength is measured on a 0-1 scale where 1.0
    represents the strongest conferences.
    """
    adjustment_factor = 0.5 + (conference_strength * 0.5)
    return stats * adjustment_factor

Part 3: Target Variable Definition

The choice of target variable fundamentally shapes what our model learns to predict. Different target definitions capture different aspects of player value.

3.1 Career Value Metrics

Metric	Definition	Pros	Cons
Win Shares	Cumulative wins attributed	Captures total value	Favors longevity over peak
VORP	Value over replacement	Position-adjusted	Complex to interpret
All-Star Selections	Counting stat	Clear milestone	Subject to popularity bias
Career Earnings	Total compensation	Market valuation	Affected by cap era
Composite Score	Weighted combination	Multi-dimensional	Requires weight selection

3.2 Implementing Target Variables

def create_target_variables(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create multiple target variable definitions for model training.

    We create several targets to enable different analyses:
    - Regression targets for predicting career value
    - Classification targets for identifying tiers
    - Ordinal targets for ranking predictions
    """
    # Continuous regression targets
    df['target_career_ws'] = df['career_ws']
    df['target_career_vorp'] = df['career_vorp']

    # Log-transformed targets (reduces impact of outliers)
    df['target_log_ws'] = np.log1p(df['career_ws'])
    df['target_log_vorp'] = np.log1p(np.clip(df['career_vorp'], 0, None))

    # Composite target: weighted combination
    df['target_composite'] = (
        0.4 * df['career_ws'].rank(pct=True) +
        0.3 * df['career_vorp'].rank(pct=True) +
        0.2 * df['all_star_selections'].rank(pct=True) +
        0.1 * df['seasons_played'].rank(pct=True)
    )

    # Classification targets
    ws_thresholds = df['career_ws'].quantile([0.25, 0.50, 0.75, 0.90])

    df['target_tier'] = pd.cut(
        df['career_ws'],
        bins=[-np.inf, ws_thresholds[0.25], ws_thresholds[0.50],
              ws_thresholds[0.75], ws_thresholds[0.90], np.inf],
        labels=['Bust', 'Bench', 'Rotation', 'Starter', 'Star']
    )

    # Binary classification: "hit" vs "miss"
    df['target_hit'] = (df['career_ws'] > ws_thresholds[0.50]).astype(int)

    # Star identification
    df['target_star'] = (df['career_ws'] > ws_thresholds[0.90]).astype(int)

    return df

3.3 Handling Career Outcome Uncertainty

Players from recent drafts have incomplete career data. We address this through:

Minimum seasons threshold: Only include players with 4+ seasons for primary analysis
Projection to career totals: Use per-season rates to project incomplete careers
Separate models by experience: Train different models for different projection horizons

def project_career_outcomes(df: pd.DataFrame, min_seasons: int = 4) -> pd.DataFrame:
    """
    Project career outcomes for players with incomplete data.
    """
    # Mark players with sufficient data
    df['sufficient_data'] = df['seasons_played'] >= min_seasons

    # For insufficient data, project based on per-season rates
    df['ws_per_season'] = df['career_ws'] / df['seasons_played']
    df['projected_career_ws'] = np.where(
        df['sufficient_data'],
        df['career_ws'],
        df['ws_per_season'] * 10  # Project to 10-year career
    )

    return df

Part 4: Model Selection and Training

4.1 Model Selection Rationale

For draft prediction, we evaluate several model families:

Model	Strengths	Weaknesses	Use Case
Random Forest	Handles non-linearity, robust	Less interpretable	Primary prediction
Gradient Boosting	Highest accuracy potential	Prone to overfitting	Ensemble component
Linear Regression	Highly interpretable	Misses interactions	Baseline, insights
Neural Network	Captures complex patterns	Requires more data	Large datasets

4.2 Model Training Pipeline

# Complete pipeline in code/draft_model.py

from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import xgboost as xgb


class DraftModelPipeline:
    """
    Complete pipeline for training and evaluating draft prediction models.
    """

    def __init__(self, target_col: str = 'target_career_ws'):
        self.target_col = target_col
        self.models = {}
        self.scaler = StandardScaler()
        self.feature_cols = None

    def prepare_features(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        """
        Prepare feature matrix and target vector.
        """
        # Exclude target and identifier columns
        exclude_cols = ['player_id', 'name', 'draft_year'] + \
                      [c for c in df.columns if c.startswith('target_')]

        self.feature_cols = [c for c in df.columns if c not in exclude_cols
                            and df[c].dtype in ['int64', 'float64']]

        X = df[self.feature_cols].values
        y = df[self.target_col].values

        return X, y

    def train_models(
        self,
        X: np.ndarray,
        y: np.ndarray,
        draft_years: np.ndarray
    ) -> Dict:
        """
        Train multiple models with time-series cross-validation.

        We use draft year as the time index to ensure we never
        train on future data when backtesting.
        """
        results = {}

        # Define models
        models = {
            'random_forest': RandomForestRegressor(
                n_estimators=200,
                max_depth=10,
                min_samples_leaf=5,
                random_state=42
            ),
            'gradient_boosting': GradientBoostingRegressor(
                n_estimators=200,
                max_depth=5,
                learning_rate=0.05,
                random_state=42
            ),
            'xgboost': xgb.XGBRegressor(
                n_estimators=200,
                max_depth=6,
                learning_rate=0.05,
                random_state=42
            )
        }

        # Time-series cross-validation
        tscv = TimeSeriesSplit(n_splits=5)

        for name, model in models.items():
            # Scale features
            X_scaled = self.scaler.fit_transform(X)

            # Cross-validation scores
            cv_scores = cross_val_score(
                model, X_scaled, y,
                cv=tscv,
                scoring='neg_mean_squared_error'
            )

            # Train final model on all data
            model.fit(X_scaled, y)
            self.models[name] = model

            results[name] = {
                'cv_rmse': np.sqrt(-cv_scores.mean()),
                'cv_std': np.sqrt(-cv_scores).std(),
                'model': model
            }

            print(f"{name}: RMSE = {results[name]['cv_rmse']:.3f} "
                  f"(+/- {results[name]['cv_std']:.3f})")

        return results

4.3 Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

def tune_hyperparameters(X, y, n_iter=50):
    """
    Perform hyperparameter tuning for the best model.
    """
    param_distributions = {
        'n_estimators': [100, 200, 300, 500],
        'max_depth': [5, 8, 10, 15, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 5, 10],
        'max_features': ['sqrt', 'log2', None]
    }

    rf = RandomForestRegressor(random_state=42)

    search = RandomizedSearchCV(
        rf, param_distributions,
        n_iter=n_iter,
        cv=5,
        scoring='neg_mean_squared_error',
        random_state=42,
        n_jobs=-1
    )

    search.fit(X, y)

    print(f"Best parameters: {search.best_params_}")
    print(f"Best RMSE: {np.sqrt(-search.best_score_):.3f}")

    return search.best_estimator_

Part 5: Backtesting Methodology

Backtesting is critical for validating that our model would have performed well on past drafts. We must be careful to avoid data leakage that would artificially inflate performance.

5.1 Walk-Forward Validation

def walk_forward_backtest(
    df: pd.DataFrame,
    feature_cols: List[str],
    target_col: str,
    train_years: int = 5,
    test_years: int = 1
) -> pd.DataFrame:
    """
    Perform walk-forward backtesting on historical drafts.

    For each test year, we train on only the previous train_years
    of data, simulating what we would have known at the time.

    Args:
        df: Complete dataset with draft_year column
        feature_cols: Features to use for prediction
        target_col: Target variable
        train_years: Number of years to include in training
        test_years: Number of years to predict

    Returns:
        DataFrame with predictions for each test year
    """
    results = []

    years = sorted(df['draft_year'].unique())

    for i, test_year in enumerate(years):
        # Determine training years (only past data)
        train_start = test_year - train_years
        train_end = test_year - 1

        # Skip if not enough training data
        if train_start < years[0]:
            continue

        # Split data
        train_mask = (df['draft_year'] >= train_start) & \
                     (df['draft_year'] <= train_end)
        test_mask = df['draft_year'] == test_year

        X_train = df.loc[train_mask, feature_cols].values
        y_train = df.loc[train_mask, target_col].values
        X_test = df.loc[test_mask, feature_cols].values

        # Train model
        model = RandomForestRegressor(
            n_estimators=200,
            max_depth=10,
            random_state=42
        )

        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        model.fit(X_train_scaled, y_train)

        # Make predictions
        predictions = model.predict(X_test_scaled)

        # Store results
        test_df = df.loc[test_mask].copy()
        test_df['predicted_value'] = predictions
        test_df['prediction_rank'] = test_df['predicted_value'].rank(
            ascending=False
        ).astype(int)

        results.append(test_df)

    return pd.concat(results, ignore_index=True)

5.2 Avoiding Data Leakage

Common sources of data leakage in draft modeling:

Future career outcomes in features: Never use NBA stats as features
Anachronistic knowledge: Ensure all training data was available at prediction time
Selection bias: Include undrafted players in training when possible
Target encoding leakage: Calculate means only on training data

def validate_no_leakage(df: pd.DataFrame, test_year: int) -> bool:
    """
    Verify that the dataset has no data leakage for the test year.
    """
    # Check that no target variables use future data
    feature_cols = [c for c in df.columns if not c.startswith('target_')]

    for col in feature_cols:
        # All feature data should be from before the test year
        if 'nba_' in col.lower():
            raise ValueError(f"NBA stats found in features: {col}")

    return True

Part 6: Evaluation Metrics

6.1 Regression Metrics

Standard regression metrics evaluate prediction accuracy:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_regression(y_true, y_pred):
    """
    Calculate comprehensive regression metrics.
    """
    return {
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'mae': mean_absolute_error(y_true, y_pred),
        'r2': r2_score(y_true, y_pred),
        'mape': np.mean(np.abs((y_true - y_pred) / (y_true + 1))) * 100
    }

6.2 Draft-Specific Metrics

Standard ML metrics do not fully capture draft model performance. We need domain-specific metrics:

def draft_specific_metrics(df: pd.DataFrame) -> Dict:
    """
    Calculate draft-specific evaluation metrics.

    These metrics capture whether the model would have improved
    actual draft outcomes.
    """
    results = {}

    # 1. Star Detection Rate
    # How often did we rank future stars in the top 10 of our board?
    stars = df[df['target_star'] == 1]
    star_detection = (stars['prediction_rank'] <= 10).mean()
    results['star_detection_rate_top10'] = star_detection

    # 2. Bust Avoidance Rate
    # How often did we avoid ranking busts in our top 10?
    busts = df[df['target_tier'] == 'Bust']
    bust_avoidance = (busts['prediction_rank'] > 10).mean()
    results['bust_avoidance_rate'] = bust_avoidance

    # 3. Rank Correlation
    # How well does our ranking correlate with actual outcomes?
    rank_correlation = df['prediction_rank'].corr(
        df['target_career_ws'].rank(ascending=False),
        method='spearman'
    )
    results['rank_correlation'] = rank_correlation

    # 4. Value Over Draft Position
    # Did our top picks outperform players taken at the same positions?
    df['model_value_added'] = df['target_career_ws'] - \
        df.groupby('draft_pick')['target_career_ws'].transform('mean')

    top_10_picks = df[df['prediction_rank'] <= 10]
    results['avg_value_added_top10'] = top_10_picks['model_value_added'].mean()

    # 5. Hit Rate by Tier
    for tier in ['Star', 'Starter', 'Rotation']:
        tier_players = df[df['target_tier'] == tier]
        hit_rate = (tier_players['prediction_rank'] <= 15).mean()
        results[f'hit_rate_{tier.lower()}'] = hit_rate

    return results

6.3 Confidence Calibration

Models should know what they do not know:

def evaluate_calibration(y_true, y_pred, y_std):
    """
    Evaluate whether prediction uncertainties are well-calibrated.

    A well-calibrated model should have 68% of observations within
    1 standard deviation and 95% within 2 standard deviations.
    """
    within_1std = np.mean(np.abs(y_true - y_pred) <= y_std)
    within_2std = np.mean(np.abs(y_true - y_pred) <= 2 * y_std)

    return {
        'within_1std': within_1std,  # Should be ~0.68
        'within_2std': within_2std,  # Should be ~0.95
        'calibration_error_1std': abs(within_1std - 0.68),
        'calibration_error_2std': abs(within_2std - 0.95)
    }

Part 7: Creating a Draft Board

The draft board translates model predictions into an actionable ranking with uncertainty quantification.

7.1 Board Generation

def create_draft_board(
    predictions_df: pd.DataFrame,
    n_players: int = 60
) -> pd.DataFrame:
    """
    Create a complete draft board from model predictions.

    Args:
        predictions_df: DataFrame with predictions and player info
        n_players: Number of players to include

    Returns:
        Formatted draft board DataFrame
    """
    # Sort by predicted value
    board = predictions_df.nlargest(n_players, 'predicted_value').copy()

    # Add board rank
    board['board_rank'] = range(1, len(board) + 1)

    # Calculate percentile ranks
    board['percentile'] = (
        board['predicted_value'].rank(pct=True) * 100
    ).round(1)

    # Add value tier labels
    board['tier'] = pd.cut(
        board['percentile'],
        bins=[0, 25, 50, 75, 90, 100],
        labels=['5th Tier', '4th Tier', '3rd Tier', '2nd Tier', '1st Tier']
    )

    # Calculate relative value to next pick
    board['value_gap'] = board['predicted_value'].diff(-1)

    # Identify value drops (good trade-down spots)
    board['trade_down_opportunity'] = board['value_gap'] < \
        board['value_gap'].quantile(0.25)

    # Format for presentation
    display_cols = [
        'board_rank', 'name', 'position', 'college',
        'predicted_value', 'prediction_std', 'percentile', 'tier',
        'primary_comp', 'secondary_comp'
    ]

    # Only include columns that exist
    display_cols = [c for c in display_cols if c in board.columns]

    return board[display_cols]

7.2 Player Comparisons

Historical comparisons help stakeholders contextualize predictions:

def find_player_comps(
    prospect: pd.Series,
    historical_df: pd.DataFrame,
    feature_cols: List[str],
    n_comps: int = 5
) -> List[Dict]:
    """
    Find historical players with similar pre-draft profiles.

    Uses Euclidean distance in feature space to identify
    the most similar historical prospects.
    """
    # Standardize features
    scaler = StandardScaler()
    historical_scaled = scaler.fit_transform(
        historical_df[feature_cols].fillna(0)
    )
    prospect_scaled = scaler.transform(
        prospect[feature_cols].fillna(0).values.reshape(1, -1)
    )

    # Calculate distances
    distances = np.linalg.norm(
        historical_scaled - prospect_scaled,
        axis=1
    )

    # Get closest matches
    closest_idx = np.argsort(distances)[:n_comps]

    comps = []
    for idx in closest_idx:
        player = historical_df.iloc[idx]
        comps.append({
            'name': player['name'],
            'draft_year': player['draft_year'],
            'draft_pick': player['draft_pick'],
            'career_ws': player['career_ws'],
            'similarity': 1 / (1 + distances[idx])
        })

    return comps

Part 8: Visualization of Results

Effective visualization is critical for communicating findings to decision-makers.

8.1 Core Visualizations

# Complete visualization code in code/visualization.py

import matplotlib.pyplot as plt
import seaborn as sns

def plot_draft_board(board_df: pd.DataFrame, figsize=(14, 10)):
    """
    Create a visual draft board showing rankings and uncertainty.
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)

    # 1. Predicted value with confidence intervals
    ax1 = axes[0, 0]
    top_20 = board_df.head(20)

    ax1.barh(range(len(top_20)), top_20['predicted_value'],
             color='steelblue', alpha=0.7)

    if 'prediction_std' in top_20.columns:
        ax1.errorbar(
            top_20['predicted_value'],
            range(len(top_20)),
            xerr=top_20['prediction_std'] * 1.96,
            fmt='none',
            color='black',
            capsize=3
        )

    ax1.set_yticks(range(len(top_20)))
    ax1.set_yticklabels(top_20['name'])
    ax1.invert_yaxis()
    ax1.set_xlabel('Predicted Career Value')
    ax1.set_title('Top 20 Draft Board')

    # 2. Tier distribution
    ax2 = axes[0, 1]
    tier_counts = board_df['tier'].value_counts()
    colors = sns.color_palette("viridis", len(tier_counts))
    ax2.pie(tier_counts, labels=tier_counts.index, colors=colors,
            autopct='%1.0f%%')
    ax2.set_title('Prospect Tier Distribution')

    # 3. Value gaps (trade-down opportunities)
    ax3 = axes[1, 0]
    ax3.bar(range(1, len(board_df) + 1), board_df['value_gap'].fillna(0),
            color='coral')
    ax3.axhline(y=board_df['value_gap'].median(), color='red',
                linestyle='--', label='Median Gap')
    ax3.set_xlabel('Board Position')
    ax3.set_ylabel('Value Gap to Next Pick')
    ax3.set_title('Value Drops (Trade-Down Opportunities)')
    ax3.legend()

    # 4. Prediction vs draft position (for backtests)
    ax4 = axes[1, 1]
    if 'draft_pick' in board_df.columns:
        ax4.scatter(board_df['draft_pick'], board_df['predicted_value'],
                   alpha=0.6, c='teal')
        ax4.set_xlabel('Actual Draft Position')
        ax4.set_ylabel('Model Predicted Value')
        ax4.set_title('Model vs. Consensus')

    plt.tight_layout()
    return fig


def plot_feature_importance(model, feature_names: List[str], top_n: int = 20):
    """
    Visualize feature importances from tree-based models.
    """
    importances = model.feature_importances_
    indices = np.argsort(importances)[-top_n:]

    fig, ax = plt.subplots(figsize=(10, 8))

    ax.barh(range(len(indices)), importances[indices], color='steelblue')
    ax.set_yticks(range(len(indices)))
    ax.set_yticklabels([feature_names[i] for i in indices])
    ax.set_xlabel('Feature Importance')
    ax.set_title('Top Features for Draft Prediction')

    plt.tight_layout()
    return fig


def plot_backtest_results(backtest_df: pd.DataFrame):
    """
    Visualize backtest performance across years.
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # 1. Correlation by year
    ax1 = axes[0, 0]
    yearly_corr = backtest_df.groupby('draft_year').apply(
        lambda x: x['predicted_value'].corr(x['target_career_ws'])
    )
    ax1.bar(yearly_corr.index, yearly_corr.values, color='steelblue')
    ax1.set_xlabel('Draft Year')
    ax1.set_ylabel('Correlation')
    ax1.set_title('Prediction Accuracy by Draft Year')
    ax1.axhline(y=yearly_corr.mean(), color='red', linestyle='--')

    # 2. Star detection by year
    ax2 = axes[0, 1]
    yearly_star = backtest_df.groupby('draft_year').apply(
        lambda x: (x[x['target_star'] == 1]['prediction_rank'] <= 10).mean()
    )
    ax2.bar(yearly_star.index, yearly_star.values, color='coral')
    ax2.set_xlabel('Draft Year')
    ax2.set_ylabel('Detection Rate')
    ax2.set_title('Star Detection Rate (Top 10)')

    # 3. Predicted vs actual scatter
    ax3 = axes[1, 0]
    ax3.scatter(backtest_df['predicted_value'],
               backtest_df['target_career_ws'],
               alpha=0.3, c='teal')
    ax3.plot([0, backtest_df['predicted_value'].max()],
            [0, backtest_df['predicted_value'].max()],
            'r--', label='Perfect Prediction')
    ax3.set_xlabel('Predicted Value')
    ax3.set_ylabel('Actual Career Win Shares')
    ax3.set_title('Predicted vs. Actual Outcomes')
    ax3.legend()

    # 4. Value added by draft position
    ax4 = axes[1, 1]
    position_bins = pd.cut(backtest_df['draft_pick'],
                          bins=[0, 10, 20, 30, 60])
    value_by_position = backtest_df.groupby(position_bins)['model_value_added'].mean()
    ax4.bar(range(len(value_by_position)), value_by_position.values,
           color='purple', alpha=0.7)
    ax4.set_xticks(range(len(value_by_position)))
    ax4.set_xticklabels(['1-10', '11-20', '21-30', '31-60'])
    ax4.set_xlabel('Draft Position Range')
    ax4.set_ylabel('Average Value Added')
    ax4.set_title('Model Value Added by Draft Position')

    plt.tight_layout()
    return fig

8.2 Interactive Dashboard Concept

For production use, consider building an interactive dashboard:

# Example using Streamlit (conceptual)
"""
import streamlit as st

def draft_dashboard():
    st.title("Draft Model Dashboard")

    # Sidebar controls
    draft_year = st.sidebar.selectbox("Draft Year", range(2024, 2015, -1))
    model_type = st.sidebar.selectbox("Model", ["Random Forest", "XGBoost"])

    # Load predictions
    predictions = load_predictions(draft_year, model_type)

    # Main content
    col1, col2 = st.columns(2)

    with col1:
        st.subheader("Draft Board")
        st.dataframe(predictions.head(30))

    with col2:
        st.subheader("Prediction Distribution")
        fig = plot_prediction_distribution(predictions)
        st.pyplot(fig)

    # Player deep dive
    selected_player = st.selectbox("Select Player", predictions['name'])
    show_player_profile(predictions, selected_player)
"""

Part 9: Presentation to Stakeholders

9.1 Executive Summary Template

When presenting to decision-makers, structure your findings clearly:

DRAFT MODEL EXECUTIVE SUMMARY

Model Performance:
- Backtest correlation: 0.65 (strong positive relationship)
- Star detection rate: 72% (7 of 10 stars ranked in top 10)
- Bust avoidance: 85% (avoided ranking busts highly)

Key Findings:
1. Age-adjusted production is the strongest predictor
2. Wingspan-to-height ratio adds significant signal
3. Model identifies 3 potential value picks outside top 10

Recommendations:
- Player A: Best value in 5-10 range (model rank: 3)
- Player B: Significant bust risk despite consensus top-5 ranking
- Player C: Undervalued by consensus, strong physical profile

Model Limitations:
- International players have higher uncertainty
- Injury risk not captured in current features
- Recent draft classes have incomplete outcome data

9.2 Technical Documentation

For technical audiences, provide comprehensive documentation:

## Model Technical Specification

### Features (52 total)
- College production: 15 features
- Athletic measurements: 12 features
- Physical profile: 8 features
- Context adjustments: 10 features
- Trajectory indicators: 7 features

### Model Architecture
- Ensemble of Random Forest and XGBoost
- 5-fold time-series cross-validation
- Hyperparameters tuned via Bayesian optimization

### Performance Metrics
| Metric | Training | Validation | Test |
|--------|----------|------------|------|
| RMSE   | 12.3     | 15.8       | 16.2 |
| R^2    | 0.78     | 0.62       | 0.58 |
| Rank Corr | 0.85  | 0.68       | 0.65 |

### Known Limitations
1. Sample size for star outcomes is small (< 50)
2. International prospects underrepresented
3. Position-specific models not yet implemented

Part 10: Complete Python ML Pipeline

The following brings together all components into a cohesive pipeline:

"""
main_pipeline.py - Complete draft model pipeline

This script orchestrates the entire draft modeling workflow from
data loading through final board generation.
"""

import pandas as pd
import numpy as np
from pathlib import Path
import logging
import pickle
from datetime import datetime

# Local imports (from code/ directory)
from draft_model import DraftModelPipeline
from feature_engineering import DraftFeatureEngineer
from evaluation import DraftModelEvaluator
from visualization import DraftVisualizer

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def run_draft_pipeline(
    data_path: str = None,
    output_dir: str = "output",
    target_year: int = None,
    save_artifacts: bool = True
) -> dict:
    """
    Execute the complete draft modeling pipeline.

    Args:
        data_path: Path to input data (uses sample if None)
        output_dir: Directory for saving outputs
        target_year: Year to generate predictions for
        save_artifacts: Whether to save model and results

    Returns:
        Dictionary containing model, predictions, and metrics
    """
    logger.info("=" * 60)
    logger.info("DRAFT MODEL PIPELINE")
    logger.info("=" * 60)

    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    # Step 1: Data Collection
    logger.info("\n[1/6] Loading data...")
    if data_path:
        data = pd.read_csv(data_path)
    else:
        from data_collection import load_draft_data
        data = load_draft_data(use_sample=True)

    logger.info(f"Loaded {len(data)} players from {data['draft_year'].min()}"
               f" to {data['draft_year'].max()}")

    # Step 2: Feature Engineering
    logger.info("\n[2/6] Engineering features...")
    engineer = DraftFeatureEngineer()
    features_df = engineer.create_all_features(data)

    logger.info(f"Created {len(engineer.feature_names)} features")

    # Step 3: Target Variable Creation
    logger.info("\n[3/6] Creating target variables...")
    from data_collection import create_target_variables
    features_df = create_target_variables(features_df)

    # Step 4: Model Training
    logger.info("\n[4/6] Training models...")
    pipeline = DraftModelPipeline(target_col='target_career_ws')

    # Split for training (historical) and prediction (target year)
    if target_year:
        train_df = features_df[features_df['draft_year'] < target_year]
        pred_df = features_df[features_df['draft_year'] == target_year]
    else:
        # Use all but most recent year for training
        max_year = features_df['draft_year'].max()
        train_df = features_df[features_df['draft_year'] < max_year]
        pred_df = features_df[features_df['draft_year'] == max_year]
        target_year = max_year

    X_train, y_train = pipeline.prepare_features(train_df)
    model_results = pipeline.train_models(
        X_train, y_train,
        train_df['draft_year'].values
    )

    # Step 5: Backtesting
    logger.info("\n[5/6] Running backtests...")
    evaluator = DraftModelEvaluator()

    backtest_results = evaluator.walk_forward_backtest(
        features_df,
        pipeline.feature_cols,
        'target_career_ws'
    )

    metrics = evaluator.calculate_metrics(backtest_results)

    logger.info("\nBacktest Results:")
    for metric, value in metrics.items():
        logger.info(f"  {metric}: {value:.3f}")

    # Step 6: Generate Draft Board
    logger.info("\n[6/6] Generating draft board...")

    # Predict for target year
    X_pred = pred_df[pipeline.feature_cols].values
    X_pred_scaled = pipeline.scaler.transform(X_pred)

    # Use best model
    best_model_name = min(model_results, key=lambda x: model_results[x]['cv_rmse'])
    best_model = model_results[best_model_name]['model']

    predictions = best_model.predict(X_pred_scaled)

    # Create board
    pred_df = pred_df.copy()
    pred_df['predicted_value'] = predictions

    # Estimate uncertainty (for random forest, use tree variance)
    if hasattr(best_model, 'estimators_'):
        tree_preds = np.array([
            tree.predict(X_pred_scaled) for tree in best_model.estimators_
        ])
        pred_df['prediction_std'] = tree_preds.std(axis=0)

    from draft_model import create_draft_board
    draft_board = create_draft_board(pred_df)

    logger.info(f"\nDraft Board for {target_year}:")
    logger.info(draft_board.head(15).to_string())

    # Step 7: Visualization
    logger.info("\n[7/7] Creating visualizations...")
    visualizer = DraftVisualizer()

    fig_board = visualizer.plot_draft_board(draft_board)
    fig_backtest = visualizer.plot_backtest_results(backtest_results)
    fig_importance = visualizer.plot_feature_importance(
        best_model, pipeline.feature_cols
    )

    # Save artifacts
    if save_artifacts:
        logger.info("\nSaving artifacts...")

        # Save model
        with open(output_path / 'draft_model.pkl', 'wb') as f:
            pickle.dump({
                'model': best_model,
                'scaler': pipeline.scaler,
                'feature_cols': pipeline.feature_cols,
                'train_date': datetime.now().isoformat()
            }, f)

        # Save draft board
        draft_board.to_csv(output_path / 'draft_board.csv', index=False)

        # Save backtest results
        backtest_results.to_csv(output_path / 'backtest_results.csv', index=False)

        # Save figures
        fig_board.savefig(output_path / 'draft_board.png', dpi=150)
        fig_backtest.savefig(output_path / 'backtest_results.png', dpi=150)
        fig_importance.savefig(output_path / 'feature_importance.png', dpi=150)

        logger.info(f"Artifacts saved to {output_path}")

    logger.info("\n" + "=" * 60)
    logger.info("PIPELINE COMPLETE")
    logger.info("=" * 60)

    return {
        'model': best_model,
        'draft_board': draft_board,
        'backtest_results': backtest_results,
        'metrics': metrics,
        'feature_importance': dict(zip(
            pipeline.feature_cols,
            best_model.feature_importances_
        ))
    }


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Run Draft Model Pipeline")
    parser.add_argument('--data', type=str, help='Path to data file')
    parser.add_argument('--year', type=int, help='Target draft year')
    parser.add_argument('--output', type=str, default='output',
                       help='Output directory')

    args = parser.parse_args()

    results = run_draft_pipeline(
        data_path=args.data,
        target_year=args.year,
        output_dir=args.output
    )

    print("\nTop 10 Draft Board:")
    print(results['draft_board'].head(10))

Extension Exercises

Exercise 1: Position-Specific Models

Build separate models for guards, wings, and bigs to capture position-specific success factors.

Exercise 2: International Player Integration

Develop methods to incorporate international league statistics with appropriate adjustments.

Exercise 3: Injury Risk Modeling

Add a parallel model that predicts injury risk based on combine data and college workload.

Exercise 4: Trade Value Calculator

Extend the model to output draft pick trade values based on expected value at each position.

Exercise 5: Monte Carlo Simulation

Implement simulation of draft scenarios to identify optimal team strategies.

Summary

In this capstone project, you have built a complete NBA draft prediction system that:

Collects and integrates data from multiple sources
Engineers meaningful features that capture player potential
Defines appropriate target variables for career success
Trains and validates models using proper backtesting
Evaluates performance with domain-specific metrics
Generates actionable draft boards with uncertainty quantification
Visualizes results for stakeholder communication

This project demonstrates the full lifecycle of a machine learning application in sports analytics, from raw data to business decision support. The techniques and frameworks you have developed here transfer directly to professional front office work and establish a foundation for more advanced modeling approaches.

References and Further Reading

Pelton, K. "Draft Analytics: Quantifying the NBA Draft" (MIT Sloan Sports Analytics Conference)
Silver, N. "A Better Way to Evaluate NBA Draft Picks" (FiveThirtyEight)
Kubatko, J. et al. "A Starting Point for Analyzing Basketball Statistics"
Myers, D. "Position-less Basketball and Draft Strategy"
NBA Combine Official Measurements Database
Sports Reference College Basketball Statistics