12 min read

One of the most common questions in basketball discussions is: "Who does this player remind you of?" Whether scouts are evaluating draft prospects, front offices are assessing free agent targets, or fans are debating player legacies, the ability to...

Capstone Project 1: Build a Player Comparison Tool

Introduction

One of the most common questions in basketball discussions is: "Who does this player remind you of?" Whether scouts are evaluating draft prospects, front offices are assessing free agent targets, or fans are debating player legacies, the ability to systematically compare players is invaluable. This capstone project guides you through building a comprehensive player comparison tool that leverages statistical similarity algorithms, percentile rankings, and interactive visualizations.

By the end of this project, you will have created a portfolio-worthy application that demonstrates proficiency in data engineering, statistical analysis, machine learning concepts, and data visualization. The tool will allow users to find statistically similar players across different eras, compare player strengths and weaknesses, and explore the landscape of playing styles in the NBA.


Project Overview

What You Will Build

The Player Comparison Tool is a Python-based application that:

  1. Loads and processes NBA player statistics from multiple data sources
  2. Calculates statistical similarity between players using multiple algorithms
  3. Generates percentile rankings for contextual comparisons
  4. Computes z-scores for standardized statistical analysis
  5. Creates interactive visualizations including radar charts, scatter plots, and similarity matrices
  6. Provides a web-based dashboard for exploring player comparisons

Learning Objectives

Upon completing this project, you will be able to:

  • Design and implement ETL pipelines for sports statistics
  • Apply similarity metrics (Euclidean distance, cosine similarity, Mahalanobis distance) to real-world data
  • Create meaningful statistical normalizations using percentile rankings and z-scores
  • Build interactive data visualizations with Plotly
  • Deploy a Streamlit dashboard for end-user interaction
  • Handle data quality issues common in sports analytics
  • Structure a data science project for maintainability and extensibility

Why This Project Matters

Player comparison tools are used throughout the basketball industry:

  • Scouting departments use similarity scores to identify draft prospects who project similarly to successful NBA players
  • Front offices evaluate free agents by comparing them to players with known contract values
  • Coaching staffs study tendencies of similar players to develop game plans
  • Media and fans contextualize player performance through historical comparisons

Building this tool demonstrates that you understand both the technical implementation and the domain-specific considerations that make such tools valuable.


Requirements and Specifications

Functional Requirements

Requirement Description Priority
FR-1 Load player statistics from CSV files or API Must Have
FR-2 Calculate similarity scores between any two players Must Have
FR-3 Find the N most similar players to a given player Must Have
FR-4 Support multiple similarity algorithms Must Have
FR-5 Generate percentile rankings within customizable peer groups Must Have
FR-6 Compute z-scores for statistical comparisons Must Have
FR-7 Create radar chart visualizations for player profiles Must Have
FR-8 Display similarity matrices as heatmaps Should Have
FR-9 Filter comparisons by era, position, or other criteria Should Have
FR-10 Export comparison results to various formats Could Have
FR-11 Provide a web-based user interface Should Have

Non-Functional Requirements

Requirement Description Target
NFR-1 Response time for similarity calculation < 2 seconds for single comparison
NFR-2 Memory usage < 500 MB for full dataset
NFR-3 Code test coverage > 80%
NFR-4 Documentation All public functions documented
NFR-5 Cross-platform compatibility Windows, macOS, Linux

Technical Stack

  • Python 3.9+: Core programming language
  • pandas: Data manipulation and analysis
  • NumPy: Numerical computations
  • scikit-learn: Machine learning utilities and preprocessing
  • SciPy: Statistical functions and distance metrics
  • Plotly: Interactive visualizations
  • Streamlit: Web dashboard framework
  • pytest: Testing framework

Data Sources

Primary Data Sources

Basketball Reference

The most comprehensive source for historical NBA statistics. You can obtain data through:

  1. Manual download: Export CSV files from player season pages
  2. Web scraping: Use libraries like basketball_reference_web_scraper
  3. Pre-compiled datasets: Kaggle hosts several cleaned datasets

NBA Stats API

The official NBA statistics API provides current season data:

# Example endpoint structure
base_url = "https://stats.nba.com/stats/"
endpoint = "leaguedashplayerstats"

Note: The NBA Stats API requires specific headers and has rate limiting.

Kaggle Datasets

Several well-maintained datasets are available:

  • NBA Players Stats (1950-2023)
  • NBA Advanced Stats
  • NBA Play-by-Play Data

Data Schema

For this project, we will work with the following statistical categories:

Counting Statistics (Per Game)

Column Description
PTS Points per game
TRB Total rebounds per game
AST Assists per game
STL Steals per game
BLK Blocks per game
TOV Turnovers per game
MP Minutes per game

Shooting Statistics

Column Description
FG_PCT Field goal percentage
FG3_PCT Three-point percentage
FT_PCT Free throw percentage
TS_PCT True shooting percentage
EFG_PCT Effective field goal percentage

Advanced Statistics

Column Description
PER Player efficiency rating
WS Win shares
BPM Box plus/minus
VORP Value over replacement player
USG_PCT Usage percentage
AST_PCT Assist percentage
TRB_PCT Total rebound percentage

Data Quality Considerations

When working with basketball statistics, be aware of:

  1. Era differences: The three-point line was introduced in 1979-80
  2. Missing data: Some advanced stats are not available for older seasons
  3. Position changes: Position classifications have evolved over time
  4. Minutes thresholds: Low-minute players can have misleading per-game stats
  5. Lockout seasons: 1998-99 (50 games) and 2011-12 (66 games) were shortened

Architecture and Design

System Architecture

+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   Data Sources    |---->|   Data Loader     |---->|   Data Store      |
|   (CSV, API)      |     |   (ETL Pipeline)  |     |   (pandas DF)     |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+
                                                            |
                                                            v
+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   Dashboard       |<----|   Visualization   |<----|   Similarity      |
|   (Streamlit)     |     |   (Plotly)        |     |   Engine          |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+

Module Design

The application is organized into four core modules:

  1. data_loader.py: Handles all data ingestion and preprocessing
  2. similarity.py: Implements similarity algorithms and comparison logic
  3. visualization.py: Creates all charts and visual outputs
  4. player_comparison.py: Main application orchestration and Streamlit UI

Design Decisions

Decision 1: Pandas over Database

Choice: Use pandas DataFrames as the primary data store rather than a database.

Rationale: - The dataset size (thousands of player-seasons) fits comfortably in memory - Pandas provides excellent support for the statistical operations we need - Simplifies deployment by avoiding database dependencies - Allows for rapid prototyping and iteration

Trade-offs: - Not suitable if dataset grows significantly - Concurrent write operations would require additional handling

Decision 2: Multiple Similarity Algorithms

Choice: Implement multiple similarity algorithms and let users choose.

Rationale: - Different algorithms capture different aspects of similarity - Euclidean distance is intuitive but sensitive to scale - Cosine similarity captures style regardless of volume - Users can validate findings across multiple methods

Decision 3: Configurable Feature Sets

Choice: Allow users to select which statistics to include in comparisons.

Rationale: - Different use cases require different features (e.g., scoring vs. all-around) - Reduces dimensionality for more interpretable results - Enables position-specific comparisons

Decision 4: Z-Score Normalization as Default

Choice: Use z-score normalization before calculating similarity.

Rationale: - Puts all statistics on the same scale - Accounts for league-wide changes over time when calculated per-season - Interpretable (standard deviations from mean) - Handles outliers better than min-max scaling


Step-by-Step Implementation Guide

Step 1: Project Setup

Create your project directory structure:

capstone1-player-comparison/
├── code/
│   ├── __init__.py
│   ├── player_comparison.py
│   ├── data_loader.py
│   ├── similarity.py
│   ├── visualization.py
│   └── requirements.txt
├── data/
│   └── (player statistics CSV files)
├── tests/
│   ├── test_data_loader.py
│   ├── test_similarity.py
│   └── test_visualization.py
└── index.md

Install dependencies:

pip install -r requirements.txt

Step 2: Data Loading and Preprocessing

The data loader module handles:

  1. Reading data from various sources
  2. Cleaning and validating data
  3. Filtering based on criteria (minutes, games played)
  4. Computing derived statistics

Key preprocessing steps:

def preprocess_data(df: pd.DataFrame, min_games: int = 20,
                    min_minutes: float = 10.0) -> pd.DataFrame:
    """
    Preprocess player statistics data.

    Steps:
    1. Filter by minimum games and minutes
    2. Handle missing values
    3. Compute per-possession stats if needed
    4. Add era classification
    """
    # Filter for qualified players
    df = df[(df['G'] >= min_games) & (df['MP'] >= min_minutes)]

    # Handle missing three-point data for pre-1980 players
    if 'FG3_PCT' in df.columns:
        df['FG3_PCT'] = df['FG3_PCT'].fillna(0)

    # Add era classification
    df['Era'] = pd.cut(df['Season'],
                       bins=[1946, 1980, 2000, 2015, 2030],
                       labels=['Pre-3PT', 'Classic', 'Modern', 'Analytics'])

    return df

Step 3: Implementing Similarity Algorithms

Euclidean Distance

The most intuitive distance metric, measuring the straight-line distance between two points in n-dimensional space:

$$d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$

def euclidean_similarity(player1_stats: np.ndarray,
                         player2_stats: np.ndarray) -> float:
    """
    Calculate Euclidean similarity (inverse of distance).
    Returns a value between 0 and 1, where 1 is identical.
    """
    distance = np.sqrt(np.sum((player1_stats - player2_stats) ** 2))
    # Convert distance to similarity (0 to 1 scale)
    similarity = 1 / (1 + distance)
    return similarity

Pros: Intuitive, preserves magnitude differences Cons: Sensitive to scale, affected by high-volume players

Cosine Similarity

Measures the cosine of the angle between two vectors, capturing similarity in direction regardless of magnitude:

$$\cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||}$$

def cosine_similarity(player1_stats: np.ndarray,
                      player2_stats: np.ndarray) -> float:
    """
    Calculate cosine similarity between two player stat vectors.
    Returns a value between -1 and 1, where 1 is identical direction.
    """
    dot_product = np.dot(player1_stats, player2_stats)
    norm1 = np.linalg.norm(player1_stats)
    norm2 = np.linalg.norm(player2_stats)

    if norm1 == 0 or norm2 == 0:
        return 0.0

    return dot_product / (norm1 * norm2)

Pros: Captures playing style regardless of usage, good for role players Cons: Ignores volume, a 10 PPG scorer could match a 25 PPG scorer

Mahalanobis Distance

Accounts for correlations between variables and differences in variance:

$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

def mahalanobis_similarity(player1_stats: np.ndarray,
                           player2_stats: np.ndarray,
                           cov_matrix: np.ndarray) -> float:
    """
    Calculate Mahalanobis-based similarity.
    Accounts for correlations between statistics.
    """
    diff = player1_stats - player2_stats
    try:
        cov_inv = np.linalg.inv(cov_matrix)
        distance = np.sqrt(np.dot(np.dot(diff, cov_inv), diff))
        similarity = 1 / (1 + distance)
        return similarity
    except np.linalg.LinAlgError:
        # Fall back to Euclidean if covariance matrix is singular
        return euclidean_similarity(player1_stats, player2_stats)

Pros: Statistically rigorous, handles correlated features Cons: Requires invertible covariance matrix, computationally expensive

Step 4: Percentile Rankings

Percentile rankings provide context by showing where a player ranks relative to peers:

def calculate_percentile_rankings(df: pd.DataFrame,
                                  stats: List[str],
                                  group_by: Optional[str] = None) -> pd.DataFrame:
    """
    Calculate percentile rankings for specified statistics.

    Args:
        df: DataFrame with player statistics
        stats: List of statistic columns to rank
        group_by: Optional column to group rankings (e.g., 'Season', 'Position')

    Returns:
        DataFrame with percentile rankings (0-100) for each stat
    """
    result = df.copy()

    for stat in stats:
        col_name = f'{stat}_percentile'
        if group_by:
            result[col_name] = result.groupby(group_by)[stat].transform(
                lambda x: x.rank(pct=True) * 100
            )
        else:
            result[col_name] = result[stat].rank(pct=True) * 100

    return result

Use Cases: - Comparing players across different eras (normalize within season) - Evaluating positional performance (normalize within position) - Creating scouting profiles with percentile bars

Step 5: Z-Score Comparisons

Z-scores standardize statistics to enable direct comparison:

$$z = \frac{x - \mu}{\sigma}$$

def calculate_zscores(df: pd.DataFrame,
                      stats: List[str],
                      group_by: Optional[str] = None) -> pd.DataFrame:
    """
    Calculate z-scores for specified statistics.

    A z-score of 0 means average, +1 means one standard deviation above.
    """
    result = df.copy()

    for stat in stats:
        col_name = f'{stat}_zscore'
        if group_by:
            result[col_name] = result.groupby(group_by)[stat].transform(
                lambda x: (x - x.mean()) / x.std()
            )
        else:
            mean = result[stat].mean()
            std = result[stat].std()
            result[col_name] = (result[stat] - mean) / std

    return result

Interpretation Guide: | Z-Score | Interpretation | |---------|----------------| | +3.0 | Elite (99.9th percentile) | | +2.0 | Excellent (97.7th percentile) | | +1.0 | Above average (84th percentile) | | 0.0 | League average | | -1.0 | Below average (16th percentile) | | -2.0 | Poor (2.3rd percentile) |

Step 6: Finding Similar Players

The core function that ties everything together:

def find_similar_players(target_player: str,
                         target_season: int,
                         df: pd.DataFrame,
                         features: List[str],
                         algorithm: str = 'euclidean',
                         n: int = 10,
                         filters: Optional[Dict] = None) -> pd.DataFrame:
    """
    Find the N most similar players to a target player-season.

    Args:
        target_player: Name of the player to compare
        target_season: Season year to use for comparison
        df: DataFrame with all player statistics
        features: List of statistic columns to use for comparison
        algorithm: Similarity algorithm ('euclidean', 'cosine', 'mahalanobis')
        n: Number of similar players to return
        filters: Optional filters (e.g., {'Era': 'Modern', 'Position': 'PG'})

    Returns:
        DataFrame with the N most similar players and their similarity scores
    """
    # Get target player stats
    target_mask = (df['Player'] == target_player) & (df['Season'] == target_season)
    if not target_mask.any():
        raise ValueError(f"Player {target_player} not found for season {target_season}")

    target_stats = df.loc[target_mask, features].values[0]

    # Apply filters
    comparison_df = df.copy()
    if filters:
        for col, value in filters.items():
            comparison_df = comparison_df[comparison_df[col] == value]

    # Normalize features
    scaler = StandardScaler()
    normalized_features = scaler.fit_transform(comparison_df[features])
    target_normalized = scaler.transform([target_stats])[0]

    # Calculate similarities
    similarities = []
    for idx, row_stats in enumerate(normalized_features):
        if algorithm == 'euclidean':
            sim = euclidean_similarity(target_normalized, row_stats)
        elif algorithm == 'cosine':
            sim = cosine_similarity(target_normalized, row_stats)
        elif algorithm == 'mahalanobis':
            cov = np.cov(normalized_features.T)
            sim = mahalanobis_similarity(target_normalized, row_stats, cov)
        else:
            raise ValueError(f"Unknown algorithm: {algorithm}")
        similarities.append(sim)

    comparison_df['Similarity'] = similarities

    # Remove target player and sort
    result = comparison_df[~target_mask].nlargest(n, 'Similarity')

    return result[['Player', 'Season', 'Similarity'] + features]

Visualization Dashboard Design

Radar Charts

Radar charts (also called spider charts) excel at displaying multivariate data for player profiles:

def create_radar_chart(player_data: Dict[str, float],
                       comparison_data: Optional[Dict[str, float]] = None,
                       title: str = "Player Profile") -> go.Figure:
    """
    Create a radar chart comparing one or two players.

    Args:
        player_data: Dictionary of stat names to percentile values (0-100)
        comparison_data: Optional second player for comparison
        title: Chart title

    Returns:
        Plotly Figure object
    """
    categories = list(player_data.keys())

    fig = go.Figure()

    # Add primary player
    fig.add_trace(go.Scatterpolar(
        r=list(player_data.values()),
        theta=categories,
        fill='toself',
        name='Player 1',
        line=dict(color='#1f77b4')
    ))

    # Add comparison player if provided
    if comparison_data:
        fig.add_trace(go.Scatterpolar(
            r=list(comparison_data.values()),
            theta=categories,
            fill='toself',
            name='Player 2',
            line=dict(color='#ff7f0e')
        ))

    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 100]
            )
        ),
        showlegend=True,
        title=title
    )

    return fig

Similarity Heatmaps

Heatmaps visualize pairwise similarities across multiple players:

def create_similarity_heatmap(similarity_matrix: np.ndarray,
                              player_names: List[str],
                              title: str = "Player Similarity Matrix") -> go.Figure:
    """
    Create a heatmap showing pairwise similarities.
    """
    fig = go.Figure(data=go.Heatmap(
        z=similarity_matrix,
        x=player_names,
        y=player_names,
        colorscale='RdYlGn',
        zmin=0,
        zmax=1,
        text=np.round(similarity_matrix, 2),
        texttemplate='%{text}',
        textfont={"size": 10},
        hoverongaps=False
    ))

    fig.update_layout(
        title=title,
        xaxis_title="Player",
        yaxis_title="Player",
        width=800,
        height=800
    )

    return fig

Statistical Comparison Bar Charts

Side-by-side bar charts for direct statistical comparisons:

def create_comparison_bars(player1_stats: Dict[str, float],
                           player2_stats: Dict[str, float],
                           player1_name: str,
                           player2_name: str) -> go.Figure:
    """
    Create grouped bar chart comparing two players' statistics.
    """
    categories = list(player1_stats.keys())

    fig = go.Figure(data=[
        go.Bar(name=player1_name, x=categories, y=list(player1_stats.values())),
        go.Bar(name=player2_name, x=categories, y=list(player2_stats.values()))
    ])

    fig.update_layout(
        barmode='group',
        title=f'{player1_name} vs {player2_name}',
        xaxis_title='Statistic',
        yaxis_title='Z-Score',
        legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
    )

    return fig

User Interface Considerations

Streamlit Dashboard Layout

The dashboard is organized into logical sections:

  1. Sidebar: Player selection, algorithm choice, feature selection
  2. Main Area: Visualizations and results tables
  3. Expandable Sections: Detailed statistics and methodology explanations

Key UI/UX Principles

  1. Progressive Disclosure: Show summary first, details on demand
  2. Sensible Defaults: Pre-select commonly used features and algorithms
  3. Clear Feedback: Loading indicators and error messages
  4. Mobile Responsiveness: Streamlit handles this automatically

Accessibility Considerations

  • Use colorblind-friendly palettes (avoid red-green only distinctions)
  • Provide text alternatives for all visualizations
  • Ensure sufficient contrast ratios
  • Support keyboard navigation

Complete Core Implementation

Below is the complete implementation for the core player comparison functionality:

"""
Player Comparison Tool - Core Implementation
A comprehensive tool for finding and comparing statistically similar NBA players.
"""

import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist
from scipy.stats import percentileofscore
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots


class PlayerComparisonTool:
    """
    Main class for player comparison functionality.

    This class provides methods for:
    - Loading and preprocessing player statistics
    - Calculating similarity scores between players
    - Finding similar players
    - Generating visualizations
    """

    # Default features for comparison
    DEFAULT_FEATURES = [
        'PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
        'FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT',
        'USG_PCT', 'PER', 'BPM', 'WS'
    ]

    def __init__(self, data: Optional[pd.DataFrame] = None):
        """
        Initialize the PlayerComparisonTool.

        Args:
            data: Optional pre-loaded DataFrame with player statistics
        """
        self.data = data
        self.scaler = StandardScaler()
        self._normalized_data = None
        self._features = None

    def load_data(self, filepath: str, **kwargs) -> pd.DataFrame:
        """
        Load player statistics from a CSV file.

        Args:
            filepath: Path to the CSV file
            **kwargs: Additional arguments passed to pd.read_csv

        Returns:
            Loaded and preprocessed DataFrame
        """
        self.data = pd.read_csv(filepath, **kwargs)
        self._preprocess_data()
        return self.data

    def _preprocess_data(self, min_games: int = 20, min_minutes: float = 10.0):
        """
        Preprocess the loaded data.

        Args:
            min_games: Minimum games played threshold
            min_minutes: Minimum minutes per game threshold
        """
        if self.data is None:
            raise ValueError("No data loaded")

        # Filter for qualified players
        if 'G' in self.data.columns and 'MP' in self.data.columns:
            self.data = self.data[
                (self.data['G'] >= min_games) &
                (self.data['MP'] >= min_minutes)
            ]

        # Handle missing values in shooting percentages
        pct_cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT', 'EFG_PCT']
        for col in pct_cols:
            if col in self.data.columns:
                self.data[col] = self.data[col].fillna(0)

        # Reset index
        self.data = self.data.reset_index(drop=True)

    def prepare_features(self, features: Optional[List[str]] = None) -> np.ndarray:
        """
        Prepare and normalize features for similarity calculations.

        Args:
            features: List of feature columns to use

        Returns:
            Normalized feature array
        """
        if self.data is None:
            raise ValueError("No data loaded")

        self._features = features or self.DEFAULT_FEATURES

        # Filter to available features
        available_features = [f for f in self._features if f in self.data.columns]
        if len(available_features) < len(self._features):
            missing = set(self._features) - set(available_features)
            print(f"Warning: Features not found in data: {missing}")

        self._features = available_features

        # Fill any remaining missing values with column means
        feature_data = self.data[self._features].copy()
        feature_data = feature_data.fillna(feature_data.mean())

        # Normalize
        self._normalized_data = self.scaler.fit_transform(feature_data)

        return self._normalized_data

    def calculate_similarity(self,
                             player1_idx: int,
                             player2_idx: int,
                             method: str = 'euclidean') -> float:
        """
        Calculate similarity between two players.

        Args:
            player1_idx: Index of first player in DataFrame
            player2_idx: Index of second player in DataFrame
            method: Similarity method ('euclidean', 'cosine', 'correlation')

        Returns:
            Similarity score (0 to 1, where 1 is most similar)
        """
        if self._normalized_data is None:
            self.prepare_features()

        stats1 = self._normalized_data[player1_idx].reshape(1, -1)
        stats2 = self._normalized_data[player2_idx].reshape(1, -1)

        if method == 'euclidean':
            distance = cdist(stats1, stats2, metric='euclidean')[0, 0]
            return 1 / (1 + distance)
        elif method == 'cosine':
            # Cosine distance is 1 - cosine similarity
            cosine_dist = cdist(stats1, stats2, metric='cosine')[0, 0]
            return 1 - cosine_dist
        elif method == 'correlation':
            corr_dist = cdist(stats1, stats2, metric='correlation')[0, 0]
            return 1 - corr_dist
        else:
            raise ValueError(f"Unknown method: {method}")

    def find_similar_players(self,
                             player_name: str,
                             season: Optional[int] = None,
                             n: int = 10,
                             method: str = 'euclidean',
                             features: Optional[List[str]] = None,
                             filters: Optional[Dict] = None) -> pd.DataFrame:
        """
        Find the N most similar players to a given player.

        Args:
            player_name: Name of the target player
            season: Optional season year to filter target player
            n: Number of similar players to return
            method: Similarity method
            features: Optional list of features to use
            filters: Optional filters to apply (e.g., {'Position': 'PG'})

        Returns:
            DataFrame with similar players and their similarity scores
        """
        if self.data is None:
            raise ValueError("No data loaded")

        # Find target player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found" +
                           (f" for season {season}" if season else ""))

        target_idx = self.data[mask].index[0]

        # Prepare features
        self.prepare_features(features)

        # Apply filters
        comparison_mask = pd.Series([True] * len(self.data))
        if filters:
            for col, value in filters.items():
                if col in self.data.columns:
                    comparison_mask &= self.data[col] == value

        # Calculate all similarities
        target_stats = self._normalized_data[target_idx].reshape(1, -1)

        similarities = []
        for idx in range(len(self.data)):
            if comparison_mask.iloc[idx] and idx != target_idx:
                sim = self.calculate_similarity(target_idx, idx, method)
                similarities.append((idx, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_n = similarities[:n]

        # Build result DataFrame
        result_indices = [idx for idx, _ in top_n]
        result = self.data.iloc[result_indices].copy()
        result['Similarity'] = [sim for _, sim in top_n]

        # Reorder columns
        cols = ['Player', 'Season', 'Similarity'] + \
               [c for c in self._features if c in result.columns]
        result = result[[c for c in cols if c in result.columns]]

        return result.reset_index(drop=True)

    def calculate_percentiles(self,
                              player_name: str,
                              season: Optional[int] = None,
                              features: Optional[List[str]] = None,
                              comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
        """
        Calculate percentile rankings for a player.

        Args:
            player_name: Name of the player
            season: Optional season year
            features: Features to calculate percentiles for
            comparison_group: Optional subset of data for comparison

        Returns:
            Dictionary of feature names to percentile values (0-100)
        """
        if self.data is None:
            raise ValueError("No data loaded")

        features = features or self.DEFAULT_FEATURES
        features = [f for f in features if f in self.data.columns]

        # Find player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found")

        player_row = self.data[mask].iloc[0]
        comparison_df = comparison_group if comparison_group is not None else self.data

        percentiles = {}
        for feature in features:
            value = player_row[feature]
            pct = percentileofscore(comparison_df[feature].dropna(), value)
            percentiles[feature] = round(pct, 1)

        return percentiles

    def calculate_zscores(self,
                          player_name: str,
                          season: Optional[int] = None,
                          features: Optional[List[str]] = None,
                          comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
        """
        Calculate z-scores for a player.

        Args:
            player_name: Name of the player
            season: Optional season year
            features: Features to calculate z-scores for
            comparison_group: Optional subset of data for comparison

        Returns:
            Dictionary of feature names to z-score values
        """
        if self.data is None:
            raise ValueError("No data loaded")

        features = features or self.DEFAULT_FEATURES
        features = [f for f in features if f in self.data.columns]

        # Find player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found")

        player_row = self.data[mask].iloc[0]
        comparison_df = comparison_group if comparison_group is not None else self.data

        zscores = {}
        for feature in features:
            value = player_row[feature]
            mean = comparison_df[feature].mean()
            std = comparison_df[feature].std()
            if std > 0:
                zscores[feature] = round((value - mean) / std, 2)
            else:
                zscores[feature] = 0.0

        return zscores

    def create_radar_chart(self,
                           player_name: str,
                           season: Optional[int] = None,
                           comparison_player: Optional[str] = None,
                           comparison_season: Optional[int] = None,
                           features: Optional[List[str]] = None) -> go.Figure:
        """
        Create a radar chart for player comparison.

        Args:
            player_name: Primary player name
            season: Primary player season
            comparison_player: Optional comparison player name
            comparison_season: Comparison player season
            features: Features to include in chart

        Returns:
            Plotly Figure object
        """
        features = features or ['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TS_PCT']
        features = [f for f in features if f in self.data.columns]

        # Get percentiles for primary player
        player1_pct = self.calculate_percentiles(player_name, season, features)

        fig = go.Figure()

        # Add primary player
        fig.add_trace(go.Scatterpolar(
            r=list(player1_pct.values()),
            theta=list(player1_pct.keys()),
            fill='toself',
            name=f"{player_name} ({season})" if season else player_name,
            line=dict(color='#1f77b4', width=2)
        ))

        # Add comparison player if provided
        if comparison_player:
            player2_pct = self.calculate_percentiles(
                comparison_player, comparison_season, features
            )
            fig.add_trace(go.Scatterpolar(
                r=list(player2_pct.values()),
                theta=list(player2_pct.keys()),
                fill='toself',
                name=f"{comparison_player} ({comparison_season})" if comparison_season else comparison_player,
                line=dict(color='#ff7f0e', width=2)
            ))

        fig.update_layout(
            polar=dict(
                radialaxis=dict(
                    visible=True,
                    range=[0, 100],
                    tickfont=dict(size=10)
                )
            ),
            showlegend=True,
            title=dict(
                text="Player Comparison (Percentile Rankings)",
                x=0.5
            ),
            legend=dict(
                yanchor="top",
                y=1.1,
                xanchor="center",
                x=0.5,
                orientation="h"
            )
        )

        return fig

    def create_similarity_matrix(self,
                                 player_names: List[str],
                                 seasons: Optional[List[int]] = None,
                                 method: str = 'euclidean') -> Tuple[np.ndarray, go.Figure]:
        """
        Create a similarity matrix and heatmap for multiple players.

        Args:
            player_names: List of player names
            seasons: Optional list of seasons (same length as player_names)
            method: Similarity method

        Returns:
            Tuple of (similarity matrix, Plotly Figure)
        """
        n_players = len(player_names)
        if seasons is None:
            seasons = [None] * n_players

        # Get indices for all players
        indices = []
        labels = []
        for name, season in zip(player_names, seasons):
            mask = self.data['Player'] == name
            if season is not None:
                mask &= self.data['Season'] == season
            if mask.any():
                indices.append(self.data[mask].index[0])
                labels.append(f"{name} ({season})" if season else name)

        # Prepare features
        self.prepare_features()

        # Calculate similarity matrix
        n = len(indices)
        sim_matrix = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                if i == j:
                    sim_matrix[i, j] = 1.0
                elif i < j:
                    sim = self.calculate_similarity(indices[i], indices[j], method)
                    sim_matrix[i, j] = sim
                    sim_matrix[j, i] = sim

        # Create heatmap
        fig = go.Figure(data=go.Heatmap(
            z=sim_matrix,
            x=labels,
            y=labels,
            colorscale='RdYlGn',
            zmin=0,
            zmax=1,
            text=np.round(sim_matrix, 2),
            texttemplate='%{text}',
            textfont={"size": 10},
            hovertemplate='%{y} vs %{x}<br>Similarity: %{z:.3f}<extra></extra>'
        ))

        fig.update_layout(
            title="Player Similarity Matrix",
            width=600,
            height=600,
            xaxis=dict(tickangle=45)
        )

        return sim_matrix, fig


# Example usage
if __name__ == "__main__":
    # Create sample data for demonstration
    sample_data = pd.DataFrame({
        'Player': ['LeBron James', 'Kevin Durant', 'Stephen Curry',
                   'Giannis Antetokounmpo', 'Luka Doncic'] * 3,
        'Season': [2022, 2022, 2022, 2022, 2022,
                   2023, 2023, 2023, 2023, 2023,
                   2024, 2024, 2024, 2024, 2024],
        'G': [55, 58, 56, 63, 66] * 3,
        'MP': [35.5, 36.0, 34.7, 32.1, 36.2] * 3,
        'PTS': [28.9, 29.1, 29.4, 31.1, 32.4] * 3,
        'TRB': [8.3, 6.7, 6.1, 11.6, 8.8] * 3,
        'AST': [6.8, 5.0, 6.3, 5.7, 8.0] * 3,
        'STL': [0.9, 0.7, 0.9, 0.8, 1.4] * 3,
        'BLK': [0.6, 1.4, 0.4, 0.8, 0.5] * 3,
        'TOV': [3.1, 3.3, 3.2, 3.3, 3.6] * 3,
        'FG_PCT': [0.500, 0.529, 0.493, 0.553, 0.496] * 3,
        'FG3_PCT': [0.321, 0.404, 0.427, 0.275, 0.342] * 3,
        'FT_PCT': [0.723, 0.910, 0.915, 0.645, 0.760] * 3,
        'TS_PCT': [0.580, 0.660, 0.670, 0.610, 0.600] * 3,
        'USG_PCT': [31.5, 30.2, 32.1, 37.4, 36.8] * 3,
        'PER': [26.2, 26.8, 24.2, 32.1, 28.4] * 3,
        'BPM': [7.2, 6.1, 5.8, 11.0, 7.8] * 3,
        'WS': [8.1, 9.2, 7.8, 12.1, 10.2] * 3
    })

    # Initialize tool
    tool = PlayerComparisonTool(data=sample_data)

    # Find similar players
    print("Finding players similar to LeBron James (2023)...")
    similar = tool.find_similar_players("LeBron James", season=2023, n=3)
    print(similar[['Player', 'Season', 'Similarity']])

    print("\nLeBron James Percentiles:")
    pct = tool.calculate_percentiles("LeBron James", season=2023)
    for stat, value in pct.items():
        print(f"  {stat}: {value}th percentile")

    print("\nLeBron James Z-Scores:")
    zscores = tool.calculate_zscores("LeBron James", season=2023)
    for stat, value in zscores.items():
        print(f"  {stat}: {value:+.2f}")

Deployment Options

Option 1: Local Streamlit Application

The simplest deployment method for personal use or demos:

streamlit run player_comparison.py

Option 2: Streamlit Cloud

Free hosting for public applications:

  1. Push your code to a GitHub repository
  2. Connect to Streamlit Cloud
  3. Deploy with one click

Limitations: 1GB memory, public repository required for free tier

Option 3: Docker Container

For production deployments or team sharing:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8501

CMD ["streamlit", "run", "player_comparison.py", "--server.port=8501"]

Option 4: Cloud Platform Deployment

For scalable production applications:

  • AWS: EC2, ECS, or Lambda + API Gateway
  • Google Cloud: Cloud Run or App Engine
  • Azure: App Service or Container Instances

Option 5: Static Export

For documentation or reports, export visualizations as HTML or images:

fig.write_html("player_comparison.html")
fig.write_image("player_comparison.png")

Extension Ideas

Beginner Extensions

  1. Add more similarity algorithms: Implement Manhattan distance or Minkowski distance
  2. Position filtering: Allow filtering comparisons by position
  3. Export functionality: Add CSV/PDF export of comparison results
  4. Historical comparison: Compare a current player to all-time greats

Intermediate Extensions

  1. Weighted similarity: Allow users to weight certain stats more heavily
  2. Career arc comparison: Compare career trajectories, not just single seasons
  3. Cluster analysis: Use K-means to identify player archetypes
  4. Player trajectory prediction: Predict future stats based on similar players

Advanced Extensions

  1. Play-by-play integration: Include player tracking data for more detailed comparisons
  2. Natural language search: "Find me a player like prime Kobe but with better three-point shooting"
  3. Real-time updates: Connect to live NBA Stats API for current season data
  4. Machine learning models: Train models to predict player compatibility or trade value
  5. Multi-sport expansion: Generalize the architecture for other sports

Academic Extensions

  1. Research paper replication: Implement published player similarity research
  2. Novel metrics development: Create new composite statistics for comparison
  3. Bias analysis: Study how the tool's recommendations might embed historical biases
  4. Uncertainty quantification: Add confidence intervals to similarity scores

Testing Your Implementation

Unit Tests

Create comprehensive tests for core functionality:

# tests/test_similarity.py
import pytest
import numpy as np
from code.similarity import euclidean_similarity, cosine_similarity

def test_euclidean_similarity_identical():
    """Identical vectors should have similarity of 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([1.0, 2.0, 3.0])
    assert euclidean_similarity(v1, v2) == 1.0

def test_euclidean_similarity_different():
    """Different vectors should have similarity less than 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([4.0, 5.0, 6.0])
    sim = euclidean_similarity(v1, v2)
    assert 0 < sim < 1

def test_cosine_similarity_orthogonal():
    """Orthogonal vectors should have cosine similarity of 0."""
    v1 = np.array([1.0, 0.0])
    v2 = np.array([0.0, 1.0])
    assert cosine_similarity(v1, v2) == pytest.approx(0.0)

def test_cosine_similarity_parallel():
    """Parallel vectors should have cosine similarity of 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([2.0, 4.0, 6.0])
    assert cosine_similarity(v1, v2) == pytest.approx(1.0)

Integration Tests

Test the complete workflow:

# tests/test_integration.py
def test_full_workflow(sample_data):
    """Test complete comparison workflow."""
    tool = PlayerComparisonTool(data=sample_data)

    # Load and prepare data
    tool.prepare_features()

    # Find similar players
    similar = tool.find_similar_players("LeBron James", n=3)

    assert len(similar) == 3
    assert 'Similarity' in similar.columns
    assert all(0 <= s <= 1 for s in similar['Similarity'])

Troubleshooting Common Issues

Issue: "Player not found" error

Cause: Player name doesn't match exactly (spacing, special characters, suffixes)

Solution:

# Normalize player names
df['Player'] = df['Player'].str.strip()
# Handle suffixes
df['Player'] = df['Player'].str.replace(r'\s*\*$', '', regex=True)

Issue: Similarity scores all very close

Cause: Feature scaling issues or too many features causing dimensionality problems

Solution: - Reduce the number of features - Verify StandardScaler is being applied - Consider using feature selection or PCA

Issue: Memory error with large datasets

Cause: Calculating all pairwise similarities is O(n^2)

Solution: - Use approximate nearest neighbors (Annoy, FAISS) - Calculate similarities in batches - Use sparse matrices where possible

Issue: Slow performance

Cause: Pure Python loops for similarity calculations

Solution: - Use NumPy vectorized operations - Pre-compute the similarity matrix - Cache frequently accessed results


Project Rubric

Use this rubric to evaluate your implementation:

Criterion Excellent (4) Good (3) Satisfactory (2) Needs Work (1)
Code Quality Clean, well-documented, follows PEP8 Minor issues, mostly documented Some documentation, inconsistent style Poor documentation, messy code
Functionality All features work, handles edge cases Most features work Basic features work Significant bugs
Similarity Algorithms 3+ algorithms, well-implemented 2 algorithms working 1 algorithm working Algorithms incorrect
Visualizations Interactive, polished, informative Good visualizations Basic charts Visualizations broken
UI/UX Intuitive, accessible, responsive Good usability Functional but basic Difficult to use
Testing >80% coverage, thorough tests >60% coverage Some tests No tests
Documentation Comprehensive README, docstrings Good documentation Basic README No documentation
Extension Implemented 2+ creative extensions 1 extension Attempted extension No extensions

Conclusion

Congratulations on completing this capstone project! You have built a comprehensive player comparison tool that demonstrates skills in:

  • Data Engineering: Loading, cleaning, and preprocessing sports statistics
  • Statistical Analysis: Implementing similarity metrics, percentile rankings, and z-scores
  • Machine Learning: Applying distance metrics and normalization techniques
  • Data Visualization: Creating informative, interactive charts
  • Software Engineering: Structuring a maintainable, testable codebase
  • Product Development: Designing a user-friendly interface

This project serves as an excellent portfolio piece that showcases your ability to apply data science techniques to real-world problems. The basketball domain knowledge combined with technical implementation makes this project relevant for both sports analytics roles and general data science positions.

Next Steps

  1. Deploy your application using one of the deployment options discussed
  2. Add your own extensions to make the tool unique
  3. Document your work in a blog post or portfolio site
  4. Contribute improvements back to the community
  5. Apply these techniques to other domains that interest you

Remember, the best portfolio projects are ones you continue to improve and that showcase your genuine interests. Use this foundation to build something that excites you!


References

  1. Basketball Reference. (2024). Basketball Statistics and History. https://www.basketball-reference.com/
  2. NBA Advanced Stats. (2024). https://www.nba.com/stats/
  3. Shea, S. M., & Baker, C. E. (2013). Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win. CreateSpace.
  4. Oliver, D. (2004). Basketball on Paper: Rules and Tools for Performance Analysis. Potomac Books.
  5. scikit-learn documentation. (2024). Preprocessing data. https://scikit-learn.org/stable/modules/preprocessing.html
  6. Plotly Python Documentation. (2024). https://plotly.com/python/
  7. Streamlit Documentation. (2024). https://docs.streamlit.io/