12 min read

In This Chapter

Introduction
Project Overview
Requirements and Specifications
Data Sources
Architecture and Design
Step-by-Step Implementation Guide
Visualization Dashboard Design
User Interface Considerations
Complete Core Implementation
Deployment Options
Extension Ideas
Testing Your Implementation
Troubleshooting Common Issues
Project Rubric
Conclusion
References

Capstone Project 1: Build a Player Comparison Tool

Introduction

One of the most common questions in basketball discussions is: "Who does this player remind you of?" Whether scouts are evaluating draft prospects, front offices are assessing free agent targets, or fans are debating player legacies, the ability to systematically compare players is invaluable. This capstone project guides you through building a comprehensive player comparison tool that leverages statistical similarity algorithms, percentile rankings, and interactive visualizations.

By the end of this project, you will have created a portfolio-worthy application that demonstrates proficiency in data engineering, statistical analysis, machine learning concepts, and data visualization. The tool will allow users to find statistically similar players across different eras, compare player strengths and weaknesses, and explore the landscape of playing styles in the NBA.

Project Overview

What You Will Build

The Player Comparison Tool is a Python-based application that:

Loads and processes NBA player statistics from multiple data sources
Calculates statistical similarity between players using multiple algorithms
Generates percentile rankings for contextual comparisons
Computes z-scores for standardized statistical analysis
Creates interactive visualizations including radar charts, scatter plots, and similarity matrices
Provides a web-based dashboard for exploring player comparisons

Learning Objectives

Upon completing this project, you will be able to:

Design and implement ETL pipelines for sports statistics
Apply similarity metrics (Euclidean distance, cosine similarity, Mahalanobis distance) to real-world data
Create meaningful statistical normalizations using percentile rankings and z-scores
Build interactive data visualizations with Plotly
Deploy a Streamlit dashboard for end-user interaction
Handle data quality issues common in sports analytics
Structure a data science project for maintainability and extensibility

Why This Project Matters

Player comparison tools are used throughout the basketball industry:

Scouting departments use similarity scores to identify draft prospects who project similarly to successful NBA players
Front offices evaluate free agents by comparing them to players with known contract values
Coaching staffs study tendencies of similar players to develop game plans
Media and fans contextualize player performance through historical comparisons

Building this tool demonstrates that you understand both the technical implementation and the domain-specific considerations that make such tools valuable.

Requirements and Specifications

Functional Requirements

Requirement	Description	Priority
FR-1	Load player statistics from CSV files or API	Must Have
FR-2	Calculate similarity scores between any two players	Must Have
FR-3	Find the N most similar players to a given player	Must Have
FR-4	Support multiple similarity algorithms	Must Have
FR-5	Generate percentile rankings within customizable peer groups	Must Have
FR-6	Compute z-scores for statistical comparisons	Must Have
FR-7	Create radar chart visualizations for player profiles	Must Have
FR-8	Display similarity matrices as heatmaps	Should Have
FR-9	Filter comparisons by era, position, or other criteria	Should Have
FR-10	Export comparison results to various formats	Could Have
FR-11	Provide a web-based user interface	Should Have

Non-Functional Requirements

Requirement	Description	Target
NFR-1	Response time for similarity calculation	< 2 seconds for single comparison
NFR-2	Memory usage	< 500 MB for full dataset
NFR-3	Code test coverage	> 80%
NFR-4	Documentation	All public functions documented
NFR-5	Cross-platform compatibility	Windows, macOS, Linux

Technical Stack

Python 3.9+: Core programming language
pandas: Data manipulation and analysis
NumPy: Numerical computations
scikit-learn: Machine learning utilities and preprocessing
SciPy: Statistical functions and distance metrics
Plotly: Interactive visualizations
Streamlit: Web dashboard framework
pytest: Testing framework

Data Sources

Primary Data Sources

Basketball Reference

The most comprehensive source for historical NBA statistics. You can obtain data through:

Manual download: Export CSV files from player season pages
Web scraping: Use libraries like basketball_reference_web_scraper
Pre-compiled datasets: Kaggle hosts several cleaned datasets

NBA Stats API

The official NBA statistics API provides current season data:

# Example endpoint structure
base_url = "https://stats.nba.com/stats/"
endpoint = "leaguedashplayerstats"

Note: The NBA Stats API requires specific headers and has rate limiting.

Kaggle Datasets

Several well-maintained datasets are available:

NBA Players Stats (1950-2023)
NBA Advanced Stats
NBA Play-by-Play Data

Data Schema

For this project, we will work with the following statistical categories:

Counting Statistics (Per Game)

Column	Description
PTS	Points per game
TRB	Total rebounds per game
AST	Assists per game
STL	Steals per game
BLK	Blocks per game
TOV	Turnovers per game
MP	Minutes per game

Shooting Statistics

Column	Description
FG_PCT	Field goal percentage
FG3_PCT	Three-point percentage
FT_PCT	Free throw percentage
TS_PCT	True shooting percentage
EFG_PCT	Effective field goal percentage

Advanced Statistics

Column	Description
PER	Player efficiency rating
WS	Win shares
BPM	Box plus/minus
VORP	Value over replacement player
USG_PCT	Usage percentage
AST_PCT	Assist percentage
TRB_PCT	Total rebound percentage

Data Quality Considerations

When working with basketball statistics, be aware of:

Era differences: The three-point line was introduced in 1979-80
Missing data: Some advanced stats are not available for older seasons
Position changes: Position classifications have evolved over time
Minutes thresholds: Low-minute players can have misleading per-game stats
Lockout seasons: 1998-99 (50 games) and 2011-12 (66 games) were shortened

Architecture and Design

System Architecture

+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   Data Sources    |---->|   Data Loader     |---->|   Data Store      |
|   (CSV, API)      |     |   (ETL Pipeline)  |     |   (pandas DF)     |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+
                                                            |
                                                            v
+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   Dashboard       |<----|   Visualization   |<----|   Similarity      |
|   (Streamlit)     |     |   (Plotly)        |     |   Engine          |
|                   |     |                   |     |                   |
+-------------------+     +-------------------+     +-------------------+

Module Design

The application is organized into four core modules:

data_loader.py: Handles all data ingestion and preprocessing
similarity.py: Implements similarity algorithms and comparison logic
visualization.py: Creates all charts and visual outputs
player_comparison.py: Main application orchestration and Streamlit UI

Design Decisions

Decision 1: Pandas over Database

Choice: Use pandas DataFrames as the primary data store rather than a database.

Rationale: - The dataset size (thousands of player-seasons) fits comfortably in memory - Pandas provides excellent support for the statistical operations we need - Simplifies deployment by avoiding database dependencies - Allows for rapid prototyping and iteration

Trade-offs: - Not suitable if dataset grows significantly - Concurrent write operations would require additional handling

Decision 2: Multiple Similarity Algorithms

Choice: Implement multiple similarity algorithms and let users choose.

Rationale: - Different algorithms capture different aspects of similarity - Euclidean distance is intuitive but sensitive to scale - Cosine similarity captures style regardless of volume - Users can validate findings across multiple methods

Decision 3: Configurable Feature Sets

Choice: Allow users to select which statistics to include in comparisons.

Rationale: - Different use cases require different features (e.g., scoring vs. all-around) - Reduces dimensionality for more interpretable results - Enables position-specific comparisons

Decision 4: Z-Score Normalization as Default

Choice: Use z-score normalization before calculating similarity.

Rationale: - Puts all statistics on the same scale - Accounts for league-wide changes over time when calculated per-season - Interpretable (standard deviations from mean) - Handles outliers better than min-max scaling

Step-by-Step Implementation Guide

Step 1: Project Setup

Create your project directory structure:

capstone1-player-comparison/
├── code/
│   ├── __init__.py
│   ├── player_comparison.py
│   ├── data_loader.py
│   ├── similarity.py
│   ├── visualization.py
│   └── requirements.txt
├── data/
│   └── (player statistics CSV files)
├── tests/
│   ├── test_data_loader.py
│   ├── test_similarity.py
│   └── test_visualization.py
└── index.md

Install dependencies:

pip install -r requirements.txt

Step 2: Data Loading and Preprocessing

The data loader module handles:

Reading data from various sources
Cleaning and validating data
Filtering based on criteria (minutes, games played)
Computing derived statistics

Key preprocessing steps:

def preprocess_data(df: pd.DataFrame, min_games: int = 20,
                    min_minutes: float = 10.0) -> pd.DataFrame:
    """
    Preprocess player statistics data.

    Steps:
    1. Filter by minimum games and minutes
    2. Handle missing values
    3. Compute per-possession stats if needed
    4. Add era classification
    """
    # Filter for qualified players
    df = df[(df['G'] >= min_games) & (df['MP'] >= min_minutes)]

    # Handle missing three-point data for pre-1980 players
    if 'FG3_PCT' in df.columns:
        df['FG3_PCT'] = df['FG3_PCT'].fillna(0)

    # Add era classification
    df['Era'] = pd.cut(df['Season'],
                       bins=[1946, 1980, 2000, 2015, 2030],
                       labels=['Pre-3PT', 'Classic', 'Modern', 'Analytics'])

    return df

Step 3: Implementing Similarity Algorithms

Euclidean Distance

The most intuitive distance metric, measuring the straight-line distance between two points in n-dimensional space:

$$d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$

def euclidean_similarity(player1_stats: np.ndarray,
                         player2_stats: np.ndarray) -> float:
    """
    Calculate Euclidean similarity (inverse of distance).
    Returns a value between 0 and 1, where 1 is identical.
    """
    distance = np.sqrt(np.sum((player1_stats - player2_stats) ** 2))
    # Convert distance to similarity (0 to 1 scale)
    similarity = 1 / (1 + distance)
    return similarity

Pros: Intuitive, preserves magnitude differences Cons: Sensitive to scale, affected by high-volume players

Cosine Similarity

Measures the cosine of the angle between two vectors, capturing similarity in direction regardless of magnitude:

$$\cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||}$$

def cosine_similarity(player1_stats: np.ndarray,
                      player2_stats: np.ndarray) -> float:
    """
    Calculate cosine similarity between two player stat vectors.
    Returns a value between -1 and 1, where 1 is identical direction.
    """
    dot_product = np.dot(player1_stats, player2_stats)
    norm1 = np.linalg.norm(player1_stats)
    norm2 = np.linalg.norm(player2_stats)

    if norm1 == 0 or norm2 == 0:
        return 0.0

    return dot_product / (norm1 * norm2)

Pros: Captures playing style regardless of usage, good for role players Cons: Ignores volume, a 10 PPG scorer could match a 25 PPG scorer

Mahalanobis Distance

Accounts for correlations between variables and differences in variance:

$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

def mahalanobis_similarity(player1_stats: np.ndarray,
                           player2_stats: np.ndarray,
                           cov_matrix: np.ndarray) -> float:
    """
    Calculate Mahalanobis-based similarity.
    Accounts for correlations between statistics.
    """
    diff = player1_stats - player2_stats
    try:
        cov_inv = np.linalg.inv(cov_matrix)
        distance = np.sqrt(np.dot(np.dot(diff, cov_inv), diff))
        similarity = 1 / (1 + distance)
        return similarity
    except np.linalg.LinAlgError:
        # Fall back to Euclidean if covariance matrix is singular
        return euclidean_similarity(player1_stats, player2_stats)

Pros: Statistically rigorous, handles correlated features Cons: Requires invertible covariance matrix, computationally expensive

Step 4: Percentile Rankings

Percentile rankings provide context by showing where a player ranks relative to peers:

def calculate_percentile_rankings(df: pd.DataFrame,
                                  stats: List[str],
                                  group_by: Optional[str] = None) -> pd.DataFrame:
    """
    Calculate percentile rankings for specified statistics.

    Args:
        df: DataFrame with player statistics
        stats: List of statistic columns to rank
        group_by: Optional column to group rankings (e.g., 'Season', 'Position')

    Returns:
        DataFrame with percentile rankings (0-100) for each stat
    """
    result = df.copy()

    for stat in stats:
        col_name = f'{stat}_percentile'
        if group_by:
            result[col_name] = result.groupby(group_by)[stat].transform(
                lambda x: x.rank(pct=True) * 100
            )
        else:
            result[col_name] = result[stat].rank(pct=True) * 100

    return result

Use Cases: - Comparing players across different eras (normalize within season) - Evaluating positional performance (normalize within position) - Creating scouting profiles with percentile bars

Step 5: Z-Score Comparisons

Z-scores standardize statistics to enable direct comparison:

$$z = \frac{x - \mu}{\sigma}$$

def calculate_zscores(df: pd.DataFrame,
                      stats: List[str],
                      group_by: Optional[str] = None) -> pd.DataFrame:
    """
    Calculate z-scores for specified statistics.

    A z-score of 0 means average, +1 means one standard deviation above.
    """
    result = df.copy()

    for stat in stats:
        col_name = f'{stat}_zscore'
        if group_by:
            result[col_name] = result.groupby(group_by)[stat].transform(
                lambda x: (x - x.mean()) / x.std()
            )
        else:
            mean = result[stat].mean()
            std = result[stat].std()
            result[col_name] = (result[stat] - mean) / std

    return result

Interpretation Guide: | Z-Score | Interpretation | |---------|----------------| | +3.0 | Elite (99.9th percentile) | | +2.0 | Excellent (97.7th percentile) | | +1.0 | Above average (84th percentile) | | 0.0 | League average | | -1.0 | Below average (16th percentile) | | -2.0 | Poor (2.3rd percentile) |

Step 6: Finding Similar Players

The core function that ties everything together:

def find_similar_players(target_player: str,
                         target_season: int,
                         df: pd.DataFrame,
                         features: List[str],
                         algorithm: str = 'euclidean',
                         n: int = 10,
                         filters: Optional[Dict] = None) -> pd.DataFrame:
    """
    Find the N most similar players to a target player-season.

    Args:
        target_player: Name of the player to compare
        target_season: Season year to use for comparison
        df: DataFrame with all player statistics
        features: List of statistic columns to use for comparison
        algorithm: Similarity algorithm ('euclidean', 'cosine', 'mahalanobis')
        n: Number of similar players to return
        filters: Optional filters (e.g., {'Era': 'Modern', 'Position': 'PG'})

    Returns:
        DataFrame with the N most similar players and their similarity scores
    """
    # Get target player stats
    target_mask = (df['Player'] == target_player) & (df['Season'] == target_season)
    if not target_mask.any():
        raise ValueError(f"Player {target_player} not found for season {target_season}")

    target_stats = df.loc[target_mask, features].values[0]

    # Apply filters
    comparison_df = df.copy()
    if filters:
        for col, value in filters.items():
            comparison_df = comparison_df[comparison_df[col] == value]

    # Normalize features
    scaler = StandardScaler()
    normalized_features = scaler.fit_transform(comparison_df[features])
    target_normalized = scaler.transform([target_stats])[0]

    # Calculate similarities
    similarities = []
    for idx, row_stats in enumerate(normalized_features):
        if algorithm == 'euclidean':
            sim = euclidean_similarity(target_normalized, row_stats)
        elif algorithm == 'cosine':
            sim = cosine_similarity(target_normalized, row_stats)
        elif algorithm == 'mahalanobis':
            cov = np.cov(normalized_features.T)
            sim = mahalanobis_similarity(target_normalized, row_stats, cov)
        else:
            raise ValueError(f"Unknown algorithm: {algorithm}")
        similarities.append(sim)

    comparison_df['Similarity'] = similarities

    # Remove target player and sort
    result = comparison_df[~target_mask].nlargest(n, 'Similarity')

    return result[['Player', 'Season', 'Similarity'] + features]

Visualization Dashboard Design

Radar Charts

Radar charts (also called spider charts) excel at displaying multivariate data for player profiles:

def create_radar_chart(player_data: Dict[str, float],
                       comparison_data: Optional[Dict[str, float]] = None,
                       title: str = "Player Profile") -> go.Figure:
    """
    Create a radar chart comparing one or two players.

    Args:
        player_data: Dictionary of stat names to percentile values (0-100)
        comparison_data: Optional second player for comparison
        title: Chart title

    Returns:
        Plotly Figure object
    """
    categories = list(player_data.keys())

    fig = go.Figure()

    # Add primary player
    fig.add_trace(go.Scatterpolar(
        r=list(player_data.values()),
        theta=categories,
        fill='toself',
        name='Player 1',
        line=dict(color='#1f77b4')
    ))

    # Add comparison player if provided
    if comparison_data:
        fig.add_trace(go.Scatterpolar(
            r=list(comparison_data.values()),
            theta=categories,
            fill='toself',
            name='Player 2',
            line=dict(color='#ff7f0e')
        ))

    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 100]
            )
        ),
        showlegend=True,
        title=title
    )

    return fig

Similarity Heatmaps

Heatmaps visualize pairwise similarities across multiple players:

def create_similarity_heatmap(similarity_matrix: np.ndarray,
                              player_names: List[str],
                              title: str = "Player Similarity Matrix") -> go.Figure:
    """
    Create a heatmap showing pairwise similarities.
    """
    fig = go.Figure(data=go.Heatmap(
        z=similarity_matrix,
        x=player_names,
        y=player_names,
        colorscale='RdYlGn',
        zmin=0,
        zmax=1,
        text=np.round(similarity_matrix, 2),
        texttemplate='%{text}',
        textfont={"size": 10},
        hoverongaps=False
    ))

    fig.update_layout(
        title=title,
        xaxis_title="Player",
        yaxis_title="Player",
        width=800,
        height=800
    )

    return fig

Statistical Comparison Bar Charts

Side-by-side bar charts for direct statistical comparisons:

def create_comparison_bars(player1_stats: Dict[str, float],
                           player2_stats: Dict[str, float],
                           player1_name: str,
                           player2_name: str) -> go.Figure:
    """
    Create grouped bar chart comparing two players' statistics.
    """
    categories = list(player1_stats.keys())

    fig = go.Figure(data=[
        go.Bar(name=player1_name, x=categories, y=list(player1_stats.values())),
        go.Bar(name=player2_name, x=categories, y=list(player2_stats.values()))
    ])

    fig.update_layout(
        barmode='group',
        title=f'{player1_name} vs {player2_name}',
        xaxis_title='Statistic',
        yaxis_title='Z-Score',
        legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
    )

    return fig

User Interface Considerations

Streamlit Dashboard Layout

The dashboard is organized into logical sections:

Sidebar: Player selection, algorithm choice, feature selection
Main Area: Visualizations and results tables
Expandable Sections: Detailed statistics and methodology explanations

Key UI/UX Principles

Progressive Disclosure: Show summary first, details on demand
Sensible Defaults: Pre-select commonly used features and algorithms
Clear Feedback: Loading indicators and error messages
Mobile Responsiveness: Streamlit handles this automatically

Accessibility Considerations

Use colorblind-friendly palettes (avoid red-green only distinctions)
Provide text alternatives for all visualizations
Ensure sufficient contrast ratios
Support keyboard navigation

Complete Core Implementation

Below is the complete implementation for the core player comparison functionality:

"""
Player Comparison Tool - Core Implementation
A comprehensive tool for finding and comparing statistically similar NBA players.
"""

import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist
from scipy.stats import percentileofscore
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots


class PlayerComparisonTool:
    """
    Main class for player comparison functionality.

    This class provides methods for:
    - Loading and preprocessing player statistics
    - Calculating similarity scores between players
    - Finding similar players
    - Generating visualizations
    """

    # Default features for comparison
    DEFAULT_FEATURES = [
        'PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
        'FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT',
        'USG_PCT', 'PER', 'BPM', 'WS'
    ]

    def __init__(self, data: Optional[pd.DataFrame] = None):
        """
        Initialize the PlayerComparisonTool.

        Args:
            data: Optional pre-loaded DataFrame with player statistics
        """
        self.data = data
        self.scaler = StandardScaler()
        self._normalized_data = None
        self._features = None

    def load_data(self, filepath: str, **kwargs) -> pd.DataFrame:
        """
        Load player statistics from a CSV file.

        Args:
            filepath: Path to the CSV file
            **kwargs: Additional arguments passed to pd.read_csv

        Returns:
            Loaded and preprocessed DataFrame
        """
        self.data = pd.read_csv(filepath, **kwargs)
        self._preprocess_data()
        return self.data

    def _preprocess_data(self, min_games: int = 20, min_minutes: float = 10.0):
        """
        Preprocess the loaded data.

        Args:
            min_games: Minimum games played threshold
            min_minutes: Minimum minutes per game threshold
        """
        if self.data is None:
            raise ValueError("No data loaded")

        # Filter for qualified players
        if 'G' in self.data.columns and 'MP' in self.data.columns:
            self.data = self.data[
                (self.data['G'] >= min_games) &
                (self.data['MP'] >= min_minutes)
            ]

        # Handle missing values in shooting percentages
        pct_cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT', 'EFG_PCT']
        for col in pct_cols:
            if col in self.data.columns:
                self.data[col] = self.data[col].fillna(0)

        # Reset index
        self.data = self.data.reset_index(drop=True)

    def prepare_features(self, features: Optional[List[str]] = None) -> np.ndarray:
        """
        Prepare and normalize features for similarity calculations.

        Args:
            features: List of feature columns to use

        Returns:
            Normalized feature array
        """
        if self.data is None:
            raise ValueError("No data loaded")

        self._features = features or self.DEFAULT_FEATURES

        # Filter to available features
        available_features = [f for f in self._features if f in self.data.columns]
        if len(available_features) < len(self._features):
            missing = set(self._features) - set(available_features)
            print(f"Warning: Features not found in data: {missing}")

        self._features = available_features

        # Fill any remaining missing values with column means
        feature_data = self.data[self._features].copy()
        feature_data = feature_data.fillna(feature_data.mean())

        # Normalize
        self._normalized_data = self.scaler.fit_transform(feature_data)

        return self._normalized_data

    def calculate_similarity(self,
                             player1_idx: int,
                             player2_idx: int,
                             method: str = 'euclidean') -> float:
        """
        Calculate similarity between two players.

        Args:
            player1_idx: Index of first player in DataFrame
            player2_idx: Index of second player in DataFrame
            method: Similarity method ('euclidean', 'cosine', 'correlation')

        Returns:
            Similarity score (0 to 1, where 1 is most similar)
        """
        if self._normalized_data is None:
            self.prepare_features()

        stats1 = self._normalized_data[player1_idx].reshape(1, -1)
        stats2 = self._normalized_data[player2_idx].reshape(1, -1)

        if method == 'euclidean':
            distance = cdist(stats1, stats2, metric='euclidean')[0, 0]
            return 1 / (1 + distance)
        elif method == 'cosine':
            # Cosine distance is 1 - cosine similarity
            cosine_dist = cdist(stats1, stats2, metric='cosine')[0, 0]
            return 1 - cosine_dist
        elif method == 'correlation':
            corr_dist = cdist(stats1, stats2, metric='correlation')[0, 0]
            return 1 - corr_dist
        else:
            raise ValueError(f"Unknown method: {method}")

    def find_similar_players(self,
                             player_name: str,
                             season: Optional[int] = None,
                             n: int = 10,
                             method: str = 'euclidean',
                             features: Optional[List[str]] = None,
                             filters: Optional[Dict] = None) -> pd.DataFrame:
        """
        Find the N most similar players to a given player.

        Args:
            player_name: Name of the target player
            season: Optional season year to filter target player
            n: Number of similar players to return
            method: Similarity method
            features: Optional list of features to use
            filters: Optional filters to apply (e.g., {'Position': 'PG'})

        Returns:
            DataFrame with similar players and their similarity scores
        """
        if self.data is None:
            raise ValueError("No data loaded")

        # Find target player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found" +
                           (f" for season {season}" if season else ""))

        target_idx = self.data[mask].index[0]

        # Prepare features
        self.prepare_features(features)

        # Apply filters
        comparison_mask = pd.Series([True] * len(self.data))
        if filters:
            for col, value in filters.items():
                if col in self.data.columns:
                    comparison_mask &= self.data[col] == value

        # Calculate all similarities
        target_stats = self._normalized_data[target_idx].reshape(1, -1)

        similarities = []
        for idx in range(len(self.data)):
            if comparison_mask.iloc[idx] and idx != target_idx:
                sim = self.calculate_similarity(target_idx, idx, method)
                similarities.append((idx, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_n = similarities[:n]

        # Build result DataFrame
        result_indices = [idx for idx, _ in top_n]
        result = self.data.iloc[result_indices].copy()
        result['Similarity'] = [sim for _, sim in top_n]

        # Reorder columns
        cols = ['Player', 'Season', 'Similarity'] + \
               [c for c in self._features if c in result.columns]
        result = result[[c for c in cols if c in result.columns]]

        return result.reset_index(drop=True)

    def calculate_percentiles(self,
                              player_name: str,
                              season: Optional[int] = None,
                              features: Optional[List[str]] = None,
                              comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
        """
        Calculate percentile rankings for a player.

        Args:
            player_name: Name of the player
            season: Optional season year
            features: Features to calculate percentiles for
            comparison_group: Optional subset of data for comparison

        Returns:
            Dictionary of feature names to percentile values (0-100)
        """
        if self.data is None:
            raise ValueError("No data loaded")

        features = features or self.DEFAULT_FEATURES
        features = [f for f in features if f in self.data.columns]

        # Find player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found")

        player_row = self.data[mask].iloc[0]
        comparison_df = comparison_group if comparison_group is not None else self.data

        percentiles = {}
        for feature in features:
            value = player_row[feature]
            pct = percentileofscore(comparison_df[feature].dropna(), value)
            percentiles[feature] = round(pct, 1)

        return percentiles

    def calculate_zscores(self,
                          player_name: str,
                          season: Optional[int] = None,
                          features: Optional[List[str]] = None,
                          comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
        """
        Calculate z-scores for a player.

        Args:
            player_name: Name of the player
            season: Optional season year
            features: Features to calculate z-scores for
            comparison_group: Optional subset of data for comparison

        Returns:
            Dictionary of feature names to z-score values
        """
        if self.data is None:
            raise ValueError("No data loaded")

        features = features or self.DEFAULT_FEATURES
        features = [f for f in features if f in self.data.columns]

        # Find player
        mask = self.data['Player'] == player_name
        if season is not None:
            mask &= self.data['Season'] == season

        if not mask.any():
            raise ValueError(f"Player '{player_name}' not found")

        player_row = self.data[mask].iloc[0]
        comparison_df = comparison_group if comparison_group is not None else self.data

        zscores = {}
        for feature in features:
            value = player_row[feature]
            mean = comparison_df[feature].mean()
            std = comparison_df[feature].std()
            if std > 0:
                zscores[feature] = round((value - mean) / std, 2)
            else:
                zscores[feature] = 0.0

        return zscores

    def create_radar_chart(self,
                           player_name: str,
                           season: Optional[int] = None,
                           comparison_player: Optional[str] = None,
                           comparison_season: Optional[int] = None,
                           features: Optional[List[str]] = None) -> go.Figure:
        """
        Create a radar chart for player comparison.

        Args:
            player_name: Primary player name
            season: Primary player season
            comparison_player: Optional comparison player name
            comparison_season: Comparison player season
            features: Features to include in chart

        Returns:
            Plotly Figure object
        """
        features = features or ['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TS_PCT']
        features = [f for f in features if f in self.data.columns]

        # Get percentiles for primary player
        player1_pct = self.calculate_percentiles(player_name, season, features)

        fig = go.Figure()

        # Add primary player
        fig.add_trace(go.Scatterpolar(
            r=list(player1_pct.values()),
            theta=list(player1_pct.keys()),
            fill='toself',
            name=f"{player_name} ({season})" if season else player_name,
            line=dict(color='#1f77b4', width=2)
        ))

        # Add comparison player if provided
        if comparison_player:
            player2_pct = self.calculate_percentiles(
                comparison_player, comparison_season, features
            )
            fig.add_trace(go.Scatterpolar(
                r=list(player2_pct.values()),
                theta=list(player2_pct.keys()),
                fill='toself',
                name=f"{comparison_player} ({comparison_season})" if comparison_season else comparison_player,
                line=dict(color='#ff7f0e', width=2)
            ))

        fig.update_layout(
            polar=dict(
                radialaxis=dict(
                    visible=True,
                    range=[0, 100],
                    tickfont=dict(size=10)
                )
            ),
            showlegend=True,
            title=dict(
                text="Player Comparison (Percentile Rankings)",
                x=0.5
            ),
            legend=dict(
                yanchor="top",
                y=1.1,
                xanchor="center",
                x=0.5,
                orientation="h"
            )
        )

        return fig

    def create_similarity_matrix(self,
                                 player_names: List[str],
                                 seasons: Optional[List[int]] = None,
                                 method: str = 'euclidean') -> Tuple[np.ndarray, go.Figure]:
        """
        Create a similarity matrix and heatmap for multiple players.

        Args:
            player_names: List of player names
            seasons: Optional list of seasons (same length as player_names)
            method: Similarity method

        Returns:
            Tuple of (similarity matrix, Plotly Figure)
        """
        n_players = len(player_names)
        if seasons is None:
            seasons = [None] * n_players

        # Get indices for all players
        indices = []
        labels = []
        for name, season in zip(player_names, seasons):
            mask = self.data['Player'] == name
            if season is not None:
                mask &= self.data['Season'] == season
            if mask.any():
                indices.append(self.data[mask].index[0])
                labels.append(f"{name} ({season})" if season else name)

        # Prepare features
        self.prepare_features()

        # Calculate similarity matrix
        n = len(indices)
        sim_matrix = np.zeros((n, n))
        for i in range(n):
            for j in range(n):
                if i == j:
                    sim_matrix[i, j] = 1.0
                elif i < j:
                    sim = self.calculate_similarity(indices[i], indices[j], method)
                    sim_matrix[i, j] = sim
                    sim_matrix[j, i] = sim

        # Create heatmap
        fig = go.Figure(data=go.Heatmap(
            z=sim_matrix,
            x=labels,
            y=labels,
            colorscale='RdYlGn',
            zmin=0,
            zmax=1,
            text=np.round(sim_matrix, 2),
            texttemplate='%{text}',
            textfont={"size": 10},
            hovertemplate='%{y} vs %{x}<br>Similarity: %{z:.3f}<extra></extra>'
        ))

        fig.update_layout(
            title="Player Similarity Matrix",
            width=600,
            height=600,
            xaxis=dict(tickangle=45)
        )

        return sim_matrix, fig


# Example usage
if __name__ == "__main__":
    # Create sample data for demonstration
    sample_data = pd.DataFrame({
        'Player': ['LeBron James', 'Kevin Durant', 'Stephen Curry',
                   'Giannis Antetokounmpo', 'Luka Doncic'] * 3,
        'Season': [2022, 2022, 2022, 2022, 2022,
                   2023, 2023, 2023, 2023, 2023,
                   2024, 2024, 2024, 2024, 2024],
        'G': [55, 58, 56, 63, 66] * 3,
        'MP': [35.5, 36.0, 34.7, 32.1, 36.2] * 3,
        'PTS': [28.9, 29.1, 29.4, 31.1, 32.4] * 3,
        'TRB': [8.3, 6.7, 6.1, 11.6, 8.8] * 3,
        'AST': [6.8, 5.0, 6.3, 5.7, 8.0] * 3,
        'STL': [0.9, 0.7, 0.9, 0.8, 1.4] * 3,
        'BLK': [0.6, 1.4, 0.4, 0.8, 0.5] * 3,
        'TOV': [3.1, 3.3, 3.2, 3.3, 3.6] * 3,
        'FG_PCT': [0.500, 0.529, 0.493, 0.553, 0.496] * 3,
        'FG3_PCT': [0.321, 0.404, 0.427, 0.275, 0.342] * 3,
        'FT_PCT': [0.723, 0.910, 0.915, 0.645, 0.760] * 3,
        'TS_PCT': [0.580, 0.660, 0.670, 0.610, 0.600] * 3,
        'USG_PCT': [31.5, 30.2, 32.1, 37.4, 36.8] * 3,
        'PER': [26.2, 26.8, 24.2, 32.1, 28.4] * 3,
        'BPM': [7.2, 6.1, 5.8, 11.0, 7.8] * 3,
        'WS': [8.1, 9.2, 7.8, 12.1, 10.2] * 3
    })

    # Initialize tool
    tool = PlayerComparisonTool(data=sample_data)

    # Find similar players
    print("Finding players similar to LeBron James (2023)...")
    similar = tool.find_similar_players("LeBron James", season=2023, n=3)
    print(similar[['Player', 'Season', 'Similarity']])

    print("\nLeBron James Percentiles:")
    pct = tool.calculate_percentiles("LeBron James", season=2023)
    for stat, value in pct.items():
        print(f"  {stat}: {value}th percentile")

    print("\nLeBron James Z-Scores:")
    zscores = tool.calculate_zscores("LeBron James", season=2023)
    for stat, value in zscores.items():
        print(f"  {stat}: {value:+.2f}")

Deployment Options

Option 1: Local Streamlit Application

The simplest deployment method for personal use or demos:

streamlit run player_comparison.py

Option 2: Streamlit Cloud

Free hosting for public applications:

Push your code to a GitHub repository
Connect to Streamlit Cloud
Deploy with one click

Limitations: 1GB memory, public repository required for free tier

Option 3: Docker Container

For production deployments or team sharing:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8501

CMD ["streamlit", "run", "player_comparison.py", "--server.port=8501"]

Option 4: Cloud Platform Deployment

For scalable production applications:

AWS: EC2, ECS, or Lambda + API Gateway
Google Cloud: Cloud Run or App Engine
Azure: App Service or Container Instances

Option 5: Static Export

For documentation or reports, export visualizations as HTML or images:

fig.write_html("player_comparison.html")
fig.write_image("player_comparison.png")

Extension Ideas

Beginner Extensions

Add more similarity algorithms: Implement Manhattan distance or Minkowski distance
Position filtering: Allow filtering comparisons by position
Export functionality: Add CSV/PDF export of comparison results
Historical comparison: Compare a current player to all-time greats

Intermediate Extensions

Weighted similarity: Allow users to weight certain stats more heavily
Career arc comparison: Compare career trajectories, not just single seasons
Cluster analysis: Use K-means to identify player archetypes
Player trajectory prediction: Predict future stats based on similar players

Advanced Extensions

Play-by-play integration: Include player tracking data for more detailed comparisons
Natural language search: "Find me a player like prime Kobe but with better three-point shooting"
Real-time updates: Connect to live NBA Stats API for current season data
Machine learning models: Train models to predict player compatibility or trade value
Multi-sport expansion: Generalize the architecture for other sports

Academic Extensions

Research paper replication: Implement published player similarity research
Novel metrics development: Create new composite statistics for comparison
Bias analysis: Study how the tool's recommendations might embed historical biases
Uncertainty quantification: Add confidence intervals to similarity scores

Testing Your Implementation

Unit Tests

Create comprehensive tests for core functionality:

# tests/test_similarity.py
import pytest
import numpy as np
from code.similarity import euclidean_similarity, cosine_similarity

def test_euclidean_similarity_identical():
    """Identical vectors should have similarity of 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([1.0, 2.0, 3.0])
    assert euclidean_similarity(v1, v2) == 1.0

def test_euclidean_similarity_different():
    """Different vectors should have similarity less than 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([4.0, 5.0, 6.0])
    sim = euclidean_similarity(v1, v2)
    assert 0 < sim < 1

def test_cosine_similarity_orthogonal():
    """Orthogonal vectors should have cosine similarity of 0."""
    v1 = np.array([1.0, 0.0])
    v2 = np.array([0.0, 1.0])
    assert cosine_similarity(v1, v2) == pytest.approx(0.0)

def test_cosine_similarity_parallel():
    """Parallel vectors should have cosine similarity of 1."""
    v1 = np.array([1.0, 2.0, 3.0])
    v2 = np.array([2.0, 4.0, 6.0])
    assert cosine_similarity(v1, v2) == pytest.approx(1.0)

Integration Tests

Test the complete workflow:

# tests/test_integration.py
def test_full_workflow(sample_data):
    """Test complete comparison workflow."""
    tool = PlayerComparisonTool(data=sample_data)

    # Load and prepare data
    tool.prepare_features()

    # Find similar players
    similar = tool.find_similar_players("LeBron James", n=3)

    assert len(similar) == 3
    assert 'Similarity' in similar.columns
    assert all(0 <= s <= 1 for s in similar['Similarity'])

Troubleshooting Common Issues

Issue: "Player not found" error

Cause: Player name doesn't match exactly (spacing, special characters, suffixes)

Solution:

# Normalize player names
df['Player'] = df['Player'].str.strip()
# Handle suffixes
df['Player'] = df['Player'].str.replace(r'\s*\*$', '', regex=True)

Issue: Similarity scores all very close

Cause: Feature scaling issues or too many features causing dimensionality problems

Solution: - Reduce the number of features - Verify StandardScaler is being applied - Consider using feature selection or PCA

Issue: Memory error with large datasets

Cause: Calculating all pairwise similarities is O(n^2)

Solution: - Use approximate nearest neighbors (Annoy, FAISS) - Calculate similarities in batches - Use sparse matrices where possible

Issue: Slow performance

Cause: Pure Python loops for similarity calculations

Solution: - Use NumPy vectorized operations - Pre-compute the similarity matrix - Cache frequently accessed results

Project Rubric

Use this rubric to evaluate your implementation:

Criterion	Excellent (4)	Good (3)	Satisfactory (2)	Needs Work (1)
Code Quality	Clean, well-documented, follows PEP8	Minor issues, mostly documented	Some documentation, inconsistent style	Poor documentation, messy code
Functionality	All features work, handles edge cases	Most features work	Basic features work	Significant bugs
Similarity Algorithms	3+ algorithms, well-implemented	2 algorithms working	1 algorithm working	Algorithms incorrect
Visualizations	Interactive, polished, informative	Good visualizations	Basic charts	Visualizations broken
UI/UX	Intuitive, accessible, responsive	Good usability	Functional but basic	Difficult to use
Testing	>80% coverage, thorough tests	>60% coverage	Some tests	No tests
Documentation	Comprehensive README, docstrings	Good documentation	Basic README	No documentation
Extension	Implemented 2+ creative extensions	1 extension	Attempted extension	No extensions

Conclusion

Congratulations on completing this capstone project! You have built a comprehensive player comparison tool that demonstrates skills in:

Data Engineering: Loading, cleaning, and preprocessing sports statistics
Statistical Analysis: Implementing similarity metrics, percentile rankings, and z-scores
Machine Learning: Applying distance metrics and normalization techniques
Data Visualization: Creating informative, interactive charts
Software Engineering: Structuring a maintainable, testable codebase
Product Development: Designing a user-friendly interface

This project serves as an excellent portfolio piece that showcases your ability to apply data science techniques to real-world problems. The basketball domain knowledge combined with technical implementation makes this project relevant for both sports analytics roles and general data science positions.

Next Steps

Deploy your application using one of the deployment options discussed
Add your own extensions to make the tool unique
Document your work in a blog post or portfolio site
Contribute improvements back to the community
Apply these techniques to other domains that interest you

Remember, the best portfolio projects are ones you continue to improve and that showcase your genuine interests. Use this foundation to build something that excites you!

References

Basketball Reference. (2024). Basketball Statistics and History. https://www.basketball-reference.com/
NBA Advanced Stats. (2024). https://www.nba.com/stats/
Shea, S. M., & Baker, C. E. (2013). Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win. CreateSpace.
Oliver, D. (2004). Basketball on Paper: Rules and Tools for Performance Analysis. Potomac Books.
scikit-learn documentation. (2024). Preprocessing data. https://scikit-learn.org/stable/modules/preprocessing.html
Plotly Python Documentation. (2024). https://plotly.com/python/
Streamlit Documentation. (2024). https://docs.streamlit.io/