One of the most common questions in basketball discussions is: "Who does this player remind you of?" Whether scouts are evaluating draft prospects, front offices are assessing free agent targets, or fans are debating player legacies, the ability to...
In This Chapter
- Introduction
- Project Overview
- Requirements and Specifications
- Data Sources
- Architecture and Design
- Step-by-Step Implementation Guide
- Visualization Dashboard Design
- User Interface Considerations
- Complete Core Implementation
- Deployment Options
- Extension Ideas
- Testing Your Implementation
- Troubleshooting Common Issues
- Project Rubric
- Conclusion
- References
Capstone Project 1: Build a Player Comparison Tool
Introduction
One of the most common questions in basketball discussions is: "Who does this player remind you of?" Whether scouts are evaluating draft prospects, front offices are assessing free agent targets, or fans are debating player legacies, the ability to systematically compare players is invaluable. This capstone project guides you through building a comprehensive player comparison tool that leverages statistical similarity algorithms, percentile rankings, and interactive visualizations.
By the end of this project, you will have created a portfolio-worthy application that demonstrates proficiency in data engineering, statistical analysis, machine learning concepts, and data visualization. The tool will allow users to find statistically similar players across different eras, compare player strengths and weaknesses, and explore the landscape of playing styles in the NBA.
Project Overview
What You Will Build
The Player Comparison Tool is a Python-based application that:
- Loads and processes NBA player statistics from multiple data sources
- Calculates statistical similarity between players using multiple algorithms
- Generates percentile rankings for contextual comparisons
- Computes z-scores for standardized statistical analysis
- Creates interactive visualizations including radar charts, scatter plots, and similarity matrices
- Provides a web-based dashboard for exploring player comparisons
Learning Objectives
Upon completing this project, you will be able to:
- Design and implement ETL pipelines for sports statistics
- Apply similarity metrics (Euclidean distance, cosine similarity, Mahalanobis distance) to real-world data
- Create meaningful statistical normalizations using percentile rankings and z-scores
- Build interactive data visualizations with Plotly
- Deploy a Streamlit dashboard for end-user interaction
- Handle data quality issues common in sports analytics
- Structure a data science project for maintainability and extensibility
Why This Project Matters
Player comparison tools are used throughout the basketball industry:
- Scouting departments use similarity scores to identify draft prospects who project similarly to successful NBA players
- Front offices evaluate free agents by comparing them to players with known contract values
- Coaching staffs study tendencies of similar players to develop game plans
- Media and fans contextualize player performance through historical comparisons
Building this tool demonstrates that you understand both the technical implementation and the domain-specific considerations that make such tools valuable.
Requirements and Specifications
Functional Requirements
| Requirement | Description | Priority |
|---|---|---|
| FR-1 | Load player statistics from CSV files or API | Must Have |
| FR-2 | Calculate similarity scores between any two players | Must Have |
| FR-3 | Find the N most similar players to a given player | Must Have |
| FR-4 | Support multiple similarity algorithms | Must Have |
| FR-5 | Generate percentile rankings within customizable peer groups | Must Have |
| FR-6 | Compute z-scores for statistical comparisons | Must Have |
| FR-7 | Create radar chart visualizations for player profiles | Must Have |
| FR-8 | Display similarity matrices as heatmaps | Should Have |
| FR-9 | Filter comparisons by era, position, or other criteria | Should Have |
| FR-10 | Export comparison results to various formats | Could Have |
| FR-11 | Provide a web-based user interface | Should Have |
Non-Functional Requirements
| Requirement | Description | Target |
|---|---|---|
| NFR-1 | Response time for similarity calculation | < 2 seconds for single comparison |
| NFR-2 | Memory usage | < 500 MB for full dataset |
| NFR-3 | Code test coverage | > 80% |
| NFR-4 | Documentation | All public functions documented |
| NFR-5 | Cross-platform compatibility | Windows, macOS, Linux |
Technical Stack
- Python 3.9+: Core programming language
- pandas: Data manipulation and analysis
- NumPy: Numerical computations
- scikit-learn: Machine learning utilities and preprocessing
- SciPy: Statistical functions and distance metrics
- Plotly: Interactive visualizations
- Streamlit: Web dashboard framework
- pytest: Testing framework
Data Sources
Primary Data Sources
Basketball Reference
The most comprehensive source for historical NBA statistics. You can obtain data through:
- Manual download: Export CSV files from player season pages
- Web scraping: Use libraries like
basketball_reference_web_scraper - Pre-compiled datasets: Kaggle hosts several cleaned datasets
NBA Stats API
The official NBA statistics API provides current season data:
# Example endpoint structure
base_url = "https://stats.nba.com/stats/"
endpoint = "leaguedashplayerstats"
Note: The NBA Stats API requires specific headers and has rate limiting.
Kaggle Datasets
Several well-maintained datasets are available:
- NBA Players Stats (1950-2023)
- NBA Advanced Stats
- NBA Play-by-Play Data
Data Schema
For this project, we will work with the following statistical categories:
Counting Statistics (Per Game)
| Column | Description |
|---|---|
| PTS | Points per game |
| TRB | Total rebounds per game |
| AST | Assists per game |
| STL | Steals per game |
| BLK | Blocks per game |
| TOV | Turnovers per game |
| MP | Minutes per game |
Shooting Statistics
| Column | Description |
|---|---|
| FG_PCT | Field goal percentage |
| FG3_PCT | Three-point percentage |
| FT_PCT | Free throw percentage |
| TS_PCT | True shooting percentage |
| EFG_PCT | Effective field goal percentage |
Advanced Statistics
| Column | Description |
|---|---|
| PER | Player efficiency rating |
| WS | Win shares |
| BPM | Box plus/minus |
| VORP | Value over replacement player |
| USG_PCT | Usage percentage |
| AST_PCT | Assist percentage |
| TRB_PCT | Total rebound percentage |
Data Quality Considerations
When working with basketball statistics, be aware of:
- Era differences: The three-point line was introduced in 1979-80
- Missing data: Some advanced stats are not available for older seasons
- Position changes: Position classifications have evolved over time
- Minutes thresholds: Low-minute players can have misleading per-game stats
- Lockout seasons: 1998-99 (50 games) and 2011-12 (66 games) were shortened
Architecture and Design
System Architecture
+-------------------+ +-------------------+ +-------------------+
| | | | | |
| Data Sources |---->| Data Loader |---->| Data Store |
| (CSV, API) | | (ETL Pipeline) | | (pandas DF) |
| | | | | |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+ +-------------------+ +-------------------+
| | | | | |
| Dashboard |<----| Visualization |<----| Similarity |
| (Streamlit) | | (Plotly) | | Engine |
| | | | | |
+-------------------+ +-------------------+ +-------------------+
Module Design
The application is organized into four core modules:
- data_loader.py: Handles all data ingestion and preprocessing
- similarity.py: Implements similarity algorithms and comparison logic
- visualization.py: Creates all charts and visual outputs
- player_comparison.py: Main application orchestration and Streamlit UI
Design Decisions
Decision 1: Pandas over Database
Choice: Use pandas DataFrames as the primary data store rather than a database.
Rationale: - The dataset size (thousands of player-seasons) fits comfortably in memory - Pandas provides excellent support for the statistical operations we need - Simplifies deployment by avoiding database dependencies - Allows for rapid prototyping and iteration
Trade-offs: - Not suitable if dataset grows significantly - Concurrent write operations would require additional handling
Decision 2: Multiple Similarity Algorithms
Choice: Implement multiple similarity algorithms and let users choose.
Rationale: - Different algorithms capture different aspects of similarity - Euclidean distance is intuitive but sensitive to scale - Cosine similarity captures style regardless of volume - Users can validate findings across multiple methods
Decision 3: Configurable Feature Sets
Choice: Allow users to select which statistics to include in comparisons.
Rationale: - Different use cases require different features (e.g., scoring vs. all-around) - Reduces dimensionality for more interpretable results - Enables position-specific comparisons
Decision 4: Z-Score Normalization as Default
Choice: Use z-score normalization before calculating similarity.
Rationale: - Puts all statistics on the same scale - Accounts for league-wide changes over time when calculated per-season - Interpretable (standard deviations from mean) - Handles outliers better than min-max scaling
Step-by-Step Implementation Guide
Step 1: Project Setup
Create your project directory structure:
capstone1-player-comparison/
├── code/
│ ├── __init__.py
│ ├── player_comparison.py
│ ├── data_loader.py
│ ├── similarity.py
│ ├── visualization.py
│ └── requirements.txt
├── data/
│ └── (player statistics CSV files)
├── tests/
│ ├── test_data_loader.py
│ ├── test_similarity.py
│ └── test_visualization.py
└── index.md
Install dependencies:
pip install -r requirements.txt
Step 2: Data Loading and Preprocessing
The data loader module handles:
- Reading data from various sources
- Cleaning and validating data
- Filtering based on criteria (minutes, games played)
- Computing derived statistics
Key preprocessing steps:
def preprocess_data(df: pd.DataFrame, min_games: int = 20,
min_minutes: float = 10.0) -> pd.DataFrame:
"""
Preprocess player statistics data.
Steps:
1. Filter by minimum games and minutes
2. Handle missing values
3. Compute per-possession stats if needed
4. Add era classification
"""
# Filter for qualified players
df = df[(df['G'] >= min_games) & (df['MP'] >= min_minutes)]
# Handle missing three-point data for pre-1980 players
if 'FG3_PCT' in df.columns:
df['FG3_PCT'] = df['FG3_PCT'].fillna(0)
# Add era classification
df['Era'] = pd.cut(df['Season'],
bins=[1946, 1980, 2000, 2015, 2030],
labels=['Pre-3PT', 'Classic', 'Modern', 'Analytics'])
return df
Step 3: Implementing Similarity Algorithms
Euclidean Distance
The most intuitive distance metric, measuring the straight-line distance between two points in n-dimensional space:
$$d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$
def euclidean_similarity(player1_stats: np.ndarray,
player2_stats: np.ndarray) -> float:
"""
Calculate Euclidean similarity (inverse of distance).
Returns a value between 0 and 1, where 1 is identical.
"""
distance = np.sqrt(np.sum((player1_stats - player2_stats) ** 2))
# Convert distance to similarity (0 to 1 scale)
similarity = 1 / (1 + distance)
return similarity
Pros: Intuitive, preserves magnitude differences Cons: Sensitive to scale, affected by high-volume players
Cosine Similarity
Measures the cosine of the angle between two vectors, capturing similarity in direction regardless of magnitude:
$$\cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||}$$
def cosine_similarity(player1_stats: np.ndarray,
player2_stats: np.ndarray) -> float:
"""
Calculate cosine similarity between two player stat vectors.
Returns a value between -1 and 1, where 1 is identical direction.
"""
dot_product = np.dot(player1_stats, player2_stats)
norm1 = np.linalg.norm(player1_stats)
norm2 = np.linalg.norm(player2_stats)
if norm1 == 0 or norm2 == 0:
return 0.0
return dot_product / (norm1 * norm2)
Pros: Captures playing style regardless of usage, good for role players Cons: Ignores volume, a 10 PPG scorer could match a 25 PPG scorer
Mahalanobis Distance
Accounts for correlations between variables and differences in variance:
$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$
def mahalanobis_similarity(player1_stats: np.ndarray,
player2_stats: np.ndarray,
cov_matrix: np.ndarray) -> float:
"""
Calculate Mahalanobis-based similarity.
Accounts for correlations between statistics.
"""
diff = player1_stats - player2_stats
try:
cov_inv = np.linalg.inv(cov_matrix)
distance = np.sqrt(np.dot(np.dot(diff, cov_inv), diff))
similarity = 1 / (1 + distance)
return similarity
except np.linalg.LinAlgError:
# Fall back to Euclidean if covariance matrix is singular
return euclidean_similarity(player1_stats, player2_stats)
Pros: Statistically rigorous, handles correlated features Cons: Requires invertible covariance matrix, computationally expensive
Step 4: Percentile Rankings
Percentile rankings provide context by showing where a player ranks relative to peers:
def calculate_percentile_rankings(df: pd.DataFrame,
stats: List[str],
group_by: Optional[str] = None) -> pd.DataFrame:
"""
Calculate percentile rankings for specified statistics.
Args:
df: DataFrame with player statistics
stats: List of statistic columns to rank
group_by: Optional column to group rankings (e.g., 'Season', 'Position')
Returns:
DataFrame with percentile rankings (0-100) for each stat
"""
result = df.copy()
for stat in stats:
col_name = f'{stat}_percentile'
if group_by:
result[col_name] = result.groupby(group_by)[stat].transform(
lambda x: x.rank(pct=True) * 100
)
else:
result[col_name] = result[stat].rank(pct=True) * 100
return result
Use Cases: - Comparing players across different eras (normalize within season) - Evaluating positional performance (normalize within position) - Creating scouting profiles with percentile bars
Step 5: Z-Score Comparisons
Z-scores standardize statistics to enable direct comparison:
$$z = \frac{x - \mu}{\sigma}$$
def calculate_zscores(df: pd.DataFrame,
stats: List[str],
group_by: Optional[str] = None) -> pd.DataFrame:
"""
Calculate z-scores for specified statistics.
A z-score of 0 means average, +1 means one standard deviation above.
"""
result = df.copy()
for stat in stats:
col_name = f'{stat}_zscore'
if group_by:
result[col_name] = result.groupby(group_by)[stat].transform(
lambda x: (x - x.mean()) / x.std()
)
else:
mean = result[stat].mean()
std = result[stat].std()
result[col_name] = (result[stat] - mean) / std
return result
Interpretation Guide: | Z-Score | Interpretation | |---------|----------------| | +3.0 | Elite (99.9th percentile) | | +2.0 | Excellent (97.7th percentile) | | +1.0 | Above average (84th percentile) | | 0.0 | League average | | -1.0 | Below average (16th percentile) | | -2.0 | Poor (2.3rd percentile) |
Step 6: Finding Similar Players
The core function that ties everything together:
def find_similar_players(target_player: str,
target_season: int,
df: pd.DataFrame,
features: List[str],
algorithm: str = 'euclidean',
n: int = 10,
filters: Optional[Dict] = None) -> pd.DataFrame:
"""
Find the N most similar players to a target player-season.
Args:
target_player: Name of the player to compare
target_season: Season year to use for comparison
df: DataFrame with all player statistics
features: List of statistic columns to use for comparison
algorithm: Similarity algorithm ('euclidean', 'cosine', 'mahalanobis')
n: Number of similar players to return
filters: Optional filters (e.g., {'Era': 'Modern', 'Position': 'PG'})
Returns:
DataFrame with the N most similar players and their similarity scores
"""
# Get target player stats
target_mask = (df['Player'] == target_player) & (df['Season'] == target_season)
if not target_mask.any():
raise ValueError(f"Player {target_player} not found for season {target_season}")
target_stats = df.loc[target_mask, features].values[0]
# Apply filters
comparison_df = df.copy()
if filters:
for col, value in filters.items():
comparison_df = comparison_df[comparison_df[col] == value]
# Normalize features
scaler = StandardScaler()
normalized_features = scaler.fit_transform(comparison_df[features])
target_normalized = scaler.transform([target_stats])[0]
# Calculate similarities
similarities = []
for idx, row_stats in enumerate(normalized_features):
if algorithm == 'euclidean':
sim = euclidean_similarity(target_normalized, row_stats)
elif algorithm == 'cosine':
sim = cosine_similarity(target_normalized, row_stats)
elif algorithm == 'mahalanobis':
cov = np.cov(normalized_features.T)
sim = mahalanobis_similarity(target_normalized, row_stats, cov)
else:
raise ValueError(f"Unknown algorithm: {algorithm}")
similarities.append(sim)
comparison_df['Similarity'] = similarities
# Remove target player and sort
result = comparison_df[~target_mask].nlargest(n, 'Similarity')
return result[['Player', 'Season', 'Similarity'] + features]
Visualization Dashboard Design
Radar Charts
Radar charts (also called spider charts) excel at displaying multivariate data for player profiles:
def create_radar_chart(player_data: Dict[str, float],
comparison_data: Optional[Dict[str, float]] = None,
title: str = "Player Profile") -> go.Figure:
"""
Create a radar chart comparing one or two players.
Args:
player_data: Dictionary of stat names to percentile values (0-100)
comparison_data: Optional second player for comparison
title: Chart title
Returns:
Plotly Figure object
"""
categories = list(player_data.keys())
fig = go.Figure()
# Add primary player
fig.add_trace(go.Scatterpolar(
r=list(player_data.values()),
theta=categories,
fill='toself',
name='Player 1',
line=dict(color='#1f77b4')
))
# Add comparison player if provided
if comparison_data:
fig.add_trace(go.Scatterpolar(
r=list(comparison_data.values()),
theta=categories,
fill='toself',
name='Player 2',
line=dict(color='#ff7f0e')
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, 100]
)
),
showlegend=True,
title=title
)
return fig
Similarity Heatmaps
Heatmaps visualize pairwise similarities across multiple players:
def create_similarity_heatmap(similarity_matrix: np.ndarray,
player_names: List[str],
title: str = "Player Similarity Matrix") -> go.Figure:
"""
Create a heatmap showing pairwise similarities.
"""
fig = go.Figure(data=go.Heatmap(
z=similarity_matrix,
x=player_names,
y=player_names,
colorscale='RdYlGn',
zmin=0,
zmax=1,
text=np.round(similarity_matrix, 2),
texttemplate='%{text}',
textfont={"size": 10},
hoverongaps=False
))
fig.update_layout(
title=title,
xaxis_title="Player",
yaxis_title="Player",
width=800,
height=800
)
return fig
Statistical Comparison Bar Charts
Side-by-side bar charts for direct statistical comparisons:
def create_comparison_bars(player1_stats: Dict[str, float],
player2_stats: Dict[str, float],
player1_name: str,
player2_name: str) -> go.Figure:
"""
Create grouped bar chart comparing two players' statistics.
"""
categories = list(player1_stats.keys())
fig = go.Figure(data=[
go.Bar(name=player1_name, x=categories, y=list(player1_stats.values())),
go.Bar(name=player2_name, x=categories, y=list(player2_stats.values()))
])
fig.update_layout(
barmode='group',
title=f'{player1_name} vs {player2_name}',
xaxis_title='Statistic',
yaxis_title='Z-Score',
legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)
return fig
User Interface Considerations
Streamlit Dashboard Layout
The dashboard is organized into logical sections:
- Sidebar: Player selection, algorithm choice, feature selection
- Main Area: Visualizations and results tables
- Expandable Sections: Detailed statistics and methodology explanations
Key UI/UX Principles
- Progressive Disclosure: Show summary first, details on demand
- Sensible Defaults: Pre-select commonly used features and algorithms
- Clear Feedback: Loading indicators and error messages
- Mobile Responsiveness: Streamlit handles this automatically
Accessibility Considerations
- Use colorblind-friendly palettes (avoid red-green only distinctions)
- Provide text alternatives for all visualizations
- Ensure sufficient contrast ratios
- Support keyboard navigation
Complete Core Implementation
Below is the complete implementation for the core player comparison functionality:
"""
Player Comparison Tool - Core Implementation
A comprehensive tool for finding and comparing statistically similar NBA players.
"""
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist
from scipy.stats import percentileofscore
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
class PlayerComparisonTool:
"""
Main class for player comparison functionality.
This class provides methods for:
- Loading and preprocessing player statistics
- Calculating similarity scores between players
- Finding similar players
- Generating visualizations
"""
# Default features for comparison
DEFAULT_FEATURES = [
'PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
'FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT',
'USG_PCT', 'PER', 'BPM', 'WS'
]
def __init__(self, data: Optional[pd.DataFrame] = None):
"""
Initialize the PlayerComparisonTool.
Args:
data: Optional pre-loaded DataFrame with player statistics
"""
self.data = data
self.scaler = StandardScaler()
self._normalized_data = None
self._features = None
def load_data(self, filepath: str, **kwargs) -> pd.DataFrame:
"""
Load player statistics from a CSV file.
Args:
filepath: Path to the CSV file
**kwargs: Additional arguments passed to pd.read_csv
Returns:
Loaded and preprocessed DataFrame
"""
self.data = pd.read_csv(filepath, **kwargs)
self._preprocess_data()
return self.data
def _preprocess_data(self, min_games: int = 20, min_minutes: float = 10.0):
"""
Preprocess the loaded data.
Args:
min_games: Minimum games played threshold
min_minutes: Minimum minutes per game threshold
"""
if self.data is None:
raise ValueError("No data loaded")
# Filter for qualified players
if 'G' in self.data.columns and 'MP' in self.data.columns:
self.data = self.data[
(self.data['G'] >= min_games) &
(self.data['MP'] >= min_minutes)
]
# Handle missing values in shooting percentages
pct_cols = ['FG_PCT', 'FG3_PCT', 'FT_PCT', 'TS_PCT', 'EFG_PCT']
for col in pct_cols:
if col in self.data.columns:
self.data[col] = self.data[col].fillna(0)
# Reset index
self.data = self.data.reset_index(drop=True)
def prepare_features(self, features: Optional[List[str]] = None) -> np.ndarray:
"""
Prepare and normalize features for similarity calculations.
Args:
features: List of feature columns to use
Returns:
Normalized feature array
"""
if self.data is None:
raise ValueError("No data loaded")
self._features = features or self.DEFAULT_FEATURES
# Filter to available features
available_features = [f for f in self._features if f in self.data.columns]
if len(available_features) < len(self._features):
missing = set(self._features) - set(available_features)
print(f"Warning: Features not found in data: {missing}")
self._features = available_features
# Fill any remaining missing values with column means
feature_data = self.data[self._features].copy()
feature_data = feature_data.fillna(feature_data.mean())
# Normalize
self._normalized_data = self.scaler.fit_transform(feature_data)
return self._normalized_data
def calculate_similarity(self,
player1_idx: int,
player2_idx: int,
method: str = 'euclidean') -> float:
"""
Calculate similarity between two players.
Args:
player1_idx: Index of first player in DataFrame
player2_idx: Index of second player in DataFrame
method: Similarity method ('euclidean', 'cosine', 'correlation')
Returns:
Similarity score (0 to 1, where 1 is most similar)
"""
if self._normalized_data is None:
self.prepare_features()
stats1 = self._normalized_data[player1_idx].reshape(1, -1)
stats2 = self._normalized_data[player2_idx].reshape(1, -1)
if method == 'euclidean':
distance = cdist(stats1, stats2, metric='euclidean')[0, 0]
return 1 / (1 + distance)
elif method == 'cosine':
# Cosine distance is 1 - cosine similarity
cosine_dist = cdist(stats1, stats2, metric='cosine')[0, 0]
return 1 - cosine_dist
elif method == 'correlation':
corr_dist = cdist(stats1, stats2, metric='correlation')[0, 0]
return 1 - corr_dist
else:
raise ValueError(f"Unknown method: {method}")
def find_similar_players(self,
player_name: str,
season: Optional[int] = None,
n: int = 10,
method: str = 'euclidean',
features: Optional[List[str]] = None,
filters: Optional[Dict] = None) -> pd.DataFrame:
"""
Find the N most similar players to a given player.
Args:
player_name: Name of the target player
season: Optional season year to filter target player
n: Number of similar players to return
method: Similarity method
features: Optional list of features to use
filters: Optional filters to apply (e.g., {'Position': 'PG'})
Returns:
DataFrame with similar players and their similarity scores
"""
if self.data is None:
raise ValueError("No data loaded")
# Find target player
mask = self.data['Player'] == player_name
if season is not None:
mask &= self.data['Season'] == season
if not mask.any():
raise ValueError(f"Player '{player_name}' not found" +
(f" for season {season}" if season else ""))
target_idx = self.data[mask].index[0]
# Prepare features
self.prepare_features(features)
# Apply filters
comparison_mask = pd.Series([True] * len(self.data))
if filters:
for col, value in filters.items():
if col in self.data.columns:
comparison_mask &= self.data[col] == value
# Calculate all similarities
target_stats = self._normalized_data[target_idx].reshape(1, -1)
similarities = []
for idx in range(len(self.data)):
if comparison_mask.iloc[idx] and idx != target_idx:
sim = self.calculate_similarity(target_idx, idx, method)
similarities.append((idx, sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
top_n = similarities[:n]
# Build result DataFrame
result_indices = [idx for idx, _ in top_n]
result = self.data.iloc[result_indices].copy()
result['Similarity'] = [sim for _, sim in top_n]
# Reorder columns
cols = ['Player', 'Season', 'Similarity'] + \
[c for c in self._features if c in result.columns]
result = result[[c for c in cols if c in result.columns]]
return result.reset_index(drop=True)
def calculate_percentiles(self,
player_name: str,
season: Optional[int] = None,
features: Optional[List[str]] = None,
comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
"""
Calculate percentile rankings for a player.
Args:
player_name: Name of the player
season: Optional season year
features: Features to calculate percentiles for
comparison_group: Optional subset of data for comparison
Returns:
Dictionary of feature names to percentile values (0-100)
"""
if self.data is None:
raise ValueError("No data loaded")
features = features or self.DEFAULT_FEATURES
features = [f for f in features if f in self.data.columns]
# Find player
mask = self.data['Player'] == player_name
if season is not None:
mask &= self.data['Season'] == season
if not mask.any():
raise ValueError(f"Player '{player_name}' not found")
player_row = self.data[mask].iloc[0]
comparison_df = comparison_group if comparison_group is not None else self.data
percentiles = {}
for feature in features:
value = player_row[feature]
pct = percentileofscore(comparison_df[feature].dropna(), value)
percentiles[feature] = round(pct, 1)
return percentiles
def calculate_zscores(self,
player_name: str,
season: Optional[int] = None,
features: Optional[List[str]] = None,
comparison_group: Optional[pd.DataFrame] = None) -> Dict[str, float]:
"""
Calculate z-scores for a player.
Args:
player_name: Name of the player
season: Optional season year
features: Features to calculate z-scores for
comparison_group: Optional subset of data for comparison
Returns:
Dictionary of feature names to z-score values
"""
if self.data is None:
raise ValueError("No data loaded")
features = features or self.DEFAULT_FEATURES
features = [f for f in features if f in self.data.columns]
# Find player
mask = self.data['Player'] == player_name
if season is not None:
mask &= self.data['Season'] == season
if not mask.any():
raise ValueError(f"Player '{player_name}' not found")
player_row = self.data[mask].iloc[0]
comparison_df = comparison_group if comparison_group is not None else self.data
zscores = {}
for feature in features:
value = player_row[feature]
mean = comparison_df[feature].mean()
std = comparison_df[feature].std()
if std > 0:
zscores[feature] = round((value - mean) / std, 2)
else:
zscores[feature] = 0.0
return zscores
def create_radar_chart(self,
player_name: str,
season: Optional[int] = None,
comparison_player: Optional[str] = None,
comparison_season: Optional[int] = None,
features: Optional[List[str]] = None) -> go.Figure:
"""
Create a radar chart for player comparison.
Args:
player_name: Primary player name
season: Primary player season
comparison_player: Optional comparison player name
comparison_season: Comparison player season
features: Features to include in chart
Returns:
Plotly Figure object
"""
features = features or ['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TS_PCT']
features = [f for f in features if f in self.data.columns]
# Get percentiles for primary player
player1_pct = self.calculate_percentiles(player_name, season, features)
fig = go.Figure()
# Add primary player
fig.add_trace(go.Scatterpolar(
r=list(player1_pct.values()),
theta=list(player1_pct.keys()),
fill='toself',
name=f"{player_name} ({season})" if season else player_name,
line=dict(color='#1f77b4', width=2)
))
# Add comparison player if provided
if comparison_player:
player2_pct = self.calculate_percentiles(
comparison_player, comparison_season, features
)
fig.add_trace(go.Scatterpolar(
r=list(player2_pct.values()),
theta=list(player2_pct.keys()),
fill='toself',
name=f"{comparison_player} ({comparison_season})" if comparison_season else comparison_player,
line=dict(color='#ff7f0e', width=2)
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, 100],
tickfont=dict(size=10)
)
),
showlegend=True,
title=dict(
text="Player Comparison (Percentile Rankings)",
x=0.5
),
legend=dict(
yanchor="top",
y=1.1,
xanchor="center",
x=0.5,
orientation="h"
)
)
return fig
def create_similarity_matrix(self,
player_names: List[str],
seasons: Optional[List[int]] = None,
method: str = 'euclidean') -> Tuple[np.ndarray, go.Figure]:
"""
Create a similarity matrix and heatmap for multiple players.
Args:
player_names: List of player names
seasons: Optional list of seasons (same length as player_names)
method: Similarity method
Returns:
Tuple of (similarity matrix, Plotly Figure)
"""
n_players = len(player_names)
if seasons is None:
seasons = [None] * n_players
# Get indices for all players
indices = []
labels = []
for name, season in zip(player_names, seasons):
mask = self.data['Player'] == name
if season is not None:
mask &= self.data['Season'] == season
if mask.any():
indices.append(self.data[mask].index[0])
labels.append(f"{name} ({season})" if season else name)
# Prepare features
self.prepare_features()
# Calculate similarity matrix
n = len(indices)
sim_matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
if i == j:
sim_matrix[i, j] = 1.0
elif i < j:
sim = self.calculate_similarity(indices[i], indices[j], method)
sim_matrix[i, j] = sim
sim_matrix[j, i] = sim
# Create heatmap
fig = go.Figure(data=go.Heatmap(
z=sim_matrix,
x=labels,
y=labels,
colorscale='RdYlGn',
zmin=0,
zmax=1,
text=np.round(sim_matrix, 2),
texttemplate='%{text}',
textfont={"size": 10},
hovertemplate='%{y} vs %{x}<br>Similarity: %{z:.3f}<extra></extra>'
))
fig.update_layout(
title="Player Similarity Matrix",
width=600,
height=600,
xaxis=dict(tickangle=45)
)
return sim_matrix, fig
# Example usage
if __name__ == "__main__":
# Create sample data for demonstration
sample_data = pd.DataFrame({
'Player': ['LeBron James', 'Kevin Durant', 'Stephen Curry',
'Giannis Antetokounmpo', 'Luka Doncic'] * 3,
'Season': [2022, 2022, 2022, 2022, 2022,
2023, 2023, 2023, 2023, 2023,
2024, 2024, 2024, 2024, 2024],
'G': [55, 58, 56, 63, 66] * 3,
'MP': [35.5, 36.0, 34.7, 32.1, 36.2] * 3,
'PTS': [28.9, 29.1, 29.4, 31.1, 32.4] * 3,
'TRB': [8.3, 6.7, 6.1, 11.6, 8.8] * 3,
'AST': [6.8, 5.0, 6.3, 5.7, 8.0] * 3,
'STL': [0.9, 0.7, 0.9, 0.8, 1.4] * 3,
'BLK': [0.6, 1.4, 0.4, 0.8, 0.5] * 3,
'TOV': [3.1, 3.3, 3.2, 3.3, 3.6] * 3,
'FG_PCT': [0.500, 0.529, 0.493, 0.553, 0.496] * 3,
'FG3_PCT': [0.321, 0.404, 0.427, 0.275, 0.342] * 3,
'FT_PCT': [0.723, 0.910, 0.915, 0.645, 0.760] * 3,
'TS_PCT': [0.580, 0.660, 0.670, 0.610, 0.600] * 3,
'USG_PCT': [31.5, 30.2, 32.1, 37.4, 36.8] * 3,
'PER': [26.2, 26.8, 24.2, 32.1, 28.4] * 3,
'BPM': [7.2, 6.1, 5.8, 11.0, 7.8] * 3,
'WS': [8.1, 9.2, 7.8, 12.1, 10.2] * 3
})
# Initialize tool
tool = PlayerComparisonTool(data=sample_data)
# Find similar players
print("Finding players similar to LeBron James (2023)...")
similar = tool.find_similar_players("LeBron James", season=2023, n=3)
print(similar[['Player', 'Season', 'Similarity']])
print("\nLeBron James Percentiles:")
pct = tool.calculate_percentiles("LeBron James", season=2023)
for stat, value in pct.items():
print(f" {stat}: {value}th percentile")
print("\nLeBron James Z-Scores:")
zscores = tool.calculate_zscores("LeBron James", season=2023)
for stat, value in zscores.items():
print(f" {stat}: {value:+.2f}")
Deployment Options
Option 1: Local Streamlit Application
The simplest deployment method for personal use or demos:
streamlit run player_comparison.py
Option 2: Streamlit Cloud
Free hosting for public applications:
- Push your code to a GitHub repository
- Connect to Streamlit Cloud
- Deploy with one click
Limitations: 1GB memory, public repository required for free tier
Option 3: Docker Container
For production deployments or team sharing:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "player_comparison.py", "--server.port=8501"]
Option 4: Cloud Platform Deployment
For scalable production applications:
- AWS: EC2, ECS, or Lambda + API Gateway
- Google Cloud: Cloud Run or App Engine
- Azure: App Service or Container Instances
Option 5: Static Export
For documentation or reports, export visualizations as HTML or images:
fig.write_html("player_comparison.html")
fig.write_image("player_comparison.png")
Extension Ideas
Beginner Extensions
- Add more similarity algorithms: Implement Manhattan distance or Minkowski distance
- Position filtering: Allow filtering comparisons by position
- Export functionality: Add CSV/PDF export of comparison results
- Historical comparison: Compare a current player to all-time greats
Intermediate Extensions
- Weighted similarity: Allow users to weight certain stats more heavily
- Career arc comparison: Compare career trajectories, not just single seasons
- Cluster analysis: Use K-means to identify player archetypes
- Player trajectory prediction: Predict future stats based on similar players
Advanced Extensions
- Play-by-play integration: Include player tracking data for more detailed comparisons
- Natural language search: "Find me a player like prime Kobe but with better three-point shooting"
- Real-time updates: Connect to live NBA Stats API for current season data
- Machine learning models: Train models to predict player compatibility or trade value
- Multi-sport expansion: Generalize the architecture for other sports
Academic Extensions
- Research paper replication: Implement published player similarity research
- Novel metrics development: Create new composite statistics for comparison
- Bias analysis: Study how the tool's recommendations might embed historical biases
- Uncertainty quantification: Add confidence intervals to similarity scores
Testing Your Implementation
Unit Tests
Create comprehensive tests for core functionality:
# tests/test_similarity.py
import pytest
import numpy as np
from code.similarity import euclidean_similarity, cosine_similarity
def test_euclidean_similarity_identical():
"""Identical vectors should have similarity of 1."""
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([1.0, 2.0, 3.0])
assert euclidean_similarity(v1, v2) == 1.0
def test_euclidean_similarity_different():
"""Different vectors should have similarity less than 1."""
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([4.0, 5.0, 6.0])
sim = euclidean_similarity(v1, v2)
assert 0 < sim < 1
def test_cosine_similarity_orthogonal():
"""Orthogonal vectors should have cosine similarity of 0."""
v1 = np.array([1.0, 0.0])
v2 = np.array([0.0, 1.0])
assert cosine_similarity(v1, v2) == pytest.approx(0.0)
def test_cosine_similarity_parallel():
"""Parallel vectors should have cosine similarity of 1."""
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([2.0, 4.0, 6.0])
assert cosine_similarity(v1, v2) == pytest.approx(1.0)
Integration Tests
Test the complete workflow:
# tests/test_integration.py
def test_full_workflow(sample_data):
"""Test complete comparison workflow."""
tool = PlayerComparisonTool(data=sample_data)
# Load and prepare data
tool.prepare_features()
# Find similar players
similar = tool.find_similar_players("LeBron James", n=3)
assert len(similar) == 3
assert 'Similarity' in similar.columns
assert all(0 <= s <= 1 for s in similar['Similarity'])
Troubleshooting Common Issues
Issue: "Player not found" error
Cause: Player name doesn't match exactly (spacing, special characters, suffixes)
Solution:
# Normalize player names
df['Player'] = df['Player'].str.strip()
# Handle suffixes
df['Player'] = df['Player'].str.replace(r'\s*\*$', '', regex=True)
Issue: Similarity scores all very close
Cause: Feature scaling issues or too many features causing dimensionality problems
Solution: - Reduce the number of features - Verify StandardScaler is being applied - Consider using feature selection or PCA
Issue: Memory error with large datasets
Cause: Calculating all pairwise similarities is O(n^2)
Solution: - Use approximate nearest neighbors (Annoy, FAISS) - Calculate similarities in batches - Use sparse matrices where possible
Issue: Slow performance
Cause: Pure Python loops for similarity calculations
Solution: - Use NumPy vectorized operations - Pre-compute the similarity matrix - Cache frequently accessed results
Project Rubric
Use this rubric to evaluate your implementation:
| Criterion | Excellent (4) | Good (3) | Satisfactory (2) | Needs Work (1) |
|---|---|---|---|---|
| Code Quality | Clean, well-documented, follows PEP8 | Minor issues, mostly documented | Some documentation, inconsistent style | Poor documentation, messy code |
| Functionality | All features work, handles edge cases | Most features work | Basic features work | Significant bugs |
| Similarity Algorithms | 3+ algorithms, well-implemented | 2 algorithms working | 1 algorithm working | Algorithms incorrect |
| Visualizations | Interactive, polished, informative | Good visualizations | Basic charts | Visualizations broken |
| UI/UX | Intuitive, accessible, responsive | Good usability | Functional but basic | Difficult to use |
| Testing | >80% coverage, thorough tests | >60% coverage | Some tests | No tests |
| Documentation | Comprehensive README, docstrings | Good documentation | Basic README | No documentation |
| Extension | Implemented 2+ creative extensions | 1 extension | Attempted extension | No extensions |
Conclusion
Congratulations on completing this capstone project! You have built a comprehensive player comparison tool that demonstrates skills in:
- Data Engineering: Loading, cleaning, and preprocessing sports statistics
- Statistical Analysis: Implementing similarity metrics, percentile rankings, and z-scores
- Machine Learning: Applying distance metrics and normalization techniques
- Data Visualization: Creating informative, interactive charts
- Software Engineering: Structuring a maintainable, testable codebase
- Product Development: Designing a user-friendly interface
This project serves as an excellent portfolio piece that showcases your ability to apply data science techniques to real-world problems. The basketball domain knowledge combined with technical implementation makes this project relevant for both sports analytics roles and general data science positions.
Next Steps
- Deploy your application using one of the deployment options discussed
- Add your own extensions to make the tool unique
- Document your work in a blog post or portfolio site
- Contribute improvements back to the community
- Apply these techniques to other domains that interest you
Remember, the best portfolio projects are ones you continue to improve and that showcase your genuine interests. Use this foundation to build something that excites you!
References
- Basketball Reference. (2024). Basketball Statistics and History. https://www.basketball-reference.com/
- NBA Advanced Stats. (2024). https://www.nba.com/stats/
- Shea, S. M., & Baker, C. E. (2013). Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win. CreateSpace.
- Oliver, D. (2004). Basketball on Paper: Rules and Tools for Performance Analysis. Potomac Books.
- scikit-learn documentation. (2024). Preprocessing data. https://scikit-learn.org/stable/modules/preprocessing.html
- Plotly Python Documentation. (2024). https://plotly.com/python/
- Streamlit Documentation. (2024). https://docs.streamlit.io/