Case Study 1: Building a Comprehensive Player Comparison System
Overview
Scenario: The Cleveland Cavaliers are exploring free agency options and need a data-driven system to compare players across different statistical categories. They want to identify players who are elite in multiple areas and understand how candidates compare to each other using standardized metrics.
Duration: 2-3 hours Difficulty: Intermediate Prerequisites: Chapter 5 concepts, understanding of z-scores and percentiles
Background
The Cavaliers have three roster needs: 1. A secondary scorer who can create their own shot 2. A defensive anchor in the paint 3. A floor-spacing forward
For each need, they've identified 3-4 candidates. Your task is to build a statistical comparison system that: - Standardizes statistics for fair comparison - Creates composite scores for different skill sets - Provides clear visualizations for front office presentations - Accounts for context (minutes, position, etc.)
Part 1: Building the Statistical Framework
1.1 Data Preparation and Standardization
"""
Cavaliers Player Comparison System
Case Study 1 - Chapter 5: Descriptive Statistics
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from scipy import stats
# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)
@dataclass
class PlayerStats:
"""Container for player statistics."""
name: str
team: str
position: str
games_played: int
minutes_per_game: float
stats: Dict[str, float]
class StatisticalComparisonSystem:
"""
System for standardizing and comparing player statistics.
This class provides methods for calculating z-scores, percentiles,
and composite scores to enable fair player comparisons.
Attributes:
league_stats: DataFrame with league-wide statistics
min_games: Minimum games played filter
min_minutes: Minimum minutes per game filter
"""
def __init__(self, league_stats: pd.DataFrame,
min_games: int = 30,
min_minutes: float = 15.0):
"""
Initialize the comparison system.
Args:
league_stats: DataFrame with all player statistics
min_games: Minimum games to be included in analysis
min_minutes: Minimum MPG to be included
"""
self.raw_stats = league_stats
self.min_games = min_games
self.min_minutes = min_minutes
# Filter qualified players
self.qualified = self._filter_qualified_players()
# Calculate league distributions
self.league_means = {}
self.league_stds = {}
self._calculate_league_distributions()
def _filter_qualified_players(self) -> pd.DataFrame:
"""Filter to players meeting minimum thresholds."""
mask = (
(self.raw_stats['GP'] >= self.min_games) &
(self.raw_stats['MIN'] >= self.min_minutes)
)
return self.raw_stats[mask].copy()
def _calculate_league_distributions(self) -> None:
"""Calculate mean and std for each statistic."""
numeric_cols = self.qualified.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
self.league_means[col] = self.qualified[col].mean()
self.league_stds[col] = self.qualified[col].std()
def calculate_z_score(self, value: float, stat_name: str) -> float:
"""
Calculate z-score for a single value.
Args:
value: The player's statistic value
stat_name: Name of the statistic
Returns:
Z-score relative to league distribution
"""
mean = self.league_means.get(stat_name)
std = self.league_stds.get(stat_name)
if mean is None or std is None or std == 0:
return 0.0
return (value - mean) / std
def calculate_percentile(self, value: float, stat_name: str) -> float:
"""
Calculate percentile rank for a value.
Args:
value: The player's statistic value
stat_name: Name of the statistic
Returns:
Percentile rank (0-100)
"""
if stat_name not in self.qualified.columns:
return 50.0
below = (self.qualified[stat_name] < value).sum()
return (below / len(self.qualified)) * 100
def get_player_profile(self, player_name: str) -> Dict:
"""
Generate comprehensive statistical profile for a player.
Args:
player_name: Name of the player
Returns:
Dictionary with z-scores, percentiles, and raw stats
"""
player_row = self.qualified[
self.qualified['PLAYER_NAME'] == player_name
]
if len(player_row) == 0:
raise ValueError(f"Player '{player_name}' not found in qualified players")
player = player_row.iloc[0]
profile = {
'name': player_name,
'team': player.get('TEAM', 'Unknown'),
'position': player.get('POS', 'Unknown'),
'raw_stats': {},
'z_scores': {},
'percentiles': {}
}
# Key statistics to analyze
key_stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT',
'FG3_PCT', 'FT_PCT', 'MIN', 'TOV']
for stat in key_stats:
if stat in player.index:
value = player[stat]
profile['raw_stats'][stat] = value
profile['z_scores'][stat] = self.calculate_z_score(value, stat)
profile['percentiles'][stat] = self.calculate_percentile(value, stat)
return profile
def create_composite_score(self, player_name: str,
weights: Dict[str, float],
invert: List[str] = None) -> float:
"""
Create weighted composite score from multiple statistics.
Args:
player_name: Name of the player
weights: Dictionary mapping stat names to weights
invert: List of stats where lower is better (e.g., TOV)
Returns:
Weighted composite z-score
Example:
>>> scorer_weights = {'PTS': 0.4, 'FG_PCT': 0.3, 'AST': 0.2, 'TOV': 0.1}
>>> score = system.create_composite_score('Player Name', scorer_weights, ['TOV'])
"""
if invert is None:
invert = []
profile = self.get_player_profile(player_name)
z_scores = profile['z_scores']
weighted_sum = 0
total_weight = 0
for stat, weight in weights.items():
if stat in z_scores:
z = z_scores[stat]
# Invert z-score for stats where lower is better
if stat in invert:
z = -z
weighted_sum += z * weight
total_weight += weight
return weighted_sum / total_weight if total_weight > 0 else 0
def compare_players(self, player_names: List[str],
stats: List[str] = None) -> pd.DataFrame:
"""
Create comparison table for multiple players.
Args:
player_names: List of player names to compare
stats: Statistics to include (None for all key stats)
Returns:
DataFrame with comparison data
"""
if stats is None:
stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT', 'FG3_PCT']
comparison_data = []
for name in player_names:
try:
profile = self.get_player_profile(name)
row = {'Player': name, 'Team': profile['team']}
for stat in stats:
if stat in profile['raw_stats']:
row[f'{stat}_raw'] = profile['raw_stats'][stat]
row[f'{stat}_z'] = profile['z_scores'][stat]
row[f'{stat}_pct'] = profile['percentiles'][stat]
comparison_data.append(row)
except ValueError as e:
print(f"Warning: {e}")
return pd.DataFrame(comparison_data)
1.2 Position-Adjusted Comparisons
class PositionAdjustedComparison(StatisticalComparisonSystem):
"""
Extends comparison system with position-based adjustments.
Different positions have different statistical expectations.
This class calculates z-scores relative to position peers.
"""
def __init__(self, league_stats: pd.DataFrame, **kwargs):
super().__init__(league_stats, **kwargs)
self.position_distributions = {}
self._calculate_position_distributions()
def _calculate_position_distributions(self) -> None:
"""Calculate distributions by position."""
positions = self.qualified['POS'].unique()
for pos in positions:
pos_data = self.qualified[self.qualified['POS'] == pos]
self.position_distributions[pos] = {
'means': pos_data.select_dtypes(include=[np.number]).mean().to_dict(),
'stds': pos_data.select_dtypes(include=[np.number]).std().to_dict(),
'count': len(pos_data)
}
def calculate_position_z_score(self, value: float, stat_name: str,
position: str) -> float:
"""
Calculate z-score relative to position peers.
Args:
value: The player's statistic value
stat_name: Name of the statistic
position: Player's position
Returns:
Position-adjusted z-score
"""
if position not in self.position_distributions:
return self.calculate_z_score(value, stat_name)
pos_dist = self.position_distributions[position]
mean = pos_dist['means'].get(stat_name, 0)
std = pos_dist['stds'].get(stat_name, 1)
if std == 0:
return 0.0
return (value - mean) / std
def get_position_adjusted_profile(self, player_name: str) -> Dict:
"""Generate profile with both league and position z-scores."""
profile = self.get_player_profile(player_name)
position = profile['position']
profile['position_z_scores'] = {}
for stat, value in profile['raw_stats'].items():
profile['position_z_scores'][stat] = self.calculate_position_z_score(
value, stat, position
)
return profile
Part 2: Defining Player Archetypes
2.1 Composite Scores for Different Roles
# Define weights for different player archetypes
ARCHETYPE_WEIGHTS = {
'scorer': {
'weights': {'PTS': 0.35, 'FG_PCT': 0.20, 'FG3_PCT': 0.15,
'FT_PCT': 0.10, 'AST': 0.10, 'TOV': 0.10},
'invert': ['TOV'],
'description': 'Shot creator and efficient scorer'
},
'playmaker': {
'weights': {'AST': 0.40, 'TOV': 0.20, 'PTS': 0.20,
'FG_PCT': 0.10, 'STL': 0.10},
'invert': ['TOV'],
'description': 'Primary ball handler and distributor'
},
'rim_protector': {
'weights': {'BLK': 0.40, 'REB': 0.30, 'STL': 0.10,
'PTS': 0.10, 'FG_PCT': 0.10},
'invert': [],
'description': 'Defensive anchor and interior presence'
},
'floor_spacer': {
'weights': {'FG3_PCT': 0.35, 'PTS': 0.25, 'FG_PCT': 0.20,
'REB': 0.10, 'AST': 0.10},
'invert': [],
'description': 'Three-point shooting forward'
},
'two_way': {
'weights': {'PTS': 0.20, 'REB': 0.15, 'AST': 0.15,
'STL': 0.20, 'BLK': 0.15, 'TOV': 0.15},
'invert': ['TOV'],
'description': 'Contributes on both ends'
}
}
def evaluate_player_archetypes(system: StatisticalComparisonSystem,
player_name: str) -> Dict[str, float]:
"""
Evaluate a player's fit for each archetype.
Args:
system: Initialized comparison system
player_name: Name of player to evaluate
Returns:
Dictionary mapping archetypes to composite scores
"""
archetype_scores = {}
for archetype, config in ARCHETYPE_WEIGHTS.items():
score = system.create_composite_score(
player_name,
config['weights'],
config['invert']
)
archetype_scores[archetype] = score
return archetype_scores
def find_best_fits(system: StatisticalComparisonSystem,
candidates: List[str],
archetype: str) -> pd.DataFrame:
"""
Rank candidates by fit for a specific archetype.
Args:
system: Initialized comparison system
candidates: List of player names to evaluate
archetype: Archetype to evaluate against
Returns:
DataFrame with rankings
"""
config = ARCHETYPE_WEIGHTS[archetype]
results = []
for player in candidates:
try:
score = system.create_composite_score(
player, config['weights'], config['invert']
)
profile = system.get_player_profile(player)
results.append({
'Player': player,
'Team': profile['team'],
'Position': profile['position'],
'Composite_Score': score,
'Percentile': stats.norm.cdf(score) * 100
})
except ValueError:
continue
return pd.DataFrame(results).sort_values('Composite_Score', ascending=False)
Part 3: Visualization
3.1 Radar Charts for Player Profiles
def create_radar_chart(profile: Dict, title: str = None) -> plt.Figure:
"""
Create radar chart showing player's percentiles across categories.
Args:
profile: Player profile from comparison system
title: Optional custom title
Returns:
Matplotlib Figure
"""
categories = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT']
available_cats = [c for c in categories if c in profile['percentiles']]
values = [profile['percentiles'][c] for c in available_cats]
values += values[:1] # Close the polygon
angles = np.linspace(0, 2 * np.pi, len(available_cats), endpoint=False).tolist()
angles += angles[:1]
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
ax.plot(angles, values, 'o-', linewidth=2)
ax.fill(angles, values, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(available_cats)
ax.set_ylim(0, 100)
# Add reference lines
for percentile in [25, 50, 75]:
ax.axhline(percentile, color='gray', linestyle='--', alpha=0.3)
ax.set_title(title or f"{profile['name']} Statistical Profile", fontsize=14)
return fig
def create_comparison_radar(system: StatisticalComparisonSystem,
players: List[str],
categories: List[str] = None) -> plt.Figure:
"""
Create overlaid radar charts comparing multiple players.
Args:
system: Initialized comparison system
players: List of player names
categories: Statistics to include
Returns:
Matplotlib Figure
"""
if categories is None:
categories = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT']
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
angles += angles[:1]
colors = plt.cm.Set2(np.linspace(0, 1, len(players)))
for player, color in zip(players, colors):
try:
profile = system.get_player_profile(player)
values = [profile['percentiles'].get(c, 50) for c in categories]
values += values[:1]
ax.plot(angles, values, 'o-', linewidth=2, label=player, color=color)
ax.fill(angles, values, alpha=0.1, color=color)
except ValueError:
continue
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
ax.set_ylim(0, 100)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
ax.set_title('Player Comparison', fontsize=14)
return fig
3.2 Archetype Fit Visualization
def visualize_archetype_fits(archetype_scores: Dict[str, Dict],
player_names: List[str]) -> plt.Figure:
"""
Create heatmap showing archetype fit scores for multiple players.
Args:
archetype_scores: Dict mapping player names to their archetype scores
player_names: List of players to include
Returns:
Matplotlib Figure
"""
archetypes = list(ARCHETYPE_WEIGHTS.keys())
data = []
for player in player_names:
if player in archetype_scores:
row = [archetype_scores[player].get(arch, 0) for arch in archetypes]
data.append(row)
df = pd.DataFrame(data, index=player_names, columns=archetypes)
fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(df, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
ax=ax, cbar_kws={'label': 'Composite Z-Score'})
ax.set_title('Player Archetype Fit Scores', fontsize=14)
ax.set_xlabel('Archetype')
ax.set_ylabel('Player')
plt.tight_layout()
return fig
Part 4: Case Application
4.1 Cavaliers Free Agency Analysis
def run_cavaliers_analysis():
"""Execute the full Cavaliers free agency analysis."""
# Load league data (example structure)
# In practice, load from actual data source
league_data = pd.DataFrame({
'PLAYER_NAME': ['Candidate A', 'Candidate B', 'Candidate C',
'Candidate D', 'Candidate E', 'Candidate F'],
'TEAM': ['Team1', 'Team2', 'Team3', 'Team4', 'Team5', 'Team6'],
'POS': ['SG', 'SF', 'PF', 'C', 'SF', 'PF'],
'GP': [65, 70, 58, 72, 68, 55],
'MIN': [32.5, 28.4, 30.1, 26.8, 25.5, 29.2],
'PTS': [18.2, 8.5, 14.8, 12.1, 15.3, 16.1],
'REB': [4.2, 5.8, 7.2, 10.5, 6.1, 8.3],
'AST': [5.5, 2.1, 2.8, 1.5, 3.2, 2.4],
'STL': [1.2, 2.1, 0.8, 0.6, 1.8, 0.9],
'BLK': [0.3, 0.5, 1.2, 2.8, 0.4, 1.1],
'FG_PCT': [0.448, 0.425, 0.465, 0.582, 0.412, 0.445],
'FG3_PCT': [0.362, 0.348, 0.402, 0.000, 0.385, 0.388],
'FT_PCT': [0.865, 0.782, 0.825, 0.685, 0.798, 0.812],
'TOV': [2.8, 1.2, 1.5, 1.8, 1.4, 1.6]
})
# Initialize system
system = StatisticalComparisonSystem(league_data, min_games=50, min_minutes=20)
print("=" * 60)
print("CAVALIERS FREE AGENCY ANALYSIS")
print("=" * 60)
# Need 1: Secondary Scorer
print("\n--- NEED 1: SECONDARY SCORER ---")
scorer_candidates = ['Candidate A', 'Candidate C', 'Candidate F']
scorer_rankings = find_best_fits(system, scorer_candidates, 'scorer')
print(scorer_rankings.to_string(index=False))
# Need 2: Rim Protector
print("\n--- NEED 2: RIM PROTECTOR ---")
rim_candidates = ['Candidate D', 'Candidate C']
rim_rankings = find_best_fits(system, rim_candidates, 'rim_protector')
print(rim_rankings.to_string(index=False))
# Need 3: Floor Spacer
print("\n--- NEED 3: FLOOR SPACER ---")
spacer_candidates = ['Candidate C', 'Candidate E', 'Candidate F']
spacer_rankings = find_best_fits(system, spacer_candidates, 'floor_spacer')
print(spacer_rankings.to_string(index=False))
# Full archetype analysis
print("\n--- ARCHETYPE FIT MATRIX ---")
all_candidates = league_data['PLAYER_NAME'].tolist()
archetype_matrix = {}
for player in all_candidates:
try:
archetype_matrix[player] = evaluate_player_archetypes(system, player)
except ValueError:
continue
# Print summary
for player, scores in archetype_matrix.items():
best_fit = max(scores, key=scores.get)
print(f"{player}: Best fit = {best_fit} (z = {scores[best_fit]:.2f})")
return system, archetype_matrix
if __name__ == "__main__":
run_cavaliers_analysis()
Discussion Questions
Question 1: Weighting Decisions
How would you determine the weights for each archetype? Should they be based on team needs, analytical importance, or something else?
Question 2: Sample Size
Some candidates have played significantly fewer games. How do you account for reliability in your comparisons?
Question 3: Context Adjustments
Players on different teams face different situations. How might team context affect the validity of these comparisons?
Question 4: Combining Metrics
What are the limitations of combining multiple z-scores into a single composite score?
Deliverables
- Comparison System Code: Functional Python class for player comparison
- Visualization Suite: Radar charts and comparison visualizations
- Analysis Report: Written summary of findings for each Cavaliers need
- Recommendation Memo: Final recommendations with statistical justification
- Sensitivity Analysis: How conclusions change with different weighting schemes
Key Takeaways
- Z-scores enable fair comparison across different statistical categories
- Composite scores can capture multi-dimensional player value
- Position adjustments account for different role expectations
- Visualization communicates complex statistical comparisons effectively
- Weighting decisions require basketball judgment, not just statistics