Case Study 1: Building a Comprehensive Player Comparison System

Overview

Scenario: The Cleveland Cavaliers are exploring free agency options and need a data-driven system to compare players across different statistical categories. They want to identify players who are elite in multiple areas and understand how candidates compare to each other using standardized metrics.

Duration: 2-3 hours Difficulty: Intermediate Prerequisites: Chapter 5 concepts, understanding of z-scores and percentiles


Background

The Cavaliers have three roster needs: 1. A secondary scorer who can create their own shot 2. A defensive anchor in the paint 3. A floor-spacing forward

For each need, they've identified 3-4 candidates. Your task is to build a statistical comparison system that: - Standardizes statistics for fair comparison - Creates composite scores for different skill sets - Provides clear visualizations for front office presentations - Accounts for context (minutes, position, etc.)


Part 1: Building the Statistical Framework

1.1 Data Preparation and Standardization

"""
Cavaliers Player Comparison System
Case Study 1 - Chapter 5: Descriptive Statistics
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from scipy import stats

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)


@dataclass
class PlayerStats:
    """Container for player statistics."""
    name: str
    team: str
    position: str
    games_played: int
    minutes_per_game: float
    stats: Dict[str, float]


class StatisticalComparisonSystem:
    """
    System for standardizing and comparing player statistics.

    This class provides methods for calculating z-scores, percentiles,
    and composite scores to enable fair player comparisons.

    Attributes:
        league_stats: DataFrame with league-wide statistics
        min_games: Minimum games played filter
        min_minutes: Minimum minutes per game filter
    """

    def __init__(self, league_stats: pd.DataFrame,
                 min_games: int = 30,
                 min_minutes: float = 15.0):
        """
        Initialize the comparison system.

        Args:
            league_stats: DataFrame with all player statistics
            min_games: Minimum games to be included in analysis
            min_minutes: Minimum MPG to be included
        """
        self.raw_stats = league_stats
        self.min_games = min_games
        self.min_minutes = min_minutes

        # Filter qualified players
        self.qualified = self._filter_qualified_players()

        # Calculate league distributions
        self.league_means = {}
        self.league_stds = {}
        self._calculate_league_distributions()

    def _filter_qualified_players(self) -> pd.DataFrame:
        """Filter to players meeting minimum thresholds."""
        mask = (
            (self.raw_stats['GP'] >= self.min_games) &
            (self.raw_stats['MIN'] >= self.min_minutes)
        )
        return self.raw_stats[mask].copy()

    def _calculate_league_distributions(self) -> None:
        """Calculate mean and std for each statistic."""
        numeric_cols = self.qualified.select_dtypes(include=[np.number]).columns

        for col in numeric_cols:
            self.league_means[col] = self.qualified[col].mean()
            self.league_stds[col] = self.qualified[col].std()

    def calculate_z_score(self, value: float, stat_name: str) -> float:
        """
        Calculate z-score for a single value.

        Args:
            value: The player's statistic value
            stat_name: Name of the statistic

        Returns:
            Z-score relative to league distribution
        """
        mean = self.league_means.get(stat_name)
        std = self.league_stds.get(stat_name)

        if mean is None or std is None or std == 0:
            return 0.0

        return (value - mean) / std

    def calculate_percentile(self, value: float, stat_name: str) -> float:
        """
        Calculate percentile rank for a value.

        Args:
            value: The player's statistic value
            stat_name: Name of the statistic

        Returns:
            Percentile rank (0-100)
        """
        if stat_name not in self.qualified.columns:
            return 50.0

        below = (self.qualified[stat_name] < value).sum()
        return (below / len(self.qualified)) * 100

    def get_player_profile(self, player_name: str) -> Dict:
        """
        Generate comprehensive statistical profile for a player.

        Args:
            player_name: Name of the player

        Returns:
            Dictionary with z-scores, percentiles, and raw stats
        """
        player_row = self.qualified[
            self.qualified['PLAYER_NAME'] == player_name
        ]

        if len(player_row) == 0:
            raise ValueError(f"Player '{player_name}' not found in qualified players")

        player = player_row.iloc[0]

        profile = {
            'name': player_name,
            'team': player.get('TEAM', 'Unknown'),
            'position': player.get('POS', 'Unknown'),
            'raw_stats': {},
            'z_scores': {},
            'percentiles': {}
        }

        # Key statistics to analyze
        key_stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT',
                     'FG3_PCT', 'FT_PCT', 'MIN', 'TOV']

        for stat in key_stats:
            if stat in player.index:
                value = player[stat]
                profile['raw_stats'][stat] = value
                profile['z_scores'][stat] = self.calculate_z_score(value, stat)
                profile['percentiles'][stat] = self.calculate_percentile(value, stat)

        return profile

    def create_composite_score(self, player_name: str,
                               weights: Dict[str, float],
                               invert: List[str] = None) -> float:
        """
        Create weighted composite score from multiple statistics.

        Args:
            player_name: Name of the player
            weights: Dictionary mapping stat names to weights
            invert: List of stats where lower is better (e.g., TOV)

        Returns:
            Weighted composite z-score

        Example:
            >>> scorer_weights = {'PTS': 0.4, 'FG_PCT': 0.3, 'AST': 0.2, 'TOV': 0.1}
            >>> score = system.create_composite_score('Player Name', scorer_weights, ['TOV'])
        """
        if invert is None:
            invert = []

        profile = self.get_player_profile(player_name)
        z_scores = profile['z_scores']

        weighted_sum = 0
        total_weight = 0

        for stat, weight in weights.items():
            if stat in z_scores:
                z = z_scores[stat]
                # Invert z-score for stats where lower is better
                if stat in invert:
                    z = -z
                weighted_sum += z * weight
                total_weight += weight

        return weighted_sum / total_weight if total_weight > 0 else 0

    def compare_players(self, player_names: List[str],
                        stats: List[str] = None) -> pd.DataFrame:
        """
        Create comparison table for multiple players.

        Args:
            player_names: List of player names to compare
            stats: Statistics to include (None for all key stats)

        Returns:
            DataFrame with comparison data
        """
        if stats is None:
            stats = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT', 'FG3_PCT']

        comparison_data = []

        for name in player_names:
            try:
                profile = self.get_player_profile(name)

                row = {'Player': name, 'Team': profile['team']}

                for stat in stats:
                    if stat in profile['raw_stats']:
                        row[f'{stat}_raw'] = profile['raw_stats'][stat]
                        row[f'{stat}_z'] = profile['z_scores'][stat]
                        row[f'{stat}_pct'] = profile['percentiles'][stat]

                comparison_data.append(row)

            except ValueError as e:
                print(f"Warning: {e}")

        return pd.DataFrame(comparison_data)

1.2 Position-Adjusted Comparisons

class PositionAdjustedComparison(StatisticalComparisonSystem):
    """
    Extends comparison system with position-based adjustments.

    Different positions have different statistical expectations.
    This class calculates z-scores relative to position peers.
    """

    def __init__(self, league_stats: pd.DataFrame, **kwargs):
        super().__init__(league_stats, **kwargs)
        self.position_distributions = {}
        self._calculate_position_distributions()

    def _calculate_position_distributions(self) -> None:
        """Calculate distributions by position."""
        positions = self.qualified['POS'].unique()

        for pos in positions:
            pos_data = self.qualified[self.qualified['POS'] == pos]
            self.position_distributions[pos] = {
                'means': pos_data.select_dtypes(include=[np.number]).mean().to_dict(),
                'stds': pos_data.select_dtypes(include=[np.number]).std().to_dict(),
                'count': len(pos_data)
            }

    def calculate_position_z_score(self, value: float, stat_name: str,
                                   position: str) -> float:
        """
        Calculate z-score relative to position peers.

        Args:
            value: The player's statistic value
            stat_name: Name of the statistic
            position: Player's position

        Returns:
            Position-adjusted z-score
        """
        if position not in self.position_distributions:
            return self.calculate_z_score(value, stat_name)

        pos_dist = self.position_distributions[position]
        mean = pos_dist['means'].get(stat_name, 0)
        std = pos_dist['stds'].get(stat_name, 1)

        if std == 0:
            return 0.0

        return (value - mean) / std

    def get_position_adjusted_profile(self, player_name: str) -> Dict:
        """Generate profile with both league and position z-scores."""
        profile = self.get_player_profile(player_name)
        position = profile['position']

        profile['position_z_scores'] = {}

        for stat, value in profile['raw_stats'].items():
            profile['position_z_scores'][stat] = self.calculate_position_z_score(
                value, stat, position
            )

        return profile

Part 2: Defining Player Archetypes

2.1 Composite Scores for Different Roles

# Define weights for different player archetypes

ARCHETYPE_WEIGHTS = {
    'scorer': {
        'weights': {'PTS': 0.35, 'FG_PCT': 0.20, 'FG3_PCT': 0.15,
                    'FT_PCT': 0.10, 'AST': 0.10, 'TOV': 0.10},
        'invert': ['TOV'],
        'description': 'Shot creator and efficient scorer'
    },
    'playmaker': {
        'weights': {'AST': 0.40, 'TOV': 0.20, 'PTS': 0.20,
                    'FG_PCT': 0.10, 'STL': 0.10},
        'invert': ['TOV'],
        'description': 'Primary ball handler and distributor'
    },
    'rim_protector': {
        'weights': {'BLK': 0.40, 'REB': 0.30, 'STL': 0.10,
                    'PTS': 0.10, 'FG_PCT': 0.10},
        'invert': [],
        'description': 'Defensive anchor and interior presence'
    },
    'floor_spacer': {
        'weights': {'FG3_PCT': 0.35, 'PTS': 0.25, 'FG_PCT': 0.20,
                    'REB': 0.10, 'AST': 0.10},
        'invert': [],
        'description': 'Three-point shooting forward'
    },
    'two_way': {
        'weights': {'PTS': 0.20, 'REB': 0.15, 'AST': 0.15,
                    'STL': 0.20, 'BLK': 0.15, 'TOV': 0.15},
        'invert': ['TOV'],
        'description': 'Contributes on both ends'
    }
}


def evaluate_player_archetypes(system: StatisticalComparisonSystem,
                               player_name: str) -> Dict[str, float]:
    """
    Evaluate a player's fit for each archetype.

    Args:
        system: Initialized comparison system
        player_name: Name of player to evaluate

    Returns:
        Dictionary mapping archetypes to composite scores
    """
    archetype_scores = {}

    for archetype, config in ARCHETYPE_WEIGHTS.items():
        score = system.create_composite_score(
            player_name,
            config['weights'],
            config['invert']
        )
        archetype_scores[archetype] = score

    return archetype_scores


def find_best_fits(system: StatisticalComparisonSystem,
                   candidates: List[str],
                   archetype: str) -> pd.DataFrame:
    """
    Rank candidates by fit for a specific archetype.

    Args:
        system: Initialized comparison system
        candidates: List of player names to evaluate
        archetype: Archetype to evaluate against

    Returns:
        DataFrame with rankings
    """
    config = ARCHETYPE_WEIGHTS[archetype]
    results = []

    for player in candidates:
        try:
            score = system.create_composite_score(
                player, config['weights'], config['invert']
            )
            profile = system.get_player_profile(player)

            results.append({
                'Player': player,
                'Team': profile['team'],
                'Position': profile['position'],
                'Composite_Score': score,
                'Percentile': stats.norm.cdf(score) * 100
            })
        except ValueError:
            continue

    return pd.DataFrame(results).sort_values('Composite_Score', ascending=False)

Part 3: Visualization

3.1 Radar Charts for Player Profiles

def create_radar_chart(profile: Dict, title: str = None) -> plt.Figure:
    """
    Create radar chart showing player's percentiles across categories.

    Args:
        profile: Player profile from comparison system
        title: Optional custom title

    Returns:
        Matplotlib Figure
    """
    categories = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT']
    available_cats = [c for c in categories if c in profile['percentiles']]

    values = [profile['percentiles'][c] for c in available_cats]
    values += values[:1]  # Close the polygon

    angles = np.linspace(0, 2 * np.pi, len(available_cats), endpoint=False).tolist()
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

    ax.plot(angles, values, 'o-', linewidth=2)
    ax.fill(angles, values, alpha=0.25)

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(available_cats)
    ax.set_ylim(0, 100)

    # Add reference lines
    for percentile in [25, 50, 75]:
        ax.axhline(percentile, color='gray', linestyle='--', alpha=0.3)

    ax.set_title(title or f"{profile['name']} Statistical Profile", fontsize=14)

    return fig


def create_comparison_radar(system: StatisticalComparisonSystem,
                            players: List[str],
                            categories: List[str] = None) -> plt.Figure:
    """
    Create overlaid radar charts comparing multiple players.

    Args:
        system: Initialized comparison system
        players: List of player names
        categories: Statistics to include

    Returns:
        Matplotlib Figure
    """
    if categories is None:
        categories = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'FG_PCT']

    fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))

    angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
    angles += angles[:1]

    colors = plt.cm.Set2(np.linspace(0, 1, len(players)))

    for player, color in zip(players, colors):
        try:
            profile = system.get_player_profile(player)
            values = [profile['percentiles'].get(c, 50) for c in categories]
            values += values[:1]

            ax.plot(angles, values, 'o-', linewidth=2, label=player, color=color)
            ax.fill(angles, values, alpha=0.1, color=color)
        except ValueError:
            continue

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories)
    ax.set_ylim(0, 100)
    ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))

    ax.set_title('Player Comparison', fontsize=14)

    return fig

3.2 Archetype Fit Visualization

def visualize_archetype_fits(archetype_scores: Dict[str, Dict],
                             player_names: List[str]) -> plt.Figure:
    """
    Create heatmap showing archetype fit scores for multiple players.

    Args:
        archetype_scores: Dict mapping player names to their archetype scores
        player_names: List of players to include

    Returns:
        Matplotlib Figure
    """
    archetypes = list(ARCHETYPE_WEIGHTS.keys())

    data = []
    for player in player_names:
        if player in archetype_scores:
            row = [archetype_scores[player].get(arch, 0) for arch in archetypes]
            data.append(row)

    df = pd.DataFrame(data, index=player_names, columns=archetypes)

    fig, ax = plt.subplots(figsize=(12, 6))

    sns.heatmap(df, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
                ax=ax, cbar_kws={'label': 'Composite Z-Score'})

    ax.set_title('Player Archetype Fit Scores', fontsize=14)
    ax.set_xlabel('Archetype')
    ax.set_ylabel('Player')

    plt.tight_layout()
    return fig

Part 4: Case Application

4.1 Cavaliers Free Agency Analysis

def run_cavaliers_analysis():
    """Execute the full Cavaliers free agency analysis."""

    # Load league data (example structure)
    # In practice, load from actual data source
    league_data = pd.DataFrame({
        'PLAYER_NAME': ['Candidate A', 'Candidate B', 'Candidate C',
                        'Candidate D', 'Candidate E', 'Candidate F'],
        'TEAM': ['Team1', 'Team2', 'Team3', 'Team4', 'Team5', 'Team6'],
        'POS': ['SG', 'SF', 'PF', 'C', 'SF', 'PF'],
        'GP': [65, 70, 58, 72, 68, 55],
        'MIN': [32.5, 28.4, 30.1, 26.8, 25.5, 29.2],
        'PTS': [18.2, 8.5, 14.8, 12.1, 15.3, 16.1],
        'REB': [4.2, 5.8, 7.2, 10.5, 6.1, 8.3],
        'AST': [5.5, 2.1, 2.8, 1.5, 3.2, 2.4],
        'STL': [1.2, 2.1, 0.8, 0.6, 1.8, 0.9],
        'BLK': [0.3, 0.5, 1.2, 2.8, 0.4, 1.1],
        'FG_PCT': [0.448, 0.425, 0.465, 0.582, 0.412, 0.445],
        'FG3_PCT': [0.362, 0.348, 0.402, 0.000, 0.385, 0.388],
        'FT_PCT': [0.865, 0.782, 0.825, 0.685, 0.798, 0.812],
        'TOV': [2.8, 1.2, 1.5, 1.8, 1.4, 1.6]
    })

    # Initialize system
    system = StatisticalComparisonSystem(league_data, min_games=50, min_minutes=20)

    print("=" * 60)
    print("CAVALIERS FREE AGENCY ANALYSIS")
    print("=" * 60)

    # Need 1: Secondary Scorer
    print("\n--- NEED 1: SECONDARY SCORER ---")
    scorer_candidates = ['Candidate A', 'Candidate C', 'Candidate F']
    scorer_rankings = find_best_fits(system, scorer_candidates, 'scorer')
    print(scorer_rankings.to_string(index=False))

    # Need 2: Rim Protector
    print("\n--- NEED 2: RIM PROTECTOR ---")
    rim_candidates = ['Candidate D', 'Candidate C']
    rim_rankings = find_best_fits(system, rim_candidates, 'rim_protector')
    print(rim_rankings.to_string(index=False))

    # Need 3: Floor Spacer
    print("\n--- NEED 3: FLOOR SPACER ---")
    spacer_candidates = ['Candidate C', 'Candidate E', 'Candidate F']
    spacer_rankings = find_best_fits(system, spacer_candidates, 'floor_spacer')
    print(spacer_rankings.to_string(index=False))

    # Full archetype analysis
    print("\n--- ARCHETYPE FIT MATRIX ---")
    all_candidates = league_data['PLAYER_NAME'].tolist()
    archetype_matrix = {}

    for player in all_candidates:
        try:
            archetype_matrix[player] = evaluate_player_archetypes(system, player)
        except ValueError:
            continue

    # Print summary
    for player, scores in archetype_matrix.items():
        best_fit = max(scores, key=scores.get)
        print(f"{player}: Best fit = {best_fit} (z = {scores[best_fit]:.2f})")

    return system, archetype_matrix


if __name__ == "__main__":
    run_cavaliers_analysis()

Discussion Questions

Question 1: Weighting Decisions

How would you determine the weights for each archetype? Should they be based on team needs, analytical importance, or something else?

Question 2: Sample Size

Some candidates have played significantly fewer games. How do you account for reliability in your comparisons?

Question 3: Context Adjustments

Players on different teams face different situations. How might team context affect the validity of these comparisons?

Question 4: Combining Metrics

What are the limitations of combining multiple z-scores into a single composite score?


Deliverables

  1. Comparison System Code: Functional Python class for player comparison
  2. Visualization Suite: Radar charts and comparison visualizations
  3. Analysis Report: Written summary of findings for each Cavaliers need
  4. Recommendation Memo: Final recommendations with statistical justification
  5. Sensitivity Analysis: How conclusions change with different weighting schemes

Key Takeaways

  1. Z-scores enable fair comparison across different statistical categories
  2. Composite scores can capture multi-dimensional player value
  3. Position adjustments account for different role expectations
  4. Visualization communicates complex statistical comparisons effectively
  5. Weighting decisions require basketball judgment, not just statistics