Chapter 5: Key Takeaways - Descriptive Statistics in Basketball

Executive Summary

Descriptive statistics form the foundation of basketball analytics, providing the tools to summarize, compare, and communicate player and team performance. This chapter covered the essential statistical measures that every basketball analyst must master: measures of central tendency, variability, distribution shape, correlation, and standardization techniques like z-scores.


Core Concepts

Measures of Central Tendency

What They Measure: The "typical" or "center" value in a dataset.

Measure Best Used When Basketball Example
Mean Data is symmetric, no outliers Average points per game
Median Data is skewed or has outliers Typical NBA salary
Mode Finding most frequent category Most common shot location
Weighted Mean Combining rates with different sample sizes Career shooting percentage

Key Insight: NBA salary distributions are right-skewed (Mode < Median < Mean), so the median better represents the "typical" player salary than the mean.


Measures of Variability

What They Measure: How spread out the data is around the center.

Measure Formula Interpretation
Range Max - Min Total spread (sensitive to outliers)
IQR Q3 - Q1 Spread of middle 50% (robust)
Variance Average squared deviation Squared units
Standard Deviation Square root of variance Same units as data
Coefficient of Variation (SD / Mean) x 100% Relative variability

Key Insight: A player with SD = 4 points is more consistent than one with SD = 10 points, even if both average 20 PPG.


Distribution Shape

Skewness: Measures asymmetry - Positive skewness (> 0): Long right tail (e.g., PPG distribution) - Negative skewness (< 0): Long left tail (rare in basketball) - Near zero: Approximately symmetric

Kurtosis: Measures tail heaviness - Excess kurtosis > 0: Heavy tails, more extreme values - Excess kurtosis < 0: Light tails, fewer extreme values

Key Insight: Most counting statistics (points, rebounds, assists) are right-skewed because many players have low values and few have very high values.


Percentiles and Rankings

Understanding Percentiles: - 25th percentile (Q1): Better than 25% of players - 50th percentile (Median): Better than 50% of players - 75th percentile (Q3): Better than 75% of players - 90th percentile: Elite level (better than 90%)

Five-Number Summary: Min, Q1, Median, Q3, Max

Box Plot Outliers: Values beyond 1.5 x IQR from Q1 or Q3

Key Insight: Being at the 85th percentile for scoring means outscoring 85% of NBA players, not scoring 85% of anything.


Correlation and Relationships

Pearson Correlation (r): - r = +1: Perfect positive linear relationship - r = 0: No linear relationship - r = -1: Perfect negative linear relationship - |r| > 0.7: Strong relationship - |r| 0.3-0.7: Moderate relationship - |r| < 0.3: Weak relationship

Coefficient of Determination (R-squared): Proportion of variance explained

Spearman Correlation: Use when data has outliers or relationship is monotonic but non-linear

Key Insight: Correlation does not imply causation. Teams shooting more three-pointers may correlate with losing because they're playing catch-up, not because threes cause losses.


Z-Scores and Standardization

Z-Score Formula: z = (x - mean) / standard deviation

Interpretation: - z = 0: Exactly at the mean - z = +1: One standard deviation above average - z = +2: Exceptionally above average (top ~2.5%) - z = -1: One standard deviation below average

Applications: 1. Comparing different statistics on same scale 2. Era-adjusted player comparisons 3. Creating composite metrics 4. Identifying outliers (|z| > 2 or 3)

Key Insight: Z-scores enable fair comparison of players across eras by measuring how much each stood out from their contemporaries.


Essential Formulas

Mean = Sum of all values / Number of values

Median = Middle value when sorted (or average of two middle values)

Weighted Mean = Sum(weight_i * value_i) / Sum(weight_i)

Variance = Sum((x_i - mean)^2) / (n - 1)    [sample variance]

Standard Deviation = sqrt(Variance)

Coefficient of Variation = (Standard Deviation / Mean) x 100%

IQR = Q3 - Q1

Z-Score = (x - mean) / standard_deviation

Pearson Correlation = Cov(X,Y) / (SD_X * SD_Y)

R-squared = r^2

Common Pitfalls to Avoid

1. Using Simple Mean for Rates

Wrong: Averaging shooting percentages across seasons Right: Using weighted mean with attempts as weights

2. Ignoring Distribution Shape

Wrong: Assuming mean represents "typical" when data is skewed Right: Using median for skewed distributions

3. Confusing Correlation with Causation

Wrong: "High assist totals cause wins" Right: "High assist totals are associated with wins"

4. Comparing Raw Statistics Across Eras

Wrong: "Player A averaged more PPG than Player B, so A was better" Right: "Player A had a higher z-score, standing out more from their peers"

5. Ignoring Sample Size

Wrong: Comparing statistics from 10-game sample to 82-game sample equally Right: Acknowledging uncertainty in small samples

6. Double-Counting Correlated Statistics

Wrong: Adding z-scores for FGA and PPG (highly correlated) Right: Selecting independent statistics or adjusting for correlation


Decision Framework

Choosing a Central Tendency Measure

Is the data nominal (categories)?
    YES --> Use Mode
    NO --> Continue

Is the data heavily skewed or has outliers?
    YES --> Use Median
    NO --> Continue

Are you combining rates with different sample sizes?
    YES --> Use Weighted Mean
    NO --> Use Mean

Choosing a Variability Measure

Need to compare variability across different scales?
    YES --> Use Coefficient of Variation
    NO --> Continue

Is the data sensitive to outliers?
    YES --> Use IQR
    NO --> Use Standard Deviation

Choosing a Correlation Method

Is the relationship linear?
    YES --> Use Pearson
    NO --> Continue

Is the relationship monotonic (consistently increasing/decreasing)?
    YES --> Use Spearman
    NO --> Consider other methods or transformations

Practical Applications Checklist

For Player Evaluation

  • [ ] Calculate relevant descriptive statistics (mean, median, SD)
  • [ ] Assess consistency using coefficient of variation
  • [ ] Determine percentile rankings within position/league
  • [ ] Convert to z-scores for cross-statistic comparison
  • [ ] Visualize distributions with histograms and box plots

For Team Analysis

  • [ ] Summarize team statistics with five-number summaries
  • [ ] Compare variability between teams
  • [ ] Analyze correlations between statistics and wins
  • [ ] Identify outlier performances
  • [ ] Create standardized composite metrics

For Historical Comparison

  • [ ] Calculate era-adjusted z-scores
  • [ ] Use percentile ranks within era
  • [ ] Account for pace differences
  • [ ] Consider rule changes affecting statistics
  • [ ] Document assumptions and limitations

Python Implementation Summary

import pandas as pd
import numpy as np
from scipy import stats

# Central Tendency
mean = df['PPG'].mean()
median = df['PPG'].median()
mode = df['PPG'].mode()[0]
weighted_mean = np.average(df['FG_PCT'], weights=df['FGA'])

# Variability
std = df['PPG'].std()
variance = df['PPG'].var()
iqr = df['PPG'].quantile(0.75) - df['PPG'].quantile(0.25)
cv = (std / mean) * 100

# Distribution Shape
skewness = df['PPG'].skew()
kurtosis = df['PPG'].kurtosis()

# Percentiles
percentile_85 = df['PPG'].quantile(0.85)
player_percentile = stats.percentileofscore(df['PPG'], player_ppg)

# Z-Scores
z_scores = (df['PPG'] - df['PPG'].mean()) / df['PPG'].std()

# Correlation
pearson_r = df['PPG'].corr(df['WIN_PCT'])
spearman_r = df['PPG'].corr(df['WIN_PCT'], method='spearman')

# Five-Number Summary
summary = df['PPG'].describe()

Connection to Other Chapters

Chapter Connection
Chapter 4 (EDA) Descriptive statistics guide visual exploration
Chapter 6 (Box Scores) Apply these statistics to game-level data
Chapter 7 (Rate Statistics) Understand per-minute and per-possession rates
Chapter 8 (Shooting) Analyze shooting efficiency distributions
Chapter 11 (RAPM) Z-scores used in regularized metrics
Chapter 12 (BPM/VORP) Box score statistics standardized for comparison
Chapter 26 (ML) Feature scaling uses z-score normalization

Self-Assessment Questions

  1. Can you calculate and interpret all measures of central tendency?
  2. Do you know when to use mean vs. median vs. mode?
  3. Can you calculate and explain standard deviation and IQR?
  4. Do you understand what skewness and kurtosis tell you about data?
  5. Can you interpret percentile rankings correctly?
  6. Do you understand what correlation measures and its limitations?
  7. Can you calculate and interpret z-scores?
  8. Do you know how to create era-adjusted comparisons?
  9. Can you identify when weighted means are necessary?
  10. Do you understand the difference between Pearson and Spearman correlation?

Key Terminology

Term Definition
Central Tendency Measures of the "center" or "typical" value
Variability/Dispersion Measures of spread or scatter in data
Skewness Asymmetry of distribution shape
Kurtosis Tailedness of distribution
Percentile Value below which a percentage of data falls
Quartile 25th, 50th, or 75th percentile
IQR Interquartile range (Q3 - Q1)
Z-Score Number of standard deviations from mean
Correlation Strength and direction of linear relationship
R-squared Proportion of variance explained

Summary

Descriptive statistics are not just mathematical calculations - they are tools for understanding and communicating the story within basketball data. By mastering these concepts, you can:

  1. Summarize player and team performance accurately
  2. Compare statistics across different scales and eras
  3. Identify exceptional performances and outliers
  4. Communicate findings clearly to different audiences
  5. Build the foundation for advanced analytics

The key is choosing the right tool for the job: understanding when means vs. medians are appropriate, when to use z-scores for comparison, and always remembering that correlation does not imply causation.


Before proceeding to Chapter 6, ensure you can confidently apply all these concepts to real basketball data. These fundamentals will be used throughout the remainder of this textbook.