Chapter 5: Key Takeaways - Descriptive Statistics in Basketball
Executive Summary
Descriptive statistics form the foundation of basketball analytics, providing the tools to summarize, compare, and communicate player and team performance. This chapter covered the essential statistical measures that every basketball analyst must master: measures of central tendency, variability, distribution shape, correlation, and standardization techniques like z-scores.
Core Concepts
Measures of Central Tendency
What They Measure: The "typical" or "center" value in a dataset.
| Measure | Best Used When | Basketball Example |
|---|---|---|
| Mean | Data is symmetric, no outliers | Average points per game |
| Median | Data is skewed or has outliers | Typical NBA salary |
| Mode | Finding most frequent category | Most common shot location |
| Weighted Mean | Combining rates with different sample sizes | Career shooting percentage |
Key Insight: NBA salary distributions are right-skewed (Mode < Median < Mean), so the median better represents the "typical" player salary than the mean.
Measures of Variability
What They Measure: How spread out the data is around the center.
| Measure | Formula | Interpretation |
|---|---|---|
| Range | Max - Min | Total spread (sensitive to outliers) |
| IQR | Q3 - Q1 | Spread of middle 50% (robust) |
| Variance | Average squared deviation | Squared units |
| Standard Deviation | Square root of variance | Same units as data |
| Coefficient of Variation | (SD / Mean) x 100% | Relative variability |
Key Insight: A player with SD = 4 points is more consistent than one with SD = 10 points, even if both average 20 PPG.
Distribution Shape
Skewness: Measures asymmetry - Positive skewness (> 0): Long right tail (e.g., PPG distribution) - Negative skewness (< 0): Long left tail (rare in basketball) - Near zero: Approximately symmetric
Kurtosis: Measures tail heaviness - Excess kurtosis > 0: Heavy tails, more extreme values - Excess kurtosis < 0: Light tails, fewer extreme values
Key Insight: Most counting statistics (points, rebounds, assists) are right-skewed because many players have low values and few have very high values.
Percentiles and Rankings
Understanding Percentiles: - 25th percentile (Q1): Better than 25% of players - 50th percentile (Median): Better than 50% of players - 75th percentile (Q3): Better than 75% of players - 90th percentile: Elite level (better than 90%)
Five-Number Summary: Min, Q1, Median, Q3, Max
Box Plot Outliers: Values beyond 1.5 x IQR from Q1 or Q3
Key Insight: Being at the 85th percentile for scoring means outscoring 85% of NBA players, not scoring 85% of anything.
Correlation and Relationships
Pearson Correlation (r): - r = +1: Perfect positive linear relationship - r = 0: No linear relationship - r = -1: Perfect negative linear relationship - |r| > 0.7: Strong relationship - |r| 0.3-0.7: Moderate relationship - |r| < 0.3: Weak relationship
Coefficient of Determination (R-squared): Proportion of variance explained
Spearman Correlation: Use when data has outliers or relationship is monotonic but non-linear
Key Insight: Correlation does not imply causation. Teams shooting more three-pointers may correlate with losing because they're playing catch-up, not because threes cause losses.
Z-Scores and Standardization
Z-Score Formula: z = (x - mean) / standard deviation
Interpretation: - z = 0: Exactly at the mean - z = +1: One standard deviation above average - z = +2: Exceptionally above average (top ~2.5%) - z = -1: One standard deviation below average
Applications: 1. Comparing different statistics on same scale 2. Era-adjusted player comparisons 3. Creating composite metrics 4. Identifying outliers (|z| > 2 or 3)
Key Insight: Z-scores enable fair comparison of players across eras by measuring how much each stood out from their contemporaries.
Essential Formulas
Mean = Sum of all values / Number of values
Median = Middle value when sorted (or average of two middle values)
Weighted Mean = Sum(weight_i * value_i) / Sum(weight_i)
Variance = Sum((x_i - mean)^2) / (n - 1) [sample variance]
Standard Deviation = sqrt(Variance)
Coefficient of Variation = (Standard Deviation / Mean) x 100%
IQR = Q3 - Q1
Z-Score = (x - mean) / standard_deviation
Pearson Correlation = Cov(X,Y) / (SD_X * SD_Y)
R-squared = r^2
Common Pitfalls to Avoid
1. Using Simple Mean for Rates
Wrong: Averaging shooting percentages across seasons Right: Using weighted mean with attempts as weights
2. Ignoring Distribution Shape
Wrong: Assuming mean represents "typical" when data is skewed Right: Using median for skewed distributions
3. Confusing Correlation with Causation
Wrong: "High assist totals cause wins" Right: "High assist totals are associated with wins"
4. Comparing Raw Statistics Across Eras
Wrong: "Player A averaged more PPG than Player B, so A was better" Right: "Player A had a higher z-score, standing out more from their peers"
5. Ignoring Sample Size
Wrong: Comparing statistics from 10-game sample to 82-game sample equally Right: Acknowledging uncertainty in small samples
6. Double-Counting Correlated Statistics
Wrong: Adding z-scores for FGA and PPG (highly correlated) Right: Selecting independent statistics or adjusting for correlation
Decision Framework
Choosing a Central Tendency Measure
Is the data nominal (categories)?
YES --> Use Mode
NO --> Continue
Is the data heavily skewed or has outliers?
YES --> Use Median
NO --> Continue
Are you combining rates with different sample sizes?
YES --> Use Weighted Mean
NO --> Use Mean
Choosing a Variability Measure
Need to compare variability across different scales?
YES --> Use Coefficient of Variation
NO --> Continue
Is the data sensitive to outliers?
YES --> Use IQR
NO --> Use Standard Deviation
Choosing a Correlation Method
Is the relationship linear?
YES --> Use Pearson
NO --> Continue
Is the relationship monotonic (consistently increasing/decreasing)?
YES --> Use Spearman
NO --> Consider other methods or transformations
Practical Applications Checklist
For Player Evaluation
- [ ] Calculate relevant descriptive statistics (mean, median, SD)
- [ ] Assess consistency using coefficient of variation
- [ ] Determine percentile rankings within position/league
- [ ] Convert to z-scores for cross-statistic comparison
- [ ] Visualize distributions with histograms and box plots
For Team Analysis
- [ ] Summarize team statistics with five-number summaries
- [ ] Compare variability between teams
- [ ] Analyze correlations between statistics and wins
- [ ] Identify outlier performances
- [ ] Create standardized composite metrics
For Historical Comparison
- [ ] Calculate era-adjusted z-scores
- [ ] Use percentile ranks within era
- [ ] Account for pace differences
- [ ] Consider rule changes affecting statistics
- [ ] Document assumptions and limitations
Python Implementation Summary
import pandas as pd
import numpy as np
from scipy import stats
# Central Tendency
mean = df['PPG'].mean()
median = df['PPG'].median()
mode = df['PPG'].mode()[0]
weighted_mean = np.average(df['FG_PCT'], weights=df['FGA'])
# Variability
std = df['PPG'].std()
variance = df['PPG'].var()
iqr = df['PPG'].quantile(0.75) - df['PPG'].quantile(0.25)
cv = (std / mean) * 100
# Distribution Shape
skewness = df['PPG'].skew()
kurtosis = df['PPG'].kurtosis()
# Percentiles
percentile_85 = df['PPG'].quantile(0.85)
player_percentile = stats.percentileofscore(df['PPG'], player_ppg)
# Z-Scores
z_scores = (df['PPG'] - df['PPG'].mean()) / df['PPG'].std()
# Correlation
pearson_r = df['PPG'].corr(df['WIN_PCT'])
spearman_r = df['PPG'].corr(df['WIN_PCT'], method='spearman')
# Five-Number Summary
summary = df['PPG'].describe()
Connection to Other Chapters
| Chapter | Connection |
|---|---|
| Chapter 4 (EDA) | Descriptive statistics guide visual exploration |
| Chapter 6 (Box Scores) | Apply these statistics to game-level data |
| Chapter 7 (Rate Statistics) | Understand per-minute and per-possession rates |
| Chapter 8 (Shooting) | Analyze shooting efficiency distributions |
| Chapter 11 (RAPM) | Z-scores used in regularized metrics |
| Chapter 12 (BPM/VORP) | Box score statistics standardized for comparison |
| Chapter 26 (ML) | Feature scaling uses z-score normalization |
Self-Assessment Questions
- Can you calculate and interpret all measures of central tendency?
- Do you know when to use mean vs. median vs. mode?
- Can you calculate and explain standard deviation and IQR?
- Do you understand what skewness and kurtosis tell you about data?
- Can you interpret percentile rankings correctly?
- Do you understand what correlation measures and its limitations?
- Can you calculate and interpret z-scores?
- Do you know how to create era-adjusted comparisons?
- Can you identify when weighted means are necessary?
- Do you understand the difference between Pearson and Spearman correlation?
Key Terminology
| Term | Definition |
|---|---|
| Central Tendency | Measures of the "center" or "typical" value |
| Variability/Dispersion | Measures of spread or scatter in data |
| Skewness | Asymmetry of distribution shape |
| Kurtosis | Tailedness of distribution |
| Percentile | Value below which a percentage of data falls |
| Quartile | 25th, 50th, or 75th percentile |
| IQR | Interquartile range (Q3 - Q1) |
| Z-Score | Number of standard deviations from mean |
| Correlation | Strength and direction of linear relationship |
| R-squared | Proportion of variance explained |
Summary
Descriptive statistics are not just mathematical calculations - they are tools for understanding and communicating the story within basketball data. By mastering these concepts, you can:
- Summarize player and team performance accurately
- Compare statistics across different scales and eras
- Identify exceptional performances and outliers
- Communicate findings clearly to different audiences
- Build the foundation for advanced analytics
The key is choosing the right tool for the job: understanding when means vs. medians are appropriate, when to use z-scores for comparison, and always remembering that correlation does not imply causation.
Before proceeding to Chapter 6, ensure you can confidently apply all these concepts to real basketball data. These fundamentals will be used throughout the remainder of this textbook.