22 min read

Descriptive statistics form the backbone of basketball analytics, providing the fundamental tools needed to summarize, understand, and communicate insights from player and team performance data. Before we can build sophisticated predictive models or...

Chapter 5: Descriptive Statistics in Basketball

Introduction

Descriptive statistics form the backbone of basketball analytics, providing the fundamental tools needed to summarize, understand, and communicate insights from player and team performance data. Before we can build sophisticated predictive models or develop advanced metrics, we must first master the art of describing what the data tells us. This chapter establishes the statistical foundation upon which all basketball analytics work is built.

In the modern NBA, front offices, coaching staffs, and media analysts are inundated with data. Every game produces thousands of data points, from basic box score statistics to granular tracking data capturing player movements at 25 frames per second. Descriptive statistics allow us to distill this overwhelming volume of information into meaningful summaries that inform decision-making.

Consider the challenge facing a general manager evaluating potential free agent signings. Raw statistics tell only part of the story. A player averaging 18 points per game on one team may have vastly different value than a player averaging the same on another team. Descriptive statistics help us contextualize these numbers, compare players across different eras and playing environments, and identify the statistical signatures that distinguish elite performers from average ones.

This chapter covers the essential descriptive statistical concepts every basketball analyst must master: measures of central tendency and variability, percentile rankings, distribution shape analysis, correlation, standardization, and summary statistics. Throughout, we emphasize practical application to real basketball scenarios, providing both mathematical foundations and Python implementations.


5.1 Measures of Central Tendency

Central tendency measures identify the "typical" or "central" value in a dataset. In basketball analytics, these measures help us understand what constitutes normal performance and identify outliers who exceed or fall below expectations.

5.1.1 The Arithmetic Mean

The arithmetic mean, commonly called the average, is the most widely used measure of central tendency. It is calculated by summing all values and dividing by the count of observations.

Mathematical Definition:

For a dataset with $n$ observations $x_1, x_2, \ldots, x_n$, the sample mean is:

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n}$$

Basketball Application: Calculating Season Scoring Averages

When we report that a player averaged 27.4 points per game during the 2023-24 season, we are reporting the arithmetic mean of their game-by-game scoring totals. If a player scored the following points in their first five games: 32, 24, 28, 31, and 22, their average would be:

$$\bar{x} = \frac{32 + 24 + 28 + 31 + 22}{5} = \frac{137}{5} = 27.4 \text{ PPG}$$

Weighted Mean for Rate Statistics

In basketball, we frequently encounter situations requiring weighted averages. Consider calculating a player's true shooting percentage across multiple seasons with varying attempts. The weighted mean is defined as:

$$\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$$

where $w_i$ represents the weight (e.g., number of attempts) for each observation $x_i$.

For example, if a player had a 58% true shooting percentage on 800 true shooting attempts in Year 1 and 62% on 1,200 attempts in Year 2, the weighted career average would be:

$$\bar{x}_w = \frac{(800 \times 0.58) + (1200 \times 0.62)}{800 + 1200} = \frac{464 + 744}{2000} = \frac{1208}{2000} = 0.604 = 60.4\%$$

Limitations of the Mean in Basketball Contexts

The arithmetic mean is sensitive to extreme values (outliers). A player who scores 50 points in one game and 10 points in the next four games averages 18 PPG, but this "average" misrepresents their typical performance. The mean is most informative when the underlying distribution is approximately symmetric.

5.1.2 The Median

The median is the middle value when observations are arranged in order. For datasets with an odd number of observations, it is the central value; for even-numbered datasets, it is the average of the two central values.

Mathematical Definition:

For ordered data $x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}$:

$$\text{Median} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases}$$

Basketball Application: Salary Analysis

NBA salary distributions are heavily right-skewed, with a small number of max-contract players earning substantially more than the majority. In the 2023-24 season, while the mean NBA salary was approximately $9.7 million, the median salary was closer to $4.5 million. The median better represents what a "typical" NBA player earns because it is not inflated by superstar contracts.

Consider a hypothetical five-player team with the following salaries (in millions): $2.1, $4.8, $8.2, $15.0, $45.0.

  • Mean salary: $\frac{2.1 + 4.8 + 8.2 + 15.0 + 45.0}{5} = \frac{75.1}{5} = $15.02$ million
  • Median salary: $8.2 million (the third value when ordered)

The median ($8.2M) better represents the "typical" player's salary on this team than the mean ($15.02M), which is inflated by the max-contract player.

5.1.3 The Mode

The mode is the most frequently occurring value in a dataset. While less commonly used in continuous basketball statistics, it has applications in categorical data analysis.

Basketball Applications:

  1. Most common jersey numbers: Analyzing which jersey numbers are most popular in the league
  2. Most frequent shot locations: Identifying the coordinates from which players most often attempt shots
  3. Most common game outcomes: For point differential analysis, the mode identifies the most frequent margin of victory

Multimodal Distributions in Shot Charts

Modern shot chart analysis often reveals multimodal distributions. A player's shot location data might show modes at: - The restricted area (layups and dunks) - The corners (corner three-pointers) - Above the break (pull-up threes)

Identifying these modes helps coaches understand player tendencies and opposing defenses scheme accordingly.

5.1.4 Comparing Central Tendency Measures

The relationship between mean, median, and mode reveals information about distribution shape:

  • Symmetric distribution: Mean ≈ Median ≈ Mode
  • Right-skewed distribution: Mode < Median < Mean
  • Left-skewed distribution: Mean < Median < Mode

NBA scoring distributions are typically right-skewed: most players score relatively few points, while a small number of high-volume scorers pull the mean above the median.

import numpy as np
from scipy import stats

def central_tendency_analysis(data, label=""):
    """
    Calculate and compare measures of central tendency.

    Parameters:
    -----------
    data : array-like
        Numerical data to analyze
    label : str
        Description of the data for output formatting

    Returns:
    --------
    dict : Dictionary containing mean, median, and mode
    """
    data = np.array(data)

    mean_val = np.mean(data)
    median_val = np.median(data)
    mode_result = stats.mode(data, keepdims=True)
    mode_val = mode_result.mode[0]
    mode_count = mode_result.count[0]

    # Determine skewness direction based on mean-median relationship
    if mean_val > median_val * 1.05:
        skew_direction = "right-skewed"
    elif mean_val < median_val * 0.95:
        skew_direction = "left-skewed"
    else:
        skew_direction = "approximately symmetric"

    results = {
        'mean': mean_val,
        'median': median_val,
        'mode': mode_val,
        'mode_count': mode_count,
        'skew_direction': skew_direction
    }

    if label:
        print(f"\n=== Central Tendency Analysis: {label} ===")
        print(f"Mean: {mean_val:.2f}")
        print(f"Median: {median_val:.2f}")
        print(f"Mode: {mode_val:.2f} (occurs {mode_count} times)")
        print(f"Distribution appears to be {skew_direction}")

    return results

5.2 Measures of Variability

While central tendency tells us about typical values, measures of variability (or dispersion) describe how spread out the data is. In basketball, understanding variability is crucial for assessing consistency, predicting performance ranges, and evaluating risk.

5.2.1 Range

The range is the simplest measure of spread, calculated as the difference between maximum and minimum values.

$$\text{Range} = x_{\max} - x_{\min}$$

Basketball Application: Scoring Volatility

A player whose game scores range from 8 to 42 points (range = 34) is more volatile than one scoring between 18 and 28 points (range = 10), even if both average 23 PPG. The high-variance player might win you games with 40-point explosions but also contribute to losses with single-digit performances.

Limitations:

The range uses only two data points and is highly sensitive to outliers. A single career-high or injury-affected game can dramatically affect the range without representing typical variability.

5.2.2 Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of data, making it resistant to outliers.

$$\text{IQR} = Q_3 - Q_1$$

where $Q_1$ is the 25th percentile and $Q_3$ is the 75th percentile.

Basketball Application: Consistent vs. Streaky Scorers

Consider two players with identical 20 PPG averages:

Player A (Consistent): - Q1 = 16 points, Q3 = 24 points - IQR = 8 points

Player B (Streaky): - Q1 = 10 points, Q3 = 28 points - IQR = 18 points

Player A delivers predictable production, while Player B oscillates between poor and excellent performances. For playoff rotations where reliable output is valued, Player A's lower IQR might make them preferable despite identical averages.

5.2.3 Variance

Variance measures the average squared deviation from the mean, providing a mathematically convenient measure of spread.

Population Variance:

$$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$

Sample Variance:

$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

The sample variance uses $n-1$ in the denominator (Bessel's correction) to provide an unbiased estimate of the population variance.

Why Squared Deviations?

Squaring accomplishes two goals: 1. Makes all deviations positive (avoiding cancellation) 2. Penalizes larger deviations more heavily

Basketball Application: Team Scoring Variance

A team with high game-to-game scoring variance may struggle with consistency. Analyzing whether this variance stems from opponent quality, home/away splits, or rotation changes provides actionable insights for coaching staffs.

5.2.4 Standard Deviation

The standard deviation is the square root of variance, returning the measure of spread to the original units of measurement.

$$\sigma = \sqrt{\sigma^2} \quad \text{(population)} \qquad s = \sqrt{s^2} \quad \text{(sample)}$$

Interpreting Standard Deviation in Basketball

For approximately normal distributions: - About 68% of observations fall within 1 standard deviation of the mean - About 95% fall within 2 standard deviations - About 99.7% fall within 3 standard deviations

If NBA players average 15.2 PPG with a standard deviation of 6.8 points: - 68% of players score between 8.4 and 22.0 PPG - 95% of players score between 1.6 and 28.8 PPG - Players above 35.6 PPG (mean + 3 SD) are statistical outliers

5.2.5 Coefficient of Variation

The coefficient of variation (CV) expresses standard deviation as a percentage of the mean, enabling comparison of variability across different scales.

$$CV = \frac{s}{\bar{x}} \times 100\%$$

Basketball Application: Comparing Variability Across Statistics

Raw standard deviations cannot be directly compared for statistics measured on different scales. A 3-point standard deviation in scoring might indicate less variability than a 1-point standard deviation in assists because scoring typically has a higher mean.

Statistic Mean Std Dev CV
Points 15.2 6.8 44.7%
Rebounds 5.1 2.9 56.9%
Assists 3.4 2.3 67.6%

The CV reveals that assists have the highest relative variability among these statistics.

import numpy as np

def variability_analysis(data, label=""):
    """
    Calculate comprehensive variability statistics.

    Parameters:
    -----------
    data : array-like
        Numerical data to analyze
    label : str
        Description of the data for output formatting

    Returns:
    --------
    dict : Dictionary containing all variability measures
    """
    data = np.array(data)

    range_val = np.max(data) - np.min(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    variance = np.var(data, ddof=1)  # Sample variance
    std_dev = np.std(data, ddof=1)   # Sample standard deviation
    mean_val = np.mean(data)
    cv = (std_dev / mean_val) * 100 if mean_val != 0 else np.nan

    results = {
        'range': range_val,
        'q1': q1,
        'q3': q3,
        'iqr': iqr,
        'variance': variance,
        'std_dev': std_dev,
        'cv': cv,
        'min': np.min(data),
        'max': np.max(data)
    }

    if label:
        print(f"\n=== Variability Analysis: {label} ===")
        print(f"Range: {range_val:.2f} (Min: {results['min']:.2f}, Max: {results['max']:.2f})")
        print(f"IQR: {iqr:.2f} (Q1: {q1:.2f}, Q3: {q3:.2f})")
        print(f"Variance: {variance:.2f}")
        print(f"Standard Deviation: {std_dev:.2f}")
        print(f"Coefficient of Variation: {cv:.1f}%")

    return results

5.3 Percentiles and Rankings in Player Evaluation

Percentiles transform raw statistics into rankings, enabling direct comparison of players and identification of elite performers.

5.3.1 Percentile Calculation

The $p$-th percentile is the value below which $p$ percent of observations fall.

Linear Interpolation Method:

For data sorted in ascending order with $n$ observations, the $p$-th percentile is calculated as:

  1. Calculate the rank: $r = \frac{p}{100} \times (n + 1)$
  2. If $r$ is an integer, the percentile is $x_r$
  3. If $r$ is not an integer, interpolate between $x_{\lfloor r \rfloor}$ and $x_{\lceil r \rceil}$

Basketball Application: Identifying Elite Performers

Percentile rankings contextualize raw statistics: - 90th percentile: Elite (top 10%) - 75th percentile: Above average (top 25%) - 50th percentile: League average - 25th percentile: Below average - 10th percentile: Poor (bottom 10%)

A player scoring 22 PPG might be at the 85th percentile, meaning they outscore 85% of NBA players. This contextualization is more meaningful than the raw number alone.

5.3.2 Quartiles and Five-Number Summary

The five-number summary provides a concise description of a distribution:

  1. Minimum ($Q_0$)
  2. First Quartile ($Q_1$, 25th percentile)
  3. Median ($Q_2$, 50th percentile)
  4. Third Quartile ($Q_3$, 75th percentile)
  5. Maximum ($Q_4$)

Basketball Application: Box Plots for Team Comparison

Box plots based on five-number summaries enable visual comparison of team distributions. Comparing the scoring distributions of playoff teams versus lottery teams might reveal that playoff teams have higher medians and smaller IQRs, indicating both better and more consistent performance.

5.3.3 Percentile Ranks in Player Comparison

When evaluating players, percentile ranks allow apples-to-apples comparisons:

Player PPG PPG Percentile APG APG Percentile
Player A 25.3 92nd 6.2 85th
Player B 18.4 78th 9.1 96th

Player A is an elite scorer (92nd percentile) and excellent passer (85th percentile). Player B is a very good scorer (78th percentile) and elite passer (96th percentile). Percentile ranks make these relative strengths immediately clear.

5.3.4 Deciles and Custom Groupings

Deciles divide data into 10 equal groups, useful for tiered analysis:

  • Top Decile (90-100%): Superstar tier
  • 80-90%: All-Star tier
  • 70-80%: Quality starter tier
  • 50-70%: Rotation player tier
  • 30-50%: End of bench tier
  • Below 30%: Replacement level or below
import numpy as np

def percentile_analysis(player_value, league_data, stat_name="statistic"):
    """
    Calculate percentile rank and identify tier for a player's statistic.

    Parameters:
    -----------
    player_value : float
        The player's value for the statistic
    league_data : array-like
        All player values in the league for comparison
    stat_name : str
        Name of the statistic for output formatting

    Returns:
    --------
    dict : Dictionary containing percentile rank and tier information
    """
    league_data = np.array(league_data)

    # Calculate percentile rank
    percentile_rank = (np.sum(league_data < player_value) / len(league_data)) * 100

    # Determine tier
    if percentile_rank >= 90:
        tier = "Elite (Top 10%)"
    elif percentile_rank >= 75:
        tier = "Above Average (Top 25%)"
    elif percentile_rank >= 50:
        tier = "Average (Top 50%)"
    elif percentile_rank >= 25:
        tier = "Below Average (Bottom 50%)"
    else:
        tier = "Poor (Bottom 25%)"

    # Calculate key percentile values for context
    context = {
        '25th': np.percentile(league_data, 25),
        '50th': np.percentile(league_data, 50),
        '75th': np.percentile(league_data, 75),
        '90th': np.percentile(league_data, 90)
    }

    results = {
        'player_value': player_value,
        'percentile_rank': percentile_rank,
        'tier': tier,
        'context': context
    }

    print(f"\n=== Percentile Analysis: {stat_name} ===")
    print(f"Player Value: {player_value:.2f}")
    print(f"Percentile Rank: {percentile_rank:.1f}th percentile")
    print(f"Tier: {tier}")
    print(f"\nLeague Context:")
    print(f"  25th percentile: {context['25th']:.2f}")
    print(f"  50th percentile: {context['50th']:.2f}")
    print(f"  75th percentile: {context['75th']:.2f}")
    print(f"  90th percentile: {context['90th']:.2f}")

    return results

5.4 Skewness and Kurtosis in Basketball Distributions

Distribution shape affects the validity of statistical methods and interpretation of results. Skewness and kurtosis quantify departures from normality.

5.4.1 Skewness

Skewness measures the asymmetry of a distribution around its mean.

Mathematical Definition (Fisher's Skewness):

$$g_1 = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^3}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \right)^{3/2}} = \frac{m_3}{m_2^{3/2}}$$

Interpretation: - Skewness = 0: Symmetric distribution - Skewness > 0: Right-skewed (positive skew), long tail to the right - Skewness < 0: Left-skewed (negative skew), long tail to the left

Rules of Thumb: - |Skewness| < 0.5: Approximately symmetric - 0.5 ≤ |Skewness| < 1.0: Moderately skewed - |Skewness| ≥ 1.0: Highly skewed

Basketball Distributions and Their Typical Skewness:

Distribution Typical Skewness Explanation
Points per game Right-skewed (+0.8 to +1.5) Few high scorers, many low scorers
Free throw % Left-skewed (-0.5 to -1.0) Most players shoot well, few are poor
Player salaries Highly right-skewed (+2.0 to +3.0) Max contracts create long right tail
Plus/minus Near symmetric (-0.2 to +0.2) Centered around zero by construction
Minutes played Right-skewed (+0.5 to +1.0) Starters play more, bench less

5.4.2 Kurtosis

Kurtosis measures the "tailedness" of a distribution, indicating the frequency of extreme values.

Mathematical Definition (Excess Kurtosis):

$$g_2 = \frac{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^4}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \right)^{2}} - 3 = \frac{m_4}{m_2^{2}} - 3$$

The "-3" adjustment makes the normal distribution have excess kurtosis of 0.

Interpretation: - Kurtosis = 0: Mesokurtic (normal-like tails) - Kurtosis > 0: Leptokurtic (heavy tails, more extreme values) - Kurtosis < 0: Platykurtic (light tails, fewer extreme values)

Basketball Application: Identifying Outlier-Prone Statistics

High positive kurtosis indicates a statistic has more extreme values than expected under normality. This is important for:

  1. Risk assessment: High-kurtosis performance metrics suggest occasional extreme performances (both positive and negative)
  2. Model selection: Heavy-tailed distributions may require robust statistical methods
  3. Outlier detection: High kurtosis warrants careful outlier examination

5.4.3 Normality Testing

Many statistical methods assume normally distributed data. Testing for normality helps determine appropriate analytical approaches.

Common Normality Tests:

  1. Shapiro-Wilk Test: Best for small to medium samples (n < 5000) - H₀: Data is normally distributed - Low p-value → Reject normality

  2. D'Agostino-Pearson Test: Combines skewness and kurtosis tests - More powerful for detecting specific departures from normality

  3. Anderson-Darling Test: Emphasizes tails of the distribution - Useful when tail behavior is important

Practical Considerations in Basketball Analytics:

Most basketball statistics are not normally distributed. However, normality assumptions become less critical with large samples due to the Central Limit Theorem. For small samples or when distributional assumptions matter, consider: - Non-parametric methods - Data transformations (log, square root) - Robust statistics

import numpy as np
from scipy import stats

def distribution_shape_analysis(data, label=""):
    """
    Analyze the shape of a distribution including skewness, kurtosis, and normality.

    Parameters:
    -----------
    data : array-like
        Numerical data to analyze
    label : str
        Description of the data for output formatting

    Returns:
    --------
    dict : Dictionary containing shape statistics and normality test results
    """
    data = np.array(data)

    # Calculate skewness and kurtosis
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)  # Excess kurtosis

    # Interpret skewness
    if abs(skewness) < 0.5:
        skew_interp = "approximately symmetric"
    elif abs(skewness) < 1.0:
        skew_interp = "moderately " + ("right" if skewness > 0 else "left") + "-skewed"
    else:
        skew_interp = "highly " + ("right" if skewness > 0 else "left") + "-skewed"

    # Interpret kurtosis
    if abs(kurtosis) < 0.5:
        kurt_interp = "mesokurtic (normal-like tails)"
    elif kurtosis > 0:
        kurt_interp = "leptokurtic (heavy tails, more outliers)"
    else:
        kurt_interp = "platykurtic (light tails, fewer outliers)"

    # Normality tests
    if len(data) >= 8:  # Minimum sample size for these tests
        shapiro_stat, shapiro_p = stats.shapiro(data[:5000])  # Shapiro-Wilk limited to 5000
        dagostino_stat, dagostino_p = stats.normaltest(data)

        is_normal = shapiro_p > 0.05 and dagostino_p > 0.05
    else:
        shapiro_p = dagostino_p = np.nan
        is_normal = None

    results = {
        'skewness': skewness,
        'skew_interpretation': skew_interp,
        'kurtosis': kurtosis,
        'kurtosis_interpretation': kurt_interp,
        'shapiro_p': shapiro_p,
        'dagostino_p': dagostino_p,
        'is_normal': is_normal
    }

    if label:
        print(f"\n=== Distribution Shape Analysis: {label} ===")
        print(f"Skewness: {skewness:.3f} ({skew_interp})")
        print(f"Excess Kurtosis: {kurtosis:.3f} ({kurt_interp})")
        print(f"\nNormality Tests:")
        print(f"  Shapiro-Wilk p-value: {shapiro_p:.4f}")
        print(f"  D'Agostino-Pearson p-value: {dagostino_p:.4f}")
        if is_normal is not None:
            print(f"  Conclusion: {'Approximately normal' if is_normal else 'Non-normal distribution'}")

    return results

5.5 Correlation Analysis and Interpretation

Correlation measures the strength and direction of relationships between two variables. In basketball analytics, understanding correlations helps identify which statistics predict success and which tend to co-occur.

5.5.1 Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two continuous variables.

Mathematical Definition:

$$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} = \frac{\text{Cov}(X, Y)}{s_X s_Y}$$

Properties: - Range: -1 to +1 - r = +1: Perfect positive linear relationship - r = -1: Perfect negative linear relationship - r = 0: No linear relationship

Interpretation Guidelines:

| |r| | Interpretation | |-----|----------------| | 0.00 - 0.19 | Negligible | | 0.20 - 0.39 | Weak | | 0.40 - 0.59 | Moderate | | 0.60 - 0.79 | Strong | | 0.80 - 1.00 | Very strong |

5.5.2 Common Basketball Correlations

High Positive Correlations: - Points and field goal attempts (r ≈ 0.85): Volume scoring relationship - Minutes and points (r ≈ 0.75): Playing time enables production - Offensive rebounds and total rebounds (r ≈ 0.70): Rebounding skill carries over - Steals and deflections (r ≈ 0.80): Active hands metrics

Moderate Positive Correlations: - Assists and turnovers (r ≈ 0.55): Ball-handling role creates both - Win shares and salary (r ≈ 0.45): Market partially values production - Height and blocks (r ≈ 0.50): Physical attributes enable rim protection

Near-Zero or Negative Correlations: - Three-point percentage and three-point attempts (r ≈ 0.05): Volume independent of efficiency - Offensive rating and pace (r ≈ -0.10): Style independent of efficiency - Age and vertical leap (r ≈ -0.35): Athletic decline with age

5.5.3 Spearman Rank Correlation

When relationships are non-linear or data contains outliers, Spearman's rank correlation provides a more robust alternative.

Mathematical Definition:

$$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

where $d_i$ is the difference between ranks of corresponding values.

When to Use Spearman Over Pearson: 1. Ordinal data (e.g., draft position) 2. Non-linear monotonic relationships 3. Presence of outliers 4. Highly skewed distributions

Basketball Application: Draft Position and Career Success

The relationship between draft position (1-60) and career win shares is non-linear. The difference between picks 1 and 5 is much larger than between picks 45 and 50. Spearman correlation captures this monotonic but non-linear relationship better than Pearson.

5.5.4 Correlation Does Not Imply Causation

This fundamental principle is especially important in basketball analytics:

Example: Shot Attempts and Losses

A team might show a positive correlation between three-point attempts and losses. This does not mean taking more threes causes losses. Rather, losing teams tend to shoot more threes when trailing late in games (playing catch-up). The correlation is confounded by game situation.

Common Confounders in Basketball Data: - Playing time (affects all counting statistics) - Team pace (affects per-game statistics) - Era effects (rule changes, playing style evolution) - Role/position (different expectations by role) - Sample size (small samples inflate correlations)

5.5.5 Partial Correlation

Partial correlation measures the relationship between two variables while controlling for a third.

$$r_{xy \cdot z} = \frac{r_{xy} - r_{xz} \cdot r_{yz}}{\sqrt{(1 - r_{xz}^2)(1 - r_{yz}^2)}}$$

Basketball Application: Offensive Rating and Experience

The correlation between offensive rating and experience might be positive (r = 0.30). However, controlling for minutes played (which affects both experience and performance), this relationship might weaken (partial r = 0.12), suggesting that the original correlation was partially confounded by playing time allocation.

import numpy as np
from scipy import stats

def correlation_analysis(x, y, x_name="Variable X", y_name="Variable Y"):
    """
    Perform comprehensive correlation analysis between two variables.

    Parameters:
    -----------
    x, y : array-like
        Two numerical arrays of equal length
    x_name, y_name : str
        Names of variables for output formatting

    Returns:
    --------
    dict : Dictionary containing correlation statistics
    """
    x, y = np.array(x), np.array(y)

    # Pearson correlation
    pearson_r, pearson_p = stats.pearsonr(x, y)

    # Spearman correlation
    spearman_rho, spearman_p = stats.spearmanr(x, y)

    # Interpret strength
    abs_r = abs(pearson_r)
    if abs_r < 0.20:
        strength = "negligible"
    elif abs_r < 0.40:
        strength = "weak"
    elif abs_r < 0.60:
        strength = "moderate"
    elif abs_r < 0.80:
        strength = "strong"
    else:
        strength = "very strong"

    direction = "positive" if pearson_r > 0 else "negative"

    # Coefficient of determination
    r_squared = pearson_r ** 2

    results = {
        'pearson_r': pearson_r,
        'pearson_p': pearson_p,
        'spearman_rho': spearman_rho,
        'spearman_p': spearman_p,
        'r_squared': r_squared,
        'strength': strength,
        'direction': direction
    }

    print(f"\n=== Correlation Analysis: {x_name} vs {y_name} ===")
    print(f"Pearson r: {pearson_r:.3f} (p = {pearson_p:.4f})")
    print(f"Spearman rho: {spearman_rho:.3f} (p = {spearman_p:.4f})")
    print(f"R-squared: {r_squared:.3f} ({r_squared*100:.1f}% variance explained)")
    print(f"Interpretation: {strength} {direction} relationship")

    return results


def partial_correlation(x, y, z):
    """
    Calculate partial correlation between x and y, controlling for z.

    Parameters:
    -----------
    x, y, z : array-like
        Three numerical arrays of equal length

    Returns:
    --------
    float : Partial correlation coefficient
    """
    r_xy = stats.pearsonr(x, y)[0]
    r_xz = stats.pearsonr(x, z)[0]
    r_yz = stats.pearsonr(y, z)[0]

    partial_r = (r_xy - r_xz * r_yz) / np.sqrt((1 - r_xz**2) * (1 - r_yz**2))

    return partial_r

5.6 Summary Statistics for Basketball Data

Comprehensive summary statistics provide a complete picture of a dataset's characteristics at a glance.

5.6.1 The Standard Summary

A complete statistical summary includes:

  1. Sample size (n): Number of observations
  2. Central tendency: Mean, median, mode
  3. Variability: Standard deviation, variance, range, IQR
  4. Distribution shape: Skewness, kurtosis
  5. Position: Minimum, maximum, quartiles

5.6.2 Summary Statistics by Group

Basketball analysis often requires comparing groups: - Starters vs. bench players - Guards vs. forwards vs. centers - Playoff teams vs. lottery teams - Pre- and post-trade deadline periods

Example: Comparing Position Groups

Statistic Guards Forwards Centers
n 180 150 95
Mean PPG 12.4 11.8 10.2
Median PPG 10.2 9.8 8.5
Std Dev 7.2 6.8 5.4
Min 0.5 0.8 1.2
Max 34.7 33.2 25.8

This summary reveals that guards have higher scoring (mean and median), more variability (std dev), and higher maximums, reflecting the guard-centric nature of modern NBA offense.

5.6.3 Aggregation and Weighted Statistics

When combining data from different sources or time periods, appropriate aggregation methods are essential.

Per-Game vs. Per-36 vs. Per-100 Possessions:

Each rate statistic answers different questions: - Per-game: What does this player actually produce? - Per-36 minutes: What would they produce with starter minutes? - Per-100 possessions: What do they produce adjusted for pace?

Calculating League Averages:

For league-wide statistics, totals-based calculations are more accurate than averaging player averages:

$$\text{League TS\%} = \frac{\sum \text{Total Points}}{\sum (2 \times \text{TSA})}$$

rather than averaging individual player TS% values.

5.6.4 Handling Missing Data

Basketball data frequently contains missing values: - Injured players missing games - Tracking data unavailable for some seasons - Statistics not recorded in earlier eras

Common Approaches:

  1. Listwise deletion: Remove observations with any missing values - Pro: Simple, unbiased if data is missing completely at random - Con: Reduces sample size, potentially biased

  2. Mean imputation: Replace missing values with the mean - Pro: Preserves sample size, simple - Con: Reduces variance, distorts relationships

  3. Multiple imputation: Generate multiple plausible values - Pro: Accounts for uncertainty - Con: Complex, requires assumptions

Best Practice in Basketball Analytics:

For rate statistics, consider using weighted averages that only include games played rather than imputing values for missed games. A player's true shooting percentage should reflect their actual performance, not estimated values for games they missed.

import numpy as np
import pandas as pd
from scipy import stats

def comprehensive_summary(data, label="", decimals=2):
    """
    Generate comprehensive summary statistics for basketball data.

    Parameters:
    -----------
    data : array-like
        Numerical data to summarize
    label : str
        Description of the data
    decimals : int
        Number of decimal places for rounding

    Returns:
    --------
    dict : Dictionary containing all summary statistics
    """
    data = np.array(data)
    data = data[~np.isnan(data)]  # Remove NaN values

    summary = {
        # Sample information
        'n': len(data),
        'missing': np.sum(np.isnan(data)),

        # Central tendency
        'mean': np.mean(data),
        'median': np.median(data),
        'mode': stats.mode(data, keepdims=True).mode[0],

        # Variability
        'std': np.std(data, ddof=1),
        'variance': np.var(data, ddof=1),
        'range': np.max(data) - np.min(data),
        'iqr': np.percentile(data, 75) - np.percentile(data, 25),
        'cv': (np.std(data, ddof=1) / np.mean(data)) * 100 if np.mean(data) != 0 else np.nan,

        # Position
        'min': np.min(data),
        'q1': np.percentile(data, 25),
        'q3': np.percentile(data, 75),
        'max': np.max(data),

        # Distribution shape
        'skewness': stats.skew(data),
        'kurtosis': stats.kurtosis(data),

        # Additional percentiles
        'p10': np.percentile(data, 10),
        'p90': np.percentile(data, 90)
    }

    if label:
        print(f"\n{'='*50}")
        print(f"Summary Statistics: {label}")
        print(f"{'='*50}")
        print(f"\nSample Size: {summary['n']}")
        print(f"\nCentral Tendency:")
        print(f"  Mean:   {summary['mean']:.{decimals}f}")
        print(f"  Median: {summary['median']:.{decimals}f}")
        print(f"  Mode:   {summary['mode']:.{decimals}f}")
        print(f"\nVariability:")
        print(f"  Std Dev:  {summary['std']:.{decimals}f}")
        print(f"  Variance: {summary['variance']:.{decimals}f}")
        print(f"  Range:    {summary['range']:.{decimals}f}")
        print(f"  IQR:      {summary['iqr']:.{decimals}f}")
        print(f"  CV:       {summary['cv']:.1f}%")
        print(f"\nPosition (Five-Number Summary + Extended):")
        print(f"  Min:  {summary['min']:.{decimals}f}")
        print(f"  P10:  {summary['p10']:.{decimals}f}")
        print(f"  Q1:   {summary['q1']:.{decimals}f}")
        print(f"  Med:  {summary['median']:.{decimals}f}")
        print(f"  Q3:   {summary['q3']:.{decimals}f}")
        print(f"  P90:  {summary['p90']:.{decimals}f}")
        print(f"  Max:  {summary['max']:.{decimals}f}")
        print(f"\nDistribution Shape:")
        print(f"  Skewness: {summary['skewness']:.3f}")
        print(f"  Kurtosis: {summary['kurtosis']:.3f}")

    return summary


def grouped_summary(df, value_column, group_column):
    """
    Generate summary statistics grouped by a categorical variable.

    Parameters:
    -----------
    df : pandas DataFrame
        DataFrame containing the data
    value_column : str
        Name of the column containing values to summarize
    group_column : str
        Name of the column containing group labels

    Returns:
    --------
    pandas DataFrame : Summary statistics for each group
    """
    summary = df.groupby(group_column)[value_column].agg([
        ('n', 'count'),
        ('mean', 'mean'),
        ('median', 'median'),
        ('std', 'std'),
        ('min', 'min'),
        ('q1', lambda x: x.quantile(0.25)),
        ('q3', lambda x: x.quantile(0.75)),
        ('max', 'max')
    ]).round(2)

    return summary

5.7 Z-Scores and Standardization for Player Comparison

Standardization transforms raw statistics into a common scale, enabling fair comparison across different metrics and contexts.

5.7.1 Z-Score Calculation

A z-score expresses a value in terms of standard deviations from the mean.

Mathematical Definition:

$$z = \frac{x - \mu}{\sigma} \quad \text{(population)} \qquad z = \frac{x - \bar{x}}{s} \quad \text{(sample)}$$

Interpretation: - z = 0: Value equals the mean - z = +1: Value is 1 standard deviation above the mean - z = -1: Value is 1 standard deviation below the mean - z = +2: Value is 2 standard deviations above (approximately 97.7th percentile) - z = -2: Value is 2 standard deviations below (approximately 2.3rd percentile)

5.7.2 Basketball Applications of Z-Scores

Cross-Statistic Comparison:

Raw statistics cannot be directly compared. A player with 25 PPG and 5 APG might have similar z-scores for both if 25 PPG is one standard deviation above league average scoring while 5 APG is one standard deviation above average assists.

Example Calculation:

If the league averages 15.0 PPG with standard deviation 6.0: - Player with 27.0 PPG: z = (27.0 - 15.0) / 6.0 = +2.0

If the league averages 3.5 APG with standard deviation 2.0: - Player with 5.5 APG: z = (5.5 - 3.5) / 2.0 = +1.0

Despite the smaller raw difference in assists, the z-score reveals that the player's scoring is more exceptional relative to the league.

5.7.3 Creating Composite Scores

Z-scores enable creation of composite metrics by averaging standardized values.

Simple Composite:

$$\text{Composite} = \frac{z_{\text{PPG}} + z_{\text{RPG}} + z_{\text{APG}}}{3}$$

Weighted Composite:

$$\text{Composite} = w_1 z_{\text{PPG}} + w_2 z_{\text{RPG}} + w_3 z_{\text{APG}}$$

where weights $w_i$ sum to 1.

Position-Adjusted Z-Scores:

For fair player comparison, calculate z-scores within position groups. A center averaging 2.5 APG might have a negative z-score league-wide but a positive z-score among centers.

5.7.4 Era-Adjusted Statistics

Z-scores facilitate cross-era comparisons by standardizing relative to era-specific distributions.

Problem:

Wilt Chamberlain's 50.4 PPG in 1961-62 cannot be directly compared to modern scoring averages because pace, rules, and competition were different.

Solution:

Calculate era-specific z-scores:

$$z_{\text{era}} = \frac{x_{\text{player}} - \bar{x}_{\text{era}}}{s_{\text{era}}}$$

Chamberlain's 50.4 PPG when league average was ~25 PPG with SD ~10: $$z = \frac{50.4 - 25}{10} = +2.54$$

A modern 35 PPG scorer when league average is ~15 PPG with SD ~7: $$z = \frac{35 - 15}{7} = +2.86$$

Both are exceptional, but the modern scorer's z-score is slightly higher, suggesting they are similarly dominant relative to their peers.

5.7.5 Min-Max Normalization

An alternative to z-scores, min-max normalization scales values to a fixed range (typically 0 to 1).

$$x_{\text{normalized}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$

Advantages: - Bounded range, easy to interpret - Preserves zero values

Disadvantages: - Sensitive to outliers - New maximum values require rescaling

5.7.6 Robust Standardization

For skewed distributions or data with outliers, robust standardization using median and IQR provides more stable results.

$$z_{\text{robust}} = \frac{x - \text{Median}}{\text{IQR}}$$

This approach is less affected by extreme values and better suited for heavily skewed basketball statistics like salaries or games started.

import numpy as np

def calculate_z_scores(player_values, league_values, stat_name=""):
    """
    Calculate z-scores for player values relative to league distribution.

    Parameters:
    -----------
    player_values : float or array-like
        Player's value(s) for the statistic
    league_values : array-like
        All league values for comparison
    stat_name : str
        Name of statistic for output

    Returns:
    --------
    float or array : Z-score(s)
    """
    league_values = np.array(league_values)
    league_mean = np.mean(league_values)
    league_std = np.std(league_values, ddof=1)

    z_scores = (np.array(player_values) - league_mean) / league_std

    return z_scores


def create_composite_score(z_scores, weights=None):
    """
    Create a composite score from multiple z-scores.

    Parameters:
    -----------
    z_scores : dict
        Dictionary with stat names as keys and z-scores as values
    weights : dict, optional
        Dictionary with stat names as keys and weights as values
        If None, equal weights are used

    Returns:
    --------
    float : Composite score
    """
    stats = list(z_scores.keys())
    values = np.array([z_scores[stat] for stat in stats])

    if weights is None:
        weights_arr = np.ones(len(stats)) / len(stats)
    else:
        weights_arr = np.array([weights.get(stat, 0) for stat in stats])
        weights_arr = weights_arr / np.sum(weights_arr)  # Normalize to sum to 1

    composite = np.sum(values * weights_arr)

    return composite


def position_adjusted_z_score(player_value, position_values, position_name=""):
    """
    Calculate z-score relative to player's position group.

    Parameters:
    -----------
    player_value : float
        Player's value for the statistic
    position_values : array-like
        All values for players at the same position
    position_name : str
        Name of position for output

    Returns:
    --------
    float : Position-adjusted z-score
    """
    position_values = np.array(position_values)
    pos_mean = np.mean(position_values)
    pos_std = np.std(position_values, ddof=1)

    z_score = (player_value - pos_mean) / pos_std

    return z_score


def era_adjusted_comparison(player_stat, player_era_mean, player_era_std,
                            compare_stat, compare_era_mean, compare_era_std):
    """
    Compare players across eras using era-adjusted z-scores.

    Parameters:
    -----------
    player_stat : float
        First player's statistic value
    player_era_mean, player_era_std : float
        Mean and std dev of that stat in first player's era
    compare_stat : float
        Second player's statistic value
    compare_era_mean, compare_era_std : float
        Mean and std dev of that stat in second player's era

    Returns:
    --------
    tuple : (z1, z2, comparison_result)
    """
    z1 = (player_stat - player_era_mean) / player_era_std
    z2 = (compare_stat - compare_era_mean) / compare_era_std

    if z1 > z2:
        result = "Player 1 was more exceptional relative to their era"
    elif z2 > z1:
        result = "Player 2 was more exceptional relative to their era"
    else:
        result = "Both players were equally exceptional relative to their eras"

    return z1, z2, result


def robust_standardization(data, center='median', scale='iqr'):
    """
    Perform robust standardization using median and IQR.

    Parameters:
    -----------
    data : array-like
        Data to standardize
    center : str
        'median' or 'mean'
    scale : str
        'iqr' or 'mad' (median absolute deviation) or 'std'

    Returns:
    --------
    array : Robustly standardized values
    """
    data = np.array(data)

    if center == 'median':
        center_val = np.median(data)
    else:
        center_val = np.mean(data)

    if scale == 'iqr':
        scale_val = np.percentile(data, 75) - np.percentile(data, 25)
    elif scale == 'mad':
        scale_val = np.median(np.abs(data - np.median(data)))
    else:
        scale_val = np.std(data, ddof=1)

    standardized = (data - center_val) / scale_val

    return standardized

5.8 Practical Considerations in Basketball Statistical Analysis

5.8.1 Sample Size and Reliability

Statistical stability requires adequate sample sizes. Key considerations:

Rule of Thumb for Basketball Statistics:

Statistic Type Minimum Games Minimum Attempts
Points per game 20-25 games N/A
Field goal % 15-20 games 150-200 attempts
Three-point % 25-30 games 100-150 attempts
Free throw % 15-20 games 75-100 attempts
Advanced metrics 40-50 games Varies

Small Sample Warning Signs: - Large confidence intervals - Inconsistent with career norms - Extreme percentile rankings - High volatility in rolling averages

5.8.2 Context and Adjustment

Raw statistics must be interpreted in context:

  1. Team context: Players on bad teams may accumulate counting stats in losses
  2. Role context: Sixth men have different expectations than starters
  3. Era context: Rules and pace change statistical norms
  4. Opponent context: Schedule strength affects statistics

5.8.3 Data Quality Issues

Common data quality concerns:

  1. Recording errors: Rare but possible in official box scores
  2. Tracking data limitations: Optical tracking has edge cases
  3. Definition changes: What counts as an "assist" has evolved
  4. Missing data: Not all statistics available for all eras

5.8.4 Communicating Statistical Findings

Effective communication principles:

  1. Use appropriate precision: Report 15.2 PPG, not 15.1873 PPG
  2. Provide context: Percentile ranks or comparisons add meaning
  3. Acknowledge uncertainty: Confidence intervals when possible
  4. Avoid spurious precision: Small samples warrant rounded estimates

5.9 Chapter Summary

Descriptive statistics provide the foundation for all basketball analytics work. This chapter covered:

  1. Measures of Central Tendency: Mean, median, and mode each have appropriate uses depending on distribution shape and analysis goals. The mean is most common but sensitive to outliers; the median is robust; the mode is useful for categorical analysis.

  2. Measures of Variability: Range, IQR, variance, and standard deviation quantify spread. Understanding variability is essential for assessing player consistency and predicting performance ranges.

  3. Percentiles and Rankings: Transforming raw statistics into percentile ranks enables player comparison and identification of elite performers. The five-number summary and box plots provide visual distribution comparison.

  4. Distribution Shape: Skewness and kurtosis quantify departures from normality. Most basketball statistics are non-normal, affecting method selection and interpretation.

  5. Correlation Analysis: Pearson and Spearman correlations measure relationships between variables. Correlation does not imply causation, and confounding variables must be considered.

  6. Summary Statistics: Comprehensive summaries combine central tendency, variability, position, and shape measures. Grouped summaries enable comparison across categories.

  7. Standardization and Z-Scores: Converting raw statistics to z-scores enables cross-statistic and cross-era comparison. Composite scores combine multiple standardized metrics.

  8. Practical Considerations: Sample size, context, data quality, and communication all affect analytical validity and usefulness.

Mastering these fundamental concepts prepares analysts to tackle advanced topics including inferential statistics, predictive modeling, and causal analysis in subsequent chapters.


References

  1. James, B. (1984). The Bill James Baseball Abstract. Ballantine Books.
  2. Kubatko, J., Oliver, D., Pelton, K., & Rosenbaum, D. T. (2007). A starting point for analyzing basketball statistics. Journal of Quantitative Analysis in Sports, 3(3).
  3. Oliver, D. (2004). Basketball on Paper: Rules and Tools for Performance Analysis. Brassey's.
  4. Shea, S. M., & Baker, C. E. (2013). Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win. CreateSpace.
  5. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer.