Chapter 6 Key Takeaways: Descriptive Statistics for Sports


Central Tendency

  1. The mean is not always the best summary. In sports data with outliers (blowout games, injury-shortened appearances), the median or trimmed mean often provides a more representative measure of typical performance. Always examine the distribution shape before selecting a central tendency measure.

  2. Weighted averages reflect reality better than simple averages. When combining statistics across different contexts (home vs. away, opponent quality, pitch types), weight by sample size or relevance. A player's batting average against fastballs matters more if 60% of pitches faced are fastballs.

  3. Moving averages capture form and momentum. A 5-game moving average responds to recent trends more quickly than a season-long average. Sportsbooks update lines slowly, so moving averages can reveal when a team's current performance diverges from the line.

  4. The mode identifies "anchor" performances. For player props, the mode tells you the most common outcome. A player whose mode is 22 points is most likely to score around 22, even if their mean is 25 due to occasional 40-point explosions.


Variability

  1. Standard deviation measures consistency. Two teams can have identical means but vastly different standard deviations. A team averaging 24 PPG with SD = 4 is far more predictable than one averaging 24 PPG with SD = 12. Consistency affects betting confidence.

  2. The coefficient of variation (CV) enables cross-sport comparisons. Because raw standard deviations depend on the scale of measurement, the CV (SD/Mean) allows you to compare variability across sports. NFL scoring (CV around 0.15-0.20) tends to be more variable relative to its mean than NBA scoring (CV around 0.08-0.12).

  3. The IQR is robust to outliers. When analyzing player or team performance, the IQR captures the "typical range" without being distorted by extreme games. Use it when the data contains blowouts, stat-padding in garbage time, or injury-shortened performances.

  4. Rolling volatility reveals regime changes. A team's consistency can change dramatically over a season due to injuries, lineup changes, or schedule difficulty. Tracking rolling standard deviation identifies when a team enters a high-variance or low-variance phase, creating betting opportunities.


Distribution Shape

  1. Skewness affects over/under betting. Right-skewed scoring distributions mean the mean exceeds the median. Over/under lines based on means will be hit by the over less than 50% of the time but by larger margins when hit. Understanding skewness helps identify which side offers better expected value.

  2. Kurtosis measures tail risk. Leptokurtic distributions (heavy tails) produce more extreme outcomes than normal models predict. If a sportsbook prices alternate lines using a normal distribution but the true distribution is leptokurtic, the extreme lines will be systematically mispriced.

  3. NFL scoring is bimodal, not normal. The peaks at margins of 3 and 7 (key numbers) mean the normal distribution is a poor fit for NFL game margins. This creates value around key number spreads, where small line differences carry outsized importance.

  4. Always check for normality before using normal-based methods. The empirical rule (68-95-99.7) only applies to approximately normal distributions. Use QQ-plots, the Shapiro-Wilk test, or visual inspection via histograms before relying on normality assumptions.


Correlation and Relationships

  1. Correlation does not imply causation, especially in sports. The classic example: NFL rushing yards correlate with wins because winning teams run to kill the clock, not because rushing causes winning. Always consider reverse causation and confounding variables.

  2. Point differential is the single most correlated statistic with winning. Across all major sports, the net rating or point differential explains 85-95% of the variance in winning percentage. This is the foundation of Pythagorean win expectation models.

  3. Correlation strength determines predictive value. Variables with |r| > 0.7 are strong predictors; 0.4-0.7 are moderate; below 0.4 are weak. For betting models, prioritize strongly correlated variables and be skeptical of weak correlations, which may be noise.

  4. Autocorrelation tests whether streaks are real. If game-to-game performance is autocorrelated, streaks are real and should be factored into betting. If autocorrelation is near zero, past performance does not predict the next game beyond what the season average tells you.


Visualization

  1. Choose the right chart for the right question. Histograms for distribution shape, box plots for group comparisons, scatter plots for relationships, time series for trends, and heatmaps for correlation matrices. The wrong visualization can hide important patterns.

  2. Box plots efficiently compare multiple groups. When comparing home vs. away, or multiple players for prop bets, side-by-side box plots immediately reveal differences in center, spread, and outliers that tables of numbers cannot convey as quickly.

  3. Overlay reference lines on all betting visualizations. Always mark the relevant betting line (spread, total, prop) on your visualizations. This immediately shows where the line falls relative to the distribution and how much of the data falls on each side.

  4. Density plots beat histograms for comparing distributions. When overlaying multiple distributions (e.g., different eras of scoring, home vs. away), kernel density estimates avoid the binning artifacts of histograms and produce cleaner comparisons.


Betting Applications

  1. Descriptive statistics are backward-looking. They summarize what has happened, not what will happen. The gap between descriptive and predictive is where betting value lives. A team's 20-game average is descriptive; adjusting for schedule, injuries, and regression to the mean makes it predictive.

  2. Sample size determines reliability. Descriptive statistics from 5 games are nearly useless; from 20 games, they are suggestive; from 50+ games, they become reliable. Always consider whether you have enough data before acting on statistical summaries.

  3. Compare your statistics to the market's implied statistics. Calculate what descriptive statistics the sportsbook's line implies, then compare to your own calculations. If the line implies a mean of 108 and your analysis shows 112 with a tight confidence interval, you may have found value.

  4. Consistency metrics drive bet sizing. Even if you identify value, bet more on outcomes involving consistent performers (low CV) and less on volatile ones (high CV). The Kelly criterion naturally incorporates this through the probability estimate, but descriptive consistency metrics provide an additional check.

  5. Descriptive statistics are the foundation, not the ceiling. Every advanced model in sports betting starts with the concepts in this chapter. Mastering means, variances, correlations, and distributions is prerequisite to regression, machine learning, and Bayesian analysis covered in later chapters.


Quick Reference Formulas

Measure Formula When to Use
Mean sum(x) / n Symmetric data, no outliers
Median Middle value when sorted Skewed data, outliers present
Trimmed Mean Mean after removing k% extremes Moderate outliers, robust estimate
Variance sum((x - mean)^2) / (n-1) Quantifying spread (squared units)
Std Dev sqrt(variance) Spread in original units
CV SD / Mean Cross-group variability comparison
IQR Q3 - Q1 Robust spread measure
Pearson r cov(X,Y) / (SD_X * SD_Y) Linear relationship strength
Skewness E[(X - mean)^3] / SD^3 Distribution asymmetry
Kurtosis E[(X - mean)^4] / SD^4 - 3 Tail heaviness