Welcome to Part II: Statistical Foundations. In Part I, you learned how to think about sports betting markets, manage your bankroll, and wrangle the data that fuels every serious analysis. Now it is time to make that data speak.
Learning Objectives
- Calculate and interpret measures of central tendency (mean, median, mode, weighted averages, trimmed means) for sports data and understand when each is most appropriate
- Quantify team and player variability using variance, standard deviation, IQR, and coefficient of variation, and connect these measures to betting implications
- Compute and critically evaluate correlation and association metrics between sports variables while avoiding common pitfalls like spurious correlation
- Build publication-quality visualizations of sports data using matplotlib and seaborn, including histograms, box plots, scatter plots, heatmaps, and time series charts
- Standardize and compare distributions across different teams, seasons, and eras using z-scores, effect sizes, QQ plots, and kernel density estimation
In This Chapter
- Chapter Overview
- 6.1 Measures of Central Tendency in Sports Contexts
- 6.2 Measures of Variability and Spread
- 6.3 Correlation and Association
- 6.4 Data Visualization Techniques for Sports
- 6.5 Comparing Distributions Across Seasons and Teams
- 6.6 Chapter Summary
- What's Next: Chapter 7 - Probability Distributions in Sports
- Practice Exercises
Chapter 6: Descriptive Statistics for Sports
Chapter Overview
Welcome to Part II: Statistical Foundations. In Part I, you learned how to think about sports betting markets, manage your bankroll, and wrangle the data that fuels every serious analysis. Now it is time to make that data speak.
Descriptive statistics are the bedrock of sports analytics. Before you build a predictive model, before you estimate probabilities, before you calculate expected value on a wager, you need to understand what your data looks like right now. How many points does a team typically score? How consistent are they? Which statistics actually correlate with winning? These are not glamorous questions, but they are the questions that separate disciplined bettors from casual punters chasing narratives.
This chapter will feel different from what you encountered in a standard statistics textbook. Every formula, every technique, and every line of code is anchored in sports contexts. We will calculate means using NFL scoring data, measure variability to identify "upset-prone" teams, build correlation matrices to find which stats actually predict wins, and produce visualizations that reveal patterns invisible in raw numbers.
By the end of this chapter, you will have a personal toolkit of descriptive techniques that you will use in every subsequent chapter. Probability distributions in Chapter 7, regression in Chapter 8, and every model you build later all depend on the foundations laid here.
Let us begin.
6.1 Measures of Central Tendency in Sports Contexts
Every analysis starts with the same question: what is the typical value? In sports, "typical" can mean many things. The average points scored by a team, the most common margin of victory, the midpoint of a quarterback's passer rating over a season. Measures of central tendency give us a single number that summarizes the center of a distribution, but choosing the right measure matters enormously.
The Arithmetic Mean
The arithmetic mean is the most familiar measure of central tendency. For a set of n observations x_1, x_2, ..., x_n, the mean is:
x_bar = (1/n) * sum(x_i) for i = 1 to n
In sports, the mean is your default starting point. A team's average points per game, a player's average yards per carry, a pitcher's average strikeouts per nine innings. The mean incorporates every observation, which is both its strength and its weakness.
Strength: The mean uses all available information. Every game, every at-bat, every possession contributes to the calculation.
Weakness: The mean is sensitive to outliers. A single 60-point blowout can pull a team's average scoring significantly higher than what you would consider "typical" performance.
The Median
The median is the middle value when observations are sorted from smallest to largest. For an odd number of observations, it is the center value. For an even number, it is the average of the two center values.
The median is robust to outliers. If a basketball team scores 95, 98, 100, 102, and 155 points in five games, the median is 100, which better represents "normal" performance than the mean of 110, which is inflated by that 155-point explosion.
When to use the median in sports:
- Home run totals: The distribution of home runs per player is heavily right-skewed. A handful of sluggers hit 40-plus while most hitters are in single digits. The median gives a better sense of a "typical" player.
- Margin of victory: Blowouts skew the mean. The median margin tells you what a close-to-normal game looks like for a team.
- Career earnings, contract values: A few max contracts distort the mean. The median salary tells you what a "regular" player earns.
The Mode
The mode is the most frequently occurring value. In continuous sports data, the mode is rarely useful on its own because exact repetition is uncommon. But for discrete data, it can be revealing.
- The most common final score in NFL games has historically been 20-17 or 27-24.
- The most common margin of victory in the NFL is 3 points (a field goal), followed by 7 (a touchdown).
- In soccer, the most common match result is 1-0.
The mode is particularly useful when you are thinking about specific betting markets like exact score or winning margin props.
Weighted Averages
Not all games are created equal. A team's performance in September may be more indicative of their current ability than their performance in the previous March. Weighted averages let you assign more importance to recent observations.
For weights w_1, w_2, ..., w_n assigned to observations x_1, x_2, ..., x_n:
x_weighted = sum(w_i * x_i) / sum(w_i) for i = 1 to n
A common weighting scheme uses exponential decay, where the most recent game gets the highest weight and each preceding game gets progressively less:
w_i = alpha^(n - i) where 0 < alpha < 1
With alpha = 0.9 and 10 games, the most recent game has weight 1.0, the game before it has weight 0.9, two games back has weight 0.81, and so on. This captures the intuition that a team's current form matters more than what happened months ago.
Bettor's Note: Many sharp bettors use weighted averages with alpha values between 0.85 and 0.95. The optimal decay rate depends on the sport. NFL seasons are short (17 games), so each game carries significant weight, and a moderate alpha around 0.90 works well. NBA seasons are long (82 games), allowing for more aggressive recency weighting with alpha around 0.85.
Trimmed Means
A trimmed mean removes a fixed percentage of the highest and lowest values before computing the mean. A 10% trimmed mean drops the top 10% and bottom 10% of observations.
This is useful in sports when you want the stability of the mean but need protection against outliers. A team that had one game where their starting quarterback was injured and they scored 3 points, and another game where they scored 52 against a historically bad defense, might be better characterized by a trimmed mean.
Trimmed mean (10%) = mean of the middle 80% of sorted values
Python Code: Calculating and Comparing Central Tendency Measures
Let us put these concepts to work with NFL-style scoring data.
import numpy as np
import pandas as pd
from scipy import stats
# Simulated NFL-style points scored data for several teams across a 17-game season
np.random.seed(42)
teams_data = {
'Chiefs': [27, 31, 24, 17, 34, 28, 21, 30, 26, 33, 24, 38, 20, 27, 31, 29, 35],
'Ravens': [24, 30, 34, 20, 28, 31, 17, 33, 27, 22, 35, 29, 26, 31, 24, 19, 37],
'Bills': [30, 17, 28, 35, 24, 10, 31, 27, 33, 21, 29, 26, 38, 20, 27, 34, 23],
'Jaguars': [14, 20, 17, 10, 24, 13, 21, 16, 7, 23, 17, 20, 14, 19, 10, 24, 16],
'Lions': [34, 28, 31, 42, 27, 24, 30, 35, 21, 33, 26, 38, 29, 31, 55, 27, 30],
}
df = pd.DataFrame(teams_data)
# Calculate all central tendency measures for each team
summary = pd.DataFrame(index=df.columns)
summary['Mean'] = df.mean()
summary['Median'] = df.median()
summary['Mode'] = df.mode().iloc[0] # First mode if multiple
summary['Std Dev'] = df.std()
# Trimmed mean (10%)
summary['Trimmed Mean (10%)'] = df.apply(lambda x: stats.trim_mean(x, 0.1))
# Weighted average with exponential decay (alpha = 0.9)
def weighted_avg(series, alpha=0.9):
n = len(series)
weights = np.array([alpha ** (n - 1 - i) for i in range(n)])
return np.average(series, weights=weights)
summary['Weighted Avg (a=0.9)'] = df.apply(weighted_avg)
print("=== Central Tendency Measures: NFL Points Scored ===\n")
print(summary.round(2).to_string())
print()
# Demonstrate the impact of outliers
print("\n=== Lions Detail: Impact of the 55-point Game ===")
lions = df['Lions']
print(f"All games: Mean = {lions.mean():.2f}, Median = {lions.median():.2f}")
lions_no_outlier = lions[lions < 50]
print(f"Without 55-point: Mean = {lions_no_outlier.mean():.2f}, Median = {lions_no_outlier.median():.2f}")
print(f"Trimmed mean: {stats.trim_mean(lions, 0.1):.2f}")
print(f"\nThe 55-point game inflates the mean by {lions.mean() - lions_no_outlier.mean():.2f} points.")
print(f"The median barely changes, demonstrating its robustness.")
Output Interpretation:
When you run this code, you will notice several patterns. The Chiefs and Ravens have similar means, but their medians may differ, revealing different scoring distributions. The Lions' 55-point outlier game pulls their mean noticeably above their median, which is the classic signature of right-skewed data. The Jaguars' low-scoring offense is consistently low, as their mean and median are close together, indicating a roughly symmetric (though low) distribution.
The weighted average for each team tells a different story than the simple mean. If a team improved late in the season (common for young rosters gaining chemistry), the weighted average will be higher than the raw mean. If a team faded down the stretch (injuries, fatigue), the weighted average will be lower.
Worked Example: Which Average Should You Trust?
Suppose you are evaluating the Bills for a Week 1 bet next season. Their raw data shows scores of 10 and 17 in weeks 2 and 6, but they also posted 35 and 38 in other weeks. Their mean is about 27 points per game. But which measure best predicts their next game?
- The mean (27.0) incorporates everything, including the 10-point stinker. If you believe every game is equally informative, use this.
- The median (27.0) happens to be close to the mean here, suggesting a fairly symmetric distribution. When they differ substantially, prefer the median for "typical game" analysis.
- The weighted average gives more influence to late-season games. If the Bills made a midseason trade that improved their offense, the weighted average captures that better than the raw mean.
- The trimmed mean removes the extreme highs and lows, giving you the most robust estimate of "normal" performance.
For betting purposes, the weighted average is usually the most useful for predicting near-future performance, while the trimmed mean is most useful for identifying a team's baseline capability.
6.2 Measures of Variability and Spread
Knowing a team's average is only half the story. Two teams can both average 24 points per game, but if one team scores between 21 and 27 every week while the other oscillates between 10 and 40, they present very different betting propositions. Variability is what separates the predictable from the chaotic.
Variance and Standard Deviation
Variance measures the average squared deviation from the mean:
s^2 = (1 / (n - 1)) * sum((x_i - x_bar)^2) for i = 1 to n
We divide by (n - 1) rather than n for sample variance, applying Bessel's correction to produce an unbiased estimate. Standard deviation is simply the square root of variance:
s = sqrt(s^2)
Standard deviation has the same units as the original data, making it more interpretable. If a team has a mean of 24 points and a standard deviation of 4 points, roughly 68% of their games (assuming approximate normality) will fall between 20 and 28 points.
In sports betting, standard deviation is a direct measure of predictability. A team with a low standard deviation is consistent and easier to model. A team with a high standard deviation is volatile and creates both risk and opportunity for bettors.
Range and Interquartile Range
The range is simply the maximum minus the minimum. It is easy to calculate but extremely sensitive to outliers. One freak performance distorts the entire measure.
The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1):
IQR = Q3 - Q1
The IQR captures the spread of the middle 50% of the data. It is robust to outliers and gives you a practical sense of a team's "normal operating range."
Key Insight: When a sportsbook sets an over/under total, they are implicitly estimating the mean of the combined scoring distribution. But the IQR tells you how confident you should be that the game lands near that total. A narrow IQR means games cluster tightly around the mean. A wide IQR means there is more variance, and the over/under is harder to predict.
Coefficient of Variation
The coefficient of variation (CV) expresses standard deviation as a percentage of the mean:
CV = (s / x_bar) * 100%
The CV is essential for comparing variability across different sports or different statistical categories. A standard deviation of 5 points means something very different in football (where teams average 23 points) versus basketball (where teams average 112 points). The CV normalizes this.
| Sport | Avg Score | Std Dev | CV |
|---|---|---|---|
| NFL | 23.0 | 10.2 | 44.3% |
| NBA | 112.5 | 12.1 | 10.8% |
| MLB (runs) | 4.5 | 3.2 | 71.1% |
| NHL | 3.1 | 1.8 | 58.1% |
| Soccer | 1.5 | 1.3 | 86.7% |
This table reveals something crucial: soccer and baseball have the highest coefficients of variation, meaning individual game scores are the most unpredictable relative to their averages. NBA games are the most predictable. This aligns with the well-known difficulty of betting on baseball and soccer money lines based on single-game performance, and the relative stability of NBA totals.
Standard Deviation of Point Spreads
Point spreads themselves have a characteristic standard deviation. In the NFL, the actual game margin relative to the spread has a standard deviation of approximately 13-14 points. This means that even if the spread is perfectly set, the actual margin will land within one standard deviation (about 13 points of the spread) roughly 68% of the time.
This has profound implications:
- A 3-point spread means the favorite is expected to win by a field goal, but the actual margin could easily range from the underdog winning by 10 to the favorite winning by 16.
- The inherent noise in football outcomes makes it difficult to consistently beat the spread, because the variance is so large relative to the typical edge a bettor might identify.
Why Variance Matters for Betting
Consider two teams, both 8-9 against the spread:
Team A (Low Variance): Their margins relative to the spread are: -1, +2, -3, +1, -2, +4, -1, +3, -2, +1, -3, +2, -1, +3, -2, +1, -3. Standard deviation: 2.2 points.
Team B (High Variance): Their margins relative to the spread are: -14, +11, -8, +15, -10, +12, -7, +16, -13, +9, -11, +14, -6, +10, -15, +8, -12. Standard deviation: 11.8 points.
Both teams have similar ATS records, but Team B's outcomes are far more extreme. When betting on Team B, you are essentially flipping a coin with much higher stakes. Team A's outcomes hover near the spread, suggesting the market has them priced accurately. Team B's wild swings suggest the market struggles to capture their true nature, which may present opportunities if you can identify why they are so volatile.
Python Code: Measuring Team Variability and the Upset Potential Metric
import numpy as np
import pandas as pd
# NFL-style data: points scored per game for a 17-game season
np.random.seed(42)
teams_data = {
'Chiefs': [27, 31, 24, 17, 34, 28, 21, 30, 26, 33, 24, 38, 20, 27, 31, 29, 35],
'Ravens': [24, 30, 34, 20, 28, 31, 17, 33, 27, 22, 35, 29, 26, 31, 24, 19, 37],
'Bills': [30, 17, 28, 35, 24, 10, 31, 27, 33, 21, 29, 26, 38, 20, 27, 34, 23],
'Jaguars': [14, 20, 17, 10, 24, 13, 21, 16, 7, 23, 17, 20, 14, 19, 10, 24, 16],
'Lions': [34, 28, 31, 42, 27, 24, 30, 35, 21, 33, 26, 38, 29, 31, 55, 27, 30],
'Cowboys': [10, 30, 14, 35, 17, 31, 13, 28, 20, 34, 7, 38, 16, 27, 21, 33, 24],
}
df = pd.DataFrame(teams_data)
# Variability measures
variability = pd.DataFrame(index=df.columns)
variability['Mean'] = df.mean().round(1)
variability['Std Dev'] = df.std().round(2)
variability['Variance'] = df.var().round(1)
variability['Range'] = df.max() - df.min()
variability['IQR'] = df.quantile(0.75) - df.quantile(0.25)
variability['CV (%)'] = ((df.std() / df.mean()) * 100).round(1)
print("=== Team Variability Measures ===\n")
print(variability.to_string())
print()
# --- Upset Potential Metric ---
# Concept: Teams with high variability relative to their mean are more likely
# to produce upsets. A bad team with high variance occasionally plays great,
# and a good team with high variance occasionally plays terribly.
def upset_potential(points_series, is_favorite=True):
"""
Calculate upset potential score.
For favorites: Higher variance = more likely to lay an egg (upset loss)
For underdogs: Higher variance = more likely to have a big day (upset win)
We use the CV and the proportion of games that deviate significantly
from the mean (>1 std dev in the upset direction).
"""
mean = points_series.mean()
std = points_series.std()
cv = std / mean
if is_favorite:
# How often does this team score well below average?
bad_games = (points_series < mean - std).sum() / len(points_series)
upset_score = cv * 50 + bad_games * 50
else:
# How often does this team score well above average?
good_games = (points_series > mean + std).sum() / len(points_series)
upset_score = cv * 50 + good_games * 50
return round(upset_score, 1)
print("=== Upset Potential Scores ===\n")
print(f"{'Team':<12} {'As Favorite':<15} {'As Underdog':<15} {'Assessment'}")
print("-" * 60)
for team in df.columns:
fav_score = upset_potential(df[team], is_favorite=True)
dog_score = upset_potential(df[team], is_favorite=False)
if fav_score > 18:
assessment = "Risky favorite"
elif dog_score > 18:
assessment = "Dangerous underdog"
else:
assessment = "Predictable"
print(f"{team:<12} {fav_score:<15} {dog_score:<15} {assessment}")
print()
print("Higher scores indicate greater upset potential.")
print("Risky favorites = beware laying big spreads with these teams.")
print("Dangerous underdogs = consider taking points with these teams.")
# Show the distribution shape with percentiles
print("\n=== Scoring Distribution Percentiles ===\n")
percentiles = [5, 10, 25, 50, 75, 90, 95]
pct_df = df.describe(percentiles=[p/100 for p in percentiles]).loc[
[f'{p}%' for p in percentiles]
]
print(pct_df.round(1).to_string())
Interpreting the Upset Potential Metric:
The Cowboys in our simulated data show classic volatile team behavior: their scores swing from 7 to 38 with no middle ground. As a favorite, they are risky because they have frequent low-scoring games that could lead to an upset loss. As an underdog, they are dangerous because they occasionally explode for 30-plus points. The Jaguars, despite being a bad team, are consistently bad, as their low variance means they rarely surprise anyone with a strong performance, making them less dangerous as underdogs than their record suggests.
6.3 Correlation and Association
With central tendency and variability in hand, the next step is understanding relationships between variables. Does a team's rushing offense correlate with winning? Is there a relationship between three-point shooting percentage and covering the spread? Correlation analysis answers these questions quantitatively.
Pearson Correlation
The Pearson correlation coefficient r measures the strength and direction of the linear relationship between two continuous variables:
r = sum((x_i - x_bar)(y_i - y_bar)) / sqrt(sum((x_i - x_bar)^2) * sum((y_i - y_bar)^2))
The value of r ranges from -1 to +1:
- r = +1: Perfect positive linear relationship
- r = 0: No linear relationship
- r = -1: Perfect negative linear relationship
In practice, correlation values in sports data are rarely above 0.7 or below -0.7. Here is a rough guide for interpreting correlations in a sports context:
| Absolute Value of r | Interpretation |
|---|---|
| 0.00 - 0.10 | Negligible, essentially noise |
| 0.10 - 0.30 | Weak, but possibly worth noting |
| 0.30 - 0.50 | Moderate, meaningful relationship |
| 0.50 - 0.70 | Strong, important for modeling |
| 0.70 - 1.00 | Very strong, rare in sports data |
Spearman Rank Correlation
The Spearman rank correlation coefficient measures the monotonic (but not necessarily linear) relationship between two variables. Instead of using raw values, it ranks the observations and computes the Pearson correlation on the ranks.
Spearman correlation is appropriate when:
- The data contains outliers that would distort Pearson correlation.
- The relationship is monotonic but not linear (e.g., diminishing returns from additional passing yards).
- The data is ordinal (rankings, draft positions).
For example, the relationship between draft position and career performance is not linear. The difference between the 1st and 5th pick is much larger than the difference between the 50th and 54th pick. Spearman correlation handles this naturally.
Correlation Between Stats and Wins
One of the most valuable exercises in sports analytics is computing the correlation between every available team statistic and wins. This tells you which stats actually matter.
Here are approximate correlations with wins in the NFL (based on historical data patterns):
| Statistic | Correlation with Wins |
|---|---|
| Point differential | +0.92 |
| Turnover differential | +0.50 |
| Yards per play (offense) | +0.55 |
| Yards per play (defense) | -0.52 |
| Third-down conversion % | +0.45 |
| Rushing yards per game | +0.20 |
| Total passing yards | +0.35 |
| Penalty yards | -0.10 |
| Time of possession | +0.08 |
Several insights emerge from this table:
-
Point differential is king. It correlates more strongly with wins than any individual component stat. This is partly tautological (scoring more and allowing less leads to wins), but it tells you that models should focus on scoring efficiency rather than raw yardage.
-
Rushing yards per game is overrated. The correlation of +0.20 is surprisingly weak. Teams that are winning run the ball to kill clock, inflating their rushing numbers. The causation runs from winning to rushing, not the other way around.
-
Time of possession is nearly meaningless. A correlation of +0.08 is essentially noise. Despite the attention it gets from commentators, controlling the clock has minimal predictive value.
-
Turnovers and efficiency metrics matter. Yards per play and turnover differential are both moderately to strongly correlated with winning.
Spurious Correlations in Sports
Spurious correlations are statistical relationships that arise by coincidence, confounding variables, or small sample sizes. Sports data is riddled with them.
Famous examples include:
-
The Super Bowl Indicator: For decades, when an original NFL team (or a team with NFL heritage) won the Super Bowl, the stock market went up. When an original AFL team won, the market went down. This "worked" for over 80% of years through the 1990s but has no causal mechanism. It is a coincidence amplified by data mining.
-
Jersey number and performance: If you correlate jersey numbers with touchdowns scored, you will find a "relationship" because wide receivers (who score more touchdowns) tend to wear numbers in the 80s, while offensive linemen (who score none) wear numbers in the 60s and 70s. The correlation is real, but the cause is the position assignment, not the number.
-
Coin toss and game outcomes: In any given season, you can find that teams winning the coin toss won a disproportionate number of games. This is sampling variation, not a meaningful pattern.
Warning Box: The Data Dredging Trap
If you test enough relationships, some will appear significant by pure chance. Testing 20 uncorrelated variables at the 5% significance level will produce, on average, one "significant" finding that is actually noise. This is called the multiple comparisons problem, and it is the single biggest trap in sports analytics. Always ask: "Is there a plausible mechanism for this relationship?" If not, be deeply skeptical.
Correlation vs. Causation
The most important statistical principle in sports analytics is this: correlation does not imply causation. Yet violations of this principle drive billions of dollars in bad analysis every year.
Example 1: Ice cream sales and drowning deaths. Both increase in summer. The lurking variable is temperature. In sports, the equivalent is: teams that pass more tend to lose more. Does passing cause losing? No. Teams that are behind pass more to catch up. The game situation is the lurking variable.
Example 2: Coaching changes and improvement. Teams that fire their coach midseason often improve afterward. Is the new coach better? Maybe. But this is also explained by regression to the mean. Teams fire coaches when they are playing their worst, and performance naturally rebounds from extreme lows regardless of coaching changes.
Example 3: Pace of play and winning. In the NBA, teams that play faster score more points. Does playing fast cause winning? Not necessarily. Good teams in transition (after steals, rebounds) play fast because they have advantages. Bad teams that try to play fast without the talent just turn the ball over more.
To establish causation, you need one of: a controlled experiment (impossible in real sports), a natural experiment (rare), or a compelling theoretical framework combined with the data. In most sports betting analysis, you should be comfortable working with correlations while remaining honest that you are not proving causation.
Python Code: Correlation Matrix and Identifying Meaningful Relationships
import numpy as np
import pandas as pd
# Simulated team-level season statistics for 32 NFL-style teams
np.random.seed(42)
n_teams = 32
# Generate correlated data that mimics real NFL relationships
wins = np.random.randint(3, 15, n_teams).astype(float)
# Stats with varying correlations to wins
point_diff = wins * 18 - 130 + np.random.normal(0, 20, n_teams)
yards_per_play = 4.5 + (wins - 8.5) * 0.12 + np.random.normal(0, 0.3, n_teams)
turnover_diff = (wins - 8.5) * 1.2 + np.random.normal(0, 4, n_teams)
third_down_pct = 35 + (wins - 8.5) * 1.0 + np.random.normal(0, 3, n_teams)
rush_ypg = 110 + (wins - 8.5) * 2 + np.random.normal(0, 15, n_teams)
pass_ypg = 225 + (wins - 8.5) * 5 + np.random.normal(0, 25, n_teams)
penalty_ypg = 55 + np.random.normal(0, 10, n_teams) # Weak correlation
time_of_poss = 30 + np.random.normal(0, 2, n_teams) # Near zero correlation
team_stats = pd.DataFrame({
'Wins': wins,
'Point Diff': point_diff,
'Yds/Play': yards_per_play,
'TO Diff': turnover_diff,
'3rd Down %': third_down_pct,
'Rush YPG': rush_ypg,
'Pass YPG': pass_ypg,
'Penalty YPG': penalty_ypg,
'TOP (min)': time_of_poss
})
# Pearson correlation matrix
print("=== Pearson Correlation Matrix ===\n")
corr_matrix = team_stats.corr()
print(corr_matrix.round(3).to_string())
print()
# Focus on correlations with wins
print("\n=== Correlations with Wins (sorted by absolute value) ===\n")
win_corrs = corr_matrix['Wins'].drop('Wins').abs().sort_values(ascending=False)
for stat, corr_val in win_corrs.items():
actual_corr = corr_matrix.loc[stat, 'Wins']
direction = "+" if actual_corr > 0 else "-"
strength = (
"Very Strong" if corr_val > 0.7 else
"Strong" if corr_val > 0.5 else
"Moderate" if corr_val > 0.3 else
"Weak" if corr_val > 0.1 else
"Negligible"
)
print(f" {stat:<15} r = {direction}{corr_val:.3f} ({strength})")
# Spearman rank correlation for comparison
from scipy.stats import spearmanr
print("\n\n=== Pearson vs Spearman: Correlations with Wins ===\n")
print(f"{'Statistic':<15} {'Pearson':<10} {'Spearman':<10} {'Difference':<10}")
print("-" * 45)
for col in team_stats.columns:
if col == 'Wins':
continue
pearson_r = team_stats['Wins'].corr(team_stats[col])
spearman_r, _ = spearmanr(team_stats['Wins'], team_stats[col])
diff = abs(pearson_r - spearman_r)
print(f"{col:<15} {pearson_r:>+.3f} {spearman_r:>+.3f} {diff:.3f}")
print()
print("Large Pearson-Spearman differences suggest nonlinear relationships")
print("or the influence of outliers.")
# Demonstrate a spurious correlation
np.random.seed(99)
mascot_letter_count = np.random.randint(4, 12, n_teams)
spurious_r = np.corrcoef(wins, mascot_letter_count)[0, 1]
print(f"\n\n=== Spurious Correlation Example ===")
print(f"Correlation between wins and mascot name length: r = {spurious_r:.3f}")
print(f"This is meaningless noise, but in a small sample it can appear real.")
Key Takeaway: When building betting models, prioritize the statistics that have the strongest and most stable correlations with winning. Point differential, yards per play, and turnover differential consistently rise to the top. Resist the temptation to include everything. More variables does not mean a better model, and including weakly correlated or spurious predictors will add noise rather than signal.
6.4 Data Visualization Techniques for Sports
Numbers in tables are precise but hard to digest. A well-crafted visualization can reveal patterns, outliers, and relationships that would take pages of numbers to communicate. In this section, we build a comprehensive visualization toolkit for sports data.
Histograms and Density Plots
Histograms show the distribution of a single variable by dividing the range into bins and counting how many observations fall into each bin. Density plots smooth the histogram into a continuous curve, making it easier to compare distributions.
When to use histograms in sports: - Examining the distribution of final scores - Checking whether margins of victory follow a normal distribution - Comparing home vs. away scoring distributions
Box Plots for Comparing Groups
Box plots display the five-number summary (minimum, Q1, median, Q3, maximum) and highlight outliers. They are ideal for comparing distributions across groups: home vs. away, conference vs. conference, division by division.
Scatter Plots with Regression Lines
Scatter plots visualize the relationship between two continuous variables. Adding a regression line shows the trend. These are the workhorses of correlation analysis.
Heatmaps for Correlation Matrices
Heatmaps represent a matrix of values using color intensity. They are the standard way to display correlation matrices, making it easy to spot clusters of related variables.
Time Series Plots
Time series plots show how a variable changes over time (or over the course of a season). They are essential for identifying trends, hot streaks, and performance changes.
Python Code: Comprehensive Visualization Suite
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
# === Generate comprehensive NFL-style data ===
np.random.seed(42)
# Team game-by-game scoring data
n_games = 17
teams = {
'Chiefs': np.random.normal(27, 5, n_games).astype(int).clip(3, 55),
'Ravens': np.random.normal(26, 6, n_games).astype(int).clip(3, 55),
'Bills': np.random.normal(25, 8, n_games).astype(int).clip(3, 55),
'Jaguars': np.random.normal(17, 4, n_games).astype(int).clip(3, 55),
'Cowboys': np.random.normal(23, 10, n_games).astype(int).clip(3, 55),
}
scores_df = pd.DataFrame(teams)
# Home vs Away scoring
home_scores = np.random.normal(25, 9, 200).astype(int).clip(0, 60)
away_scores = np.random.normal(22, 9, 200).astype(int).clip(0, 60)
# Team-level season stats (32 teams)
n_teams = 32
wins = np.random.randint(3, 15, n_teams).astype(float)
point_diff = wins * 18 - 130 + np.random.normal(0, 20, n_teams)
off_ypp = 4.5 + (wins - 8.5) * 0.12 + np.random.normal(0, 0.3, n_teams)
def_ypp = 5.5 - (wins - 8.5) * 0.10 + np.random.normal(0, 0.3, n_teams)
to_diff = (wins - 8.5) * 1.2 + np.random.normal(0, 4, n_teams)
third_pct = 35 + (wins - 8.5) * 1.0 + np.random.normal(0, 3, n_teams)
season_stats = pd.DataFrame({
'Wins': wins, 'Point Diff': point_diff,
'Off Yds/Play': off_ypp, 'Def Yds/Play': def_ypp,
'TO Diff': to_diff, '3rd Down %': third_pct
})
# ============================
# FIGURE 1: Score Distributions
# ============================
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('Figure 6.1: NFL Score Distributions', fontsize=14, fontweight='bold')
# 1a: Histogram of all scores
all_scores = np.concatenate(list(teams.values()))
axes[0].hist(all_scores, bins=15, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(np.mean(all_scores), color='red', linestyle='--', linewidth=2,
label=f'Mean: {np.mean(all_scores):.1f}')
axes[0].axvline(np.median(all_scores), color='orange', linestyle='--', linewidth=2,
label=f'Median: {np.median(all_scores):.1f}')
axes[0].set_xlabel('Points Scored')
axes[0].set_ylabel('Frequency')
axes[0].set_title('All Teams: Points Scored Distribution')
axes[0].legend()
# 1b: Density plot comparing teams
for team_name, team_scores in teams.items():
sns.kdeplot(team_scores, ax=axes[1], label=team_name, linewidth=2)
axes[1].set_xlabel('Points Scored')
axes[1].set_ylabel('Density')
axes[1].set_title('Scoring Density by Team')
axes[1].legend()
# 1c: Home vs Away histogram
axes[2].hist(home_scores, bins=15, alpha=0.5, label=f'Home (mean={home_scores.mean():.1f})',
color='green', edgecolor='black')
axes[2].hist(away_scores, bins=15, alpha=0.5, label=f'Away (mean={away_scores.mean():.1f})',
color='red', edgecolor='black')
axes[2].set_xlabel('Points Scored')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Home vs Away Scoring')
axes[2].legend()
plt.tight_layout()
plt.savefig('fig_6_1_score_distributions.png', dpi=150, bbox_inches='tight')
plt.show()
# =========================
# FIGURE 2: Box Plots
# =========================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Figure 6.2: Box Plot Comparisons', fontsize=14, fontweight='bold')
# 2a: Box plot by team
scores_melted = scores_df.melt(var_name='Team', value_name='Points')
sns.boxplot(data=scores_melted, x='Team', y='Points', ax=axes[0], palette='Set2')
axes[0].set_title('Points Scored by Team')
axes[0].set_ylabel('Points')
# 2b: Box plot home vs away
home_away_df = pd.DataFrame({
'Points': np.concatenate([home_scores, away_scores]),
'Location': ['Home'] * len(home_scores) + ['Away'] * len(away_scores)
})
sns.boxplot(data=home_away_df, x='Location', y='Points', ax=axes[1],
palette=['forestgreen', 'firebrick'])
axes[1].set_title('Home vs Away Scoring Distribution')
axes[1].set_ylabel('Points')
plt.tight_layout()
plt.savefig('fig_6_2_box_plots.png', dpi=150, bbox_inches='tight')
plt.show()
# =====================================
# FIGURE 3: Scatter Plots with Trends
# =====================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Figure 6.3: Key Relationships with Wins', fontsize=14, fontweight='bold')
scatter_vars = [
('Point Diff', 'Point Differential'),
('Off Yds/Play', 'Offensive Yards per Play'),
('TO Diff', 'Turnover Differential')
]
for idx, (col, title) in enumerate(scatter_vars):
ax = axes[idx]
ax.scatter(season_stats[col], season_stats['Wins'], alpha=0.6,
edgecolors='black', linewidth=0.5, s=60, color='steelblue')
# Add regression line
z = np.polyfit(season_stats[col], season_stats['Wins'], 1)
p = np.poly1d(z)
x_line = np.linspace(season_stats[col].min(), season_stats[col].max(), 100)
ax.plot(x_line, p(x_line), color='red', linewidth=2, linestyle='--')
r = season_stats['Wins'].corr(season_stats[col])
ax.set_xlabel(title)
ax.set_ylabel('Wins')
ax.set_title(f'{title} vs Wins\nr = {r:.3f}')
plt.tight_layout()
plt.savefig('fig_6_3_scatter_plots.png', dpi=150, bbox_inches='tight')
plt.show()
# =============================
# FIGURE 4: Correlation Heatmap
# =============================
fig, ax = plt.subplots(figsize=(10, 8))
fig.suptitle('Figure 6.4: Correlation Heatmap of Team Statistics',
fontsize=14, fontweight='bold')
corr = season_stats.corr()
mask = np.triu(np.ones_like(corr, dtype=bool)) # Show only lower triangle
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, vmin=-1, vmax=1, square=True,
linewidths=1, ax=ax,
cbar_kws={'label': 'Correlation Coefficient'})
ax.set_title('')
plt.tight_layout()
plt.savefig('fig_6_4_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
# ==============================
# FIGURE 5: Time Series / Season
# ==============================
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
fig.suptitle('Figure 6.5: Season Progression', fontsize=14, fontweight='bold')
# 5a: Raw weekly scores
weeks = range(1, n_games + 1)
for team_name in ['Chiefs', 'Bills', 'Jaguars']:
axes[0].plot(weeks, teams[team_name], marker='o', linewidth=2,
markersize=5, label=team_name)
axes[0].set_xlabel('Week')
axes[0].set_ylabel('Points Scored')
axes[0].set_title('Weekly Scoring by Team')
axes[0].legend()
axes[0].set_xticks(range(1, n_games + 1))
# 5b: Rolling average (4-game window)
window = 4
for team_name in ['Chiefs', 'Bills', 'Jaguars']:
rolling = pd.Series(teams[team_name]).rolling(window=window).mean()
axes[1].plot(weeks, rolling, marker='s', linewidth=2,
markersize=5, label=f'{team_name} ({window}-game avg)')
axes[1].set_xlabel('Week')
axes[1].set_ylabel(f'{window}-Game Rolling Average')
axes[1].set_title(f'{window}-Game Rolling Average Scoring')
axes[1].legend()
axes[1].set_xticks(range(1, n_games + 1))
plt.tight_layout()
plt.savefig('fig_6_5_time_series.png', dpi=150, bbox_inches='tight')
plt.show()
print("All figures saved successfully.")
Reading the Visualizations:
Figure 6.1 reveals the overall shape of scoring distributions. The density plot is particularly useful: the Chiefs show a tall, narrow peak (consistent scoring), while the Cowboys show a flat, wide curve (volatile). This visual difference maps directly to standard deviation numbers.
Figure 6.2 box plots let you compare teams at a glance. The box shows the IQR, the line inside shows the median, and the whiskers extend to 1.5 times the IQR. Points beyond the whiskers are outliers. A box plot immediately tells you both where the center is and how much spread exists.
Figure 6.3 scatter plots with regression lines show the strength of relationships. The point differential plot should show a tight cluster around the line (high correlation), while other variables will show more scatter (lower correlation).
Figure 6.4 heatmap makes the correlation matrix visual. Red cells are positive correlations, blue cells are negative, and the intensity shows the strength. This lets you instantly identify which stats cluster together and which are independent.
Figure 6.5 time series plots reveal trends that summary statistics hide. A team's mean might be 24 points, but if they scored 30 per game in the first half and 18 per game in the second half, the time series plot shows that decline. Rolling averages smooth out game-to-game noise and show the underlying trend.
Visualization Best Practices for Sports Analysis
- Always label your axes with units.
- Include the correlation coefficient (r) on scatter plots.
- Use consistent color coding (same team = same color across all plots).
- Prefer density plots over histograms when comparing distributions.
- Use rolling averages (3-5 game windows) for time series to reduce noise.
- Save figures at high resolution (dpi=150 or higher) for presentations.
6.5 Comparing Distributions Across Seasons and Teams
Raw statistics are meaningless without context. A quarterback throwing for 4,500 yards in 2024 is a good season. A quarterback throwing for 4,500 yards in 1975 would have been the greatest passing season in history. To make fair comparisons across different eras, leagues, and contexts, we need standardization techniques.
Z-Scores for Standardization
A z-score tells you how many standard deviations an observation is from the mean of its distribution:
z = (x - mu) / sigma
Where x is the observation, mu is the population (or sample) mean, and sigma is the population (or sample) standard deviation.
Z-scores are essential for cross-era comparisons. If a running back rushed for 1,800 yards when the league average was 900 with a standard deviation of 250, his z-score is:
z = (1800 - 900) / 250 = 3.6
A modern running back who rushes for 1,600 yards when the league average is 800 with a standard deviation of 200 gets:
z = (1600 - 800) / 200 = 4.0
Despite the lower raw total, the modern running back's performance was more exceptional relative to his era. Z-scores reveal this.
Z-scores in betting context: When you see a team's offensive output, converting to z-scores relative to the league average that season tells you how truly exceptional (or poor) they are. A team averaging 28 points in a league where the mean is 22 and the standard deviation is 4 has a z-score of +1.5, meaning they are solidly above average but not historically dominant.
Effect Size: Cohen's d
While z-scores standardize individual observations, Cohen's d measures the standardized difference between two group means:
d = (mean_1 - mean_2) / s_pooled
Where s_pooled is the pooled standard deviation:
s_pooled = sqrt(((n_1 - 1) * s_1^2 + (n_2 - 1) * s_2^2) / (n_1 + n_2 - 2))
Cohen's d guidelines adapted for sports:
| Cohen's d | Interpretation | Sports Example |
|---|---|---|
| 0.0 - 0.2 | Negligible | Home vs. away free throw % in NBA |
| 0.2 - 0.5 | Small | Home vs. away scoring in NFL |
| 0.5 - 0.8 | Medium | Playoff teams vs. non-playoff scoring |
| 0.8+ | Large | Top-5 offense vs. bottom-5 offense |
Why effect size matters for bettors: Statistical significance (p-values) tells you whether a difference is real. Effect size tells you whether the difference is large enough to matter. In sports betting, even a real effect with a tiny effect size is useless because it will not overcome the vig. You need effects that are both statistically significant and practically meaningful.
QQ Plots for Normality Checking
Many statistical techniques assume that data follows a normal (Gaussian) distribution. QQ (quantile-quantile) plots are the standard diagnostic for checking this assumption.
A QQ plot graphs the quantiles of your data against the quantiles of a theoretical normal distribution. If the data is normally distributed, the points will fall approximately along a straight diagonal line. Systematic deviations from the line indicate departures from normality:
- Points curving upward on the right: Right-skewed data (heavy right tail)
- Points curving downward on the left: Left-skewed data (heavy left tail)
- S-shaped curve: Heavy tails on both sides (leptokurtic distribution)
Why normality matters for betting models: If your model assumes normally distributed errors but the actual distribution has heavier tails, you will underestimate the probability of extreme outcomes. This is directly relevant to betting markets like over/unders and point spreads, where tail probabilities determine the value of the bet.
Kernel Density Estimation
Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Unlike a histogram (which depends on bin width and placement), KDE produces a smooth curve.
The KDE at any point x is:
f_hat(x) = (1 / (n * h)) * sum(K((x - x_i) / h)) for i = 1 to n
Where K is the kernel function (typically Gaussian), h is the bandwidth (a smoothing parameter), and n is the number of observations.
The bandwidth h controls the trade-off between smoothness and fidelity: - Small h: Jagged curve that fits the data closely (possibly overfitting) - Large h: Smooth curve that may mask important features (underfitting)
In practice, libraries like seaborn and scikit-learn choose reasonable defaults using methods like Scott's rule or Silverman's rule.
Application in sports betting: KDE allows you to estimate the probability that a team scores more than a certain number of points without assuming any particular distributional form. This is more flexible than parametric approaches and can capture features like bimodality (a team that either scores a lot or very little, with few games in between).
Python Code: Era-Adjusted Comparisons and Distribution Overlays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')
# ==========================================
# Z-Score Comparisons Across NFL Eras
# ==========================================
# Simulated league-wide passing yards data by era
np.random.seed(42)
eras = {
'2000 Season': {'league_mean': 3400, 'league_std': 600,
'top_qb_yards': 4400, 'top_qb_name': 'QB Alpha'},
'2010 Season': {'league_mean': 3800, 'league_std': 650,
'top_qb_yards': 5100, 'top_qb_name': 'QB Beta'},
'2020 Season': {'league_mean': 4000, 'league_std': 700,
'top_qb_yards': 5200, 'top_qb_name': 'QB Gamma'},
}
print("=== Era-Adjusted Passing Yard Comparisons ===\n")
print(f"{'Era':<15} {'QB':<12} {'Yards':<8} {'League Avg':<12} {'League SD':<12} {'Z-Score':<10}")
print("-" * 70)
for era, data in eras.items():
z = (data['top_qb_yards'] - data['league_mean']) / data['league_std']
print(f"{era:<15} {data['top_qb_name']:<12} {data['top_qb_yards']:<8} "
f"{data['league_mean']:<12} {data['league_std']:<12} {z:<10.2f}")
print("\nHigher z-score = more dominant relative to era, regardless of raw totals.")
# ==========================================
# Effect Size: Home vs Away, Playoff vs Not
# ==========================================
# Simulated home and away scoring data
home_scores = np.random.normal(24.5, 9.5, 256) # 256 home games in a season
away_scores = np.random.normal(21.5, 9.8, 256) # 256 away games
# Simulated playoff vs non-playoff team scoring
playoff_ppg = np.random.normal(26, 4, 14)
non_playoff_ppg = np.random.normal(20, 5, 18)
def cohens_d(group1, group2):
"""Calculate Cohen's d effect size."""
n1, n2 = len(group1), len(group2)
var1, var2 = group1.var(ddof=1), group2.var(ddof=1)
pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
return (group1.mean() - group2.mean()) / pooled_std
d_home_away = cohens_d(home_scores, away_scores)
d_playoff = cohens_d(playoff_ppg, non_playoff_ppg)
print("\n=== Effect Sizes (Cohen's d) ===\n")
print(f"Home vs Away scoring: d = {d_home_away:.3f} ", end="")
print("(Small)" if abs(d_home_away) < 0.5 else "(Medium)" if abs(d_home_away) < 0.8 else "(Large)")
print(f"Playoff vs Non-playoff team PPG: d = {d_playoff:.3f} ", end="")
print("(Small)" if abs(d_playoff) < 0.5 else "(Medium)" if abs(d_playoff) < 0.8 else "(Large)")
print(f"\nHome advantage effect ({d_home_away:.2f}) is real but small.")
print(f"Playoff team scoring advantage ({d_playoff:.2f}) is much more substantial.")
# ==============================
# FIGURE 6: Z-Score Distributions
# ==============================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Figure 6.6: Z-Score Standardization Across Eras',
fontsize=14, fontweight='bold')
colors = ['steelblue', 'coral', 'forestgreen']
for idx, (era, data) in enumerate(eras.items()):
# Generate full distribution for the era
league_yards = np.random.normal(data['league_mean'], data['league_std'], 32)
z_scores = (league_yards - data['league_mean']) / data['league_std']
top_z = (data['top_qb_yards'] - data['league_mean']) / data['league_std']
axes[idx].hist(z_scores, bins=10, alpha=0.6, color=colors[idx],
edgecolor='black', density=True)
axes[idx].axvline(top_z, color='red', linewidth=2, linestyle='--',
label=f'{data["top_qb_name"]}: z={top_z:.2f}')
axes[idx].set_xlabel('Z-Score')
axes[idx].set_ylabel('Density')
axes[idx].set_title(f'{era}')
axes[idx].legend()
axes[idx].set_xlim(-3.5, 4.5)
plt.tight_layout()
plt.savefig('fig_6_6_zscore_eras.png', dpi=150, bbox_inches='tight')
plt.show()
# ==============================
# FIGURE 7: QQ Plots
# ==============================
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('Figure 6.7: QQ Plots for Normality Assessment',
fontsize=14, fontweight='bold')
# Generate three different distributions
np.random.seed(42)
# 1: Approximately normal (NFL point spreads)
normal_data = np.random.normal(0, 13, 200)
stats.probplot(normal_data, dist="norm", plot=axes[0])
axes[0].set_title('NFL Margin vs Spread\n(Approximately Normal)')
axes[0].get_lines()[0].set_markerfacecolor('steelblue')
axes[0].get_lines()[0].set_markersize(4)
# 2: Right-skewed (home runs per player)
skewed_data = np.random.exponential(8, 200)
stats.probplot(skewed_data, dist="norm", plot=axes[1])
axes[1].set_title('Home Runs per Player\n(Right-Skewed)')
axes[1].get_lines()[0].set_markerfacecolor('coral')
axes[1].get_lines()[0].set_markersize(4)
# 3: Heavy-tailed (daily fantasy scores)
heavy_tail_data = np.random.standard_t(df=4, size=200) * 10 + 50
stats.probplot(heavy_tail_data, dist="norm", plot=axes[2])
axes[2].set_title('Daily Fantasy Scores\n(Heavy-Tailed)')
axes[2].get_lines()[0].set_markerfacecolor('forestgreen')
axes[2].get_lines()[0].set_markersize(4)
plt.tight_layout()
plt.savefig('fig_6_7_qq_plots.png', dpi=150, bbox_inches='tight')
plt.show()
# ======================================
# FIGURE 8: KDE Distribution Comparisons
# ======================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle('Figure 6.8: Kernel Density Estimation Comparisons',
fontsize=14, fontweight='bold')
# 8a: KDE of scoring by team tier
np.random.seed(42)
elite_offense = np.random.normal(28, 5, 100)
average_offense = np.random.normal(22, 6, 100)
poor_offense = np.random.normal(16, 4, 100)
sns.kdeplot(elite_offense, ax=axes[0], label='Elite Offense (top 5)',
linewidth=2.5, color='forestgreen')
sns.kdeplot(average_offense, ax=axes[0], label='Average Offense (middle)',
linewidth=2.5, color='steelblue')
sns.kdeplot(poor_offense, ax=axes[0], label='Poor Offense (bottom 5)',
linewidth=2.5, color='firebrick')
axes[0].axvline(22, color='gray', linestyle=':', alpha=0.5)
axes[0].set_xlabel('Points per Game')
axes[0].set_ylabel('Density')
axes[0].set_title('Scoring Distributions by Team Tier')
axes[0].legend()
# 8b: KDE comparing same team across seasons
np.random.seed(42)
team_2022 = np.random.normal(21, 6, 17) # Rebuilding year
team_2023 = np.random.normal(25, 5, 17) # Improved
team_2024 = np.random.normal(28, 4, 17) # Contender
sns.kdeplot(team_2022, ax=axes[1], label='2022 (Rebuilding)',
linewidth=2.5, color='firebrick', linestyle='--')
sns.kdeplot(team_2023, ax=axes[1], label='2023 (Improving)',
linewidth=2.5, color='orange', linestyle='-.')
sns.kdeplot(team_2024, ax=axes[1], label='2024 (Contender)',
linewidth=2.5, color='forestgreen', linestyle='-')
axes[1].set_xlabel('Points per Game')
axes[1].set_ylabel('Density')
axes[1].set_title('Same Team Across Three Seasons')
axes[1].legend()
plt.tight_layout()
plt.savefig('fig_6_8_kde_comparisons.png', dpi=150, bbox_inches='tight')
plt.show()
# ================================================
# Comprehensive Comparison Table: Z-Scores Applied
# ================================================
print("\n=== Team Scoring: Raw vs Z-Score Comparison ===\n")
# Simulated team scoring averages for current season
np.random.seed(42)
team_names = ['Team A', 'Team B', 'Team C', 'Team D', 'Team E',
'Team F', 'Team G', 'Team H']
ppg = np.array([28.5, 24.2, 21.0, 26.8, 19.5, 30.1, 22.3, 17.8])
league_mean = ppg.mean()
league_std = ppg.std()
comparison_df = pd.DataFrame({
'Team': team_names,
'PPG': ppg,
'Z-Score': (ppg - league_mean) / league_std,
'Percentile': [stats.norm.cdf(z) * 100 for z in (ppg - league_mean) / league_std]
})
comparison_df = comparison_df.sort_values('Z-Score', ascending=False)
print(f"League Mean: {league_mean:.1f} PPG")
print(f"League Std: {league_std:.1f} PPG\n")
print(comparison_df.to_string(index=False, float_format='%.2f'))
print("\n--- Interpretation ---")
print("Z-Score of +1.0: This team scores ~1 standard deviation above the league average.")
print("Z-Score of -1.0: This team scores ~1 standard deviation below the league average.")
print("Percentile of 84: This team's scoring is better than ~84% of the league.")
# ================================================
# Bandwidth Sensitivity for KDE
# ================================================
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('Figure 6.9: KDE Bandwidth Sensitivity',
fontsize=14, fontweight='bold')
np.random.seed(42)
sample_scores = np.random.normal(23, 7, 50)
bandwidths = [1.0, 3.0, 8.0]
labels = ['Small (h=1.0)\nOverfit', 'Moderate (h=3.0)\nGood Balance', 'Large (h=8.0)\nOversmoothed']
for idx, (bw, label) in enumerate(zip(bandwidths, labels)):
axes[idx].hist(sample_scores, bins=12, density=True, alpha=0.3,
color='gray', edgecolor='black')
sns.kdeplot(sample_scores, ax=axes[idx], bw_adjust=bw/3.0,
linewidth=2.5, color='steelblue')
axes[idx].set_xlabel('Points Scored')
axes[idx].set_ylabel('Density')
axes[idx].set_title(label)
plt.tight_layout()
plt.savefig('fig_6_9_kde_bandwidth.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nAll figures saved successfully.")
Interpreting the Results:
The z-score comparison across eras is one of the most powerful applications in this chapter. When a commentator says "Player X is having a historic season," z-scores let you test that claim rigorously. A 5,200-yard passing season in 2020 (z = 1.71) is excellent but less dominant than a 5,100-yard season in 2010 (z = 2.00) because the league average has shifted upward.
The QQ plots reveal three common distributional shapes in sports data:
- NFL margins vs. the spread approximate a normal distribution well. This is why point spread markets work relatively efficiently, as the underlying distribution matches the assumptions.
- Home runs per player are right-skewed. A normal model would underestimate the probability of extreme performances (50+ home runs). This matters for season-long prop bets.
- Daily fantasy scores have heavier tails than a normal distribution. Both extremely high and extremely low scores occur more often than a normal model predicts. This means DFS players face more variance than naive models suggest.
The KDE bandwidth comparison is a practical lesson in a parameter that many analysts never think about. The default bandwidth in most libraries is reasonable, but if your data has unusual features (bimodality, sharp peaks), you may need to adjust it. When in doubt, try multiple bandwidths and see if the key features of the distribution are robust across choices.
Applied Betting Insight: Using Z-Scores for Line Shopping
When you encounter a point spread or total, convert the implied team scoring into z-scores relative to that team's season distribution. If the line implies a team will score at a z-score of +1.5 (well above their typical output), ask yourself: what would need to happen for this team to score that far above their mean? If you cannot identify a credible reason (matchup advantage, pace factor, weather), the over on that team total may be a poor bet. Conversely, if a team's implied scoring is at z = -0.5 but they are facing the worst defense in the league, the market may be underestimating their output.
Putting It All Together: A Distribution Comparison Workflow
Here is the systematic process for comparing distributions that you should follow in your own analysis:
- Calculate summary statistics: Mean, median, standard deviation, IQR for each group.
- Visualize with overlapping KDEs: Look for separation between distributions, differences in spread, and distributional shape (symmetric, skewed, bimodal).
- Compute z-scores: Standardize individual observations to enable cross-group comparisons.
- Calculate effect size: Use Cohen's d to quantify whether the difference between groups is practically meaningful, not just statistically significant.
- Check normality with QQ plots: If you plan to use parametric methods later, verify that the normality assumption is reasonable.
- Adjust for context: Consider era, opponent quality, game situation, and other confounders before drawing conclusions.
6.6 Chapter Summary
This chapter has equipped you with the fundamental descriptive tools that every sports analyst and serious bettor needs. Let us review the key concepts and their betting applications.
Central Tendency: Know Your Baseline
- The mean is your default measure, but be aware of its sensitivity to outliers.
- The median is more robust and better represents "typical" performance when data is skewed.
- Weighted averages with exponential decay capture recent form, which is usually more predictive than full-season averages.
- Trimmed means offer a middle ground between mean and median, removing extreme values while retaining most of the data.
Betting application: When evaluating a team's scoring potential, use the weighted average for near-term predictions and the trimmed mean for baseline assessment. Compare these to the median to check for skewness that might affect your confidence.
Variability: Measure the Uncertainty
- Standard deviation quantifies consistency. Low standard deviation teams are predictable; high standard deviation teams create opportunities and risks.
- IQR gives you the practical "normal range" for a team's output.
- Coefficient of variation enables comparisons across different sports and statistical categories.
- The upset potential metric synthesizes variability into a practical tool for identifying risky favorites and dangerous underdogs.
Betting application: Before laying a big spread with a favorite, check their standard deviation. A volatile favorite with a standard deviation of 10+ points is a riskier lay than a consistent one with a standard deviation of 5. Similarly, look for high-variance underdogs whose occasional breakout games may not be fully priced into the line.
Correlation: Find What Matters
- Pearson correlation measures linear relationships between continuous variables.
- Spearman correlation handles ordinal data and is robust to outliers.
- Not all correlations are meaningful. Spurious correlations, confounding variables, and the multiple comparisons problem are constant threats.
- Correlation does not imply causation. This is especially important in sports, where game flow creates many misleading relationships.
Betting application: Build your models on statistics that have strong, stable correlations with winning: point differential, yards per play, turnover differential. Ignore "analyst favorites" like time of possession and total rushing yards that have weak correlations despite their media prominence.
Visualization: See the Story
- Histograms and KDEs reveal distributional shape.
- Box plots enable quick group comparisons.
- Scatter plots with regression lines show relationships.
- Heatmaps make correlation matrices digestible.
- Time series plots with rolling averages show trends.
Betting application: Visualize before you model. A scatter plot might reveal that the linear relationship you assumed is actually nonlinear. A KDE might show bimodality that a mean and standard deviation would completely miss. A time series plot might reveal a midseason breakout that gets hidden in full-season averages.
Standardization: Compare Fairly
- Z-scores enable comparisons across different contexts, eras, and scales.
- Cohen's d measures whether group differences are practically meaningful.
- QQ plots diagnose whether normality assumptions hold.
- KDE provides flexible, assumption-free density estimation.
Betting application: Use z-scores to evaluate lines. If a sportsbook implies a team will perform at z = +2.0 (two standard deviations above their mean), that outcome occurs only about 2.5% of the time in a normal distribution. Unless you have strong reasons to believe in the extreme scenario, the other side of that bet likely has value.
Key Formulas Reference Table
| Measure | Formula | When to Use |
|---|---|---|
| Mean | sum(x_i) / n | Default measure, all data points contribute |
| Median | Middle value of sorted data | Skewed data, outlier-resistant |
| Weighted Mean | sum(w_i * x_i) / sum(w_i) | Recency weighting, variable importance |
| Trimmed Mean | Mean of middle (1-2p)% of data | Outlier-resistant while using most data |
| Variance | sum((x_i - x_bar)^2) / (n-1) | Quantifying spread (squared units) |
| Std Deviation | sqrt(variance) | Quantifying spread (original units) |
| IQR | Q3 - Q1 | Robust measure of spread |
| CV | (std / mean) * 100% | Comparing variability across scales |
| Pearson r | Covariance / (std_x * std_y) | Linear relationship strength |
| Spearman rho | Pearson r on ranked data | Monotonic relationships, ordinal data |
| Z-score | (x - mean) / std | Standardization, cross-era comparison |
| Cohen's d | (mean_1 - mean_2) / pooled_std | Practical significance of group differences |
Common Mistakes to Avoid
-
Using the mean without checking for outliers. Always compare the mean to the median. If they differ substantially, investigate why.
-
Ignoring variability. Two teams with the same average are not the same. Standard deviation is as important as the mean for betting purposes.
-
Confusing correlation with causation. Just because two variables move together does not mean one causes the other. Always look for lurking variables.
-
Data dredging. Testing dozens of correlations and reporting only the significant ones is scientific fraud. Pre-register your hypotheses or at least be honest about how many things you tested.
-
Ignoring sample size. A correlation of r = 0.6 based on 10 games is much less reliable than r = 0.3 based on 200 games. We will formalize this with confidence intervals in Chapter 7.
-
Failing to standardize across eras. Raw statistics are meaningless without context. Always compute z-scores when comparing across different seasons or leagues.
-
Over-relying on a single visualization. Use multiple plot types to explore the same data. Each visualization reveals different aspects of the distribution.
What's Next: Chapter 7 - Probability Distributions in Sports
With descriptive statistics, you can characterize what has happened. In Chapter 7, we move from description to probabilistic modeling, asking: what is likely to happen next?
Chapter 7 will introduce the probability distributions that underpin every sports betting model. You will learn:
- The normal distribution and why NFL point spreads approximately follow it. You will use the z-scores from this chapter to calculate the probability of covering a spread.
- The Poisson distribution for modeling discrete events like goals in soccer, runs in baseball, and touchdowns in football. This is the foundation of many scoring models used by sharp bettors.
- The binomial distribution for modeling win-loss records and against-the-spread performance. Is a 60% ATS record genuinely skillful, or could it be luck? The binomial distribution gives you the answer.
- Heavy-tailed distributions and why they matter when standard models fail. The QQ plots from this chapter showed you that some sports data has heavier tails than the normal distribution predicts. Chapter 7 will introduce the Student's t-distribution and other alternatives that handle this better.
The descriptive statistics you learned in this chapter are the inputs to every probability model in Chapter 7. The means and standard deviations you calculate become the parameters of normal distributions. The count data you summarize becomes the lambda parameter of a Poisson distribution. Every concept in this chapter feeds directly into the next.
Make sure you are comfortable with z-scores, standard deviations, and the visual tools from this chapter before proceeding. If any concept feels shaky, revisit the Python code examples and run them with your own data. The best way to internalize these ideas is to apply them to a sport and team you follow closely. Calculate your team's mean and standard deviation of scoring, check the normality of their distribution with a QQ plot, and compare their z-score to the rest of the league. When you can do this fluently, you are ready for Chapter 7.
Practice Exercises
Exercise 6.1: Central Tendency Analysis Collect the last 17 games of scoring data for an NFL team (or any sport you follow). Calculate the mean, median, trimmed mean (10%), and weighted average (alpha = 0.9). Which measure do you think best represents the team's current scoring ability? Why?
Exercise 6.2: Variability and Betting For the same team, calculate the standard deviation and IQR of their scoring. Then find an opponent and do the same. Based on the variability measures, which team is more predictable? If these two teams were playing, how would you use this variability information to evaluate a point spread bet versus a totals bet?
Exercise 6.3: Correlation Investigation Using publicly available team statistics for any league, compute the correlation between at least five different statistics and wins. Which statistic has the strongest correlation? Which has the weakest? Are any of the weak correlations surprising? Can you explain why a commonly cited statistic might have a weak correlation with winning?
Exercise 6.4: Visualization Portfolio Create the following visualizations for a sport of your choice: (a) a histogram of scoring with mean and median marked, (b) box plots comparing at least three teams, (c) a scatter plot of the strongest stat-wins correlation you found in Exercise 6.3, (d) a correlation heatmap of at least six variables. Write one paragraph interpreting each visualization.
Exercise 6.5: Era Comparison Choose a single statistic in any sport (passing yards, home runs, goals per game, etc.) and gather the league average and standard deviation for at least three different seasons spread across different eras. Identify the best individual performance in each era and compute z-scores. Which performance was most dominant relative to its era? Did the raw numbers or the z-scores lead you to a different conclusion?
Exercise 6.6: The Full Workflow Pick a team and build a complete descriptive profile: all central tendency measures, all variability measures, QQ plot for normality, and z-score relative to the league. Write a one-page summary of what the data tells you about this team. Then find their next game's point spread and total, and explain whether the descriptive statistics support or challenge the market's numbers.