> "Statistics are like a lamp post to a drunk—used more for support than illumination."
Learning Objectives
- Calculate and interpret measures of central tendency for football metrics
- Apply measures of spread to understand performance variability
- Analyze distributions of football statistics
- Use correlation to identify relationships between metrics
- Build statistical profiles for team and player comparison
In This Chapter
Chapter 4: Descriptive Statistics in Football
"Statistics are like a lamp post to a drunk—used more for support than illumination." — Vin Scully (paraphrasing Andrew Lang)
Introduction
Before you can predict outcomes, identify trends, or build models, you must first understand what your data looks like. Descriptive statistics provide the foundation for all football analytics—they summarize, describe, and illuminate the patterns hidden in raw numbers.
This chapter teaches you to see beyond individual data points to understand the story your data tells. When someone says "Alabama averages 35 points per game," that single number hides tremendous variation. Understanding how those points are distributed—the consistency, the outliers, the context—separates amateur analysis from professional insight.
Why Descriptive Statistics Matter in Football
Consider two quarterbacks with identical passer ratings of 95.0. Are they equally valuable? Not necessarily:
- QB A: Consistent performer, rating between 85-105 every game
- QB B: Boom-or-bust, rating swings from 60 to 130
The averages are identical, but these are fundamentally different players. Descriptive statistics help you see these differences—and make better decisions because of them.
What You'll Learn
This chapter covers four essential areas:
- Central Tendency: Mean, median, mode—and when each matters
- Variability: Standard deviation, range, percentiles—understanding consistency
- Distributions: Shape, skewness, outliers—seeing the full picture
- Relationships: Correlation, covariance—connecting metrics
Part 1: Measures of Central Tendency
Central tendency answers the question: "What's typical?" In football, this might be average points per game, typical rushing yards, or expected completion percentage.
The Mean (Average)
The arithmetic mean is the most common measure—add all values and divide by count:
$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$
import pandas as pd
import numpy as np
# Sample team scoring data
games = pd.DataFrame({
"week": range(1, 13),
"opponent": ["Duke", "Texas", "Arkansas", "Ole Miss", "Vanderbilt", "Tennessee",
"LSU", "Missouri", "Kentucky", "Auburn", "Georgia", "Alabama"],
"points_scored": [52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35],
"points_allowed": [10, 24, 17, 28, 7, 21, 28, 14, 10, 24, 41, 28]
})
# Calculate mean
mean_scored = games["points_scored"].mean()
mean_allowed = games["points_allowed"].mean()
print(f"Average points scored: {mean_scored:.1f}")
print(f"Average points allowed: {mean_allowed:.1f}")
print(f"Average margin: {mean_scored - mean_allowed:.1f}")
Output:
Average points scored: 35.5
Average points allowed: 21.0
Average margin: 14.5
When the Mean Misleads
The mean is sensitive to extreme values. Consider rushing yards in a game where most plays gain 3-5 yards, but one 75-yard touchdown run occurs:
rushing_plays = [3, 4, 2, 5, 3, 4, 75, 3, 4, 2] # Including one big play
mean_yards = np.mean(rushing_plays)
print(f"Mean yards per carry: {mean_yards:.1f}") # 10.5 yards
# Remove the outlier
typical_plays = [x for x in rushing_plays if x < 20]
typical_mean = np.mean(typical_plays)
print(f"Typical yards per carry: {typical_mean:.1f}") # 3.3 yards
The 75-yard run distorts our understanding of "typical" performance. This is where the median helps.
The Median (Middle Value)
The median is the middle value when data is sorted. It's resistant to outliers:
rushing_plays = [3, 4, 2, 5, 3, 4, 75, 3, 4, 2]
median_yards = np.median(rushing_plays)
print(f"Median yards per carry: {median_yards:.1f}") # 3.5 yards
The median (3.5) better represents typical performance than the mean (10.5).
When to Use Median vs. Mean
| Situation | Use | Reason |
|---|---|---|
| Symmetric data | Mean | More precise estimate |
| Skewed data | Median | Resistant to outliers |
| Comparing players/teams | Both | Show different aspects |
| Financial data (salaries) | Median | Usually skewed |
| Standardized metrics | Mean | By design |
The Mode (Most Frequent)
The mode is the most common value. It's particularly useful for categorical data:
# Play type distribution
plays = pd.DataFrame({
"play_type": ["Pass", "Rush", "Pass", "Pass", "Rush", "Pass",
"Rush", "Pass", "Pass", "Punt", "Rush", "Pass"]
})
mode_play = plays["play_type"].mode()[0]
print(f"Most common play type: {mode_play}") # Pass
For continuous data like yards gained, mode is less useful unless you bin the data:
# Binned rushing yards
yards = pd.Series([3, 4, 2, 5, 3, 4, 8, 3, 4, 2, 1, 3, 4, 5])
bins = pd.cut(yards, bins=[0, 3, 6, 10, 100], labels=["Short", "Medium", "Long", "Explosive"])
print(bins.value_counts())
Practical Example: Comparing Offenses
def compare_offenses(team1_points: list, team2_points: list,
team1_name: str, team2_name: str) -> pd.DataFrame:
"""
Compare two teams' offensive production using multiple measures.
Parameters
----------
team1_points, team2_points : list
Points scored in each game
team1_name, team2_name : str
Team names for display
Returns
-------
pd.DataFrame
Comparison statistics
"""
stats = {
"Metric": ["Mean", "Median", "Min", "Max"],
team1_name: [
np.mean(team1_points),
np.median(team1_points),
np.min(team1_points),
np.max(team1_points)
],
team2_name: [
np.mean(team2_points),
np.median(team2_points),
np.min(team2_points),
np.max(team2_points)
]
}
return pd.DataFrame(stats)
# Compare Georgia and Alabama
georgia_points = [49, 33, 56, 26, 41, 38, 42, 16, 27, 30, 65, 35]
alabama_points = [55, 24, 63, 42, 49, 34, 42, 24, 49, 27, 41, 31]
comparison = compare_offenses(georgia_points, alabama_points, "Georgia", "Alabama")
print(comparison.to_string(index=False))
Part 2: Measures of Variability
Central tendency tells you what's typical; variability tells you how spread out the data is. In football, consistency often matters as much as the average.
Range
The simplest measure of spread—maximum minus minimum:
points = [52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35]
range_points = max(points) - min(points)
print(f"Scoring range: {range_points} points") # 38 points (17 to 55)
Limitation: Range only uses two values, ignoring everything in between.
Variance and Standard Deviation
Standard deviation measures average distance from the mean:
$$\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}$$
points = np.array([52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35])
# Manual calculation
mean = np.mean(points)
squared_diffs = (points - mean) ** 2
variance = np.mean(squared_diffs)
std_dev = np.sqrt(variance)
print(f"Mean: {mean:.1f}")
print(f"Variance: {variance:.1f}")
print(f"Standard Deviation: {std_dev:.1f}")
# Using NumPy (same result)
print(f"\nNumPy std: {np.std(points):.1f}")
Interpreting Standard Deviation
For roughly normal distributions: - ~68% of values fall within 1 standard deviation of the mean - ~95% fall within 2 standard deviations - ~99.7% fall within 3 standard deviations
points = np.array([52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35])
mean = np.mean(points)
std = np.std(points)
print(f"Mean: {mean:.1f}, Std Dev: {std:.1f}")
print(f"One std range: [{mean - std:.1f}, {mean + std:.1f}]")
print(f"Two std range: [{mean - 2*std:.1f}, {mean + 2*std:.1f}]")
# Count how many fall within ranges
within_one = np.sum((points >= mean - std) & (points <= mean + std))
within_two = np.sum((points >= mean - 2*std) & (points <= mean + 2*std))
print(f"Within 1 std: {within_one}/12 ({within_one/12:.0%})")
print(f"Within 2 std: {within_two}/12 ({within_two/12:.0%})")
Coefficient of Variation
When comparing variability across different scales, use coefficient of variation (CV):
$$CV = \frac{\sigma}{\bar{x}} \times 100\%$$
def coefficient_of_variation(data):
"""Calculate CV as percentage."""
return (np.std(data) / np.mean(data)) * 100
# Compare rushing vs passing variability
rushing_yards = [150, 120, 180, 95, 165, 140]
passing_yards = [280, 310, 250, 340, 290, 275]
cv_rush = coefficient_of_variation(rushing_yards)
cv_pass = coefficient_of_variation(passing_yards)
print(f"Rushing CV: {cv_rush:.1f}%")
print(f"Passing CV: {cv_pass:.1f}%")
print(f"\nMore variable: {'Rushing' if cv_rush > cv_pass else 'Passing'}")
Percentiles and Quartiles
Percentiles divide data into 100 equal parts. Quartiles divide into 4: - Q1 (25th percentile): 25% of data below - Q2 (50th percentile): Median - Q3 (75th percentile): 75% of data below - IQR (Interquartile Range): Q3 - Q1
def calculate_quartiles(data: np.ndarray) -> dict:
"""Calculate quartile statistics."""
return {
"min": np.min(data),
"Q1": np.percentile(data, 25),
"median": np.percentile(data, 50),
"Q3": np.percentile(data, 75),
"max": np.max(data),
"IQR": np.percentile(data, 75) - np.percentile(data, 25)
}
points = np.array([17, 24, 27, 28, 31, 32, 35, 38, 42, 45, 52, 55])
quartiles = calculate_quartiles(points)
print("Five-Number Summary:")
for key, value in quartiles.items():
print(f" {key}: {value:.1f}")
Practical Example: Consistency Analysis
def analyze_consistency(player_stats: pd.DataFrame, stat_col: str,
player_col: str = "player") -> pd.DataFrame:
"""
Analyze player consistency in a given statistic.
Parameters
----------
player_stats : pd.DataFrame
Game-by-game player statistics
stat_col : str
Column to analyze
player_col : str
Column containing player names
Returns
-------
pd.DataFrame
Consistency metrics for each player
"""
consistency = player_stats.groupby(player_col)[stat_col].agg([
("mean", "mean"),
("std", "std"),
("min", "min"),
("max", "max"),
("range", lambda x: x.max() - x.min()),
("cv", lambda x: (x.std() / x.mean() * 100) if x.mean() != 0 else 0)
]).round(1)
return consistency.sort_values("cv") # Most consistent first
# Sample quarterback data
qb_games = pd.DataFrame({
"player": ["QB1"]*6 + ["QB2"]*6,
"passing_yards": [280, 290, 275, 285, 295, 270, # QB1: consistent
180, 350, 220, 320, 190, 340] # QB2: inconsistent
})
consistency = analyze_consistency(qb_games, "passing_yards")
print("Quarterback Consistency Analysis:")
print(consistency)
Part 3: Understanding Distributions
A distribution shows how values are spread across the range of possible outcomes. Understanding distributions helps you set realistic expectations and identify unusual performances.
Visualizing Distributions
While this chapter focuses on calculations, distribution shape is often best understood visually. Key characteristics:
- Symmetric: Mean ≈ Median (normal distribution)
- Right-skewed: Mean > Median (long tail to the right)
- Left-skewed: Mean < Median (long tail to the left)
- Bimodal: Two peaks (two distinct groups)
Skewness
Skewness measures distribution asymmetry:
from scipy import stats
# Rushing yards (right-skewed: many short gains, few long runs)
rushing_yards = [3, 4, 2, 5, 3, 4, 8, 3, 4, 2, 1, 3, 4, 5, 45, 3, 4, 2]
skewness = stats.skew(rushing_yards)
print(f"Skewness: {skewness:.2f}")
# Interpretation
if skewness > 0.5:
print("Right-skewed (positive): Long tail to the right")
elif skewness < -0.5:
print("Left-skewed (negative): Long tail to the left")
else:
print("Approximately symmetric")
Football examples: - Right-skewed: Individual play yards, player salaries, recruiting rankings - Left-skewed: Completion percentage (ceiling at 100%), time of possession - Symmetric: Point differentials, standardized metrics (z-scores)
Kurtosis
Kurtosis measures "tailedness"—how extreme the outliers are:
# Compare two rushing backs
rb1_yards = np.random.normal(4.5, 2, 100) # Consistent
rb2_yards = np.concatenate([
np.random.normal(3, 1, 90), # Mostly short gains
np.random.uniform(20, 50, 10) # With some explosive plays
])
kurtosis_rb1 = stats.kurtosis(rb1_yards)
kurtosis_rb2 = stats.kurtosis(rb2_yards)
print(f"RB1 kurtosis: {kurtosis_rb1:.2f} (consistent)")
print(f"RB2 kurtosis: {kurtosis_rb2:.2f} (boom-or-bust)")
- High kurtosis (>0): More extreme values than normal
- Low kurtosis (<0): Fewer extreme values than normal
Identifying Outliers
Outliers are extreme values that may indicate special circumstances or data errors.
Method 1: IQR Method
def find_outliers_iqr(data: np.ndarray, multiplier: float = 1.5) -> np.ndarray:
"""
Find outliers using the IQR method.
Parameters
----------
data : np.ndarray
Data to analyze
multiplier : float
IQR multiplier (1.5 for outliers, 3.0 for extreme outliers)
Returns
-------
np.ndarray
Boolean array marking outliers
"""
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr
return (data < lower_bound) | (data > upper_bound)
# Example: Find outlier games
points = np.array([35, 42, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 82])
outliers = find_outliers_iqr(points)
print(f"Data: {points}")
print(f"Outliers: {points[outliers]}")
Method 2: Z-Score Method
def find_outliers_zscore(data: np.ndarray, threshold: float = 2.5) -> np.ndarray:
"""
Find outliers using z-scores.
Parameters
----------
data : np.ndarray
Data to analyze
threshold : float
Z-score threshold (typically 2.5 or 3.0)
Returns
-------
np.ndarray
Boolean array marking outliers
"""
z_scores = np.abs(stats.zscore(data))
return z_scores > threshold
# Example
points = np.array([35, 42, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 82])
outliers = find_outliers_zscore(points)
print(f"Outliers (z-score): {points[outliers]}")
Practical Example: Scoring Distribution Analysis
def analyze_scoring_distribution(points: np.ndarray, team_name: str) -> dict:
"""
Comprehensive scoring distribution analysis.
Parameters
----------
points : np.ndarray
Points scored in each game
team_name : str
Team name
Returns
-------
dict
Distribution analysis results
"""
analysis = {
"team": team_name,
"games": len(points),
"central_tendency": {
"mean": np.mean(points),
"median": np.median(points),
"mode": float(stats.mode(points, keepdims=True)[0][0])
},
"spread": {
"std": np.std(points),
"variance": np.var(points),
"range": np.ptp(points),
"iqr": np.percentile(points, 75) - np.percentile(points, 25)
},
"shape": {
"skewness": stats.skew(points),
"kurtosis": stats.kurtosis(points)
},
"percentiles": {
"10th": np.percentile(points, 10),
"25th": np.percentile(points, 25),
"50th": np.percentile(points, 50),
"75th": np.percentile(points, 75),
"90th": np.percentile(points, 90)
}
}
# Interpret skewness
skew = analysis["shape"]["skewness"]
if skew > 0.5:
analysis["shape"]["interpretation"] = "Right-skewed: More low-scoring games"
elif skew < -0.5:
analysis["shape"]["interpretation"] = "Left-skewed: More high-scoring games"
else:
analysis["shape"]["interpretation"] = "Approximately symmetric"
return analysis
# Analyze a team's scoring
georgia_points = np.array([49, 33, 56, 26, 41, 38, 42, 16, 27, 30, 65, 35])
analysis = analyze_scoring_distribution(georgia_points, "Georgia")
print(f"SCORING DISTRIBUTION: {analysis['team']}")
print("-" * 50)
print(f"Mean: {analysis['central_tendency']['mean']:.1f}")
print(f"Median: {analysis['central_tendency']['median']:.1f}")
print(f"Std Dev: {analysis['spread']['std']:.1f}")
print(f"Skewness: {analysis['shape']['skewness']:.2f}")
print(f"Interpretation: {analysis['shape']['interpretation']}")
Part 4: Relationships Between Variables
Understanding how variables relate to each other is crucial for football analysis. Does a strong rushing attack lead to more wins? Does time of possession correlate with scoring?
Covariance
Covariance measures the direction of the relationship between two variables:
$$Cov(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$
# Rushing yards and points scored
rushing = np.array([150, 120, 180, 95, 165, 140, 200, 110, 175, 130])
points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28])
covariance = np.cov(rushing, points)[0, 1]
print(f"Covariance: {covariance:.1f}")
Interpretation: - Positive covariance: Variables tend to move together - Negative covariance: Variables tend to move opposite - Near zero: Little linear relationship
Problem: Covariance magnitude depends on variable scales, making comparison difficult.
Correlation
Correlation standardizes covariance to a -1 to +1 scale:
$$r = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$
def calculate_correlation(x: np.ndarray, y: np.ndarray) -> tuple:
"""
Calculate Pearson correlation with interpretation.
Parameters
----------
x, y : np.ndarray
Variables to correlate
Returns
-------
tuple
(correlation coefficient, interpretation)
"""
r = np.corrcoef(x, y)[0, 1]
if abs(r) >= 0.8:
strength = "Very strong"
elif abs(r) >= 0.6:
strength = "Strong"
elif abs(r) >= 0.4:
strength = "Moderate"
elif abs(r) >= 0.2:
strength = "Weak"
else:
strength = "Very weak or no"
direction = "positive" if r > 0 else "negative"
return r, f"{strength} {direction} correlation"
# Examples
rushing = np.array([150, 120, 180, 95, 165, 140, 200, 110, 175, 130])
points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28])
r, interpretation = calculate_correlation(rushing, points)
print(f"Rushing vs Points: r = {r:.3f} ({interpretation})")
Correlation Matrix
When analyzing multiple variables, create a correlation matrix:
def create_correlation_matrix(df: pd.DataFrame, columns: list) -> pd.DataFrame:
"""
Create correlation matrix for specified columns.
Parameters
----------
df : pd.DataFrame
Data
columns : list
Columns to include
Returns
-------
pd.DataFrame
Correlation matrix
"""
return df[columns].corr().round(3)
# Team game statistics
team_stats = pd.DataFrame({
"points": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
"rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130],
"pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265],
"turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1],
"time_of_possession": [32, 28, 35, 25, 33, 30, 36, 27, 34, 29]
})
corr_matrix = create_correlation_matrix(
team_stats,
["points", "rush_yards", "pass_yards", "turnovers", "time_of_possession"]
)
print("Correlation Matrix:")
print(corr_matrix)
Correlation Pitfalls
1. Correlation ≠ Causation
# Ice cream sales and drowning deaths are correlated
# Both increase in summer, but ice cream doesn't cause drowning
2. Non-linear Relationships
Correlation measures linear relationships. A U-shaped relationship shows low correlation:
# Time of possession example
# Too little AND too much possession both correlate with losing
# (not enough offense, or clock-killing when behind)
3. Outliers Can Dominate
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]) # Outlier
y = np.array([2, 4, 5, 4, 5, 6, 7, 6, 8, 50])
r_with_outlier = np.corrcoef(x, y)[0, 1]
r_without = np.corrcoef(x[:-1], y[:-1])[0, 1]
print(f"With outlier: r = {r_with_outlier:.3f}")
print(f"Without: r = {r_without:.3f}")
Practical Example: What Drives Winning?
def analyze_winning_factors(games_df: pd.DataFrame) -> pd.DataFrame:
"""
Analyze which statistics correlate most with winning.
Parameters
----------
games_df : pd.DataFrame
Game statistics with win/loss outcome
Returns
-------
pd.DataFrame
Correlations with winning, sorted by strength
"""
# Create win indicator (1 = win, 0 = loss)
games_df = games_df.copy()
games_df["win"] = (games_df["points_for"] > games_df["points_against"]).astype(int)
# Stats to analyze
stat_columns = ["rush_yards", "pass_yards", "total_yards",
"turnovers", "time_of_possession", "third_down_pct"]
correlations = []
for col in stat_columns:
if col in games_df.columns:
r = games_df["win"].corr(games_df[col])
correlations.append({"Statistic": col, "Correlation": r})
result = pd.DataFrame(correlations)
result["Abs_Correlation"] = result["Correlation"].abs()
result = result.sort_values("Abs_Correlation", ascending=False)
return result[["Statistic", "Correlation"]]
# Sample data
games = pd.DataFrame({
"points_for": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
"points_against": [21, 28, 14, 31, 24, 28, 17, 35, 21, 31],
"rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130],
"pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265],
"total_yards": [430, 370, 490, 315, 440, 400, 490, 350, 475, 395],
"turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1],
"time_of_possession": [32, 28, 35, 25, 33, 30, 36, 27, 34, 29],
"third_down_pct": [45, 35, 55, 25, 48, 40, 52, 32, 50, 38]
})
winning_factors = analyze_winning_factors(games)
print("Factors Correlated with Winning:")
print(winning_factors.to_string(index=False))
Part 5: Building Statistical Profiles
Combining these concepts, we can build comprehensive statistical profiles for teams and players.
Team Profile
def create_team_profile(team_games: pd.DataFrame, team_name: str) -> dict:
"""
Create comprehensive statistical profile for a team.
Parameters
----------
team_games : pd.DataFrame
All games for the team
team_name : str
Team name
Returns
-------
dict
Complete statistical profile
"""
profile = {
"team": team_name,
"games": len(team_games),
"record": {
"wins": (team_games["points_for"] > team_games["points_against"]).sum(),
"losses": (team_games["points_for"] < team_games["points_against"]).sum()
}
}
# Scoring profile
profile["scoring"] = {
"offense": {
"mean": team_games["points_for"].mean(),
"median": team_games["points_for"].median(),
"std": team_games["points_for"].std(),
"min": team_games["points_for"].min(),
"max": team_games["points_for"].max()
},
"defense": {
"mean": team_games["points_against"].mean(),
"median": team_games["points_against"].median(),
"std": team_games["points_against"].std(),
"min": team_games["points_against"].min(),
"max": team_games["points_against"].max()
}
}
# Yardage profile (if available)
if "rush_yards" in team_games.columns:
profile["rushing"] = {
"mean": team_games["rush_yards"].mean(),
"std": team_games["rush_yards"].std(),
"consistency": 100 - (team_games["rush_yards"].std() /
team_games["rush_yards"].mean() * 100)
}
if "pass_yards" in team_games.columns:
profile["passing"] = {
"mean": team_games["pass_yards"].mean(),
"std": team_games["pass_yards"].std(),
"consistency": 100 - (team_games["pass_yards"].std() /
team_games["pass_yards"].mean() * 100)
}
# Turnover profile (if available)
if "turnovers" in team_games.columns:
profile["turnovers"] = {
"mean": team_games["turnovers"].mean(),
"total": team_games["turnovers"].sum(),
"games_with_zero": (team_games["turnovers"] == 0).sum()
}
return profile
# Create profile
team_data = pd.DataFrame({
"points_for": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28, 33, 41],
"points_against": [21, 28, 14, 31, 24, 28, 17, 35, 21, 31, 20, 24],
"rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130, 155, 170],
"pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265, 285, 295],
"turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1, 0, 1]
})
profile = create_team_profile(team_data, "Example Team")
print(f"TEAM PROFILE: {profile['team']}")
print(f"Record: {profile['record']['wins']}-{profile['record']['losses']}")
print(f"\nOffense: {profile['scoring']['offense']['mean']:.1f} PPG (±{profile['scoring']['offense']['std']:.1f})")
print(f"Defense: {profile['scoring']['defense']['mean']:.1f} PPG allowed (±{profile['scoring']['defense']['std']:.1f})")
print(f"\nRushing: {profile['rushing']['mean']:.1f} YPG ({profile['rushing']['consistency']:.0f}% consistency)")
print(f"Passing: {profile['passing']['mean']:.1f} YPG ({profile['passing']['consistency']:.0f}% consistency)")
print(f"\nTurnovers: {profile['turnovers']['mean']:.1f} per game ({profile['turnovers']['games_with_zero']} clean games)")
Comparing Teams
def compare_team_profiles(profile1: dict, profile2: dict) -> pd.DataFrame:
"""
Compare two team profiles side-by-side.
Parameters
----------
profile1, profile2 : dict
Team profiles to compare
Returns
-------
pd.DataFrame
Comparison table
"""
comparison = []
# Record
comparison.append({
"Category": "Record",
profile1["team"]: f"{profile1['record']['wins']}-{profile1['record']['losses']}",
profile2["team"]: f"{profile2['record']['wins']}-{profile2['record']['losses']}"
})
# Scoring
comparison.append({
"Category": "Points Per Game",
profile1["team"]: f"{profile1['scoring']['offense']['mean']:.1f}",
profile2["team"]: f"{profile2['scoring']['offense']['mean']:.1f}"
})
comparison.append({
"Category": "Points Allowed",
profile1["team"]: f"{profile1['scoring']['defense']['mean']:.1f}",
profile2["team"]: f"{profile2['scoring']['defense']['mean']:.1f}"
})
# Consistency
comparison.append({
"Category": "Scoring Consistency",
profile1["team"]: f"±{profile1['scoring']['offense']['std']:.1f}",
profile2["team"]: f"±{profile2['scoring']['offense']['std']:.1f}"
})
return pd.DataFrame(comparison)
Part 6: Z-Scores and Standardization
When comparing values across different scales, standardization is essential. A z-score tells you how many standard deviations a value is from the mean.
$$z = \frac{x - \mu}{\sigma}$$
Calculating Z-Scores
def calculate_zscore(value: float, data: np.ndarray) -> float:
"""Calculate z-score for a value relative to a dataset."""
return (value - np.mean(data)) / np.std(data)
# Example: How does 45 points compare to season average?
season_points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28, 33, 41])
z = calculate_zscore(45, season_points)
print(f"45 points: z = {z:.2f}")
print(f"Interpretation: {z:.2f} standard deviations above average")
Comparing Across Positions
Z-scores allow fair comparison across different statistics:
def standardize_stats(df: pd.DataFrame, stat_columns: list) -> pd.DataFrame:
"""
Standardize multiple statistics to z-scores.
Parameters
----------
df : pd.DataFrame
Player or team statistics
stat_columns : list
Columns to standardize
Returns
-------
pd.DataFrame
DataFrame with z-scored columns
"""
result = df.copy()
for col in stat_columns:
z_col = f"{col}_z"
result[z_col] = (result[col] - result[col].mean()) / result[col].std()
return result
# Compare players with different stats
players = pd.DataFrame({
"player": ["QB1", "RB1", "WR1", "TE1"],
"passing_yards": [3500, 0, 0, 0],
"rushing_yards": [200, 1200, 50, 30],
"receiving_yards": [0, 300, 1100, 600]
})
# For meaningful comparison, standardize within position groups
Composite Scores
Combine multiple z-scores into a single metric:
def calculate_composite_score(df: pd.DataFrame, metrics: dict) -> pd.Series:
"""
Calculate weighted composite z-score.
Parameters
----------
df : pd.DataFrame
Standardized data with z-score columns
metrics : dict
{column: weight} for each metric
Returns
-------
pd.Series
Composite scores
"""
composite = pd.Series(0, index=df.index)
for col, weight in metrics.items():
composite += df[col] * weight
# Normalize by total weight
total_weight = sum(metrics.values())
return composite / total_weight
# Example: Team efficiency composite
team_stats = pd.DataFrame({
"team": ["A", "B", "C", "D", "E"],
"ppg_z": [1.5, 0.8, -0.2, -0.8, -1.3],
"ypg_z": [1.2, 1.0, 0.1, -0.5, -1.8],
"turnover_z": [-1.0, -0.5, 0.2, 0.5, 0.8] # Negative is good for turnovers
})
# Weight: PPG 40%, YPG 30%, Turnovers 30% (inverted)
team_stats["efficiency_composite"] = calculate_composite_score(
team_stats,
{"ppg_z": 0.4, "ypg_z": 0.3, "turnover_z": -0.3} # Negative weight for turnovers
)
print(team_stats.sort_values("efficiency_composite", ascending=False))
Summary
Descriptive statistics form the foundation of football analytics:
Central Tendency: - Mean: Average value, sensitive to outliers - Median: Middle value, resistant to outliers - Mode: Most common value
Variability: - Standard deviation: Average distance from mean - Variance: Squared standard deviation - Range/IQR: Spread of values - Coefficient of variation: Relative variability
Distribution Shape: - Skewness: Asymmetry direction - Kurtosis: Tail heaviness - Outliers: Extreme values
Relationships: - Correlation: Strength and direction of linear relationship - Correlation matrix: Multiple relationships at once
Standardization: - Z-scores: Compare across different scales - Composite scores: Combine multiple metrics
In the next chapter, we'll apply these concepts to data cleaning and preparation—ensuring your data is accurate before analysis.
Key Terms
| Term | Definition |
|---|---|
| Mean | Sum of values divided by count |
| Median | Middle value when sorted |
| Mode | Most frequent value |
| Standard Deviation | Average distance from the mean |
| Variance | Squared standard deviation |
| Percentile | Value below which a percentage of data falls |
| IQR | Interquartile range (Q3 - Q1) |
| Skewness | Measure of distribution asymmetry |
| Kurtosis | Measure of distribution tail weight |
| Correlation | Standardized measure of linear relationship (-1 to +1) |
| Z-score | Number of standard deviations from the mean |
| Coefficient of Variation | Standard deviation as percentage of mean |
References
-
Agresti, A., & Franklin, C. (2018). Statistics: The Art and Science of Learning from Data. Pearson.
-
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.
-
McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.
-
Winston, W. (2012). Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football. Princeton University Press.
Related Reading
Explore this topic in other books
AI Engineering Probability, Statistics & Information Theory Sports Betting Probability and Odds Sports Betting Probability Distributions Prediction Markets Probability Fundamentals NFL Analytics Statistical Foundations Basketball Analytics Descriptive Statistics Soccer Analytics Statistics for Soccer