7 min read

> "Statistics are like a lamp post to a drunk—used more for support than illumination."

Learning Objectives

  • Calculate and interpret measures of central tendency for football metrics
  • Apply measures of spread to understand performance variability
  • Analyze distributions of football statistics
  • Use correlation to identify relationships between metrics
  • Build statistical profiles for team and player comparison

Chapter 4: Descriptive Statistics in Football

"Statistics are like a lamp post to a drunk—used more for support than illumination." — Vin Scully (paraphrasing Andrew Lang)

Introduction

Before you can predict outcomes, identify trends, or build models, you must first understand what your data looks like. Descriptive statistics provide the foundation for all football analytics—they summarize, describe, and illuminate the patterns hidden in raw numbers.

This chapter teaches you to see beyond individual data points to understand the story your data tells. When someone says "Alabama averages 35 points per game," that single number hides tremendous variation. Understanding how those points are distributed—the consistency, the outliers, the context—separates amateur analysis from professional insight.

Why Descriptive Statistics Matter in Football

Consider two quarterbacks with identical passer ratings of 95.0. Are they equally valuable? Not necessarily:

  • QB A: Consistent performer, rating between 85-105 every game
  • QB B: Boom-or-bust, rating swings from 60 to 130

The averages are identical, but these are fundamentally different players. Descriptive statistics help you see these differences—and make better decisions because of them.

What You'll Learn

This chapter covers four essential areas:

  1. Central Tendency: Mean, median, mode—and when each matters
  2. Variability: Standard deviation, range, percentiles—understanding consistency
  3. Distributions: Shape, skewness, outliers—seeing the full picture
  4. Relationships: Correlation, covariance—connecting metrics

Part 1: Measures of Central Tendency

Central tendency answers the question: "What's typical?" In football, this might be average points per game, typical rushing yards, or expected completion percentage.

The Mean (Average)

The arithmetic mean is the most common measure—add all values and divide by count:

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$

import pandas as pd
import numpy as np

# Sample team scoring data
games = pd.DataFrame({
    "week": range(1, 13),
    "opponent": ["Duke", "Texas", "Arkansas", "Ole Miss", "Vanderbilt", "Tennessee",
                 "LSU", "Missouri", "Kentucky", "Auburn", "Georgia", "Alabama"],
    "points_scored": [52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35],
    "points_allowed": [10, 24, 17, 28, 7, 21, 28, 14, 10, 24, 41, 28]
})

# Calculate mean
mean_scored = games["points_scored"].mean()
mean_allowed = games["points_allowed"].mean()

print(f"Average points scored: {mean_scored:.1f}")
print(f"Average points allowed: {mean_allowed:.1f}")
print(f"Average margin: {mean_scored - mean_allowed:.1f}")

Output:

Average points scored: 35.5
Average points allowed: 21.0
Average margin: 14.5

When the Mean Misleads

The mean is sensitive to extreme values. Consider rushing yards in a game where most plays gain 3-5 yards, but one 75-yard touchdown run occurs:

rushing_plays = [3, 4, 2, 5, 3, 4, 75, 3, 4, 2]  # Including one big play

mean_yards = np.mean(rushing_plays)
print(f"Mean yards per carry: {mean_yards:.1f}")  # 10.5 yards

# Remove the outlier
typical_plays = [x for x in rushing_plays if x < 20]
typical_mean = np.mean(typical_plays)
print(f"Typical yards per carry: {typical_mean:.1f}")  # 3.3 yards

The 75-yard run distorts our understanding of "typical" performance. This is where the median helps.

The Median (Middle Value)

The median is the middle value when data is sorted. It's resistant to outliers:

rushing_plays = [3, 4, 2, 5, 3, 4, 75, 3, 4, 2]

median_yards = np.median(rushing_plays)
print(f"Median yards per carry: {median_yards:.1f}")  # 3.5 yards

The median (3.5) better represents typical performance than the mean (10.5).

When to Use Median vs. Mean

Situation Use Reason
Symmetric data Mean More precise estimate
Skewed data Median Resistant to outliers
Comparing players/teams Both Show different aspects
Financial data (salaries) Median Usually skewed
Standardized metrics Mean By design

The Mode (Most Frequent)

The mode is the most common value. It's particularly useful for categorical data:

# Play type distribution
plays = pd.DataFrame({
    "play_type": ["Pass", "Rush", "Pass", "Pass", "Rush", "Pass",
                  "Rush", "Pass", "Pass", "Punt", "Rush", "Pass"]
})

mode_play = plays["play_type"].mode()[0]
print(f"Most common play type: {mode_play}")  # Pass

For continuous data like yards gained, mode is less useful unless you bin the data:

# Binned rushing yards
yards = pd.Series([3, 4, 2, 5, 3, 4, 8, 3, 4, 2, 1, 3, 4, 5])
bins = pd.cut(yards, bins=[0, 3, 6, 10, 100], labels=["Short", "Medium", "Long", "Explosive"])
print(bins.value_counts())

Practical Example: Comparing Offenses

def compare_offenses(team1_points: list, team2_points: list,
                     team1_name: str, team2_name: str) -> pd.DataFrame:
    """
    Compare two teams' offensive production using multiple measures.

    Parameters
    ----------
    team1_points, team2_points : list
        Points scored in each game
    team1_name, team2_name : str
        Team names for display

    Returns
    -------
    pd.DataFrame
        Comparison statistics
    """
    stats = {
        "Metric": ["Mean", "Median", "Min", "Max"],
        team1_name: [
            np.mean(team1_points),
            np.median(team1_points),
            np.min(team1_points),
            np.max(team1_points)
        ],
        team2_name: [
            np.mean(team2_points),
            np.median(team2_points),
            np.min(team2_points),
            np.max(team2_points)
        ]
    }

    return pd.DataFrame(stats)

# Compare Georgia and Alabama
georgia_points = [49, 33, 56, 26, 41, 38, 42, 16, 27, 30, 65, 35]
alabama_points = [55, 24, 63, 42, 49, 34, 42, 24, 49, 27, 41, 31]

comparison = compare_offenses(georgia_points, alabama_points, "Georgia", "Alabama")
print(comparison.to_string(index=False))

Part 2: Measures of Variability

Central tendency tells you what's typical; variability tells you how spread out the data is. In football, consistency often matters as much as the average.

Range

The simplest measure of spread—maximum minus minimum:

points = [52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35]

range_points = max(points) - min(points)
print(f"Scoring range: {range_points} points")  # 38 points (17 to 55)

Limitation: Range only uses two values, ignoring everything in between.

Variance and Standard Deviation

Standard deviation measures average distance from the mean:

$$\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}$$

points = np.array([52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35])

# Manual calculation
mean = np.mean(points)
squared_diffs = (points - mean) ** 2
variance = np.mean(squared_diffs)
std_dev = np.sqrt(variance)

print(f"Mean: {mean:.1f}")
print(f"Variance: {variance:.1f}")
print(f"Standard Deviation: {std_dev:.1f}")

# Using NumPy (same result)
print(f"\nNumPy std: {np.std(points):.1f}")

Interpreting Standard Deviation

For roughly normal distributions: - ~68% of values fall within 1 standard deviation of the mean - ~95% fall within 2 standard deviations - ~99.7% fall within 3 standard deviations

points = np.array([52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35])
mean = np.mean(points)
std = np.std(points)

print(f"Mean: {mean:.1f}, Std Dev: {std:.1f}")
print(f"One std range: [{mean - std:.1f}, {mean + std:.1f}]")
print(f"Two std range: [{mean - 2*std:.1f}, {mean + 2*std:.1f}]")

# Count how many fall within ranges
within_one = np.sum((points >= mean - std) & (points <= mean + std))
within_two = np.sum((points >= mean - 2*std) & (points <= mean + 2*std))

print(f"Within 1 std: {within_one}/12 ({within_one/12:.0%})")
print(f"Within 2 std: {within_two}/12 ({within_two/12:.0%})")

Coefficient of Variation

When comparing variability across different scales, use coefficient of variation (CV):

$$CV = \frac{\sigma}{\bar{x}} \times 100\%$$

def coefficient_of_variation(data):
    """Calculate CV as percentage."""
    return (np.std(data) / np.mean(data)) * 100

# Compare rushing vs passing variability
rushing_yards = [150, 120, 180, 95, 165, 140]
passing_yards = [280, 310, 250, 340, 290, 275]

cv_rush = coefficient_of_variation(rushing_yards)
cv_pass = coefficient_of_variation(passing_yards)

print(f"Rushing CV: {cv_rush:.1f}%")
print(f"Passing CV: {cv_pass:.1f}%")
print(f"\nMore variable: {'Rushing' if cv_rush > cv_pass else 'Passing'}")

Percentiles and Quartiles

Percentiles divide data into 100 equal parts. Quartiles divide into 4: - Q1 (25th percentile): 25% of data below - Q2 (50th percentile): Median - Q3 (75th percentile): 75% of data below - IQR (Interquartile Range): Q3 - Q1

def calculate_quartiles(data: np.ndarray) -> dict:
    """Calculate quartile statistics."""
    return {
        "min": np.min(data),
        "Q1": np.percentile(data, 25),
        "median": np.percentile(data, 50),
        "Q3": np.percentile(data, 75),
        "max": np.max(data),
        "IQR": np.percentile(data, 75) - np.percentile(data, 25)
    }

points = np.array([17, 24, 27, 28, 31, 32, 35, 38, 42, 45, 52, 55])
quartiles = calculate_quartiles(points)

print("Five-Number Summary:")
for key, value in quartiles.items():
    print(f"  {key}: {value:.1f}")

Practical Example: Consistency Analysis

def analyze_consistency(player_stats: pd.DataFrame, stat_col: str,
                       player_col: str = "player") -> pd.DataFrame:
    """
    Analyze player consistency in a given statistic.

    Parameters
    ----------
    player_stats : pd.DataFrame
        Game-by-game player statistics
    stat_col : str
        Column to analyze
    player_col : str
        Column containing player names

    Returns
    -------
    pd.DataFrame
        Consistency metrics for each player
    """
    consistency = player_stats.groupby(player_col)[stat_col].agg([
        ("mean", "mean"),
        ("std", "std"),
        ("min", "min"),
        ("max", "max"),
        ("range", lambda x: x.max() - x.min()),
        ("cv", lambda x: (x.std() / x.mean() * 100) if x.mean() != 0 else 0)
    ]).round(1)

    return consistency.sort_values("cv")  # Most consistent first

# Sample quarterback data
qb_games = pd.DataFrame({
    "player": ["QB1"]*6 + ["QB2"]*6,
    "passing_yards": [280, 290, 275, 285, 295, 270,  # QB1: consistent
                      180, 350, 220, 320, 190, 340]   # QB2: inconsistent
})

consistency = analyze_consistency(qb_games, "passing_yards")
print("Quarterback Consistency Analysis:")
print(consistency)

Part 3: Understanding Distributions

A distribution shows how values are spread across the range of possible outcomes. Understanding distributions helps you set realistic expectations and identify unusual performances.

Visualizing Distributions

While this chapter focuses on calculations, distribution shape is often best understood visually. Key characteristics:

  • Symmetric: Mean ≈ Median (normal distribution)
  • Right-skewed: Mean > Median (long tail to the right)
  • Left-skewed: Mean < Median (long tail to the left)
  • Bimodal: Two peaks (two distinct groups)

Skewness

Skewness measures distribution asymmetry:

from scipy import stats

# Rushing yards (right-skewed: many short gains, few long runs)
rushing_yards = [3, 4, 2, 5, 3, 4, 8, 3, 4, 2, 1, 3, 4, 5, 45, 3, 4, 2]

skewness = stats.skew(rushing_yards)
print(f"Skewness: {skewness:.2f}")

# Interpretation
if skewness > 0.5:
    print("Right-skewed (positive): Long tail to the right")
elif skewness < -0.5:
    print("Left-skewed (negative): Long tail to the left")
else:
    print("Approximately symmetric")

Football examples: - Right-skewed: Individual play yards, player salaries, recruiting rankings - Left-skewed: Completion percentage (ceiling at 100%), time of possession - Symmetric: Point differentials, standardized metrics (z-scores)

Kurtosis

Kurtosis measures "tailedness"—how extreme the outliers are:

# Compare two rushing backs
rb1_yards = np.random.normal(4.5, 2, 100)  # Consistent
rb2_yards = np.concatenate([
    np.random.normal(3, 1, 90),  # Mostly short gains
    np.random.uniform(20, 50, 10)  # With some explosive plays
])

kurtosis_rb1 = stats.kurtosis(rb1_yards)
kurtosis_rb2 = stats.kurtosis(rb2_yards)

print(f"RB1 kurtosis: {kurtosis_rb1:.2f} (consistent)")
print(f"RB2 kurtosis: {kurtosis_rb2:.2f} (boom-or-bust)")
  • High kurtosis (>0): More extreme values than normal
  • Low kurtosis (<0): Fewer extreme values than normal

Identifying Outliers

Outliers are extreme values that may indicate special circumstances or data errors.

Method 1: IQR Method

def find_outliers_iqr(data: np.ndarray, multiplier: float = 1.5) -> np.ndarray:
    """
    Find outliers using the IQR method.

    Parameters
    ----------
    data : np.ndarray
        Data to analyze
    multiplier : float
        IQR multiplier (1.5 for outliers, 3.0 for extreme outliers)

    Returns
    -------
    np.ndarray
        Boolean array marking outliers
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1

    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr

    return (data < lower_bound) | (data > upper_bound)

# Example: Find outlier games
points = np.array([35, 42, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 82])
outliers = find_outliers_iqr(points)

print(f"Data: {points}")
print(f"Outliers: {points[outliers]}")

Method 2: Z-Score Method

def find_outliers_zscore(data: np.ndarray, threshold: float = 2.5) -> np.ndarray:
    """
    Find outliers using z-scores.

    Parameters
    ----------
    data : np.ndarray
        Data to analyze
    threshold : float
        Z-score threshold (typically 2.5 or 3.0)

    Returns
    -------
    np.ndarray
        Boolean array marking outliers
    """
    z_scores = np.abs(stats.zscore(data))
    return z_scores > threshold

# Example
points = np.array([35, 42, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 82])
outliers = find_outliers_zscore(points)

print(f"Outliers (z-score): {points[outliers]}")

Practical Example: Scoring Distribution Analysis

def analyze_scoring_distribution(points: np.ndarray, team_name: str) -> dict:
    """
    Comprehensive scoring distribution analysis.

    Parameters
    ----------
    points : np.ndarray
        Points scored in each game
    team_name : str
        Team name

    Returns
    -------
    dict
        Distribution analysis results
    """
    analysis = {
        "team": team_name,
        "games": len(points),
        "central_tendency": {
            "mean": np.mean(points),
            "median": np.median(points),
            "mode": float(stats.mode(points, keepdims=True)[0][0])
        },
        "spread": {
            "std": np.std(points),
            "variance": np.var(points),
            "range": np.ptp(points),
            "iqr": np.percentile(points, 75) - np.percentile(points, 25)
        },
        "shape": {
            "skewness": stats.skew(points),
            "kurtosis": stats.kurtosis(points)
        },
        "percentiles": {
            "10th": np.percentile(points, 10),
            "25th": np.percentile(points, 25),
            "50th": np.percentile(points, 50),
            "75th": np.percentile(points, 75),
            "90th": np.percentile(points, 90)
        }
    }

    # Interpret skewness
    skew = analysis["shape"]["skewness"]
    if skew > 0.5:
        analysis["shape"]["interpretation"] = "Right-skewed: More low-scoring games"
    elif skew < -0.5:
        analysis["shape"]["interpretation"] = "Left-skewed: More high-scoring games"
    else:
        analysis["shape"]["interpretation"] = "Approximately symmetric"

    return analysis

# Analyze a team's scoring
georgia_points = np.array([49, 33, 56, 26, 41, 38, 42, 16, 27, 30, 65, 35])
analysis = analyze_scoring_distribution(georgia_points, "Georgia")

print(f"SCORING DISTRIBUTION: {analysis['team']}")
print("-" * 50)
print(f"Mean: {analysis['central_tendency']['mean']:.1f}")
print(f"Median: {analysis['central_tendency']['median']:.1f}")
print(f"Std Dev: {analysis['spread']['std']:.1f}")
print(f"Skewness: {analysis['shape']['skewness']:.2f}")
print(f"Interpretation: {analysis['shape']['interpretation']}")

Part 4: Relationships Between Variables

Understanding how variables relate to each other is crucial for football analysis. Does a strong rushing attack lead to more wins? Does time of possession correlate with scoring?

Covariance

Covariance measures the direction of the relationship between two variables:

$$Cov(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$

# Rushing yards and points scored
rushing = np.array([150, 120, 180, 95, 165, 140, 200, 110, 175, 130])
points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28])

covariance = np.cov(rushing, points)[0, 1]
print(f"Covariance: {covariance:.1f}")

Interpretation: - Positive covariance: Variables tend to move together - Negative covariance: Variables tend to move opposite - Near zero: Little linear relationship

Problem: Covariance magnitude depends on variable scales, making comparison difficult.

Correlation

Correlation standardizes covariance to a -1 to +1 scale:

$$r = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$

def calculate_correlation(x: np.ndarray, y: np.ndarray) -> tuple:
    """
    Calculate Pearson correlation with interpretation.

    Parameters
    ----------
    x, y : np.ndarray
        Variables to correlate

    Returns
    -------
    tuple
        (correlation coefficient, interpretation)
    """
    r = np.corrcoef(x, y)[0, 1]

    if abs(r) >= 0.8:
        strength = "Very strong"
    elif abs(r) >= 0.6:
        strength = "Strong"
    elif abs(r) >= 0.4:
        strength = "Moderate"
    elif abs(r) >= 0.2:
        strength = "Weak"
    else:
        strength = "Very weak or no"

    direction = "positive" if r > 0 else "negative"

    return r, f"{strength} {direction} correlation"

# Examples
rushing = np.array([150, 120, 180, 95, 165, 140, 200, 110, 175, 130])
points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28])

r, interpretation = calculate_correlation(rushing, points)
print(f"Rushing vs Points: r = {r:.3f} ({interpretation})")

Correlation Matrix

When analyzing multiple variables, create a correlation matrix:

def create_correlation_matrix(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    """
    Create correlation matrix for specified columns.

    Parameters
    ----------
    df : pd.DataFrame
        Data
    columns : list
        Columns to include

    Returns
    -------
    pd.DataFrame
        Correlation matrix
    """
    return df[columns].corr().round(3)

# Team game statistics
team_stats = pd.DataFrame({
    "points": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
    "rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130],
    "pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265],
    "turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1],
    "time_of_possession": [32, 28, 35, 25, 33, 30, 36, 27, 34, 29]
})

corr_matrix = create_correlation_matrix(
    team_stats,
    ["points", "rush_yards", "pass_yards", "turnovers", "time_of_possession"]
)
print("Correlation Matrix:")
print(corr_matrix)

Correlation Pitfalls

1. Correlation ≠ Causation

# Ice cream sales and drowning deaths are correlated
# Both increase in summer, but ice cream doesn't cause drowning

2. Non-linear Relationships

Correlation measures linear relationships. A U-shaped relationship shows low correlation:

# Time of possession example
# Too little AND too much possession both correlate with losing
# (not enough offense, or clock-killing when behind)

3. Outliers Can Dominate

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])  # Outlier
y = np.array([2, 4, 5, 4, 5, 6, 7, 6, 8, 50])

r_with_outlier = np.corrcoef(x, y)[0, 1]
r_without = np.corrcoef(x[:-1], y[:-1])[0, 1]

print(f"With outlier: r = {r_with_outlier:.3f}")
print(f"Without: r = {r_without:.3f}")

Practical Example: What Drives Winning?

def analyze_winning_factors(games_df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze which statistics correlate most with winning.

    Parameters
    ----------
    games_df : pd.DataFrame
        Game statistics with win/loss outcome

    Returns
    -------
    pd.DataFrame
        Correlations with winning, sorted by strength
    """
    # Create win indicator (1 = win, 0 = loss)
    games_df = games_df.copy()
    games_df["win"] = (games_df["points_for"] > games_df["points_against"]).astype(int)

    # Stats to analyze
    stat_columns = ["rush_yards", "pass_yards", "total_yards",
                    "turnovers", "time_of_possession", "third_down_pct"]

    correlations = []
    for col in stat_columns:
        if col in games_df.columns:
            r = games_df["win"].corr(games_df[col])
            correlations.append({"Statistic": col, "Correlation": r})

    result = pd.DataFrame(correlations)
    result["Abs_Correlation"] = result["Correlation"].abs()
    result = result.sort_values("Abs_Correlation", ascending=False)

    return result[["Statistic", "Correlation"]]

# Sample data
games = pd.DataFrame({
    "points_for": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
    "points_against": [21, 28, 14, 31, 24, 28, 17, 35, 21, 31],
    "rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130],
    "pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265],
    "total_yards": [430, 370, 490, 315, 440, 400, 490, 350, 475, 395],
    "turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1],
    "time_of_possession": [32, 28, 35, 25, 33, 30, 36, 27, 34, 29],
    "third_down_pct": [45, 35, 55, 25, 48, 40, 52, 32, 50, 38]
})

winning_factors = analyze_winning_factors(games)
print("Factors Correlated with Winning:")
print(winning_factors.to_string(index=False))

Part 5: Building Statistical Profiles

Combining these concepts, we can build comprehensive statistical profiles for teams and players.

Team Profile

def create_team_profile(team_games: pd.DataFrame, team_name: str) -> dict:
    """
    Create comprehensive statistical profile for a team.

    Parameters
    ----------
    team_games : pd.DataFrame
        All games for the team
    team_name : str
        Team name

    Returns
    -------
    dict
        Complete statistical profile
    """
    profile = {
        "team": team_name,
        "games": len(team_games),
        "record": {
            "wins": (team_games["points_for"] > team_games["points_against"]).sum(),
            "losses": (team_games["points_for"] < team_games["points_against"]).sum()
        }
    }

    # Scoring profile
    profile["scoring"] = {
        "offense": {
            "mean": team_games["points_for"].mean(),
            "median": team_games["points_for"].median(),
            "std": team_games["points_for"].std(),
            "min": team_games["points_for"].min(),
            "max": team_games["points_for"].max()
        },
        "defense": {
            "mean": team_games["points_against"].mean(),
            "median": team_games["points_against"].median(),
            "std": team_games["points_against"].std(),
            "min": team_games["points_against"].min(),
            "max": team_games["points_against"].max()
        }
    }

    # Yardage profile (if available)
    if "rush_yards" in team_games.columns:
        profile["rushing"] = {
            "mean": team_games["rush_yards"].mean(),
            "std": team_games["rush_yards"].std(),
            "consistency": 100 - (team_games["rush_yards"].std() /
                                  team_games["rush_yards"].mean() * 100)
        }

    if "pass_yards" in team_games.columns:
        profile["passing"] = {
            "mean": team_games["pass_yards"].mean(),
            "std": team_games["pass_yards"].std(),
            "consistency": 100 - (team_games["pass_yards"].std() /
                                  team_games["pass_yards"].mean() * 100)
        }

    # Turnover profile (if available)
    if "turnovers" in team_games.columns:
        profile["turnovers"] = {
            "mean": team_games["turnovers"].mean(),
            "total": team_games["turnovers"].sum(),
            "games_with_zero": (team_games["turnovers"] == 0).sum()
        }

    return profile


# Create profile
team_data = pd.DataFrame({
    "points_for": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28, 33, 41],
    "points_against": [21, 28, 14, 31, 24, 28, 17, 35, 21, 31, 20, 24],
    "rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130, 155, 170],
    "pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265, 285, 295],
    "turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1, 0, 1]
})

profile = create_team_profile(team_data, "Example Team")

print(f"TEAM PROFILE: {profile['team']}")
print(f"Record: {profile['record']['wins']}-{profile['record']['losses']}")
print(f"\nOffense: {profile['scoring']['offense']['mean']:.1f} PPG (±{profile['scoring']['offense']['std']:.1f})")
print(f"Defense: {profile['scoring']['defense']['mean']:.1f} PPG allowed (±{profile['scoring']['defense']['std']:.1f})")
print(f"\nRushing: {profile['rushing']['mean']:.1f} YPG ({profile['rushing']['consistency']:.0f}% consistency)")
print(f"Passing: {profile['passing']['mean']:.1f} YPG ({profile['passing']['consistency']:.0f}% consistency)")
print(f"\nTurnovers: {profile['turnovers']['mean']:.1f} per game ({profile['turnovers']['games_with_zero']} clean games)")

Comparing Teams

def compare_team_profiles(profile1: dict, profile2: dict) -> pd.DataFrame:
    """
    Compare two team profiles side-by-side.

    Parameters
    ----------
    profile1, profile2 : dict
        Team profiles to compare

    Returns
    -------
    pd.DataFrame
        Comparison table
    """
    comparison = []

    # Record
    comparison.append({
        "Category": "Record",
        profile1["team"]: f"{profile1['record']['wins']}-{profile1['record']['losses']}",
        profile2["team"]: f"{profile2['record']['wins']}-{profile2['record']['losses']}"
    })

    # Scoring
    comparison.append({
        "Category": "Points Per Game",
        profile1["team"]: f"{profile1['scoring']['offense']['mean']:.1f}",
        profile2["team"]: f"{profile2['scoring']['offense']['mean']:.1f}"
    })

    comparison.append({
        "Category": "Points Allowed",
        profile1["team"]: f"{profile1['scoring']['defense']['mean']:.1f}",
        profile2["team"]: f"{profile2['scoring']['defense']['mean']:.1f}"
    })

    # Consistency
    comparison.append({
        "Category": "Scoring Consistency",
        profile1["team"]: f"±{profile1['scoring']['offense']['std']:.1f}",
        profile2["team"]: f"±{profile2['scoring']['offense']['std']:.1f}"
    })

    return pd.DataFrame(comparison)

Part 6: Z-Scores and Standardization

When comparing values across different scales, standardization is essential. A z-score tells you how many standard deviations a value is from the mean.

$$z = \frac{x - \mu}{\sigma}$$

Calculating Z-Scores

def calculate_zscore(value: float, data: np.ndarray) -> float:
    """Calculate z-score for a value relative to a dataset."""
    return (value - np.mean(data)) / np.std(data)

# Example: How does 45 points compare to season average?
season_points = np.array([35, 24, 42, 17, 38, 31, 45, 21, 40, 28, 33, 41])
z = calculate_zscore(45, season_points)

print(f"45 points: z = {z:.2f}")
print(f"Interpretation: {z:.2f} standard deviations above average")

Comparing Across Positions

Z-scores allow fair comparison across different statistics:

def standardize_stats(df: pd.DataFrame, stat_columns: list) -> pd.DataFrame:
    """
    Standardize multiple statistics to z-scores.

    Parameters
    ----------
    df : pd.DataFrame
        Player or team statistics
    stat_columns : list
        Columns to standardize

    Returns
    -------
    pd.DataFrame
        DataFrame with z-scored columns
    """
    result = df.copy()

    for col in stat_columns:
        z_col = f"{col}_z"
        result[z_col] = (result[col] - result[col].mean()) / result[col].std()

    return result

# Compare players with different stats
players = pd.DataFrame({
    "player": ["QB1", "RB1", "WR1", "TE1"],
    "passing_yards": [3500, 0, 0, 0],
    "rushing_yards": [200, 1200, 50, 30],
    "receiving_yards": [0, 300, 1100, 600]
})

# For meaningful comparison, standardize within position groups

Composite Scores

Combine multiple z-scores into a single metric:

def calculate_composite_score(df: pd.DataFrame, metrics: dict) -> pd.Series:
    """
    Calculate weighted composite z-score.

    Parameters
    ----------
    df : pd.DataFrame
        Standardized data with z-score columns
    metrics : dict
        {column: weight} for each metric

    Returns
    -------
    pd.Series
        Composite scores
    """
    composite = pd.Series(0, index=df.index)

    for col, weight in metrics.items():
        composite += df[col] * weight

    # Normalize by total weight
    total_weight = sum(metrics.values())
    return composite / total_weight

# Example: Team efficiency composite
team_stats = pd.DataFrame({
    "team": ["A", "B", "C", "D", "E"],
    "ppg_z": [1.5, 0.8, -0.2, -0.8, -1.3],
    "ypg_z": [1.2, 1.0, 0.1, -0.5, -1.8],
    "turnover_z": [-1.0, -0.5, 0.2, 0.5, 0.8]  # Negative is good for turnovers
})

# Weight: PPG 40%, YPG 30%, Turnovers 30% (inverted)
team_stats["efficiency_composite"] = calculate_composite_score(
    team_stats,
    {"ppg_z": 0.4, "ypg_z": 0.3, "turnover_z": -0.3}  # Negative weight for turnovers
)

print(team_stats.sort_values("efficiency_composite", ascending=False))

Summary

Descriptive statistics form the foundation of football analytics:

Central Tendency: - Mean: Average value, sensitive to outliers - Median: Middle value, resistant to outliers - Mode: Most common value

Variability: - Standard deviation: Average distance from mean - Variance: Squared standard deviation - Range/IQR: Spread of values - Coefficient of variation: Relative variability

Distribution Shape: - Skewness: Asymmetry direction - Kurtosis: Tail heaviness - Outliers: Extreme values

Relationships: - Correlation: Strength and direction of linear relationship - Correlation matrix: Multiple relationships at once

Standardization: - Z-scores: Compare across different scales - Composite scores: Combine multiple metrics

In the next chapter, we'll apply these concepts to data cleaning and preparation—ensuring your data is accurate before analysis.


Key Terms

Term Definition
Mean Sum of values divided by count
Median Middle value when sorted
Mode Most frequent value
Standard Deviation Average distance from the mean
Variance Squared standard deviation
Percentile Value below which a percentage of data falls
IQR Interquartile range (Q3 - Q1)
Skewness Measure of distribution asymmetry
Kurtosis Measure of distribution tail weight
Correlation Standardized measure of linear relationship (-1 to +1)
Z-score Number of standard deviations from the mean
Coefficient of Variation Standard deviation as percentage of mean

References

  1. Agresti, A., & Franklin, C. (2018). Statistics: The Art and Science of Learning from Data. Pearson.

  2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.

  3. McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.

  4. Winston, W. (2012). Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football. Princeton University Press.