Key Takeaways: Descriptive Statistics in Football

DataField.Dev

Key Takeaways: Descriptive Statistics in Football

Core Concepts at a Glance

Central Tendency

Measure	Formula	Best For	Limitation
Mean	Sum / Count	Symmetric data, totals	Sensitive to outliers
Median	Middle value	Skewed data, typical performance	Ignores distribution shape
Mode	Most frequent	Categorical data, common plays	May not exist or be unique

Football Rule of Thumb: Use median for per-play metrics (yards per carry, per target), mean for totals and rates calculated from aggregates.

Variability Measures

Measure	Interpretation	Use Case
Standard Deviation	Average distance from mean	Comparing consistency within same metric
Variance	SD squared	Statistical calculations
Range	Max - Min	Quick spread overview
IQR	Q3 - Q1	Robust spread measure, outlier detection
CV	SD / Mean × 100%	Comparing variability across different scales

Consistency Insight: Lower SD = more consistent performance. A QB with 250 yards/game (SD=30) is more reliable than one with 250 yards/game (SD=75).

Distribution Shapes

Right-Skewed (Positive)     Normal              Left-Skewed (Negative)
        ▲                    ▲                        ▲
       ██                   ███                      ██
      ███                  █████                    ███
     ████                 ███████                  ████
    █████ → tail         █████████               ████████
   Mode < Median < Mean  Mode = Median = Mean   Mean < Median < Mode

   Example: Rushing yards  Example: Pass %      Example: Veteran experience

Correlation Quick Reference

r Value	Strength	Football Example
0.90 to 1.00	Very Strong	EPA and Win %
0.70 to 0.89	Strong	Points Scored and Wins
0.40 to 0.69	Moderate	Rushing Yards and Wins
0.20 to 0.39	Weak	Time of Possession and Wins
0.00 to 0.19	Very Weak/None	Jersey Numbers and Performance

Critical Reminder: Correlation ≠ Causation. Third variables, reverse causation, and coincidence can create misleading correlations.

Z-Score Interpretation

z = (Value - Mean) / Standard Deviation

z = +2.0  →  Top ~2.5% (excellent)
z = +1.0  →  Top ~16% (above average)
z = 0.0   →  Average (50th percentile)
z = -1.0  →  Bottom ~16% (below average)
z = -2.0  →  Bottom ~2.5% (poor)

Application: Compare players across different statistical categories by converting to z-scores.

Essential Code Patterns

Calculate All Central Tendency Measures

def central_tendency_summary(data: pd.Series) -> dict:
    """Complete central tendency analysis."""
    return {
        'mean': data.mean(),
        'median': data.median(),
        'mode': data.mode().iloc[0] if len(data.mode()) > 0 else None,
        'trimmed_mean': stats.trim_mean(data.dropna(), 0.1)
    }

Robust Variability Analysis

def variability_summary(data: pd.Series) -> dict:
    """Complete variability analysis."""
    q1, q3 = data.quantile([0.25, 0.75])
    return {
        'std': data.std(),
        'variance': data.var(),
        'range': data.max() - data.min(),
        'iqr': q3 - q1,
        'cv': (data.std() / data.mean()) * 100
    }

Outlier Detection

def find_outliers_iqr(data: pd.Series) -> pd.Series:
    """Identify outliers using IQR method."""
    q1, q3 = data.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return data[(data < lower) | (data > upper)]

Correlation Matrix with Significance

def correlation_analysis(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    """Calculate correlation matrix with interpretation."""
    corr_matrix = df[cols].corr()

    # Add interpretation
    def interpret(r):
        if abs(r) >= 0.8: return "Very Strong"
        if abs(r) >= 0.6: return "Strong"
        if abs(r) >= 0.4: return "Moderate"
        if abs(r) >= 0.2: return "Weak"
        return "Very Weak"

    return corr_matrix.applymap(interpret)

Decision Framework

Choosing the Right Statistic

START
  │
  ▼
Is your data categorical?
  │
  ├─ YES → Use MODE
  │
  └─ NO → Is there significant skew or outliers?
           │
           ├─ YES → Use MEDIAN and IQR
           │
           └─ NO → Use MEAN and STANDARD DEVIATION

Comparing Players or Teams

Same metric, same context: Direct comparison okay
Same metric, different contexts: Use z-scores or percentiles
Different metrics: Normalize to common scale (z-scores)
Over time: Consider trends, not just snapshots

Common Pitfalls to Avoid

Pitfall	Problem	Solution
Averaging averages	Incorrect when sample sizes differ	Weight by attempts/games
Ignoring sample size	Small samples have high variance	Report n alongside stats
Outlier blindness	Mean distorted by extreme values	Always check median too
Correlation = Causation	Misleading conclusions	Look for mechanism, third variables
Comparing raw stats	Different contexts make comparison unfair	Standardize with z-scores

Football Analytics Cheat Sheet

Quick Formulas

Mean = Σx / n
Median = Middle value (or average of two middle values)
SD = √(Σ(x - mean)² / (n-1))
IQR = Q3 - Q1
CV = (SD / Mean) × 100%
z-score = (x - mean) / SD
Outlier bounds = [Q1 - 1.5×IQR, Q3 + 1.5×IQR]

Interpretation Benchmarks

Completion %: Mean ~62%, SD ~8% in FBS
Yards per Carry: Mean ~4.0, SD ~1.5 (right-skewed)
Points per Game: Mean ~28, SD ~10 in FBS
Turnover Margin: Mean ~0, range typically -2 to +2 per game

Chapter Summary

Central tendency describes typical values; choose based on data distribution
Variability measures spread and consistency; essential for player evaluation
Distributions reveal data shape; skewness affects which statistics to use
Correlation quantifies relationships; strength and direction both matter
Z-scores enable cross-metric comparisons; standard deviations are the universal unit
Always consider context: sample size, competition level, era, and position

The best analysts don't just calculate statistics—they interpret them within the context of the game.