Key Takeaways: Descriptive Statistics in Football
Core Concepts at a Glance
Central Tendency
| Measure | Formula | Best For | Limitation |
|---|---|---|---|
| Mean | Sum / Count | Symmetric data, totals | Sensitive to outliers |
| Median | Middle value | Skewed data, typical performance | Ignores distribution shape |
| Mode | Most frequent | Categorical data, common plays | May not exist or be unique |
Football Rule of Thumb: Use median for per-play metrics (yards per carry, per target), mean for totals and rates calculated from aggregates.
Variability Measures
| Measure | Interpretation | Use Case |
|---|---|---|
| Standard Deviation | Average distance from mean | Comparing consistency within same metric |
| Variance | SD squared | Statistical calculations |
| Range | Max - Min | Quick spread overview |
| IQR | Q3 - Q1 | Robust spread measure, outlier detection |
| CV | SD / Mean × 100% | Comparing variability across different scales |
Consistency Insight: Lower SD = more consistent performance. A QB with 250 yards/game (SD=30) is more reliable than one with 250 yards/game (SD=75).
Distribution Shapes
Right-Skewed (Positive) Normal Left-Skewed (Negative)
▲ ▲ ▲
██ ███ ██
███ █████ ███
████ ███████ ████
█████ → tail █████████ ████████
Mode < Median < Mean Mode = Median = Mean Mean < Median < Mode
Example: Rushing yards Example: Pass % Example: Veteran experience
Correlation Quick Reference
| r Value | Strength | Football Example |
|---|---|---|
| 0.90 to 1.00 | Very Strong | EPA and Win % |
| 0.70 to 0.89 | Strong | Points Scored and Wins |
| 0.40 to 0.69 | Moderate | Rushing Yards and Wins |
| 0.20 to 0.39 | Weak | Time of Possession and Wins |
| 0.00 to 0.19 | Very Weak/None | Jersey Numbers and Performance |
Critical Reminder: Correlation ≠ Causation. Third variables, reverse causation, and coincidence can create misleading correlations.
Z-Score Interpretation
z = (Value - Mean) / Standard Deviation
z = +2.0 → Top ~2.5% (excellent)
z = +1.0 → Top ~16% (above average)
z = 0.0 → Average (50th percentile)
z = -1.0 → Bottom ~16% (below average)
z = -2.0 → Bottom ~2.5% (poor)
Application: Compare players across different statistical categories by converting to z-scores.
Essential Code Patterns
Calculate All Central Tendency Measures
def central_tendency_summary(data: pd.Series) -> dict:
"""Complete central tendency analysis."""
return {
'mean': data.mean(),
'median': data.median(),
'mode': data.mode().iloc[0] if len(data.mode()) > 0 else None,
'trimmed_mean': stats.trim_mean(data.dropna(), 0.1)
}
Robust Variability Analysis
def variability_summary(data: pd.Series) -> dict:
"""Complete variability analysis."""
q1, q3 = data.quantile([0.25, 0.75])
return {
'std': data.std(),
'variance': data.var(),
'range': data.max() - data.min(),
'iqr': q3 - q1,
'cv': (data.std() / data.mean()) * 100
}
Outlier Detection
def find_outliers_iqr(data: pd.Series) -> pd.Series:
"""Identify outliers using IQR method."""
q1, q3 = data.quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
return data[(data < lower) | (data > upper)]
Correlation Matrix with Significance
def correlation_analysis(df: pd.DataFrame, cols: list) -> pd.DataFrame:
"""Calculate correlation matrix with interpretation."""
corr_matrix = df[cols].corr()
# Add interpretation
def interpret(r):
if abs(r) >= 0.8: return "Very Strong"
if abs(r) >= 0.6: return "Strong"
if abs(r) >= 0.4: return "Moderate"
if abs(r) >= 0.2: return "Weak"
return "Very Weak"
return corr_matrix.applymap(interpret)
Decision Framework
Choosing the Right Statistic
START
│
▼
Is your data categorical?
│
├─ YES → Use MODE
│
└─ NO → Is there significant skew or outliers?
│
├─ YES → Use MEDIAN and IQR
│
└─ NO → Use MEAN and STANDARD DEVIATION
Comparing Players or Teams
- Same metric, same context: Direct comparison okay
- Same metric, different contexts: Use z-scores or percentiles
- Different metrics: Normalize to common scale (z-scores)
- Over time: Consider trends, not just snapshots
Common Pitfalls to Avoid
| Pitfall | Problem | Solution |
|---|---|---|
| Averaging averages | Incorrect when sample sizes differ | Weight by attempts/games |
| Ignoring sample size | Small samples have high variance | Report n alongside stats |
| Outlier blindness | Mean distorted by extreme values | Always check median too |
| Correlation = Causation | Misleading conclusions | Look for mechanism, third variables |
| Comparing raw stats | Different contexts make comparison unfair | Standardize with z-scores |
Football Analytics Cheat Sheet
Quick Formulas
Mean = Σx / n
Median = Middle value (or average of two middle values)
SD = √(Σ(x - mean)² / (n-1))
IQR = Q3 - Q1
CV = (SD / Mean) × 100%
z-score = (x - mean) / SD
Outlier bounds = [Q1 - 1.5×IQR, Q3 + 1.5×IQR]
Interpretation Benchmarks
- Completion %: Mean ~62%, SD ~8% in FBS
- Yards per Carry: Mean ~4.0, SD ~1.5 (right-skewed)
- Points per Game: Mean ~28, SD ~10 in FBS
- Turnover Margin: Mean ~0, range typically -2 to +2 per game
Chapter Summary
- Central tendency describes typical values; choose based on data distribution
- Variability measures spread and consistency; essential for player evaluation
- Distributions reveal data shape; skewness affects which statistics to use
- Correlation quantifies relationships; strength and direction both matter
- Z-scores enable cross-metric comparisons; standard deviations are the universal unit
- Always consider context: sample size, competition level, era, and position
The best analysts don't just calculate statistics—they interpret them within the context of the game.