Key Takeaways: Exploratory Data Analysis for Football

One-page reference for Chapter 4 concepts


The EDA Mindset

  • Explore before modeling: Understand data patterns before formal analysis
  • Visualize often: Charts reveal what statistics hide
  • Question everything: Let the data guide your hypotheses
  • Document findings: Insights are useless if not recorded

EDA Process Flow

1. Load & Inspect → 2. Check Quality → 3. Univariate → 4. Bivariate → 5. Document

Quick Data Inspection

# Essential first-look commands
print(df.shape)                    # Rows, columns
print(df.dtypes)                   # Data types
print(df.isnull().sum())          # Missing values
print(df['col'].describe())        # Summary stats
print(df['col'].value_counts())    # Categorical counts

Key Visualizations

Goal Chart Type Code
Distribution Histogram df['epa'].hist(bins=50)
Compare groups Box plot df.boxplot(column='epa', by='down')
Relationship Scatter plt.scatter(df['x'], df['y'])
Trend Line plt.plot(df['week'], df['epa'])
Categories Bar df['team'].value_counts().plot(kind='bar')

Common Analysis Patterns

Team Summary

team_stats = (
    pbp
    .query("play_type.isin(['pass', 'run'])")
    .groupby('posteam')
    .agg(
        epa=('epa', 'mean'),
        success_rate=('success', 'mean'),
        plays=('play_id', 'count')
    )
    .sort_values('epa', ascending=False)
)

Split Analysis

# Compare home vs away
splits = (
    pbp
    .assign(is_home=pbp['posteam'] == pbp['home_team'])
    .groupby(['posteam', 'is_home'])['epa']
    .mean()
    .unstack()
)

Rolling Average

df['rolling_epa'] = (
    df
    .groupby('passer')['epa']
    .transform(lambda x: x.rolling(100).mean())
)

Visualization Checklist

Before sharing any chart:

  • [ ] Clear, descriptive title
  • [ ] Labeled axes with units
  • [ ] Legend if using colors/shapes
  • [ ] Readable font sizes
  • [ ] Data source noted
  • [ ] Insight clearly stated

Missing Data Patterns

Column Typically Missing When
passer_player_name Non-pass plays
receiver_player_name Incomplete passes, runs
air_yards Runs, sacks
cpoe Spikes, throwaways
epa Some special teams plays

EPA Distribution Facts

  • Centered near zero by construction
  • Passes > Runs in average EPA
  • Passes > Runs in variance (higher risk/reward)
  • Heavy tails: Turnovers create extreme negatives

Common Gotchas

Issue Solution
Overplotting Use hexbin or alpha
Misleading scale Start y-axis at 0 when appropriate
Missing context Add benchmarks/comparisons
Too much data Filter, aggregate, or sample

Quick Statistical Checks

# Correlation
df['x'].corr(df['y'])

# Skewness (>0 = right tail, <0 = left tail)
df['epa'].skew()

# Percentiles
df['epa'].quantile([0.10, 0.25, 0.50, 0.75, 0.90])

Presentation Tips

  1. Lead with insight: State the finding clearly
  2. Show evidence: One visualization per point
  3. Provide context: Compare to benchmarks
  4. Acknowledge limits: Sample size, data quality
  5. Suggest action: What should be done?

Preview: Chapter 5

Next: Statistical Foundations — probability, inference, and the math behind EPA and win probability.