Key Takeaways: Exploratory Data Analysis for Football

DataField.Dev

Key Takeaways: Exploratory Data Analysis for Football

One-page reference for Chapter 4 concepts

The EDA Mindset

Explore before modeling: Understand data patterns before formal analysis
Visualize often: Charts reveal what statistics hide
Question everything: Let the data guide your hypotheses
Document findings: Insights are useless if not recorded

EDA Process Flow

1. Load & Inspect → 2. Check Quality → 3. Univariate → 4. Bivariate → 5. Document

Quick Data Inspection

# Essential first-look commands
print(df.shape)                    # Rows, columns
print(df.dtypes)                   # Data types
print(df.isnull().sum())          # Missing values
print(df['col'].describe())        # Summary stats
print(df['col'].value_counts())    # Categorical counts

Key Visualizations

Goal	Chart Type	Code
Distribution	Histogram	`df['epa'].hist(bins=50)`
Compare groups	Box plot	`df.boxplot(column='epa', by='down')`
Relationship	Scatter	`plt.scatter(df['x'], df['y'])`
Trend	Line	`plt.plot(df['week'], df['epa'])`
Categories	Bar	`df['team'].value_counts().plot(kind='bar')`

Common Analysis Patterns

Team Summary

team_stats = (
    pbp
    .query("play_type.isin(['pass', 'run'])")
    .groupby('posteam')
    .agg(
        epa=('epa', 'mean'),
        success_rate=('success', 'mean'),
        plays=('play_id', 'count')
    )
    .sort_values('epa', ascending=False)
)

Split Analysis

# Compare home vs away
splits = (
    pbp
    .assign(is_home=pbp['posteam'] == pbp['home_team'])
    .groupby(['posteam', 'is_home'])['epa']
    .mean()
    .unstack()
)

Rolling Average

df['rolling_epa'] = (
    df
    .groupby('passer')['epa']
    .transform(lambda x: x.rolling(100).mean())
)

Visualization Checklist

Before sharing any chart:

[ ] Clear, descriptive title
[ ] Labeled axes with units
[ ] Legend if using colors/shapes
[ ] Readable font sizes
[ ] Data source noted
[ ] Insight clearly stated

Missing Data Patterns

Column	Typically Missing When
`passer_player_name`	Non-pass plays
`receiver_player_name`	Incomplete passes, runs
`air_yards`	Runs, sacks
`cpoe`	Spikes, throwaways
`epa`	Some special teams plays

EPA Distribution Facts

Centered near zero by construction
Passes > Runs in average EPA
Passes > Runs in variance (higher risk/reward)
Heavy tails: Turnovers create extreme negatives

Common Gotchas

Issue	Solution
Overplotting	Use hexbin or alpha
Misleading scale	Start y-axis at 0 when appropriate
Missing context	Add benchmarks/comparisons
Too much data	Filter, aggregate, or sample

Quick Statistical Checks

# Correlation
df['x'].corr(df['y'])

# Skewness (>0 = right tail, <0 = left tail)
df['epa'].skew()

# Percentiles
df['epa'].quantile([0.10, 0.25, 0.50, 0.75, 0.90])

Presentation Tips

Lead with insight: State the finding clearly
Show evidence: One visualization per point
Provide context: Compare to benchmarks
Acknowledge limits: Sample size, data quality
Suggest action: What should be done?

Preview: Chapter 5

Next: Statistical Foundations — probability, inference, and the math behind EPA and win probability.