Key Takeaways: Exploratory Data Analysis for Football
One-page reference for Chapter 4 concepts
The EDA Mindset
- Explore before modeling: Understand data patterns before formal analysis
- Visualize often: Charts reveal what statistics hide
- Question everything: Let the data guide your hypotheses
- Document findings: Insights are useless if not recorded
EDA Process Flow
1. Load & Inspect → 2. Check Quality → 3. Univariate → 4. Bivariate → 5. Document
Quick Data Inspection
# Essential first-look commands
print(df.shape) # Rows, columns
print(df.dtypes) # Data types
print(df.isnull().sum()) # Missing values
print(df['col'].describe()) # Summary stats
print(df['col'].value_counts()) # Categorical counts
Key Visualizations
| Goal | Chart Type | Code |
|---|---|---|
| Distribution | Histogram | df['epa'].hist(bins=50) |
| Compare groups | Box plot | df.boxplot(column='epa', by='down') |
| Relationship | Scatter | plt.scatter(df['x'], df['y']) |
| Trend | Line | plt.plot(df['week'], df['epa']) |
| Categories | Bar | df['team'].value_counts().plot(kind='bar') |
Common Analysis Patterns
Team Summary
team_stats = (
pbp
.query("play_type.isin(['pass', 'run'])")
.groupby('posteam')
.agg(
epa=('epa', 'mean'),
success_rate=('success', 'mean'),
plays=('play_id', 'count')
)
.sort_values('epa', ascending=False)
)
Split Analysis
# Compare home vs away
splits = (
pbp
.assign(is_home=pbp['posteam'] == pbp['home_team'])
.groupby(['posteam', 'is_home'])['epa']
.mean()
.unstack()
)
Rolling Average
df['rolling_epa'] = (
df
.groupby('passer')['epa']
.transform(lambda x: x.rolling(100).mean())
)
Visualization Checklist
Before sharing any chart:
- [ ] Clear, descriptive title
- [ ] Labeled axes with units
- [ ] Legend if using colors/shapes
- [ ] Readable font sizes
- [ ] Data source noted
- [ ] Insight clearly stated
Missing Data Patterns
| Column | Typically Missing When |
|---|---|
passer_player_name |
Non-pass plays |
receiver_player_name |
Incomplete passes, runs |
air_yards |
Runs, sacks |
cpoe |
Spikes, throwaways |
epa |
Some special teams plays |
EPA Distribution Facts
- Centered near zero by construction
- Passes > Runs in average EPA
- Passes > Runs in variance (higher risk/reward)
- Heavy tails: Turnovers create extreme negatives
Common Gotchas
| Issue | Solution |
|---|---|
| Overplotting | Use hexbin or alpha |
| Misleading scale | Start y-axis at 0 when appropriate |
| Missing context | Add benchmarks/comparisons |
| Too much data | Filter, aggregate, or sample |
Quick Statistical Checks
# Correlation
df['x'].corr(df['y'])
# Skewness (>0 = right tail, <0 = left tail)
df['epa'].skew()
# Percentiles
df['epa'].quantile([0.10, 0.25, 0.50, 0.75, 0.90])
Presentation Tips
- Lead with insight: State the finding clearly
- Show evidence: One visualization per point
- Provide context: Compare to benchmarks
- Acknowledge limits: Sample size, data quality
- Suggest action: What should be done?
Preview: Chapter 5
Next: Statistical Foundations — probability, inference, and the math behind EPA and win probability.