Key Takeaways: The NFL Data Ecosystem
One-page reference for Chapter 2 concepts
Data Categories
| Category | Granularity | Public Access | Primary Source |
|---|---|---|---|
| Aggregated Stats | Season/Game | Full | Pro Football Reference |
| Play-by-Play | Per play | Full (1999+) | nfl_data_py |
| Tracking Data | Per frame (10 Hz) | Limited samples | Big Data Bowl |
| Video | Continuous | Subscription | NFL Game Pass |
Key Play-by-Play Fields
Identification:
- game_id - Unique game identifier
- play_id - Play number within game
- posteam / defteam - Possession and defensive teams
Situation:
- down - Current down (1-4)
- ydstogo - Yards needed for first down
- yardline_100 - Yards from opponent's goal
Metrics:
- epa - Expected Points Added (key metric!)
- wpa - Win Probability Added
- success - Binary: did EPA > 0?
Expected Points (EP) Explained
Formula: EPA = EP_after - EP_before
What it measures: How much a play changed the expected score
Interpretation: - EPA > 0 → Offense helped their scoring chances - EPA < 0 → Offense hurt their scoring chances - EPA ≈ 0 → Neutral play
Typical values: - Touchdown: +4 to +6 EPA - Interception: -4 to -6 EPA - Average play: ~0 EPA
Tracking Data Coordinate System
y = 53.3 (far sideline)
┌─────────────────────────┐
│ x=0 x=120 │
│ (own EZ) (opp EZ) │
└─────────────────────────┘
y = 0 (near sideline)
Key fields: x, y, s (speed), a (acceleration), o (orientation)
Data Loading Quick Reference
import nfl_data_py as nfl
# Play-by-play
pbp = nfl.import_pbp_data([2023])
# Seasonal stats
stats = nfl.import_seasonal_data([2023])
# Rosters
rosters = nfl.import_rosters([2023])
# Schedules
schedule = nfl.import_schedules([2023])
Common Data Quality Issues
| Issue | Detection | Solution |
|---|---|---|
| Missing values | df.isnull().sum() |
Filter or impute |
| Invalid EPA | epa.abs() > 10 |
Investigate or exclude |
| Empty team fields | posteam.isna() |
Filter non-plays |
| Model updates | Compare versions | Document data version |
Pipeline Best Practices
- Cache data locally - Avoid repeated downloads
- Filter early - Reduce memory usage
- Validate after loading - Check for expected values
- Document versions - Track data source dates
- Separate raw from processed - Maintain data lineage
Data Source Selection Cheat Sheet
| Question Type | Best Source |
|---|---|
| Counting stats (yards, TDs) | Pro Football Reference |
| Efficiency metrics (EPA) | nfl_data_py |
| Spatial analysis | Big Data Bowl |
| Player grades | PFF (paid) |
| Contract data | Spotrac, Over The Cap |
Quick Self-Check
Can you: - [ ] Load play-by-play data with nfl_data_py? - [ ] Explain what EPA measures? - [ ] Describe the tracking data coordinate system? - [ ] List three common data quality issues? - [ ] Design a simple caching strategy?
If not: Review the relevant section before proceeding to Chapter 3.
Preview: Chapter 3
Next, we dive into Python for Football Analytics—building the programming skills to efficiently manipulate, analyze, and visualize NFL data.