Key Takeaways: The NFL Data Ecosystem

One-page reference for Chapter 2 concepts


Data Categories

Category Granularity Public Access Primary Source
Aggregated Stats Season/Game Full Pro Football Reference
Play-by-Play Per play Full (1999+) nfl_data_py
Tracking Data Per frame (10 Hz) Limited samples Big Data Bowl
Video Continuous Subscription NFL Game Pass

Key Play-by-Play Fields

Identification: - game_id - Unique game identifier - play_id - Play number within game - posteam / defteam - Possession and defensive teams

Situation: - down - Current down (1-4) - ydstogo - Yards needed for first down - yardline_100 - Yards from opponent's goal

Metrics: - epa - Expected Points Added (key metric!) - wpa - Win Probability Added - success - Binary: did EPA > 0?


Expected Points (EP) Explained

Formula: EPA = EP_after - EP_before

What it measures: How much a play changed the expected score

Interpretation: - EPA > 0 → Offense helped their scoring chances - EPA < 0 → Offense hurt their scoring chances - EPA ≈ 0 → Neutral play

Typical values: - Touchdown: +4 to +6 EPA - Interception: -4 to -6 EPA - Average play: ~0 EPA


Tracking Data Coordinate System

         y = 53.3 (far sideline)
    ┌─────────────────────────┐
    │    x=0        x=120     │
    │   (own EZ)   (opp EZ)   │
    └─────────────────────────┘
         y = 0 (near sideline)

Key fields: x, y, s (speed), a (acceleration), o (orientation)


Data Loading Quick Reference

import nfl_data_py as nfl

# Play-by-play
pbp = nfl.import_pbp_data([2023])

# Seasonal stats
stats = nfl.import_seasonal_data([2023])

# Rosters
rosters = nfl.import_rosters([2023])

# Schedules
schedule = nfl.import_schedules([2023])

Common Data Quality Issues

Issue Detection Solution
Missing values df.isnull().sum() Filter or impute
Invalid EPA epa.abs() > 10 Investigate or exclude
Empty team fields posteam.isna() Filter non-plays
Model updates Compare versions Document data version

Pipeline Best Practices

  1. Cache data locally - Avoid repeated downloads
  2. Filter early - Reduce memory usage
  3. Validate after loading - Check for expected values
  4. Document versions - Track data source dates
  5. Separate raw from processed - Maintain data lineage

Data Source Selection Cheat Sheet

Question Type Best Source
Counting stats (yards, TDs) Pro Football Reference
Efficiency metrics (EPA) nfl_data_py
Spatial analysis Big Data Bowl
Player grades PFF (paid)
Contract data Spotrac, Over The Cap

Quick Self-Check

Can you: - [ ] Load play-by-play data with nfl_data_py? - [ ] Explain what EPA measures? - [ ] Describe the tracking data coordinate system? - [ ] List three common data quality issues? - [ ] Design a simple caching strategy?

If not: Review the relevant section before proceeding to Chapter 3.


Preview: Chapter 3

Next, we dive into Python for Football Analytics—building the programming skills to efficiently manipulate, analyze, and visualize NFL data.