Key Takeaways: The NFL Data Ecosystem

DataField.Dev

Key Takeaways: The NFL Data Ecosystem

One-page reference for Chapter 2 concepts

Data Categories

Category	Granularity	Public Access	Primary Source
Aggregated Stats	Season/Game	Full	Pro Football Reference
Play-by-Play	Per play	Full (1999+)	nfl_data_py
Tracking Data	Per frame (10 Hz)	Limited samples	Big Data Bowl
Video	Continuous	Subscription	NFL Game Pass

Key Play-by-Play Fields

Identification: - game_id - Unique game identifier - play_id - Play number within game - posteam / defteam - Possession and defensive teams

Situation: - down - Current down (1-4) - ydstogo - Yards needed for first down - yardline_100 - Yards from opponent's goal

Metrics: - epa - Expected Points Added (key metric!) - wpa - Win Probability Added - success - Binary: did EPA > 0?

Expected Points (EP) Explained

Formula: EPA = EP_after - EP_before

What it measures: How much a play changed the expected score

Interpretation: - EPA > 0 → Offense helped their scoring chances - EPA < 0 → Offense hurt their scoring chances - EPA ≈ 0 → Neutral play

Typical values: - Touchdown: +4 to +6 EPA - Interception: -4 to -6 EPA - Average play: ~0 EPA

Tracking Data Coordinate System

         y = 53.3 (far sideline)
    ┌─────────────────────────┐
    │    x=0        x=120     │
    │   (own EZ)   (opp EZ)   │
    └─────────────────────────┘
         y = 0 (near sideline)

Key fields: x, y, s (speed), a (acceleration), o (orientation)

Data Loading Quick Reference

import nfl_data_py as nfl

# Play-by-play
pbp = nfl.import_pbp_data([2023])

# Seasonal stats
stats = nfl.import_seasonal_data([2023])

# Rosters
rosters = nfl.import_rosters([2023])

# Schedules
schedule = nfl.import_schedules([2023])

Common Data Quality Issues

Issue	Detection	Solution
Missing values	`df.isnull().sum()`	Filter or impute
Invalid EPA	`epa.abs() > 10`	Investigate or exclude
Empty team fields	`posteam.isna()`	Filter non-plays
Model updates	Compare versions	Document data version

Pipeline Best Practices

Cache data locally - Avoid repeated downloads
Filter early - Reduce memory usage
Validate after loading - Check for expected values
Document versions - Track data source dates
Separate raw from processed - Maintain data lineage

Data Source Selection Cheat Sheet

Question Type	Best Source
Counting stats (yards, TDs)	Pro Football Reference
Efficiency metrics (EPA)	nfl_data_py
Spatial analysis	Big Data Bowl
Player grades	PFF (paid)
Contract data	Spotrac, Over The Cap

Quick Self-Check

Can you: - [ ] Load play-by-play data with nfl_data_py? - [ ] Explain what EPA measures? - [ ] Describe the tracking data coordinate system? - [ ] List three common data quality issues? - [ ] Design a simple caching strategy?

If not: Review the relevant section before proceeding to Chapter 3.

Preview: Chapter 3

Next, we dive into Python for Football Analytics—building the programming skills to efficiently manipulate, analyze, and visualize NFL data.