> "In analytics, data is both the foundation and the ceiling. You can only answer questions your data allows."
Learning Objectives
- Identify the major sources of NFL data and their characteristics
- Understand the structure of play-by-play data and its key fields
- Explain what player tracking data captures and its limitations
- Evaluate data quality issues common in football analytics
- Access and load NFL data using Python tools
- Design a data pipeline for football analytics projects
In This Chapter
- Chapter Overview
- 2.1 The Data Landscape
- 2.2 Play-by-Play Data
- 2.3 Tracking Data
- 2.4 Other Data Sources
- 2.5 Data Access with Python
- 2.6 Building a Data Pipeline
- 2.7 Chapter Summary
- What's Next
- Chapter 2 Exercises → exercises.md
- Chapter 2 Quiz → quiz.md
- Case Study: Building a Team Stats Dashboard → case-study-01.md
Chapter 2: The NFL Data Ecosystem
"In analytics, data is both the foundation and the ceiling. You can only answer questions your data allows." — Anonymous NFL Analytics Director
Chapter Overview
Every analysis begins with data. Before building models or calculating metrics, you must understand where your data comes from, what it contains, and what it leaves out. This chapter provides a comprehensive tour of the NFL data ecosystem—the sources, structures, and quirks that shape what analysis is possible.
The football analytics landscape has transformed over the past decade. What was once the exclusive domain of NFL teams is now largely accessible to anyone with a laptop and internet connection. Public data repositories provide play-by-play records for every snap, while annual competitions release samples of proprietary tracking data. This democratization has fueled an explosion of public research and created career pathways that didn't exist a generation ago.
Yet data accessibility comes with responsibility. Public data has limitations, biases, and quality issues that can mislead the unwary analyst. Understanding these constraints is as important as understanding the data itself.
In this chapter, you will learn to: - Navigate the landscape of NFL data providers and their offerings - Work with play-by-play data structure and key fields - Understand what tracking data captures and where it falls short - Identify and address common data quality issues - Build robust data loading and caching pipelines
2.1 The Data Landscape
2.1.1 A Brief History of Football Data
Football data collection has evolved through distinct eras:
The Paper Era (Pre-1990s)
Before computerization, football statistics were recorded by hand on paper scoresheets. Basic statistics—completions, yards, touchdowns—were compiled by team statisticians and wire services. Play-by-play records existed but were inconsistent and difficult to access.
The Digital Era (1990s-2000s)
The NFL began systematic digital recording of game events. The Game Statistics and Information System (GSIS) created standardized play-by-play records. ESPN, Sports Reference, and other outlets built databases that eventually became publicly accessible.
The Analytics Era (2010s)
Play-by-play data became freely available through nflscrapR and later nflfastR. Expected points and win probability calculations were added to public data. Football analytics moved from specialist hobby to mainstream discourse.
The Tracking Era (2018-Present)
RFID chips in player equipment enabled precise location tracking. The NFL Big Data Bowl democratized access to tracking data samples. Computer vision and machine learning opened new analytical frontiers.
2.1.2 Categories of Football Data
NFL data falls into several distinct categories:
Event Data (Play-by-Play)
Records what happened on each play: formations, play calls, outcomes, penalties. This is the foundation of most football analysis. Available for every play since the 1999 season.
Aggregated Statistics
Player and team totals across games or seasons: passing yards, touchdowns, win-loss records. Easy to work with but loses situational context.
Tracking Data
Position, speed, and acceleration of every player on every frame (10 times per second). Captures spatial relationships invisible to event data. Currently proprietary, with limited public samples.
Video Data
Game film showing the actual plays. Essential for contextual analysis but requires manual review or computer vision processing. Available through NFL Game Pass and team-specific systems.
Transaction Data
Contracts, trades, draft picks, injury reports. Essential for roster construction and value analysis. Available through various public sources with varying reliability.
"""
Data Categories Overview
This code demonstrates the different granularities of NFL data.
"""
import pandas as pd
import numpy as np
# Create sample data showing different granularities
# 1. Aggregated Statistics (Season Level)
season_stats = pd.DataFrame({
'player': ['Patrick Mahomes', 'Joe Burrow', 'Josh Allen'],
'season': [2023, 2023, 2023],
'pass_yards': [4183, 4918, 4306],
'pass_td': [27, 34, 29],
'games': [16, 17, 17]
})
# 2. Play-by-Play (Play Level)
play_by_play = pd.DataFrame({
'game_id': ['2023_01_DET_KC', '2023_01_DET_KC', '2023_01_DET_KC'],
'play_id': [1, 2, 3],
'quarter': [1, 1, 1],
'time': ['15:00', '14:42', '14:15'],
'down': [1, 2, 3],
'ydstogo': [10, 7, 4],
'play_type': ['pass', 'run', 'pass'],
'yards_gained': [3, 7, 15],
'epa': [-0.2, 0.8, 1.5]
})
# 3. Tracking Data (Frame Level - 10 Hz)
tracking_data = pd.DataFrame({
'game_id': ['2023_01_DET_KC'] * 6,
'play_id': [1] * 6,
'frame_id': [1, 1, 1, 2, 2, 2],
'nfl_id': [12345, 67890, 11111, 12345, 67890, 11111],
'x': [25.5, 30.2, 28.1, 25.8, 30.5, 28.4],
'y': [23.1, 25.6, 22.8, 23.3, 25.9, 23.0],
's': [2.1, 5.4, 3.2, 2.3, 5.6, 3.4], # Speed
'a': [0.5, 1.2, 0.8, 0.6, 1.1, 0.9] # Acceleration
})
print("DATA GRANULARITY COMPARISON")
print("=" * 50)
print("\n1. Season Statistics (highest aggregation):")
print(f" Rows: {len(season_stats)}")
print(f" One row per: player-season")
print("\n2. Play-by-Play (play level):")
print(f" Rows: {len(play_by_play)}")
print(f" One row per: play")
print("\n3. Tracking Data (frame level):")
print(f" Rows: {len(tracking_data)}")
print(f" One row per: player-frame (10 per second)")
print(f" Typical game: ~2.5 million rows")
2.1.3 Data Providers Overview
| Provider | Data Type | Coverage | Access | Cost |
|---|---|---|---|---|
| nfl_data_py / nflfastR | Play-by-play | 1999-present | Python/R package | Free |
| Pro Football Reference | Aggregated stats | 1920-present | Web scraping | Free |
| NFL Next Gen Stats | Tracking metrics | 2016-present | API/Website | Free (aggregated) |
| Big Data Bowl | Tracking data | Selected games | Kaggle download | Free |
| PFF | Grades, advanced stats | 2006-present | Subscription | $$$$ |
| ESPN | Various stats | Varies | API (limited) | Free |
| NFL Official | Various | Current season | NFL API | Free (limited) |
2.2 Play-by-Play Data
2.2.1 What Is Play-by-Play Data?
Play-by-play (PBP) data records the essential facts about every snap in an NFL game. Each row represents one play, with columns describing the situation, action, and outcome.
This is the workhorse of football analytics. Nearly every metric you've heard of—EPA, success rate, DVOA—is calculated from play-by-play data.
Intuition: Think of play-by-play data as a detailed game log. It doesn't tell you how something happened (you can't see a receiver's route), but it tells you what happened and in what context.
2.2.2 Key Play-by-Play Fields
The nflfastR / nfl_data_py play-by-play dataset contains over 300 columns. Here are the most important:
Game and Play Identification:
| Field | Description | Example |
|---|---|---|
game_id |
Unique game identifier | 2023_01_DET_KC |
play_id |
Play number within game | 42 |
old_game_id |
Legacy NFL game ID | 2023091000 |
Situation Fields:
| Field | Description | Example |
|---|---|---|
posteam |
Team with possession | KC |
defteam |
Defensive team | DET |
yardline_100 |
Yards from opponent's goal | 65 |
down |
Current down (1-4) | 2 |
ydstogo |
Yards needed for first down | 7 |
game_seconds_remaining |
Time left in game | 1823 |
half_seconds_remaining |
Time left in half | 823 |
score_differential |
Offense score minus defense | 3 |
Play Description Fields:
| Field | Description | Example |
|---|---|---|
play_type |
Type of play | pass, run, punt, field_goal |
pass |
Binary: was this a pass play? | 1 |
rush |
Binary: was this a rush play? | 0 |
desc |
Text description of play | (11:25) P.Mahomes pass... |
yards_gained |
Net yards on the play | 12 |
Expected Points Fields:
| Field | Description | Example |
|---|---|---|
ep |
Expected points before play | 2.35 |
epa |
Expected points added by play | 0.87 |
wp |
Win probability before play | 0.62 |
wpa |
Win probability added | 0.03 |
Player Identification:
| Field | Description | Example |
|---|---|---|
passer_player_id |
GSIS ID of passer | 00-0033873 |
passer_player_name |
Passer name | P.Mahomes |
receiver_player_id |
GSIS ID of targeted receiver | 00-0036212 |
rusher_player_id |
GSIS ID of ball carrier | 00-0037744 |
"""
Exploring Play-by-Play Data Structure
This code demonstrates loading and exploring PBP data.
"""
import nfl_data_py as nfl
import pandas as pd
def load_and_explore_pbp(seasons: list = [2023]) -> pd.DataFrame:
"""
Load play-by-play data and display key information.
Parameters
----------
seasons : list
List of seasons to load
Returns
-------
pd.DataFrame
Play-by-play data
"""
print(f"Loading play-by-play data for {seasons}...")
pbp = nfl.import_pbp_data(seasons)
print(f"\nDataset Overview:")
print(f" Total plays: {len(pbp):,}")
print(f" Total columns: {len(pbp.columns)}")
print(f" Memory usage: {pbp.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"\nPlay Type Distribution:")
print(pbp['play_type'].value_counts().head(10))
print(f"\nKey Numeric Fields:")
numeric_fields = ['yards_gained', 'epa', 'wpa', 'air_yards', 'yac']
print(pbp[numeric_fields].describe().round(2))
return pbp
# Example usage (commented out to avoid actual data download in textbook)
# pbp = load_and_explore_pbp([2023])
2.2.3 Understanding Expected Points
Expected Points (EP) is perhaps the most important derived field in play-by-play data. It answers the question: "Given this game situation, how many points will the offense score on this drive on average?"
EP is calculated using historical data across millions of plays. Key factors include:
- Field position: Closer to the goal line = higher EP
- Down and distance: First-and-10 is better than third-and-15
- Time remaining: Context for late-game situations
Expected Points Added (EPA) measures how much a single play changed the expected score:
$$ \text{EPA} = \text{EP}_{\text{after}} - \text{EP}_{\text{before}} $$
A positive EPA means the offense improved their scoring expectation; negative means they hurt it.
"""
Understanding Expected Points
This code illustrates how EP varies by situation.
"""
import numpy as np
import pandas as pd
# Approximate EP values by field position (1st and 10)
field_positions = list(range(1, 100))
ep_values = []
for yardline in field_positions:
if yardline <= 10:
# Red zone: high EP
ep = 4.5 + (10 - yardline) * 0.25
elif yardline <= 50:
# Opponent's territory: positive EP
ep = 0.5 + (50 - yardline) * 0.08
else:
# Own territory: lower EP, can go negative
ep = 0.5 - (yardline - 50) * 0.04
ep_values.append(ep)
ep_curve = pd.DataFrame({
'yards_from_own_goal': list(range(1, 100)),
'yards_from_opp_goal': list(range(99, 0, -1)),
'expected_points': ep_values
})
print("Expected Points by Field Position (1st and 10):")
print("-" * 50)
print(f"{'Location':<30} {'EP':>10}")
print("-" * 50)
sample_positions = [
(1, "Own 1 yard line"),
(20, "Own 20"),
(50, "Midfield"),
(80, "Opponent's 20"),
(95, "Opponent's 5 (Red Zone)")
]
for yards_from_own, description in sample_positions:
ep = ep_curve[ep_curve['yards_from_own_goal'] == yards_from_own]['expected_points'].values[0]
print(f"{description:<30} {ep:>+10.2f}")
2.2.4 Data Quality Considerations
Play-by-play data is not perfect. Common issues include:
Missing Values
Many fields are only populated for certain play types. For example, air_yards is null for rushing plays, and receiver_player_id is null for incompletions to unknown targets.
Inconsistent Recording
How plays are categorized can vary by scorer. Some plays classified as "scrambles" might be considered "designed runs" by others.
Timing Imprecision
Game clock values are approximate, especially for real-time data. Post-game cleaning improves accuracy but doesn't eliminate issues.
Model-Dependent Fields
EPA, WPA, and CPOE depend on underlying models that are periodically updated. Historical values may be recalculated, creating version issues.
Common Pitfall: Treating EPA as ground truth rather than an estimate. EPA depends on the expected points model used, which makes assumptions about average team performance. Actual team-specific expected values may differ.
2.3 Tracking Data
2.3.1 What Is Tracking Data?
Tracking data captures the precise location of every player (and the ball) on every frame of every play. RFID chips in player shoulder pads transmit position data at 10 Hz (10 times per second).
This data reveals what play-by-play cannot see:
- How open was the receiver?
- How much time did the quarterback have in the pocket?
- Where were the defenders positioned at the snap?
- How fast was the ball carrier running?
2.3.2 Tracking Data Structure
Each row in tracking data represents one player on one frame:
| Field | Description | Example |
|---|---|---|
gameId |
Game identifier | 2023091000 |
playId |
Play identifier | 42 |
frameId |
Frame number (1 = ball snap) | 15 |
nflId |
Player identifier | 12345 |
displayName |
Player name | Patrick Mahomes |
x |
X coordinate (0-120 yards) | 35.5 |
y |
Y coordinate (0-53.3 yards) | 26.7 |
s |
Speed (yards/second) | 4.2 |
a |
Acceleration (yards/second²) | 1.8 |
dis |
Distance traveled since last frame | 0.42 |
o |
Orientation (degrees) | 135.2 |
dir |
Direction of movement (degrees) | 142.8 |
event |
Event tag (snap, pass, tackle, etc.) | ball_snap |
2.3.3 Coordinate System
The NFL tracking coordinate system:
SIDELINE (y = 53.3)
┌────────────────────────────────────────────┐
│ │
│ ◄──── OFFENSE MOVING THIS WAY ────► │
│ │
│ x=0 x=120 │
│ (back of (back of │
│ own end opp end │
│ zone) zone) │
│ │
└────────────────────────────────────────────┘
SIDELINE (y = 0)
- X-axis: Length of field (0-120 yards, including end zones)
- Y-axis: Width of field (0-53.3 yards)
- Origin: Bottom-left corner when offense moves left-to-right
"""
Understanding Tracking Data Coordinates
This code demonstrates the tracking data coordinate system.
"""
import numpy as np
import pandas as pd
def create_sample_tracking_frame():
"""
Create a sample tracking data frame showing player positions.
"""
# Sample frame: typical passing play at the snap
players = pd.DataFrame({
'nflId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, # Offense
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], # Defense
'team': ['offense'] * 11 + ['defense'] * 11,
'position': ['QB', 'RB', 'WR', 'WR', 'WR', 'TE',
'LT', 'LG', 'C', 'RG', 'RT',
'DE', 'DT', 'DT', 'DE', 'LB', 'LB', 'LB',
'CB', 'CB', 'S', 'S'],
'x': [25, 23, 25, 25, 25, 26, # Offense x
27, 27, 27, 27, 27, 26, # OL at line of scrimmage
28, 28, 28, 28, 30, 30, 31, # Defense front 7
25, 25, 35, 35], # Secondary
'y': [26.65, 24, 5, 48, 15, 35, # Offense y (across width)
22, 24, 26.65, 29, 31, # OL
33, 25, 28, 20, 22, 28, 35, # Front 7
3, 50, 20, 33], # Secondary
's': [0, 0, 0, 0, 0, 0, # At snap, offense stationary
0, 0, 0, 0, 0,
0.5, 0.3, 0.3, 0.5, 0.2, 0.2, 0.1, # Defense slightly moving
0.1, 0.1, 0.1, 0.1]
})
return players
def calculate_distances(tracking_frame: pd.DataFrame) -> pd.DataFrame:
"""
Calculate distances between offensive and defensive players.
Parameters
----------
tracking_frame : pd.DataFrame
Single frame of tracking data
Returns
-------
pd.DataFrame
Distance matrix between key players
"""
offense = tracking_frame[tracking_frame['team'] == 'offense']
defense = tracking_frame[tracking_frame['team'] == 'defense']
# Example: distance from each WR to nearest defender
receivers = offense[offense['position'] == 'WR']
defenders = defense[defense['position'].isin(['CB', 'S'])]
separations = []
for _, wr in receivers.iterrows():
min_dist = float('inf')
for _, db in defenders.iterrows():
dist = np.sqrt((wr['x'] - db['x'])**2 + (wr['y'] - db['y'])**2)
min_dist = min(min_dist, dist)
separations.append({
'receiver_id': wr['nflId'],
'separation': min_dist
})
return pd.DataFrame(separations)
# Demonstration
frame = create_sample_tracking_frame()
print("Sample Tracking Frame (at snap):")
print(frame[['position', 'team', 'x', 'y', 's']].to_string(index=False))
separations = calculate_distances(frame)
print("\n\nReceiver Separations:")
print(separations)
2.3.4 Tracking Data Availability
Tracking data availability is limited compared to play-by-play:
| Source | Scope | Access |
|---|---|---|
| NFL Teams | All games, all seasons | Team only (proprietary) |
| Next Gen Stats | Aggregated metrics | Public (website/API) |
| Big Data Bowl | 50-100 games per year | Public (Kaggle) |
| Research partnerships | Varies | Academic/corporate agreements |
For most analysts, the Big Data Bowl is the primary source of raw tracking data. Each year's competition focuses on different aspects of the game (rushing, pass coverage, special teams, etc.).
2.3.5 Limitations of Tracking Data
Despite its richness, tracking data has limitations:
Doesn't Capture Everything
- No ball tracking in current public data
- Can't see player eye movement or hand placement
- Doesn't record pre-snap communication
Computational Challenges
- Massive data volume (2-3 million rows per game)
- Requires specialized processing tools
- Storage and memory constraints
Interpretation Challenges
- Raw coordinates need context
- "Separation" depends on how you measure it
- Quarterback decisions aren't directly observable
2.4 Other Data Sources
2.4.1 Pro Football Reference
Pro Football Reference (PFR) provides historical statistics and game logs. Key features:
- Coverage: Statistics from 1920 to present
- Granularity: Game-level and season-level statistics
- Special data: Draft history, combine results, snap counts
- Access: Web scraping (terms of service apply)
"""
Example: Scraping Pro Football Reference
(For demonstration purposes - respect rate limits and ToS)
"""
import pandas as pd
# PFR provides HTML tables that pandas can read
# Example URL structure for player game logs:
# https://www.pro-football-reference.com/players/M/MahoPa00/gamelog/2023/
def get_pfr_player_gamelog(player_code: str, season: int) -> pd.DataFrame:
"""
Fetch player game log from Pro Football Reference.
Parameters
----------
player_code : str
PFR player code (e.g., 'MahoPa00' for Patrick Mahomes)
season : int
Season year
Returns
-------
pd.DataFrame
Game log data
Notes
-----
This is a simplified example. Production code should:
- Implement rate limiting
- Handle errors gracefully
- Cache results
- Respect robots.txt
"""
url = f"https://www.pro-football-reference.com/players/M/{player_code}/gamelog/{season}/"
# In practice:
# tables = pd.read_html(url)
# gamelog = tables[0] # Usually first table
# Placeholder return for demonstration
print(f"Would fetch: {url}")
return pd.DataFrame()
2.4.2 NFL Next Gen Stats
Next Gen Stats provides aggregated tracking-derived metrics:
Available Metrics: - Completion probability over expected - Separation at catch point - Time to throw - Rushing yards over expected - Pass rush win rate
Access: - Website: stats.nfl.com - Limited API access
2.4.3 Pro Football Focus (PFF)
PFF employs analysts to watch every play and grade player performance:
Products: - Player grades (0-100 scale) - Detailed play-level data - Position-specific metrics
Considerations: - Subjective element in grading - Expensive subscription - Widely used by NFL teams
2.4.4 Contract and Transaction Data
For roster construction analysis:
- Spotrac: Salary cap and contract details
- Over The Cap: Contract tracking, cap projections
- NFL Transaction Wire: Real-time transaction tracking
2.5 Data Access with Python
2.5.1 The nfl_data_py Package
The nfl_data_py package is the primary tool for accessing NFL data in Python:
"""
nfl_data_py Comprehensive Guide
This module demonstrates all major data loading functions.
"""
import nfl_data_py as nfl
import pandas as pd
from typing import List, Optional
# -----------------------------------------------------------------------------
# PLAY-BY-PLAY DATA
# -----------------------------------------------------------------------------
def load_pbp(seasons: List[int], columns: Optional[List[str]] = None) -> pd.DataFrame:
"""
Load play-by-play data with optional column filtering.
Parameters
----------
seasons : List[int]
Seasons to load (e.g., [2022, 2023])
columns : List[str], optional
Specific columns to load (reduces memory)
Returns
-------
pd.DataFrame
Play-by-play data
"""
pbp = nfl.import_pbp_data(seasons, columns=columns)
return pbp
# Common column subsets for different analyses
PASSING_COLUMNS = [
'game_id', 'play_id', 'posteam', 'defteam', 'down', 'ydstogo',
'yardline_100', 'pass', 'complete_pass', 'incomplete_pass',
'passer_player_id', 'passer_player_name', 'receiver_player_id',
'receiver_player_name', 'yards_gained', 'air_yards', 'yac',
'epa', 'wpa', 'cpoe', 'qb_epa'
]
RUSHING_COLUMNS = [
'game_id', 'play_id', 'posteam', 'defteam', 'down', 'ydstogo',
'yardline_100', 'rush', 'rusher_player_id', 'rusher_player_name',
'yards_gained', 'epa', 'wpa', 'success'
]
# -----------------------------------------------------------------------------
# SEASONAL PLAYER STATISTICS
# -----------------------------------------------------------------------------
def load_seasonal_stats(
seasons: List[int],
stat_type: str = 'passing'
) -> pd.DataFrame:
"""
Load seasonal player statistics.
Parameters
----------
seasons : List[int]
Seasons to load
stat_type : str
Type: 'passing', 'rushing', 'receiving', 'defense'
Returns
-------
pd.DataFrame
Seasonal statistics
"""
stats = nfl.import_seasonal_data(seasons)
# Filter by stat type if needed
if stat_type == 'passing':
stats = stats[stats['attempts'] > 0]
elif stat_type == 'rushing':
stats = stats[stats['carries'] > 0]
elif stat_type == 'receiving':
stats = stats[stats['targets'] > 0]
return stats
# -----------------------------------------------------------------------------
# ROSTER AND PLAYER DATA
# -----------------------------------------------------------------------------
def load_rosters(seasons: List[int]) -> pd.DataFrame:
"""Load roster data with player information."""
return nfl.import_rosters(seasons)
def load_players() -> pd.DataFrame:
"""Load player ID mapping table."""
return nfl.import_ids()
# -----------------------------------------------------------------------------
# ADDITIONAL DATA
# -----------------------------------------------------------------------------
def load_schedules(seasons: List[int]) -> pd.DataFrame:
"""Load game schedules with results."""
return nfl.import_schedules(seasons)
def load_draft_picks(seasons: List[int]) -> pd.DataFrame:
"""Load draft pick data."""
return nfl.import_draft_picks(seasons)
def load_combine_data(seasons: List[int]) -> pd.DataFrame:
"""Load NFL Combine results."""
return nfl.import_combine_data(seasons)
def load_injuries(seasons: List[int]) -> pd.DataFrame:
"""Load injury report data."""
return nfl.import_injuries(seasons)
# -----------------------------------------------------------------------------
# DEMONSTRATION
# -----------------------------------------------------------------------------
if __name__ == "__main__":
print("NFL Data Loading Examples")
print("=" * 50)
# Example: Load 2023 passing plays only
print("\nLoading 2023 passing data...")
# pbp = load_pbp([2023], PASSING_COLUMNS)
# pass_plays = pbp[pbp['pass'] == 1].copy()
# print(f"Loaded {len(pass_plays):,} passing plays")
# Example: Load seasonal QB stats
print("\nLoading seasonal statistics...")
# qb_stats = load_seasonal_stats([2022, 2023], 'passing')
# print(f"Loaded {len(qb_stats)} player-seasons")
print("\nAvailable functions:")
print(" - nfl.import_pbp_data(seasons)")
print(" - nfl.import_seasonal_data(seasons)")
print(" - nfl.import_rosters(seasons)")
print(" - nfl.import_schedules(seasons)")
print(" - nfl.import_draft_picks(seasons)")
print(" - nfl.import_combine_data(seasons)")
print(" - nfl.import_injuries(seasons)")
2.5.2 Caching and Data Management
For efficiency, implement caching:
"""
Data Caching Strategies
Efficient data loading with local caching.
"""
import os
import pandas as pd
import nfl_data_py as nfl
from pathlib import Path
from typing import List, Optional
from datetime import datetime, timedelta
class NFLDataCache:
"""
Manages cached NFL data to avoid repeated downloads.
Attributes
----------
cache_dir : Path
Directory for cached files
max_age_days : int
Maximum age before refreshing cache
"""
def __init__(
self,
cache_dir: str = './data/cache',
max_age_days: int = 7
):
"""
Initialize cache manager.
Parameters
----------
cache_dir : str
Directory for cache files
max_age_days : int
Days before cache expires
"""
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(parents=True, exist_ok=True)
self.max_age_days = max_age_days
def _cache_path(self, data_type: str, seasons: List[int]) -> Path:
"""Generate cache file path."""
season_str = '_'.join(map(str, sorted(seasons)))
return self.cache_dir / f"{data_type}_{season_str}.parquet"
def _is_cache_valid(self, path: Path) -> bool:
"""Check if cache file is still valid."""
if not path.exists():
return False
file_time = datetime.fromtimestamp(path.stat().st_mtime)
age = datetime.now() - file_time
return age < timedelta(days=self.max_age_days)
def load_pbp(
self,
seasons: List[int],
columns: Optional[List[str]] = None,
force_refresh: bool = False
) -> pd.DataFrame:
"""
Load play-by-play data with caching.
Parameters
----------
seasons : List[int]
Seasons to load
columns : List[str], optional
Columns to select
force_refresh : bool
Force re-download even if cache exists
Returns
-------
pd.DataFrame
Play-by-play data
"""
cache_path = self._cache_path('pbp', seasons)
if not force_refresh and self._is_cache_valid(cache_path):
print(f"Loading from cache: {cache_path}")
df = pd.read_parquet(cache_path)
else:
print(f"Downloading play-by-play for {seasons}...")
df = nfl.import_pbp_data(seasons)
df.to_parquet(cache_path)
print(f"Cached to: {cache_path}")
if columns:
available = [c for c in columns if c in df.columns]
df = df[available]
return df
def clear_cache(self) -> int:
"""Clear all cached files. Returns number deleted."""
count = 0
for file in self.cache_dir.glob('*.parquet'):
file.unlink()
count += 1
return count
# Usage example
# cache = NFLDataCache('./data/cache')
# pbp = cache.load_pbp([2023])
2.6 Building a Data Pipeline
2.6.1 Pipeline Architecture
A robust analytics pipeline separates concerns:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Source │────▶│ Landing │────▶│ Processed │
│ APIs │ │ Zone │ │ Zone │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Archive │ │ Analytics │
│ Storage │ │ Ready │
└─────────────┘ └─────────────┘
Source Layer: APIs and data providers Landing Zone: Raw data as received Processed Zone: Cleaned, standardized data Analytics Ready: Aggregated, feature-engineered datasets
2.6.2 Implementation Example
"""
NFL Data Pipeline
A simple but complete data pipeline for football analytics.
"""
import pandas as pd
import numpy as np
import nfl_data_py as nfl
from pathlib import Path
from typing import List, Dict, Optional
from dataclasses import dataclass
from datetime import datetime
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class PipelineConfig:
"""Configuration for data pipeline."""
data_dir: str = './data'
seasons: List[int] = None
refresh_interval_days: int = 7
def __post_init__(self):
if self.seasons is None:
self.seasons = [2023]
class NFLDataPipeline:
"""
Complete data pipeline for NFL analytics.
This class handles data extraction, transformation, and loading
with built-in caching and validation.
"""
def __init__(self, config: PipelineConfig):
"""Initialize pipeline with configuration."""
self.config = config
self.data_dir = Path(config.data_dir)
# Create directory structure
(self.data_dir / 'raw').mkdir(parents=True, exist_ok=True)
(self.data_dir / 'processed').mkdir(parents=True, exist_ok=True)
(self.data_dir / 'features').mkdir(parents=True, exist_ok=True)
self.data: Dict[str, pd.DataFrame] = {}
def extract_pbp(self) -> pd.DataFrame:
"""Extract play-by-play data from source."""
logger.info(f"Extracting PBP for seasons: {self.config.seasons}")
pbp = nfl.import_pbp_data(self.config.seasons)
self.data['pbp_raw'] = pbp
# Save to landing zone
output_path = self.data_dir / 'raw' / 'pbp_raw.parquet'
pbp.to_parquet(output_path)
logger.info(f"Saved {len(pbp):,} plays to {output_path}")
return pbp
def transform_pbp(self, pbp: pd.DataFrame = None) -> pd.DataFrame:
"""
Transform raw PBP data into clean format.
Transformations:
1. Filter to real plays (no timeouts, end of quarter)
2. Create derived fields
3. Standardize naming
4. Handle missing values
"""
if pbp is None:
pbp = self.data.get('pbp_raw')
if pbp is None:
raise ValueError("No PBP data available. Run extract_pbp first.")
logger.info("Transforming PBP data...")
# Filter to real plays
real_plays = pbp[
(pbp['play_type'].isin(['pass', 'run', 'punt', 'field_goal', 'kickoff'])) |
(pbp['rush'] == 1) |
(pbp['pass'] == 1)
].copy()
# Create derived fields
real_plays['is_pass'] = (real_plays['pass'] == 1).astype(int)
real_plays['is_rush'] = (real_plays['rush'] == 1).astype(int)
real_plays['is_success'] = (real_plays['epa'] > 0).astype(int)
# Score context
real_plays['score_diff'] = real_plays['posteam_score'] - real_plays['defteam_score']
real_plays['is_close_game'] = (abs(real_plays['score_diff']) <= 8).astype(int)
# Time context
real_plays['is_garbage_time'] = (
(real_plays['game_seconds_remaining'] < 300) &
(abs(real_plays['score_diff']) > 16)
).astype(int)
# Store
self.data['pbp_clean'] = real_plays
# Save to processed zone
output_path = self.data_dir / 'processed' / 'pbp_clean.parquet'
real_plays.to_parquet(output_path)
logger.info(f"Saved {len(real_plays):,} clean plays to {output_path}")
return real_plays
def create_player_features(self, pbp: pd.DataFrame = None) -> pd.DataFrame:
"""
Create player-level aggregated features.
Returns
-------
pd.DataFrame
Player statistics aggregated by season
"""
if pbp is None:
pbp = self.data.get('pbp_clean')
logger.info("Creating player features...")
# Passing stats
pass_plays = pbp[pbp['is_pass'] == 1]
passer_stats = pass_plays.groupby(
['season', 'passer_player_id', 'passer_player_name']
).agg({
'play_id': 'count',
'complete_pass': 'sum',
'yards_gained': 'sum',
'air_yards': 'sum',
'epa': 'sum',
'is_success': 'mean',
'cpoe': 'mean'
}).reset_index()
passer_stats.columns = [
'season', 'player_id', 'player_name',
'dropbacks', 'completions', 'pass_yards', 'air_yards',
'total_epa', 'success_rate', 'cpoe'
]
passer_stats['epa_per_dropback'] = passer_stats['total_epa'] / passer_stats['dropbacks']
passer_stats['completion_pct'] = passer_stats['completions'] / passer_stats['dropbacks']
# Store
self.data['passer_features'] = passer_stats
# Save to features zone
output_path = self.data_dir / 'features' / 'passer_features.parquet'
passer_stats.to_parquet(output_path)
logger.info(f"Saved features for {len(passer_stats)} passers")
return passer_stats
def run_full_pipeline(self) -> Dict[str, pd.DataFrame]:
"""Run complete pipeline from extract to features."""
logger.info("Starting full pipeline run...")
# Extract
pbp = self.extract_pbp()
# Transform
pbp_clean = self.transform_pbp(pbp)
# Create features
passer_features = self.create_player_features(pbp_clean)
logger.info("Pipeline complete!")
return {
'pbp_raw': pbp,
'pbp_clean': pbp_clean,
'passer_features': passer_features
}
def validate_data(self) -> Dict[str, bool]:
"""Run validation checks on processed data."""
validations = {}
if 'pbp_clean' in self.data:
pbp = self.data['pbp_clean']
# Check for expected columns
required_cols = ['game_id', 'play_id', 'epa', 'posteam', 'defteam']
validations['has_required_columns'] = all(c in pbp.columns for c in required_cols)
# Check for reasonable EPA values
validations['epa_in_range'] = pbp['epa'].between(-10, 10).mean() > 0.99
# Check for non-empty teams
validations['no_empty_teams'] = pbp['posteam'].notna().mean() > 0.99
return validations
# Usage
if __name__ == "__main__":
config = PipelineConfig(seasons=[2023])
pipeline = NFLDataPipeline(config)
# Run pipeline
# results = pipeline.run_full_pipeline()
# Validate
# validations = pipeline.validate_data()
# print("Validation results:", validations)
2.7 Chapter Summary
Key Concepts
-
Data categories range from aggregated statistics to frame-level tracking data, each with different use cases and accessibility.
-
Play-by-play data is the foundation of football analytics, recording events and outcomes for every play with derived metrics like EPA and WPA.
-
Tracking data captures precise player locations at 10 Hz, enabling spatial analysis invisible to event data, but access is limited.
-
Data quality issues including missing values, inconsistent recording, and model dependencies must be understood and addressed.
-
nfl_data_py provides Python access to the major public NFL data sources.
-
Data pipelines separate extraction, transformation, and loading for maintainable analytics workflows.
Key Takeaways for Practice
- Always understand your data's provenance and limitations before analysis
- Use caching to avoid repeated downloads of large datasets
- Filter to relevant subsets early to manage memory
- Validate data quality as part of your pipeline
- Document data processing decisions for reproducibility
Data Source Decision Framework
What data do I need?
├── Historical patterns across many seasons?
│ └── Use nfl_data_py play-by-play (1999-present)
│
├── Current season statistics?
│ └── Use nfl_data_py or NFL API
│
├── Spatial/movement analysis?
│ ├── Aggregated metrics only?
│ │ └── Use Next Gen Stats website
│ └── Raw tracking data?
│ └── Use Big Data Bowl data (limited games)
│
├── Player grades/PFF metrics?
│ └── Requires PFF subscription
│
├── Contract/salary data?
│ └── Use Spotrac or Over The Cap
What's Next
In Chapter 3: Python for Football Analytics, we dive deeper into the programming tools and patterns that make efficient football analysis possible. You'll learn to write reusable functions, process large datasets efficiently, and build the coding foundations for everything that follows.
Before moving on, complete the exercises and quiz to ensure you can navigate the NFL data ecosystem confidently.
Chapter 2 Exercises → exercises.md
Chapter 2 Quiz → quiz.md
Case Study: Building a Team Stats Dashboard → case-study-01.md
The analyst who understands their data's limitations will always outperform the one who treats data as truth.
Related Reading
Explore this topic in other books
College Football Analytics College Football Data Landscape Basketball Analytics Basketball Data Sources Soccer Analytics Soccer Data Sources Prediction Markets Data Collection for Markets