10 min read

> "In analytics, data is both the foundation and the ceiling. You can only answer questions your data allows."

Learning Objectives

  • Identify the major sources of NFL data and their characteristics
  • Understand the structure of play-by-play data and its key fields
  • Explain what player tracking data captures and its limitations
  • Evaluate data quality issues common in football analytics
  • Access and load NFL data using Python tools
  • Design a data pipeline for football analytics projects

Chapter 2: The NFL Data Ecosystem

"In analytics, data is both the foundation and the ceiling. You can only answer questions your data allows." — Anonymous NFL Analytics Director


Chapter Overview

Every analysis begins with data. Before building models or calculating metrics, you must understand where your data comes from, what it contains, and what it leaves out. This chapter provides a comprehensive tour of the NFL data ecosystem—the sources, structures, and quirks that shape what analysis is possible.

The football analytics landscape has transformed over the past decade. What was once the exclusive domain of NFL teams is now largely accessible to anyone with a laptop and internet connection. Public data repositories provide play-by-play records for every snap, while annual competitions release samples of proprietary tracking data. This democratization has fueled an explosion of public research and created career pathways that didn't exist a generation ago.

Yet data accessibility comes with responsibility. Public data has limitations, biases, and quality issues that can mislead the unwary analyst. Understanding these constraints is as important as understanding the data itself.

In this chapter, you will learn to: - Navigate the landscape of NFL data providers and their offerings - Work with play-by-play data structure and key fields - Understand what tracking data captures and where it falls short - Identify and address common data quality issues - Build robust data loading and caching pipelines


2.1 The Data Landscape

2.1.1 A Brief History of Football Data

Football data collection has evolved through distinct eras:

The Paper Era (Pre-1990s)

Before computerization, football statistics were recorded by hand on paper scoresheets. Basic statistics—completions, yards, touchdowns—were compiled by team statisticians and wire services. Play-by-play records existed but were inconsistent and difficult to access.

The Digital Era (1990s-2000s)

The NFL began systematic digital recording of game events. The Game Statistics and Information System (GSIS) created standardized play-by-play records. ESPN, Sports Reference, and other outlets built databases that eventually became publicly accessible.

The Analytics Era (2010s)

Play-by-play data became freely available through nflscrapR and later nflfastR. Expected points and win probability calculations were added to public data. Football analytics moved from specialist hobby to mainstream discourse.

The Tracking Era (2018-Present)

RFID chips in player equipment enabled precise location tracking. The NFL Big Data Bowl democratized access to tracking data samples. Computer vision and machine learning opened new analytical frontiers.

2.1.2 Categories of Football Data

NFL data falls into several distinct categories:

Event Data (Play-by-Play)

Records what happened on each play: formations, play calls, outcomes, penalties. This is the foundation of most football analysis. Available for every play since the 1999 season.

Aggregated Statistics

Player and team totals across games or seasons: passing yards, touchdowns, win-loss records. Easy to work with but loses situational context.

Tracking Data

Position, speed, and acceleration of every player on every frame (10 times per second). Captures spatial relationships invisible to event data. Currently proprietary, with limited public samples.

Video Data

Game film showing the actual plays. Essential for contextual analysis but requires manual review or computer vision processing. Available through NFL Game Pass and team-specific systems.

Transaction Data

Contracts, trades, draft picks, injury reports. Essential for roster construction and value analysis. Available through various public sources with varying reliability.

"""
Data Categories Overview

This code demonstrates the different granularities of NFL data.
"""

import pandas as pd
import numpy as np

# Create sample data showing different granularities

# 1. Aggregated Statistics (Season Level)
season_stats = pd.DataFrame({
    'player': ['Patrick Mahomes', 'Joe Burrow', 'Josh Allen'],
    'season': [2023, 2023, 2023],
    'pass_yards': [4183, 4918, 4306],
    'pass_td': [27, 34, 29],
    'games': [16, 17, 17]
})

# 2. Play-by-Play (Play Level)
play_by_play = pd.DataFrame({
    'game_id': ['2023_01_DET_KC', '2023_01_DET_KC', '2023_01_DET_KC'],
    'play_id': [1, 2, 3],
    'quarter': [1, 1, 1],
    'time': ['15:00', '14:42', '14:15'],
    'down': [1, 2, 3],
    'ydstogo': [10, 7, 4],
    'play_type': ['pass', 'run', 'pass'],
    'yards_gained': [3, 7, 15],
    'epa': [-0.2, 0.8, 1.5]
})

# 3. Tracking Data (Frame Level - 10 Hz)
tracking_data = pd.DataFrame({
    'game_id': ['2023_01_DET_KC'] * 6,
    'play_id': [1] * 6,
    'frame_id': [1, 1, 1, 2, 2, 2],
    'nfl_id': [12345, 67890, 11111, 12345, 67890, 11111],
    'x': [25.5, 30.2, 28.1, 25.8, 30.5, 28.4],
    'y': [23.1, 25.6, 22.8, 23.3, 25.9, 23.0],
    's': [2.1, 5.4, 3.2, 2.3, 5.6, 3.4],  # Speed
    'a': [0.5, 1.2, 0.8, 0.6, 1.1, 0.9]   # Acceleration
})

print("DATA GRANULARITY COMPARISON")
print("=" * 50)
print("\n1. Season Statistics (highest aggregation):")
print(f"   Rows: {len(season_stats)}")
print(f"   One row per: player-season")

print("\n2. Play-by-Play (play level):")
print(f"   Rows: {len(play_by_play)}")
print(f"   One row per: play")

print("\n3. Tracking Data (frame level):")
print(f"   Rows: {len(tracking_data)}")
print(f"   One row per: player-frame (10 per second)")
print(f"   Typical game: ~2.5 million rows")

2.1.3 Data Providers Overview

Provider Data Type Coverage Access Cost
nfl_data_py / nflfastR Play-by-play 1999-present Python/R package Free
Pro Football Reference Aggregated stats 1920-present Web scraping Free
NFL Next Gen Stats Tracking metrics 2016-present API/Website Free (aggregated)
Big Data Bowl Tracking data Selected games Kaggle download Free
PFF Grades, advanced stats 2006-present Subscription $$$$
ESPN Various stats Varies API (limited) Free
NFL Official Various Current season NFL API Free (limited)

2.2 Play-by-Play Data

2.2.1 What Is Play-by-Play Data?

Play-by-play (PBP) data records the essential facts about every snap in an NFL game. Each row represents one play, with columns describing the situation, action, and outcome.

This is the workhorse of football analytics. Nearly every metric you've heard of—EPA, success rate, DVOA—is calculated from play-by-play data.

Intuition: Think of play-by-play data as a detailed game log. It doesn't tell you how something happened (you can't see a receiver's route), but it tells you what happened and in what context.

2.2.2 Key Play-by-Play Fields

The nflfastR / nfl_data_py play-by-play dataset contains over 300 columns. Here are the most important:

Game and Play Identification:

Field Description Example
game_id Unique game identifier 2023_01_DET_KC
play_id Play number within game 42
old_game_id Legacy NFL game ID 2023091000

Situation Fields:

Field Description Example
posteam Team with possession KC
defteam Defensive team DET
yardline_100 Yards from opponent's goal 65
down Current down (1-4) 2
ydstogo Yards needed for first down 7
game_seconds_remaining Time left in game 1823
half_seconds_remaining Time left in half 823
score_differential Offense score minus defense 3

Play Description Fields:

Field Description Example
play_type Type of play pass, run, punt, field_goal
pass Binary: was this a pass play? 1
rush Binary: was this a rush play? 0
desc Text description of play (11:25) P.Mahomes pass...
yards_gained Net yards on the play 12

Expected Points Fields:

Field Description Example
ep Expected points before play 2.35
epa Expected points added by play 0.87
wp Win probability before play 0.62
wpa Win probability added 0.03

Player Identification:

Field Description Example
passer_player_id GSIS ID of passer 00-0033873
passer_player_name Passer name P.Mahomes
receiver_player_id GSIS ID of targeted receiver 00-0036212
rusher_player_id GSIS ID of ball carrier 00-0037744
"""
Exploring Play-by-Play Data Structure

This code demonstrates loading and exploring PBP data.
"""

import nfl_data_py as nfl
import pandas as pd

def load_and_explore_pbp(seasons: list = [2023]) -> pd.DataFrame:
    """
    Load play-by-play data and display key information.

    Parameters
    ----------
    seasons : list
        List of seasons to load

    Returns
    -------
    pd.DataFrame
        Play-by-play data
    """
    print(f"Loading play-by-play data for {seasons}...")
    pbp = nfl.import_pbp_data(seasons)

    print(f"\nDataset Overview:")
    print(f"  Total plays: {len(pbp):,}")
    print(f"  Total columns: {len(pbp.columns)}")
    print(f"  Memory usage: {pbp.memory_usage(deep=True).sum() / 1e6:.1f} MB")

    print(f"\nPlay Type Distribution:")
    print(pbp['play_type'].value_counts().head(10))

    print(f"\nKey Numeric Fields:")
    numeric_fields = ['yards_gained', 'epa', 'wpa', 'air_yards', 'yac']
    print(pbp[numeric_fields].describe().round(2))

    return pbp


# Example usage (commented out to avoid actual data download in textbook)
# pbp = load_and_explore_pbp([2023])

2.2.3 Understanding Expected Points

Expected Points (EP) is perhaps the most important derived field in play-by-play data. It answers the question: "Given this game situation, how many points will the offense score on this drive on average?"

EP is calculated using historical data across millions of plays. Key factors include:

  • Field position: Closer to the goal line = higher EP
  • Down and distance: First-and-10 is better than third-and-15
  • Time remaining: Context for late-game situations

Expected Points Added (EPA) measures how much a single play changed the expected score:

$$ \text{EPA} = \text{EP}_{\text{after}} - \text{EP}_{\text{before}} $$

A positive EPA means the offense improved their scoring expectation; negative means they hurt it.

"""
Understanding Expected Points

This code illustrates how EP varies by situation.
"""

import numpy as np
import pandas as pd

# Approximate EP values by field position (1st and 10)
field_positions = list(range(1, 100))
ep_values = []

for yardline in field_positions:
    if yardline <= 10:
        # Red zone: high EP
        ep = 4.5 + (10 - yardline) * 0.25
    elif yardline <= 50:
        # Opponent's territory: positive EP
        ep = 0.5 + (50 - yardline) * 0.08
    else:
        # Own territory: lower EP, can go negative
        ep = 0.5 - (yardline - 50) * 0.04

    ep_values.append(ep)

ep_curve = pd.DataFrame({
    'yards_from_own_goal': list(range(1, 100)),
    'yards_from_opp_goal': list(range(99, 0, -1)),
    'expected_points': ep_values
})

print("Expected Points by Field Position (1st and 10):")
print("-" * 50)
print(f"{'Location':<30} {'EP':>10}")
print("-" * 50)

sample_positions = [
    (1, "Own 1 yard line"),
    (20, "Own 20"),
    (50, "Midfield"),
    (80, "Opponent's 20"),
    (95, "Opponent's 5 (Red Zone)")
]

for yards_from_own, description in sample_positions:
    ep = ep_curve[ep_curve['yards_from_own_goal'] == yards_from_own]['expected_points'].values[0]
    print(f"{description:<30} {ep:>+10.2f}")

2.2.4 Data Quality Considerations

Play-by-play data is not perfect. Common issues include:

Missing Values

Many fields are only populated for certain play types. For example, air_yards is null for rushing plays, and receiver_player_id is null for incompletions to unknown targets.

Inconsistent Recording

How plays are categorized can vary by scorer. Some plays classified as "scrambles" might be considered "designed runs" by others.

Timing Imprecision

Game clock values are approximate, especially for real-time data. Post-game cleaning improves accuracy but doesn't eliminate issues.

Model-Dependent Fields

EPA, WPA, and CPOE depend on underlying models that are periodically updated. Historical values may be recalculated, creating version issues.

Common Pitfall: Treating EPA as ground truth rather than an estimate. EPA depends on the expected points model used, which makes assumptions about average team performance. Actual team-specific expected values may differ.


2.3 Tracking Data

2.3.1 What Is Tracking Data?

Tracking data captures the precise location of every player (and the ball) on every frame of every play. RFID chips in player shoulder pads transmit position data at 10 Hz (10 times per second).

This data reveals what play-by-play cannot see:

  • How open was the receiver?
  • How much time did the quarterback have in the pocket?
  • Where were the defenders positioned at the snap?
  • How fast was the ball carrier running?

2.3.2 Tracking Data Structure

Each row in tracking data represents one player on one frame:

Field Description Example
gameId Game identifier 2023091000
playId Play identifier 42
frameId Frame number (1 = ball snap) 15
nflId Player identifier 12345
displayName Player name Patrick Mahomes
x X coordinate (0-120 yards) 35.5
y Y coordinate (0-53.3 yards) 26.7
s Speed (yards/second) 4.2
a Acceleration (yards/second²) 1.8
dis Distance traveled since last frame 0.42
o Orientation (degrees) 135.2
dir Direction of movement (degrees) 142.8
event Event tag (snap, pass, tackle, etc.) ball_snap

2.3.3 Coordinate System

The NFL tracking coordinate system:

                    SIDELINE (y = 53.3)
    ┌────────────────────────────────────────────┐
    │                                            │
    │   ◄──── OFFENSE MOVING THIS WAY ────►     │
    │                                            │
    │   x=0                            x=120     │
    │   (back of                    (back of    │
    │    own end                     opp end    │
    │    zone)                       zone)      │
    │                                            │
    └────────────────────────────────────────────┘
                    SIDELINE (y = 0)
  • X-axis: Length of field (0-120 yards, including end zones)
  • Y-axis: Width of field (0-53.3 yards)
  • Origin: Bottom-left corner when offense moves left-to-right
"""
Understanding Tracking Data Coordinates

This code demonstrates the tracking data coordinate system.
"""

import numpy as np
import pandas as pd

def create_sample_tracking_frame():
    """
    Create a sample tracking data frame showing player positions.
    """
    # Sample frame: typical passing play at the snap

    players = pd.DataFrame({
        'nflId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,  # Offense
                  12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],  # Defense
        'team': ['offense'] * 11 + ['defense'] * 11,
        'position': ['QB', 'RB', 'WR', 'WR', 'WR', 'TE',
                     'LT', 'LG', 'C', 'RG', 'RT',
                     'DE', 'DT', 'DT', 'DE', 'LB', 'LB', 'LB',
                     'CB', 'CB', 'S', 'S'],
        'x': [25, 23, 25, 25, 25, 26,  # Offense x
              27, 27, 27, 27, 27, 26,  # OL at line of scrimmage
              28, 28, 28, 28, 30, 30, 31,  # Defense front 7
              25, 25, 35, 35],  # Secondary
        'y': [26.65, 24, 5, 48, 15, 35,  # Offense y (across width)
              22, 24, 26.65, 29, 31,  # OL
              33, 25, 28, 20, 22, 28, 35,  # Front 7
              3, 50, 20, 33],  # Secondary
        's': [0, 0, 0, 0, 0, 0,  # At snap, offense stationary
              0, 0, 0, 0, 0,
              0.5, 0.3, 0.3, 0.5, 0.2, 0.2, 0.1,  # Defense slightly moving
              0.1, 0.1, 0.1, 0.1]
    })

    return players


def calculate_distances(tracking_frame: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate distances between offensive and defensive players.

    Parameters
    ----------
    tracking_frame : pd.DataFrame
        Single frame of tracking data

    Returns
    -------
    pd.DataFrame
        Distance matrix between key players
    """
    offense = tracking_frame[tracking_frame['team'] == 'offense']
    defense = tracking_frame[tracking_frame['team'] == 'defense']

    # Example: distance from each WR to nearest defender
    receivers = offense[offense['position'] == 'WR']
    defenders = defense[defense['position'].isin(['CB', 'S'])]

    separations = []
    for _, wr in receivers.iterrows():
        min_dist = float('inf')
        for _, db in defenders.iterrows():
            dist = np.sqrt((wr['x'] - db['x'])**2 + (wr['y'] - db['y'])**2)
            min_dist = min(min_dist, dist)
        separations.append({
            'receiver_id': wr['nflId'],
            'separation': min_dist
        })

    return pd.DataFrame(separations)


# Demonstration
frame = create_sample_tracking_frame()
print("Sample Tracking Frame (at snap):")
print(frame[['position', 'team', 'x', 'y', 's']].to_string(index=False))

separations = calculate_distances(frame)
print("\n\nReceiver Separations:")
print(separations)

2.3.4 Tracking Data Availability

Tracking data availability is limited compared to play-by-play:

Source Scope Access
NFL Teams All games, all seasons Team only (proprietary)
Next Gen Stats Aggregated metrics Public (website/API)
Big Data Bowl 50-100 games per year Public (Kaggle)
Research partnerships Varies Academic/corporate agreements

For most analysts, the Big Data Bowl is the primary source of raw tracking data. Each year's competition focuses on different aspects of the game (rushing, pass coverage, special teams, etc.).

2.3.5 Limitations of Tracking Data

Despite its richness, tracking data has limitations:

Doesn't Capture Everything

  • No ball tracking in current public data
  • Can't see player eye movement or hand placement
  • Doesn't record pre-snap communication

Computational Challenges

  • Massive data volume (2-3 million rows per game)
  • Requires specialized processing tools
  • Storage and memory constraints

Interpretation Challenges

  • Raw coordinates need context
  • "Separation" depends on how you measure it
  • Quarterback decisions aren't directly observable

2.4 Other Data Sources

2.4.1 Pro Football Reference

Pro Football Reference (PFR) provides historical statistics and game logs. Key features:

  • Coverage: Statistics from 1920 to present
  • Granularity: Game-level and season-level statistics
  • Special data: Draft history, combine results, snap counts
  • Access: Web scraping (terms of service apply)
"""
Example: Scraping Pro Football Reference
(For demonstration purposes - respect rate limits and ToS)
"""

import pandas as pd

# PFR provides HTML tables that pandas can read
# Example URL structure for player game logs:
# https://www.pro-football-reference.com/players/M/MahoPa00/gamelog/2023/

def get_pfr_player_gamelog(player_code: str, season: int) -> pd.DataFrame:
    """
    Fetch player game log from Pro Football Reference.

    Parameters
    ----------
    player_code : str
        PFR player code (e.g., 'MahoPa00' for Patrick Mahomes)
    season : int
        Season year

    Returns
    -------
    pd.DataFrame
        Game log data

    Notes
    -----
    This is a simplified example. Production code should:
    - Implement rate limiting
    - Handle errors gracefully
    - Cache results
    - Respect robots.txt
    """
    url = f"https://www.pro-football-reference.com/players/M/{player_code}/gamelog/{season}/"

    # In practice:
    # tables = pd.read_html(url)
    # gamelog = tables[0]  # Usually first table

    # Placeholder return for demonstration
    print(f"Would fetch: {url}")
    return pd.DataFrame()

2.4.2 NFL Next Gen Stats

Next Gen Stats provides aggregated tracking-derived metrics:

Available Metrics: - Completion probability over expected - Separation at catch point - Time to throw - Rushing yards over expected - Pass rush win rate

Access: - Website: stats.nfl.com - Limited API access

2.4.3 Pro Football Focus (PFF)

PFF employs analysts to watch every play and grade player performance:

Products: - Player grades (0-100 scale) - Detailed play-level data - Position-specific metrics

Considerations: - Subjective element in grading - Expensive subscription - Widely used by NFL teams

2.4.4 Contract and Transaction Data

For roster construction analysis:

  • Spotrac: Salary cap and contract details
  • Over The Cap: Contract tracking, cap projections
  • NFL Transaction Wire: Real-time transaction tracking

2.5 Data Access with Python

2.5.1 The nfl_data_py Package

The nfl_data_py package is the primary tool for accessing NFL data in Python:

"""
nfl_data_py Comprehensive Guide

This module demonstrates all major data loading functions.
"""

import nfl_data_py as nfl
import pandas as pd
from typing import List, Optional

# -----------------------------------------------------------------------------
# PLAY-BY-PLAY DATA
# -----------------------------------------------------------------------------

def load_pbp(seasons: List[int], columns: Optional[List[str]] = None) -> pd.DataFrame:
    """
    Load play-by-play data with optional column filtering.

    Parameters
    ----------
    seasons : List[int]
        Seasons to load (e.g., [2022, 2023])
    columns : List[str], optional
        Specific columns to load (reduces memory)

    Returns
    -------
    pd.DataFrame
        Play-by-play data
    """
    pbp = nfl.import_pbp_data(seasons, columns=columns)
    return pbp


# Common column subsets for different analyses
PASSING_COLUMNS = [
    'game_id', 'play_id', 'posteam', 'defteam', 'down', 'ydstogo',
    'yardline_100', 'pass', 'complete_pass', 'incomplete_pass',
    'passer_player_id', 'passer_player_name', 'receiver_player_id',
    'receiver_player_name', 'yards_gained', 'air_yards', 'yac',
    'epa', 'wpa', 'cpoe', 'qb_epa'
]

RUSHING_COLUMNS = [
    'game_id', 'play_id', 'posteam', 'defteam', 'down', 'ydstogo',
    'yardline_100', 'rush', 'rusher_player_id', 'rusher_player_name',
    'yards_gained', 'epa', 'wpa', 'success'
]

# -----------------------------------------------------------------------------
# SEASONAL PLAYER STATISTICS
# -----------------------------------------------------------------------------

def load_seasonal_stats(
    seasons: List[int],
    stat_type: str = 'passing'
) -> pd.DataFrame:
    """
    Load seasonal player statistics.

    Parameters
    ----------
    seasons : List[int]
        Seasons to load
    stat_type : str
        Type: 'passing', 'rushing', 'receiving', 'defense'

    Returns
    -------
    pd.DataFrame
        Seasonal statistics
    """
    stats = nfl.import_seasonal_data(seasons)

    # Filter by stat type if needed
    if stat_type == 'passing':
        stats = stats[stats['attempts'] > 0]
    elif stat_type == 'rushing':
        stats = stats[stats['carries'] > 0]
    elif stat_type == 'receiving':
        stats = stats[stats['targets'] > 0]

    return stats


# -----------------------------------------------------------------------------
# ROSTER AND PLAYER DATA
# -----------------------------------------------------------------------------

def load_rosters(seasons: List[int]) -> pd.DataFrame:
    """Load roster data with player information."""
    return nfl.import_rosters(seasons)


def load_players() -> pd.DataFrame:
    """Load player ID mapping table."""
    return nfl.import_ids()


# -----------------------------------------------------------------------------
# ADDITIONAL DATA
# -----------------------------------------------------------------------------

def load_schedules(seasons: List[int]) -> pd.DataFrame:
    """Load game schedules with results."""
    return nfl.import_schedules(seasons)


def load_draft_picks(seasons: List[int]) -> pd.DataFrame:
    """Load draft pick data."""
    return nfl.import_draft_picks(seasons)


def load_combine_data(seasons: List[int]) -> pd.DataFrame:
    """Load NFL Combine results."""
    return nfl.import_combine_data(seasons)


def load_injuries(seasons: List[int]) -> pd.DataFrame:
    """Load injury report data."""
    return nfl.import_injuries(seasons)


# -----------------------------------------------------------------------------
# DEMONSTRATION
# -----------------------------------------------------------------------------

if __name__ == "__main__":
    print("NFL Data Loading Examples")
    print("=" * 50)

    # Example: Load 2023 passing plays only
    print("\nLoading 2023 passing data...")
    # pbp = load_pbp([2023], PASSING_COLUMNS)
    # pass_plays = pbp[pbp['pass'] == 1].copy()
    # print(f"Loaded {len(pass_plays):,} passing plays")

    # Example: Load seasonal QB stats
    print("\nLoading seasonal statistics...")
    # qb_stats = load_seasonal_stats([2022, 2023], 'passing')
    # print(f"Loaded {len(qb_stats)} player-seasons")

    print("\nAvailable functions:")
    print("  - nfl.import_pbp_data(seasons)")
    print("  - nfl.import_seasonal_data(seasons)")
    print("  - nfl.import_rosters(seasons)")
    print("  - nfl.import_schedules(seasons)")
    print("  - nfl.import_draft_picks(seasons)")
    print("  - nfl.import_combine_data(seasons)")
    print("  - nfl.import_injuries(seasons)")

2.5.2 Caching and Data Management

For efficiency, implement caching:

"""
Data Caching Strategies

Efficient data loading with local caching.
"""

import os
import pandas as pd
import nfl_data_py as nfl
from pathlib import Path
from typing import List, Optional
from datetime import datetime, timedelta


class NFLDataCache:
    """
    Manages cached NFL data to avoid repeated downloads.

    Attributes
    ----------
    cache_dir : Path
        Directory for cached files
    max_age_days : int
        Maximum age before refreshing cache
    """

    def __init__(
        self,
        cache_dir: str = './data/cache',
        max_age_days: int = 7
    ):
        """
        Initialize cache manager.

        Parameters
        ----------
        cache_dir : str
            Directory for cache files
        max_age_days : int
            Days before cache expires
        """
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.max_age_days = max_age_days

    def _cache_path(self, data_type: str, seasons: List[int]) -> Path:
        """Generate cache file path."""
        season_str = '_'.join(map(str, sorted(seasons)))
        return self.cache_dir / f"{data_type}_{season_str}.parquet"

    def _is_cache_valid(self, path: Path) -> bool:
        """Check if cache file is still valid."""
        if not path.exists():
            return False

        file_time = datetime.fromtimestamp(path.stat().st_mtime)
        age = datetime.now() - file_time
        return age < timedelta(days=self.max_age_days)

    def load_pbp(
        self,
        seasons: List[int],
        columns: Optional[List[str]] = None,
        force_refresh: bool = False
    ) -> pd.DataFrame:
        """
        Load play-by-play data with caching.

        Parameters
        ----------
        seasons : List[int]
            Seasons to load
        columns : List[str], optional
            Columns to select
        force_refresh : bool
            Force re-download even if cache exists

        Returns
        -------
        pd.DataFrame
            Play-by-play data
        """
        cache_path = self._cache_path('pbp', seasons)

        if not force_refresh and self._is_cache_valid(cache_path):
            print(f"Loading from cache: {cache_path}")
            df = pd.read_parquet(cache_path)
        else:
            print(f"Downloading play-by-play for {seasons}...")
            df = nfl.import_pbp_data(seasons)
            df.to_parquet(cache_path)
            print(f"Cached to: {cache_path}")

        if columns:
            available = [c for c in columns if c in df.columns]
            df = df[available]

        return df

    def clear_cache(self) -> int:
        """Clear all cached files. Returns number deleted."""
        count = 0
        for file in self.cache_dir.glob('*.parquet'):
            file.unlink()
            count += 1
        return count


# Usage example
# cache = NFLDataCache('./data/cache')
# pbp = cache.load_pbp([2023])

2.6 Building a Data Pipeline

2.6.1 Pipeline Architecture

A robust analytics pipeline separates concerns:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Source    │────▶│   Landing   │────▶│  Processed  │
│    APIs     │     │    Zone     │     │    Zone     │
└─────────────┘     └─────────────┘     └─────────────┘
                          │                    │
                          │                    │
                          ▼                    ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Archive   │     │  Analytics  │
                    │   Storage   │     │    Ready    │
                    └─────────────┘     └─────────────┘

Source Layer: APIs and data providers Landing Zone: Raw data as received Processed Zone: Cleaned, standardized data Analytics Ready: Aggregated, feature-engineered datasets

2.6.2 Implementation Example

"""
NFL Data Pipeline

A simple but complete data pipeline for football analytics.
"""

import pandas as pd
import numpy as np
import nfl_data_py as nfl
from pathlib import Path
from typing import List, Dict, Optional
from dataclasses import dataclass
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class PipelineConfig:
    """Configuration for data pipeline."""
    data_dir: str = './data'
    seasons: List[int] = None
    refresh_interval_days: int = 7

    def __post_init__(self):
        if self.seasons is None:
            self.seasons = [2023]


class NFLDataPipeline:
    """
    Complete data pipeline for NFL analytics.

    This class handles data extraction, transformation, and loading
    with built-in caching and validation.
    """

    def __init__(self, config: PipelineConfig):
        """Initialize pipeline with configuration."""
        self.config = config
        self.data_dir = Path(config.data_dir)

        # Create directory structure
        (self.data_dir / 'raw').mkdir(parents=True, exist_ok=True)
        (self.data_dir / 'processed').mkdir(parents=True, exist_ok=True)
        (self.data_dir / 'features').mkdir(parents=True, exist_ok=True)

        self.data: Dict[str, pd.DataFrame] = {}

    def extract_pbp(self) -> pd.DataFrame:
        """Extract play-by-play data from source."""
        logger.info(f"Extracting PBP for seasons: {self.config.seasons}")
        pbp = nfl.import_pbp_data(self.config.seasons)
        self.data['pbp_raw'] = pbp

        # Save to landing zone
        output_path = self.data_dir / 'raw' / 'pbp_raw.parquet'
        pbp.to_parquet(output_path)
        logger.info(f"Saved {len(pbp):,} plays to {output_path}")

        return pbp

    def transform_pbp(self, pbp: pd.DataFrame = None) -> pd.DataFrame:
        """
        Transform raw PBP data into clean format.

        Transformations:
        1. Filter to real plays (no timeouts, end of quarter)
        2. Create derived fields
        3. Standardize naming
        4. Handle missing values
        """
        if pbp is None:
            pbp = self.data.get('pbp_raw')
            if pbp is None:
                raise ValueError("No PBP data available. Run extract_pbp first.")

        logger.info("Transforming PBP data...")

        # Filter to real plays
        real_plays = pbp[
            (pbp['play_type'].isin(['pass', 'run', 'punt', 'field_goal', 'kickoff'])) |
            (pbp['rush'] == 1) |
            (pbp['pass'] == 1)
        ].copy()

        # Create derived fields
        real_plays['is_pass'] = (real_plays['pass'] == 1).astype(int)
        real_plays['is_rush'] = (real_plays['rush'] == 1).astype(int)
        real_plays['is_success'] = (real_plays['epa'] > 0).astype(int)

        # Score context
        real_plays['score_diff'] = real_plays['posteam_score'] - real_plays['defteam_score']
        real_plays['is_close_game'] = (abs(real_plays['score_diff']) <= 8).astype(int)

        # Time context
        real_plays['is_garbage_time'] = (
            (real_plays['game_seconds_remaining'] < 300) &
            (abs(real_plays['score_diff']) > 16)
        ).astype(int)

        # Store
        self.data['pbp_clean'] = real_plays

        # Save to processed zone
        output_path = self.data_dir / 'processed' / 'pbp_clean.parquet'
        real_plays.to_parquet(output_path)
        logger.info(f"Saved {len(real_plays):,} clean plays to {output_path}")

        return real_plays

    def create_player_features(self, pbp: pd.DataFrame = None) -> pd.DataFrame:
        """
        Create player-level aggregated features.

        Returns
        -------
        pd.DataFrame
            Player statistics aggregated by season
        """
        if pbp is None:
            pbp = self.data.get('pbp_clean')

        logger.info("Creating player features...")

        # Passing stats
        pass_plays = pbp[pbp['is_pass'] == 1]

        passer_stats = pass_plays.groupby(
            ['season', 'passer_player_id', 'passer_player_name']
        ).agg({
            'play_id': 'count',
            'complete_pass': 'sum',
            'yards_gained': 'sum',
            'air_yards': 'sum',
            'epa': 'sum',
            'is_success': 'mean',
            'cpoe': 'mean'
        }).reset_index()

        passer_stats.columns = [
            'season', 'player_id', 'player_name',
            'dropbacks', 'completions', 'pass_yards', 'air_yards',
            'total_epa', 'success_rate', 'cpoe'
        ]

        passer_stats['epa_per_dropback'] = passer_stats['total_epa'] / passer_stats['dropbacks']
        passer_stats['completion_pct'] = passer_stats['completions'] / passer_stats['dropbacks']

        # Store
        self.data['passer_features'] = passer_stats

        # Save to features zone
        output_path = self.data_dir / 'features' / 'passer_features.parquet'
        passer_stats.to_parquet(output_path)
        logger.info(f"Saved features for {len(passer_stats)} passers")

        return passer_stats

    def run_full_pipeline(self) -> Dict[str, pd.DataFrame]:
        """Run complete pipeline from extract to features."""
        logger.info("Starting full pipeline run...")

        # Extract
        pbp = self.extract_pbp()

        # Transform
        pbp_clean = self.transform_pbp(pbp)

        # Create features
        passer_features = self.create_player_features(pbp_clean)

        logger.info("Pipeline complete!")

        return {
            'pbp_raw': pbp,
            'pbp_clean': pbp_clean,
            'passer_features': passer_features
        }

    def validate_data(self) -> Dict[str, bool]:
        """Run validation checks on processed data."""
        validations = {}

        if 'pbp_clean' in self.data:
            pbp = self.data['pbp_clean']

            # Check for expected columns
            required_cols = ['game_id', 'play_id', 'epa', 'posteam', 'defteam']
            validations['has_required_columns'] = all(c in pbp.columns for c in required_cols)

            # Check for reasonable EPA values
            validations['epa_in_range'] = pbp['epa'].between(-10, 10).mean() > 0.99

            # Check for non-empty teams
            validations['no_empty_teams'] = pbp['posteam'].notna().mean() > 0.99

        return validations


# Usage
if __name__ == "__main__":
    config = PipelineConfig(seasons=[2023])
    pipeline = NFLDataPipeline(config)

    # Run pipeline
    # results = pipeline.run_full_pipeline()

    # Validate
    # validations = pipeline.validate_data()
    # print("Validation results:", validations)

2.7 Chapter Summary

Key Concepts

  1. Data categories range from aggregated statistics to frame-level tracking data, each with different use cases and accessibility.

  2. Play-by-play data is the foundation of football analytics, recording events and outcomes for every play with derived metrics like EPA and WPA.

  3. Tracking data captures precise player locations at 10 Hz, enabling spatial analysis invisible to event data, but access is limited.

  4. Data quality issues including missing values, inconsistent recording, and model dependencies must be understood and addressed.

  5. nfl_data_py provides Python access to the major public NFL data sources.

  6. Data pipelines separate extraction, transformation, and loading for maintainable analytics workflows.

Key Takeaways for Practice

  • Always understand your data's provenance and limitations before analysis
  • Use caching to avoid repeated downloads of large datasets
  • Filter to relevant subsets early to manage memory
  • Validate data quality as part of your pipeline
  • Document data processing decisions for reproducibility

Data Source Decision Framework

What data do I need?

├── Historical patterns across many seasons?
│   └── Use nfl_data_py play-by-play (1999-present)
│
├── Current season statistics?
│   └── Use nfl_data_py or NFL API
│
├── Spatial/movement analysis?
│   ├── Aggregated metrics only?
│   │   └── Use Next Gen Stats website
│   └── Raw tracking data?
│       └── Use Big Data Bowl data (limited games)
│
├── Player grades/PFF metrics?
│   └── Requires PFF subscription
│
├── Contract/salary data?
│   └── Use Spotrac or Over The Cap

What's Next

In Chapter 3: Python for Football Analytics, we dive deeper into the programming tools and patterns that make efficient football analysis possible. You'll learn to write reusable functions, process large datasets efficiently, and build the coding foundations for everything that follows.

Before moving on, complete the exercises and quiz to ensure you can navigate the NFL data ecosystem confidently.


Chapter 2 Exercises → exercises.md

Chapter 2 Quiz → quiz.md

Case Study: Building a Team Stats Dashboard → case-study-01.md


The analyst who understands their data's limitations will always outperform the one who treats data as truth.