Exercises: The NFL Data Ecosystem

These exercises build practical skills for working with NFL data sources. Estimated completion time: 4 hours.

Scoring Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. List the four categories of NFL data discussed in this chapter. For each, provide one example use case.

A.2. Explain the difference between play-by-play data and tracking data. What questions can tracking data answer that play-by-play cannot?

A.3. What is Expected Points Added (EPA)? Write the formula and explain what positive and negative values indicate.

A.4. Why is tracking data not fully available to the public? What are the implications for football analytics research?

A.5. Describe three data quality issues that commonly affect play-by-play data. For each, explain how you might detect or address it.

A.6. What is the difference between the game_id and play_id fields in play-by-play data? Why are both necessary?

A.7. Explain why caching is important when working with NFL data. What are the tradeoffs between cache freshness and performance?

A.8. What does the GSIS ID represent, and why is it important for linking data across sources?

Part B: Data Exploration ⭐⭐

B.1. Using nfl_data_py, load play-by-play data for the 2023 season. Answer the following: - How many total rows are in the dataset? - How many unique games are represented? - What is the breakdown of play_type values?

B.2. Calculate the league-wide EPA per play for passing plays vs. rushing plays in 2023. Which is higher? By how much?

B.3. Identify the top 5 quarterbacks by total EPA in 2023 (minimum 200 pass attempts). Display their names and EPA totals.

B.4. For a single game of your choice, trace the Expected Points values through each play. Create a line chart showing EP over the course of the game.

B.5. Compare the distribution of yards_gained for passes vs. rushes. Calculate: - Mean, median, standard deviation for each - The 90th percentile for each - Create overlapping histograms

B.6. Investigate missing values in the 2023 play-by-play data: - Which columns have the most missing values? - Why might air_yards be null for some passing plays? - What percentage of receiver_player_name values are missing on pass plays?

Part C: Programming Challenges ⭐⭐-⭐⭐⭐

C.1. Data Loader with Validation ⭐⭐

Create a function that loads play-by-play data and performs basic validation checks.

def load_and_validate_pbp(seasons: list) -> tuple:
    """
    Load PBP data and run validation checks.

    Parameters
    ----------
    seasons : list
        Seasons to load

    Returns
    -------
    tuple
        (DataFrame, validation_report dict)

    The validation_report should include:
    - total_rows: int
    - total_games: int
    - seasons_found: list
    - missing_epa_pct: float
    - play_type_distribution: dict
    - data_quality_warnings: list
    """
    # Your code here
    pass

C.2. Caching Decorator ⭐⭐

Create a decorator that caches function results to disk.

def cache_to_parquet(cache_dir: str = './cache'):
    """
    Decorator that caches DataFrame results to parquet files.

    Usage:
        @cache_to_parquet('./cache')
        def load_data(season):
            return nfl.import_pbp_data([season])
    """
    # Your code here
    pass

C.3. Play-by-Play Filter Builder ⭐⭐⭐

Create a class that builds complex filters for play-by-play data.

class PBPFilter:
    """
    Builder pattern for filtering play-by-play data.

    Usage:
        filter = (PBPFilter(pbp)
                  .plays_only()
                  .passing()
                  .regular_season()
                  .exclude_garbage_time()
                  .build())
    """

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()

    def plays_only(self):
        """Filter to actual plays (exclude timeouts, etc.)"""
        # Your code here
        return self

    def passing(self):
        """Filter to passing plays only."""
        # Your code here
        return self

    def rushing(self):
        """Filter to rushing plays only."""
        # Your code here
        return self

    def regular_season(self):
        """Exclude playoffs and preseason."""
        # Your code here
        return self

    def exclude_garbage_time(self):
        """Exclude plays in garbage time."""
        # Your code here
        return self

    def min_dropbacks(self, player_col: str, n: int):
        """Keep only players with minimum N dropbacks."""
        # Your code here
        return self

    def build(self) -> pd.DataFrame:
        """Return filtered DataFrame."""
        return self.df

C.4. Tracking Data Processor ⭐⭐⭐

Create functions to process tracking data coordinates.

def calculate_separation(
    tracking_frame: pd.DataFrame,
    receiver_id: int,
    defender_ids: list
) -> float:
    """
    Calculate minimum separation between receiver and defenders.

    Parameters
    ----------
    tracking_frame : pd.DataFrame
        Single frame of tracking data
    receiver_id : int
        NFL ID of the receiver
    defender_ids : list
        NFL IDs of defenders to check

    Returns
    -------
    float
        Minimum distance in yards
    """
    # Your code here
    pass


def calculate_player_speed(
    tracking_data: pd.DataFrame,
    player_id: int,
    frame_start: int,
    frame_end: int
) -> dict:
    """
    Calculate speed statistics for a player over frames.

    Returns
    -------
    dict
        {'max_speed': float, 'avg_speed': float, 'distance': float}
    """
    # Your code here
    pass

C.5. Complete Data Pipeline ⭐⭐⭐

Implement a full ETL pipeline class that: - Extracts data from nfl_data_py - Transforms it (cleaning, derived fields) - Loads it to processed storage - Includes logging and error handling

Part D: Analysis & Interpretation ⭐⭐-⭐⭐⭐

D.1. Data Source Selection ⭐⭐

For each of the following analysis questions, identify: - The best data source(s) - Key fields you would need - Any limitations of the available data

a) "Which receivers get the most separation from defenders?" b) "What is the career arc of running back productivity?" c) "How does home field advantage affect win probability?" d) "Which offensive line configurations are most effective?"

D.2. Data Quality Investigation ⭐⭐⭐

The 2023 play-by-play data shows some games with unusually high or low EPA totals. Investigate: - Identify games with total EPA (offense + defense) in the top or bottom 5% - Are these data quality issues or legitimate outliers? - What patterns do you observe?

D.3. Cross-Source Validation ⭐⭐⭐

Compare quarterback statistics between nfl_data_py and another source (e.g., Pro Football Reference). For the top 10 QBs by passing yards: - Do the total yards match exactly? - If not, what might explain the differences? - Which source would you trust for different use cases?

Part E: Research & Extension ⭐⭐⭐⭐

E.1. Data Lineage Documentation

Create comprehensive documentation for the nfl_data_py play-by-play data that includes: - Origin (where does the data ultimately come from?) - Processing steps (what transformations are applied?) - Known limitations and biases - Version history and breaking changes

E.2. Alternative Data Source Evaluation

Research and evaluate one data source not covered in this chapter (e.g., PFF, SIS, TruMedia). Write a report covering: - Data available and granularity - Access methods and cost - Unique features vs. public data - Use cases where this source would be essential

E.3. Tracking Data Analysis

Using Big Data Bowl data (available on Kaggle), perform an analysis that could not be done with play-by-play data alone. Examples: - Pre-snap motion effects on play success - Coverage shell recognition at snap - Time-to-throw under different pressure situations

Solutions

Selected solutions are available in: - code/exercise-solutions.py - appendices/g-answers-to-selected-exercises.md