Further Reading: The NFL Data Ecosystem

Annotated bibliography for deeper exploration of Chapter 2 topics


Essential Technical Resources

nflfastR and nfl_data_py Documentation

nflfastR Official Documentation (nflfastr.com)

The comprehensive guide to the R package that powers most public football analytics. The website includes field descriptions, calculation methodologies, and example analyses. Even Python users should review this documentation since nfl_data_py wraps the same data.

nfl_data_py GitHub Repository

The Python package's source code and examples. Review the README for available functions and check the Issues for known problems and solutions.

Expected Points Methodology

"Building Expected Points Models" - Open Source Football

A detailed walkthrough of how EP and EPA are calculated, including the historical data used for calibration. Essential reading for understanding what these metrics actually measure and their limitations.

nflfastR EP Model Documentation

Technical details on the logistic regression model used to estimate expected points. Includes coefficients and the features used in the model.


Data Quality and Limitations

Understanding Data Provenance

"NFL Play-by-Play Data Quirks" - Ben Baldwin

A catalog of known issues in play-by-play data, including inconsistent play classifications, missing values, and changes over time. Essential reading before any serious analysis.

"What the Data Doesn't Tell You" - The Athletic

Discussion of what play-by-play data misses and how this affects analytical conclusions. Includes examples where data-driven conclusions conflicted with film study.


Tracking Data Resources

NFL Big Data Bowl

Kaggle NFL Big Data Bowl Competition Pages (2018-2024)

Each year's competition includes data documentation, starter notebooks, and winning solutions. The data varies by year (rushing, passing, special teams) but the winning solutions demonstrate state-of-the-art tracking data analysis.

Big Data Bowl Data Dictionaries

Detailed field descriptions for tracking data, including coordinate systems, event tags, and player identification.

Next Gen Stats

NFL Next Gen Stats Methodology Guides

Official explanations of how aggregated tracking metrics (completion probability, separation, time to throw) are calculated. Available on the NFL's stats website.


Alternative Data Sources

Pro Football Reference

Sports Reference Data Use Policy

Guidelines for using PFR data, including rate limiting for web scraping. Important to review before building any automated data collection.

PFR Glossary

Definitions for all statistics on Pro Football Reference, including calculation details. Useful for understanding differences between PFR stats and nflfastR calculations.

Pro Football Focus

PFF Methodology White Papers

Available to subscribers, these documents explain how PFF grades are assigned and what they attempt to measure. Important for understanding the subjective component of PFF data.


Tools and Infrastructure

Data Engineering

"Designing Data-Intensive Applications" by Martin Kleppmann

The definitive guide to data systems design. While not sports-specific, the principles of data pipelines, caching, and storage apply directly to football analytics infrastructure.

Parquet File Format Documentation

Understanding columnar storage formats helps explain why parquet is preferred for large analytical datasets. Review the Apache Parquet specification for details.

Python Data Stack

Pandas Documentation

The official pandas documentation, including performance tips for working with large DataFrames. Essential reference for efficient data manipulation.

Polars Documentation

A newer, faster alternative to pandas for large datasets. Consider exploring if pandas becomes a bottleneck.


Academic and Research Papers

Data Science in Sports

"A Survey of Machine Learning Approaches for Player and Team Performance Prediction in Football" (2022)

Academic review of how tracking and event data are used in predictive modeling across various football codes. Provides context for NFL-specific applications.

"Measuring Performance in the NFL" - Journal of Quantitative Analysis in Sports

Collection of papers using NFL data for performance analysis. Good examples of rigorous methodology applied to football questions.


Community Resources

Forums and Discussion

r/NFLstatheads (Reddit)

Active community discussing data sources, methodologies, and findings. Good place to ask questions about data quirks.

nflfastR Discord Server

Real-time discussion with package developers and experienced users. Fastest way to get help with data loading issues.

Blogs and Newsletters

Open Source Football (opensourcefootball.com)

Regular tutorials using nflfastR/nfl_data_py data. Excellent for learning practical analysis patterns.

The F5 (thef5.substack.com)

Newsletter covering football analytics with regular data-driven analysis examples.


Reading Schedule Suggestion

Immediately: - nflfastR EP model documentation - Big Data Bowl data dictionary for current year

Before Part II: - "NFL Play-by-Play Data Quirks" article - Review one Big Data Bowl winning solution

Ongoing Reference: - Bookmark pandas performance tips - Keep nflfastR glossary accessible


The best analysts combine understanding of the data with awareness of its limitations. These resources help you develop both.