Further Reading: The NFL Data Ecosystem
Annotated bibliography for deeper exploration of Chapter 2 topics
Essential Technical Resources
nflfastR and nfl_data_py Documentation
nflfastR Official Documentation (nflfastr.com)
The comprehensive guide to the R package that powers most public football analytics. The website includes field descriptions, calculation methodologies, and example analyses. Even Python users should review this documentation since nfl_data_py wraps the same data.
nfl_data_py GitHub Repository
The Python package's source code and examples. Review the README for available functions and check the Issues for known problems and solutions.
Expected Points Methodology
"Building Expected Points Models" - Open Source Football
A detailed walkthrough of how EP and EPA are calculated, including the historical data used for calibration. Essential reading for understanding what these metrics actually measure and their limitations.
nflfastR EP Model Documentation
Technical details on the logistic regression model used to estimate expected points. Includes coefficients and the features used in the model.
Data Quality and Limitations
Understanding Data Provenance
"NFL Play-by-Play Data Quirks" - Ben Baldwin
A catalog of known issues in play-by-play data, including inconsistent play classifications, missing values, and changes over time. Essential reading before any serious analysis.
"What the Data Doesn't Tell You" - The Athletic
Discussion of what play-by-play data misses and how this affects analytical conclusions. Includes examples where data-driven conclusions conflicted with film study.
Tracking Data Resources
NFL Big Data Bowl
Kaggle NFL Big Data Bowl Competition Pages (2018-2024)
Each year's competition includes data documentation, starter notebooks, and winning solutions. The data varies by year (rushing, passing, special teams) but the winning solutions demonstrate state-of-the-art tracking data analysis.
Big Data Bowl Data Dictionaries
Detailed field descriptions for tracking data, including coordinate systems, event tags, and player identification.
Next Gen Stats
NFL Next Gen Stats Methodology Guides
Official explanations of how aggregated tracking metrics (completion probability, separation, time to throw) are calculated. Available on the NFL's stats website.
Alternative Data Sources
Pro Football Reference
Sports Reference Data Use Policy
Guidelines for using PFR data, including rate limiting for web scraping. Important to review before building any automated data collection.
PFR Glossary
Definitions for all statistics on Pro Football Reference, including calculation details. Useful for understanding differences between PFR stats and nflfastR calculations.
Pro Football Focus
PFF Methodology White Papers
Available to subscribers, these documents explain how PFF grades are assigned and what they attempt to measure. Important for understanding the subjective component of PFF data.
Tools and Infrastructure
Data Engineering
"Designing Data-Intensive Applications" by Martin Kleppmann
The definitive guide to data systems design. While not sports-specific, the principles of data pipelines, caching, and storage apply directly to football analytics infrastructure.
Parquet File Format Documentation
Understanding columnar storage formats helps explain why parquet is preferred for large analytical datasets. Review the Apache Parquet specification for details.
Python Data Stack
Pandas Documentation
The official pandas documentation, including performance tips for working with large DataFrames. Essential reference for efficient data manipulation.
Polars Documentation
A newer, faster alternative to pandas for large datasets. Consider exploring if pandas becomes a bottleneck.
Academic and Research Papers
Data Science in Sports
"A Survey of Machine Learning Approaches for Player and Team Performance Prediction in Football" (2022)
Academic review of how tracking and event data are used in predictive modeling across various football codes. Provides context for NFL-specific applications.
"Measuring Performance in the NFL" - Journal of Quantitative Analysis in Sports
Collection of papers using NFL data for performance analysis. Good examples of rigorous methodology applied to football questions.
Community Resources
Forums and Discussion
r/NFLstatheads (Reddit)
Active community discussing data sources, methodologies, and findings. Good place to ask questions about data quirks.
nflfastR Discord Server
Real-time discussion with package developers and experienced users. Fastest way to get help with data loading issues.
Blogs and Newsletters
Open Source Football (opensourcefootball.com)
Regular tutorials using nflfastR/nfl_data_py data. Excellent for learning practical analysis patterns.
The F5 (thef5.substack.com)
Newsletter covering football analytics with regular data-driven analysis examples.
Reading Schedule Suggestion
Immediately: - nflfastR EP model documentation - Big Data Bowl data dictionary for current year
Before Part II: - "NFL Play-by-Play Data Quirks" article - Review one Big Data Bowl winning solution
Ongoing Reference: - Bookmark pandas performance tips - Keep nflfastR glossary accessible
The best analysts combine understanding of the data with awareness of its limitations. These resources help you develop both.