Quiz: The NFL Data Ecosystem
Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.
Section 1: Multiple Choice (1 point each)
1. Which data source provides play-by-play data with EPA calculations for free?
- A) Pro Football Focus
- B) nfl_data_py / nflfastR
- C) NFL Game Pass
- D) ESPN API
Answer
**B)** nfl_data_py / nflfastR *Explanation:* The nflfastR ecosystem (R) and nfl_data_py (Python) provide free access to play-by-play data with EPA, WPA, and other calculated fields. See Section 2.1.3.2. How many times per second does NFL tracking data capture player positions?
- A) 1 Hz (once per second)
- B) 5 Hz (5 times per second)
- C) 10 Hz (10 times per second)
- D) 30 Hz (30 times per second)
Answer
**C)** 10 Hz (10 times per second) *Explanation:* RFID chips transmit position data at 10 Hz, creating 10 frames per second of tracking data. See Section 2.3.1.3. What does EPA measure?
- A) Total points scored on a drive
- B) How much a play changed the expected score
- C) The probability of winning after a play
- D) Player efficiency rating
Answer
**B)** How much a play changed the expected score *Explanation:* EPA (Expected Points Added) = EP_after - EP_before, measuring how much a single play improved or hurt the offense's scoring expectation. See Section 2.2.3.4. Which field in play-by-play data identifies the team with the ball?
- A)
home_team - B)
offense - C)
posteam - D)
ball_team
Answer
**C)** `posteam` *Explanation:* `posteam` (possession team) identifies which team has the ball, while `defteam` identifies the defense. See Section 2.2.2.5. What is the primary public source for raw NFL tracking data?
- A) NFL Next Gen Stats website
- B) NFL Big Data Bowl (Kaggle)
- C) Pro Football Reference
- D) ESPN Stats API
Answer
**B)** NFL Big Data Bowl (Kaggle) *Explanation:* The Big Data Bowl releases samples of tracking data (50-100 games) annually. Next Gen Stats provides aggregated metrics, not raw data. See Section 2.3.4.6. In the tracking data coordinate system, what does x=0 represent?
- A) The 50-yard line (midfield)
- B) The back of the offense's own end zone
- C) The left sideline
- D) The offense's goal line
Answer
**B)** The back of the offense's own end zone *Explanation:* x ranges from 0 (back of own end zone) to 120 (back of opponent's end zone), with the field being 100 yards plus two 10-yard end zones. See Section 2.3.3.7. What does the GSIS ID represent?
- A) A team's game strategy identifier
- B) A unique player identifier used across NFL data systems
- C) The Game Statistics Information System database
- D) A play's unique identifier
Answer
**B)** A unique player identifier used across NFL data systems *Explanation:* GSIS (Game Statistics and Information System) IDs uniquely identify players and are used to link data across different sources. See Section 2.2.2.8. Which of the following is NOT a common data quality issue in play-by-play data?
- A) Missing receiver names on some incompletions
- B) Inconsistent play type classifications
- C) Model-dependent EPA values being updated
- D) All NFL teams intentionally falsifying data
Answer
**D)** All NFL teams intentionally falsifying data *Explanation:* Data quality issues include missing values, inconsistent categorization, and model updates—but not intentional falsification. See Section 2.2.4.9. What information does tracking data capture that play-by-play data does NOT?
- A) Which team scored
- B) Player speed and location at every moment
- C) Final score of the game
- D) Which players were on the field
Answer
**B)** Player speed and location at every moment *Explanation:* Tracking data captures x/y coordinates, speed, and acceleration for every player on every frame. PBP captures outcomes but not spatial details. See Section 2.3.2.10. What does yardline_100 represent in play-by-play data?
- A) The total yards gained in 100 plays
- B) Yards from the offense's own goal line
- C) Yards from the opponent's goal line
- D) Percentage of field covered
Answer
**C)** Yards from the opponent's goal line *Explanation:* `yardline_100` ranges from 1 (goal line) to 99 (own 1-yard line), measuring distance to the opponent's end zone. See Section 2.2.2.Section 2: True/False (1 point each)
11. Play-by-play data is available for every NFL season since 1970.
Answer
**False** *Explanation:* Play-by-play data (as structured in nflfastR) is available from 1999 to present. Earlier data is incomplete or formatted differently. See Section 2.2.1.12. The air_yards field is populated for all passing plays.
Answer
**False** *Explanation:* `air_yards` may be null for some plays, particularly incompletions where the target isn't identified or very short passes. See Section 2.2.4.13. Pro Football Focus grades are derived from publicly available tracking data.
Answer
**False** *Explanation:* PFF grades are created by analysts watching film and subjectively evaluating each play—they are not automatically derived from tracking data. See Section 2.4.3.14. Expected Points values depend on the model used to calculate them, so historical values may change.
Answer
**True** *Explanation:* EPA depends on the underlying EP model. When models are updated, historical values may be recalculated. See Section 2.2.4.15. The Big Data Bowl releases tracking data for every NFL game each season.
Answer
**False** *Explanation:* The Big Data Bowl releases tracking data for a limited sample (typically 50-100 games), not all games. See Section 2.3.4.Section 3: Fill in the Blank (1 point each)
16. The formula for Expected Points Added is: EPA = EP_ - EP_
Answer
**after** - **before** (or post - pre) EPA = EP_after - EP_before17. In tracking data, player speed is measured in __ per second.
Answer
**yards** Speed (`s` field) is in yards per second.18. The __ field in play-by-play data indicates the team that has possession of the ball.
Answer
**posteam** Short for "possession team."19. Play-by-play data captures approximately __ columns of information per play.
Answer
**300** (accept 250-350) The nflfastR dataset contains over 300 columns.Section 4: Short Answer (2 points each)
20. Explain why caching is important when working with NFL data, and describe one strategy for implementing it.
Sample Answer
Caching is important because downloading play-by-play data is slow (large files) and stresses the data provider's servers. Without caching, every analysis run would require re-downloading the same data. Strategy: Save downloaded data to local parquet files with timestamps. Before downloading, check if a recent cache exists and load from disk instead. Implement cache expiration (e.g., refresh weekly during the season). *Key points for full credit:* - Identifies performance/efficiency benefit - Mentions server load consideration - Describes a concrete caching approach21. Describe two limitations of tracking data that analysts should be aware of.
Sample Answer
1. **Limited public availability**: Raw tracking data is only available through Big Data Bowl samples (50-100 games/year), not for all games. Full tracking data remains proprietary to NFL teams. 2. **No ball tracking in public data**: Current public tracking data doesn't include the ball's position, making it difficult to determine exact timing of catches or throw locations. Other valid limitations: computational challenges with massive data volume, interpretation complexity (raw coordinates need context), doesn't capture eye movement or communication.22. What is the difference between aggregated statistics and play-by-play data? When would you use each?
Sample Answer
**Aggregated statistics** are summaries across multiple plays (e.g., season passing yards). **Play-by-play data** records every individual play with situational context. **Use aggregated statistics** when: - Comparing players/teams on counting stats - Quick historical lookups - Simple rankings and leaderboards **Use play-by-play data** when: - Analyzing situational performance (3rd down, red zone) - Building predictive models - Calculating context-adjusted metrics like EPA - Understanding sequences of playsSection 5: Matching (1 point each)
Match each data source with its primary use case:
| Data Source | Use Case Options |
|---|---|
| 23. nfl_data_py | A. Player grades and subjective evaluations |
| 24. Big Data Bowl | B. Contract and salary cap information |
| 25. Pro Football Focus | C. Play-by-play with EPA calculations |
| D. Raw tracking data samples |
Answers
**23. C** - nfl_data_py provides play-by-play data with EPA **24. D** - Big Data Bowl provides raw tracking data **25. A** - PFF provides player grades and evaluations (B would match Spotrac/Over The Cap)Scoring
| Section | Points | Your Score |
|---|---|---|
| Multiple Choice (1-10) | 10 | ___ |
| True/False (11-15) | 5 | ___ |
| Fill in Blank (16-19) | 4 | ___ |
| Short Answer (20-22) | 6 | ___ |
| Matching (23-25) | 3 | ___ |
| Total | 28 | ___ |
Passing Score: 20/28 (70%)
Review Recommendations
- Score < 50%: Re-read entire chapter, focusing on Sections 2.2 and 2.3
- Score 50-70%: Review Sections 2.5 and 2.6, practice loading data
- Score 70-85%: Good understanding! Practice the programming exercises
- Score > 85%: Excellent! Ready for Chapter 3