Chapter 2 Quiz: Data Sources and Collection
Instructions
This quiz tests your understanding of NBA data sources, collection methods, and data quality concepts covered in Chapter 2. Answer all questions to the best of your ability. Each question is worth 1 point unless otherwise noted.
Time Limit: 45 minutes Total Points: 30
Section A: The NBA Data Ecosystem (Questions 1-5)
Question 1
Which of the following represents the correct hierarchy of NBA data layers, from most accessible to most restricted?
A) Tracking Data > Official League Data > Third-Party Aggregations B) Official League Data > Third-Party Aggregations > Tracking Data C) Third-Party Aggregations > Tracking Data > Official League Data D) Broadcast Data > Derived Analytics > Tracking Data
Question 2
The NBA Stats API returns data in which format?
A) XML B) CSV C) JSON D) HTML
Question 3
What is the earliest season for which most NBA Stats API endpoints provide data?
A) 1946-47 B) 1979-80 C) 1996-97 D) 2013-14
Question 4
Which Python library provides a structured interface to the NBA Stats API?
A) basketball_reference B) nba_api C) espn_stats D) hoopstats
Question 5
Why is the NBA Stats API considered "unofficial" for public use?
A) It requires a paid subscription B) It has not been documented by the NBA for public consumption C) It only works during the regular season D) It provides incorrect data
Section B: API Access and Rate Limiting (Questions 6-10)
Question 6
What HTTP status code typically indicates that you have exceeded the NBA API's rate limit?
A) 200 B) 404 C) 429 D) 500
Question 7
When using the NBA API, what is a recommended maximum request rate to avoid blocking?
A) 10 requests per second B) 1 request per 2 seconds (0.5 requests/second) C) 100 requests per minute D) 1 request per 10 seconds
Question 8
Which of the following HTTP headers is typically required when making requests to the NBA Stats API?
A) Authorization B) Referer C) Content-Length D) Accept-Encoding
Question 9
In the rate limiting decorator pattern shown in Chapter 2, what is the purpose of the last_called variable?
A) To count total API calls B) To track when the most recent request was made C) To store the API response D) To log error messages
Question 10
What is the primary benefit of implementing request caching in a data collection system?
A) It makes the code run faster B) It reduces redundant API calls and respects rate limits C) It improves data accuracy D) It encrypts sensitive data
Section C: Web Scraping and Basketball-Reference (Questions 11-15)
Question 11
What should you always check before scraping a website?
A) The website's color scheme B) The robots.txt file and terms of service C) The website's loading speed D) The number of pages on the site
Question 12
Basketball-Reference wraps some tables in HTML comments for performance optimization. Which technique is used to extract these tables?
A) Regular JavaScript evaluation B) Standard pandas.read_html() C) Regular expression matching to extract commented HTML D) Direct database queries
Question 13
What is the recommended minimum delay between requests when scraping Basketball-Reference?
A) 0.1 seconds B) 1 second C) 3-5 seconds D) 30 seconds
Question 14
Which Python library is commonly used alongside pandas for parsing HTML tables from web pages?
A) numpy B) BeautifulSoup C) matplotlib D) scikit-learn
Question 15
When scraping data, what does "respecting robots.txt" mean in practice?
A) Sending robot-like requests B) Following the directives about which pages can and cannot be accessed by crawlers C) Using automated testing tools D) Creating robotic data visualizations
Section D: Play-by-Play and Shot Data (Questions 16-20)
Question 16
In NBA play-by-play data, which EVENTMSGTYPE code represents a made field goal?
A) 1 B) 2 C) 3 D) 4
Question 17
What information does the PCTIMESTRING field in play-by-play data represent?
A) The total game time elapsed B) The time remaining in the current period C) The shot clock time D) The player's time on court
Question 18
In NBA shot chart data, what do the LOC_X and LOC_Y coordinates represent?
A) GPS coordinates of the arena B) Position on the court in tenths of feet from the basket C) Pixel positions on the broadcast screen D) Distance from half court in meters
Question 19
Which event in play-by-play data does NOT typically end a possession?
A) Made field goal (excluding and-one situations) B) Defensive rebound C) Offensive rebound D) Turnover
Question 20
The formula for calculating elapsed game seconds in regulation accounts for what period length?
A) 10 minutes B) 12 minutes C) 15 minutes D) 20 minutes
Section E: Tracking Data (Questions 21-25)
Question 21
What is the name of the current NBA tracking data provider (as of the 2017 season onward)?
A) SportVU B) Second Spectrum C) Synergy Sports D) Stats Perform
Question 22
At what frame rate does the NBA's current tracking system capture player and ball positions?
A) 10 frames per second B) 25 frames per second C) 60 frames per second D) 120 frames per second
Question 23
In tracking data coordinate systems, what is typically located at the origin (0, 0)?
A) Half court B) Center court C) The basket D) The scorer's table
Question 24
Which of the following is a "derived" tracking metric rather than raw positional data?
A) Player X coordinate B) Ball height C) Contested shot percentage D) Player Y coordinate
Question 25
What is the primary limitation for analysts wanting to work with raw NBA tracking data?
A) The data is too large to process B) Access is restricted to league partners and teams C) The tracking system is unreliable D) The data format is proprietary and unreadable
Section F: Data Quality and Cleaning (Questions 26-30)
Question 26
Which of the following is NOT a common data quality issue in basketball datasets?
A) Missing play-by-play events B) Player ID swaps in tracking data C) Excessive data availability D) Box score and play-by-play totals that don't reconcile
Question 27
When validating shot chart coordinates, what is the approximate valid range for LOC_X (in tenths of feet)?
A) -50 to 50 B) -100 to 100 C) -250 to 250 D) -500 to 500
Question 28
What is the purpose of the check_robots_txt() function shown in Chapter 2?
A) To test if a URL is accessible B) To verify if automated access to a URL is permitted by the site C) To measure website response time D) To check for broken links
Question 29 (2 points)
Match each data era with its characteristics:
| Era | Characteristics |
|---|---|
| 1. Pre-1980 | A. Tracking data available, play-by-play complete |
| 2. 1980-1996 | B. No three-point line, limited statistics |
| 3. 1997-2013 | C. Three-pointers recorded, no play-by-play |
| 4. 2014-present | D. Play-by-play available, no tracking |
Question 30
Which storage format is recommended in Chapter 2 for analytical workloads due to its columnar structure?
A) CSV B) JSON C) Parquet D) Excel
Bonus Questions (2 points each)
Bonus Question 1
Explain the difference between "Missing Completely at Random" (MCAR) and "Missing Not at Random" (MNAR) in the context of NBA data. Provide an example of each from basketball statistics.
Bonus Question 2
A team's analytics department needs to collect shot chart data for all players in the 2023-24 season. Describe a data collection strategy that respects API rate limits, implements caching, and handles potential errors. What considerations should inform your approach?
Answer Key
Section A: The NBA Data Ecosystem
-
B) Official League Data > Third-Party Aggregations > Tracking Data - Official league data via the API is most accessible; third-party sites aggregate and republish; raw tracking data is restricted to partners.
-
C) JSON - The NBA Stats API follows RESTful architecture and returns JSON-formatted responses.
-
C) 1996-97 - Most endpoints provide data back to the 1996-97 season, with box score data available from earlier eras.
-
B) nba_api - The nba_api library, maintained by Swar Patel and contributors, provides a Pythonic interface to the NBA Stats API.
-
B) It has not been documented by the NBA for public consumption - While the API powers NBA.com, it's been reverse-engineered by the community rather than officially documented for developers.
Section B: API Access and Rate Limiting
-
C) 429 - HTTP 429 "Too Many Requests" indicates rate limit exceeded.
-
B) 1 request per 2 seconds (0.5 requests/second) - The chapter recommends approximately 0.5 requests per second to avoid rate limiting.
-
B) Referer - Requests must include headers like Referer (stats.nba.com) to receive valid responses.
-
B) To track when the most recent request was made - This timestamp allows calculation of elapsed time to enforce minimum intervals between requests.
-
B) It reduces redundant API calls and respects rate limits
- Caching prevents re-fetching data that's already been collected, reducing API load.
Section C: Web Scraping and Basketball-Reference
-
B) The robots.txt file and terms of service
- These documents specify what automated access is permitted.
-
C) Regular expression matching to extract commented HTML
- The chapter shows using regex to extract tables wrapped in HTML comments.
-
C) 3-5 seconds
- A delay of 3-5 seconds is recommended to avoid imposing undue server load.
-
B) BeautifulSoup
- BeautifulSoup (from bs4) is commonly used with pandas for HTML parsing.
-
B) Following the directives about which pages can and cannot be accessed by crawlers
- robots.txt specifies crawling permissions for automated agents.
Section D: Play-by-Play and Shot Data
-
A) 1
- EVENTMSGTYPE 1 represents a made shot; 2 represents a missed shot.
-
B) The time remaining in the current period
- PCTIMESTRING shows time remaining (e.g., "8:34") in the current period.
-
B) Position on the court in tenths of feet from the basket
- Coordinates are measured in tenths of feet relative to the basket location.
-
C) Offensive rebound
- Offensive rebounds extend the possession; the other events end it.
-
B) 12 minutes
- NBA regulation periods are 12 minutes; overtime periods are 5 minutes.
Section E: Tracking Data
-
B) Second Spectrum
- Second Spectrum became the NBA's tracking partner in 2017, replacing SportVU.
-
B) 25 frames per second
- The tracking system captures positions at 25 fps.
-
B) Center court
- The coordinate system places the origin at center court with axes to sidelines and baselines.
-
C) Contested shot percentage
- This is derived from raw positions; it's computed, not directly measured.
-
B) Access is restricted to league partners and teams
- Raw frame-by-frame tracking data remains proprietary and requires partnerships.
Section F: Data Quality and Cleaning
-
C) Excessive data availability
- This is not a data quality issue; the others are common problems discussed in the chapter.
-
C) -250 to 250
- The valid X range is -250 to 250 (representing -25 to 25 feet from center).
-
B) To verify if automated access to a URL is permitted by the site
- The function checks robots.txt to determine if scraping is allowed.
-
Answers: 1-B, 2-C, 3-D, 4-A (2 points)
- Pre-1980: No three-point line, limited stats
- 1980-1996: Three-pointers added, still no PBP
- 1997-2013: Play-by-play available, no tracking
- 2014-present: Tracking data era with full data availability
-
C) Parquet
- Parquet's columnar format provides significant advantages for analytical workloads.
Bonus Questions
Bonus 1: (2 points) - MCAR (Missing Completely at Random): The missingness has no relationship to any variables. Example: A random data entry error causing missing assist values for a game. - MNAR (Missing Not at Random): The missingness depends on the value itself. Example: Three-point percentage is "missing" for players who don't attempt three-pointers because the statistic is undefined (0/0), not randomly missing.
Bonus 2: (2 points) Strategy should include: - Implement rate limiting (0.5-1 request per second) - Use file-based caching (Parquet format) to store retrieved data - Check cache before making API calls to avoid redundant requests - Implement exponential backoff for failed requests - Log all operations for debugging - Handle HTTP errors gracefully (429, 500, timeouts) - Process players in batches with progress tracking - Consider time of day for API access (off-peak hours) - Validate data integrity after collection
Scoring Guide
| Score | Grade | Feedback |
|---|---|---|
| 28-34 | A | Excellent understanding of NBA data sources and collection |
| 24-27 | B | Good grasp of core concepts; review data quality sections |
| 20-23 | C | Adequate understanding; more practice with APIs recommended |
| 16-19 | D | Review chapter material and complete all exercises |
| Below 16 | F | Seek additional help; re-read chapter before proceeding |
Post-Quiz Reflection
After completing this quiz, consider:
- Which data sources are most relevant to your basketball analytics projects?
- What data quality challenges do you anticipate encountering?
- How will you implement responsible data collection practices?
- What additional data sources might complement the ones discussed?
Take time to revisit sections where you scored below 80% before moving to the next chapter.