Chapter 2 Quiz: Data Sources and Collection

Instructions

This quiz tests your understanding of NBA data sources, collection methods, and data quality concepts covered in Chapter 2. Answer all questions to the best of your ability. Each question is worth 1 point unless otherwise noted.

Time Limit: 45 minutes Total Points: 30


Section A: The NBA Data Ecosystem (Questions 1-5)

Question 1

Which of the following represents the correct hierarchy of NBA data layers, from most accessible to most restricted?

A) Tracking Data > Official League Data > Third-Party Aggregations B) Official League Data > Third-Party Aggregations > Tracking Data C) Third-Party Aggregations > Tracking Data > Official League Data D) Broadcast Data > Derived Analytics > Tracking Data


Question 2

The NBA Stats API returns data in which format?

A) XML B) CSV C) JSON D) HTML


Question 3

What is the earliest season for which most NBA Stats API endpoints provide data?

A) 1946-47 B) 1979-80 C) 1996-97 D) 2013-14


Question 4

Which Python library provides a structured interface to the NBA Stats API?

A) basketball_reference B) nba_api C) espn_stats D) hoopstats


Question 5

Why is the NBA Stats API considered "unofficial" for public use?

A) It requires a paid subscription B) It has not been documented by the NBA for public consumption C) It only works during the regular season D) It provides incorrect data


Section B: API Access and Rate Limiting (Questions 6-10)

Question 6

What HTTP status code typically indicates that you have exceeded the NBA API's rate limit?

A) 200 B) 404 C) 429 D) 500


Question 7

When using the NBA API, what is a recommended maximum request rate to avoid blocking?

A) 10 requests per second B) 1 request per 2 seconds (0.5 requests/second) C) 100 requests per minute D) 1 request per 10 seconds


Question 8

Which of the following HTTP headers is typically required when making requests to the NBA Stats API?

A) Authorization B) Referer C) Content-Length D) Accept-Encoding


Question 9

In the rate limiting decorator pattern shown in Chapter 2, what is the purpose of the last_called variable?

A) To count total API calls B) To track when the most recent request was made C) To store the API response D) To log error messages


Question 10

What is the primary benefit of implementing request caching in a data collection system?

A) It makes the code run faster B) It reduces redundant API calls and respects rate limits C) It improves data accuracy D) It encrypts sensitive data


Section C: Web Scraping and Basketball-Reference (Questions 11-15)

Question 11

What should you always check before scraping a website?

A) The website's color scheme B) The robots.txt file and terms of service C) The website's loading speed D) The number of pages on the site


Question 12

Basketball-Reference wraps some tables in HTML comments for performance optimization. Which technique is used to extract these tables?

A) Regular JavaScript evaluation B) Standard pandas.read_html() C) Regular expression matching to extract commented HTML D) Direct database queries


Question 13

What is the recommended minimum delay between requests when scraping Basketball-Reference?

A) 0.1 seconds B) 1 second C) 3-5 seconds D) 30 seconds


Question 14

Which Python library is commonly used alongside pandas for parsing HTML tables from web pages?

A) numpy B) BeautifulSoup C) matplotlib D) scikit-learn


Question 15

When scraping data, what does "respecting robots.txt" mean in practice?

A) Sending robot-like requests B) Following the directives about which pages can and cannot be accessed by crawlers C) Using automated testing tools D) Creating robotic data visualizations


Section D: Play-by-Play and Shot Data (Questions 16-20)

Question 16

In NBA play-by-play data, which EVENTMSGTYPE code represents a made field goal?

A) 1 B) 2 C) 3 D) 4


Question 17

What information does the PCTIMESTRING field in play-by-play data represent?

A) The total game time elapsed B) The time remaining in the current period C) The shot clock time D) The player's time on court


Question 18

In NBA shot chart data, what do the LOC_X and LOC_Y coordinates represent?

A) GPS coordinates of the arena B) Position on the court in tenths of feet from the basket C) Pixel positions on the broadcast screen D) Distance from half court in meters


Question 19

Which event in play-by-play data does NOT typically end a possession?

A) Made field goal (excluding and-one situations) B) Defensive rebound C) Offensive rebound D) Turnover


Question 20

The formula for calculating elapsed game seconds in regulation accounts for what period length?

A) 10 minutes B) 12 minutes C) 15 minutes D) 20 minutes


Section E: Tracking Data (Questions 21-25)

Question 21

What is the name of the current NBA tracking data provider (as of the 2017 season onward)?

A) SportVU B) Second Spectrum C) Synergy Sports D) Stats Perform


Question 22

At what frame rate does the NBA's current tracking system capture player and ball positions?

A) 10 frames per second B) 25 frames per second C) 60 frames per second D) 120 frames per second


Question 23

In tracking data coordinate systems, what is typically located at the origin (0, 0)?

A) Half court B) Center court C) The basket D) The scorer's table


Question 24

Which of the following is a "derived" tracking metric rather than raw positional data?

A) Player X coordinate B) Ball height C) Contested shot percentage D) Player Y coordinate


Question 25

What is the primary limitation for analysts wanting to work with raw NBA tracking data?

A) The data is too large to process B) Access is restricted to league partners and teams C) The tracking system is unreliable D) The data format is proprietary and unreadable


Section F: Data Quality and Cleaning (Questions 26-30)

Question 26

Which of the following is NOT a common data quality issue in basketball datasets?

A) Missing play-by-play events B) Player ID swaps in tracking data C) Excessive data availability D) Box score and play-by-play totals that don't reconcile


Question 27

When validating shot chart coordinates, what is the approximate valid range for LOC_X (in tenths of feet)?

A) -50 to 50 B) -100 to 100 C) -250 to 250 D) -500 to 500


Question 28

What is the purpose of the check_robots_txt() function shown in Chapter 2?

A) To test if a URL is accessible B) To verify if automated access to a URL is permitted by the site C) To measure website response time D) To check for broken links


Question 29 (2 points)

Match each data era with its characteristics:

Era Characteristics
1. Pre-1980 A. Tracking data available, play-by-play complete
2. 1980-1996 B. No three-point line, limited statistics
3. 1997-2013 C. Three-pointers recorded, no play-by-play
4. 2014-present D. Play-by-play available, no tracking

Question 30

Which storage format is recommended in Chapter 2 for analytical workloads due to its columnar structure?

A) CSV B) JSON C) Parquet D) Excel


Bonus Questions (2 points each)

Bonus Question 1

Explain the difference between "Missing Completely at Random" (MCAR) and "Missing Not at Random" (MNAR) in the context of NBA data. Provide an example of each from basketball statistics.


Bonus Question 2

A team's analytics department needs to collect shot chart data for all players in the 2023-24 season. Describe a data collection strategy that respects API rate limits, implements caching, and handles potential errors. What considerations should inform your approach?


Answer Key

Section A: The NBA Data Ecosystem

  1. B) Official League Data > Third-Party Aggregations > Tracking Data - Official league data via the API is most accessible; third-party sites aggregate and republish; raw tracking data is restricted to partners.

  2. C) JSON - The NBA Stats API follows RESTful architecture and returns JSON-formatted responses.

  3. C) 1996-97 - Most endpoints provide data back to the 1996-97 season, with box score data available from earlier eras.

  4. B) nba_api - The nba_api library, maintained by Swar Patel and contributors, provides a Pythonic interface to the NBA Stats API.

  5. B) It has not been documented by the NBA for public consumption - While the API powers NBA.com, it's been reverse-engineered by the community rather than officially documented for developers.

Section B: API Access and Rate Limiting

  1. C) 429 - HTTP 429 "Too Many Requests" indicates rate limit exceeded.

  2. B) 1 request per 2 seconds (0.5 requests/second) - The chapter recommends approximately 0.5 requests per second to avoid rate limiting.

  3. B) Referer - Requests must include headers like Referer (stats.nba.com) to receive valid responses.

  4. B) To track when the most recent request was made - This timestamp allows calculation of elapsed time to enforce minimum intervals between requests.

  5. B) It reduces redundant API calls and respects rate limits

    • Caching prevents re-fetching data that's already been collected, reducing API load.

Section C: Web Scraping and Basketball-Reference

  1. B) The robots.txt file and terms of service

    • These documents specify what automated access is permitted.
  2. C) Regular expression matching to extract commented HTML

    • The chapter shows using regex to extract tables wrapped in HTML comments.
  3. C) 3-5 seconds

    • A delay of 3-5 seconds is recommended to avoid imposing undue server load.
  4. B) BeautifulSoup

    • BeautifulSoup (from bs4) is commonly used with pandas for HTML parsing.
  5. B) Following the directives about which pages can and cannot be accessed by crawlers

    • robots.txt specifies crawling permissions for automated agents.

Section D: Play-by-Play and Shot Data

  1. A) 1

    • EVENTMSGTYPE 1 represents a made shot; 2 represents a missed shot.
  2. B) The time remaining in the current period

    • PCTIMESTRING shows time remaining (e.g., "8:34") in the current period.
  3. B) Position on the court in tenths of feet from the basket

    • Coordinates are measured in tenths of feet relative to the basket location.
  4. C) Offensive rebound

    • Offensive rebounds extend the possession; the other events end it.
  5. B) 12 minutes

    • NBA regulation periods are 12 minutes; overtime periods are 5 minutes.

Section E: Tracking Data

  1. B) Second Spectrum

    • Second Spectrum became the NBA's tracking partner in 2017, replacing SportVU.
  2. B) 25 frames per second

    • The tracking system captures positions at 25 fps.
  3. B) Center court

    • The coordinate system places the origin at center court with axes to sidelines and baselines.
  4. C) Contested shot percentage

    • This is derived from raw positions; it's computed, not directly measured.
  5. B) Access is restricted to league partners and teams

    • Raw frame-by-frame tracking data remains proprietary and requires partnerships.

Section F: Data Quality and Cleaning

  1. C) Excessive data availability

    • This is not a data quality issue; the others are common problems discussed in the chapter.
  2. C) -250 to 250

    • The valid X range is -250 to 250 (representing -25 to 25 feet from center).
  3. B) To verify if automated access to a URL is permitted by the site

    • The function checks robots.txt to determine if scraping is allowed.
  4. Answers: 1-B, 2-C, 3-D, 4-A (2 points)

    • Pre-1980: No three-point line, limited stats
    • 1980-1996: Three-pointers added, still no PBP
    • 1997-2013: Play-by-play available, no tracking
    • 2014-present: Tracking data era with full data availability
  5. C) Parquet

    • Parquet's columnar format provides significant advantages for analytical workloads.

Bonus Questions

Bonus 1: (2 points) - MCAR (Missing Completely at Random): The missingness has no relationship to any variables. Example: A random data entry error causing missing assist values for a game. - MNAR (Missing Not at Random): The missingness depends on the value itself. Example: Three-point percentage is "missing" for players who don't attempt three-pointers because the statistic is undefined (0/0), not randomly missing.

Bonus 2: (2 points) Strategy should include: - Implement rate limiting (0.5-1 request per second) - Use file-based caching (Parquet format) to store retrieved data - Check cache before making API calls to avoid redundant requests - Implement exponential backoff for failed requests - Log all operations for debugging - Handle HTTP errors gracefully (429, 500, timeouts) - Process players in batches with progress tracking - Consider time of day for API access (off-peak hours) - Validate data integrity after collection


Scoring Guide

Score Grade Feedback
28-34 A Excellent understanding of NBA data sources and collection
24-27 B Good grasp of core concepts; review data quality sections
20-23 C Adequate understanding; more practice with APIs recommended
16-19 D Review chapter material and complete all exercises
Below 16 F Seek additional help; re-read chapter before proceeding

Post-Quiz Reflection

After completing this quiz, consider:

  1. Which data sources are most relevant to your basketball analytics projects?
  2. What data quality challenges do you anticipate encountering?
  3. How will you implement responsible data collection practices?
  4. What additional data sources might complement the ones discussed?

Take time to revisit sections where you scored below 80% before moving to the next chapter.