Quiz: The Data Landscape of NCAA Football

Q: Which of the following is the PRIMARY free data source for college football analytics? ESPN Stats API Pro Football Focus College Football Data API (CFBD) NFL Next Gen Stats

C) College Football Data API (CFBD) Explanation: CFBD is the primary free source for comprehensive college football data, including play-by-play data, statistics, and pre-calculated advanced metrics. See Section 2.2.1.

Q: What does "endpoint" mean in the context of an API? The last game of the season A specific URL that returns particular types of data The maximum number of requests allowed The password for authentication

B) A specific URL that returns particular types of data Explanation: An endpoint is a specific URL in an API that returns particular data. For example, /games returns game data while /plays returns play-by-play data. See Section 2.3.1.

Q: Which file format is most efficient for storing large play-by-play datasets? CSV JSON Parquet Text files

C) Parquet Explanation: Parquet is a columnar binary format that offers significant compression (often 80% smaller than CSV) and much faster read times for large datasets. It's ideal for play-by-play data with millions of rows. See Section 2.4.4.

Q: Sports Reference is best suited for which type of analysis? Real-time game predictions Historical research with deep coverage back to 1869 Detailed play-by-play analysis Player tracking data analysis

B) Historical research with deep coverage back to 1869 Explanation: Sports Reference has the deepest historical coverage of any source, with data going back to the earliest days of college football. However, it doesn't provide play-by-play data. See Section 2.2.2.

Q: What does MCAR mean in the context of missing data? Missing Completely at Random Most Common Analysis Result Missing Column After Review Multiple Category Assessment Rating

A) Missing Completely at Random Explanation: MCAR (Missing Completely at Random) is one pattern of missing data where the missingness has no relationship to any observed or unobserved data—essentially random. See Section 2.5.1.

Q: Which of the following is stored in a typical play-by-play record but NOT in a game-level box score? Final score Total rushing yards Down and distance for each play Winning team

C) Down and distance for each play Explanation: Box scores aggregate totals (scores, yards) but don't capture the situational context of each play. Down, distance, field position, and other play-level details require play-by-play data. See Section 2.1.2.

Q: When organizing your data files, where should you store the original data from the API? In the processed/ folder In the raw/ folder In the output/ folder In the notebooks/ folder

B) In the raw/ folder Explanation: Original data from sources should be stored in a raw/ folder and never modified. All cleaning and transformations should produce new files in a processed/ folder. This preserves the ability to trace back to the source. See Section 2.6.1.

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed. Time: ~30 minutes

Section 1: Multiple Choice (1 point each)

1. What is play-by-play data?

A) A summary of each game's final score
B) Data with one row per play, capturing detailed information about each play
C) Video recordings of football games
D) A list of all players on each team's roster

Answer

**B)** Data with one row per play, capturing detailed information about each play *Explanation:* Play-by-play data is the most granular form of football data, with each row representing a single play including down, distance, yards gained, play type, and other details. See Section 2.1.1.

2. Which of the following is the PRIMARY free data source for college football analytics?

A) ESPN Stats API
B) Pro Football Focus
C) College Football Data API (CFBD)
D) NFL Next Gen Stats

Answer

**C)** College Football Data API (CFBD) *Explanation:* CFBD is the primary free source for comprehensive college football data, including play-by-play data, statistics, and pre-calculated advanced metrics. See Section 2.2.1.

3. What does "endpoint" mean in the context of an API?

A) The last game of the season
B) A specific URL that returns particular types of data
C) The maximum number of requests allowed
D) The password for authentication

Answer

**B)** A specific URL that returns particular types of data *Explanation:* An endpoint is a specific URL in an API that returns particular data. For example, /games returns game data while /plays returns play-by-play data. See Section 2.3.1.

4. Why should you cache API data locally?

A) To avoid paying for the API
B) To reduce repeated API calls, save time, and respect rate limits
C) Because the API might delete the data
D) Caching is not recommended

Answer

**B)** To reduce repeated API calls, save time, and respect rate limits *Explanation:* Caching saves data locally after the first retrieval. This reduces redundant API calls, speeds up subsequent analyses, and helps you stay within rate limits. See Section 2.3.4.

5. Which file format is most efficient for storing large play-by-play datasets?

A) CSV
B) JSON
C) Parquet
D) Text files

Answer

**C)** Parquet *Explanation:* Parquet is a columnar binary format that offers significant compression (often 80% smaller than CSV) and much faster read times for large datasets. It's ideal for play-by-play data with millions of rows. See Section 2.4.4.

6. What is a "rate limit" on an API?

A) How fast the data downloads
B) A restriction on how many requests can be made in a time period
C) The speed at which players run
D) The maximum file size for responses

Answer

**B)** A restriction on how many requests can be made in a time period *Explanation:* Rate limits prevent any single user from overloading the API server. CFBD typically allows around 1000 requests per hour. Exceeding limits results in error responses. See Section 2.3.2.

7. Sports Reference is best suited for which type of analysis?

A) Real-time game predictions
B) Historical research with deep coverage back to 1869
C) Detailed play-by-play analysis
D) Player tracking data analysis

Answer

**B)** Historical research with deep coverage back to 1869 *Explanation:* Sports Reference has the deepest historical coverage of any source, with data going back to the earliest days of college football. However, it doesn't provide play-by-play data. See Section 2.2.2.

8. What does MCAR mean in the context of missing data?

A) Missing Completely at Random
B) Most Common Analysis Result
C) Missing Column After Review
D) Multiple Category Assessment Rating

Answer

**A)** Missing Completely at Random *Explanation:* MCAR (Missing Completely at Random) is one pattern of missing data where the missingness has no relationship to any observed or unobserved data—essentially random. See Section 2.5.1.

9. Which of the following is stored in a typical play-by-play record but NOT in a game-level box score?

A) Final score
B) Total rushing yards
C) Down and distance for each play
D) Winning team

Answer

**C)** Down and distance for each play *Explanation:* Box scores aggregate totals (scores, yards) but don't capture the situational context of each play. Down, distance, field position, and other play-level details require play-by-play data. See Section 2.1.2.

10. When organizing your data files, where should you store the original data from the API?

A) In the processed/ folder
B) In the raw/ folder
C) In the output/ folder
D) In the notebooks/ folder

Answer

**B)** In the raw/ folder *Explanation:* Original data from sources should be stored in a raw/ folder and never modified. All cleaning and transformations should produce new files in a processed/ folder. This preserves the ability to trace back to the source. See Section 2.6.1.

Section 2: True/False (1 point each)

11. JSON is the most common format for API responses.

Answer

**True** *Explanation:* JSON (JavaScript Object Notation) is the standard format for web API responses, including CFBD. It supports nested data structures and preserves data types. See Section 2.4.2.

12. You can aggregate play-level data to game-level data, but you cannot disaggregate game-level data to play-level data.

Answer

**True** *Explanation:* Aggregation combines detailed records into summaries (plays → games). This process loses information that cannot be recovered. You can always summarize up but cannot expand down. See Section 2.1.2.

13. PFF (Pro Football Focus) data is freely available to the public.

Answer

**False** *Explanation:* PFF data requires paid subscriptions. They provide premium data not available elsewhere, including play-by-play grades and detailed charting, but at significant cost. See Section 2.2.4.

14. It's acceptable to hardcode your API key directly in your Python script.

Answer

**False** *Explanation:* API keys should be stored in environment variables or config files, never hardcoded. Hardcoded keys can be accidentally shared when sharing code, compromising your access. See Section 2.3.2.

15. Historical football data is equally complete and reliable across all time periods.

Answer

**False** *Explanation:* Data quality generally improves over time. Play-by-play data is sparse before 2000, many statistics weren't tracked historically, and rule changes affect comparability across eras. See Section 2.5.4.

Section 3: Fill in the Blank (1 point each)

16. An ____ is a software interface that allows programs to communicate with each other and request data.

Answer

**API** (Application Programming Interface) *Explanation:* APIs enable your code to request data from remote servers like CFBD. See Section 2.3.1.

17. A ____ is documentation that describes every column in a dataset, including its type, meaning, and valid values.

Answer

**data dictionary** *Explanation:* Data dictionaries document what each field means, helping users understand and correctly use the data. See Section 2.6.3.

18. ____ format stores data in columns rather than rows, enabling efficient compression and fast queries for analytical workloads.

Answer

**Parquet** (or columnar) *Explanation:* Parquet is a columnar binary format that's highly efficient for large analytical datasets. See Section 2.4.4.

19. When data is missing because of factors related to the missing values themselves, this pattern is called ____.

Answer

**MNAR** (Missing Not at Random) *Explanation:* MNAR occurs when missingness is related to the value that's missing (e.g., injured players missing game stats). See Section 2.5.1.

Section 4: Short Answer (2 points each)

20. Explain why you should separate raw and processed data files into different folders.

Answer

**Sample Answer:** Separating raw and processed data preserves the original data exactly as received from the source. This enables you to: (1) trace any issues back to determine if they're from the source or your processing, (2) reprocess data if you discover errors in your transformations, (3) reproduce your analysis by starting from the same raw inputs. If you modify raw files, you lose the ability to verify your processing steps or start over. *Key points for full credit:* - Preserves ability to trace back to source - Enables reprocessing if needed

21. What are two advantages and two disadvantages of using CSV files for storing football data?

Answer

**Sample Answer:** *Advantages:* 1. Human-readable—can be opened in any text editor or spreadsheet 2. Universal compatibility—works with virtually any tool (Python, R, Excel, etc.) *Disadvantages:* 1. No data type information—all values stored as strings 2. Larger file sizes compared to binary formats like Parquet *Key points for full credit:* - At least two valid advantages - At least two valid disadvantages

22. Describe a situation where you would need to use the CFBD API's /plays endpoint rather than the /games endpoint.

Answer

**Sample Answer:** You would need /plays for any analysis that requires situational context within games. For example: calculating third-down conversion rates (need to know which plays were third downs), computing EPA (need each play's before/after situation), analyzing red zone efficiency (need to filter plays by field position), or studying fourth-down decision making (need down, distance, and outcome for each fourth-down situation). *Key points for full credit:* - Identifies a specific analysis requiring play-level data - Explains why game-level data is insufficient

Section 5: Code Analysis (2 points each)

23. What is wrong with the following code for handling API data?

import requests

def get_all_games(years):
    all_games = []
    for year in years:
        for week in range(1, 16):
            for team in all_teams:  # 130+ teams
                response = requests.get(
                    f"https://api.cfbd.com/games?year={year}&week={week}&team={team}"
                )
                all_games.extend(response.json())
    return all_games

Answer

**Problems:** 1. **Inefficient API usage:** Making separate requests for each team, week, and year creates far too many requests (130 teams × 15 weeks × multiple years = thousands of requests). Should fetch all games for a year/week in one request, then filter. 2. **No rate limit handling:** No delays between requests; will likely hit rate limits and get errors. 3. **No error handling:** No try/except to handle failed requests gracefully. 4. **No caching:** Will re-fetch all data every time the function runs. 5. **No authentication:** Missing API key in headers (required for CFBD). **Better approach:**

response = requests.get(f"https://api.cfbd.com/games?year={year}")
# Then filter by team in Python

24. What does this code do, and why is it useful?

import os
import pandas as pd

def load_data(year, cache_dir="cache"):
    cache_path = f"{cache_dir}/games_{year}.parquet"

    if os.path.exists(cache_path):
        return pd.read_parquet(cache_path)

    data = fetch_from_api(year)
    os.makedirs(cache_dir, exist_ok=True)
    data.to_parquet(cache_path)
    return data

Answer

**What it does:** This function implements a caching pattern for API data: 1. First checks if a cached file exists for the requested year 2. If cache exists, loads and returns it (fast) 3. If no cache, fetches from API, saves to cache, then returns 4. Creates the cache directory if it doesn't exist **Why it's useful:** 1. Saves time—subsequent requests are instant (read from disk) 2. Reduces API calls—respects rate limits 3. Works offline—cached data available without internet 4. Efficient format—uses Parquet for fast reads and small files *Key points for full credit:* - Explains caching pattern - Identifies at least two benefits

Section 6: Applied Problem (5 points)

25. You are building a system to analyze rushing efficiency for SEC teams. Design your data collection and storage approach:

a) What data do you need? (1 point) b) Which CFBD endpoints would you use? (1 point) c) What file format would you store the data in and why? (1 point) d) How would you organize your files? (1 point) e) What data quality checks would you perform? (1 point)

Answer

**a) Data needed:** - Play-by-play data filtered to rushing plays - Fields: game_id, team, down, distance, yards_gained, EPA, success - Potentially team schedules and opponent information **b) CFBD endpoints:** - `/plays` with filters: year, conference="SEC", playType="Rush" - `/games` for game context and opponent information - Possibly `/teams` for team metadata **c) File format:** - Parquet for the main play-by-play data (will have many rows) - Reasoning: Smaller file size, faster reads, preserves data types - Could use CSV for smaller reference tables **d) File organization:**

project/
├── data/
│   ├── raw/
│   │   └── sec_rushing_plays_2023.parquet
│   ├── processed/
│   │   └── rushing_efficiency_by_team.parquet
│   └── reference/
│       └── teams.csv
├── notebooks/
│   └── rushing_analysis.ipynb
└── output/
    └── sec_rushing_report.html

**e) Data quality checks:** - Verify all 16 SEC teams are represented - Check for missing values in key fields (yards, down, distance) - Validate yards_gained is within reasonable range (-20 to 99) - Confirm play_type filter worked correctly - Cross-check total rushes against published box scores - Check for duplicate plays *Key points for full credit:* - Answers all five parts - Demonstrates understanding of data flow - Shows consideration of quality issues

Scoring

Section	Points	Your Score
Multiple Choice (1-10)	10	___
True/False (11-15)	5	___
Fill in Blank (16-19)	4	___
Short Answer (20-22)	6	___
Code Analysis (23-24)	4	___
Applied Problem (25)	5	___
Total	34	___

Passing Score: 24/34 (70%)

Review Recommendations

Score < 50%: Re-read entire chapter, focusing on Sections 2.2 and 2.3
Score 50-70%: Review Sections 2.4-2.6, redo exercises Part C
Score 70-85%: Good understanding! Review any missed topics before proceeding
Score > 85%: Excellent! Ready for Chapter 3