Exercises: The Data Landscape of NCAA Football
These exercises progress from foundational concept checks to hands-on data retrieval. Estimated completion time: 3-4 hours.
Scoring Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
Test your understanding of core concepts.
A.1. Explain the difference between play-level data and game-level data. Give an example of an analysis that requires play-level data and one that could be done with game-level data alone.
A.2. What is an API? Explain in plain language what happens when you make an API request to CFBD.
A.3. List three pieces of information typically found in a play-by-play record that would NOT be available in a game-level box score.
A.4. What is a rate limit, and why do API providers implement them? How should you design your code to respect rate limits?
A.5. Compare CSV and Parquet file formats. When would you choose each one?
A.6. Explain why you should never modify raw data files. What is the proper approach instead?
A.7. What is a data dictionary, and why is it important to create one for your datasets?
A.8. Describe three types of data quality issues and give a football-related example of each.
Part B: Data Source Identification ⭐⭐
Match analysis needs to appropriate data sources.
B.1. For each analysis below, identify which data source(s) would be most appropriate and what endpoint or data type you would need:
a) Calculate total rushing yards for all SEC teams in 2023 b) Find the recruiting rankings for Ohio State's 2024 signing class c) Determine the average yards gained on third-and-long plays for Alabama d) Compare historical win percentages of Big Ten teams since 1950 e) Get the betting line for next week's games
B.2. A colleague wants to analyze how often teams use pre-snap motion. Which of the following data sources could provide this information? - CFBD play-by-play - Sports Reference - ESPN box scores - PFF premium data
Explain your reasoning.
B.3. You're building a recruiting database and need: - High school player ratings (1-5 stars) - Player physical measurements - Commitment dates
Identify potential sources for each piece of information and discuss their accessibility.
B.4. Rank the following data sources by historical depth (earliest data available) and explain any limitations: - CFBD API - Sports Reference - ESPN Stats - PFF
Part C: API Practice ⭐⭐
Practice working with the CFBD API. Requires an API key.
C.1. Basic API Request ⭐⭐
Write Python code to: 1. Connect to the CFBD API 2. Retrieve all games from Week 1 of the 2023 season 3. Print the number of games returned 4. Display the first 5 games showing home team, away team, and score
# Your code here
# Hint: Use the cfbd library or requests library
C.2. Filtered Data Retrieval ⭐⭐
Write code to retrieve all 2023 regular season games for a specific team (your choice). Then: a) Calculate the team's record b) Calculate average points scored and allowed c) Find their largest margin of victory and defeat
C.3. Play-by-Play Exploration ⭐⭐
Retrieve play-by-play data for a single game. Then: a) Count the total number of plays b) Calculate the percentage of pass plays vs. run plays c) Find the longest play of the game d) Identify all touchdown plays
C.4. Multi-Endpoint Query ⭐⭐⭐
Write a function that takes a team name and year, then returns: - Win-loss record - Total points scored - Average yards per game (requires combining endpoints) - Top rusher (player stats endpoint)
def get_team_summary(team: str, year: int) -> dict:
"""
Get comprehensive team summary from multiple endpoints.
Parameters
----------
team : str
Team name (e.g., "Alabama")
year : int
Season year
Returns
-------
dict
Dictionary with record, points, yards, top_rusher
"""
# Your code here
pass
C.5. Data Caching ⭐⭐⭐
Implement a caching system for API requests: 1. Check if data exists locally before making API call 2. Save API responses to local files 3. Load from cache on subsequent requests 4. Include cache invalidation (e.g., refresh if older than 24 hours)
Part D: Data Manipulation ⭐⭐-⭐⭐⭐
Work with football data in different formats.
D.1. JSON Processing ⭐⭐
Given the following JSON structure (typical CFBD response), write code to convert it to a pandas DataFrame:
[
{
"id": 401520180,
"home_team": "Alabama",
"away_team": "Georgia",
"home_points": 27,
"away_points": 24,
"venue": {"name": "Mercedes-Benz Stadium", "city": "Atlanta"}
}
]
Handle the nested "venue" field appropriately.
D.2. Data Cleaning ⭐⭐
You receive a dataset with the following issues: - Team names inconsistent ("Ohio St", "Ohio State", "OSU") - Some scores are stored as strings - Missing values represented as -999 - Duplicate rows
Write code to clean this dataset, creating a mapping dictionary for team names and handling all other issues.
D.3. Format Conversion ⭐⭐
Write a script that: 1. Reads a CSV file of game data 2. Converts it to Parquet format 3. Compares file sizes 4. Compares read times for both formats
Include timing code to measure performance.
D.4. Building a Play Database ⭐⭐⭐
Create a SQLite database with the following tables:
- games (game_id, home_team, away_team, date, etc.)
- plays (play_id, game_id, down, distance, yards, etc.)
- teams (team_id, name, conference, etc.)
Write functions to: a) Insert game data b) Insert play data with foreign key to games c) Query average yards per play for a specific team
Part E: Data Quality Analysis ⭐⭐⭐
Investigate and address data quality issues.
E.1. Missing Data Investigation
Load a play-by-play dataset and analyze missing values: a) Which columns have the most missing values? b) Is the missingness random or patterned (e.g., more missing in certain game types)? c) Propose appropriate handling strategies for each column with missing data.
E.2. Outlier Detection
Write code to identify potential data entry errors in game results: a) Games with unusually high scores (> 70 points for either team) b) Games with negative point values c) Games where the point total seems unreasonable
For each outlier found, investigate whether it's a real result or a data error.
E.3. Cross-Source Validation ⭐⭐⭐
Pick 10 random games from your CFBD data. Manually check the scores against Sports Reference or ESPN. Calculate: a) Percentage of exact matches b) Any discrepancies found c) Hypotheses about sources of discrepancies
E.4. Data Completeness Audit ⭐⭐⭐
For a single season, check completeness: a) Are all 133 FBS teams represented? b) Are there the expected number of games (roughly 12-15 per team)? c) For play-by-play data, are any games missing plays? d) Document any gaps found.
Part F: Project Setup ⭐⭐⭐-⭐⭐⭐⭐
Build infrastructure for ongoing analysis.
F.1. Folder Structure Setup ⭐⭐⭐
Create a complete project structure for a college football analytics project: a) Create the directory structure from Section 2.6.1 b) Add README.md files to each major directory c) Create a sample data dictionary template d) Initialize a Git repository (ignore data folders appropriately)
F.2. Data Pipeline ⭐⭐⭐⭐
Build an automated data pipeline that: a) Fetches all games for a given season from CFBD b) Fetches play-by-play data for each game c) Combines into a unified dataset d) Saves in both CSV and Parquet formats e) Logs progress and any errors f) Can be re-run safely (idempotent)
def build_season_dataset(year: int, output_dir: str) -> None:
"""
Build complete season dataset from CFBD API.
Parameters
----------
year : int
Season year to fetch
output_dir : str
Directory to save output files
"""
# Your implementation
pass
F.3. Documentation Project ⭐⭐⭐
Create comprehensive documentation for the CFBD plays endpoint data: a) Data dictionary for all columns b) Notes on any data quirks discovered c) Sample queries for common use cases d) Known limitations
Part G: Research Extensions ⭐⭐⭐⭐
Open-ended exploration.
G.1. Historical Data Availability
Research and document: a) What years of play-by-play data are available from CFBD? b) How does data completeness change across years? c) What statistics were added or changed over time? d) What rule changes might affect historical comparisons?
Write a 500-word report on historical data considerations.
G.2. Alternative Data Sources
Investigate one data source not covered in detail in this chapter: - College football betting data - Weather data for games - Stadium/venue information - Social media data
Document: a) What data is available b) How to access it c) Potential analytical applications d) Limitations and challenges
G.3. Data Source Comparison
Design and execute a systematic comparison of two data sources: a) Define metrics for comparison (completeness, accuracy, timeliness) b) Sample data from both sources c) Analyze differences d) Make recommendations for when to use each
Solutions
Selected solutions are available in:
- code/exercise-solutions.py (programming problems)
- appendices/g-answers-to-selected-exercises.md (odd-numbered problems)
Full solutions available to instructors upon request.
Notes
- Exercises C.1-C.5 require a CFBD API key. Register at collegefootballdata.com
- Some exercises use the
cfbdPython library. Install with:pip install cfbd - Data manipulation exercises can use sample data if API access is unavailable