Case Study: Building a Complete Season Database
"Give me six hours to chop down a tree and I will spend the first four sharpening the axe." — Abraham Lincoln
Executive Summary
This case study walks through building a comprehensive college football database for a complete season. You will collect game results, play-by-play data, and team statistics from the CFBD API, organize them properly, and validate the data for quality. By the end, you'll have a reusable template for building season databases.
Skills Applied: - API data collection at scale - Data organization and storage - Quality validation techniques - Documentation practices
Output: A complete 2023 FBS season database with games, plays, and team stats
Background
The Analysis Need
Imagine you're starting a college football analytics blog. You want to analyze the 2023 season comprehensively: team rankings, player performance, strategic trends, and more. Before any analysis, you need data—and not just scattered downloads, but a well-organized database you can query reliably.
Requirements
Our database needs to include:
- All FBS games from the 2023 regular season and bowl games
- Play-by-play data for every game
- Team statistics aggregated by season
- Game metadata (venue, attendance, weather if available)
Design Decisions
Before writing code, we make key decisions:
| Decision | Choice | Reasoning |
|---|---|---|
| Primary format | Parquet | Large datasets, analytical queries |
| Backup format | CSV | Human inspection, sharing |
| Organization | By data type | Logical grouping, easy to find |
| Caching | Aggressive | Minimize API calls |
| Validation | Automated | Catch issues early |
Phase 1: Setup and Configuration
Project Structure
First, create our project directory:
# setup_project.py
import os
def create_project_structure(base_dir: str = "cfb_2023"):
"""Create standardized project directory structure."""
directories = [
f"{base_dir}/data/raw/games",
f"{base_dir}/data/raw/plays",
f"{base_dir}/data/raw/stats",
f"{base_dir}/data/processed",
f"{base_dir}/data/cache",
f"{base_dir}/notebooks",
f"{base_dir}/src",
f"{base_dir}/output/figures",
f"{base_dir}/output/reports",
f"{base_dir}/docs",
]
for directory in directories:
os.makedirs(directory, exist_ok=True)
print(f"Created: {directory}")
# Create README files
readme_content = {
f"{base_dir}/data/raw": "# Raw Data\n\nOriginal data from CFBD API. Do not modify.",
f"{base_dir}/data/processed": "# Processed Data\n\nCleaned and transformed data.",
f"{base_dir}/data/cache": "# Cache\n\nTemporary API response cache.",
f"{base_dir}/docs": "# Documentation\n\nData dictionaries and notes.",
}
for path, content in readme_content.items():
readme_path = f"{path}/README.md"
with open(readme_path, "w") as f:
f.write(content)
print(f"Created: {readme_path}")
print(f"\nProject structure created at: {base_dir}/")
if __name__ == "__main__":
create_project_structure()
Configuration
Create a configuration file for API settings:
# config.py
import os
from dataclasses import dataclass
@dataclass
class Config:
"""Configuration for data collection."""
# API Settings
api_key: str = os.environ.get("CFBD_API_KEY", "")
base_url: str = "https://api.collegefootballdata.com"
rate_limit_delay: float = 0.5 # Seconds between requests
# Data Settings
season: int = 2023
data_dir: str = "cfb_2023/data"
# Validation
expected_fbs_teams: int = 133
expected_games_per_team: tuple = (11, 15) # Min, max
def validate(self):
"""Validate configuration."""
if not self.api_key:
raise ValueError("CFBD_API_KEY environment variable not set")
config = Config()
Phase 2: Data Collection
API Client
Build a reusable API client:
# api_client.py
import requests
import time
import json
import os
from typing import Optional, Dict, Any, List
from config import config
class CFBDClient:
"""Client for College Football Data API."""
def __init__(self):
self.base_url = config.base_url
self.headers = {
"Authorization": f"Bearer {config.api_key}",
"Accept": "application/json"
}
self.cache_dir = f"{config.data_dir}/cache"
def _cache_path(self, endpoint: str, params: Dict) -> str:
"""Generate cache file path for request."""
param_str = "_".join(f"{k}={v}" for k, v in sorted(params.items()))
filename = f"{endpoint.replace('/', '_')}_{param_str}.json"
return os.path.join(self.cache_dir, filename)
def _get_cached(self, cache_path: str) -> Optional[Any]:
"""Load data from cache if exists."""
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
return json.load(f)
return None
def _save_cache(self, cache_path: str, data: Any):
"""Save data to cache."""
os.makedirs(os.path.dirname(cache_path), exist_ok=True)
with open(cache_path, "w") as f:
json.dump(data, f)
def request(
self,
endpoint: str,
params: Dict[str, Any] = None,
use_cache: bool = True
) -> Any:
"""
Make API request with caching.
Parameters
----------
endpoint : str
API endpoint (e.g., "/games")
params : Dict
Query parameters
use_cache : bool
Whether to use cached responses
Returns
-------
Any
Parsed JSON response
"""
params = params or {}
# Check cache
cache_path = self._cache_path(endpoint, params)
if use_cache:
cached = self._get_cached(cache_path)
if cached is not None:
print(f" [CACHE] {endpoint}")
return cached
# Make request
url = f"{self.base_url}{endpoint}"
print(f" [API] {endpoint} {params}")
response = requests.get(url, headers=self.headers, params=params)
response.raise_for_status()
data = response.json()
# Save to cache
if use_cache:
self._save_cache(cache_path, data)
# Rate limiting
time.sleep(config.rate_limit_delay)
return data
def get_games(self, year: int, **kwargs) -> List[Dict]:
"""Get games for a season."""
return self.request("/games", {"year": year, **kwargs})
def get_plays(self, year: int, week: int, **kwargs) -> List[Dict]:
"""Get plays for a specific week."""
return self.request("/plays", {"year": year, "week": week, **kwargs})
def get_team_stats(self, year: int, **kwargs) -> List[Dict]:
"""Get team season statistics."""
return self.request("/stats/season", {"year": year, **kwargs})
def get_teams(self, **kwargs) -> List[Dict]:
"""Get team information."""
return self.request("/teams", kwargs)
Collecting Games
# collect_games.py
import pandas as pd
from api_client import CFBDClient
from config import config
def collect_all_games(year: int = config.season) -> pd.DataFrame:
"""
Collect all FBS games for a season.
Parameters
----------
year : int
Season year
Returns
-------
pd.DataFrame
All games for the season
"""
client = CFBDClient()
print(f"Collecting {year} games...")
# Get regular season games
regular = client.get_games(year, seasonType="regular")
print(f" Regular season: {len(regular)} games")
# Get postseason games
postseason = client.get_games(year, seasonType="postseason")
print(f" Postseason: {len(postseason)} games")
# Combine
all_games = regular + postseason
games_df = pd.DataFrame(all_games)
print(f" Total: {len(games_df)} games")
return games_df
def save_games(games_df: pd.DataFrame, year: int = config.season):
"""Save games to raw data directory."""
# Save as Parquet
parquet_path = f"{config.data_dir}/raw/games/games_{year}.parquet"
games_df.to_parquet(parquet_path, index=False)
print(f"Saved: {parquet_path}")
# Save as CSV backup
csv_path = f"{config.data_dir}/raw/games/games_{year}.csv"
games_df.to_csv(csv_path, index=False)
print(f"Saved: {csv_path}")
if __name__ == "__main__":
games = collect_all_games()
save_games(games)
Collecting Play-by-Play Data
# collect_plays.py
import pandas as pd
from api_client import CFBDClient
from config import config
from typing import List
def collect_plays_for_week(year: int, week: int, client: CFBDClient) -> pd.DataFrame:
"""Collect play-by-play data for a specific week."""
plays = client.get_plays(year, week)
if plays:
df = pd.DataFrame(plays)
df["collection_week"] = week
return df
return pd.DataFrame()
def collect_all_plays(year: int = config.season) -> pd.DataFrame:
"""
Collect play-by-play data for entire season.
Parameters
----------
year : int
Season year
Returns
-------
pd.DataFrame
All plays for the season
"""
client = CFBDClient()
print(f"Collecting {year} play-by-play data...")
all_plays = []
# Regular season weeks (1-15 covers most schedules)
for week in range(1, 16):
print(f" Week {week}...")
week_plays = collect_plays_for_week(year, week, client)
if not week_plays.empty:
all_plays.append(week_plays)
print(f" {len(week_plays)} plays")
# Postseason
print(" Postseason...")
post_plays = client.get_plays(year, 1, seasonType="postseason")
if post_plays:
post_df = pd.DataFrame(post_plays)
post_df["collection_week"] = "postseason"
all_plays.append(post_df)
print(f" {len(post_df)} plays")
# Combine
plays_df = pd.concat(all_plays, ignore_index=True)
print(f" Total: {len(plays_df):,} plays")
return plays_df
def save_plays(plays_df: pd.DataFrame, year: int = config.season):
"""Save plays to raw data directory."""
# Save as Parquet (much smaller for large datasets)
parquet_path = f"{config.data_dir}/raw/plays/plays_{year}.parquet"
plays_df.to_parquet(parquet_path, index=False)
print(f"Saved: {parquet_path}")
# Get file size
import os
size_mb = os.path.getsize(parquet_path) / (1024 * 1024)
print(f" File size: {size_mb:.1f} MB")
if __name__ == "__main__":
plays = collect_all_plays()
save_plays(plays)
Phase 3: Data Validation
Validation Functions
# validate_data.py
import pandas as pd
from config import config
from typing import List, Dict
class DataValidator:
"""Validate collected football data."""
def __init__(self):
self.issues: List[str] = []
self.stats: Dict[str, any] = {}
def validate_games(self, games_df: pd.DataFrame) -> bool:
"""Validate games dataset."""
print("\nValidating games...")
# Check number of games
num_games = len(games_df)
expected_min = config.expected_fbs_teams * config.expected_games_per_team[0] / 2
expected_max = config.expected_fbs_teams * config.expected_games_per_team[1] / 2
self.stats["num_games"] = num_games
if num_games < expected_min:
self.issues.append(f"Too few games: {num_games} (expected >{expected_min})")
else:
print(f" ✓ Game count: {num_games}")
# Check for missing scores
missing_scores = games_df["home_points"].isna().sum()
if missing_scores > 0:
# Some might be future games - check dates
print(f" ⚠ Games with missing scores: {missing_scores}")
# Check for duplicate games
if "id" in games_df.columns:
duplicates = games_df["id"].duplicated().sum()
if duplicates > 0:
self.issues.append(f"Duplicate game IDs: {duplicates}")
else:
print(f" ✓ No duplicate game IDs")
# Check for required columns
required_cols = ["id", "home_team", "away_team", "home_points", "away_points"]
missing_cols = [c for c in required_cols if c not in games_df.columns]
if missing_cols:
self.issues.append(f"Missing columns: {missing_cols}")
else:
print(f" ✓ All required columns present")
# Check score validity
completed_games = games_df.dropna(subset=["home_points", "away_points"])
invalid_scores = (
(completed_games["home_points"] < 0) |
(completed_games["away_points"] < 0) |
(completed_games["home_points"] > 100) |
(completed_games["away_points"] > 100)
).sum()
if invalid_scores > 0:
self.issues.append(f"Invalid scores: {invalid_scores}")
else:
print(f" ✓ All scores in valid range")
return len(self.issues) == 0
def validate_plays(self, plays_df: pd.DataFrame) -> bool:
"""Validate play-by-play dataset."""
print("\nValidating plays...")
num_plays = len(plays_df)
self.stats["num_plays"] = num_plays
print(f" Total plays: {num_plays:,}")
# Check for required columns
required_cols = ["id", "offense", "defense", "down", "distance", "yards_gained"]
present_cols = [c for c in required_cols if c in plays_df.columns]
missing_cols = [c for c in required_cols if c not in plays_df.columns]
if missing_cols:
print(f" ⚠ Missing columns: {missing_cols}")
# Check down values
if "down" in plays_df.columns:
invalid_downs = ~plays_df["down"].isin([1, 2, 3, 4, None])
if invalid_downs.any():
self.issues.append(f"Invalid down values: {invalid_downs.sum()}")
else:
print(f" ✓ All down values valid")
# Check for extreme yard values
if "yards_gained" in plays_df.columns:
extreme_yards = (
(plays_df["yards_gained"] < -50) |
(plays_df["yards_gained"] > 100)
)
if extreme_yards.any():
print(f" ⚠ Extreme yard values: {extreme_yards.sum()}")
# These might be valid (long TDs, big losses) - just flag
# Check play type distribution
if "play_type" in plays_df.columns:
play_types = plays_df["play_type"].value_counts()
print(f" Play type distribution:")
for pt, count in play_types.head(5).items():
print(f" {pt}: {count:,}")
return len(self.issues) == 0
def generate_report(self) -> str:
"""Generate validation report."""
report = ["=" * 60]
report.append("DATA VALIDATION REPORT")
report.append("=" * 60)
report.append("")
report.append("STATISTICS:")
for key, value in self.stats.items():
report.append(f" {key}: {value:,}" if isinstance(value, int) else f" {key}: {value}")
report.append("")
if self.issues:
report.append("ISSUES FOUND:")
for issue in self.issues:
report.append(f" ❌ {issue}")
else:
report.append("✓ NO ISSUES FOUND")
report.append("")
report.append("=" * 60)
return "\n".join(report)
def run_validation():
"""Run full validation on collected data."""
validator = DataValidator()
# Load and validate games
games = pd.read_parquet(f"{config.data_dir}/raw/games/games_{config.season}.parquet")
validator.validate_games(games)
# Load and validate plays
plays = pd.read_parquet(f"{config.data_dir}/raw/plays/plays_{config.season}.parquet")
validator.validate_plays(plays)
# Print report
print(validator.generate_report())
# Save report
report_path = f"{config.data_dir}/../docs/validation_report.txt"
with open(report_path, "w") as f:
f.write(validator.generate_report())
print(f"Report saved: {report_path}")
if __name__ == "__main__":
run_validation()
Phase 4: Documentation
Data Dictionary
Create comprehensive documentation:
# 2023 Season Database - Data Dictionary
## games_2023.parquet
| Column | Type | Description | Example |
|--------|------|-------------|---------|
| id | int64 | Unique game identifier | 401520180 |
| season | int32 | Season year | 2023 |
| week | int32 | Week number (1-15, 16+ for postseason) | 12 |
| season_type | string | "regular" or "postseason" | regular |
| start_date | datetime | Game date and time | 2023-11-25T15:30:00 |
| home_team | string | Home team name | Alabama |
| away_team | string | Away team name | Auburn |
| home_points | int32 | Home team final score | 27 |
| away_points | int32 | Away team final score | 24 |
| venue | string | Stadium name | Bryant-Denny Stadium |
| attendance | int32 | Attendance (may be null) | 101821 |
## plays_2023.parquet
| Column | Type | Description | Example |
|--------|------|-------------|---------|
| id | int64 | Unique play identifier | 401520180001 |
| game_id | int64 | Parent game ID | 401520180 |
| drive_id | int64 | Parent drive ID | 4015201801 |
| period | int32 | Quarter (1-4, 5+ OT) | 2 |
| clock | string | Time remaining MM:SS | 7:32 |
| offense | string | Team on offense | Alabama |
| defense | string | Team on defense | Auburn |
| down | int32 | Down number (1-4) | 3 |
| distance | int32 | Yards to first down | 7 |
| yard_line | int32 | Field position (1-99) | 65 |
| yards_gained | int32 | Yards gained on play | 12 |
| play_type | string | Type of play | Pass |
| play_text | string | Play description | Milroe pass complete to... |
## Collection Notes
- **Source:** College Football Data API (collegefootballdata.com)
- **Collection Date:** [DATE]
- **Season:** 2023 regular season + bowl games
- **Coverage:** All FBS teams
## Known Limitations
1. Some bowl games may be missing if collected before bowl season completion
2. Attendance data incomplete for some games
3. A small number of plays may have missing yards_gained values
Results Summary
What We Built
After completing this case study, you have:
cfb_2023/
├── data/
│ ├── raw/
│ │ ├── games/
│ │ │ ├── games_2023.parquet (850 games)
│ │ │ └── games_2023.csv
│ │ └── plays/
│ │ └── plays_2023.parquet (150,000+ plays)
│ ├── processed/
│ └── cache/
├── docs/
│ ├── data_dictionary.md
│ └── validation_report.txt
└── src/
├── config.py
├── api_client.py
├── collect_games.py
├── collect_plays.py
└── validate_data.py
Key Metrics
| Metric | Value |
|---|---|
| Games collected | ~850 |
| Plays collected | ~150,000 |
| FBS teams represented | 133 |
| Raw data size (Parquet) | ~25 MB |
| Raw data size (CSV) | ~120 MB |
| API requests made | ~20 |
Discussion Questions
-
What would you change if you needed to collect data for multiple seasons?
-
How would you handle API rate limit errors gracefully?
-
What additional validation checks would increase confidence in data quality?
-
How might you automate this process to run weekly during the season?
Your Turn: Extensions
Option A: Add recruiting data collection to the pipeline
Option B: Build an incremental update system that only fetches new games
Option C: Create visualizations to explore the collected data
Key Takeaways
- Plan before collecting - Design your structure and validation criteria first
- Cache aggressively - Minimize API calls through local caching
- Validate early - Catch data issues before analysis begins
- Document everything - Future you will thank present you
- Use efficient formats - Parquet saves significant space for large datasets