Case Study: Building a Complete Season Database

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe." — Abraham Lincoln

Executive Summary

This case study walks through building a comprehensive college football database for a complete season. You will collect game results, play-by-play data, and team statistics from the CFBD API, organize them properly, and validate the data for quality. By the end, you'll have a reusable template for building season databases.

Skills Applied: - API data collection at scale - Data organization and storage - Quality validation techniques - Documentation practices

Output: A complete 2023 FBS season database with games, plays, and team stats


Background

The Analysis Need

Imagine you're starting a college football analytics blog. You want to analyze the 2023 season comprehensively: team rankings, player performance, strategic trends, and more. Before any analysis, you need data—and not just scattered downloads, but a well-organized database you can query reliably.

Requirements

Our database needs to include:

  1. All FBS games from the 2023 regular season and bowl games
  2. Play-by-play data for every game
  3. Team statistics aggregated by season
  4. Game metadata (venue, attendance, weather if available)

Design Decisions

Before writing code, we make key decisions:

Decision Choice Reasoning
Primary format Parquet Large datasets, analytical queries
Backup format CSV Human inspection, sharing
Organization By data type Logical grouping, easy to find
Caching Aggressive Minimize API calls
Validation Automated Catch issues early

Phase 1: Setup and Configuration

Project Structure

First, create our project directory:

# setup_project.py

import os

def create_project_structure(base_dir: str = "cfb_2023"):
    """Create standardized project directory structure."""

    directories = [
        f"{base_dir}/data/raw/games",
        f"{base_dir}/data/raw/plays",
        f"{base_dir}/data/raw/stats",
        f"{base_dir}/data/processed",
        f"{base_dir}/data/cache",
        f"{base_dir}/notebooks",
        f"{base_dir}/src",
        f"{base_dir}/output/figures",
        f"{base_dir}/output/reports",
        f"{base_dir}/docs",
    ]

    for directory in directories:
        os.makedirs(directory, exist_ok=True)
        print(f"Created: {directory}")

    # Create README files
    readme_content = {
        f"{base_dir}/data/raw": "# Raw Data\n\nOriginal data from CFBD API. Do not modify.",
        f"{base_dir}/data/processed": "# Processed Data\n\nCleaned and transformed data.",
        f"{base_dir}/data/cache": "# Cache\n\nTemporary API response cache.",
        f"{base_dir}/docs": "# Documentation\n\nData dictionaries and notes.",
    }

    for path, content in readme_content.items():
        readme_path = f"{path}/README.md"
        with open(readme_path, "w") as f:
            f.write(content)
        print(f"Created: {readme_path}")

    print(f"\nProject structure created at: {base_dir}/")

if __name__ == "__main__":
    create_project_structure()

Configuration

Create a configuration file for API settings:

# config.py

import os
from dataclasses import dataclass

@dataclass
class Config:
    """Configuration for data collection."""

    # API Settings
    api_key: str = os.environ.get("CFBD_API_KEY", "")
    base_url: str = "https://api.collegefootballdata.com"
    rate_limit_delay: float = 0.5  # Seconds between requests

    # Data Settings
    season: int = 2023
    data_dir: str = "cfb_2023/data"

    # Validation
    expected_fbs_teams: int = 133
    expected_games_per_team: tuple = (11, 15)  # Min, max

    def validate(self):
        """Validate configuration."""
        if not self.api_key:
            raise ValueError("CFBD_API_KEY environment variable not set")

config = Config()

Phase 2: Data Collection

API Client

Build a reusable API client:

# api_client.py

import requests
import time
import json
import os
from typing import Optional, Dict, Any, List
from config import config

class CFBDClient:
    """Client for College Football Data API."""

    def __init__(self):
        self.base_url = config.base_url
        self.headers = {
            "Authorization": f"Bearer {config.api_key}",
            "Accept": "application/json"
        }
        self.cache_dir = f"{config.data_dir}/cache"

    def _cache_path(self, endpoint: str, params: Dict) -> str:
        """Generate cache file path for request."""
        param_str = "_".join(f"{k}={v}" for k, v in sorted(params.items()))
        filename = f"{endpoint.replace('/', '_')}_{param_str}.json"
        return os.path.join(self.cache_dir, filename)

    def _get_cached(self, cache_path: str) -> Optional[Any]:
        """Load data from cache if exists."""
        if os.path.exists(cache_path):
            with open(cache_path, "r") as f:
                return json.load(f)
        return None

    def _save_cache(self, cache_path: str, data: Any):
        """Save data to cache."""
        os.makedirs(os.path.dirname(cache_path), exist_ok=True)
        with open(cache_path, "w") as f:
            json.dump(data, f)

    def request(
        self,
        endpoint: str,
        params: Dict[str, Any] = None,
        use_cache: bool = True
    ) -> Any:
        """
        Make API request with caching.

        Parameters
        ----------
        endpoint : str
            API endpoint (e.g., "/games")
        params : Dict
            Query parameters
        use_cache : bool
            Whether to use cached responses

        Returns
        -------
        Any
            Parsed JSON response
        """
        params = params or {}

        # Check cache
        cache_path = self._cache_path(endpoint, params)
        if use_cache:
            cached = self._get_cached(cache_path)
            if cached is not None:
                print(f"  [CACHE] {endpoint}")
                return cached

        # Make request
        url = f"{self.base_url}{endpoint}"
        print(f"  [API] {endpoint} {params}")

        response = requests.get(url, headers=self.headers, params=params)
        response.raise_for_status()

        data = response.json()

        # Save to cache
        if use_cache:
            self._save_cache(cache_path, data)

        # Rate limiting
        time.sleep(config.rate_limit_delay)

        return data

    def get_games(self, year: int, **kwargs) -> List[Dict]:
        """Get games for a season."""
        return self.request("/games", {"year": year, **kwargs})

    def get_plays(self, year: int, week: int, **kwargs) -> List[Dict]:
        """Get plays for a specific week."""
        return self.request("/plays", {"year": year, "week": week, **kwargs})

    def get_team_stats(self, year: int, **kwargs) -> List[Dict]:
        """Get team season statistics."""
        return self.request("/stats/season", {"year": year, **kwargs})

    def get_teams(self, **kwargs) -> List[Dict]:
        """Get team information."""
        return self.request("/teams", kwargs)

Collecting Games

# collect_games.py

import pandas as pd
from api_client import CFBDClient
from config import config

def collect_all_games(year: int = config.season) -> pd.DataFrame:
    """
    Collect all FBS games for a season.

    Parameters
    ----------
    year : int
        Season year

    Returns
    -------
    pd.DataFrame
        All games for the season
    """
    client = CFBDClient()
    print(f"Collecting {year} games...")

    # Get regular season games
    regular = client.get_games(year, seasonType="regular")
    print(f"  Regular season: {len(regular)} games")

    # Get postseason games
    postseason = client.get_games(year, seasonType="postseason")
    print(f"  Postseason: {len(postseason)} games")

    # Combine
    all_games = regular + postseason
    games_df = pd.DataFrame(all_games)

    print(f"  Total: {len(games_df)} games")

    return games_df

def save_games(games_df: pd.DataFrame, year: int = config.season):
    """Save games to raw data directory."""
    # Save as Parquet
    parquet_path = f"{config.data_dir}/raw/games/games_{year}.parquet"
    games_df.to_parquet(parquet_path, index=False)
    print(f"Saved: {parquet_path}")

    # Save as CSV backup
    csv_path = f"{config.data_dir}/raw/games/games_{year}.csv"
    games_df.to_csv(csv_path, index=False)
    print(f"Saved: {csv_path}")

if __name__ == "__main__":
    games = collect_all_games()
    save_games(games)

Collecting Play-by-Play Data

# collect_plays.py

import pandas as pd
from api_client import CFBDClient
from config import config
from typing import List

def collect_plays_for_week(year: int, week: int, client: CFBDClient) -> pd.DataFrame:
    """Collect play-by-play data for a specific week."""
    plays = client.get_plays(year, week)
    if plays:
        df = pd.DataFrame(plays)
        df["collection_week"] = week
        return df
    return pd.DataFrame()

def collect_all_plays(year: int = config.season) -> pd.DataFrame:
    """
    Collect play-by-play data for entire season.

    Parameters
    ----------
    year : int
        Season year

    Returns
    -------
    pd.DataFrame
        All plays for the season
    """
    client = CFBDClient()
    print(f"Collecting {year} play-by-play data...")

    all_plays = []

    # Regular season weeks (1-15 covers most schedules)
    for week in range(1, 16):
        print(f"  Week {week}...")
        week_plays = collect_plays_for_week(year, week, client)
        if not week_plays.empty:
            all_plays.append(week_plays)
            print(f"    {len(week_plays)} plays")

    # Postseason
    print("  Postseason...")
    post_plays = client.get_plays(year, 1, seasonType="postseason")
    if post_plays:
        post_df = pd.DataFrame(post_plays)
        post_df["collection_week"] = "postseason"
        all_plays.append(post_df)
        print(f"    {len(post_df)} plays")

    # Combine
    plays_df = pd.concat(all_plays, ignore_index=True)
    print(f"  Total: {len(plays_df):,} plays")

    return plays_df

def save_plays(plays_df: pd.DataFrame, year: int = config.season):
    """Save plays to raw data directory."""
    # Save as Parquet (much smaller for large datasets)
    parquet_path = f"{config.data_dir}/raw/plays/plays_{year}.parquet"
    plays_df.to_parquet(parquet_path, index=False)
    print(f"Saved: {parquet_path}")

    # Get file size
    import os
    size_mb = os.path.getsize(parquet_path) / (1024 * 1024)
    print(f"  File size: {size_mb:.1f} MB")

if __name__ == "__main__":
    plays = collect_all_plays()
    save_plays(plays)

Phase 3: Data Validation

Validation Functions

# validate_data.py

import pandas as pd
from config import config
from typing import List, Dict

class DataValidator:
    """Validate collected football data."""

    def __init__(self):
        self.issues: List[str] = []
        self.stats: Dict[str, any] = {}

    def validate_games(self, games_df: pd.DataFrame) -> bool:
        """Validate games dataset."""
        print("\nValidating games...")

        # Check number of games
        num_games = len(games_df)
        expected_min = config.expected_fbs_teams * config.expected_games_per_team[0] / 2
        expected_max = config.expected_fbs_teams * config.expected_games_per_team[1] / 2

        self.stats["num_games"] = num_games

        if num_games < expected_min:
            self.issues.append(f"Too few games: {num_games} (expected >{expected_min})")
        else:
            print(f"  ✓ Game count: {num_games}")

        # Check for missing scores
        missing_scores = games_df["home_points"].isna().sum()
        if missing_scores > 0:
            # Some might be future games - check dates
            print(f"  ⚠ Games with missing scores: {missing_scores}")

        # Check for duplicate games
        if "id" in games_df.columns:
            duplicates = games_df["id"].duplicated().sum()
            if duplicates > 0:
                self.issues.append(f"Duplicate game IDs: {duplicates}")
            else:
                print(f"  ✓ No duplicate game IDs")

        # Check for required columns
        required_cols = ["id", "home_team", "away_team", "home_points", "away_points"]
        missing_cols = [c for c in required_cols if c not in games_df.columns]
        if missing_cols:
            self.issues.append(f"Missing columns: {missing_cols}")
        else:
            print(f"  ✓ All required columns present")

        # Check score validity
        completed_games = games_df.dropna(subset=["home_points", "away_points"])
        invalid_scores = (
            (completed_games["home_points"] < 0) |
            (completed_games["away_points"] < 0) |
            (completed_games["home_points"] > 100) |
            (completed_games["away_points"] > 100)
        ).sum()

        if invalid_scores > 0:
            self.issues.append(f"Invalid scores: {invalid_scores}")
        else:
            print(f"  ✓ All scores in valid range")

        return len(self.issues) == 0

    def validate_plays(self, plays_df: pd.DataFrame) -> bool:
        """Validate play-by-play dataset."""
        print("\nValidating plays...")

        num_plays = len(plays_df)
        self.stats["num_plays"] = num_plays
        print(f"  Total plays: {num_plays:,}")

        # Check for required columns
        required_cols = ["id", "offense", "defense", "down", "distance", "yards_gained"]
        present_cols = [c for c in required_cols if c in plays_df.columns]
        missing_cols = [c for c in required_cols if c not in plays_df.columns]

        if missing_cols:
            print(f"  ⚠ Missing columns: {missing_cols}")

        # Check down values
        if "down" in plays_df.columns:
            invalid_downs = ~plays_df["down"].isin([1, 2, 3, 4, None])
            if invalid_downs.any():
                self.issues.append(f"Invalid down values: {invalid_downs.sum()}")
            else:
                print(f"  ✓ All down values valid")

        # Check for extreme yard values
        if "yards_gained" in plays_df.columns:
            extreme_yards = (
                (plays_df["yards_gained"] < -50) |
                (plays_df["yards_gained"] > 100)
            )
            if extreme_yards.any():
                print(f"  ⚠ Extreme yard values: {extreme_yards.sum()}")
                # These might be valid (long TDs, big losses) - just flag

        # Check play type distribution
        if "play_type" in plays_df.columns:
            play_types = plays_df["play_type"].value_counts()
            print(f"  Play type distribution:")
            for pt, count in play_types.head(5).items():
                print(f"    {pt}: {count:,}")

        return len(self.issues) == 0

    def generate_report(self) -> str:
        """Generate validation report."""
        report = ["=" * 60]
        report.append("DATA VALIDATION REPORT")
        report.append("=" * 60)
        report.append("")

        report.append("STATISTICS:")
        for key, value in self.stats.items():
            report.append(f"  {key}: {value:,}" if isinstance(value, int) else f"  {key}: {value}")
        report.append("")

        if self.issues:
            report.append("ISSUES FOUND:")
            for issue in self.issues:
                report.append(f"  ❌ {issue}")
        else:
            report.append("✓ NO ISSUES FOUND")

        report.append("")
        report.append("=" * 60)

        return "\n".join(report)


def run_validation():
    """Run full validation on collected data."""
    validator = DataValidator()

    # Load and validate games
    games = pd.read_parquet(f"{config.data_dir}/raw/games/games_{config.season}.parquet")
    validator.validate_games(games)

    # Load and validate plays
    plays = pd.read_parquet(f"{config.data_dir}/raw/plays/plays_{config.season}.parquet")
    validator.validate_plays(plays)

    # Print report
    print(validator.generate_report())

    # Save report
    report_path = f"{config.data_dir}/../docs/validation_report.txt"
    with open(report_path, "w") as f:
        f.write(validator.generate_report())
    print(f"Report saved: {report_path}")

if __name__ == "__main__":
    run_validation()

Phase 4: Documentation

Data Dictionary

Create comprehensive documentation:

# 2023 Season Database - Data Dictionary

## games_2023.parquet

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| id | int64 | Unique game identifier | 401520180 |
| season | int32 | Season year | 2023 |
| week | int32 | Week number (1-15, 16+ for postseason) | 12 |
| season_type | string | "regular" or "postseason" | regular |
| start_date | datetime | Game date and time | 2023-11-25T15:30:00 |
| home_team | string | Home team name | Alabama |
| away_team | string | Away team name | Auburn |
| home_points | int32 | Home team final score | 27 |
| away_points | int32 | Away team final score | 24 |
| venue | string | Stadium name | Bryant-Denny Stadium |
| attendance | int32 | Attendance (may be null) | 101821 |

## plays_2023.parquet

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| id | int64 | Unique play identifier | 401520180001 |
| game_id | int64 | Parent game ID | 401520180 |
| drive_id | int64 | Parent drive ID | 4015201801 |
| period | int32 | Quarter (1-4, 5+ OT) | 2 |
| clock | string | Time remaining MM:SS | 7:32 |
| offense | string | Team on offense | Alabama |
| defense | string | Team on defense | Auburn |
| down | int32 | Down number (1-4) | 3 |
| distance | int32 | Yards to first down | 7 |
| yard_line | int32 | Field position (1-99) | 65 |
| yards_gained | int32 | Yards gained on play | 12 |
| play_type | string | Type of play | Pass |
| play_text | string | Play description | Milroe pass complete to... |

## Collection Notes

- **Source:** College Football Data API (collegefootballdata.com)
- **Collection Date:** [DATE]
- **Season:** 2023 regular season + bowl games
- **Coverage:** All FBS teams

## Known Limitations

1. Some bowl games may be missing if collected before bowl season completion
2. Attendance data incomplete for some games
3. A small number of plays may have missing yards_gained values

Results Summary

What We Built

After completing this case study, you have:

cfb_2023/
├── data/
│   ├── raw/
│   │   ├── games/
│   │   │   ├── games_2023.parquet (850 games)
│   │   │   └── games_2023.csv
│   │   └── plays/
│   │       └── plays_2023.parquet (150,000+ plays)
│   ├── processed/
│   └── cache/
├── docs/
│   ├── data_dictionary.md
│   └── validation_report.txt
└── src/
    ├── config.py
    ├── api_client.py
    ├── collect_games.py
    ├── collect_plays.py
    └── validate_data.py

Key Metrics

Metric Value
Games collected ~850
Plays collected ~150,000
FBS teams represented 133
Raw data size (Parquet) ~25 MB
Raw data size (CSV) ~120 MB
API requests made ~20

Discussion Questions

  1. What would you change if you needed to collect data for multiple seasons?

  2. How would you handle API rate limit errors gracefully?

  3. What additional validation checks would increase confidence in data quality?

  4. How might you automate this process to run weekly during the season?


Your Turn: Extensions

Option A: Add recruiting data collection to the pipeline

Option B: Build an incremental update system that only fetches new games

Option C: Create visualizations to explore the collected data


Key Takeaways

  1. Plan before collecting - Design your structure and validation criteria first
  2. Cache aggressively - Minimize API calls through local caching
  3. Validate early - Catch data issues before analysis begins
  4. Document everything - Future you will thank present you
  5. Use efficient formats - Parquet saves significant space for large datasets