Appendix C: Data Sources and APIs

Free Data Sources

College Football Data API

Website: https://collegefootballdata.com

Description: Comprehensive API for college football data including play-by-play, team statistics, recruiting, and more.

Access: Free with API key registration

Available Data: - Play-by-play data (2001-present) - Team and game statistics - Recruiting information - Drive data - Betting lines - Team talent composites

Python Example:

import requests

API_KEY = 'your_api_key'
headers = {'Authorization': f'Bearer {API_KEY}'}

# Fetch games
response = requests.get(
    'https://api.collegefootballdata.com/games',
    params={'year': 2024, 'team': 'Alabama'},
    headers=headers
)
games = response.json()

# Fetch plays
response = requests.get(
    'https://api.collegefootballdata.com/plays',
    params={'year': 2024, 'week': 1, 'team': 'Alabama'},
    headers=headers
)
plays = response.json()

Rate Limits: 100 requests per minute


Sports Reference / College Football Reference

Website: https://www.sports-reference.com/cfb/

Description: Historical college football statistics and records.

Access: Free (web scraping with appropriate limits)

Available Data: - Season statistics (1869-present) - Player statistics - Team records - Bowl game results - Draft history

Python Example:

import pandas as pd

# Read table directly
url = 'https://www.sports-reference.com/cfb/years/2024-team-offense.html'
tables = pd.read_html(url)
team_offense = tables[0]

Note: Respect robots.txt and rate limits. Consider using their data export features.


cfbfastR (R Package)

Repository: https://github.com/sportsdataverse/cfbfastR

Description: R package for accessing college football data with pre-calculated EPA.

Installation:

install.packages("cfbfastR")

Usage:

library(cfbfastR)

# Load play-by-play with EPA
pbp <- load_cfb_pbp(seasons = 2024)

# Get team information
teams <- cfbd_team_info()

# Get recruiting data
recruits <- cfbd_recruiting_player(year = 2024)

nflfastR (NFL Data)

Repository: https://github.com/nflverse/nflfastR

Description: While focused on NFL, concepts and methodologies are transferable to college football.

Python Access (via nfl_data_py):

import nfl_data_py as nfl

# Load play-by-play
pbp = nfl.import_pbp_data([2023])

# Load roster data
rosters = nfl.import_rosters([2023])

Commercial/Professional Data Sources

Sportradar

Website: https://sportradar.com

Description: Professional sports data provider used by leagues and media.

Available Data: - Real-time play-by-play - Player tracking data - Detailed game statistics - Odds and betting data

Access: Commercial license required


Stats Perform / Opta

Website: https://www.statsperform.com

Description: Professional data provider with extensive sports coverage.

Available Data: - Advanced statistics - Player tracking - Expected metrics - Historical archives

Access: Commercial license required


Pro Football Focus (PFF)

Website: https://www.pff.com

Description: Detailed player grades and advanced metrics.

Available Data: - Player grades (offense, defense, special teams) - Position-specific metrics - Premium statistics - Coverage and pressure data

Access: - Basic: Free registration - Premium: Subscription required - API: Commercial license


Catapult / Zebra Technologies

Website: https://www.catapultsports.com

Description: Player tracking and performance monitoring.

Available Data: - GPS tracking - Accelerometer data - Load monitoring - Player movement patterns

Access: Team-level contracts; typically not publicly available


Academic and Research Data

Kaggle Datasets

Website: https://www.kaggle.com

Relevant Datasets: - NFL Big Data Bowl datasets (tracking data) - College football historical data - Fantasy football datasets

Access: Free with registration


Harvard Dataverse

Website: https://dataverse.harvard.edu

Description: Research data repository with sports analytics datasets.

Access: Free; some datasets require registration


Data Quality Considerations

Validation Checks

When using any data source, verify:

  1. Completeness - All expected games present - All plays within games recorded - No unexpected gaps in sequences

  2. Accuracy - Scores sum correctly - Yard line progressions make sense - Play results align with descriptions

  3. Consistency - Team names standardized - Date formats uniform - ID schemes consistent

Common Issues

Issue Description Solution
Missing plays Plays not recorded Fill gaps or flag affected games
Duplicate records Same play appears twice Deduplicate by unique identifiers
Name variations "Ohio State" vs "OSU" Create standardization mapping
Timezone issues Game times in different zones Normalize to single timezone
Encoding errors Special characters corrupted Specify encoding when reading

Data Pipeline Best Practices

class DataValidator:
    """Validate incoming football data."""

    def validate_play(self, play):
        errors = []

        # Required fields
        required = ['game_id', 'play_id', 'down', 'distance']
        for field in required:
            if field not in play or play[field] is None:
                errors.append(f"Missing: {field}")

        # Range checks
        if play.get('down') and not 1 <= play['down'] <= 4:
            errors.append(f"Invalid down: {play['down']}")

        if play.get('yard_line') and not 0 <= play['yard_line'] <= 100:
            errors.append(f"Invalid yard_line: {play['yard_line']}")

        return errors

    def validate_game(self, plays):
        """Validate all plays in a game."""
        errors = []

        # Check play sequence
        play_numbers = [p.get('play_number') for p in plays]
        if sorted(play_numbers) != list(range(1, len(plays) + 1)):
            errors.append("Non-sequential play numbers")

        # Check score progression
        for i, play in enumerate(plays[1:], 1):
            prev = plays[i-1]
            if play.get('home_score', 0) < prev.get('home_score', 0):
                errors.append(f"Score decreased at play {i}")

        return errors

API Rate Limiting

Best Practices

import time
from functools import wraps

def rate_limit(calls_per_minute):
    """Decorator to rate limit API calls."""
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            last_called[0] = time.time()
            return func(*args, **kwargs)
        return wrapper
    return decorator

@rate_limit(calls_per_minute=60)
def fetch_plays(game_id):
    """Fetch plays with rate limiting."""
    response = requests.get(f'{BASE_URL}/plays', params={'gameId': game_id})
    return response.json()

Exponential Backoff

import time
import random

def fetch_with_backoff(url, max_retries=5):
    """Fetch with exponential backoff on failure."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Retry {attempt + 1} in {wait_time:.1f}s: {e}")
            time.sleep(wait_time)

Data Storage Recommendations

For Learning/Personal Projects

SQLite: Simple, file-based database

import sqlite3

conn = sqlite3.connect('cfb_data.db')
df.to_sql('plays', conn, if_exists='replace', index=False)

For Team/Production Use

PostgreSQL: Robust relational database

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@localhost/cfb')
df.to_sql('plays', engine, if_exists='append', index=False)

For Large-Scale Analytics

Cloud Data Warehouse: BigQuery, Snowflake, or Redshift for large datasets and complex queries.


  1. Terms of Service: Always review and comply with data provider terms
  2. Rate Limits: Respect rate limits to avoid bans
  3. Attribution: Credit data sources appropriately
  4. Privacy: Handle any personal data (player information) responsibly
  5. Commercial Use: Verify licensing for commercial applications