Appendix C: Data Sources and APIs
Free Data Sources
College Football Data API
Website: https://collegefootballdata.com
Description: Comprehensive API for college football data including play-by-play, team statistics, recruiting, and more.
Access: Free with API key registration
Available Data: - Play-by-play data (2001-present) - Team and game statistics - Recruiting information - Drive data - Betting lines - Team talent composites
Python Example:
import requests
API_KEY = 'your_api_key'
headers = {'Authorization': f'Bearer {API_KEY}'}
# Fetch games
response = requests.get(
'https://api.collegefootballdata.com/games',
params={'year': 2024, 'team': 'Alabama'},
headers=headers
)
games = response.json()
# Fetch plays
response = requests.get(
'https://api.collegefootballdata.com/plays',
params={'year': 2024, 'week': 1, 'team': 'Alabama'},
headers=headers
)
plays = response.json()
Rate Limits: 100 requests per minute
Sports Reference / College Football Reference
Website: https://www.sports-reference.com/cfb/
Description: Historical college football statistics and records.
Access: Free (web scraping with appropriate limits)
Available Data: - Season statistics (1869-present) - Player statistics - Team records - Bowl game results - Draft history
Python Example:
import pandas as pd
# Read table directly
url = 'https://www.sports-reference.com/cfb/years/2024-team-offense.html'
tables = pd.read_html(url)
team_offense = tables[0]
Note: Respect robots.txt and rate limits. Consider using their data export features.
cfbfastR (R Package)
Repository: https://github.com/sportsdataverse/cfbfastR
Description: R package for accessing college football data with pre-calculated EPA.
Installation:
install.packages("cfbfastR")
Usage:
library(cfbfastR)
# Load play-by-play with EPA
pbp <- load_cfb_pbp(seasons = 2024)
# Get team information
teams <- cfbd_team_info()
# Get recruiting data
recruits <- cfbd_recruiting_player(year = 2024)
nflfastR (NFL Data)
Repository: https://github.com/nflverse/nflfastR
Description: While focused on NFL, concepts and methodologies are transferable to college football.
Python Access (via nfl_data_py):
import nfl_data_py as nfl
# Load play-by-play
pbp = nfl.import_pbp_data([2023])
# Load roster data
rosters = nfl.import_rosters([2023])
Commercial/Professional Data Sources
Sportradar
Website: https://sportradar.com
Description: Professional sports data provider used by leagues and media.
Available Data: - Real-time play-by-play - Player tracking data - Detailed game statistics - Odds and betting data
Access: Commercial license required
Stats Perform / Opta
Website: https://www.statsperform.com
Description: Professional data provider with extensive sports coverage.
Available Data: - Advanced statistics - Player tracking - Expected metrics - Historical archives
Access: Commercial license required
Pro Football Focus (PFF)
Website: https://www.pff.com
Description: Detailed player grades and advanced metrics.
Available Data: - Player grades (offense, defense, special teams) - Position-specific metrics - Premium statistics - Coverage and pressure data
Access: - Basic: Free registration - Premium: Subscription required - API: Commercial license
Catapult / Zebra Technologies
Website: https://www.catapultsports.com
Description: Player tracking and performance monitoring.
Available Data: - GPS tracking - Accelerometer data - Load monitoring - Player movement patterns
Access: Team-level contracts; typically not publicly available
Academic and Research Data
Kaggle Datasets
Website: https://www.kaggle.com
Relevant Datasets: - NFL Big Data Bowl datasets (tracking data) - College football historical data - Fantasy football datasets
Access: Free with registration
Harvard Dataverse
Website: https://dataverse.harvard.edu
Description: Research data repository with sports analytics datasets.
Access: Free; some datasets require registration
Data Quality Considerations
Validation Checks
When using any data source, verify:
-
Completeness - All expected games present - All plays within games recorded - No unexpected gaps in sequences
-
Accuracy - Scores sum correctly - Yard line progressions make sense - Play results align with descriptions
-
Consistency - Team names standardized - Date formats uniform - ID schemes consistent
Common Issues
| Issue | Description | Solution |
|---|---|---|
| Missing plays | Plays not recorded | Fill gaps or flag affected games |
| Duplicate records | Same play appears twice | Deduplicate by unique identifiers |
| Name variations | "Ohio State" vs "OSU" | Create standardization mapping |
| Timezone issues | Game times in different zones | Normalize to single timezone |
| Encoding errors | Special characters corrupted | Specify encoding when reading |
Data Pipeline Best Practices
class DataValidator:
"""Validate incoming football data."""
def validate_play(self, play):
errors = []
# Required fields
required = ['game_id', 'play_id', 'down', 'distance']
for field in required:
if field not in play or play[field] is None:
errors.append(f"Missing: {field}")
# Range checks
if play.get('down') and not 1 <= play['down'] <= 4:
errors.append(f"Invalid down: {play['down']}")
if play.get('yard_line') and not 0 <= play['yard_line'] <= 100:
errors.append(f"Invalid yard_line: {play['yard_line']}")
return errors
def validate_game(self, plays):
"""Validate all plays in a game."""
errors = []
# Check play sequence
play_numbers = [p.get('play_number') for p in plays]
if sorted(play_numbers) != list(range(1, len(plays) + 1)):
errors.append("Non-sequential play numbers")
# Check score progression
for i, play in enumerate(plays[1:], 1):
prev = plays[i-1]
if play.get('home_score', 0) < prev.get('home_score', 0):
errors.append(f"Score decreased at play {i}")
return errors
API Rate Limiting
Best Practices
import time
from functools import wraps
def rate_limit(calls_per_minute):
"""Decorator to rate limit API calls."""
min_interval = 60.0 / calls_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
if elapsed < min_interval:
time.sleep(min_interval - elapsed)
last_called[0] = time.time()
return func(*args, **kwargs)
return wrapper
return decorator
@rate_limit(calls_per_minute=60)
def fetch_plays(game_id):
"""Fetch plays with rate limiting."""
response = requests.get(f'{BASE_URL}/plays', params={'gameId': game_id})
return response.json()
Exponential Backoff
import time
import random
def fetch_with_backoff(url, max_retries=5):
"""Fetch with exponential backoff on failure."""
for attempt in range(max_retries):
try:
response = requests.get(url)
response.raise_for_status()
return response.json()
except requests.RequestException as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Retry {attempt + 1} in {wait_time:.1f}s: {e}")
time.sleep(wait_time)
Data Storage Recommendations
For Learning/Personal Projects
SQLite: Simple, file-based database
import sqlite3
conn = sqlite3.connect('cfb_data.db')
df.to_sql('plays', conn, if_exists='replace', index=False)
For Team/Production Use
PostgreSQL: Robust relational database
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@localhost/cfb')
df.to_sql('plays', engine, if_exists='append', index=False)
For Large-Scale Analytics
Cloud Data Warehouse: BigQuery, Snowflake, or Redshift for large datasets and complex queries.
Legal and Ethical Considerations
- Terms of Service: Always review and comply with data provider terms
- Rate Limits: Respect rate limits to avoid bans
- Attribution: Credit data sources appropriately
- Privacy: Handle any personal data (player information) responsibly
- Commercial Use: Verify licensing for commercial applications