Key Takeaways: The Data Landscape of NCAA Football


One-Page Summary

Data Types and Granularity

AGGREGATION HIERARCHY

Play-by-Play (most detailed)
  │  - One row per play
  │  - Includes: down, distance, yards, play type
  │  - Required for: EPA, success rate, situational analysis
  ↓
Drive-Level
  │  - One row per possession
  │  - Includes: start/end position, result, play count
  ↓
Game-Level
  │  - One row per game
  │  - Traditional box score stats
  ↓
Season-Level (most summarized)
     - One row per team per season

Key insight: You can always aggregate up, but you can't disaggregate down.


Primary Data Sources

Source Best For Access Cost
CFBD API Play-by-play, stats, recruiting API Free
Sports Reference Historical research (back to 1869) Web Free
ESPN Quick lookups, real-time Web Free
PFF Grades, detailed charting Subscription $$$

CFBD is your primary tool - free, comprehensive, programmatic access.


API Essentials

Key Concepts: - Endpoint: URL that returns specific data (e.g., /games, /plays) - Parameters: Filters for your request (year=2023, team=Alabama) - Rate Limit: Maximum requests per hour (~1000 for CFBD)

Required Setup:

# 1. Get API key from collegefootballdata.com
# 2. Store in environment variable (NEVER hardcode!)
import os
api_key = os.environ.get("CFBD_API_KEY")

# 3. Include in request header
headers = {"Authorization": f"Bearer {api_key}"}

File Formats

Format Use When Pros Cons
CSV Sharing, inspection Human-readable, universal No types, large files
JSON API responses, nested data Preserves structure Not ideal for tables
Parquet Large datasets (>100K rows) 80% smaller, fast Not human-readable
SQLite Complex queries, multiple tables Relationships, SQL More setup

Rule of thumb: Use Parquet for play-by-play data.


Data Quality Issues

Issue Example How to Detect
Missing values Null attendance df.isnull().sum()
Typos "Alamaba" vs "Alabama" df['team'].value_counts()
Duplicates Same game twice df.duplicated().sum()
Invalid values Score of -7 Check ranges: df.describe()
Definitional differences Different yard calculations Cross-source validation

Project Organization

project/
├── data/
│   ├── raw/          # Never modify! Original from API
│   ├── processed/    # Cleaned, transformed
│   └── cache/        # Temporary API cache
├── notebooks/        # Jupyter notebooks
├── src/              # Python scripts
├── output/           # Results, figures
└── docs/             # Documentation

Essential Code Patterns

Pattern 1: Caching API requests

def get_data_cached(params, cache_dir="cache"):
    cache_file = f"{cache_dir}/{params_to_filename(params)}"
    if os.path.exists(cache_file):
        return pd.read_parquet(cache_file)  # Cache hit
    data = fetch_from_api(params)           # Cache miss
    data.to_parquet(cache_file)
    return data

Pattern 2: Data validation

# Check completeness
assert len(df) > 0, "Empty dataset"
assert df["game_id"].notna().all(), "Missing game IDs"

# Check ranges
assert df["points"].between(0, 100).all(), "Invalid scores"

# Check for duplicates
assert df["id"].is_unique, "Duplicate records"

Documentation Checklist

For every dataset, document:

  • [ ] Source: Where did the data come from?
  • [ ] Collection date: When was it retrieved?
  • [ ] Columns: What does each field mean?
  • [ ] Coverage: What's included/excluded?
  • [ ] Known issues: Any data quality problems?
  • [ ] Processing: What transformations were applied?

Key Terms Quick Reference

Term Definition
API Interface for programmatic data access
Endpoint Specific URL returning particular data
Play-by-play Data with one row per play
Rate limit Max requests allowed per time period
Data dictionary Documentation of column definitions
Parquet Efficient columnar file format
Cache Local storage of API responses

Decision Tree: Choosing a Data Source

What data do you need?
│
├── Play-by-play detail?
│   └── Yes → CFBD API (/plays endpoint)
│
├── Data before 2000?
│   └── Yes → Sports Reference
│
├── Player grades/charting?
│   └── Yes → PFF (paid)
│
├── Quick lookup?
│   └── Yes → ESPN website
│
└── Default → CFBD API

Looking Ahead

Chapter 3 introduces Python for Sports Analytics: - pandas for data manipulation - NumPy for numerical computing - matplotlib for visualization - Working with the football data you now know how to access