Key Takeaways: The Data Landscape of NCAA Football

One-Page Summary

Data Types and Granularity

AGGREGATION HIERARCHY

Play-by-Play (most detailed)
  │  - One row per play
  │  - Includes: down, distance, yards, play type
  │  - Required for: EPA, success rate, situational analysis
  ↓
Drive-Level
  │  - One row per possession
  │  - Includes: start/end position, result, play count
  ↓
Game-Level
  │  - One row per game
  │  - Traditional box score stats
  ↓
Season-Level (most summarized)
     - One row per team per season

Key insight: You can always aggregate up, but you can't disaggregate down.

Primary Data Sources

Source	Best For	Access	Cost
CFBD API	Play-by-play, stats, recruiting	API	Free
Sports Reference	Historical research (back to 1869)	Web	Free
ESPN	Quick lookups, real-time	Web	Free
PFF	Grades, detailed charting	Subscription	$$$

CFBD is your primary tool - free, comprehensive, programmatic access.

API Essentials

Key Concepts: - Endpoint: URL that returns specific data (e.g., /games, /plays) - Parameters: Filters for your request (year=2023, team=Alabama) - Rate Limit: Maximum requests per hour (~1000 for CFBD)

Required Setup:

# 1. Get API key from collegefootballdata.com
# 2. Store in environment variable (NEVER hardcode!)
import os
api_key = os.environ.get("CFBD_API_KEY")

# 3. Include in request header
headers = {"Authorization": f"Bearer {api_key}"}

File Formats

Format	Use When	Pros	Cons
CSV	Sharing, inspection	Human-readable, universal	No types, large files
JSON	API responses, nested data	Preserves structure	Not ideal for tables
Parquet	Large datasets (>100K rows)	80% smaller, fast	Not human-readable
SQLite	Complex queries, multiple tables	Relationships, SQL	More setup

Rule of thumb: Use Parquet for play-by-play data.

Data Quality Issues

Issue	Example	How to Detect
Missing values	Null attendance	`df.isnull().sum()`
Typos	"Alamaba" vs "Alabama"	`df['team'].value_counts()`
Duplicates	Same game twice	`df.duplicated().sum()`
Invalid values	Score of -7	Check ranges: `df.describe()`
Definitional differences	Different yard calculations	Cross-source validation

Project Organization

project/
├── data/
│   ├── raw/          # Never modify! Original from API
│   ├── processed/    # Cleaned, transformed
│   └── cache/        # Temporary API cache
├── notebooks/        # Jupyter notebooks
├── src/              # Python scripts
├── output/           # Results, figures
└── docs/             # Documentation

Essential Code Patterns

Pattern 1: Caching API requests

def get_data_cached(params, cache_dir="cache"):
    cache_file = f"{cache_dir}/{params_to_filename(params)}"
    if os.path.exists(cache_file):
        return pd.read_parquet(cache_file)  # Cache hit
    data = fetch_from_api(params)           # Cache miss
    data.to_parquet(cache_file)
    return data

Pattern 2: Data validation

# Check completeness
assert len(df) > 0, "Empty dataset"
assert df["game_id"].notna().all(), "Missing game IDs"

# Check ranges
assert df["points"].between(0, 100).all(), "Invalid scores"

# Check for duplicates
assert df["id"].is_unique, "Duplicate records"

Documentation Checklist

For every dataset, document:

[ ] Source: Where did the data come from?
[ ] Collection date: When was it retrieved?
[ ] Columns: What does each field mean?
[ ] Coverage: What's included/excluded?
[ ] Known issues: Any data quality problems?
[ ] Processing: What transformations were applied?

Key Terms Quick Reference

Term	Definition
API	Interface for programmatic data access
Endpoint	Specific URL returning particular data
Play-by-play	Data with one row per play
Rate limit	Max requests allowed per time period
Data dictionary	Documentation of column definitions
Parquet	Efficient columnar file format
Cache	Local storage of API responses

Decision Tree: Choosing a Data Source

What data do you need?
│
├── Play-by-play detail?
│   └── Yes → CFBD API (/plays endpoint)
│
├── Data before 2000?
│   └── Yes → Sports Reference
│
├── Player grades/charting?
│   └── Yes → PFF (paid)
│
├── Quick lookup?
│   └── Yes → ESPN website
│
└── Default → CFBD API

Looking Ahead

Chapter 3 introduces Python for Sports Analytics: - pandas for data manipulation - NumPy for numerical computing - matplotlib for visualization - Working with the football data you now know how to access