Key Takeaways: The Data Landscape of NCAA Football
One-Page Summary
Data Types and Granularity
AGGREGATION HIERARCHY
Play-by-Play (most detailed)
│ - One row per play
│ - Includes: down, distance, yards, play type
│ - Required for: EPA, success rate, situational analysis
↓
Drive-Level
│ - One row per possession
│ - Includes: start/end position, result, play count
↓
Game-Level
│ - One row per game
│ - Traditional box score stats
↓
Season-Level (most summarized)
- One row per team per season
Key insight: You can always aggregate up, but you can't disaggregate down.
Primary Data Sources
| Source | Best For | Access | Cost |
|---|---|---|---|
| CFBD API | Play-by-play, stats, recruiting | API | Free |
| Sports Reference | Historical research (back to 1869) | Web | Free |
| ESPN | Quick lookups, real-time | Web | Free |
| PFF | Grades, detailed charting | Subscription | $$$ |
CFBD is your primary tool - free, comprehensive, programmatic access.
API Essentials
Key Concepts:
- Endpoint: URL that returns specific data (e.g., /games, /plays)
- Parameters: Filters for your request (year=2023, team=Alabama)
- Rate Limit: Maximum requests per hour (~1000 for CFBD)
Required Setup:
# 1. Get API key from collegefootballdata.com
# 2. Store in environment variable (NEVER hardcode!)
import os
api_key = os.environ.get("CFBD_API_KEY")
# 3. Include in request header
headers = {"Authorization": f"Bearer {api_key}"}
File Formats
| Format | Use When | Pros | Cons |
|---|---|---|---|
| CSV | Sharing, inspection | Human-readable, universal | No types, large files |
| JSON | API responses, nested data | Preserves structure | Not ideal for tables |
| Parquet | Large datasets (>100K rows) | 80% smaller, fast | Not human-readable |
| SQLite | Complex queries, multiple tables | Relationships, SQL | More setup |
Rule of thumb: Use Parquet for play-by-play data.
Data Quality Issues
| Issue | Example | How to Detect |
|---|---|---|
| Missing values | Null attendance | df.isnull().sum() |
| Typos | "Alamaba" vs "Alabama" | df['team'].value_counts() |
| Duplicates | Same game twice | df.duplicated().sum() |
| Invalid values | Score of -7 | Check ranges: df.describe() |
| Definitional differences | Different yard calculations | Cross-source validation |
Project Organization
project/
├── data/
│ ├── raw/ # Never modify! Original from API
│ ├── processed/ # Cleaned, transformed
│ └── cache/ # Temporary API cache
├── notebooks/ # Jupyter notebooks
├── src/ # Python scripts
├── output/ # Results, figures
└── docs/ # Documentation
Essential Code Patterns
Pattern 1: Caching API requests
def get_data_cached(params, cache_dir="cache"):
cache_file = f"{cache_dir}/{params_to_filename(params)}"
if os.path.exists(cache_file):
return pd.read_parquet(cache_file) # Cache hit
data = fetch_from_api(params) # Cache miss
data.to_parquet(cache_file)
return data
Pattern 2: Data validation
# Check completeness
assert len(df) > 0, "Empty dataset"
assert df["game_id"].notna().all(), "Missing game IDs"
# Check ranges
assert df["points"].between(0, 100).all(), "Invalid scores"
# Check for duplicates
assert df["id"].is_unique, "Duplicate records"
Documentation Checklist
For every dataset, document:
- [ ] Source: Where did the data come from?
- [ ] Collection date: When was it retrieved?
- [ ] Columns: What does each field mean?
- [ ] Coverage: What's included/excluded?
- [ ] Known issues: Any data quality problems?
- [ ] Processing: What transformations were applied?
Key Terms Quick Reference
| Term | Definition |
|---|---|
| API | Interface for programmatic data access |
| Endpoint | Specific URL returning particular data |
| Play-by-play | Data with one row per play |
| Rate limit | Max requests allowed per time period |
| Data dictionary | Documentation of column definitions |
| Parquet | Efficient columnar file format |
| Cache | Local storage of API responses |
Decision Tree: Choosing a Data Source
What data do you need?
│
├── Play-by-play detail?
│ └── Yes → CFBD API (/plays endpoint)
│
├── Data before 2000?
│ └── Yes → Sports Reference
│
├── Player grades/charting?
│ └── Yes → PFF (paid)
│
├── Quick lookup?
│ └── Yes → ESPN website
│
└── Default → CFBD API
Looking Ahead
Chapter 3 introduces Python for Sports Analytics: - pandas for data manipulation - NumPy for numerical computing - matplotlib for visualization - Working with the football data you now know how to access