15 min read

> "Data is the new oil. It's valuable, but if unrefined it cannot really be used."

Learning Objectives

  • Identify and describe the major sources of college football data
  • Explain the structure and content of play-by-play data
  • Access and retrieve data from the College Football Data API
  • Compare different data formats and their appropriate use cases
  • Evaluate data quality and recognize common data issues
  • Design an organized system for storing and managing football data

Chapter 2: The Data Landscape of NCAA Football

"Data is the new oil. It's valuable, but if unrefined it cannot really be used." — Clive Humby

Chapter Overview

In 2014, a group of college football enthusiasts launched the College Football Data API (CFBD), making detailed play-by-play data freely available to anyone with an internet connection. Before this, such data was locked behind expensive subscriptions or tedious manual collection. The democratization of college football data transformed the field—suddenly, a student with a laptop could perform analyses that previously required professional resources.

This chapter introduces you to the world of college football data. You will learn where data comes from, how it's structured, and how to access it. By the end, you will have retrieved your first dataset from the CFBD API and understand the landscape of information available for your analyses.

The data you learn to access here forms the foundation for everything else in this textbook. Every metric calculation, every visualization, and every predictive model starts with data. Understanding your data sources—their strengths, limitations, and quirks—is essential for producing trustworthy analysis.

In this chapter, you will learn to: - Navigate the major sources of college football data - Understand play-by-play data structure and content - Make API requests to retrieve data programmatically - Choose appropriate data formats for different use cases - Evaluate and document data quality issues


2.1 Understanding Football Data

Before diving into specific sources, let's understand what types of data exist in college football and how they relate to each other.

2.1.1 Play-by-Play Data Structure

Play-by-play (PBP) data is the most granular form of football data. Each row represents a single play, and columns capture everything that happened on that play.

A typical play-by-play record includes:

Category Fields Example
Game Context game_id, home_team, away_team, season, week 401520180, "Alabama", "Georgia", 2023, 14
Time Context period, clock, game_seconds_remaining 2, "7:32", 1948
Situation down, distance, yard_line, yards_to_goal 3, 7, 65, 35
Play Details play_type, play_text, yards_gained "Pass", "Jalen Milroe pass complete to...", 12
Outcome first_down, touchdown, turnover, penalty true, false, false, false
Advanced epa, wpa, success 0.85, 0.03, true

💡 Intuition: Think of play-by-play data as a detailed transcript of the game. If you wanted to reconstruct exactly what happened, play-by-play data would let you follow along from kickoff to final whistle.

The power of play-by-play data lies in its granularity. Want to know how a team performs on third-and-long? Filter to those plays. Want to analyze red zone efficiency? Filter to plays inside the 20. The atomic unit of football analysis is the play, and PBP data gives you access to every one.

2.1.2 Game-Level vs. Play-Level Data

Data exists at multiple levels of aggregation:

Play-Level Data - One row per play - Maximum detail and flexibility - Large file sizes (millions of rows per season) - Required for advanced metrics like EPA

Drive-Level Data - One row per possession - Includes start/end field position, result, plays count - Useful for studying possession efficiency - Medium granularity

Game-Level Data - One row per game (or per team-game) - Traditional box score statistics - Compact and easy to work with - Loses play-by-play context

Season-Level Data - One row per team per season - Aggregated totals and averages - Good for year-over-year comparisons - Maximum information loss

AGGREGATION HIERARCHY

Play-by-Play (most granular)
    ↓ aggregate
Drive-Level
    ↓ aggregate
Game-Level
    ↓ aggregate
Season-Level (most summarized)

⚠️ Common Pitfall: Beginners often work with pre-aggregated data because it's simpler. This limits what you can analyze. Start with play-level data when possible—you can always aggregate up, but you can't disaggregate down.

2.1.3 Player-Level Data

Player data adds another dimension to analysis:

Player Statistics - Passing: completions, attempts, yards, TDs, INTs - Rushing: carries, yards, TDs, fumbles - Receiving: receptions, targets, yards, TDs - Defense: tackles, sacks, interceptions, pass breakups

Player Information - Name, position, jersey number - Height, weight, hometown - Class year, eligibility status - Recruiting ranking (for college players)

Player-Play Linkage - Which players were involved in each play - Allows attribution of plays to specific players - Essential for player evaluation

The challenge with player data is attribution. A completed pass involves at least the quarterback and receiver, but also the offensive line's pass protection and potentially other receivers running routes. Determining who deserves credit requires careful thinking about what questions you're trying to answer.


2.2 Primary Data Sources

Several organizations collect and distribute college football data. Understanding each source's strengths and limitations helps you choose the right one for your analysis.

2.2.1 College Football Data API (CFBD)

The College Football Data API is the primary free source for college football data and the foundation for most public college football analytics.

What CFBD Provides: - Play-by-play data from 2001 to present - Game results and box scores - Team and player statistics - Recruiting data and rankings - Betting lines and spreads - Pre-calculated advanced metrics (EPA, WPA, etc.) - Draft and NFL data for former college players

Accessing CFBD: 1. Register for a free API key at collegefootballdata.com 2. Use the API directly via HTTP requests 3. Or use wrapper libraries (cfbd for Python, cfbfastR for R)

CFBD Strengths: - Free and open access - Comprehensive historical data - Active development and community - Pre-calculated advanced metrics - Well-documented API

CFBD Limitations: - No tracking data (player locations/movements) - Some historical data gaps - Occasional data entry errors - Rate limits on API requests

📊 Real-World Application: The vast majority of public college football analysis—from blog posts to academic research—relies on CFBD data. Learning to use CFBD effectively is the single most valuable data skill for college football analytics.

2.2.2 Sports Reference

Sports Reference (sports-reference.com/cfb) provides comprehensive historical statistics in a browsable web format.

What Sports Reference Provides: - Team and player statistics back to 1869 - Game logs and box scores - Historical records and milestones - Award voting results - Conference standings and results - Bowl game history

Sports Reference Strengths: - Deepest historical coverage - Clean, consistent formatting - Excellent for historical research - No API key needed for browsing

Sports Reference Limitations: - No official API (must scrape or use unofficial tools) - No play-by-play data - Terms of service restrict automated access - Pre-aggregated data only

Using Sports Reference Data:

While Sports Reference doesn't offer an official API, the data is structured consistently, making it possible to access programmatically:

# Example: Using pandas to read Sports Reference tables
# Note: Check Terms of Service before scraping
import pandas as pd

# Sports Reference provides data in HTML tables
# This is for educational illustration
url = "https://www.sports-reference.com/cfb/years/2023-standings.html"

# pandas can read HTML tables directly
# tables = pd.read_html(url)

2.2.3 ESPN and Official NCAA Statistics

ESPN and the NCAA provide official statistics through their websites.

What's Available: - Official NCAA statistics and records - ESPN's Team and Player pages - Real-time game updates - QBR and other ESPN-specific metrics - Depth charts and injury reports

Strengths: - Official/authoritative source - Real-time updates during games - Some proprietary metrics (QBR)

Limitations: - Limited API access - Less historical depth than other sources - Data often requires manual collection - Format changes can break scrapers

2.2.4 PFF and Premium Data Providers

Pro Football Focus (PFF) and similar services provide premium data not available elsewhere.

What PFF Provides: - Play-by-play grades for every player - Detailed charting (coverage assignments, pressure, etc.) - Snap counts by position - Premium metrics (grades 0-100 for each player)

Other Premium Providers: - Sports Info Solutions: Detailed charting data - Telemetry Sports: Tracking data - Pro Football Reference: NFL data with college crossover

Premium Data Strengths: - Information not available publicly - Human-reviewed play charting - Granular player attribution - Tracking/location data (some providers)

Premium Data Limitations: - Expensive subscriptions ($$$) - May require institutional access - Terms often restrict redistribution - Subjective elements (grades)

📝 Note: This textbook uses freely available data from CFBD. Premium data providers offer valuable information, but learning the fundamentals with free data prepares you to use any data source effectively.


2.3 Working with the CFBD API

Let's get hands-on with the College Football Data API. This section walks you through registration, making requests, and understanding responses.

2.3.1 API Fundamentals

An API (Application Programming Interface) allows programs to communicate with each other. The CFBD API lets your Python code request data from CFBD's servers.

Key API Concepts:

Endpoint: A specific URL that returns particular data. CFBD has endpoints for games, plays, teams, players, and more.

Base URL: https://api.collegefootballdata.com

Example Endpoints:
  /games          → Game results
  /plays          → Play-by-play data
  /teams          → Team information
  /player/stats   → Player statistics
  /recruiting/players → Recruiting data

Request: Your code asks the API for data by calling an endpoint with specific parameters.

Response: The API returns data, typically in JSON format, which your code then processes.

Parameters: Filters that specify what data you want. For example, year=2023 and team=Alabama would request 2023 Alabama data.

2.3.2 Authentication and Rate Limits

CFBD requires an API key for authentication. This identifies you and allows tracking of usage.

Getting Your API Key: 1. Go to collegefootballdata.com 2. Click "Get API Key" or navigate to the API section 3. Register with your email 4. Receive your key (a long string of characters)

Using Your API Key:

# Include key in request header
headers = {
    "Authorization": "Bearer YOUR_API_KEY_HERE"
}

Rate Limits: CFBD limits how many requests you can make to ensure fair access for all users: - Typical limit: 1000 requests per hour - Exceeding limits returns error responses - Solution: Cache data locally, batch requests efficiently

✅ Best Practice: Store your API key in an environment variable or config file, never in your code directly. This prevents accidentally sharing your key when sharing code.

# Good: Load key from environment
import os
api_key = os.environ.get("CFBD_API_KEY")

# Bad: Hardcoded key (don't do this!)
# api_key = "abc123xyz789"  # Never commit this!

2.3.3 Available Endpoints

CFBD offers numerous endpoints. Here are the most commonly used:

Game Data: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /games | Game results and box scores | year, week, team, conference | | /games/teams | Team stats by game | year, week, team | | /games/players | Player stats by game | year, week, team |

Play-by-Play: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /plays | Detailed play-by-play | year, week, team, offense, defense | | /drives | Drive-level summaries | year, week, team |

Teams and Conferences: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /teams | Team information | conference | | /conferences | Conference information | - | | /teams/fbs | FBS teams by year | year |

Players and Stats: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /player/search | Find players by name | searchTerm | | /stats/player/season | Season player stats | year, team, category | | /stats/categories | Available stat categories | - |

Recruiting: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /recruiting/players | Player recruiting rankings | year, team, position | | /recruiting/teams | Team recruiting rankings | year |

Advanced Metrics: | Endpoint | Description | Key Parameters | |----------|-------------|----------------| | /ppa/games | Predicted Points Added by game | year, week, team | | /metrics/wp | Win probability data | year, week |

2.3.4 Best Practices for API Usage

Follow these practices for efficient, responsible API usage:

1. Cache Data Locally

import os
import pandas as pd

def get_games_cached(year, cache_dir="data/cache"):
    """Load games from cache if available, else fetch from API."""
    cache_file = f"{cache_dir}/games_{year}.csv"

    if os.path.exists(cache_file):
        print(f"Loading from cache: {cache_file}")
        return pd.read_csv(cache_file)

    # Fetch from API
    print(f"Fetching from API: {year} games")
    games = fetch_games_from_api(year)  # Your API function

    # Save to cache
    os.makedirs(cache_dir, exist_ok=True)
    games.to_csv(cache_file, index=False)

    return games

2. Batch Requests

# Inefficient: One request per team
for team in teams:
    data = api.get_games(year=2023, team=team)  # 130+ requests!

# Efficient: One request for all data, then filter
all_games = api.get_games(year=2023)  # 1 request
team_games = all_games[all_games['team'] == team]

3. Handle Errors Gracefully

import time

def api_request_with_retry(func, max_retries=3, delay=5):
    """Make API request with retry logic."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"Request failed, retrying in {delay}s: {e}")
                time.sleep(delay)
            else:
                raise

4. Respect Rate Limits

import time

def fetch_multiple_seasons(start_year, end_year, delay=1):
    """Fetch data across seasons with rate limiting."""
    all_data = []

    for year in range(start_year, end_year + 1):
        data = fetch_year_data(year)
        all_data.append(data)
        time.sleep(delay)  # Pause between requests

    return pd.concat(all_data, ignore_index=True)

2.4 Data Formats and Storage

Data can be stored in various formats. Choosing the right format affects performance, compatibility, and ease of use.

2.4.1 CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data.

Structure:

game_id,home_team,away_team,home_points,away_points
401520180,Alabama,Georgia,27,24
401520181,Ohio State,Michigan,42,27

Advantages: - Human-readable (open in any text editor) - Universal compatibility (Excel, Python, R, etc.) - Simple structure - Easy to inspect and debug

Disadvantages: - No data type information (all values are strings) - Larger file sizes than binary formats - Slow for very large datasets - No native support for nested data

Working with CSV in Python:

import pandas as pd

# Reading CSV
games = pd.read_csv("games_2023.csv")

# Writing CSV
games.to_csv("output.csv", index=False)

# Specifying data types for efficiency
dtypes = {
    "game_id": "int64",
    "home_points": "int32",
    "away_points": "int32"
}
games = pd.read_csv("games_2023.csv", dtype=dtypes)

2.4.2 JSON Data

JSON (JavaScript Object Notation) is the standard format for API responses.

Structure:

{
  "game_id": 401520180,
  "home_team": "Alabama",
  "away_team": "Georgia",
  "home_points": 27,
  "away_points": 24,
  "plays": [
    {"play_id": 1, "type": "Kickoff", "yards": 65},
    {"play_id": 2, "type": "Rush", "yards": 3}
  ]
}

Advantages: - Supports nested/hierarchical data - Native format for web APIs - Preserves data types (numbers, strings, booleans) - Human-readable

Disadvantages: - Not ideal for purely tabular data - Larger than binary formats - Can be complex to work with in tabular tools

Working with JSON in Python:

import json
import pandas as pd

# Reading JSON
with open("game.json", "r") as f:
    data = json.load(f)

# Converting JSON to DataFrame
games_df = pd.DataFrame(data)

# Reading JSON with pandas directly
games = pd.read_json("games.json")

# Handling nested JSON
plays = pd.json_normalize(data, "plays", ["game_id", "home_team"])

2.4.3 SQL Databases

For larger projects, storing data in a SQL database offers advantages.

Common Database Options: - SQLite: File-based, no server needed, good for personal projects - PostgreSQL: Full-featured, good for production systems - MySQL: Popular, widely supported

Advantages: - Efficient queries on large datasets - Relationships between tables - Data integrity constraints - Concurrent access support

Disadvantages: - More setup required - Need to learn SQL - Overkill for simple analyses

Working with SQLite in Python:

import sqlite3
import pandas as pd

# Create/connect to database
conn = sqlite3.connect("cfb_data.db")

# Write DataFrame to table
games.to_sql("games", conn, if_exists="replace", index=False)

# Query data
query = """
    SELECT home_team, AVG(home_points) as avg_points
    FROM games
    WHERE season = 2023
    GROUP BY home_team
    ORDER BY avg_points DESC
"""
results = pd.read_sql(query, conn)

conn.close()

2.4.4 Parquet Files for Large Datasets

Parquet is a columnar binary format optimized for analytical workloads.

Advantages: - Much smaller file sizes (compression) - Much faster read/write times - Preserves data types - Excellent for large datasets (millions of rows)

Disadvantages: - Not human-readable - Requires special libraries - Less universal compatibility

When to Use Parquet: - Play-by-play data across multiple seasons - Any dataset with 100K+ rows - Data you'll read repeatedly

Working with Parquet in Python:

import pandas as pd

# Reading Parquet (requires pyarrow or fastparquet)
plays = pd.read_parquet("plays_2015_2023.parquet")

# Writing Parquet
plays.to_parquet("plays.parquet", index=False)

# File size comparison example:
# plays_2023.csv:      450 MB
# plays_2023.parquet:   85 MB (81% smaller!)

2.5 Data Quality Considerations

All data has imperfections. Understanding common data quality issues helps you avoid flawed analyses.

2.5.1 Missing Data Patterns

Missing data appears in several ways:

Explicit Missing Values - Null, None, NaN, empty string - pandas represents as NaN (Not a Number) - Easy to detect: df.isnull().sum()

Implicit Missing Values - Data that should exist but doesn't - Harder to detect (requires knowing what should be there) - Example: A game missing from the games table

Patterns of Missingness:

Pattern Description Example
MCAR Missing Completely at Random Random sensor failures
MAR Missing at Random (conditional) FCS games less likely to have detailed data
MNAR Missing Not at Random Players with injuries may have missing game stats

Handling Missing Data:

import pandas as pd

# Check for missing values
print(games.isnull().sum())

# Drop rows with any missing values (aggressive)
games_complete = games.dropna()

# Drop rows only if key columns missing (conservative)
games_clean = games.dropna(subset=["home_points", "away_points"])

# Fill missing values
games["attendance"] = games["attendance"].fillna(games["attendance"].median())

2.5.2 Data Entry Errors

Humans enter data, and humans make mistakes.

Common Error Types:

Typos and Misspellings:

"Alabama" vs "Alamaba" vs "ALABAMA"
"Ohio State" vs "Ohio St" vs "OSU"

Transposition Errors:

Score: 27-24 entered as 24-27 (teams swapped)
Yards: 15 entered as 51

Unit Errors:

Clock showing "7:32" vs "732" vs "452 seconds"

Detection Strategies:

# Check for unexpected values
print(games["home_team"].value_counts())  # Look for typos

# Check for outliers
print(games["home_points"].describe())
print(games[games["home_points"] > 80])  # Unusually high scores

# Cross-validate with other sources
# Compare your game scores to ESPN or Sports Reference

2.5.3 Definitional Inconsistencies

The same concept may be measured differently across sources or time periods.

Examples:

Sack Statistics: - NCAA didn't officially track sacks until 2000 - Historical sack data is incomplete or inconsistent

Fumbles: - "Fumbles" vs "Fumbles Lost" - Some sources count both, others only lost fumbles

Pass Attempts: - Do spikes count? - What about aborted plays?

Solutions: - Document your definitions explicitly - Check source documentation - Be cautious when combining data from different sources - Note definitional changes in historical analyses

2.5.4 Historical Data Limitations

Data quality generally improves over time. Historical analyses face additional challenges.

Pre-2000s Limitations: - Play-by-play data sparse or nonexistent - Many statistics not tracked - Different rules affect comparability

Early 2000s: - Play-by-play becomes more available - Still missing advanced charting

2010s onward: - Comprehensive play-by-play - Pre-calculated advanced metrics - Better player attribution

⚠️ Common Pitfall: Don't compare 1995 statistics directly to 2023 statistics without acknowledging rule changes, pace differences, and data quality differences. A "fair catch" rule change, for example, affects punt return statistics across eras.


2.6 Building Your Data Library

Organizing your data systematically saves time and prevents errors.

2.6.1 Organizing Data Files

Establish a consistent folder structure:

cfb-analytics/
│
├── data/
│   ├── raw/                    # Original data from sources
│   │   ├── cfbd/
│   │   │   ├── games_2023.json
│   │   │   ├── plays_2023.json
│   │   │   └── ...
│   │   └── manual/
│   │       └── weather_data.csv
│   │
│   ├── processed/              # Cleaned, transformed data
│   │   ├── games_clean.parquet
│   │   ├── plays_with_epa.parquet
│   │   └── ...
│   │
│   ├── cache/                  # Temporary cached API responses
│   │   └── ...
│   │
│   └── external/               # Data from other sources
│       ├── recruiting/
│       └── weather/
│
├── notebooks/                  # Jupyter notebooks for analysis
├── src/                        # Python source code
└── output/                     # Results, figures, reports

Key Principles: 1. Separate raw from processed: Never modify raw data files 2. Use descriptive names: Include year, type, date in filenames 3. Document everything: README files in each directory 4. Be consistent: Same structure across projects

2.6.2 Version Control for Data

While Git is excellent for code, it handles large data files poorly. Alternative approaches:

For Small Data (< 10 MB): - Include in Git repository - Track changes with normal commits

For Medium Data (10 MB - 1 GB): - Use Git LFS (Large File Storage) - Or keep data separate and document how to obtain it

For Large Data (> 1 GB): - Store on cloud storage (S3, Google Cloud, etc.) - Version with timestamps in filenames - Document data lineage

Data Versioning Example:

data/processed/
├── plays_v1_20231015.parquet    # Initial version
├── plays_v2_20231022.parquet    # Added EPA calculations
├── plays_v3_20231105.parquet    # Fixed penalty handling
└── plays_current.parquet        # Symlink to latest

2.6.3 Documentation Practices

Good documentation prevents confusion and enables reproducibility.

Data Dictionary: Document every column in your datasets:

# plays.parquet Data Dictionary

## Columns

| Column | Type | Description | Example |
|--------|------|-------------|---------|
| play_id | int64 | Unique play identifier | 4015201800001 |
| game_id | int64 | Parent game identifier | 401520180 |
| drive_id | int64 | Parent drive identifier | 40152018001 |
| period | int32 | Game period (1-4, 5+ for OT) | 2 |
| clock | string | Game clock MM:SS | "12:45" |
| offense | string | Team with possession | "Alabama" |
| defense | string | Team on defense | "Georgia" |
| down | int32 | Current down (1-4) | 3 |
| distance | int32 | Yards needed for first down | 7 |
| yard_line | int32 | Yard line (0-100) | 65 |
| yards_gained | int32 | Yards gained on play | 12 |
| play_type | string | Type of play | "Pass" |
| play_text | string | Description of play | "Milroe pass..." |
| epa | float64 | Expected Points Added | 0.85 |
| success | bool | Whether play was successful | True |

## Notes
- yard_line uses 0-100 scale where 0 is own goal line
- epa calculated using cfbd model
- success defined per standard analytics definition

## Source
- CFBD API /plays endpoint
- Retrieved: 2023-10-15

## Processing Applied
- Removed plays with null yard_line
- Calculated success column

README Files: Include README.md in each data directory explaining: - What data is contained - Where it came from - When it was last updated - Any known issues or limitations - How to regenerate if needed


2.7 Chapter Summary

This chapter introduced the data landscape of college football, providing the foundation for all subsequent analyses.

Key Concepts

  1. Play-by-play data is the most granular form of football data, with each row representing a single play. It enables the most sophisticated analyses but requires more storage and processing.

  2. The College Football Data API (CFBD) is the primary free source for college football data, providing play-by-play data, statistics, recruiting information, and pre-calculated advanced metrics.

  3. API usage requires understanding endpoints, authentication, parameters, and rate limits. Efficient API usage involves caching, batching, and error handling.

  4. Data formats (CSV, JSON, SQL, Parquet) each have tradeoffs. Choose based on file size, compatibility needs, and query patterns.

  5. Data quality issues—missing data, errors, inconsistencies—are present in all datasets. Awareness and explicit handling are essential for trustworthy analysis.

  6. Organized data management with consistent folder structures, versioning, and documentation prevents confusion and enables reproducibility.

Key Terms

Term Definition
Play-by-play (PBP) Data with one row per play, maximum granularity
API Application Programming Interface; allows programmatic data access
Endpoint A specific URL that returns particular data
JSON JavaScript Object Notation; common data format for APIs
CSV Comma-Separated Values; simple tabular text format
Parquet Columnar binary format efficient for large datasets
Data dictionary Documentation describing each column in a dataset
Rate limit Restriction on how many API requests can be made

Code Patterns

# Pattern: Load from cache or API
def get_data_cached(year, data_type, cache_dir="data/cache"):
    cache_file = f"{cache_dir}/{data_type}_{year}.parquet"
    if os.path.exists(cache_file):
        return pd.read_parquet(cache_file)
    data = fetch_from_api(year, data_type)
    data.to_parquet(cache_file)
    return data

# Pattern: Handle missing values
df_clean = df.dropna(subset=required_columns)
df["optional_col"] = df["optional_col"].fillna(default_value)

# Pattern: Validate data
assert df["points"].between(0, 100).all(), "Invalid point values"
assert df["game_id"].is_unique, "Duplicate game IDs found"

What's Next

In Chapter 3: Python for Sports Analytics, you will build the programming skills needed to work with the data you've learned to access. You'll master pandas for data manipulation, NumPy for numerical computing, and matplotlib for visualization—the core toolkit for football analysis.

Before moving on, complete the exercises and quiz to solidify your understanding of college football data sources and access patterns.


Chapter 2 Exercises → exercises.md

Chapter 2 Quiz → quiz.md

Case Study: Building a Complete Season Database → case-study-01.md

Case Study: Comparing Data Sources for Accuracy → case-study-02.md


Understanding your data sources is the foundation of trustworthy analysis. Every metric, model, and visualization starts here.