Case Study: Building an NFL Season Database from Scratch

Overview

In this case study, you will build a complete NFL season database from raw data to queryable SQLite tables. We will walk through every step: defining the schema, collecting data from public sources, cleaning and transforming it, loading it into the database, and running analytical queries that answer real betting questions. By the end, you will have a reusable template for any sport and any season.

The complete code for this case study is available in code/case-study-code.py.


Step 1: Define the Requirements

Before writing any code, we need to answer three questions:

  1. What questions do we want to answer? We want to compute ATS records, analyze home-field advantage, identify scheduling edges, track over/under trends, and eventually feed data into predictive models.

  2. What data do we need? Game results (date, teams, scores), odds data (spreads, totals, moneylines), and contextual information (week number, day of week, division games, playoff status).

  3. What is the right granularity? Game-level data is sufficient for ATS and totals analysis. Play-by-play data would enable deeper analysis but is out of scope for this initial build.

With these requirements defined, we can design the schema.


Step 2: Design the Database Schema

We will use three normalized tables plus one view for convenience:

-- Teams table: canonical team information
CREATE TABLE IF NOT EXISTS teams (
    team_id TEXT PRIMARY KEY,
    team_name TEXT NOT NULL,
    team_city TEXT NOT NULL,
    conference TEXT NOT NULL CHECK (conference IN ('AFC', 'NFC')),
    division TEXT NOT NULL
);

-- Games table: one row per game
CREATE TABLE IF NOT EXISTS games (
    game_id TEXT PRIMARY KEY,
    season INTEGER NOT NULL,
    week INTEGER NOT NULL,
    game_date TEXT NOT NULL,
    day_of_week TEXT NOT NULL,
    home_team_id TEXT NOT NULL,
    away_team_id TEXT NOT NULL,
    home_score INTEGER,
    away_score INTEGER,
    is_playoff INTEGER DEFAULT 0,
    is_divisional INTEGER DEFAULT 0,
    FOREIGN KEY (home_team_id) REFERENCES teams(team_id),
    FOREIGN KEY (away_team_id) REFERENCES teams(team_id)
);

-- Odds table: one row per game, linked to games
CREATE TABLE IF NOT EXISTS odds (
    odds_id INTEGER PRIMARY KEY AUTOINCREMENT,
    game_id TEXT NOT NULL UNIQUE,
    spread REAL,           -- negative = home favored
    total REAL,
    home_ml INTEGER,
    away_ml INTEGER,
    closing_spread REAL,
    closing_total REAL,
    FOREIGN KEY (game_id) REFERENCES games(game_id)
);

Key design decisions:

  • team_id uses abbreviations (e.g., KC, BUF) because they are short, memorable, and match most data sources after standardization.
  • game_id is a composite string combining season, week, and teams (e.g., 2023_01_KC_DET) for human readability.
  • Odds are in a separate table because not every game will have odds data (historical games, preseason), and keeping them separate maintains clean referential integrity.
  • is_divisional is precomputed during loading rather than derived at query time, because division membership changes rarely and the flag speeds up common queries.

Step 3: Build the Data Collection Pipeline

We will simulate data collection using a structured approach that mirrors real-world scraping. In production, you would replace the sample data generator with actual API calls or scraper functions.

import sqlite3
import pandas as pd
from datetime import datetime, timedelta
from typing import Optional


# NFL team definitions
NFL_TEAMS: dict[str, dict[str, str]] = {
    "KC":  {"name": "Chiefs",     "city": "Kansas City",   "conf": "AFC", "div": "AFC West"},
    "BUF": {"name": "Bills",      "city": "Buffalo",       "conf": "AFC", "div": "AFC East"},
    "MIA": {"name": "Dolphins",   "city": "Miami",         "conf": "AFC", "div": "AFC East"},
    "NE":  {"name": "Patriots",   "city": "New England",   "conf": "AFC", "div": "AFC East"},
    "NYJ": {"name": "Jets",       "city": "New York",      "conf": "AFC", "div": "AFC East"},
    "BAL": {"name": "Ravens",     "city": "Baltimore",     "conf": "AFC", "div": "AFC North"},
    "CIN": {"name": "Bengals",    "city": "Cincinnati",    "conf": "AFC", "div": "AFC North"},
    "CLE": {"name": "Browns",     "city": "Cleveland",     "conf": "AFC", "div": "AFC North"},
    "PIT": {"name": "Steelers",   "city": "Pittsburgh",    "conf": "AFC", "div": "AFC North"},
    "HOU": {"name": "Texans",     "city": "Houston",       "conf": "AFC", "div": "AFC South"},
    "IND": {"name": "Colts",      "city": "Indianapolis",  "conf": "AFC", "div": "AFC South"},
    "JAX": {"name": "Jaguars",    "city": "Jacksonville",  "conf": "AFC", "div": "AFC South"},
    "TEN": {"name": "Titans",     "city": "Tennessee",     "conf": "AFC", "div": "AFC South"},
    "DEN": {"name": "Broncos",    "city": "Denver",        "conf": "AFC", "div": "AFC West"},
    "LV":  {"name": "Raiders",    "city": "Las Vegas",     "conf": "AFC", "div": "AFC West"},
    "LAC": {"name": "Chargers",   "city": "Los Angeles",   "conf": "AFC", "div": "AFC West"},
    "DAL": {"name": "Cowboys",    "city": "Dallas",        "conf": "NFC", "div": "NFC East"},
    "NYG": {"name": "Giants",     "city": "New York",      "conf": "NFC", "div": "NFC East"},
    "PHI": {"name": "Eagles",     "city": "Philadelphia",  "conf": "NFC", "div": "NFC East"},
    "WAS": {"name": "Commanders", "city": "Washington",    "conf": "NFC", "div": "NFC East"},
    "CHI": {"name": "Bears",      "city": "Chicago",       "conf": "NFC", "div": "NFC North"},
    "DET": {"name": "Lions",      "city": "Detroit",       "conf": "NFC", "div": "NFC North"},
    "GB":  {"name": "Packers",    "city": "Green Bay",     "conf": "NFC", "div": "NFC North"},
    "MIN": {"name": "Vikings",    "city": "Minnesota",     "conf": "NFC", "div": "NFC North"},
    "ATL": {"name": "Falcons",    "city": "Atlanta",       "conf": "NFC", "div": "NFC South"},
    "CAR": {"name": "Panthers",   "city": "Carolina",      "conf": "NFC", "div": "NFC South"},
    "NO":  {"name": "Saints",     "city": "New Orleans",   "conf": "NFC", "div": "NFC South"},
    "TB":  {"name": "Buccaneers", "city": "Tampa Bay",     "conf": "NFC", "div": "NFC South"},
    "ARI": {"name": "Cardinals",  "city": "Arizona",       "conf": "NFC", "div": "NFC West"},
    "LAR": {"name": "Rams",       "city": "Los Angeles",   "conf": "NFC", "div": "NFC West"},
    "SF":  {"name": "49ers",      "city": "San Francisco",  "conf": "NFC", "div": "NFC West"},
    "SEA": {"name": "Seahawks",   "city": "Seattle",       "conf": "NFC", "div": "NFC West"},
}

This dictionary serves as our canonical team reference. Every data source we ingest will be mapped to these abbreviations.


Step 4: Team Name Standardization

Different data sources use different names for the same team. This function handles the mapping:

# Common alternate names found in various data sources
TEAM_NAME_ALIASES: dict[str, str] = {
    "Kansas City Chiefs": "KC",
    "KAN": "KC",
    "Buffalo Bills": "BUF",
    "Miami Dolphins": "MIA",
    "New England Patriots": "NE",
    "New York Jets": "NYJ",
    "Baltimore Ravens": "BAL",
    "Cincinnati Bengals": "CIN",
    "Cleveland Browns": "CLE",
    "Pittsburgh Steelers": "PIT",
    "Houston Texans": "HOU",
    "Indianapolis Colts": "IND",
    "Jacksonville Jaguars": "JAX",
    "JAC": "JAX",  # Common alternate abbreviation
    "Tennessee Titans": "TEN",
    "Denver Broncos": "DEN",
    "Las Vegas Raiders": "LV",
    "Oakland Raiders": "LV",  # Historical name
    "OAK": "LV",
    "Los Angeles Chargers": "LAC",
    "San Diego Chargers": "LAC",  # Historical name
    "SD": "LAC",
    "Dallas Cowboys": "DAL",
    "New York Giants": "NYG",
    "Philadelphia Eagles": "PHI",
    "Washington Commanders": "WAS",
    "Washington Football Team": "WAS",
    "Washington Redskins": "WAS",  # Historical name
    "WSH": "WAS",
    "Chicago Bears": "CHI",
    "Detroit Lions": "DET",
    "Green Bay Packers": "GB",
    "Minnesota Vikings": "MIN",
    "Atlanta Falcons": "ATL",
    "Carolina Panthers": "CAR",
    "New Orleans Saints": "NO",
    "Tampa Bay Buccaneers": "TB",
    "Arizona Cardinals": "ARI",
    "Los Angeles Rams": "LAR",
    "St. Louis Rams": "LAR",  # Historical name
    "STL": "LAR",
    "San Francisco 49ers": "SF",
    "Seattle Seahawks": "SEA",
}


def standardize_team_name(raw_name: str) -> str:
    """Convert any team name variant to its canonical abbreviation.

    Args:
        raw_name: The team name as it appears in the source data.

    Returns:
        The canonical team abbreviation (e.g., 'KC', 'BUF').

    Raises:
        ValueError: If the name cannot be mapped to a known team.
    """
    cleaned = raw_name.strip()

    # Already a canonical abbreviation
    if cleaned in NFL_TEAMS:
        return cleaned

    # Check the alias dictionary
    if cleaned in TEAM_NAME_ALIASES:
        return TEAM_NAME_ALIASES[cleaned]

    # Case-insensitive partial matching as a fallback
    cleaned_lower = cleaned.lower()
    for alias, team_id in TEAM_NAME_ALIASES.items():
        if alias.lower() == cleaned_lower:
            return team_id

    raise ValueError(f"Unknown team name: '{raw_name}'. Add it to TEAM_NAME_ALIASES.")

This function is called on every team name in every row of incoming data. The alias dictionary handles historical names (Oakland Raiders to LV), alternate abbreviations (JAC to JAX), and full names. The ValueError at the end ensures that unknown names fail loudly rather than silently corrupting data.


Step 5: Data Loading and Cleaning

The loading function takes a pandas DataFrame (from any source: CSV, API, scraper) and inserts it into the database:

def load_games_to_database(
    df: pd.DataFrame,
    db_path: str,
    season: int
) -> dict[str, int]:
    """Load a DataFrame of game data into the SQLite database.

    Args:
        df: DataFrame with columns: date, home_team, away_team,
            home_score, away_score, spread, total.
        db_path: Path to the SQLite database file.
        season: The NFL season year.

    Returns:
        A dictionary summarizing the loading results:
        {'inserted': N, 'skipped_duplicates': N, 'errors': N}.
    """
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA foreign_keys = ON")  # Enforce referential integrity
    cursor = conn.cursor()

    results = {"inserted": 0, "skipped_duplicates": 0, "errors": 0}

    for _, row in df.iterrows():
        try:
            # Standardize team names
            home_id = standardize_team_name(row["home_team"])
            away_id = standardize_team_name(row["away_team"])

            # Determine if this is a divisional game
            home_div = NFL_TEAMS[home_id]["div"]
            away_div = NFL_TEAMS[away_id]["div"]
            is_divisional = 1 if home_div == away_div else 0

            # Build game_id
            week = int(row["week"])
            game_id = f"{season}_{week:02d}_{home_id}_{away_id}"

            # Insert the game
            cursor.execute(
                """INSERT OR IGNORE INTO games
                   (game_id, season, week, game_date, day_of_week,
                    home_team_id, away_team_id, home_score, away_score,
                    is_playoff, is_divisional)
                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    game_id, season, week,
                    row["date"], row.get("day_of_week", "Unknown"),
                    home_id, away_id,
                    int(row["home_score"]), int(row["away_score"]),
                    int(row.get("is_playoff", 0)), is_divisional,
                ),
            )

            if cursor.rowcount == 0:
                results["skipped_duplicates"] += 1
            else:
                results["inserted"] += 1

                # Insert odds if available
                if pd.notna(row.get("spread")) and pd.notna(row.get("total")):
                    cursor.execute(
                        """INSERT OR IGNORE INTO odds
                           (game_id, spread, total, home_ml, away_ml)
                           VALUES (?, ?, ?, ?, ?)""",
                        (
                            game_id,
                            float(row["spread"]),
                            float(row["total"]),
                            int(row["home_ml"]) if pd.notna(row.get("home_ml")) else None,
                            int(row["away_ml"]) if pd.notna(row.get("away_ml")) else None,
                        ),
                    )

        except (ValueError, KeyError) as e:
            print(f"Error processing row: {e}")
            results["errors"] += 1

    conn.commit()
    conn.close()
    return results

The INSERT OR IGNORE pattern handles duplicate game IDs gracefully --- if we accidentally load the same game twice, it is silently skipped rather than causing an error. The results dictionary provides a clear audit trail.


Step 6: Data Quality Validation

After loading, we run automated checks:

def validate_database(db_path: str, season: int) -> list[str]:
    """Run data quality checks on the loaded database.

    Args:
        db_path: Path to the SQLite database file.
        season: The season to validate.

    Returns:
        A list of warning messages. An empty list means all checks passed.
    """
    conn = sqlite3.connect(db_path)
    warnings: list[str] = []

    # Check 1: Expected number of regular-season games (272 for 17-week,
    # 32-team NFL with 17 games each = 544 team-games = 272 unique games)
    game_count = conn.execute(
        "SELECT COUNT(*) FROM games WHERE season = ? AND is_playoff = 0",
        (season,),
    ).fetchone()[0]

    if game_count < 272:
        warnings.append(
            f"Only {game_count} regular-season games found (expected 272). "
            f"Missing {272 - game_count} games."
        )
    elif game_count > 272:
        warnings.append(
            f"Found {game_count} regular-season games (expected 272). "
            f"Possible duplicates."
        )

    # Check 2: Every team should appear in exactly 17 regular-season games
    team_counts = conn.execute(
        """SELECT team_id, COUNT(*) as games FROM (
               SELECT home_team_id AS team_id FROM games
               WHERE season = ? AND is_playoff = 0
               UNION ALL
               SELECT away_team_id AS team_id FROM games
               WHERE season = ? AND is_playoff = 0
           ) GROUP BY team_id""",
        (season, season),
    ).fetchall()

    for team_id, count in team_counts:
        if count != 17:
            warnings.append(f"Team {team_id} has {count} games (expected 17).")

    # Check 3: No negative scores
    bad_scores = conn.execute(
        """SELECT game_id FROM games
           WHERE (home_score < 0 OR away_score < 0)
           AND season = ?""",
        (season,),
    ).fetchall()

    if bad_scores:
        warnings.append(
            f"Found {len(bad_scores)} games with negative scores: "
            f"{[g[0] for g in bad_scores]}"
        )

    # Check 4: Odds coverage
    total_games = conn.execute(
        "SELECT COUNT(*) FROM games WHERE season = ?", (season,)
    ).fetchone()[0]
    odds_count = conn.execute(
        """SELECT COUNT(*) FROM odds o
           JOIN games g ON o.game_id = g.game_id
           WHERE g.season = ?""",
        (season,),
    ).fetchone()[0]
    coverage = (odds_count / total_games * 100) if total_games > 0 else 0
    if coverage < 95:
        warnings.append(
            f"Odds coverage is only {coverage:.1f}% ({odds_count}/{total_games} games)."
        )

    conn.close()
    return warnings

Step 7: Analytical Queries

With clean data in the database, we can answer betting questions with SQL:

def get_ats_records(db_path: str, season: int) -> pd.DataFrame:
    """Compute each team's against-the-spread record for a season.

    A team 'covers' when their actual margin of victory exceeds the spread.
    For the home team: (home_score - away_score) + spread > 0 means cover.
    (spread is negative when home is favored.)
    """
    query = """
    WITH team_games AS (
        -- Home games
        SELECT
            g.home_team_id AS team_id,
            (g.home_score - g.away_score) AS score_diff,
            o.spread AS spread,
            'home' AS location
        FROM games g
        JOIN odds o ON g.game_id = o.game_id
        WHERE g.season = ? AND g.is_playoff = 0

        UNION ALL

        -- Away games (flip the score diff and spread)
        SELECT
            g.away_team_id AS team_id,
            (g.away_score - g.home_score) AS score_diff,
            -o.spread AS spread,
            'away' AS location
        FROM games g
        JOIN odds o ON g.game_id = o.game_id
        WHERE g.season = ? AND g.is_playoff = 0
    )
    SELECT
        team_id,
        COUNT(*) AS games,
        SUM(CASE WHEN score_diff + spread > 0 THEN 1 ELSE 0 END) AS covers,
        SUM(CASE WHEN score_diff + spread < 0 THEN 1 ELSE 0 END) AS non_covers,
        SUM(CASE WHEN score_diff + spread = 0 THEN 1 ELSE 0 END) AS pushes,
        ROUND(
            100.0 * SUM(CASE WHEN score_diff + spread > 0 THEN 1 ELSE 0 END)
            / COUNT(*), 1
        ) AS cover_pct
    FROM team_games
    GROUP BY team_id
    ORDER BY cover_pct DESC;
    """
    conn = sqlite3.connect(db_path)
    df = pd.read_sql_query(query, conn, params=(season, season))
    conn.close()
    return df


def get_home_field_advantage(db_path: str, season: int) -> dict[str, float]:
    """Compute home-field advantage metrics for a season."""
    conn = sqlite3.connect(db_path)

    # Straight-up home win percentage
    home_wins = conn.execute(
        """SELECT
               COUNT(*) as total,
               SUM(CASE WHEN home_score > away_score THEN 1 ELSE 0 END) as home_wins
           FROM games WHERE season = ? AND is_playoff = 0""",
        (season,),
    ).fetchone()

    # Average home margin
    avg_margin = conn.execute(
        """SELECT AVG(home_score - away_score)
           FROM games WHERE season = ? AND is_playoff = 0""",
        (season,),
    ).fetchone()[0]

    # Home ATS record
    home_ats = conn.execute(
        """SELECT
               SUM(CASE WHEN (g.home_score - g.away_score) + o.spread > 0
                   THEN 1 ELSE 0 END) as covers,
               COUNT(*) as total
           FROM games g JOIN odds o ON g.game_id = o.game_id
           WHERE g.season = ? AND g.is_playoff = 0""",
        (season,),
    ).fetchone()

    conn.close()

    total_games = home_wins[0]
    return {
        "home_win_pct": round(home_wins[1] / total_games * 100, 1),
        "avg_home_margin": round(avg_margin, 2),
        "home_ats_cover_pct": round(home_ats[0] / home_ats[1] * 100, 1),
        "total_games": total_games,
    }

Step 8: Putting It All Together

The main orchestration function ties every step together:

def build_nfl_database(
    data_source: pd.DataFrame,
    db_path: str,
    season: int,
) -> None:
    """Complete pipeline: create schema, load data, validate, report.

    Args:
        data_source: DataFrame containing the raw game data.
        db_path: Path where the SQLite database will be created.
        season: The NFL season year (e.g., 2023).
    """
    # Step 1: Create schema
    conn = sqlite3.connect(db_path)
    conn.executescript(SCHEMA_SQL)  # The CREATE TABLE statements from Step 2

    # Step 2: Load teams
    for team_id, info in NFL_TEAMS.items():
        conn.execute(
            "INSERT OR IGNORE INTO teams VALUES (?, ?, ?, ?, ?)",
            (team_id, info["name"], info["city"], info["conf"], info["div"]),
        )
    conn.commit()
    conn.close()

    # Step 3: Load games and odds
    results = load_games_to_database(data_source, db_path, season)
    print(f"Loading complete: {results}")

    # Step 4: Validate
    warnings = validate_database(db_path, season)
    if warnings:
        print("Validation warnings:")
        for w in warnings:
            print(f"  - {w}")
    else:
        print("All validation checks passed.")

    # Step 5: Generate summary report
    ats_df = get_ats_records(db_path, season)
    print(f"\nATS Records ({season} season):")
    print(ats_df.to_string(index=False))

    hfa = get_home_field_advantage(db_path, season)
    print(f"\nHome-Field Advantage:")
    print(f"  Win%: {hfa['home_win_pct']}%")
    print(f"  Avg Margin: {hfa['avg_home_margin']} points")
    print(f"  ATS Cover%: {hfa['home_ats_cover_pct']}%")

Lessons Learned

  1. Schema design comes first. Spending an hour on schema design saves dozens of hours of painful queries and data restructuring later.

  2. Team name standardization is the most common data-cleaning task in sports analytics. Build a robust mapping early and expand it as you encounter new sources.

  3. Validation should be automated and run after every load. The five-minute investment in writing validation functions catches errors that could otherwise go unnoticed for weeks.

  4. SQLite is sufficient for individual bettors. It requires no server setup, stores the entire database in a single file, and handles millions of rows efficiently. Graduate to PostgreSQL only when you need concurrent access or advanced features.

  5. Separate concerns. Collection, cleaning, loading, validation, and analysis are distinct pipeline stages. Keeping them modular makes debugging and extending the system straightforward.


Extension Ideas

  • Add a bet_tracking table to log your personal bets against the database and compute ROI.
  • Extend the schema with a weather table for outdoor games.
  • Add play-by-play data in a plays table with a foreign key to games.
  • Build a scheduled script that runs the pipeline weekly during the NFL season.
  • Create a simple Flask or Streamlit dashboard that queries the database and displays ATS standings.