Case Study: Building an NFL Season Database from Scratch
Overview
In this case study, you will build a complete NFL season database from raw data to queryable SQLite tables. We will walk through every step: defining the schema, collecting data from public sources, cleaning and transforming it, loading it into the database, and running analytical queries that answer real betting questions. By the end, you will have a reusable template for any sport and any season.
The complete code for this case study is available in code/case-study-code.py.
Step 1: Define the Requirements
Before writing any code, we need to answer three questions:
-
What questions do we want to answer? We want to compute ATS records, analyze home-field advantage, identify scheduling edges, track over/under trends, and eventually feed data into predictive models.
-
What data do we need? Game results (date, teams, scores), odds data (spreads, totals, moneylines), and contextual information (week number, day of week, division games, playoff status).
-
What is the right granularity? Game-level data is sufficient for ATS and totals analysis. Play-by-play data would enable deeper analysis but is out of scope for this initial build.
With these requirements defined, we can design the schema.
Step 2: Design the Database Schema
We will use three normalized tables plus one view for convenience:
-- Teams table: canonical team information
CREATE TABLE IF NOT EXISTS teams (
team_id TEXT PRIMARY KEY,
team_name TEXT NOT NULL,
team_city TEXT NOT NULL,
conference TEXT NOT NULL CHECK (conference IN ('AFC', 'NFC')),
division TEXT NOT NULL
);
-- Games table: one row per game
CREATE TABLE IF NOT EXISTS games (
game_id TEXT PRIMARY KEY,
season INTEGER NOT NULL,
week INTEGER NOT NULL,
game_date TEXT NOT NULL,
day_of_week TEXT NOT NULL,
home_team_id TEXT NOT NULL,
away_team_id TEXT NOT NULL,
home_score INTEGER,
away_score INTEGER,
is_playoff INTEGER DEFAULT 0,
is_divisional INTEGER DEFAULT 0,
FOREIGN KEY (home_team_id) REFERENCES teams(team_id),
FOREIGN KEY (away_team_id) REFERENCES teams(team_id)
);
-- Odds table: one row per game, linked to games
CREATE TABLE IF NOT EXISTS odds (
odds_id INTEGER PRIMARY KEY AUTOINCREMENT,
game_id TEXT NOT NULL UNIQUE,
spread REAL, -- negative = home favored
total REAL,
home_ml INTEGER,
away_ml INTEGER,
closing_spread REAL,
closing_total REAL,
FOREIGN KEY (game_id) REFERENCES games(game_id)
);
Key design decisions:
team_iduses abbreviations (e.g.,KC,BUF) because they are short, memorable, and match most data sources after standardization.game_idis a composite string combining season, week, and teams (e.g.,2023_01_KC_DET) for human readability.- Odds are in a separate table because not every game will have odds data (historical games, preseason), and keeping them separate maintains clean referential integrity.
is_divisionalis precomputed during loading rather than derived at query time, because division membership changes rarely and the flag speeds up common queries.
Step 3: Build the Data Collection Pipeline
We will simulate data collection using a structured approach that mirrors real-world scraping. In production, you would replace the sample data generator with actual API calls or scraper functions.
import sqlite3
import pandas as pd
from datetime import datetime, timedelta
from typing import Optional
# NFL team definitions
NFL_TEAMS: dict[str, dict[str, str]] = {
"KC": {"name": "Chiefs", "city": "Kansas City", "conf": "AFC", "div": "AFC West"},
"BUF": {"name": "Bills", "city": "Buffalo", "conf": "AFC", "div": "AFC East"},
"MIA": {"name": "Dolphins", "city": "Miami", "conf": "AFC", "div": "AFC East"},
"NE": {"name": "Patriots", "city": "New England", "conf": "AFC", "div": "AFC East"},
"NYJ": {"name": "Jets", "city": "New York", "conf": "AFC", "div": "AFC East"},
"BAL": {"name": "Ravens", "city": "Baltimore", "conf": "AFC", "div": "AFC North"},
"CIN": {"name": "Bengals", "city": "Cincinnati", "conf": "AFC", "div": "AFC North"},
"CLE": {"name": "Browns", "city": "Cleveland", "conf": "AFC", "div": "AFC North"},
"PIT": {"name": "Steelers", "city": "Pittsburgh", "conf": "AFC", "div": "AFC North"},
"HOU": {"name": "Texans", "city": "Houston", "conf": "AFC", "div": "AFC South"},
"IND": {"name": "Colts", "city": "Indianapolis", "conf": "AFC", "div": "AFC South"},
"JAX": {"name": "Jaguars", "city": "Jacksonville", "conf": "AFC", "div": "AFC South"},
"TEN": {"name": "Titans", "city": "Tennessee", "conf": "AFC", "div": "AFC South"},
"DEN": {"name": "Broncos", "city": "Denver", "conf": "AFC", "div": "AFC West"},
"LV": {"name": "Raiders", "city": "Las Vegas", "conf": "AFC", "div": "AFC West"},
"LAC": {"name": "Chargers", "city": "Los Angeles", "conf": "AFC", "div": "AFC West"},
"DAL": {"name": "Cowboys", "city": "Dallas", "conf": "NFC", "div": "NFC East"},
"NYG": {"name": "Giants", "city": "New York", "conf": "NFC", "div": "NFC East"},
"PHI": {"name": "Eagles", "city": "Philadelphia", "conf": "NFC", "div": "NFC East"},
"WAS": {"name": "Commanders", "city": "Washington", "conf": "NFC", "div": "NFC East"},
"CHI": {"name": "Bears", "city": "Chicago", "conf": "NFC", "div": "NFC North"},
"DET": {"name": "Lions", "city": "Detroit", "conf": "NFC", "div": "NFC North"},
"GB": {"name": "Packers", "city": "Green Bay", "conf": "NFC", "div": "NFC North"},
"MIN": {"name": "Vikings", "city": "Minnesota", "conf": "NFC", "div": "NFC North"},
"ATL": {"name": "Falcons", "city": "Atlanta", "conf": "NFC", "div": "NFC South"},
"CAR": {"name": "Panthers", "city": "Carolina", "conf": "NFC", "div": "NFC South"},
"NO": {"name": "Saints", "city": "New Orleans", "conf": "NFC", "div": "NFC South"},
"TB": {"name": "Buccaneers", "city": "Tampa Bay", "conf": "NFC", "div": "NFC South"},
"ARI": {"name": "Cardinals", "city": "Arizona", "conf": "NFC", "div": "NFC West"},
"LAR": {"name": "Rams", "city": "Los Angeles", "conf": "NFC", "div": "NFC West"},
"SF": {"name": "49ers", "city": "San Francisco", "conf": "NFC", "div": "NFC West"},
"SEA": {"name": "Seahawks", "city": "Seattle", "conf": "NFC", "div": "NFC West"},
}
This dictionary serves as our canonical team reference. Every data source we ingest will be mapped to these abbreviations.
Step 4: Team Name Standardization
Different data sources use different names for the same team. This function handles the mapping:
# Common alternate names found in various data sources
TEAM_NAME_ALIASES: dict[str, str] = {
"Kansas City Chiefs": "KC",
"KAN": "KC",
"Buffalo Bills": "BUF",
"Miami Dolphins": "MIA",
"New England Patriots": "NE",
"New York Jets": "NYJ",
"Baltimore Ravens": "BAL",
"Cincinnati Bengals": "CIN",
"Cleveland Browns": "CLE",
"Pittsburgh Steelers": "PIT",
"Houston Texans": "HOU",
"Indianapolis Colts": "IND",
"Jacksonville Jaguars": "JAX",
"JAC": "JAX", # Common alternate abbreviation
"Tennessee Titans": "TEN",
"Denver Broncos": "DEN",
"Las Vegas Raiders": "LV",
"Oakland Raiders": "LV", # Historical name
"OAK": "LV",
"Los Angeles Chargers": "LAC",
"San Diego Chargers": "LAC", # Historical name
"SD": "LAC",
"Dallas Cowboys": "DAL",
"New York Giants": "NYG",
"Philadelphia Eagles": "PHI",
"Washington Commanders": "WAS",
"Washington Football Team": "WAS",
"Washington Redskins": "WAS", # Historical name
"WSH": "WAS",
"Chicago Bears": "CHI",
"Detroit Lions": "DET",
"Green Bay Packers": "GB",
"Minnesota Vikings": "MIN",
"Atlanta Falcons": "ATL",
"Carolina Panthers": "CAR",
"New Orleans Saints": "NO",
"Tampa Bay Buccaneers": "TB",
"Arizona Cardinals": "ARI",
"Los Angeles Rams": "LAR",
"St. Louis Rams": "LAR", # Historical name
"STL": "LAR",
"San Francisco 49ers": "SF",
"Seattle Seahawks": "SEA",
}
def standardize_team_name(raw_name: str) -> str:
"""Convert any team name variant to its canonical abbreviation.
Args:
raw_name: The team name as it appears in the source data.
Returns:
The canonical team abbreviation (e.g., 'KC', 'BUF').
Raises:
ValueError: If the name cannot be mapped to a known team.
"""
cleaned = raw_name.strip()
# Already a canonical abbreviation
if cleaned in NFL_TEAMS:
return cleaned
# Check the alias dictionary
if cleaned in TEAM_NAME_ALIASES:
return TEAM_NAME_ALIASES[cleaned]
# Case-insensitive partial matching as a fallback
cleaned_lower = cleaned.lower()
for alias, team_id in TEAM_NAME_ALIASES.items():
if alias.lower() == cleaned_lower:
return team_id
raise ValueError(f"Unknown team name: '{raw_name}'. Add it to TEAM_NAME_ALIASES.")
This function is called on every team name in every row of incoming data. The alias dictionary handles historical names (Oakland Raiders to LV), alternate abbreviations (JAC to JAX), and full names. The ValueError at the end ensures that unknown names fail loudly rather than silently corrupting data.
Step 5: Data Loading and Cleaning
The loading function takes a pandas DataFrame (from any source: CSV, API, scraper) and inserts it into the database:
def load_games_to_database(
df: pd.DataFrame,
db_path: str,
season: int
) -> dict[str, int]:
"""Load a DataFrame of game data into the SQLite database.
Args:
df: DataFrame with columns: date, home_team, away_team,
home_score, away_score, spread, total.
db_path: Path to the SQLite database file.
season: The NFL season year.
Returns:
A dictionary summarizing the loading results:
{'inserted': N, 'skipped_duplicates': N, 'errors': N}.
"""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA foreign_keys = ON") # Enforce referential integrity
cursor = conn.cursor()
results = {"inserted": 0, "skipped_duplicates": 0, "errors": 0}
for _, row in df.iterrows():
try:
# Standardize team names
home_id = standardize_team_name(row["home_team"])
away_id = standardize_team_name(row["away_team"])
# Determine if this is a divisional game
home_div = NFL_TEAMS[home_id]["div"]
away_div = NFL_TEAMS[away_id]["div"]
is_divisional = 1 if home_div == away_div else 0
# Build game_id
week = int(row["week"])
game_id = f"{season}_{week:02d}_{home_id}_{away_id}"
# Insert the game
cursor.execute(
"""INSERT OR IGNORE INTO games
(game_id, season, week, game_date, day_of_week,
home_team_id, away_team_id, home_score, away_score,
is_playoff, is_divisional)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
game_id, season, week,
row["date"], row.get("day_of_week", "Unknown"),
home_id, away_id,
int(row["home_score"]), int(row["away_score"]),
int(row.get("is_playoff", 0)), is_divisional,
),
)
if cursor.rowcount == 0:
results["skipped_duplicates"] += 1
else:
results["inserted"] += 1
# Insert odds if available
if pd.notna(row.get("spread")) and pd.notna(row.get("total")):
cursor.execute(
"""INSERT OR IGNORE INTO odds
(game_id, spread, total, home_ml, away_ml)
VALUES (?, ?, ?, ?, ?)""",
(
game_id,
float(row["spread"]),
float(row["total"]),
int(row["home_ml"]) if pd.notna(row.get("home_ml")) else None,
int(row["away_ml"]) if pd.notna(row.get("away_ml")) else None,
),
)
except (ValueError, KeyError) as e:
print(f"Error processing row: {e}")
results["errors"] += 1
conn.commit()
conn.close()
return results
The INSERT OR IGNORE pattern handles duplicate game IDs gracefully --- if we accidentally load the same game twice, it is silently skipped rather than causing an error. The results dictionary provides a clear audit trail.
Step 6: Data Quality Validation
After loading, we run automated checks:
def validate_database(db_path: str, season: int) -> list[str]:
"""Run data quality checks on the loaded database.
Args:
db_path: Path to the SQLite database file.
season: The season to validate.
Returns:
A list of warning messages. An empty list means all checks passed.
"""
conn = sqlite3.connect(db_path)
warnings: list[str] = []
# Check 1: Expected number of regular-season games (272 for 17-week,
# 32-team NFL with 17 games each = 544 team-games = 272 unique games)
game_count = conn.execute(
"SELECT COUNT(*) FROM games WHERE season = ? AND is_playoff = 0",
(season,),
).fetchone()[0]
if game_count < 272:
warnings.append(
f"Only {game_count} regular-season games found (expected 272). "
f"Missing {272 - game_count} games."
)
elif game_count > 272:
warnings.append(
f"Found {game_count} regular-season games (expected 272). "
f"Possible duplicates."
)
# Check 2: Every team should appear in exactly 17 regular-season games
team_counts = conn.execute(
"""SELECT team_id, COUNT(*) as games FROM (
SELECT home_team_id AS team_id FROM games
WHERE season = ? AND is_playoff = 0
UNION ALL
SELECT away_team_id AS team_id FROM games
WHERE season = ? AND is_playoff = 0
) GROUP BY team_id""",
(season, season),
).fetchall()
for team_id, count in team_counts:
if count != 17:
warnings.append(f"Team {team_id} has {count} games (expected 17).")
# Check 3: No negative scores
bad_scores = conn.execute(
"""SELECT game_id FROM games
WHERE (home_score < 0 OR away_score < 0)
AND season = ?""",
(season,),
).fetchall()
if bad_scores:
warnings.append(
f"Found {len(bad_scores)} games with negative scores: "
f"{[g[0] for g in bad_scores]}"
)
# Check 4: Odds coverage
total_games = conn.execute(
"SELECT COUNT(*) FROM games WHERE season = ?", (season,)
).fetchone()[0]
odds_count = conn.execute(
"""SELECT COUNT(*) FROM odds o
JOIN games g ON o.game_id = g.game_id
WHERE g.season = ?""",
(season,),
).fetchone()[0]
coverage = (odds_count / total_games * 100) if total_games > 0 else 0
if coverage < 95:
warnings.append(
f"Odds coverage is only {coverage:.1f}% ({odds_count}/{total_games} games)."
)
conn.close()
return warnings
Step 7: Analytical Queries
With clean data in the database, we can answer betting questions with SQL:
def get_ats_records(db_path: str, season: int) -> pd.DataFrame:
"""Compute each team's against-the-spread record for a season.
A team 'covers' when their actual margin of victory exceeds the spread.
For the home team: (home_score - away_score) + spread > 0 means cover.
(spread is negative when home is favored.)
"""
query = """
WITH team_games AS (
-- Home games
SELECT
g.home_team_id AS team_id,
(g.home_score - g.away_score) AS score_diff,
o.spread AS spread,
'home' AS location
FROM games g
JOIN odds o ON g.game_id = o.game_id
WHERE g.season = ? AND g.is_playoff = 0
UNION ALL
-- Away games (flip the score diff and spread)
SELECT
g.away_team_id AS team_id,
(g.away_score - g.home_score) AS score_diff,
-o.spread AS spread,
'away' AS location
FROM games g
JOIN odds o ON g.game_id = o.game_id
WHERE g.season = ? AND g.is_playoff = 0
)
SELECT
team_id,
COUNT(*) AS games,
SUM(CASE WHEN score_diff + spread > 0 THEN 1 ELSE 0 END) AS covers,
SUM(CASE WHEN score_diff + spread < 0 THEN 1 ELSE 0 END) AS non_covers,
SUM(CASE WHEN score_diff + spread = 0 THEN 1 ELSE 0 END) AS pushes,
ROUND(
100.0 * SUM(CASE WHEN score_diff + spread > 0 THEN 1 ELSE 0 END)
/ COUNT(*), 1
) AS cover_pct
FROM team_games
GROUP BY team_id
ORDER BY cover_pct DESC;
"""
conn = sqlite3.connect(db_path)
df = pd.read_sql_query(query, conn, params=(season, season))
conn.close()
return df
def get_home_field_advantage(db_path: str, season: int) -> dict[str, float]:
"""Compute home-field advantage metrics for a season."""
conn = sqlite3.connect(db_path)
# Straight-up home win percentage
home_wins = conn.execute(
"""SELECT
COUNT(*) as total,
SUM(CASE WHEN home_score > away_score THEN 1 ELSE 0 END) as home_wins
FROM games WHERE season = ? AND is_playoff = 0""",
(season,),
).fetchone()
# Average home margin
avg_margin = conn.execute(
"""SELECT AVG(home_score - away_score)
FROM games WHERE season = ? AND is_playoff = 0""",
(season,),
).fetchone()[0]
# Home ATS record
home_ats = conn.execute(
"""SELECT
SUM(CASE WHEN (g.home_score - g.away_score) + o.spread > 0
THEN 1 ELSE 0 END) as covers,
COUNT(*) as total
FROM games g JOIN odds o ON g.game_id = o.game_id
WHERE g.season = ? AND g.is_playoff = 0""",
(season,),
).fetchone()
conn.close()
total_games = home_wins[0]
return {
"home_win_pct": round(home_wins[1] / total_games * 100, 1),
"avg_home_margin": round(avg_margin, 2),
"home_ats_cover_pct": round(home_ats[0] / home_ats[1] * 100, 1),
"total_games": total_games,
}
Step 8: Putting It All Together
The main orchestration function ties every step together:
def build_nfl_database(
data_source: pd.DataFrame,
db_path: str,
season: int,
) -> None:
"""Complete pipeline: create schema, load data, validate, report.
Args:
data_source: DataFrame containing the raw game data.
db_path: Path where the SQLite database will be created.
season: The NFL season year (e.g., 2023).
"""
# Step 1: Create schema
conn = sqlite3.connect(db_path)
conn.executescript(SCHEMA_SQL) # The CREATE TABLE statements from Step 2
# Step 2: Load teams
for team_id, info in NFL_TEAMS.items():
conn.execute(
"INSERT OR IGNORE INTO teams VALUES (?, ?, ?, ?, ?)",
(team_id, info["name"], info["city"], info["conf"], info["div"]),
)
conn.commit()
conn.close()
# Step 3: Load games and odds
results = load_games_to_database(data_source, db_path, season)
print(f"Loading complete: {results}")
# Step 4: Validate
warnings = validate_database(db_path, season)
if warnings:
print("Validation warnings:")
for w in warnings:
print(f" - {w}")
else:
print("All validation checks passed.")
# Step 5: Generate summary report
ats_df = get_ats_records(db_path, season)
print(f"\nATS Records ({season} season):")
print(ats_df.to_string(index=False))
hfa = get_home_field_advantage(db_path, season)
print(f"\nHome-Field Advantage:")
print(f" Win%: {hfa['home_win_pct']}%")
print(f" Avg Margin: {hfa['avg_home_margin']} points")
print(f" ATS Cover%: {hfa['home_ats_cover_pct']}%")
Lessons Learned
-
Schema design comes first. Spending an hour on schema design saves dozens of hours of painful queries and data restructuring later.
-
Team name standardization is the most common data-cleaning task in sports analytics. Build a robust mapping early and expand it as you encounter new sources.
-
Validation should be automated and run after every load. The five-minute investment in writing validation functions catches errors that could otherwise go unnoticed for weeks.
-
SQLite is sufficient for individual bettors. It requires no server setup, stores the entire database in a single file, and handles millions of rows efficiently. Graduate to PostgreSQL only when you need concurrent access or advanced features.
-
Separate concerns. Collection, cleaning, loading, validation, and analysis are distinct pipeline stages. Keeping them modular makes debugging and extending the system straightforward.
Extension Ideas
- Add a
bet_trackingtable to log your personal bets against the database and compute ROI. - Extend the schema with a
weathertable for outdoor games. - Add play-by-play data in a
playstable with a foreign key togames. - Build a scheduled script that runs the pipeline weekly during the NFL season.
- Create a simple Flask or Streamlit dashboard that queries the database and displays ATS standings.