21 min read

> "Without data, you're just another person with an opinion."

Learning Objectives

  • Identify and evaluate major sports data sources, APIs, and Python packages for obtaining historical game statistics and odds data
  • Perform fundamental data manipulation tasks using pandas DataFrames, including loading, filtering, grouping, and merging sports datasets
  • Apply systematic data cleaning techniques to handle missing values, inconsistent naming conventions, and outliers in sports data
  • Conduct exploratory data analysis on sports datasets using summary statistics, visualizations, and correlation analysis to uncover betting-relevant patterns
  • Design and implement a personal SQLite betting database with proper schema design, automated data collection, and maintenance workflows

Chapter 5: Data Literacy for Bettors

"Without data, you're just another person with an opinion." -- W. Edwards Deming

Chapter Overview

In the first four chapters, we established the conceptual foundations of sports betting: how markets work, how odds encode probabilities, how bookmakers set lines, and how to think about expected value. Those ideas are necessary but insufficient. To move from theory into practice -- to actually build the models that identify value in betting markets -- you need data. Lots of it. And you need to know how to find it, clean it, reshape it, and interrogate it.

This chapter is the most code-heavy in Part I. It marks the transition from conceptual understanding to hands-on quantitative work. By the end, you will have a functioning data pipeline: you will know where to get sports data, how to wrangle it with Python and pandas, how to explore it visually, and how to store it in a personal database that grows over time.

A word of calibration before we begin. This chapter assumes you have basic Python literacy -- you can write functions, use loops, and install packages with pip. If you are new to Python, spend a week with an introductory tutorial before proceeding. We will not explain what a for loop is, but we will explain every pandas operation and data concept in detail. The goal is not to teach you programming from scratch; it is to teach you how programmers handle sports data.


5.1 Sports Data Sources and APIs

The sports betting data ecosystem is surprisingly rich. A decade ago, obtaining clean historical data required expensive subscriptions or tedious manual collection. Today, free sources cover most major sports with remarkable depth, and well-maintained Python packages wrap the best of them into clean interfaces. The challenge has shifted from finding data to choosing among abundant sources and combining them intelligently.

Free Reference Sites

The Sports Reference family of websites is the gold standard for free historical sports statistics.

Pro Football Reference (pro-football-reference.com) provides game logs, player statistics, team statistics, play-by-play data, and advanced metrics for every NFL season back to 1920. The depth is extraordinary: you can find snap counts, target shares, defensive pressure rates, and dozens of advanced metrics. The site exposes its data through well-structured HTML tables, making it accessible to both manual researchers and automated scrapers.

Basketball Reference (basketball-reference.com) offers similarly comprehensive coverage for the NBA, including box scores, advanced statistics (PER, Win Shares, BPM, VORP), and detailed game logs. Historical coverage extends back to the BAA era of the late 1940s.

Baseball Reference (baseball-reference.com) is the oldest and deepest of the family. Baseball's statistical tradition means you can find Statcast data, pitch-level data, situational splits, and nearly every metric the sabermetrics community has developed.

FBRef (fbref.com) covers global soccer (football) with data sourced from StatsBomb and Opta. This includes expected goals (xG), passing networks, defensive actions, and progressive carries -- the kinds of advanced metrics that power modern soccer analytics.

These sites are free for personal use, but they have terms of service that restrict automated bulk scraping. Respect those terms. For most analytical work, the Python packages described below provide a cleaner and more ethical path to the same data.

Python Packages for Sports Data

The Python ecosystem has produced excellent packages that wrap sports data APIs into clean, pandas-friendly interfaces. These are your primary tools for programmatic data access.

nfl_data_py is the standard package for NFL data. It pulls from the nflverse ecosystem, which maintains cleaned and standardized NFL datasets including play-by-play data, roster information, schedule results, and draft picks. The data is hosted on GitHub and updated weekly during the season.

import nfl_data_py as nfl
import pandas as pd

# Pull play-by-play data for recent seasons
pbp = nfl.import_pbp_data([2023, 2024])
print(f"Play-by-play dataset: {pbp.shape[0]:,} plays, {pbp.shape[1]} columns")
print(f"Columns include: {list(pbp.columns[:10])}")

# Pull weekly roster data
rosters = nfl.import_rosters([2024])
print(f"\nRoster entries: {rosters.shape[0]:,}")

# Pull schedule and game results
schedules = nfl.import_schedules([2023, 2024])
print(f"\nGames in schedule: {schedules.shape[0]:,}")
print(f"Columns: {list(schedules.columns)}")

# The schedule data includes spreads and over/unders
print(schedules[['game_id', 'home_team', 'away_team', 'home_score',
                  'away_score', 'spread_line', 'total_line']].head(10))

This single package gives you access to the richest play-by-play dataset in sports analytics. Each play record contains over 300 columns: down, distance, yard line, personnel groupings, expected points added, win probability, completion probability, and much more.

nba_api provides access to the NBA's official statistics API (stats.nba.com). The API is comprehensive but can be temperamental -- it occasionally rate-limits requests or changes endpoints. Always include delays between requests.

from nba_api.stats.endpoints import leaguegamefinder, teamgamelogs
from nba_api.stats.static import teams
import pandas as pd
import time

# Get all NBA teams
nba_teams = teams.get_teams()
print(f"Number of NBA teams: {len(nba_teams)}")

# Find a specific team
celtics = [t for t in nba_teams if t['abbreviation'] == 'BOS'][0]
print(f"Team: {celtics['full_name']}, ID: {celtics['id']}")

# Pull game logs for a team in a specific season
# Note: always add a delay between API calls to avoid rate limiting
game_logs = teamgamelogs.TeamGameLogs(
    season_nullable='2024-25',
    team_id_nullable=celtics['id']
)
time.sleep(0.6)  # Respect rate limits

df_games = game_logs.get_data_frames()[0]
print(f"\nGames found: {len(df_games)}")
print(df_games[['GAME_ID', 'TEAM_NAME', 'GAME_DATE', 'MATCHUP',
                'WL', 'PTS', 'REB', 'AST']].head(10))

pybaseball wraps multiple baseball data sources, including Statcast pitch-level data from Baseball Savant, FanGraphs statistics, and Baseball Reference data.

from pybaseball import statcast, batting_stats, pitching_stats, schedule_and_record
import pandas as pd

# Pull Statcast data for a date range
# Warning: large date ranges will pull millions of rows
statcast_data = statcast(start_dt='2024-06-01', end_dt='2024-06-07')
print(f"Pitches tracked: {statcast_data.shape[0]:,}")
print(f"Columns: {statcast_data.shape[1]}")

# Pull season-level batting stats from FanGraphs
batters = batting_stats(2024, qual=100)  # Minimum 100 PA
print(f"\nQualified batters: {len(batters)}")
print(batters[['Name', 'Team', 'G', 'PA', 'HR', 'AVG', 'OBP',
               'SLG', 'wOBA', 'WAR']].head(10))

# Pull team schedule and results
nyy_schedule = schedule_and_record(2024, 'NYY')
print(f"\nYankees games: {len(nyy_schedule)}")

Odds Data Sources

Betting analysis requires more than game statistics -- you need historical odds data to evaluate line movements, closing lines, and market efficiency.

Historical odds databases are available from several sources. The Australian Sports Betting site (aussportsbetting.com) provides free historical closing odds for major sports. Kaggle hosts several historical odds datasets. The Covers.com website archives historical spreads and totals.

Odds APIs provide real-time and historical odds from multiple bookmakers. The-Odds-API (the-odds-api.com) offers a freemium model with access to live odds from dozens of bookmakers across major sports. The free tier typically allows several hundred requests per month.

import requests
import pandas as pd

# Example: fetching live odds from The Odds API
# You'll need to sign up for a free API key
API_KEY = 'YOUR_API_KEY_HERE'  # Replace with your actual key

def fetch_odds(sport='americanfootball_nfl', markets='spreads,totals'):
    """Fetch current odds from The Odds API."""
    url = f'https://api.the-odds-api.com/v4/sports/{sport}/odds'
    params = {
        'apiKey': API_KEY,
        'regions': 'us',
        'markets': markets,
        'oddsFormat': 'american'
    }
    response = requests.get(url, params=params)

    if response.status_code == 200:
        data = response.json()
        remaining = response.headers.get('x-requests-remaining', 'unknown')
        print(f"Requests remaining this month: {remaining}")
        return data
    else:
        print(f"Error: {response.status_code}")
        return None

# Parse odds data into a DataFrame
def odds_to_dataframe(odds_data):
    """Convert raw odds JSON into a structured DataFrame."""
    rows = []
    for game in odds_data:
        game_info = {
            'game_id': game['id'],
            'sport': game['sport_key'],
            'commence_time': game['commence_time'],
            'home_team': game['home_team'],
            'away_team': game['away_team']
        }
        for bookmaker in game.get('bookmakers', []):
            for market in bookmaker.get('markets', []):
                for outcome in market.get('outcomes', []):
                    row = {**game_info}
                    row['bookmaker'] = bookmaker['key']
                    row['market'] = market['key']
                    row['outcome_name'] = outcome['name']
                    row['price'] = outcome['price']
                    row['point'] = outcome.get('point', None)
                    rows.append(row)
    return pd.DataFrame(rows)

For serious bettors and researchers, paid data providers offer cleaner data, broader coverage, and more reliable access.

Sportradar is the official data provider for the NFL, NBA, MLB, and NHL. Their feeds include real-time play-by-play, detailed player tracking data, and proprietary metrics. Pricing is typically enterprise-level.

Stats Perform (Opta) dominates soccer data globally. Their event-level data includes every touch, pass, shot, and tactical action in a match. Opta data powers most advanced soccer analytics.

Second Spectrum and Hawk-Eye provide player tracking data for basketball and cricket/tennis respectively. These datasets enable spatial analysis that goes far beyond box score statistics.

For most individual bettors, the free sources and Python packages described above provide more than enough data to build competitive models. Paid sources become necessary when you are working with less popular sports, need real-time data feeds for live betting, or require player tracking data for advanced spatial models.

Web Scraping Basics

Sometimes the data you need is not available through an API or a Python package. It sits on a website in an HTML table, and you need to extract it. This is web scraping, and it comes with both technical and ethical considerations.

Ethical guidelines for web scraping:

  1. Check robots.txt first. Every website has a robots.txt file (e.g., pro-football-reference.com/robots.txt) that specifies which pages automated tools are allowed to access. Respect these directives.
  2. Read the Terms of Service. Many sports data sites explicitly prohibit automated scraping. Violating these terms can result in IP bans and, in extreme cases, legal action.
  3. Rate-limit your requests. Even when scraping is allowed, hammering a server with hundreds of requests per second is abusive. Add delays of at least one second between requests. Two to three seconds is better.
  4. Cache your results. Download data once and save it locally. Do not re-scrape the same page repeatedly.
  5. Prefer APIs and packages. If a Python package or API provides the same data, use it instead of scraping. These maintained interfaces are more stable and more ethical.

When scraping is appropriate, Python's requests and BeautifulSoup libraries are the standard tools. For pages that render content with JavaScript, selenium or playwright may be necessary.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_pfr_table(url, table_id, delay=3):
    """
    Scrape a specific table from Pro Football Reference.

    Parameters:
        url: Full URL of the page
        table_id: HTML id attribute of the target table
        delay: Seconds to wait before making the request (rate limiting)

    Returns:
        pandas DataFrame of the table contents
    """
    time.sleep(delay)  # Rate limiting -- be respectful

    headers = {
        'User-Agent': 'Mozilla/5.0 (research project, contact@example.com)'
    }
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        raise Exception(f"Request failed with status {response.status_code}")

    soup = BeautifulSoup(response.content, 'html.parser')

    # PFR sometimes wraps tables in comments to reduce initial page load
    # We need to check both visible tables and commented-out tables
    table = soup.find('table', {'id': table_id})

    if table is None:
        # Check inside HTML comments
        from bs4 import Comment
        comments = soup.find_all(string=lambda text: isinstance(text, Comment))
        for comment in comments:
            comment_soup = BeautifulSoup(comment, 'html.parser')
            table = comment_soup.find('table', {'id': table_id})
            if table is not None:
                break

    if table is None:
        raise Exception(f"Table '{table_id}' not found on page")

    # Use pandas to parse the HTML table directly
    df = pd.read_html(str(table))[0]
    return df

# Example usage (do not run without checking robots.txt):
# standings = scrape_pfr_table(
#     'https://www.pro-football-reference.com/years/2024/',
#     'AFC'
# )

5.2 Python and Pandas Fundamentals for Betting Data

Pandas is the workhorse library of data analysis in Python. If you are going to build sports betting models, you will spend more time writing pandas code than any other single activity. This section is not a comprehensive pandas tutorial -- entire books exist for that purpose. Instead, it focuses on the specific patterns and operations that arise constantly in sports data analysis.

DataFrames and Series

A DataFrame is a two-dimensional labeled data structure -- think of it as a spreadsheet or a SQL table. Each column is a Series -- a one-dimensional array with labels. When you load NFL play-by-play data, each row is a play and each column is an attribute of that play (down, distance, yard line, passer, receiver, expected points added, and so on).

import pandas as pd
import numpy as np

# Create a simple DataFrame from a dictionary
games = pd.DataFrame({
    'home_team': ['KC', 'BUF', 'SF', 'DAL', 'PHI'],
    'away_team': ['BAL', 'MIA', 'SEA', 'NYG', 'WSH'],
    'home_score': [27, 31, 24, 20, 34],
    'away_score': [24, 17, 21, 13, 28],
    'spread': [-3.5, -6.0, -3.0, -7.5, -4.5],
    'total': [51.5, 48.0, 47.5, 42.0, 49.5]
})

# Basic inspection
print(games.shape)         # (5, 6) -- 5 rows, 6 columns
print(games.dtypes)        # Data types of each column
print(games.describe())    # Summary statistics for numeric columns

# A Series is a single column
home_scores = games['home_score']
print(type(home_scores))   # <class 'pandas.core.series.Series'>
print(home_scores.mean())  # 27.2

Loading Data from Various Sources

Sports data arrives in many formats. Here are the most common loading patterns.

import pandas as pd
import json

# From CSV (the most common format for downloaded datasets)
df_csv = pd.read_csv('nfl_games_2024.csv')

# From CSV with specific parsing options
df_csv = pd.read_csv(
    'nfl_games_2024.csv',
    parse_dates=['game_date'],          # Parse date columns automatically
    dtype={'game_id': str},             # Force specific column types
    na_values=['', 'NA', 'N/A', '-']   # Treat these as missing values
)

# From JSON (common with API responses)
with open('odds_data.json', 'r') as f:
    raw_data = json.load(f)
df_json = pd.json_normalize(raw_data)  # Flatten nested JSON structures

# From an API response (requests library)
import requests
response = requests.get('https://api.example.com/games')
df_api = pd.DataFrame(response.json())

# From Excel (some data providers deliver Excel files)
df_excel = pd.read_excel('historical_odds.xlsx', sheet_name='NFL')

# From a SQL database
import sqlite3
conn = sqlite3.connect('betting.db')
df_sql = pd.read_sql('SELECT * FROM games WHERE season = 2024', conn)
conn.close()

Selecting, Filtering, and Grouping

These three operations form the backbone of sports data analysis. You will use them in virtually every analysis you perform.

import nfl_data_py as nfl
import pandas as pd

# Load schedule data
schedule = nfl.import_schedules([2023, 2024])

# --- SELECTING COLUMNS ---
# Select specific columns
subset = schedule[['game_id', 'season', 'week', 'home_team', 'away_team',
                   'home_score', 'away_score', 'spread_line', 'total_line']]

# --- FILTERING ROWS ---
# Filter to completed games only (non-null scores)
completed = subset.dropna(subset=['home_score', 'away_score'])

# Filter by condition: regular season games only (weeks 1-18)
regular_season = completed[completed['week'] <= 18]

# Multiple conditions: 2024 regular season, home favorites
home_favs_2024 = completed[
    (completed['season'] == 2024) &
    (completed['week'] <= 18) &
    (completed['spread_line'] < 0)  # Negative spread = home favorite
]

# Filter using .query() for cleaner syntax
home_favs_2024_alt = completed.query(
    'season == 2024 and week <= 18 and spread_line < 0'
)

# --- DERIVED COLUMNS ---
# Add columns for analytical work
completed = completed.copy()
completed['home_margin'] = completed['home_score'] - completed['away_score']
completed['total_points'] = completed['home_score'] + completed['away_score']
completed['home_covered'] = completed['home_margin'] > -completed['spread_line']
completed['over_hit'] = completed['total_points'] > completed['total_line']

# --- GROUPING AND AGGREGATION ---
# How often do home teams cover, by season?
cover_rates = (completed
    .groupby('season')['home_covered']
    .agg(['mean', 'count'])
    .rename(columns={'mean': 'cover_rate', 'count': 'num_games'})
)
print("Home cover rates by season:")
print(cover_rates)

# Average scoring by week
weekly_scoring = (completed
    .groupby(['season', 'week'])['total_points']
    .mean()
    .reset_index()
    .rename(columns={'total_points': 'avg_total'})
)
print("\nAverage total points by week (2024 season):")
print(weekly_scoring[weekly_scoring['season'] == 2024].head(10))

# Team-level home record
home_records = (completed[completed['season'] == 2024]
    .assign(home_win=lambda df: df['home_margin'] > 0)
    .groupby('home_team')['home_win']
    .agg(['sum', 'count', 'mean'])
    .rename(columns={'sum': 'wins', 'count': 'games', 'mean': 'win_pct'})
    .sort_values('win_pct', ascending=False)
)
print("\n2024 Home Records:")
print(home_records)

Merge and Join Operations

Real betting analysis almost always requires combining data from multiple sources. You might need to join game statistics with odds data, or combine player-level data with team-level results. Mastering merges is essential.

import pandas as pd

# Suppose we have two DataFrames:
# 1) Game results from one source
results = pd.DataFrame({
    'game_id': ['2024_01_KC_BAL', '2024_01_BUF_MIA', '2024_01_SF_SEA'],
    'home_team': ['BAL', 'MIA', 'SEA'],
    'away_team': ['KC', 'BUF', 'SF'],
    'home_score': [20, 17, 21],
    'away_score': [27, 31, 24]
})

# 2) Odds data from another source
odds = pd.DataFrame({
    'game_id': ['2024_01_KC_BAL', '2024_01_BUF_MIA', '2024_01_SF_SEA'],
    'spread': [1.5, 3.0, 6.5],  # Home team spread
    'total': [51.5, 48.0, 43.5],
    'home_ml': [110, [130], [-250]],  # Moneyline odds
    'away_ml': [-130, [-150], [210]]
})

# Inner merge: only rows that exist in both DataFrames
merged = pd.merge(results, odds, on='game_id', how='inner')
print("Merged shape:", merged.shape)

# Left merge: keep all rows from the left DataFrame
# (useful when you want all games even if odds are missing)
merged_left = pd.merge(results, odds, on='game_id', how='left')

# When column names conflict, use suffixes
# For example, if both DataFrames have a 'home_team' column:
merged_suffix = pd.merge(
    results, odds,
    on='game_id',
    how='inner',
    suffixes=('_result', '_odds')
)

# Merging on multiple columns (when there's no shared game_id)
# This is common when combining data from different providers
results_alt = pd.DataFrame({
    'date': ['2024-09-05', '2024-09-08', '2024-09-08'],
    'home': ['BAL', 'MIA', 'SEA'],
    'away': ['KC', 'BUF', 'SF'],
    'home_pts': [20, 17, 21],
    'away_pts': [27, 31, 24]
})

odds_alt = pd.DataFrame({
    'game_date': ['2024-09-05', '2024-09-08', '2024-09-08'],
    'home_team': ['BAL', 'MIA', 'SEA'],
    'away_team': ['KC', 'BUF', 'SF'],
    'spread': [1.5, 3.0, 6.5],
    'total': [51.5, 48.0, 43.5]
})

# Merge on multiple columns with different names
combined = pd.merge(
    results_alt, odds_alt,
    left_on=['date', 'home', 'away'],
    right_on=['game_date', 'home_team', 'away_team'],
    how='inner'
)
print(combined)

Common Pandas Patterns for Sports Analysis

Certain pandas operations come up so frequently in sports analytics that they deserve specific attention. These patterns will save you hours of trial and error.

import pandas as pd
import numpy as np

# --- ROLLING AVERAGES ---
# Essential for tracking team form / momentum
# Sort by team and date, then compute rolling stats within each team
def add_rolling_stats(df, stat_col, windows=[3, 5, 10], group_col='team'):
    """Add rolling averages for a statistic, grouped by team."""
    df = df.sort_values([group_col, 'game_date']).copy()
    for w in windows:
        col_name = f'{stat_col}_rolling_{w}'
        df[col_name] = (df.groupby(group_col)[stat_col]
                        .transform(lambda x: x.rolling(w, min_periods=1).mean()))
    return df

# --- SHIFT FOR LAG FEATURES ---
# Critical: you must use LAGGED data to avoid lookahead bias
# If you're predicting game N, you can only use data from games 1 through N-1
def add_lagged_features(df, stat_cols, group_col='team'):
    """Create lagged versions of columns (previous game's value)."""
    df = df.sort_values([group_col, 'game_date']).copy()
    for col in stat_cols:
        df[f'{col}_prev'] = df.groupby(group_col)[col].shift(1)
    return df

# --- PIVOT TABLES ---
# Useful for creating team-vs-team matrices or weekly summaries
# Example: create a matrix of average point differential by matchup
def create_matchup_matrix(df):
    """Create a pivot table of average margins by home/away matchup."""
    df = df.copy()
    df['margin'] = df['home_score'] - df['away_score']
    pivot = df.pivot_table(
        values='margin',
        index='home_team',
        columns='away_team',
        aggfunc='mean'
    )
    return pivot

# --- BINNING CONTINUOUS VARIABLES ---
# Useful for analyzing performance by spread range
def analyze_by_spread_bin(df):
    """Break down cover rates by spread magnitude."""
    df = df.copy()
    df['spread_bin'] = pd.cut(
        df['spread_line'].abs(),
        bins=[0, 3, 6, 10, 20],
        labels=['1-3', '3.5-6', '6.5-10', '10+']
    )
    return df.groupby('spread_bin')['home_covered'].agg(['mean', 'count'])

# --- APPLYING FUNCTIONS ROW-WISE ---
# Sometimes you need custom logic applied to each row
def classify_game_result(row):
    """Classify a game result relative to the spread."""
    margin = row['home_score'] - row['away_score']
    spread = row['spread_line']
    result = margin + spread  # Positive = home covers

    if result > 0:
        return 'home_cover'
    elif result < 0:
        return 'away_cover'
    else:
        return 'push'

# df['result_vs_spread'] = df.apply(classify_game_result, axis=1)

5.3 Data Cleaning and Preparation

Raw sports data is messy. Every data source has its quirks, inconsistencies, and gaps. The difference between amateur and professional data analysis often comes down to the rigor of the cleaning process. A model trained on dirty data will produce dirty predictions.

This section walks through the most common data quality issues in sports datasets and provides systematic approaches for handling them.

Handling Missing Data in Sports Contexts

Missing data in sports is not random. A player's stats might be missing because they were injured, suspended, or rested. A game's weather data might be missing because it was played indoors. Understanding why data is missing determines how you should handle it.

Types of missing data in sports:

  • Did Not Play (DNP): A player was on the roster but did not enter the game. Their stats are legitimately zero for counting stats (points, rebounds, assists) but should be treated as missing for rate stats (shooting percentage, yards per attempt).
  • Injury/Suspension: The player was unavailable. For team-level models, you may want to create a feature indicating that a key player was absent rather than imputing their statistics.
  • Weather data gaps: Indoor games have no weather data. This is not missing data in the traditional sense -- it is a categorical distinction. Create an "indoor" flag rather than imputing wind speed.
  • Historical data limitations: Older seasons may lack advanced metrics that exist for recent seasons. Do not impute these; instead, restrict your analysis to seasons where the data exists.
import pandas as pd
import numpy as np

def audit_missing_data(df, threshold=0.05):
    """
    Audit a DataFrame for missing values.

    Parameters:
        df: Input DataFrame
        threshold: Flag columns with more than this fraction missing

    Returns:
        DataFrame summarizing missing data by column
    """
    missing = pd.DataFrame({
        'total_missing': df.isnull().sum(),
        'pct_missing': df.isnull().mean(),
        'dtype': df.dtypes
    })
    missing = missing[missing['total_missing'] > 0].sort_values(
        'pct_missing', ascending=False
    )

    flagged = missing[missing['pct_missing'] > threshold]
    if len(flagged) > 0:
        print(f"WARNING: {len(flagged)} columns exceed {threshold:.0%} missing:")
        for col, row in flagged.iterrows():
            print(f"  {col}: {row['pct_missing']:.1%} missing ({row['total_missing']:,} rows)")

    return missing


def clean_player_game_logs(df):
    """
    Clean player game log data with sports-aware missing value handling.

    Handles DNPs, injuries, and rate statistics appropriately.
    """
    df = df.copy()

    # Step 1: Identify DNP rows
    # Players who were inactive or did not play should be flagged, not dropped
    if 'minutes_played' in df.columns:
        df['did_not_play'] = df['minutes_played'].isna() | (df['minutes_played'] == 0)

        # For counting stats, DNP means 0 (they contributed nothing)
        counting_stats = ['points', 'rebounds', 'assists', 'steals', 'blocks',
                          'turnovers']
        for col in counting_stats:
            if col in df.columns:
                df.loc[df['did_not_play'], col] = 0

        # For rate stats, DNP means NaN (undefined, not zero)
        rate_stats = ['fg_pct', 'ft_pct', 'three_pt_pct', 'ts_pct']
        for col in rate_stats:
            if col in df.columns:
                df.loc[df['did_not_play'], col] = np.nan

    # Step 2: Handle numeric columns with reasonable defaults
    # Only fill with median for columns where it makes sense
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        pct_missing = df[col].isna().mean()
        if 0 < pct_missing < 0.1:  # Less than 10% missing
            df[col] = df[col].fillna(df[col].median())
        elif pct_missing >= 0.1:
            print(f"  Warning: {col} has {pct_missing:.1%} missing -- not auto-filling")

    return df

Dealing with Team Name Changes and Relocations

This is one of the most frustrating data quality issues in sports analytics. Teams change names, relocate cities, and rebrand -- but they retain continuity of the franchise. Your data must handle this correctly.

NFL examples: - Washington: Redskins (through 2019) -> Football Team (2020-2021) -> Commanders (2022+) - Oakland Raiders -> Las Vegas Raiders (2020) - San Diego Chargers -> Los Angeles Chargers (2017) - St. Louis Rams -> Los Angeles Rams (2016)

NBA examples: - Charlotte Bobcats -> Charlotte Hornets (2014) - New Jersey Nets -> Brooklyn Nets (2012) - Seattle SuperSonics -> Oklahoma City Thunder (2008)

import pandas as pd

# Comprehensive team name mapping for NFL
NFL_TEAM_STANDARDIZATION = {
    # Current abbreviations are the standard
    # Map common variations to the standard
    'WSH': 'WAS', 'Washington': 'WAS', 'Commanders': 'WAS',
    'Football Team': 'WAS', 'Redskins': 'WAS',
    'OAK': 'LV', 'Oakland': 'LV', 'Raiders': 'LV',
    'Las Vegas': 'LV',
    'SD': 'LAC', 'San Diego': 'LAC', 'Chargers': 'LAC',
    'STL': 'LAR', 'St. Louis': 'LAR', 'LA Rams': 'LAR',
    # Some sources use JAX, others JAC
    'JAC': 'JAX', 'Jacksonville': 'JAX',
}

def standardize_team_names(df, team_columns, mapping=None):
    """
    Standardize team names/abbreviations across a DataFrame.

    Parameters:
        df: Input DataFrame
        team_columns: List of column names containing team identifiers
        mapping: Dictionary mapping variant names to standard names

    Returns:
        DataFrame with standardized team names
    """
    if mapping is None:
        mapping = NFL_TEAM_STANDARDIZATION

    df = df.copy()
    for col in team_columns:
        df[col] = df[col].replace(mapping)

    return df


def validate_team_names(df, team_column, expected_teams):
    """Check for unexpected team names that may indicate data issues."""
    unique_teams = set(df[team_column].unique())
    unexpected = unique_teams - set(expected_teams)
    missing = set(expected_teams) - unique_teams

    if unexpected:
        print(f"WARNING: Unexpected team names in '{team_column}': {unexpected}")
    if missing:
        print(f"INFO: Teams not found in '{team_column}': {missing}")

    return unexpected, missing

# Expected current NFL teams (32 teams)
NFL_TEAMS = [
    'ARI', 'ATL', 'BAL', 'BUF', 'CAR', 'CHI', 'CIN', 'CLE',
    'DAL', 'DEN', 'DET', 'GB', 'HOU', 'IND', 'JAX', 'KC',
    'LAC', 'LAR', 'LV', 'MIA', 'MIN', 'NE', 'NO', 'NYG',
    'NYJ', 'PHI', 'PIT', 'SEA', 'SF', 'TB', 'TEN', 'WAS'
]

Standardizing Date and Time Formats

Date handling causes more subtle bugs than almost any other data issue. Time zones are particularly treacherous: an NFL game that kicks off at 1:00 PM Eastern is stored as 1:00 PM in some sources, 10:00 AM Pacific in others, and 18:00 UTC in yet others. If you are merging data from multiple sources, mismatched date formats will cause silent join failures.

import pandas as pd
from datetime import datetime

def standardize_dates(df, date_columns, target_format='%Y-%m-%d'):
    """
    Parse and standardize date columns to a consistent format.

    Handles multiple input formats automatically.
    """
    df = df.copy()

    # Common date formats in sports data
    date_formats = [
        '%Y-%m-%d',          # 2024-09-08
        '%m/%d/%Y',          # 09/08/2024
        '%m-%d-%Y',          # 09-08-2024
        '%B %d, %Y',        # September 8, 2024
        '%b %d, %Y',        # Sep 8, 2024
        '%Y%m%d',            # 20240908
        '%d/%m/%Y',          # 08/09/2024 (European format)
    ]

    for col in date_columns:
        if col not in df.columns:
            continue

        # Try pandas' flexible parser first
        try:
            df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
        except (ValueError, TypeError):
            # Fall back to trying each format
            for fmt in date_formats:
                try:
                    df[col] = pd.to_datetime(df[col], format=fmt)
                    break
                except (ValueError, TypeError):
                    continue

        # Convert to target string format if needed
        if target_format and pd.api.types.is_datetime64_any_dtype(df[col]):
            print(f"  Parsed '{col}' successfully as datetime64")

    return df


def add_temporal_features(df, date_column='game_date'):
    """
    Extract useful temporal features from a date column.

    These features can be valuable for modeling seasonal patterns.
    """
    df = df.copy()
    dt = pd.to_datetime(df[date_column])

    df['day_of_week'] = dt.dt.day_name()
    df['month'] = dt.dt.month
    df['is_weekend'] = dt.dt.dayofweek >= 5
    df['days_since_season_start'] = (dt - dt.min()).dt.days

    return df

Outlier Detection and Handling

Sports data contains legitimate extreme values that look like outliers. A team scoring 70 points in an NFL game is extraordinarily rare but has happened. A pitcher throwing a perfect game is an outlier but a real event. Blindly removing outliers will bias your analysis. Instead, you need context-aware outlier detection.

import pandas as pd
import numpy as np

def detect_outliers_iqr(series, multiplier=1.5):
    """
    Detect outliers using the Interquartile Range method.

    Returns a boolean Series where True indicates an outlier.
    """
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    return (series < lower_bound) | (series > upper_bound)


def detect_outliers_zscore(series, threshold=3.0):
    """
    Detect outliers using Z-score method.

    Returns a boolean Series where True indicates an outlier.
    """
    z_scores = np.abs((series - series.mean()) / series.std())
    return z_scores > threshold


def handle_sports_outliers(df, score_cols, method='flag'):
    """
    Handle outliers in sports data with context awareness.

    Parameters:
        df: Input DataFrame
        score_cols: Columns to check for outliers
        method: 'flag' (add indicator column), 'cap' (winsorize), or 'drop'

    Returns:
        DataFrame with outliers handled
    """
    df = df.copy()

    for col in score_cols:
        if col not in df.columns:
            continue

        outlier_mask = detect_outliers_iqr(df[col])
        n_outliers = outlier_mask.sum()

        if n_outliers > 0:
            print(f"  {col}: {n_outliers} outliers detected "
                  f"({n_outliers / len(df):.1%})")

            if method == 'flag':
                df[f'{col}_is_outlier'] = outlier_mask
            elif method == 'cap':
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower = Q1 - 1.5 * IQR
                upper = Q3 + 1.5 * IQR
                df[col] = df[col].clip(lower=lower, upper=upper)
            elif method == 'drop':
                df = df[~outlier_mask]

    return df

Creating Derived Features

Raw statistics rarely tell the full story. Derived features -- new variables computed from existing ones -- often have more predictive power than the raw inputs. This process is called feature engineering, and it is where domain knowledge about sports meets data science.

import pandas as pd
import numpy as np

def engineer_nfl_team_features(games_df):
    """
    Create derived features for NFL team-level game data.

    This function demonstrates common feature engineering patterns
    for sports betting models.
    """
    df = games_df.sort_values(['home_team', 'game_date']).copy()

    # --- MARGIN AND TOTAL ---
    df['margin'] = df['home_score'] - df['away_score']
    df['total_points'] = df['home_score'] + df['away_score']

    # --- PERFORMANCE VS. EXPECTATION ---
    # How did the team perform relative to the market's expectation?
    df['margin_vs_spread'] = df['margin'] - (-df['spread_line'])
    df['total_vs_line'] = df['total_points'] - df['total_line']

    # --- REST DAYS ---
    # Teams with more rest tend to perform better
    df['game_date_dt'] = pd.to_datetime(df['game_date'])
    df['home_rest_days'] = (
        df.groupby('home_team')['game_date_dt']
        .diff()
        .dt.days
    )

    # --- ROLLING PERFORMANCE METRICS ---
    # Team's recent form over last 3 and 5 games
    for window in [3, 5]:
        # Rolling average points scored (home games only in this simple version)
        df[f'home_pts_rolling_{window}'] = (
            df.groupby('home_team')['home_score']
            .transform(lambda x: x.shift(1).rolling(window, min_periods=1).mean())
        )
        # Rolling average margin
        df[f'margin_rolling_{window}'] = (
            df.groupby('home_team')['margin']
            .transform(lambda x: x.shift(1).rolling(window, min_periods=1).mean())
        )

    # --- STREAK FEATURES ---
    # Current winning/losing streak
    df['home_win'] = (df['margin'] > 0).astype(int)
    df['win_streak'] = (
        df.groupby('home_team')['home_win']
        .transform(lambda x: x.shift(1)
                   .groupby((x.shift(1) != x.shift(1).shift(1)).cumsum())
                   .cumcount() + 1)
    )

    # --- SEASON CONTEXT ---
    df['season_game_number'] = df.groupby(['home_team', 'season']).cumcount() + 1
    df['is_divisional'] = False  # Would need division data to populate

    return df

A Complete Data Cleaning Pipeline

Let us bring all of these techniques together into a single, reusable pipeline.

import pandas as pd
import numpy as np
from datetime import datetime


class SportsDataPipeline:
    """
    A reusable data cleaning pipeline for sports betting data.

    Usage:
        pipeline = SportsDataPipeline(sport='nfl')
        clean_data = pipeline.run(raw_data)
    """

    def __init__(self, sport='nfl', verbose=True):
        self.sport = sport
        self.verbose = verbose
        self.cleaning_log = []

    def log(self, message):
        """Log a cleaning step."""
        self.cleaning_log.append(f"[{datetime.now():%H:%M:%S}] {message}")
        if self.verbose:
            print(f"  {message}")

    def run(self, df):
        """Execute the full cleaning pipeline."""
        print(f"Starting cleaning pipeline for {self.sport.upper()} data")
        print(f"Input shape: {df.shape}")

        df = self.remove_duplicates(df)
        df = self.standardize_teams(df)
        df = self.standardize_dates(df)
        df = self.handle_missing_values(df)
        df = self.validate_scores(df)
        df = self.add_derived_columns(df)

        print(f"Output shape: {df.shape}")
        print(f"Cleaning steps completed: {len(self.cleaning_log)}")
        return df

    def remove_duplicates(self, df):
        """Remove duplicate rows."""
        n_before = len(df)
        df = df.drop_duplicates()
        n_removed = n_before - len(df)
        if n_removed > 0:
            self.log(f"Removed {n_removed} duplicate rows")
        return df

    def standardize_teams(self, df):
        """Standardize team name columns."""
        team_cols = [c for c in df.columns if 'team' in c.lower()]
        if team_cols:
            df = standardize_team_names(df, team_cols)
            self.log(f"Standardized team names in columns: {team_cols}")
        return df

    def standardize_dates(self, df):
        """Parse and standardize date columns."""
        date_cols = [c for c in df.columns
                     if any(d in c.lower() for d in ['date', 'time', 'dt'])]
        if date_cols:
            for col in date_cols:
                try:
                    df[col] = pd.to_datetime(df[col])
                    self.log(f"Parsed '{col}' as datetime")
                except (ValueError, TypeError):
                    self.log(f"Could not parse '{col}' as datetime")
        return df

    def handle_missing_values(self, df):
        """Handle missing values with appropriate strategies."""
        missing_report = df.isnull().sum()
        missing_cols = missing_report[missing_report > 0]

        for col, count in missing_cols.items():
            pct = count / len(df)
            if pct > 0.5:
                self.log(f"WARNING: '{col}' is {pct:.0%} missing -- consider dropping")
            elif pct > 0.1:
                self.log(f"'{col}' is {pct:.0%} missing -- flagging for review")

        return df

    def validate_scores(self, df):
        """Validate that scores are reasonable."""
        score_cols = [c for c in df.columns if 'score' in c.lower()]
        for col in score_cols:
            if df[col].dtype in ['int64', 'float64']:
                negatives = (df[col] < 0).sum()
                if negatives > 0:
                    self.log(f"WARNING: {negatives} negative values in '{col}'")
                    df = df[df[col] >= 0]
        return df

    def add_derived_columns(self, df):
        """Add commonly needed derived columns."""
        if 'home_score' in df.columns and 'away_score' in df.columns:
            df['margin'] = df['home_score'] - df['away_score']
            df['total_points'] = df['home_score'] + df['away_score']
            self.log("Added 'margin' and 'total_points' columns")

        if 'spread_line' in df.columns and 'margin' in df.columns:
            df['home_covered'] = df['margin'] > -df['spread_line']
            self.log("Added 'home_covered' column")

        return df


# Usage:
# pipeline = SportsDataPipeline(sport='nfl')
# clean_df = pipeline.run(raw_df)

5.4 Exploratory Data Analysis for Sports

Before building any predictive model, you must understand your data. Exploratory Data Analysis (EDA) is the process of summarizing datasets, discovering patterns, and identifying anomalies through statistics and visualization. For sports betting, EDA answers questions like: How are point spreads distributed? Is there a home-field advantage trend? Are totals predictable from simple statistics?

EDA is not a luxury or a preliminary step to rush through. It is where you develop the intuition that shapes your models. Patterns you notice during EDA become hypotheses you can test. Anomalies you spot during EDA become data quality issues you need to resolve.

Summary Statistics That Matter for Betting

Not all summary statistics are equally useful for betting analysis. The statistics that matter most are those related to margins, consistency, and market accuracy.

import nfl_data_py as nfl
import pandas as pd
import numpy as np

# Load and prepare data
schedule = nfl.import_schedules([2021, 2022, 2023, 2024])
games = schedule.dropna(subset=['home_score', 'away_score']).copy()
games = games[games['week'] <= 18]  # Regular season only

# Add key derived columns
games['margin'] = games['home_score'] - games['away_score']
games['total_points'] = games['home_score'] + games['away_score']
games['home_covered'] = games['margin'] > -games['spread_line']
games['over_hit'] = games['total_points'] > games['total_line']
games['spread_error'] = games['margin'] + games['spread_line']
games['total_error'] = games['total_points'] - games['total_line']
games['abs_margin'] = games['margin'].abs()

# --- KEY SUMMARY STATISTICS ---

# 1. Home field advantage
print("=" * 60)
print("HOME FIELD ADVANTAGE BY SEASON")
print("=" * 60)
home_adv = games.groupby('season').agg(
    avg_margin=('margin', 'mean'),
    home_win_pct=('margin', lambda x: (x > 0).mean()),
    home_cover_pct=('home_covered', 'mean'),
    games=('margin', 'count')
).round(3)
print(home_adv)
print(f"\nOverall home margin: {games['margin'].mean():.2f} points")
print(f"Overall home win%: {(games['margin'] > 0).mean():.1%}")

# 2. Market accuracy
print("\n" + "=" * 60)
print("MARKET ACCURACY (SPREAD)")
print("=" * 60)
print(f"Mean spread error: {games['spread_error'].mean():.2f}")
print(f"Median spread error: {games['spread_error'].median():.2f}")
print(f"Std dev of spread error: {games['spread_error'].std():.2f}")
print(f"MAE of spread: {games['spread_error'].abs().mean():.2f}")
print(f"Home cover rate: {games['home_covered'].mean():.3f}")
print(f"  (0.500 = perfectly calibrated)")

# 3. Totals accuracy
print("\n" + "=" * 60)
print("MARKET ACCURACY (TOTALS)")
print("=" * 60)
print(f"Mean total error: {games['total_error'].mean():.2f}")
print(f"Std dev of total error: {games['total_error'].std():.2f}")
print(f"MAE of total: {games['total_error'].abs().mean():.2f}")
print(f"Over rate: {games['over_hit'].mean():.3f}")

# 4. Score distribution characteristics
print("\n" + "=" * 60)
print("SCORE DISTRIBUTIONS")
print("=" * 60)
for col, label in [('margin', 'Home Margin'), ('total_points', 'Total Points')]:
    print(f"\n{label}:")
    print(f"  Mean: {games[col].mean():.1f}")
    print(f"  Std Dev: {games[col].std():.1f}")
    print(f"  Skewness: {games[col].skew():.3f}")
    print(f"  Kurtosis: {games[col].kurtosis():.3f}")
    print(f"  5th percentile: {games[col].quantile(0.05):.0f}")
    print(f"  95th percentile: {games[col].quantile(0.95):.0f}")

Visualization is where patterns leap off the screen. The following code block demonstrates the core visualizations every sports bettor should produce when exploring a new dataset.

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import pandas as pd
import numpy as np

# Set a clean style for all plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')

def create_betting_eda_plots(games, save_prefix='nfl_eda'):
    """
    Create a comprehensive set of EDA visualizations for betting data.

    Parameters:
        games: DataFrame with columns margin, total_points, spread_line,
               total_line, spread_error, total_error, season, week
        save_prefix: Prefix for saved figure filenames
    """

    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('NFL Betting Data: Exploratory Analysis', fontsize=16, y=1.02)

    # --- Plot 1: Distribution of Margins ---
    ax = axes[0, 0]
    ax.hist(games['margin'], bins=range(-40, 41, 1), edgecolor='black',
            linewidth=0.5, alpha=0.7, color='steelblue')
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1.5, label='Pick em')
    ax.axvline(x=games['margin'].mean(), color='orange', linestyle='-',
               linewidth=1.5, label=f"Mean: {games['margin'].mean():.1f}")
    ax.set_xlabel('Home Team Margin')
    ax.set_ylabel('Frequency')
    ax.set_title('Distribution of Game Margins')
    ax.legend(fontsize=9)

    # Note the key numbers: margins of 3, 7, 10 are elevated due to
    # football's scoring structure (FG=3, TD+XP=7)

    # --- Plot 2: Distribution of Totals ---
    ax = axes[0, 1]
    ax.hist(games['total_points'], bins=range(10, 80, 2), edgecolor='black',
            linewidth=0.5, alpha=0.7, color='coral')
    ax.axvline(x=games['total_points'].mean(), color='blue', linestyle='-',
               linewidth=1.5, label=f"Mean: {games['total_points'].mean():.1f}")
    ax.set_xlabel('Total Points Scored')
    ax.set_ylabel('Frequency')
    ax.set_title('Distribution of Game Totals')
    ax.legend(fontsize=9)

    # --- Plot 3: Spread Error Distribution ---
    ax = axes[0, 2]
    ax.hist(games['spread_error'], bins=range(-40, 41, 1), edgecolor='black',
            linewidth=0.5, alpha=0.7, color='mediumseagreen')
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1.5)
    ax.set_xlabel('Actual Margin - (-Spread)')
    ax.set_ylabel('Frequency')
    ax.set_title('Spread Prediction Error\n(0 = perfect prediction)')

    # --- Plot 4: Home Cover Rate by Week ---
    ax = axes[1, 0]
    weekly_cover = games.groupby('week')['home_covered'].mean()
    ax.bar(weekly_cover.index, weekly_cover.values, color='steelblue',
           edgecolor='black', linewidth=0.5)
    ax.axhline(y=0.5, color='red', linestyle='--', linewidth=1.5,
               label='50% (no bias)')
    ax.set_xlabel('Week')
    ax.set_ylabel('Home Cover Rate')
    ax.set_title('Home Cover Rate by Week')
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
    ax.legend(fontsize=9)

    # --- Plot 5: Scoring Trends by Season ---
    ax = axes[1, 1]
    seasonal = games.groupby('season').agg(
        avg_total=('total_points', 'mean'),
        avg_margin=('margin', lambda x: x.abs().mean())
    )
    ax.plot(seasonal.index, seasonal['avg_total'], 'o-', color='coral',
            linewidth=2, markersize=8, label='Avg Total')
    ax.set_xlabel('Season')
    ax.set_ylabel('Points')
    ax.set_title('Scoring Trends Over Time')
    ax.legend(fontsize=9)

    # --- Plot 6: Spread vs. Actual Margin Scatter ---
    ax = axes[1, 2]
    ax.scatter(-games['spread_line'], games['margin'], alpha=0.15,
               s=15, color='purple')
    # Add perfect prediction line
    spread_range = np.linspace(-20, 20, 100)
    ax.plot(spread_range, spread_range, 'r--', linewidth=2,
            label='Perfect prediction')
    ax.set_xlabel('Predicted Margin (-Spread)')
    ax.set_ylabel('Actual Margin')
    ax.set_title('Predicted vs. Actual Margins')
    ax.legend(fontsize=9)
    ax.set_xlim(-25, 25)
    ax.set_ylim(-45, 45)

    plt.tight_layout()
    plt.savefig(f'{save_prefix}_overview.png', dpi=150, bbox_inches='tight')
    plt.show()

    print("Plots saved successfully.")


# Additional focused visualization: key number analysis for NFL
def plot_margin_key_numbers(games):
    """
    Analyze the frequency of key numbers in NFL margins.

    Key numbers (3, 7, 10) are critical for spread betting because
    they represent common scoring differentials in football.
    """
    fig, ax = plt.subplots(figsize=(14, 6))

    margin_counts = games['margin'].abs().value_counts().sort_index()
    margin_counts = margin_counts[margin_counts.index <= 30]

    colors = ['gold' if m in [3, 7, 10, 6, 4, 14, 17] else 'steelblue'
              for m in margin_counts.index]

    ax.bar(margin_counts.index, margin_counts.values, color=colors,
           edgecolor='black', linewidth=0.5)

    # Annotate key numbers
    for key_num in [3, 7, 10]:
        if key_num in margin_counts.index:
            count = margin_counts[key_num]
            pct = count / len(games)
            ax.annotate(f'{key_num}\n({pct:.1%})',
                       xy=(key_num, count),
                       xytext=(key_num, count + 15),
                       ha='center', fontsize=10, fontweight='bold',
                       color='darkred')

    ax.set_xlabel('Absolute Margin', fontsize=12)
    ax.set_ylabel('Number of Games', fontsize=12)
    ax.set_title('NFL Game Margins: Key Number Analysis\n'
                 '(Gold bars = key numbers in football)', fontsize=14)
    plt.tight_layout()
    plt.savefig('nfl_key_numbers.png', dpi=150, bbox_inches='tight')
    plt.show()


# Run the visualizations:
# create_betting_eda_plots(games)
# plot_margin_key_numbers(games)

Home/Away Splits Analysis

Home-field advantage is one of the most studied phenomena in sports. Understanding its magnitude, how it varies across teams and situations, and whether it has been declining over time is directly relevant to spread betting.

import pandas as pd
import numpy as np

def analyze_home_away_splits(games, min_games=10):
    """
    Comprehensive home/away splits analysis.

    Returns team-level home/away performance differentials.
    """
    # Restructure data so each row is a team-game
    # (each original game produces two rows: one home, one away)
    home = games[['season', 'week', 'home_team', 'home_score',
                   'away_score', 'spread_line']].copy()
    home.columns = ['season', 'week', 'team', 'pts_for', 'pts_against', 'spread']
    home['location'] = 'home'
    home['spread'] = -home['spread']  # Convert to team's perspective

    away = games[['season', 'week', 'away_team', 'away_score',
                   'home_score', 'spread_line']].copy()
    away.columns = ['season', 'week', 'team', 'pts_for', 'pts_against', 'spread']
    away['location'] = 'away'

    team_games = pd.concat([home, away], ignore_index=True)
    team_games['margin'] = team_games['pts_for'] - team_games['pts_against']
    team_games['won'] = team_games['margin'] > 0
    team_games['covered'] = team_games['margin'] > -team_games['spread']

    # Team-level home vs away comparison
    splits = team_games.groupby(['team', 'location']).agg(
        games=('margin', 'count'),
        avg_pts_for=('pts_for', 'mean'),
        avg_pts_against=('pts_against', 'mean'),
        avg_margin=('margin', 'mean'),
        win_pct=('won', 'mean'),
        cover_pct=('covered', 'mean')
    ).round(2)

    # Compute home-away differential for each team
    splits_wide = splits.unstack('location')
    home_advantage = pd.DataFrame({
        'home_margin': splits_wide[('avg_margin', 'home')],
        'away_margin': splits_wide[('avg_margin', 'away')],
        'hfa_margin': (splits_wide[('avg_margin', 'home')]
                       - splits_wide[('avg_margin', 'away')]),
        'home_win_pct': splits_wide[('win_pct', 'home')],
        'away_win_pct': splits_wide[('win_pct', 'away')],
        'hfa_win_pct': (splits_wide[('win_pct', 'home')]
                        - splits_wide[('win_pct', 'away')])
    }).round(3)

    home_advantage = home_advantage.sort_values('hfa_margin', ascending=False)

    print("HOME FIELD ADVANTAGE BY TEAM")
    print("(Ordered by margin differential)")
    print(home_advantage)

    print(f"\nLeague-wide HFA (margin): "
          f"{home_advantage['hfa_margin'].mean():.2f} points")
    print(f"League-wide HFA (win%): "
          f"{home_advantage['hfa_win_pct'].mean():.1%}")

    return home_advantage

# home_splits = analyze_home_away_splits(games)

Correlation Analysis

Correlation analysis reveals which statistics are most strongly associated with winning, covering spreads, and hitting overs. These correlations guide feature selection for predictive models.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def betting_correlation_analysis(games):
    """
    Analyze correlations between game statistics and betting outcomes.
    """
    # Select numeric columns relevant to betting
    analysis_cols = [
        'home_score', 'away_score', 'margin', 'total_points',
        'spread_line', 'total_line', 'spread_error', 'total_error'
    ]

    # Filter to columns that exist in the DataFrame
    available_cols = [c for c in analysis_cols if c in games.columns]
    corr_matrix = games[available_cols].corr()

    # Plot correlation heatmap
    fig, ax = plt.subplots(figsize=(10, 8))
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Upper triangle mask
    sns.heatmap(
        corr_matrix,
        mask=mask,
        annot=True,
        fmt='.2f',
        cmap='RdBu_r',
        center=0,
        vmin=-1, vmax=1,
        square=True,
        linewidths=1,
        ax=ax
    )
    ax.set_title('Correlation Matrix: Game Statistics and Betting Lines',
                 fontsize=14)
    plt.tight_layout()
    plt.savefig('correlation_matrix.png', dpi=150, bbox_inches='tight')
    plt.show()

    # Key correlations to examine
    print("KEY CORRELATIONS")
    print("=" * 50)
    print(f"Spread vs. Actual Margin: "
          f"{games['spread_line'].corr(-games['margin']):.3f}")
    print(f"Total Line vs. Actual Total: "
          f"{games['total_line'].corr(games['total_points']):.3f}")

    # Which basic stats correlate with covering the spread?
    if 'home_covered' in games.columns:
        numeric_cols = games.select_dtypes(include=[np.number]).columns
        cover_corrs = games[numeric_cols].corrwith(
            games['home_covered'].astype(float)
        ).sort_values(ascending=False)
        print(f"\nCorrelations with Home Covering the Spread:")
        print(cover_corrs.head(10))

    return corr_matrix


# corr = betting_correlation_analysis(games)

A crucial insight from correlation analysis: the spread-to-margin correlation in the NFL is typically around 0.40-0.45. This tells you that the market explains roughly 16-20% of the variance in game outcomes (r-squared). The remaining 80%+ is noise, weather, injuries, randomness, and -- potentially -- information your model can capture that the market missed. This framing sets realistic expectations: even a very good model will only marginally improve on the market.


5.5 Building Your Personal Betting Database

Scattered CSV files and one-off scripts will eventually become unmanageable. As your betting analysis matures, you need a proper database: a central, structured repository for game data, odds data, your bets, and your results. This section shows you how to build one using SQLite and SQLAlchemy.

Why a Database?

A database solves several problems that accumulate as your data grows:

  1. Single source of truth. Instead of multiple CSV files with different cleaning states, you have one canonical dataset.
  2. Efficient querying. SQL lets you ask complex questions without loading entire datasets into memory.
  3. Referential integrity. Foreign keys ensure that every bet record references a valid game, and every game references valid teams.
  4. Incremental updates. You can add new data without reprocessing the entire dataset.
  5. Reproducibility. When you or someone else revisits your analysis months later, the data is all in one place with a clear schema.

SQLite for Personal Use

SQLite is the ideal database for individual bettors and researchers. It stores the entire database in a single file, requires no server setup, and is included with Python's standard library. For a personal betting database that will contain tens of thousands to low millions of rows, SQLite is more than sufficient.

If your data eventually grows beyond what SQLite handles comfortably (roughly 100GB or heavy concurrent write loads), you can migrate to PostgreSQL using the same SQLAlchemy code with minimal changes.

Schema Design

Good schema design reflects the structure of the domain. In sports betting, the key entities are: seasons, teams, games, odds, and bets. Here is a schema that balances normalization with practical convenience.

from sqlalchemy import (
    create_engine, Column, Integer, Float, String, Boolean,
    Date, DateTime, ForeignKey, UniqueConstraint, Index
)
from sqlalchemy.orm import declarative_base, relationship, sessionmaker
from datetime import datetime, date

Base = declarative_base()


class Team(Base):
    """Teams table: canonical team information."""
    __tablename__ = 'teams'

    id = Column(Integer, primary_key=True)
    abbreviation = Column(String(10), unique=True, nullable=False)
    full_name = Column(String(100), nullable=False)
    city = Column(String(50))
    conference = Column(String(10))
    division = Column(String(20))
    sport = Column(String(10), nullable=False, default='NFL')

    # Relationships
    home_games = relationship('Game', foreign_keys='Game.home_team_id',
                              back_populates='home_team')
    away_games = relationship('Game', foreign_keys='Game.away_team_id',
                              back_populates='away_team')

    def __repr__(self):
        return f"<Team({self.abbreviation}: {self.full_name})>"


class Game(Base):
    """Games table: one row per completed game."""
    __tablename__ = 'games'

    id = Column(Integer, primary_key=True)
    external_id = Column(String(50), unique=True)  # ID from data source
    sport = Column(String(10), nullable=False, default='NFL')
    season = Column(Integer, nullable=False)
    week = Column(Integer)
    game_date = Column(Date, nullable=False)
    game_time = Column(String(10))

    home_team_id = Column(Integer, ForeignKey('teams.id'), nullable=False)
    away_team_id = Column(Integer, ForeignKey('teams.id'), nullable=False)

    home_score = Column(Integer)
    away_score = Column(Integer)

    # Venue and conditions
    venue = Column(String(100))
    is_neutral_site = Column(Boolean, default=False)
    surface = Column(String(20))
    roof_type = Column(String(20))  # 'dome', 'open', 'retractable'
    temperature = Column(Float)     # Fahrenheit
    wind_speed = Column(Float)      # MPH
    weather_detail = Column(String(50))

    # Derived columns (computed on insert)
    home_margin = Column(Float)
    total_points = Column(Float)

    # Relationships
    home_team = relationship('Team', foreign_keys=[home_team_id],
                             back_populates='home_games')
    away_team = relationship('Team', foreign_keys=[away_team_id],
                             back_populates='away_games')
    odds = relationship('Odds', back_populates='game')
    bets = relationship('Bet', back_populates='game')

    # Indexes for common queries
    __table_args__ = (
        Index('idx_game_date', 'game_date'),
        Index('idx_season_week', 'season', 'week'),
        Index('idx_home_team', 'home_team_id'),
        Index('idx_away_team', 'away_team_id'),
    )

    def __repr__(self):
        return (f"<Game({self.season} W{self.week}: "
                f"Team {self.away_team_id} @ Team {self.home_team_id})>")


class Odds(Base):
    """Odds table: opening and closing lines from various books."""
    __tablename__ = 'odds'

    id = Column(Integer, primary_key=True)
    game_id = Column(Integer, ForeignKey('games.id'), nullable=False)
    bookmaker = Column(String(50), nullable=False)
    timestamp = Column(DateTime, default=datetime.utcnow)

    # Spread
    home_spread = Column(Float)
    home_spread_odds = Column(Integer)   # American odds (e.g., -110)
    away_spread = Column(Float)
    away_spread_odds = Column(Integer)

    # Total
    total_line = Column(Float)
    over_odds = Column(Integer)
    under_odds = Column(Integer)

    # Moneyline
    home_ml = Column(Integer)
    away_ml = Column(Integer)

    # Is this the closing line?
    is_closing = Column(Boolean, default=False)

    # Relationship
    game = relationship('Game', back_populates='odds')

    __table_args__ = (
        Index('idx_odds_game', 'game_id'),
        Index('idx_odds_book', 'bookmaker'),
        UniqueConstraint('game_id', 'bookmaker', 'is_closing',
                         name='uq_game_book_closing'),
    )


class Bet(Base):
    """Bets table: your personal betting record."""
    __tablename__ = 'bets'

    id = Column(Integer, primary_key=True)
    game_id = Column(Integer, ForeignKey('games.id'), nullable=False)
    placed_at = Column(DateTime, default=datetime.utcnow)

    # Bet details
    bet_type = Column(String(20), nullable=False)  # 'spread', 'total', 'ml', 'prop'
    selection = Column(String(100), nullable=False)  # 'home', 'away', 'over', 'under'
    line = Column(Float)              # The line you bet (e.g., -3.5)
    odds = Column(Integer)            # American odds you got (e.g., -110)
    stake = Column(Float, nullable=False)

    # Bookmaker
    bookmaker = Column(String(50))

    # Result (filled in after game completes)
    result = Column(String(10))       # 'win', 'loss', 'push'
    payout = Column(Float)            # Total return (0 for loss, stake for push)
    profit = Column(Float)            # Payout minus stake

    # Model info (which model suggested this bet?)
    model_name = Column(String(50))
    model_edge = Column(Float)        # Estimated edge when bet was placed
    model_probability = Column(Float) # Model's implied probability

    # Notes
    notes = Column(String(500))

    # Relationship
    game = relationship('Game', back_populates='bets')

    __table_args__ = (
        Index('idx_bet_game', 'game_id'),
        Index('idx_bet_date', 'placed_at'),
        Index('idx_bet_type', 'bet_type'),
    )


# --- DATABASE INITIALIZATION ---

def create_database(db_path='betting.db'):
    """Create the database and all tables."""
    engine = create_engine(f'sqlite:///{db_path}', echo=False)
    Base.metadata.create_all(engine)
    print(f"Database created at: {db_path}")
    print(f"Tables: {list(Base.metadata.tables.keys())}")
    return engine


def get_session(engine):
    """Create a database session."""
    Session = sessionmaker(bind=engine)
    return Session()

Populating the Database

Once the schema is defined, you need to populate it with data. Here is a workflow for loading NFL data into the database.

import nfl_data_py as nfl
import pandas as pd
from datetime import datetime

def populate_nfl_teams(session):
    """Insert all 32 NFL teams into the database."""
    nfl_teams_data = [
        ('ARI', 'Arizona Cardinals', 'Glendale', 'NFC', 'NFC West'),
        ('ATL', 'Atlanta Falcons', 'Atlanta', 'NFC', 'NFC South'),
        ('BAL', 'Baltimore Ravens', 'Baltimore', 'AFC', 'AFC North'),
        ('BUF', 'Buffalo Bills', 'Orchard Park', 'AFC', 'AFC East'),
        ('CAR', 'Carolina Panthers', 'Charlotte', 'NFC', 'NFC South'),
        ('CHI', 'Chicago Bears', 'Chicago', 'NFC', 'NFC North'),
        ('CIN', 'Cincinnati Bengals', 'Cincinnati', 'AFC', 'AFC North'),
        ('CLE', 'Cleveland Browns', 'Cleveland', 'AFC', 'AFC North'),
        ('DAL', 'Dallas Cowboys', 'Arlington', 'NFC', 'NFC East'),
        ('DEN', 'Denver Broncos', 'Denver', 'AFC', 'AFC West'),
        ('DET', 'Detroit Lions', 'Detroit', 'NFC', 'NFC North'),
        ('GB', 'Green Bay Packers', 'Green Bay', 'NFC', 'NFC North'),
        ('HOU', 'Houston Texans', 'Houston', 'AFC', 'AFC South'),
        ('IND', 'Indianapolis Colts', 'Indianapolis', 'AFC', 'AFC South'),
        ('JAX', 'Jacksonville Jaguars', 'Jacksonville', 'AFC', 'AFC South'),
        ('KC', 'Kansas City Chiefs', 'Kansas City', 'AFC', 'AFC West'),
        ('LAC', 'Los Angeles Chargers', 'Inglewood', 'AFC', 'AFC West'),
        ('LAR', 'Los Angeles Rams', 'Inglewood', 'NFC', 'NFC West'),
        ('LV', 'Las Vegas Raiders', 'Las Vegas', 'AFC', 'AFC West'),
        ('MIA', 'Miami Dolphins', 'Miami Gardens', 'AFC', 'AFC East'),
        ('MIN', 'Minnesota Vikings', 'Minneapolis', 'NFC', 'NFC North'),
        ('NE', 'New England Patriots', 'Foxborough', 'AFC', 'AFC East'),
        ('NO', 'New Orleans Saints', 'New Orleans', 'NFC', 'NFC South'),
        ('NYG', 'New York Giants', 'East Rutherford', 'NFC', 'NFC East'),
        ('NYJ', 'New York Jets', 'East Rutherford', 'AFC', 'AFC East'),
        ('PHI', 'Philadelphia Eagles', 'Philadelphia', 'NFC', 'NFC East'),
        ('PIT', 'Pittsburgh Steelers', 'Pittsburgh', 'AFC', 'AFC North'),
        ('SEA', 'Seattle Seahawks', 'Seattle', 'NFC', 'NFC West'),
        ('SF', 'San Francisco 49ers', 'Santa Clara', 'NFC', 'NFC West'),
        ('TB', 'Tampa Bay Buccaneers', 'Tampa', 'NFC', 'NFC South'),
        ('TEN', 'Tennessee Titans', 'Nashville', 'AFC', 'AFC South'),
        ('WAS', 'Washington Commanders', 'Landover', 'NFC', 'NFC East'),
    ]

    for abbr, name, city, conf, div in nfl_teams_data:
        team = Team(
            abbreviation=abbr,
            full_name=name,
            city=city,
            conference=conf,
            division=div,
            sport='NFL'
        )
        session.merge(team)

    session.commit()
    print(f"Inserted {len(nfl_teams_data)} NFL teams")


def populate_games_from_nfl_data(session, seasons):
    """
    Load NFL game data and insert into the database.

    Uses nfl_data_py to fetch schedule data with scores and spreads.
    """
    schedule = nfl.import_schedules(seasons)
    completed = schedule.dropna(subset=['home_score', 'away_score']).copy()

    # Build a team abbreviation -> id lookup
    teams = {t.abbreviation: t.id for t in session.query(Team).all()}

    inserted = 0
    skipped = 0

    for _, row in completed.iterrows():
        home_abbr = row['home_team']
        away_abbr = row['away_team']

        # Skip if teams not in our database
        if home_abbr not in teams or away_abbr not in teams:
            skipped += 1
            continue

        # Check for existing game (avoid duplicates)
        existing = session.query(Game).filter_by(
            external_id=row.get('game_id', None)
        ).first()

        if existing:
            skipped += 1
            continue

        game = Game(
            external_id=row.get('game_id'),
            sport='NFL',
            season=int(row['season']),
            week=int(row['week']) if pd.notna(row.get('week')) else None,
            game_date=pd.to_datetime(row['gameday']).date(),
            home_team_id=teams[home_abbr],
            away_team_id=teams[away_abbr],
            home_score=int(row['home_score']),
            away_score=int(row['away_score']),
            home_margin=int(row['home_score']) - int(row['away_score']),
            total_points=int(row['home_score']) + int(row['away_score']),
            roof_type=row.get('roof'),
            surface=row.get('surface'),
            temperature=row.get('temp') if pd.notna(row.get('temp')) else None,
            wind_speed=row.get('wind') if pd.notna(row.get('wind')) else None,
        )
        session.add(game)

        # Also add odds data if available
        if pd.notna(row.get('spread_line')):
            odds_record = Odds(
                game=game,
                bookmaker='consensus',
                home_spread=row['spread_line'],
                total_line=row.get('total_line'),
                home_ml=row.get('home_moneyline'),
                away_ml=row.get('away_moneyline'),
                is_closing=True
            )
            session.add(odds_record)

        inserted += 1

    session.commit()
    print(f"Inserted {inserted} games, skipped {skipped}")


# --- FULL SETUP WORKFLOW ---

def setup_betting_database(db_path='betting.db', seasons=None):
    """
    Complete workflow: create database, populate teams, load game data.

    Usage:
        engine = setup_betting_database(
            db_path='my_betting.db',
            seasons=[2021, 2022, 2023, 2024]
        )
    """
    if seasons is None:
        seasons = [2021, 2022, 2023, 2024]

    print("=" * 60)
    print("SETTING UP BETTING DATABASE")
    print("=" * 60)

    # Step 1: Create database and tables
    engine = create_database(db_path)
    session = get_session(engine)

    # Step 2: Populate teams
    print("\nPopulating teams...")
    populate_nfl_teams(session)

    # Step 3: Load game data
    print(f"\nLoading game data for seasons: {seasons}")
    populate_games_from_nfl_data(session, seasons)

    # Step 4: Verify
    game_count = session.query(Game).count()
    odds_count = session.query(Odds).count()
    team_count = session.query(Team).count()
    print(f"\nDatabase summary:")
    print(f"  Teams: {team_count}")
    print(f"  Games: {game_count}")
    print(f"  Odds records: {odds_count}")

    session.close()
    return engine

Querying Your Database

Once the database is populated, you can use both SQL and SQLAlchemy's ORM to query it.

import pandas as pd
from sqlalchemy import create_engine, func, and_

def query_examples(db_path='betting.db'):
    """Demonstrate common queries against the betting database."""

    engine = create_engine(f'sqlite:///{db_path}')
    session = get_session(engine)

    # --- QUERY 1: Team records by season ---
    print("=" * 50)
    print("QUERY 1: Team Win/Loss Records (2024)")
    print("=" * 50)

    # Using raw SQL via pandas (often the simplest approach)
    query = """
    SELECT
        t.abbreviation AS team,
        COUNT(*) AS games,
        SUM(CASE WHEN g.home_margin > 0 THEN 1 ELSE 0 END) AS wins,
        SUM(CASE WHEN g.home_margin < 0 THEN 1 ELSE 0 END) AS losses,
        ROUND(AVG(g.home_margin), 1) AS avg_margin,
        ROUND(AVG(g.total_points), 1) AS avg_total
    FROM games g
    JOIN teams t ON g.home_team_id = t.id
    WHERE g.season = 2024
    GROUP BY t.abbreviation
    ORDER BY wins DESC
    """
    df = pd.read_sql(query, engine)
    print(df.head(10))

    # --- QUERY 2: Spread performance by favorite size ---
    print("\n" + "=" * 50)
    print("QUERY 2: Cover Rates by Spread Range")
    print("=" * 50)

    query = """
    SELECT
        CASE
            WHEN ABS(o.home_spread) <= 3 THEN '0-3'
            WHEN ABS(o.home_spread) <= 6 THEN '3.5-6'
            WHEN ABS(o.home_spread) <= 10 THEN '6.5-10'
            ELSE '10+'
        END AS spread_range,
        COUNT(*) AS games,
        ROUND(AVG(CASE WHEN g.home_margin > -o.home_spread
                  THEN 1.0 ELSE 0.0 END), 3) AS home_cover_rate,
        ROUND(AVG(g.home_margin + o.home_spread), 1) AS avg_spread_error
    FROM games g
    JOIN odds o ON g.id = o.game_id AND o.is_closing = 1
    WHERE o.home_spread IS NOT NULL
    GROUP BY spread_range
    ORDER BY spread_range
    """
    df = pd.read_sql(query, engine)
    print(df)

    # --- QUERY 3: Using SQLAlchemy ORM ---
    print("\n" + "=" * 50)
    print("QUERY 3: Recent Games with Largest Upsets")
    print("=" * 50)

    # Find games where the underdog won by the largest margin
    results = (
        session.query(
            Game, Odds, Team
        )
        .join(Odds, Game.id == Odds.game_id)
        .join(Team, Game.home_team_id == Team.id)
        .filter(Odds.is_closing == True)
        .filter(Game.season == 2024)
        .all()
    )

    upsets = []
    for game, odds, team in results:
        if odds.home_spread and game.home_margin:
            spread_error = game.home_margin - (-odds.home_spread)
            upsets.append({
                'date': game.game_date,
                'home_team': team.abbreviation,
                'score': f"{game.home_score}-{game.away_score}",
                'spread': odds.home_spread,
                'margin_vs_spread': round(spread_error, 1)
            })

    upset_df = pd.DataFrame(upsets)
    upset_df = upset_df.reindex(
        upset_df['margin_vs_spread'].abs().sort_values(ascending=False).index
    )
    print(upset_df.head(10))

    session.close()

# query_examples()

Recording and Analyzing Your Bets

The most important table in your database is the bets table. Without disciplined tracking of every bet you place, you cannot evaluate your performance, identify leaks in your strategy, or know whether your models actually work.

import pandas as pd
from datetime import datetime

def record_bet(session, game_id, bet_type, selection, line, odds,
               stake, bookmaker, model_name=None, model_edge=None,
               model_probability=None, notes=None):
    """
    Record a new bet in the database.

    Parameters:
        session: SQLAlchemy session
        game_id: ID of the game in the games table
        bet_type: 'spread', 'total', 'ml', or 'prop'
        selection: 'home', 'away', 'over', 'under', or custom string
        line: The line you bet (e.g., -3.5 for a spread)
        odds: American odds (e.g., -110)
        stake: Dollar amount wagered
        bookmaker: Name of the bookmaker
        model_name: Which model (if any) generated this recommendation
        model_edge: Model's estimated edge
        model_probability: Model's probability for this outcome
        notes: Any additional context

    Returns:
        The created Bet object
    """
    # Validate the game exists
    game = session.query(Game).filter_by(id=game_id).first()
    if not game:
        raise ValueError(f"Game {game_id} not found in database")

    bet = Bet(
        game_id=game_id,
        bet_type=bet_type,
        selection=selection,
        line=line,
        odds=odds,
        stake=float(stake),
        bookmaker=bookmaker,
        model_name=model_name,
        model_edge=model_edge,
        model_probability=model_probability,
        notes=notes
    )

    session.add(bet)
    session.commit()
    print(f"Bet recorded: {bet_type} {selection} @ {odds} "
          f"(${stake:.2f} on game {game_id})")
    return bet


def settle_bets(session, game_id):
    """
    Settle all bets for a completed game.

    Determines win/loss/push and calculates profit for each bet.
    """
    game = session.query(Game).filter_by(id=game_id).first()
    if not game or game.home_score is None:
        print(f"Game {game_id} not found or not yet completed")
        return

    bets = session.query(Bet).filter_by(game_id=game_id).all()
    margin = game.home_margin
    total = game.total_points

    for bet in bets:
        if bet.result is not None:
            continue  # Already settled

        # Determine result based on bet type
        if bet.bet_type == 'spread':
            if bet.selection == 'home':
                diff = margin - (-bet.line)
            else:  # away
                diff = -margin - (-bet.line)

            if diff > 0:
                bet.result = 'win'
            elif diff < 0:
                bet.result = 'loss'
            else:
                bet.result = 'push'

        elif bet.bet_type == 'total':
            if bet.selection == 'over':
                diff = total - bet.line
            else:  # under
                diff = bet.line - total

            if diff > 0:
                bet.result = 'win'
            elif diff < 0:
                bet.result = 'loss'
            else:
                bet.result = 'push'

        elif bet.bet_type == 'ml':
            if bet.selection == 'home':
                bet.result = 'win' if margin > 0 else 'loss'
            else:
                bet.result = 'win' if margin < 0 else 'loss'

        # Calculate payout
        if bet.result == 'win':
            if bet.odds > 0:
                profit = bet.stake * (bet.odds / 100)
            else:
                profit = bet.stake * (100 / abs(bet.odds))
            bet.payout = bet.stake + profit
            bet.profit = profit
        elif bet.result == 'push':
            bet.payout = bet.stake
            bet.profit = 0
        else:
            bet.payout = 0
            bet.profit = -bet.stake

    session.commit()
    print(f"Settled {len(bets)} bets for game {game_id}")


def generate_betting_report(engine, season=None):
    """
    Generate a comprehensive betting performance report.
    """
    where_clause = ""
    if season:
        where_clause = f"AND g.season = {season}"

    query = f"""
    SELECT
        b.bet_type,
        COUNT(*) AS total_bets,
        SUM(CASE WHEN b.result = 'win' THEN 1 ELSE 0 END) AS wins,
        SUM(CASE WHEN b.result = 'loss' THEN 1 ELSE 0 END) AS losses,
        SUM(CASE WHEN b.result = 'push' THEN 1 ELSE 0 END) AS pushes,
        ROUND(SUM(CASE WHEN b.result = 'win' THEN 1.0 ELSE 0.0 END)
              / NULLIF(SUM(CASE WHEN b.result IN ('win','loss')
                      THEN 1 ELSE 0 END), 0), 3) AS win_rate,
        ROUND(SUM(b.stake), 2) AS total_staked,
        ROUND(SUM(b.profit), 2) AS total_profit,
        ROUND(SUM(b.profit) / NULLIF(SUM(b.stake), 0) * 100, 2) AS roi_pct,
        ROUND(AVG(b.model_edge), 4) AS avg_model_edge
    FROM bets b
    JOIN games g ON b.game_id = g.id
    WHERE b.result IS NOT NULL {where_clause}
    GROUP BY b.bet_type
    """

    report = pd.read_sql(query, engine)

    print("=" * 70)
    print("BETTING PERFORMANCE REPORT")
    if season:
        print(f"Season: {season}")
    print("=" * 70)
    print(report.to_string(index=False))

    # Overall summary
    total_query = f"""
    SELECT
        COUNT(*) AS total_bets,
        SUM(b.stake) AS total_staked,
        SUM(b.profit) AS total_profit,
        SUM(b.profit) / NULLIF(SUM(b.stake), 0) * 100 AS roi_pct
    FROM bets b
    JOIN games g ON b.game_id = g.id
    WHERE b.result IS NOT NULL {where_clause}
    """
    totals = pd.read_sql(total_query, engine)
    print(f"\nOverall: {totals['total_bets'].iloc[0]} bets, "
          f"${totals['total_staked'].iloc[0]:.2f} staked, "
          f"${totals['total_profit'].iloc[0]:.2f} profit "
          f"({totals['roi_pct'].iloc[0]:.2f}% ROI)")

    return report

Automating Data Collection

A database is only useful if it stays current. Automating data collection ensures your database is updated regularly without manual intervention.

import schedule
import time
import logging
from datetime import datetime, timedelta

# Set up logging
logging.basicConfig(
    filename='data_collection.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def daily_update(db_path='betting.db'):
    """
    Daily data update routine.

    Run this as a scheduled task (cron job on Linux/Mac,
    Task Scheduler on Windows) or use Python's schedule library.
    """
    logging.info("Starting daily data update")

    try:
        engine = create_engine(f'sqlite:///{db_path}')
        session = get_session(engine)

        # 1. Check for newly completed games
        import nfl_data_py as nfl
        current_year = datetime.now().year
        schedule_data = nfl.import_schedules([current_year])

        # Find games completed since our last update
        latest_game = session.query(func.max(Game.game_date)).scalar()
        if latest_game:
            new_games = schedule_data[
                (pd.to_datetime(schedule_data['gameday']).dt.date > latest_game) &
                (schedule_data['home_score'].notna())
            ]
        else:
            new_games = schedule_data[schedule_data['home_score'].notna()]

        if len(new_games) > 0:
            logging.info(f"Found {len(new_games)} new completed games")
            # Insert new games (using populate function from above)
            populate_games_from_nfl_data(session, [current_year])

            # Settle any pending bets for completed games
            pending_bets = session.query(Bet).filter(Bet.result.is_(None)).all()
            game_ids_to_settle = set(b.game_id for b in pending_bets)
            for gid in game_ids_to_settle:
                game = session.query(Game).filter_by(id=gid).first()
                if game and game.home_score is not None:
                    settle_bets(session, gid)
        else:
            logging.info("No new completed games found")

        session.close()
        logging.info("Daily update completed successfully")

    except Exception as e:
        logging.error(f"Daily update failed: {str(e)}")
        raise


def run_scheduler():
    """
    Run the data collection on a schedule.

    For production use, prefer OS-level scheduling (cron/Task Scheduler)
    over this approach, as it requires the Python process to stay running.
    """
    schedule.every().day.at("06:00").do(daily_update)

    print("Scheduler started. Press Ctrl+C to stop.")
    while True:
        schedule.run_pending()
        time.sleep(60)


# For a simpler approach, just run the update manually or via cron:
# daily_update('my_betting.db')

5.6 Chapter Summary

This chapter covered the practical data skills that transform theoretical betting knowledge into actionable analysis. Let us review the key concepts, essential code patterns, and decision frameworks.

Key Concepts

Data sources are abundant but imperfect. The Sports Reference family, nfl_data_py, nba_api, and pybaseball provide rich, free data for major North American sports. FBRef covers global soccer. Odds data requires separate sourcing from providers like The Odds API or historical databases. No single source is complete; serious analysis requires combining multiple sources.

Pandas is the essential tool. DataFrames are the primary data structure for sports analysis. The operations you will use most frequently are filtering (boolean indexing), grouping (groupby/agg), merging (pd.merge), and computing rolling statistics (rolling/shift). Mastering these four patterns covers the majority of sports data manipulation tasks.

Data cleaning is not optional. Sports data has domain-specific quality issues: team relocations change names, DNP entries are not the same as zero, time zones create silent merge failures, and legitimate extreme performances look like outliers. Context-aware cleaning -- understanding why data looks the way it does -- produces better results than mechanical approaches.

Exploratory analysis reveals structure. Before modeling, you must understand your data's distributions, trends, and correlations. In the NFL, margins cluster around key numbers (3, 7, 10). Home-field advantage has been declining over the past two decades. Spreads explain roughly 40-45% of margin variation. These structural facts shape every model you build.

A personal database compounds in value. A well-designed SQLite database with proper schema (teams, games, odds, bets) becomes more valuable over time. It provides a single source of truth, enables complex queries, and -- critically -- tracks your actual betting performance against your models' predictions.

Key Code Patterns

Here is a reference card of the pandas patterns you will use most often in sports betting analysis.

Pattern Code Use Case
Filter rows df[df['season'] == 2024] Restrict to a specific season
Multiple conditions df[(df['week'] <= 18) & (df['spread'] < 0)] Regular season home favorites
Group and aggregate df.groupby('team')['margin'].agg(['mean', 'std']) Team-level summaries
Rolling average df.groupby('team')['pts'].transform(lambda x: x.rolling(5).mean()) Recent form
Lag features df.groupby('team')['pts'].shift(1) Previous game's stats (avoid lookahead)
Merge datasets pd.merge(stats, odds, on='game_id', how='inner') Combine sources
Pivot table df.pivot_table(values='margin', index='home', columns='away') Matchup matrices
Apply function df.apply(custom_function, axis=1) Row-wise custom logic
Value counts df['result'].value_counts(normalize=True) Frequency distributions
Date parsing pd.to_datetime(df['date']) Standardize date formats

Decision Framework for Choosing Data Sources

When starting a new analysis, use this decision tree to choose your data source:

  1. Is there a maintained Python package? (nfl_data_py, nba_api, pybaseball, etc.) - Yes: Use it. These packages handle authentication, rate limiting, and data formatting for you. - No: Continue to step 2.

  2. Is there a public API? - Yes: Use the requests library with proper rate limiting and caching. - No: Continue to step 3.

  3. Is the data available on a reference site? - Yes: Check if pandas.read_html() can extract it directly. If not, use requests + BeautifulSoup while respecting robots.txt and rate limits. - No: Continue to step 4.

  4. Is paid data necessary? - If your analysis requires data not available through free sources (real-time feeds, player tracking, niche sports), evaluate paid providers. Start with free tiers and trial periods.

  5. For odds data specifically: Combine The Odds API (live odds) with historical datasets from Kaggle or Australian Sports Betting (historical closing lines). Store everything in your personal database so you build a proprietary historical record over time.

Common Pitfalls to Avoid

Lookahead bias is the most dangerous data error in sports modeling. If your features for predicting game N include any information from game N or later, your backtest results will be fraudulently good and your live performance will be terrible. Always use .shift(1) or equivalent to ensure you are only using past data.

Survivorship bias occurs when your dataset only includes teams or players that existed throughout your analysis period. Teams relocate, players retire, and leagues expand. If you analyze "all current NFL teams" over 20 years, you miss the St. Louis Rams, San Diego Chargers, and Oakland Raiders.

Small sample deception is especially treacherous in sports. A team that covers 8 of 10 spreads is on a hot streak. It does not mean they have a 80% cover rate going forward. Always compute confidence intervals and think about sample sizes. A single NFL season provides only 272 regular-season games -- barely enough to estimate a league-wide cover rate, let alone team-specific tendencies.

Data source mismatches occur when you merge data from sources that define variables differently. One source might report spreads from the home team's perspective; another from the favorite's perspective. One source might use Eastern Time; another UTC. Always verify alignment after merging by spot-checking individual games.


What's Next

With the data foundations established in this chapter, you are prepared for Part II: Statistical Foundations. The next three chapters build the mathematical framework that transforms data into models.

Chapter 6: Descriptive Statistics and Distributions dives deep into the statistical properties of sports outcomes. You will learn why NFL margins approximate a normal distribution centered on the spread, how to characterize the variance of different sports, and how to use distributional thinking to estimate probabilities. The pandas and visualization skills from this chapter will be applied immediately as we analyze real distributions of scores, margins, and totals.

Chapter 7: Probability Theory for Betting formalizes the probability concepts introduced informally in earlier chapters. Conditional probability, Bayes' theorem, and the law of large numbers will be developed with sports-specific examples and applications.

Chapter 8: Regression Analysis introduces the first true predictive modeling technique. You will build linear regression models to predict point spreads and totals, using the data loading, cleaning, and feature engineering techniques from this chapter as inputs.

The arc of this textbook is deliberate: Part I gave you the vocabulary (Chapters 1-4) and the data skills (Chapter 5) to speak the language of quantitative betting. Part II gives you the mathematical tools to build models. By the end of Part II, you will have everything needed to construct your first working sports betting model.


Exercises

Exercise 5.1: Data Acquisition

Using nfl_data_py, pull play-by-play data for the most recent completed NFL season. Filter to passing plays only (hint: look for play_type == 'pass'). Calculate the top 10 quarterbacks by expected points added (EPA) per dropback. Compare your rankings to the final season standings.

Exercise 5.2: Data Cleaning Challenge

Download game data from two different sources (e.g., nfl_data_py and a CSV from Pro Football Reference). Merge them on game identifiers. Document every discrepancy you find: different team abbreviations, different date formats, different column names for the same statistic. Write a cleaning function that reconciles the two sources into a single, consistent DataFrame.

Exercise 5.3: Exploratory Analysis

Using at least three seasons of NFL data, create visualizations that answer these questions: 1. Has the average total points per game changed over the period? 2. Do Thursday Night Football games have different scoring patterns than Sunday games? 3. Is there a relationship between a team's turnover differential and their ATS (against the spread) record?

Exercise 5.4: Database Construction

Build the SQLite betting database described in Section 5.5. Populate it with at least three seasons of NFL data. Write SQL queries that answer: 1. Which teams have been the most profitable to bet on (home spread) over the period? 2. What is the over/under hit rate for games with totals above 50 versus below 40? 3. Do divisional games have different ATS characteristics than non-divisional games?

Exercise 5.5: Automation Pipeline

Extend the SportsDataPipeline class from Section 5.3 to handle NBA data. Your pipeline should: 1. Pull game data using nba_api 2. Standardize team names (handle the Charlotte Bobcats/Hornets transition) 3. Handle the different season notation (e.g., "2024-25" vs. 2024) 4. Create derived features appropriate for basketball (pace, offensive rating, etc.) 5. Output a clean DataFrame ready for analysis


Further Reading

  • McKinney, W. Python for Data Analysis, 3rd Edition. O'Reilly Media. The definitive guide to pandas by its creator.
  • VanderPlas, J. Python Data Science Handbook. O'Reilly Media. Excellent coverage of NumPy, pandas, matplotlib, and scikit-learn.
  • The nflverse documentation (nflverse.nflverse.com) provides comprehensive guides to the NFL data ecosystem.
  • Kopf, D. and Zuckerman, A. "The Home Field Advantage Is Disappearing." FiveThirtyEight. A data-driven analysis of declining home-field advantage across professional sports.
  • SQLAlchemy documentation (docs.sqlalchemy.org) for deeper understanding of the ORM patterns used in Section 5.5.