Chapter 5: Key Takeaways

Data Literacy for Bettors -- Summary Card

Every profitable model, every edge, and every strategy depends on data that is accurate, complete, and properly structured.
The bettor who controls their data pipeline controls their analytical destiny.
Garbage in, garbage out applies with special force in sports betting, where decisions are made on thin margins.

Primary sources (official league APIs, sportsbook feeds) are authoritative but may have access restrictions.
Secondary sources (aggregators, third-party APIs, scraped datasets) are convenient but introduce additional error risk.
Always document data provenance: where it came from, when it was collected, and what transformations were applied.
Evaluate sources on six dimensions: accuracy, completeness, consistency, timeliness, cost, and coverage.

pandas is the essential tool for loading, filtering, transforming, and analyzing sports datasets in Python.
Core operations to memorize: boolean indexing, groupby with aggregation, merge/join, rolling windows, pivot_table, and apply.
Always sort time-series data by date before computing rolling statistics.
Use pd.concat() to reshape game-level data into team-game-level data (doubling the rows, one per team).

Real sports data is messy: inconsistent team names, missing values, duplicated rows, invalid dates, and conflicting records across sources.
Build a reusable cleaning pipeline with standardized steps: schema validation, team-name mapping, duplicate detection, missing-value handling, and domain-constraint checks.
Never blindly impute missing data. Understand why data is missing before choosing a strategy (drop, fill, flag, or interpolate).
Test your cleaning code. A bug in your pipeline corrupts every analysis downstream.

EDA is not optional. It is the process of understanding your data before modeling.
The EDA checklist: shape, dtypes, summary statistics, missing values, distributions, correlations, outliers, and domain-specific patterns.
Use histograms to see distributions (key numbers in NFL spreads, scoring distributions in NBA).
Use scatter plots and correlation matrices to identify relationships between features.
Investigate outliers before removing them. Some are errors; some are real and informative.

Flat CSV files are fine for exploration but do not scale for serious betting research.
A relational database (SQLite for personal use, PostgreSQL for production) provides structure, indexing, referential integrity, and query power.
Normalize your schema: separate tables for teams, games, odds, and bets, linked by foreign keys.
Create indexes on columns you query frequently (date, team, game_id).
SQL is a skill worth learning. Complex aggregations (ATS records, rolling averages, conditional filters) are often easier and faster in SQL than in pandas.

Run automated quality checks every time new data enters your pipeline.
The five essential checks: schema validation, row-count verification, duplicate detection, domain-constraint validation, and referential integrity.
Log everything. When a model suddenly performs poorly, the first place to look is the data pipeline.

Survivorship bias: Datasets that only include successful outcomes (e.g., winning picks) are misleading.
Look-ahead bias: Using information that was not available at the time of the bet (e.g., final injury reports before they were published).
Multiple comparisons: Testing 50 hypotheses and celebrating the 3 that pass at p < 0.05 is not discovery; it is noise.
Overfitting to history: A pattern that appeared in 3 seasons of data may be sample noise, not a durable signal.
Key mismatch: Team abbreviations, date formats, and naming conventions vary across sources. Standardize before merging.

Always ask: Where did this data come from? How was it collected? What might be wrong with it?
Treat every dataset as guilty until proven innocent.
Document your work. Future-you will thank present-you when debugging a pipeline at 2 AM.
The best bettors are not the ones with the most data. They are the ones who understand their data most deeply.

Task	pandas Code
Filter rows	`df[df['spread'] < -7]`
Group and aggregate	`df.groupby('team')['points'].mean()`
Merge datasets	`pd.merge(games, odds, on='game_id', how='left')`
Rolling average	`df.groupby('team')['pts'].rolling(10).mean()`
Pivot table	`pd.pivot_table(df, values='covered', index='team', columns='location', aggfunc='mean')`
Missing values	`df['col'].fillna(df['col'].median())`
Standardize names	`df['team'].replace(name_mapping)`
ATS calculation	`df['covered'] = (df['score_diff'] + df['spread'] > 0)`