Chapter 5 Exercises: Data Literacy for Bettors

Part A: Conceptual Questions (Data Quality, Sources, and EDA Principles)

Exercise A-1: Defining Data Quality Dimensions

List and define the six core dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, and uniqueness). For each dimension, provide a concrete example of how a failure in that dimension would affect a sports betting dataset. For instance, how would an accuracy failure in point-spread closing lines corrupt a closing-line-value analysis?

Exercise A-2: Primary vs. Secondary Data Sources

Explain the difference between a primary data source and a secondary data source in the context of sports statistics. Give two examples of each for NFL data. Discuss the trade-offs a bettor faces when choosing between a primary source (e.g., the NFL's official stats API) and a secondary aggregator (e.g., a third-party website that scrapes and repackages the data).

Exercise A-3: Survivorship Bias in Betting Data

A bettor downloads a dataset of "sharp plays" from a tipster's website. The dataset contains only the tipster's winning picks over the last three seasons. Explain how survivorship bias affects the conclusions the bettor might draw from this dataset. What additional data would be needed to properly evaluate the tipster's track record?

Exercise A-4: Tidy Data Principles

Define what Hadley Wickham's "tidy data" principles require. Then take the following messy table of NFL game results and describe, step by step, how you would restructure it into tidy format:

Week	Home Team	Away Team	Home Score	Away Score	Home Spread	Away Spread
1	KC	DET	21	20	-3.5	+3.5

Discuss why the Home Spread and Away Spread columns violate tidy principles and how you would resolve this.

Exercise A-5: The Role of Domain Knowledge in EDA

Why is domain expertise essential during exploratory data analysis of sports data? Give three specific examples where a bettor's knowledge of the sport would lead them to investigate patterns that a pure statistician might overlook. For instance, how does understanding NFL bye weeks change the way you explore team performance data?

Exercise A-6: Data Provenance and Reproducibility

You find two free NBA datasets online. Dataset A includes a detailed README with collection methodology, update frequency, and known limitations. Dataset B has no documentation. Both contain similar columns. Make the case for choosing Dataset A even if Dataset B has more rows. Define "data provenance" and explain why it matters for reproducible betting research.

Exercise A-7: Granularity and Aggregation

Explain the concept of data granularity. Compare the analytical value of play-by-play NFL data versus game-level summary data. For each of the following questions, state which granularity level is required and why:

What is a team's red-zone conversion rate?
What is a team's average margin of victory?
How does a team's offensive efficiency change in the fourth quarter of close games?
What is a team's season win total?

Exercise A-8: Ethical Considerations in Sports Data Collection

Discuss the ethical and legal considerations involved in scraping sports data from websites. Address the following points: (a) terms-of-service compliance, (b) rate limiting and server impact, (c) copyright of statistical data vs. creative content, (d) the distinction between publicly available data and proprietary data. Propose a set of best practices for responsible data collection.

Part B: Calculation and Data Transformation Exercises

Exercise B-1: Pandas Indexing and Selection

Given a DataFrame df with columns ['date', 'team', 'opponent', 'points_scored', 'points_allowed', 'spread', 'total'], write pandas expressions to:

Select all rows where team == 'LAL' and points_scored > 110.
Select all rows from January 2024.
Calculate the margin of victory column (points_scored - points_allowed).
Find the game with the largest upset (biggest negative margin relative to spread).
Return the top 5 teams by average points scored.

Exercise B-2: GroupBy Aggregations

Using the same DataFrame structure from B-1, write pandas code to:

Compute each team's average points scored and points allowed per game.
Calculate each team's ATS (against the spread) record as wins, losses, and pushes.
Find the team with the best over/under "over" percentage.
Compute monthly scoring averages for each team.
Rank teams by their cover percentage (wins ATS / total games).

Exercise B-3: Merging Datasets

You have two DataFrames:

games_df with columns ['game_id', 'date', 'home_team', 'away_team', 'home_score', 'away_score']
odds_df with columns ['game_id', 'spread', 'total', 'home_ml', 'away_ml']

Write the pandas code to:

Inner join on game_id.
Left join (keeping all games even if odds are missing).
After joining, identify games where odds data is missing.
Create a validation check that flags games where the date in games_df falls on a day when no games should have been played (e.g., NFL games on a Wednesday).

Exercise B-4: Handling Missing Data

A betting dataset has the following missing-data patterns:

spread is missing for 3% of games (older games where lines were not recorded).
closing_line is missing for 12% of games (pre-2015 data).
weather_temperature is missing for 40% of games (indoor stadiums).

For each column, recommend and justify a strategy: drop, fill with a constant, interpolate, or flag and analyze separately. Write the pandas code implementing your chosen strategy for each.

Exercise B-5: Rolling and Expanding Windows

Write pandas code to compute the following for each team in an NBA season:

A 10-game rolling average of points scored.
A 10-game rolling ATS record.
An expanding (cumulative) season win percentage.
A 5-game rolling standard deviation of point differential (a measure of consistency).

Explain why you must sort by date within each team before applying rolling calculations.

Exercise B-6: Pivot Tables for Betting Analysis

Using an NFL dataset with columns ['season', 'week', 'team', 'opponent', 'location', 'result', 'spread', 'covered'], create pivot tables that answer:

Each team's home vs. away ATS record by season.
The league-wide ATS cover rate by week number (do certain weeks have more upsets?).
A cross-tabulation of favorite/underdog status vs. cover result.

Write the full pandas pivot_table or crosstab code for each.

Exercise B-7: Date and Time Operations

An NBA schedule dataset has a game_datetime column in UTC. Write pandas code to:

Convert all times to US Eastern.
Create a rest_days column showing days since each team's previous game.
Flag "back-to-back" games (0 rest days, accounting for the fact that each team plays).
Calculate travel distance between consecutive games using a dictionary of city coordinates and the Haversine formula.
Create a schedule_advantage column: the difference in rest days between the home team and the away team for each game.

Part C: Programming Exercises

Exercise C-1: Build a Sports Data Scraper

Write a complete Python script that scrapes NBA game results from Basketball Reference (or a similar public source) for a given season. Requirements:

Accept a season year as a command-line argument.
Respect robots.txt and include a 3-second delay between requests.
Parse the HTML tables into a pandas DataFrame.
Clean column names (lowercase, underscores).
Save the result to a CSV file named nba_{season}_games.csv.
Include proper error handling for network failures and parsing errors.
Add logging that records which pages were fetched and any errors encountered.

Exercise C-2: Data Cleaning Pipeline

Write a Python module cleaning_pipeline.py that implements a reusable data-cleaning pipeline for sports betting data. The pipeline should include the following functions:

standardize_team_names(df, sport) -- maps variant team names to a canonical form.
validate_scores(df) -- checks that scores are non-negative integers and that no game has both teams scoring zero.
detect_duplicates(df) -- identifies duplicate games using a composite key of date, home team, and away team.
fill_missing_odds(df, method) -- fills missing odds data using the specified method (drop, ffill, or median).
run_pipeline(df, sport) -- orchestrates all cleaning steps and returns a cleaned DataFrame plus a report summarizing actions taken.

Each function must include type hints, a docstring, and at least one inline comment explaining a non-obvious decision.

Exercise C-3: EDA Notebook

Create a Python script (simulating a Jupyter notebook workflow) that performs a complete exploratory data analysis on an NFL season dataset. The script should:

Load data from CSV.
Print shape, dtypes, and first five rows.
Generate summary statistics for all numeric columns.
Check for and report missing values.
Plot the distribution of point spreads (using matplotlib or seaborn).
Plot the distribution of game totals.
Compute and display a correlation matrix of key numeric features.
Identify outlier games (e.g., total points > 3 standard deviations from mean).
Analyze home-field advantage: average margin of victory for home teams.
Output a text summary of all findings.

Exercise C-4: SQLite Betting Database

Design and implement a SQLite database for storing multi-sport betting data. Requirements:

Tables: sports, teams, seasons, games, odds, bet_tracking.
Enforce referential integrity with foreign keys.
Create indexes on frequently queried columns (date, team, game_id).
Write Python functions to insert data, query results, and export to DataFrame.
Include a function that computes a bettor's ROI from the bet_tracking table.
Seed the database with at least 20 sample rows across all tables.

Exercise C-5: API Wrapper Library

Write a Python class SportsDataAPI that wraps a free sports data API (e.g., the-odds-api.com or a similar service). The class should:

Accept an API key via constructor or environment variable.
Implement methods: get_upcoming_odds(sport), get_historical_odds(sport, date), get_scores(sport, date).
Handle rate limiting with exponential backoff.
Cache responses locally (to a JSON file or SQLite) to avoid redundant API calls.
Return results as pandas DataFrames.
Include a __repr__ method that shows the API base URL and rate-limit status.

Part D: Analysis Exercises

Exercise D-1: Explore an NFL Dataset

Download (or generate) a dataset of NFL game results for the 2023 season. Perform the following analyses and write up your findings in 300-500 words:

Distribution of home vs. away wins.
Average margin of victory by week.
Identify the three most "clutch" teams (teams that covered the spread most often in games decided by 7 or fewer points).
Correlation between a team's turnover differential and their ATS record.
Are primetime games (SNF, MNF, TNF) more or less likely to go over the total than early Sunday games?

Exercise D-2: Data Quality Audit

Obtain two different sources of NBA box-score data for the same season (e.g., Basketball Reference and ESPN). Merge them on a common key and perform a data-quality audit:

How many games appear in one source but not the other?
For games in both sources, do the scores agree? Track discrepancies.
Are player names spelled consistently across sources?
Do the dates match, or are there timezone-related off-by-one errors?
Write a summary report quantifying the differences and recommending which source to trust.

Exercise D-3: Feature Engineering for Betting

Starting with a raw NBA game-results dataset, engineer at least 10 features that could be useful for predicting game outcomes or betting results. For each feature, provide:

The feature name and a one-sentence description.
The pandas code to compute it.
A brief justification for why it might be predictive.

Examples: rolling point differential, rest days, travel distance, win streak, back-to-back flag, opponent strength (rolling opponent win percentage), pace, etc.

Exercise D-4: Identifying Systematic Patterns

Using at least three full NFL seasons of data, test the following common betting narratives and report whether the data supports them:

Home underdogs cover at a higher rate than road underdogs.
Teams coming off a bye week perform better ATS than their opponents.
Division rivalry games tend to go under the total more than non-division games.
Heavy favorites (spread > 10) fail to cover more often than the vig-free break-even rate.
For each claim, compute the relevant cover rate, a 95% confidence interval, and state whether the result is statistically significant.

Exercise D-5: Outlier Investigation

Select any sport and identify the 10 most extreme outlier games in your dataset (by total points, margin of victory, or deviation from the spread). For each game:

Record the date, teams, score, spread, and total.
Research what happened (injuries, weather, garbage time, etc.).
Determine whether the game should be included in future modeling or excluded as an anomaly.
Write a brief recommendation on outlier-handling policy for your dataset.

Part E: Research Exercises

Exercise E-1: Data Source Comparison Matrix

Research and compare at least five sports data sources (free and paid) across the following dimensions:

Source	Sports Covered	Data Types	Update Frequency	API Available	Cost	Limitations

Write a 400-word summary recommending which sources you would use for: (a) a casual bettor tracking personal bets, (b) a serious bettor building models, (c) a professional syndicate requiring real-time data.

Exercise E-2: API Evaluation

Evaluate three sports odds APIs (e.g., The Odds API, Sportradar, ESPN's public endpoints). For each API:

Document the available endpoints and data types.
Test the free tier: What are the rate limits? How much historical data is available?
Assess data quality: Are the odds accurate? How quickly are they updated?
Estimate the monthly cost for a bettor who needs to pull odds for 4 sports, 3 times per day.
Rank the APIs by value (quality per dollar) and justify your ranking.

Exercise E-3: Database Design Review

Research three different approaches to storing sports betting data: (a) flat CSV files, (b) a relational database (SQLite/PostgreSQL), (c) a NoSQL document store (MongoDB). For each approach:

Describe the data model and schema design.
List three queries that are easy to perform and three that are difficult.
Estimate storage requirements for 10 years of NFL, NBA, MLB, and NHL data.
Assess scalability, backup/recovery, and sharing with collaborators.
Recommend when each approach is most appropriate.

Exercise E-4: The State of Public Sports Data

Write a 600-word essay on the current state of publicly available sports data for betting research. Address:

What data is freely available and what requires payment?
How has data availability changed over the past decade?
What gaps exist (e.g., real-time player tracking data, historical weather conditions)?
How do data vendors protect their products (rate limits, licensing, legal action)?
What is the outlook for open-source sports data projects?

Exercise E-5: Building a Data Collection Plan

You are starting a new sports betting research project focused on NBA player props. Design a complete data collection plan that addresses:

What data do you need? (Player stats, team stats, odds, injury reports, lineups, etc.)
Where will you get each data type? Name specific sources.
How will you collect it? (API, scraping, manual entry, purchased dataset.)
How often does each data type need to be updated?
How will you store and organize the data?
What data quality checks will you implement?
What is your estimated cost (time and money) for the first season of data collection?

Present your plan as a structured document with a timeline and task list.