Chapter 5 Quiz: Data Literacy for Bettors

Q: Which of the following is a *primary* data source for official NBA game statistics? A sports betting forum post summarizing game results The NBA's official stats API at `stats.nba.com` A Kaggle dataset uploaded by another user An ESPN article recapping the game

(B) The NBA's official stats API is a primary data source because it provides data directly from the authoritative originator. Options A, C, and D are all secondary sources that repackage or interpret data originally published elsewhere.

Q: You are comparing The Odds API (free tier) and a premium Sportradar feed. The free API offers 500 requests/month with a 10-second delay on odds updates. Sportradar offers real-time odds with no request cap for $500/month. For which user profile does the free tier make the *most* sense? A professional syndicate placing bets within seconds of line movement A casual bettor who checks odds once per day and bets on weekends A quantitative researcher backtesting a live-betting model A sportsbook operator setting their own lines

(B) A casual bettor checking odds once daily and betting on weekends has low request volume and no need for real-time data. The 500 requests/month and 10-second delay are perfectly adequate for this use case. The other profiles all require either real-time data or high request volumes.

Q: You have two DataFrames: `games` (5,000 rows) and `odds` (4,800 rows), both with a `game_id` column. You run `merged = games.merge(odds, on='game_id', how='left')`. How many rows will `merged` have, assuming `game_id` is unique in both DataFrames? 4,800 5,000 9,800 It depends on how many `game_id` values match

(B) A left merge keeps all rows from the left DataFrame (`games`, 5,000 rows). Since `game_id` is unique in both DataFrames, each row in `games` matches at most one row in `odds`. The 200 games without matching odds will have `NaN` in the odds columns. The result has exactly 5,000 rows.

Instructions: Answer all 25 questions. Each question is worth 4 points (100 total). For multiple-choice questions, select the single best answer. For short-answer questions, respond in 1-3 sentences.

Section 1: Data Sources and Collection (Questions 1-5)

Question 1. Which of the following is a primary data source for official NBA game statistics?

(A) A sports betting forum post summarizing game results
(B) The NBA's official stats API at stats.nba.com
(C) A Kaggle dataset uploaded by another user
(D) An ESPN article recapping the game

Answer

**(B)** The NBA's official stats API is a primary data source because it provides data directly from the authoritative originator. Options A, C, and D are all secondary sources that repackage or interpret data originally published elsewhere.

Question 2. A bettor is building a web scraper to collect historical odds from a sportsbook website. Which of the following is the most important ethical and practical consideration?

(A) Using the fastest possible request rate to finish quickly
(B) Checking the site's robots.txt and terms of service before scraping
(C) Downloading the entire site in case it goes offline
(D) Using multiple IP addresses to avoid detection

Answer

**(B)** Checking `robots.txt` and the terms of service is the first and most important step. It tells you what the site owner permits, helps you avoid legal issues, and is the foundation of ethical data collection. Options A, C, and D describe aggressive or evasive tactics that violate responsible scraping principles.

Question 3. You are comparing The Odds API (free tier) and a premium Sportradar feed. The free API offers 500 requests/month with a 10-second delay on odds updates. Sportradar offers real-time odds with no request cap for $500/month. For which user profile does the free tier make the most sense?

(A) A professional syndicate placing bets within seconds of line movement
(B) A casual bettor who checks odds once per day and bets on weekends
(C) A quantitative researcher backtesting a live-betting model
(D) A sportsbook operator setting their own lines

Answer

**(B)** A casual bettor checking odds once daily and betting on weekends has low request volume and no need for real-time data. The 500 requests/month and 10-second delay are perfectly adequate for this use case. The other profiles all require either real-time data or high request volumes.

Question 4. (Short Answer) Explain the difference between structured and unstructured sports data. Give one example of each that a bettor might use.

Answer

**Structured data** is organized in a predefined format with rows and columns, such as a CSV file of game scores with columns for date, teams, and points. **Unstructured data** lacks a fixed schema and requires processing to extract information, such as the text of post-game coach interviews or social media posts about player injuries. A bettor might use structured box-score data for statistical modeling and unstructured injury report text for sentiment analysis.

Question 5. What does the term "data provenance" mean, and why is it critical when using a third-party sports dataset for betting research?

Answer

**Data provenance** refers to the documented history of a dataset: where the data originated, how it was collected, what transformations were applied, and who maintains it. It is critical for betting research because models built on unreliable or incorrectly processed data will produce misleading results. Without provenance, you cannot assess accuracy, identify potential biases, or reproduce analyses if the data changes.

Section 2: Pandas Fundamentals (Questions 6-10)

Question 6. Given a pandas DataFrame df with 1,000 rows and a column spread, which expression correctly computes the number of games where the home team was favored by more than 7 points (spread < -7)?

(A) df[df.spread < -7].shape[0]
(B) df.query('spread > 7').count()
(C) len(df[df['spread'] < -7])
(D) Both (A) and (C)

Answer

**(D)** Both (A) and (C) are correct. Expression (A) uses boolean indexing and `.shape[0]` to get the row count. Expression (C) uses `len()` on the filtered DataFrame. Both return the number of rows where `spread < -7`. Option (B) is wrong because it filters for `spread > 7` (positive, meaning the home team is an underdog) and `.count()` returns per-column counts, not a single integer.

Question 7. What is the difference between df.groupby('team').mean() and df.groupby('team').transform('mean') in pandas?

(A) There is no difference; both return the same DataFrame
(B) .mean() returns one row per group; .transform('mean') returns a Series the same length as the original DataFrame with each row's value replaced by its group mean
(C) .transform('mean') is faster but less accurate
(D) .mean() works only on numeric columns; .transform('mean') works on all column types

Answer

**(B)** The key difference is output shape. `.mean()` aggregates, producing one row per group. `.transform('mean')` broadcasts the group-level result back to every row in the original DataFrame, preserving the original index and row count. This is useful when you want to add a column for "team average" alongside each individual game row.

Question 8. You have two DataFrames: games (5,000 rows) and odds (4,800 rows), both with a game_id column. You run merged = games.merge(odds, on='game_id', how='left'). How many rows will merged have, assuming game_id is unique in both DataFrames?

(A) 4,800
(B) 5,000
(C) 9,800
(D) It depends on how many game_id values match

Answer

**(B)** A left merge keeps all rows from the left DataFrame (`games`, 5,000 rows). Since `game_id` is unique in both DataFrames, each row in `games` matches at most one row in `odds`. The 200 games without matching odds will have `NaN` in the odds columns. The result has exactly 5,000 rows.

Question 9. (Short Answer) Explain why you should call df.sort_values(['team', 'date']) before computing df.groupby('team')['points'].rolling(10).mean(). What error or incorrect result would occur without sorting?

Answer

Rolling window calculations operate on rows in their current order. If the data is not sorted by date within each team, the "10-game rolling average" would mix games from different points in the season, producing a meaningless result. For example, game 30 of the season might be averaged with game 5 if the rows are not chronologically ordered. Sorting by team and then date ensures each team's games are in temporal sequence so the rolling window captures the correct recent history.

Question 10. Which pandas method is best suited for reshaping a DataFrame from "one row per game" (with home_team, away_team, home_score, away_score) into "one row per team-game" (with team, opponent, points_for, points_against, location)?

(A) pd.pivot_table()
(B) pd.melt() followed by additional transformations
(C) Manual concatenation of home and away subsets using pd.concat()
(D) df.stack()

Answer

**(C)** The most straightforward approach is to create two subsets --- one for home teams and one for away teams --- renaming columns appropriately, and then concatenate them with `pd.concat()`. For each game, the home subset gets `location='home'` and the away subset gets `location='away'`. While `pd.melt()` (B) could partially help, the column renaming and restructuring required make manual concatenation clearer and more explicit for this specific transformation.

Section 3: Data Cleaning (Questions 11-15)

Question 11. A dataset contains team names in multiple formats: "LA Lakers", "Los Angeles Lakers", "LAL", and "L.A. Lakers". What is this data quality problem called, and what is the standard solution?

(A) Missing data; impute the correct name
(B) Inconsistency; create a canonical mapping dictionary and apply it to the column
(C) Invalidity; drop rows with non-standard names
(D) Duplication; use df.drop_duplicates()

Answer

**(B)** This is a data consistency (or standardization) problem. The solution is to create a mapping dictionary like `{"LA Lakers": "Los Angeles Lakers", "LAL": "Los Angeles Lakers", "L.A. Lakers": "Los Angeles Lakers"}` and apply it using `df['team'].map(mapping)` or `df['team'].replace(mapping)`. Dropping rows (C) would lose valid data. This is not a missing-data or duplication problem.

Question 12. You discover that 15% of the closing_line values in your dataset are NaN. Which of the following approaches is most appropriate?

(A) Drop all rows with missing closing_line values
(B) Fill missing values with the column mean
(C) Investigate why the values are missing before choosing a strategy
(D) Fill missing values with 0

Answer

**(C)** Before choosing any imputation or deletion strategy, you must understand the missing-data mechanism. If closing lines are missing because the data source does not cover older seasons (missing not at random), filling with the column mean would introduce bias. If they are missing randomly due to scraping errors, a different strategy applies. Filling with 0 (D) would be especially harmful since 0 is not a valid spread. Always investigate first.

Question 13. A dataset of MLB games contains a row where the date is 2024-02-30. What type of data quality issue is this, and how should you handle it?

(A) A missing-value issue; replace with NaT
(B) A validity issue; February 30 does not exist, so flag the row for investigation
(C) A consistency issue; reformat to 02/30/2024
(D) A timeliness issue; the data is too old

Answer

**(B)** February 30 is an impossible date, making this a validity issue. The value fails a basic domain constraint (valid calendar dates). The correct response is to flag the row, investigate the source of the error (perhaps a data-entry mistake or parsing error), and attempt to recover the correct date from another source. Simply replacing with `NaT` loses information unnecessarily.

Question 14. (Short Answer) Explain the difference between df.dropna() and df.fillna(method='ffill'). When would forward-fill be appropriate for a time-series sports dataset, and when would it be dangerous?

Answer

`df.dropna()` removes any row containing at least one `NaN` value, reducing the dataset size. `df.fillna(method='ffill')` replaces each `NaN` with the most recent non-null value above it in the column. Forward-fill is appropriate for slowly changing values in time-ordered data, such as a team's Elo rating or a player's injury status that persists until updated. It is dangerous when applied to values that change independently each game (like scores or odds), because it would fabricate data that looks real but is actually copied from a previous game, potentially corrupting analyses without any visible warning.

Question 15. You are merging NFL game data from two sources using team abbreviations as the join key. Source A uses "JAX" for Jacksonville; Source B uses "JAC". After merging, Jacksonville games show NaN in all columns from Source B. What is the term for this problem, and how would you prevent it in a pipeline?

Answer

This is a **key mismatch** (or join-key inconsistency) problem. It occurs when the same real-world entity is represented differently across datasets. To prevent it, implement a team-name standardization step *before* merging: create a canonical abbreviation mapping for each source and apply it so both sides use the same key values. In a pipeline, this would be a dedicated `standardize_team_names()` function called on each DataFrame before any join operation.

Section 4: Exploratory Data Analysis (Questions 16-20)

Question 16. During EDA on an NBA dataset, you compute that the mean total points per game is 215.4, but the median is 213.0. What does this tell you about the distribution of game totals?

(A) The distribution is perfectly symmetric
(B) The distribution is left-skewed (long tail on the low end)
(C) The distribution is right-skewed (long tail on the high end)
(D) The mean and median are too close to draw any conclusion

Answer

**(C)** When the mean exceeds the median, the distribution is right-skewed, meaning there is a longer tail on the high end. This indicates that a few very high-scoring games pull the mean above the median. In NBA data, this makes sense: occasional overtime games or defensive collapses produce extremely high totals, creating the right skew.

Question 17. You plot a histogram of NFL point spreads and observe that the values -3, -7, +3, and +7 have significantly higher frequencies than adjacent values. Why do these "key numbers" appear in the distribution, and what is their significance for bettors?

Answer

These key numbers correspond to the most common margins of victory in the NFL: 3 points (a field goal) and 7 points (a touchdown with extra point). Because many games are decided by exactly 3 or 7 points, spreads set at these values produce a disproportionate number of pushes, and games landing on these margins determine many ATS results. For bettors, the significance is that buying or selling a half-point through -3 or -7 (e.g., moving from -3 to -2.5) has outsized value compared to the same adjustment at non-key numbers (e.g., -4 to -3.5). This is why sportsbooks charge extra juice for half-point moves across key numbers.

Question 18. (Short Answer) You compute a correlation matrix for an NBA dataset and find that pace (possessions per 48 minutes) has a correlation of +0.72 with total_points. Does this mean that pace causes higher-scoring games? Explain your reasoning.

Answer

No, correlation does not imply causation. While the +0.72 correlation indicates a strong positive linear relationship between pace and total points, it does not establish a causal mechanism. There are plausible confounding variables: for example, teams that play fast may also have better offensive talent (which independently increases scoring), or the causality may run in both directions (high-scoring games may inflate pace statistics). To investigate causality, you would need controlled experiments or causal inference methods (e.g., instrumental variables, natural experiments), not just observational correlation.

Question 19. During EDA, you identify a game where the total points scored was 312 in an NBA game. This is more than 4 standard deviations above the mean. What should your first step be?

(A) Delete the row immediately --- it is clearly an error
(B) Cap the value at the mean plus 3 standard deviations
(C) Investigate whether the value is legitimate (e.g., a quadruple-overtime game) before deciding
(D) Replace it with the median total

Answer

**(C)** Extreme values should be investigated, not automatically removed. A total of 312 is unusual but not impossible --- a quadruple-overtime game could produce such a total. Your first step is to look up the actual game to determine whether the value is a legitimate outlier (keep it, possibly with a flag) or a data error (correct or remove it). Blindly deleting or capping outliers destroys real information.

Question 20. You are conducting EDA on three seasons of NFL data to look for potential betting angles. You test 50 different hypotheses (home dogs on Monday night, teams after a loss, etc.) and find that 3 of them show "statistically significant" results at the p < 0.05 level. Should you be excited? Explain.

Answer

No. This is a classic **multiple comparisons problem**. When testing 50 independent hypotheses at the p < 0.05 significance level, you would expect approximately 2-3 "significant" results purely by chance (50 x 0.05 = 2.5). Finding 3 significant results is consistent with random noise. To draw meaningful conclusions, you should apply a multiple-comparisons correction (such as Bonferroni, which would require p < 0.001 per test) or, better yet, use a holdout dataset to validate any pattern discovered during exploratory analysis.

Section 5: Database Design (Questions 21-23)

Question 21. You are designing a relational database for storing NFL betting data. Which of the following table structures best implements normalization?

(A) A single table with columns: game_id, date, home_team_name, home_team_city, away_team_name, away_team_city, score, spread, total, bet_amount, bet_result
(B) Separate tables for teams, games, odds, and bets, linked by foreign keys
(C) A single JSON document per game containing all related data
(D) Separate CSV files for each week of the season

Answer

**(B)** Normalization requires separating data into distinct tables to eliminate redundancy and ensure referential integrity. Storing team information in a `teams` table (with `team_id`, `name`, `city`) means you update a team's city once, not in every game row. Foreign keys enforce valid relationships between tables. Option A has extensive redundancy (team city repeated every row). Options C and D are not relational database approaches.

Question 22. (Short Answer) Explain what a database index is and why you would create one on the date column of a games table that contains 50,000 rows.

Answer

A **database index** is a data structure (typically a B-tree) that allows the database engine to locate rows matching a query condition without scanning every row in the table. Creating an index on the `date` column of a 50,000-row `games` table dramatically speeds up queries that filter by date (e.g., `SELECT * FROM games WHERE date = '2024-01-15'` or `WHERE date BETWEEN '2024-01-01' AND '2024-01-31'`). Without an index, the database performs a full table scan of all 50,000 rows. With an index, it can jump directly to the relevant rows, often reducing query time from hundreds of milliseconds to under one millisecond. The trade-off is slightly slower insert operations and additional disk space.

Question 23. Which SQL query correctly computes each NFL team's ATS record (covers, non-covers, and pushes) for the 2023 season from a games table with columns season, team, score_diff (positive = win), and spread?

(A) sql SELECT team, SUM(CASE WHEN score_diff + spread > 0 THEN 1 ELSE 0 END) AS covers, SUM(CASE WHEN score_diff + spread < 0 THEN 1 ELSE 0 END) AS non_covers, SUM(CASE WHEN score_diff + spread = 0 THEN 1 ELSE 0 END) AS pushes FROM games WHERE season = 2023 GROUP BY team;
(B) sql SELECT team, COUNT(*) AS covers FROM games WHERE score_diff > spread AND season = 2023 GROUP BY team;
(C) sql SELECT team, AVG(spread) AS ats_record FROM games WHERE season = 2023 GROUP BY team;

Answer

**(A)** This query correctly computes all three ATS outcomes. The logic `score_diff + spread` determines the ATS result: if a team won by 10 (`score_diff = 10`) and was favored by 7 (`spread = -7`), then `10 + (-7) = 3 > 0`, so they covered. The `CASE` expressions correctly categorize each game as a cover, non-cover, or push, and `GROUP BY team` aggregates across all games. Option B only counts covers and ignores pushes. Option C computes average spread, which is unrelated to ATS record.

Section 6: Applied Data Literacy (Questions 24-25)

Question 24. A fellow bettor shows you a chart titled "Teams on 3+ Game Losing Streaks Cover 58% of the Time." The chart is based on 5 NFL seasons of data. List at least three questions you should ask before accepting this claim as a viable betting angle.

Answer

Critical questions include: 1. **Sample size:** How many games in the dataset involve teams on 3+ game losing streaks? If the sample is small (e.g., 50-80 games), a 58% rate may not be statistically significant. 2. **Statistical significance:** What is the p-value or confidence interval? A 58% rate with a wide confidence interval (e.g., 48%-68%) is not meaningfully different from 50%. 3. **Vig adjustment:** A 58% cover rate sounds profitable, but at standard -110 juice, you need approximately 52.4% to break even. After accounting for vig, what is the actual ROI? 4. **Time period stability:** Does the 58% hold across each individual season, or is it driven by one anomalous season? 5. **Definition consistency:** How exactly is "3+ game losing streak" defined? Straight-up losses or ATS losses? Does the count reset after a win? 6. **Data snooping:** How many other "streaks" or "situations" were tested before this one was found significant? (Multiple comparisons problem.) 7. **Market awareness:** Is this a well-known angle that the market has already priced in?

Question 25. You have built a complete data pipeline: scraping scores and odds, cleaning and standardizing the data, storing it in a SQLite database, and performing EDA. Describe, in order, the five most important automated data-quality checks you would run every time new data is ingested.

Answer

The five most important automated checks, in order: 1. **Schema validation:** Verify that all expected columns are present, data types are correct (dates are dates, scores are integers, odds are floats), and no unexpected columns have appeared. This catches upstream source changes immediately. 2. **Completeness check:** Count the number of games ingested and compare against the expected number (e.g., an NFL Sunday should have 13-16 games). Flag if the count is significantly lower or higher than expected. 3. **Duplicate detection:** Check for duplicate games using a composite key (date + home team + away team). Duplicates indicate scraping errors or overlapping data pulls. 4. **Domain constraint validation:** Verify that all values fall within valid ranges --- scores are non-negative, spreads are within a reasonable range (e.g., -30 to +30 for NFL), dates fall within the current season, and team names match the canonical team list. 5. **Referential integrity check:** Confirm that every team name, game ID, and foreign key in the new data matches existing records in the database. Flag any orphaned records or unknown team names that would indicate a data-quality issue upstream. These checks should run automatically, log all results, and halt the pipeline with an alert if any critical check fails.