Chapter 32 Quiz: Natural Language Processing for Betting

Test your understanding of NLP techniques applied to sports betting. Total: 100 points.


Section 1: Multiple Choice (10 questions, 3 points each = 30 points)

Question 1. Which text source is considered the most valuable for sports betting NLP systems?

a. ESPN news articles b. Beat reporter social media posts c. Fan commentary on Reddit d. Official team press releases

Answer **b. Beat reporter social media posts.** Beat reporters break news about injuries, lineup changes, coaching decisions, and locker room dynamics before this information is widely known or priced into the market. Their posts have the highest signal-to-noise ratio of any text source.

Question 2. What is the primary advantage of VADER sentiment analysis over transformer-based models for production sports betting systems?

a. Higher accuracy on nuanced text b. Better handling of sarcasm c. Speed and zero training requirement d. Ability to handle multiple languages

Answer **c. Speed and zero training requirement.** VADER is rule-based, requires no GPU, processes text in microseconds, and works out of the box. In production, where thousands of texts must be processed quickly, VADER's speed advantage outweighs the transformer's higher accuracy for most use cases.

Question 3. In the chapter's sports-specific VADER lexicon, what sentiment score is assigned to "ruled out"?

a. -1.0 b. -2.0 c. -3.0 d. -1.5

Answer **c. -3.0.** "Ruled out" receives the most negative score among injury status terms because it definitively removes a player from an upcoming game, creating the maximum negative impact on the team's prospects.

Question 4. What does the InjuryReportParser use as its primary method for extracting player names from unstructured text?

a. Regular expressions b. spaCy Named Entity Recognition (NER) c. TF-IDF keyword matching d. LLM API calls

Answer **b. spaCy Named Entity Recognition (NER).** The parser loads a spaCy language model (en_core_web_lg) and identifies PERSON entities in the text. It then looks for injury-related context in the surrounding sentence to determine status, body part, and injury type.

Question 5. According to the chapter, what is the confidence score assigned to injury entries parsed from structured (bullet-point) reports versus freeform text?

a. 1.0 and 0.5 b. 0.95 and 0.70 c. 0.90 and 0.80 d. 0.85 and 0.60

Answer **b. 0.95 and 0.70.** Structured reports follow a predictable format with explicit status labels, warranting high confidence (0.95). Freeform text requires inference from context, with more room for parsing errors, hence lower confidence (0.70).

Question 6. Which event type has the highest base significance score in the EventDetector?

a. Injury update (0.7) b. Trade (0.8) c. Coaching change (0.9) d. Player return (0.7)

Answer **c. Coaching change (0.9).** Coaching changes have the highest base significance because they affect team strategy, play-calling, and player usage across all positions, creating the broadest impact on game outcomes and betting lines.

Question 7. The chapter explicitly states that LLMs should NOT be used for which of the following tasks?

a. Information extraction from complex text b. Direct probability estimation c. Summarizing team news d. Classifying news impact

Answer **b. Direct probability estimation.** LLMs are not calibrated probability estimators. A probability number from an LLM reflects what a human analyst might say, not a mathematically grounded estimate. The chapter emphasizes using LLMs as sophisticated parsing tools, not as oracles for win probabilities.

Question 8. In the NLPSignal composite formula, what weight is assigned to injury impact?

a. 0.15 b. 0.25 c. 0.35 d. 0.45

Answer **d. 0.45.** Injury impact receives 45% of the composite signal weight because injuries are the single largest driver of betting line movements and provide the most reliable, actionable NLP signal. The remaining weights are: sentiment (0.15), news (0.25), and qualitative (0.15).

Question 9. Why does the NLP composite signal normalize the injury component to the range [-1, 0] rather than [-1, +1]?

a. Because injuries are always bad for a team b. Because positive injury news is captured by the sentiment component c. Because the injury parser only detects negative events d. Both a and b

Answer **d. Both a and b.** Injuries can only hurt a team (negative impact), never help, so the injury component is always non-positive. When a player returns from injury (a positive event), that positive information is captured by the sentiment component and the player return event type in the news component, not by the injury impact score itself.

Question 10. What is the gold standard method for evaluating whether NLP features genuinely improve a betting model?

a. In-sample cross-validation b. Walk-forward backtesting c. Leave-one-out cross-validation d. Bootstrap resampling

Answer **b. Walk-forward backtesting.** Walk-forward testing trains the model on a rolling historical window and tests on the subsequent period, advancing through time. This prevents data leakage and simulates real production conditions where you can only use past data to predict future outcomes. It is the gold standard because it captures the temporal dynamics of NLP signals.

Section 2: Short Answer (8 questions, 5 points each = 40 points)

Question 11. Explain the TeamSentimentAggregator's weighting scheme and why beat reporter sentiment receives 3x the weight of fan sentiment.

Answer The TeamSentimentAggregator assigns source-based weights to sentiment scores: beat reporters receive weight 3.0, official sources 2.0, and all others (fans, general media) 1.0. Additionally, engagement metrics (likes, retweets) provide a logarithmic boost via `1 + log1p(engagement) / 10`. Beat reporters receive 3x weight because they: (1) have direct access to teams, practices, and locker rooms; (2) break news before it is widely known; (3) provide factual reporting rather than emotional reactions; and (4) their sentiment reflects actual team conditions rather than narrative biases. Fan sentiment is noisy, emotionally driven, and often already priced into the market.

Question 12. Describe how the InjuryReportParser handles the extraction of body part and laterality (left/right) from injury descriptions.

Answer The parser maintains a list of known body parts (knee, ankle, hamstring, shoulder, etc.) and searches for these terms in the text. When a body part is found, the parser looks at up to 10 characters before the body part term in the text for a laterality modifier ("left" or "right"). If found, the modifier is prepended to the body part string (e.g., "left knee", "right ankle"). This two-step approach handles both explicit descriptions ("Left Knee Soreness") and contextual mentions ("his left knee is bothering him"). The method `_extract_body_part()` returns the combined string or None if no body part is identified.

Question 13. What is the difference between the parse_structured_report() and parse_freeform_text() methods in the InjuryReportParser? When would you use each?

Answer `parse_structured_report()` uses a compiled regex pattern to match bullet-point formatted injury reports where each line follows the pattern: "- Player Name (Injury Description) - Status". It returns a `ParsedInjuryReport` with high confidence (0.95) entries. `parse_freeform_text()` uses spaCy NER to find PERSON entities in unstructured text (news articles, tweets), then checks the surrounding sentence for injury-related context keywords. It returns individual `InjuryEntry` objects with lower confidence (0.70). Use `parse_structured_report()` for official league injury reports and team announcements that follow standard formatting. Use `parse_freeform_text()` for beat reporter tweets, news articles, and any unstructured text where injury information is embedded in natural language.

Question 14. Explain how the EventDetector estimates expected line movement and what modifiers adjust the base estimate.

Answer The EventDetector starts with a base line movement per event type: injury_update = 2.0 points, trade = 1.5, coaching_change = 3.0, suspension = 1.5, lineup_change = 1.0, player_return = 2.0, weather = 0.5, general = 0.0. Three multiplicative modifiers adjust the base: (1) Star player cues ("star", "all-star", "mvp", "starter") multiply by 1.5x; (2) Severity cues ("season-ending", "torn", "surgery", "acl", "achilles") multiply by 2.0x; (3) Minimizing cues ("minor", "precautionary", "rest") multiply by 0.5x. These modifiers are cumulative, so a star player's season-ending ACL tear would be 2.0 * 1.5 * 2.0 = 6.0 points of expected movement. This is a rough heuristic; more accurate estimation requires historical data on similar events.

Question 15. The LLMSportsAnalyzer includes a caching mechanism. Explain why caching is important for LLM-based analysis in a betting pipeline and describe one scenario where caching could produce incorrect results.

Answer Caching is important because LLM API calls are expensive (cost per token) and slow (hundreds of milliseconds to seconds). In a betting pipeline that may process the same team's news multiple times (e.g., checking different game matchups involving the same team), caching avoids redundant API calls, reducing both cost and latency. A scenario where caching produces incorrect results: if an injury status update arrives after the initial LLM analysis was cached, the cache will return stale results based on the old text. For example, if the LLM analyzed "Player X is questionable" and cached the result, then 30 minutes later "Player X has been ruled out" arrives, the cache would still return the "questionable" analysis if the same prompt is reused. This is why the cache should use content-based keys (hashing the prompt text) and potentially implement time-based expiration.

Question 16. Describe the two approaches for integrating NLP signals with existing statistical models: feature injection and model stacking. Which does the chapter recommend and why?

Answer Feature injection adds NLP-derived features directly into the main model's feature vector alongside statistical features. The model learns the optimal weight for NLP features during training. It is simple, requires no additional infrastructure, and keeps everything in a single model. Model stacking trains a separate NLP-only model and combines its predictions with the main model's predictions, typically through a meta-learner. This allows the NLP model to specialize and can capture nonlinear interactions between NLP and statistical signals. The chapter recommends feature injection because it is simpler and usually sufficient. NLP signals are typically additive corrections rather than fundamentally different predictors, so the main model can learn to weight them appropriately. Model stacking adds complexity (two models to maintain, potential overfitting of the meta-learner) without substantial gains for the typical NLP signal strength.

Question 17. What is the NLPBacktester's evaluate_feature_value() method testing when it computes "residual correlation"? Why is this metric more informative than simple correlation between the NLP feature and outcomes?

Answer Residual correlation measures the correlation between the NLP feature and the base model's prediction errors (residuals = actual_outcome - predicted_probability). This tests whether the NLP feature contains information that the base model is missing. Simple correlation between the NLP feature and outcomes would be misleading because the NLP feature might correlate with outcomes only because it correlates with factors the base model already captures (e.g., team quality). In that case, adding the NLP feature would be redundant. Residual correlation specifically isolates the NLP feature's incremental information beyond what the base model already knows, making it a direct measure of whether the feature would actually improve predictions if added to the model.

Question 18. The chapter states that "the most valuable sentiment signal comes from changes in sentiment, not absolute levels, because the market has already priced in persistent conditions." Explain this principle with a concrete example involving a team on a 5-game losing streak.

Answer Consider a team on a 5-game losing streak. Their absolute sentiment will be very negative. However, the betting market has already observed the losing streak and adjusted the team's odds accordingly. Betting on this team based solely on negative sentiment would not yield an edge because the market price already reflects the poor performance. The valuable signal is the change in sentiment. If the team's sentiment starts improving --- perhaps beat reporters note improved practice intensity, a key player returns to practice, or the coach implements a new strategy --- this changing sentiment may signal a turning point before the market fully adjusts. A team going from very negative to moderately negative sentiment might indicate upcoming improvement that the market, still anchored on recent results, has not fully priced in. This makes sentiment momentum (the delta) more predictive than the sentiment level itself.

Section 3: Applied Problems (4 questions, 7.5 points each = 30 points)

Question 19. You receive the following three texts about the Boston Celtics before their game tonight:

  • Text A (beat reporter, 2 hours ago): "Jayson Tatum did not practice today due to right ankle soreness. His status for tonight is uncertain."
  • Text B (ESPN article, 4 hours ago): "The Celtics have won 8 of their last 10 games and are playing at an elite level on both ends of the floor."
  • Text C (beat reporter, 30 minutes ago): "BREAKING: Jayson Tatum has been officially ruled out for tonight's game. Derrick White will start in his place."

Walk through the complete NLP pipeline processing of these three texts: (a) Sentiment analysis scores (estimate compound scores), (b) Injury report parsing (extract structured data), (c) Event detection and significance classification, (d) Composite NLP signal computation (use chapter weights), and (e) Whether this information should change a betting decision and why.

Answer **(a) Sentiment Analysis:** - Text A: Compound approximately -0.5 (injury + uncertainty terms dominate) - Text B: Compound approximately +0.7 (winning, elite, positive terms) - Text C: Compound approximately -0.8 (ruled out is -3.0, breaking news context) Weighted average (beat reporters at 3.0, ESPN at 2.0): Weighted = (3.0 * -0.5 + 2.0 * 0.7 + 3.0 * -0.8) / (3.0 + 2.0 + 3.0) = (-1.5 + 1.4 - 2.4) / 8.0 = -2.5 / 8.0 = -0.3125 **(b) Injury Parsing:** - Text A: {player: "Jayson Tatum", injury: "right ankle soreness", status: "questionable", body_part: "right ankle", injury_type: "soreness", confidence: 0.70} - Text C: {player: "Jayson Tatum", injury: "ruled out", status: "out", confidence: 0.70}. Updated status from questionable to out. **(c) Event Detection:** - Text A: INJURY_UPDATE, significance 0.7, star player modifier 1.5x = expected line move ~3.0 pts - Text B: GENERAL_NEWS, significance 0.1 - Text C: INJURY_UPDATE, significance 0.7 + 0.2 (BREAKING) = 0.9, star player 1.5x = expected line move ~4.5 pts **(d) Composite Signal:** (Assuming Tatum has win-share value ~3.5) - Sentiment: 0.15 * clip(-0.3125, -1, 1) = 0.15 * -0.3125 = -0.047 - Injury: 0.45 * clip(-3.5 * 1.0 / 5.0, -1, 0) = 0.45 * -0.70 = -0.315 - News: 0.25 * clip(4.5 / 3.0, -1, 1) = 0.25 * 1.0 = 0.25 (capped, but direction matters --- this should be negative news for Celtics, so approximately -0.25) - Qualitative: 0.15 * 0 = 0 - Composite approximately -0.61 **(e) Betting Decision:** This should significantly affect the betting decision. Tatum being ruled out materially weakens the Celtics. If the betting line has not yet moved to reflect this information (especially if Text C just broke), there is an opportunity to bet against the Celtics before the line adjusts. If the line has already moved, the edge may have disappeared. Pre-execution odds verification is essential.

Question 20. Design the error handling strategy for a production NLP pipeline that processes 200 texts per day across 5 data sources (RSS feeds, social media API, injury report scraper, press conference transcripts, and an LLM API for complex analysis). For each source, specify: (a) the most likely failure mode, (b) the retry strategy, (c) the fallback when the source is unavailable, and (d) the monitoring metric that detects the failure.

Answer | Source | Likely Failure | Retry | Fallback | Monitor | |---|---|---|---|---| | RSS feeds | Feed down or schema change | 3 retries with exponential backoff (2s, 4s, 8s) | Use cached articles from last successful fetch; flag as stale | Article count per fetch (alert if 0 for 2+ hours) | | Social media API | Rate limiting (429) or auth expiration | Respect rate limit headers, retry after wait period | Reduce query frequency; use cached results | Response status code distribution, posts per query | | Injury report scraper | HTML structure change (parsing fails) | 2 retries; then switch to alternate URL pattern | Use previous day's injury report with staleness warning | Parse success rate (alert if < 80%) | | Press conference | Audio/video unavailable; transcript delay | No retry for live events; retry transcript fetch hourly | Skip source for this game; increase weight on other sources | Transcript availability rate per team | | LLM API | API timeout, rate limit, or model error | 3 retries with 5s backoff; max 30s per call | Use rule-based classification as fallback; set qualitative_edge = 0 | API latency p95, error rate, cost per day | Global monitoring: Track composite signal distribution daily. If standard deviation drops below historical average by 50%, a source may be silently failing. Track feature completeness (percentage of games with all NLP features populated) and alert if below 90%.

Question 21. You have historical data showing that your NLP sentiment signal improves Brier score by an average of 0.003 when sentiment is computed from beat reporter tweets, but degrades Brier score by 0.001 when computed from fan social media posts. Design a filtering strategy that maximizes the value of sentiment data. Specify: (a) how to classify sources as high-value vs. low-value, (b) how to weight them differently, (c) how to handle the case where beat reporter data is unavailable, and (d) how to adapt the filtering as new sources are added.

Answer **(a) Source Classification:** Compute a "source value score" for each source by running a mini walk-forward test: for each source individually, compute the average Brier improvement when that source's sentiment is added. Sources with positive improvement are high-value; sources with negative or zero improvement are low-value. **(b) Weighting:** Use the Brier improvement magnitude as the weight. Beat reporters (improvement 0.003) receive weight proportional to 0.003. Fan posts (improvement -0.001) receive weight 0. A minimum threshold of 0.0005 Brier improvement is required for any weight. This effectively excludes noise sources without manual curation. **(c) Fallback When Beat Reporter Data Unavailable:** When high-value sources are unavailable, do NOT substitute with low-value sources --- this would degrade the signal. Instead, set the sentiment feature to the neutral default (0.0) and reduce the overall NLP composite signal weight. Specifically, reduce the sentiment weight in the composite from 0.15 to 0.05 and redistribute the remaining 0.10 to the injury component. **(d) Adaptive Filtering:** Maintain a rolling 90-day evaluation window. Every 30 days, recompute source value scores using the most recent data. New sources start with a 30-day probationary period where they receive half the computed weight. After probation, they receive their full computed weight. This allows the system to discover new valuable sources (e.g., a new beat reporter starts breaking news) and demote degraded sources (e.g., a formerly reliable source becomes click-bait).

Question 22. An NLP system for MLB betting must handle the following unique challenges: starting pitcher announcements (made 1-3 days before games), bullpen usage patterns (which cannot be directly parsed from text), weather effects on outdoor stadiums, and umpire assignments. Design a complete NLP signal builder for MLB that addresses each of these challenges. Specify: (a) the data sources for each signal, (b) the parsing approach, (c) how each signal enters the betting model, and (d) the expected relative importance of each signal compared to sentiment.

Answer **(a) Data Sources and (b) Parsing:** | Signal | Data Source | Parsing Approach | |---|---|---| | Starting pitcher | Official probable pitchers lists, beat reporter tweets | Regex: "[Team] will start [Name]"; NER for player names; match against roster | | Bullpen status | Post-game recaps, box scores mentioning innings pitched | Parse recent game logs for reliever IP; compute 3-day workload from structured box score data | | Weather | Weather API (primary), game preview articles (secondary) | Structured API data (no NLP needed) for primary; keyword extraction ("rain delay", "wind advisory") for secondary | | Umpire assignment | Official umpire assignment lists | Structured parsing of assignment pages; map umpire names to historical strike zone tendency stats | **(c) Model Integration:** - Starting pitcher: Feature = pitcher's season ERA, WHIP, expected win-share value. This is the highest-impact feature, as the starting pitcher determines 30-40% of outcome variance. - Bullpen: Feature = bullpen freshness score (inverse of aggregate recent IP for relievers). High workload = negative impact. - Weather: Features = wind speed, temperature, precipitation probability. Primarily affects totals (over/under) rather than sides. - Umpire: Feature = umpire's historical runs-per-game tendency relative to league average. **(d) Relative Importance:** Starting pitcher identification >> Sentiment (5x more important). Bullpen status approximately equal to Sentiment. Weather less than Sentiment (0.5x). Umpire much less than Sentiment (0.25x). Recommended composite weights: pitcher (0.50), sentiment (0.15), bullpen (0.15), weather (0.10), umpire (0.10).

Scoring Guide: - 90--100: Exceptional understanding of NLP for betting - 80--89: Strong grasp of core concepts and practical application - 70--79: Adequate understanding with room for depth - Below 70: Review chapter material, focusing on areas of weakness