Chapter 32 Exercises: Natural Language Processing for Betting


Part A: Conceptual Questions (Exercises 1--8)

Exercise 1. Explain why beat reporter social media posts are considered the most valuable text source for sports betting NLP systems, and describe two characteristics that distinguish them from general sports news articles.

Exercise 2. VADER sentiment analysis augments its base lexicon with sports-specific terms. For each of the following terms, explain why the assigned sentiment polarity and magnitude are appropriate for a sports betting context:

a. "ruled out" (score: -3.0) b. "probable" (score: +0.5) c. "traded" (score: -0.5) d. "clutch" (score: +2.0)

Exercise 3. A transformer-based sentiment analyzer (e.g., RoBERTa fine-tuned on Twitter data) is significantly more accurate than VADER for nuanced text. Despite this, VADER is often preferred as the primary sentiment tool in production betting pipelines. Give three reasons why this trade-off is justified and one scenario where the transformer model would be the better choice.

Exercise 4. Consider the following injury report tweet from a beat reporter: "LeBron James practiced fully today and is expected to play tomorrow. Anthony Davis remains out with a left knee contusion. Austin Reaves is questionable with flu-like symptoms." Explain step by step how a spaCy NER + rule-based parsing system would extract structured injury data from this text, including named entity recognition, status keyword matching, and body part extraction.

Exercise 5. Large Language Models have several documented limitations in the betting context. For each of the following limitations, describe a concrete failure mode and explain the financial risk it creates:

a. Hallucination b. Narrative bias c. Lack of calibration d. Inconsistency across calls

Exercise 6. The NLPSignal composite signal uses these weights: sentiment (0.15), injury (0.45), news (0.25), qualitative (0.15). Justify why injury impact receives the highest weight. Under what market conditions might you increase the weight on sentiment relative to injury?

Exercise 7. Explain the fundamental methodological challenge of backtesting NLP features. Why can you not simply compute today's sentiment scores for historical games and test whether they predict outcomes? Describe the walk-forward approach and why it is considered the gold standard for NLP feature evaluation.

Exercise 8. Feature injection and model stacking are two approaches for integrating NLP signals into a statistical betting model. Compare and contrast these approaches, discussing the advantages and disadvantages of each in terms of implementation complexity, interpretability, overfitting risk, and marginal predictive value.


Part B: Calculation and Analysis (Exercises 9--15)

Exercise 9. A VADER sentiment analyzer produces the following compound scores for five tweets about the Lakers before tonight's game:

Tweet Source Compound Score
"LeBron back from injury, looks great in warmups" beat_reporter +0.76
"Lakers on a 5-game losing streak, embarrassing" fan -0.82
"AD questionable tonight, knee still bothering him" beat_reporter -0.45
"Lakers starting lineup is strong tonight" analyst +0.38
"This team is trash, fire the coach" fan -0.91

Using the weighting scheme from the chapter (beat_reporter = 3.0, analyst = 2.0, fan = 1.0), compute the weighted average sentiment score for the Lakers. Then compute the unweighted average and explain why the weighted version is more informative for betting.

Exercise 10. An InjuryImpactEstimator uses the following status-to-play-probability mapping: out = 0.0, doubtful = 0.15, questionable = 0.50, probable = 0.85, available = 1.0. Given the following injury report for a team:

Player Win-Share Value Status
Star PG 3.5 questionable
Starting SF 1.8 probable
Bench C 0.4 out
Rotation SG 1.0 doubtful

Compute: (a) the expected missing value, (b) the injury severity score, (c) the star player risk, and (d) the average play probability across all entries.

Exercise 11. An EventDetector assigns base significance scores and expected line movements to different event types. A news article about a starting quarterback being ruled out for a season-ending ACL tear triggers the following modifiers: base line move = 2.0 points (injury_update), multiplied by 1.5 (star player), multiplied by 2.0 (season-ending/ACL). Compute the estimated line movement. Now compare this to a "probable" designation for a backup safety with "minor" modifier (0.5x). What is the ratio of expected line movements, and what does this imply about which news events the pipeline should prioritize?

Exercise 12. A LineMovementTracker records the following spread observations for Game X over 45 minutes:

Time Spread
10:00 AM -3.0
10:10 AM -3.0
10:15 AM -3.5
10:20 AM -4.0
10:25 AM -5.0
10:30 AM -5.5
10:45 AM -5.5

With a significance threshold of 1.0 points, identify: (a) when the significant move first occurred, (b) the total movement, (c) the velocity in points per minute during the sharpest 10-minute window, and (d) what type of news event most likely caused this movement.

Exercise 13. An NLPBacktester runs a walk-forward test comparing a base model (Brier = 0.2300) to a model augmented with NLP features. The results across 6 test periods are:

Period Base Brier NLP Brier Improvement
1 0.2312 0.2298 +0.0014
2 0.2287 0.2301 -0.0014
3 0.2340 0.2308 +0.0032
4 0.2275 0.2260 +0.0015
5 0.2310 0.2295 +0.0015
6 0.2295 0.2310 -0.0015

Compute: (a) the average Brier improvement, (b) the percentage of periods where NLP improved performance, and (c) whether this evidence is strong enough to deploy the NLP features in production. Justify your answer with reference to statistical significance and practical effect size.

Exercise 14. The NLPSignal composite formula normalizes injury impact by dividing by 5.0 and clipping to [-1, 0]. Explain why the injury component is always non-positive (it can only hurt, not help). Compute the composite signal for a team with: sentiment_score = 0.4, injury_impact_score = 2.5, news_magnitude = 1.8, qualitative_edge = 0.2. Then recompute with injury_impact_score = 0.0 (no injuries) and explain the difference.

Exercise 15. A sentiment monitoring system observes that team sentiment momentum (the rate of change in sentiment) has a residual correlation of 0.12 with model prediction errors. The system has 400 prediction-outcome pairs. Compute the t-statistic for this correlation and determine if it is statistically significant at the 5% level. If the base model has a Brier score of 0.230 and the residual-correlated NLP feature improves it to 0.228, compute the percentage improvement and discuss whether this is practically meaningful in a betting context.


Part C: Programming Exercises (Exercises 16--20)

Exercise 16. Implement a SportsSentimentAnalyzer class that extends VADER with a configurable sports-specific lexicon. The class should: - Accept a dictionary of custom terms and scores at initialization - Provide analyze(text) and analyze_batch(texts) methods returning SentimentResult dataclasses - Include a top_terms(text, n) method that returns the n terms contributing most to the sentiment score - Handle edge cases: empty text, non-English text, and very long text (truncate to 500 characters)

Exercise 17. Build an InjuryReportParserV2 class that handles both structured (bullet-point format) and unstructured (news article) injury reports. Requirements: - Parse structured reports using regex patterns - Parse unstructured text using keyword matching (do not require spaCy) - Normalize player names (handle "LeBron James", "LeBron", "James" as the same entity given a roster mapping) - Track status changes over time (e.g., player went from "questionable" to "out") - Return ParsedInjuryReport dataclasses with confidence scores

Exercise 18. Create a NewsEventClassifier that classifies sports news text into event categories (injury, trade, coaching change, suspension, lineup, weather, return, general) using keyword matching and simple TF-IDF scoring. The classifier should: - Accept training examples (text, label pairs) - Build term frequency profiles for each event type - Classify new text by comparing its term frequencies to each profile - Return classification with confidence score - Handle multi-label classification (a trade can also be an injury if a player is traded while injured)

Exercise 19. Implement an NLPPipelineIntegrator that connects a sentiment analyzer, injury parser, and event detector into a unified pipeline. The class should: - Accept text documents with metadata (source, timestamp, author) - Run all three NLP components on each document - Aggregate results by team over a configurable time window - Produce a single feature vector per team for integration with a betting model - Include monitoring: track processing latency, error rates, and signal distribution

Exercise 20. Build an NLPBacktestRunner that evaluates whether NLP features improve a betting model using walk-forward testing. Requirements: - Accept a DataFrame of games with base features, NLP features, outcomes, and dates - Implement walk-forward splits (train on past N days, test on next M days) - Train GradientBoostingClassifier with and without NLP features for each window - Compare Brier scores, log-loss, and AUC for each window - Compute paired t-test for statistical significance of improvement - Return a summary DataFrame and a boolean recommendation for deployment


Part D: Analysis and Design (Exercises 21--25)

Exercise 21. You are designing a sentiment analysis pipeline for NFL betting. Your data sources include: ESPN RSS feeds, beat reporter Twitter accounts, Reddit r/nfl, official team injury reports, and press conference transcripts. For each source, specify: (a) the scraping method, (b) the expected volume per day, (c) the signal-to-noise ratio, (d) the optimal sentiment analysis method (VADER vs. transformer), and (e) the weighting in the aggregate sentiment score.

Exercise 22. Your injury parsing system correctly parses 93% of structured injury reports but only 67% of unstructured tweets about injuries. Analyze the sources of errors in the unstructured case by listing five common failure modes with examples. For each failure mode, propose a specific improvement to the parser (regex pattern, NER model change, or LLM fallback).

Exercise 23. A walk-forward backtest shows that your NLP composite signal improves Brier score by 0.002 on average but worsens it in 35% of test periods. Design a decision framework for whether to include the NLP features in production. Your framework should consider: (a) statistical significance, (b) practical effect size in dollars, (c) computational cost, (d) robustness across sports and seasons, and (e) implementation complexity. Clearly state the threshold for each criterion and the final decision rule.

Exercise 24. You observe that your sentiment signal is most predictive on Mondays and Tuesdays (when injury reports are released and beat reporters provide practice updates) but has near-zero predictive power on game days (when the market has fully incorporated the information). Design a time-aware NLP pipeline that adjusts its behavior based on the day of the week and proximity to game time. Specify how the pipeline changes its: data collection frequency, analysis depth, feature weighting, and bet execution timing.

Exercise 25. Compare three NLP architectures for a sports betting application: (a) a fully rule-based system (VADER + regex), (b) a hybrid system (VADER + spaCy NER + regex), and (c) an LLM-first system (GPT-4 for all text analysis). For each architecture, estimate: setup time, per-prediction cost, latency, accuracy, maintainability, and failure modes. Present your analysis as a table and recommend which architecture to use for a pipeline processing 500 articles/day with a $100/month NLP budget.


Part E: Research and Extension (Exercises 26--30)

Exercise 26. Research the concept of "efficient market hypothesis" as applied to sports betting and explain how it relates to the value of NLP signals. If markets are semi-strong form efficient (all public information is priced in), can NLP features still provide an edge? Under what conditions does the speed of NLP processing create a window of exploitable inefficiency?

Exercise 27. Investigate how transfer learning can improve sentiment analysis for sports text. Fine-tune a pre-trained transformer model (e.g., DistilBERT) on a sports-specific sentiment dataset. Describe the dataset you would construct, the fine-tuning procedure, and how you would evaluate whether the fine-tuned model outperforms VADER and the generic RoBERTa model for sports betting applications.

Exercise 28. The chapter mentions that LLMs should not be used for direct probability estimation. Research recent work on calibrating LLM confidence scores and discuss whether calibrated LLM probabilities could eventually replace or supplement trained statistical models. What would an "LLM-calibrated probability" system look like, and what evidence would you need to trust it for bet sizing?

Exercise 29. Design a multi-sport NLP system that handles football, basketball, baseball, and hockey simultaneously. For each sport, identify: (a) the unique text patterns (e.g., pitch counts in baseball, line combinations in hockey), (b) sport-specific sentiment lexicon entries, (c) the injury report format and parsing requirements, and (d) the relative importance of NLP features versus statistical features. Explain how you would share components across sports and where sport-specific modules are necessary.

Exercise 30. The chapter uses a simple weighted average to combine sentiment, injury, news, and qualitative signals into a composite NLP score. Research alternative signal combination methods --- including ensemble models, Bayesian updating, and attention-based weighting --- and propose a more sophisticated combination approach. Implement a prototype and argue (with theoretical justification) why your approach should outperform the simple weighted average.