Sports betting markets are fundamentally information markets. The price of a bet reflects the aggregated beliefs of all participants about the probability of an outcome. Most of this information enters the market through structured data---box...
In This Chapter
Chapter 32: Natural Language Processing for Betting
Sports betting markets are fundamentally information markets. The price of a bet reflects the aggregated beliefs of all participants about the probability of an outcome. Most of this information enters the market through structured data---box scores, statistics, odds movements. But a significant portion enters through unstructured text: injury reports, press conferences, beat reporter tweets, news articles, and team announcements. If you can systematically extract and quantify the information in these texts faster and more accurately than the market, you gain an edge.
This chapter covers the full spectrum of Natural Language Processing (NLP) techniques applied to sports betting: from basic sentiment analysis to advanced Large Language Model (LLM) applications. We build practical systems that ingest text data, extract structured signals, and integrate those signals into betting models.
32.1 Sentiment Analysis from Sports Media
The Information Landscape
Sports media produces enormous volumes of text daily. Each piece of text carries implicit or explicit information about team quality, player health, morale, and expectations. The key sources are:
- Beat reporters on social media: The single most valuable text source. Beat reporters break news about injuries, lineup changes, coaching decisions, and locker room dynamics before this information is widely known.
- Sports news articles: Pre-game previews, post-game analyses, and feature articles from outlets like ESPN, The Athletic, and team-specific sites.
- Press conferences and interviews: Coach and player quotes reveal tactical intentions, injury status, and team morale.
- Fan and analyst commentary: Higher volume but noisier. Can capture overall market sentiment.
- Official team communications: Injury reports, roster moves, and transaction announcements.
Scraping Sports News and Social Media
Collecting text data requires automated scraping pipelines. Legal and ethical considerations are important: respect robots.txt, rate limits, and terms of service. Many platforms offer official APIs that should be preferred over scraping.
# nlp/scrapers.py
import requests
import time
import json
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from dataclasses import dataclass, field
from bs4 import BeautifulSoup
import logging
logger = logging.getLogger(__name__)
@dataclass
class TextDocument:
"""Represents a single text document from any source."""
source: str # "twitter", "espn", "injury_report", etc.
text: str
author: str
published_at: datetime
url: Optional[str] = None
metadata: Dict = field(default_factory=dict)
entities: List[str] = field(default_factory=list) # Teams, players mentioned
class NewsScraperBase:
"""Base class for news scrapers with rate limiting."""
def __init__(self, rate_limit_seconds: float = 2.0):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": (
"BettingResearchBot/1.0 "
"(academic research; respects robots.txt)"
),
})
self.rate_limit = rate_limit_seconds
self.last_request_time = 0.0
def _throttled_get(self, url: str, **kwargs) -> requests.Response:
elapsed = time.time() - self.last_request_time
if elapsed < self.rate_limit:
time.sleep(self.rate_limit - elapsed)
self.last_request_time = time.time()
return self.session.get(url, timeout=30, **kwargs)
class RSSNewsScraper(NewsScraperBase):
"""Scrape sports news from RSS feeds."""
def __init__(self, feeds: Dict[str, str] = None):
super().__init__()
self.feeds = feeds or {
"espn_nba": "https://www.espn.com/espn/rss/nba/news",
"espn_nfl": "https://www.espn.com/espn/rss/nfl/news",
}
def fetch_articles(self, feed_name: str = None
) -> List[TextDocument]:
"""Fetch articles from configured RSS feeds."""
import feedparser
documents = []
feeds_to_check = (
{feed_name: self.feeds[feed_name]}
if feed_name
else self.feeds
)
for name, url in feeds_to_check.items():
try:
feed = feedparser.parse(url)
for entry in feed.entries:
# Parse publication date
published = datetime.now()
if hasattr(entry, "published_parsed") and entry.published_parsed:
published = datetime(*entry.published_parsed[:6])
# Extract text content
text = entry.get("summary", "")
if hasattr(entry, "content"):
text = entry.content[0].get("value", text)
# Clean HTML
soup = BeautifulSoup(text, "html.parser")
clean_text = soup.get_text(separator=" ").strip()
documents.append(TextDocument(
source=name,
text=f"{entry.title}. {clean_text}",
author=entry.get("author", "unknown"),
published_at=published,
url=entry.get("link"),
))
logger.info(
f"Fetched {len(feed.entries)} articles from {name}"
)
except Exception as e:
logger.error(f"Error fetching {name}: {e}")
return documents
class SocialMediaCollector(NewsScraperBase):
"""Collect sports-related social media posts via API."""
def __init__(self, bearer_token: str = None):
super().__init__(rate_limit_seconds=3.0)
self.bearer_token = bearer_token
if bearer_token:
self.session.headers.update({
"Authorization": f"Bearer {bearer_token}",
})
def search_recent(self, query: str, max_results: int = 100
) -> List[TextDocument]:
"""Search recent posts using the API.
Note: This is a generic pattern. Actual implementation depends
on the specific social media API you have access to.
"""
documents = []
# Generic API search pattern
url = "https://api.social-platform.com/2/search/recent"
params = {
"query": query,
"max_results": min(max_results, 100),
"tweet.fields": "created_at,author_id,public_metrics",
}
try:
response = self._throttled_get(url, params=params)
if response.status_code == 200:
data = response.json()
for post in data.get("data", []):
documents.append(TextDocument(
source="social_media",
text=post["text"],
author=post.get("author_id", "unknown"),
published_at=datetime.fromisoformat(
post["created_at"].replace("Z", "+00:00")
),
metadata={
"likes": post.get("public_metrics", {}).get(
"like_count", 0
),
"retweets": post.get("public_metrics", {}).get(
"retweet_count", 0
),
},
))
except Exception as e:
logger.error(f"Social media search failed: {e}")
return documents
def track_beat_reporters(self, reporter_ids: List[str],
hours_back: int = 24
) -> List[TextDocument]:
"""Fetch recent posts from specific beat reporters.
Beat reporters are the most valuable source of breaking news
about injuries, lineup changes, and team dynamics.
"""
documents = []
cutoff = datetime.utcnow() - timedelta(hours=hours_back)
for reporter_id in reporter_ids:
try:
url = (
f"https://api.social-platform.com/2/"
f"users/{reporter_id}/posts"
)
params = {
"max_results": 50,
"start_time": cutoff.isoformat() + "Z",
"tweet.fields": "created_at,public_metrics",
}
response = self._throttled_get(url, params=params)
if response.status_code == 200:
data = response.json()
for post in data.get("data", []):
documents.append(TextDocument(
source="beat_reporter",
text=post["text"],
author=reporter_id,
published_at=datetime.fromisoformat(
post["created_at"].replace("Z", "+00:00")
),
metadata={"reporter_id": reporter_id},
))
except Exception as e:
logger.error(
f"Failed to fetch posts for reporter {reporter_id}: {e}"
)
return documents
Sentiment Scoring
Sentiment analysis assigns a numerical score to text reflecting its emotional valence---positive, negative, or neutral. For sports betting, we care about sentiment toward specific teams and players, not just the overall emotional tone.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment tool specifically designed for social media text. It handles slang, emoticons, and intensity modifiers well. It is fast, requires no training, and works reasonably well as a baseline.
# nlp/sentiment.py
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import logging
logger = logging.getLogger(__name__)
@dataclass
class SentimentResult:
text: str
compound_score: float # -1 (negative) to +1 (positive)
positive: float
negative: float
neutral: float
method: str
entities: List[str] = None
class VADERSentimentAnalyzer:
"""VADER-based sentiment analysis for sports text."""
def __init__(self):
from nltk.sentiment.vader import SentimentIntensityAnalyzer
self.analyzer = SentimentIntensityAnalyzer()
# Add sports-specific lexicon entries
sports_lexicon = {
"injury": -2.0,
"injured": -2.5,
"doubtful": -1.5,
"questionable": -1.0,
"out": -2.0,
"ruled out": -3.0,
"dnp": -2.0,
"sidelined": -2.5,
"day-to-day": -1.0,
"probable": 0.5,
"cleared": 2.0,
"returning": 2.0,
"comeback": 2.0,
"healthy": 2.0,
"full practice": 1.5,
"limited practice": -0.5,
"did not practice": -2.0,
"suspension": -2.5,
"suspended": -2.5,
"traded": -0.5, # Slightly negative (disruption)
"signed": 1.0,
"extension": 1.5,
"mvp": 2.5,
"dominant": 2.0,
"blowout": 1.5, # Context-dependent but often positive
"upset": -1.0,
"collapse": -2.5,
"slump": -2.0,
"streak": 1.0, # Usually "winning streak" context
"losing streak": -2.0,
"hot": 1.5,
"cold": -1.5,
"clutch": 2.0,
"choke": -2.5,
}
for word, score in sports_lexicon.items():
self.analyzer.lexicon[word] = score
def analyze(self, text: str) -> SentimentResult:
"""Analyze sentiment of a single text."""
scores = self.analyzer.polarity_scores(text)
return SentimentResult(
text=text,
compound_score=scores["compound"],
positive=scores["pos"],
negative=scores["neg"],
neutral=scores["neu"],
method="vader",
)
def analyze_batch(self, texts: List[str]) -> List[SentimentResult]:
"""Analyze sentiment of multiple texts."""
return [self.analyze(text) for text in texts]
class TransformerSentimentAnalyzer:
"""Transformer-based sentiment analysis using Hugging Face models.
Significantly more accurate than VADER, especially for nuanced
or context-dependent statements, but much slower.
"""
def __init__(self, model_name: str = "cardiffnlp/twitter-roberta-base-sentiment-latest"):
from transformers import pipeline
self.pipe = pipeline(
"sentiment-analysis",
model=model_name,
top_k=None, # Return all scores
truncation=True,
max_length=512,
)
self.label_map = {
"positive": 1.0,
"negative": -1.0,
"neutral": 0.0,
# Some models use different labels
"LABEL_0": -1.0,
"LABEL_1": 0.0,
"LABEL_2": 1.0,
}
def analyze(self, text: str) -> SentimentResult:
"""Analyze sentiment using transformer model."""
results = self.pipe(text)[0]
# Convert to compound score
compound = 0.0
pos_score = 0.0
neg_score = 0.0
neu_score = 0.0
for result in results:
label = result["label"].lower()
score = result["score"]
if "positive" in label or label == "LABEL_2":
pos_score = score
compound += score
elif "negative" in label or label == "LABEL_0":
neg_score = score
compound -= score
else:
neu_score = score
return SentimentResult(
text=text,
compound_score=compound,
positive=pos_score,
negative=neg_score,
neutral=neu_score,
method="transformer",
)
def analyze_batch(self, texts: List[str],
batch_size: int = 32) -> List[SentimentResult]:
"""Analyze a batch of texts efficiently."""
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
pipe_results = self.pipe(batch)
for text, pipe_result in zip(batch, pipe_results):
compound = 0.0
pos_score = neg_score = neu_score = 0.0
for r in pipe_result:
label = r["label"].lower()
if "positive" in label:
pos_score = r["score"]
compound += r["score"]
elif "negative" in label:
neg_score = r["score"]
compound -= r["score"]
else:
neu_score = r["score"]
results.append(SentimentResult(
text=text,
compound_score=compound,
positive=pos_score,
negative=neg_score,
neutral=neu_score,
method="transformer",
))
return results
class TeamSentimentAggregator:
"""Aggregate sentiment scores at the team level for betting features."""
def __init__(self, team_aliases: Dict[str, List[str]] = None):
self.team_aliases = team_aliases or {}
def identify_team(self, text: str) -> List[str]:
"""Identify which teams are mentioned in text."""
text_lower = text.lower()
mentioned_teams = []
for team_id, aliases in self.team_aliases.items():
for alias in aliases:
if alias.lower() in text_lower:
mentioned_teams.append(team_id)
break
return mentioned_teams
def aggregate(self, documents: List[TextDocument],
sentiment_results: List[SentimentResult],
window_hours: int = 48
) -> Dict[str, Dict[str, float]]:
"""Aggregate sentiment by team over a time window.
Returns dict mapping team_id to sentiment metrics:
{
"LAL": {
"mean_sentiment": 0.35,
"median_sentiment": 0.28,
"sentiment_std": 0.45,
"num_mentions": 15,
"positive_ratio": 0.67,
"negative_ratio": 0.20,
"weighted_sentiment": 0.32, # Weighted by engagement
}
}
"""
from collections import defaultdict
from datetime import datetime, timedelta
cutoff = datetime.utcnow() - timedelta(hours=window_hours)
team_sentiments = defaultdict(list)
team_weights = defaultdict(list)
for doc, sentiment in zip(documents, sentiment_results):
if doc.published_at < cutoff:
continue
teams = self.identify_team(doc.text)
for team in teams:
team_sentiments[team].append(sentiment.compound_score)
# Weight by source credibility and engagement
weight = 1.0
if doc.source == "beat_reporter":
weight = 3.0 # Beat reporters are highest signal
elif doc.source == "official":
weight = 2.0
engagement = (
doc.metadata.get("likes", 0)
+ doc.metadata.get("retweets", 0) * 2
)
weight *= 1 + np.log1p(engagement) / 10
team_weights[team].append(weight)
result = {}
for team, scores in team_sentiments.items():
scores_arr = np.array(scores)
weights_arr = np.array(team_weights[team])
result[team] = {
"mean_sentiment": float(np.mean(scores_arr)),
"median_sentiment": float(np.median(scores_arr)),
"sentiment_std": float(np.std(scores_arr)),
"num_mentions": len(scores),
"positive_ratio": float(np.mean(scores_arr > 0.05)),
"negative_ratio": float(np.mean(scores_arr < -0.05)),
"weighted_sentiment": float(
np.average(scores_arr, weights=weights_arr)
),
}
return result
Correlation with Betting Markets
Sentiment scores are only useful if they correlate with betting-relevant outcomes. The key question is not whether sentiment predicts who wins, but whether it predicts outcomes better than the market. Specifically, we look for cases where sentiment diverges from market-implied probabilities.
A team with very negative sentiment (due to a recent loss streak) that the market has already priced in is not a signal. But a team with quietly improving sentiment (a key player returning from injury, positive practice reports) that the market has not yet fully adjusted to represents an exploitable edge.
The empirical literature shows that:
- Extreme negative sentiment following a loss often overshoots, creating value on the losing team's next game.
- Injury-related sentiment changes have the most predictive power because injury information takes time to be fully priced in.
- Aggregate social media volume (not just sentiment) correlates with public betting action, which can indicate opportunities on the other side.
32.2 Injury Report Parsing
Extracting Structured Data from Injury Reports
Injury reports are the most directly actionable text data in sports betting. An official NBA injury report or an NFL practice participation report contains highly structured information that moves betting lines, but it arrives in semi-structured text format that requires parsing.
Here is an example of raw injury report text:
Los Angeles Lakers Injury Report - January 15, 2025
- LeBron James (Left Knee Soreness) - Questionable
- Anthony Davis (Right Ankle Sprain) - Doubtful
- Austin Reaves (Illness) - Probable
- D'Angelo Russell (Right Hamstring Tightness) - Out
This needs to be converted into structured data:
{
"team": "Los Angeles Lakers",
"date": "2025-01-15",
"players": [
{"name": "LeBron James", "injury": "Left Knee Soreness", "status": "Questionable"},
{"name": "Anthony Davis", "injury": "Right Ankle Sprain", "status": "Doubtful"},
{"name": "Austin Reaves", "injury": "Illness", "status": "Probable"},
{"name": "D'Angelo Russell", "injury": "Right Hamstring Tightness", "status": "Out"}
]
}
Named Entity Recognition and Status Classification
We use a combination of rule-based patterns and NLP models to parse injury reports reliably.
# nlp/injury_parser.py
import re
import spacy
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
@dataclass
class InjuryEntry:
player_name: str
team: str
injury_description: str
status: str # "out", "doubtful", "questionable", "probable", "available"
body_part: Optional[str] = None
injury_type: Optional[str] = None
reported_date: Optional[str] = None
estimated_return: Optional[str] = None
confidence: float = 1.0
@dataclass
class ParsedInjuryReport:
team: str
report_date: str
entries: List[InjuryEntry] = field(default_factory=list)
source: str = "unknown"
raw_text: str = ""
class InjuryReportParser:
"""Parse injury reports into structured data using spaCy NER
and rule-based patterns."""
def __init__(self, model_name: str = "en_core_web_lg"):
self.nlp = spacy.load(model_name)
# Status keywords and their normalized forms
self.status_keywords = {
"out": "out",
"ruled out": "out",
"will not play": "out",
"sidelined": "out",
"inactive": "out",
"dnp": "out",
"did not participate": "out",
"doubtful": "doubtful",
"game-time decision": "doubtful",
"unlikely": "doubtful",
"questionable": "questionable",
"uncertain": "questionable",
"day-to-day": "questionable",
"50-50": "questionable",
"probable": "probable",
"likely to play": "probable",
"expected to play": "probable",
"available": "available",
"cleared": "available",
"full go": "available",
"no restrictions": "available",
"upgraded": "probable", # Usually positive
"downgraded": "doubtful", # Usually negative
}
# Body part patterns
self.body_parts = [
"knee", "ankle", "hamstring", "groin", "shoulder", "back",
"hip", "calf", "quadricep", "quad", "foot", "toe", "finger",
"wrist", "hand", "elbow", "neck", "head", "concussion",
"achilles", "rib", "oblique", "abdominal", "thigh",
"shin", "fibula", "tibia", "acl", "mcl",
]
# Injury type patterns
self.injury_types = [
"sprain", "strain", "soreness", "tightness", "bruise",
"contusion", "fracture", "tear", "inflammation",
"tendinitis", "tendinopathy", "dislocation", "illness",
"flu", "personal", "rest", "load management", "surgery",
"bone bruise", "stress fracture", "hyperextension",
]
# Compile regex for structured injury report lines
self.line_pattern = re.compile(
r"[-*•]\s*" # Bullet point
r"([A-Z][a-zA-Z'\-.\s]+?)" # Player name
r"\s*\(([^)]+)\)\s*[-–—:]\s*" # Injury description in parens
r"(Out|Doubtful|Questionable|Probable|Available|Day-to-Day)",
re.IGNORECASE,
)
def parse_structured_report(self, text: str, team: str = "",
date: str = "") -> ParsedInjuryReport:
"""Parse a formally structured injury report (bullet-point format)."""
report = ParsedInjuryReport(
team=team,
report_date=date or datetime.utcnow().strftime("%Y-%m-%d"),
raw_text=text,
)
for match in self.line_pattern.finditer(text):
player_name = match.group(1).strip()
injury_desc = match.group(2).strip()
status_raw = match.group(3).strip()
status = self.status_keywords.get(
status_raw.lower(), status_raw.lower()
)
body_part = self._extract_body_part(injury_desc)
injury_type = self._extract_injury_type(injury_desc)
report.entries.append(InjuryEntry(
player_name=player_name,
team=team,
injury_description=injury_desc,
status=status,
body_part=body_part,
injury_type=injury_type,
reported_date=report.report_date,
confidence=0.95, # High confidence for structured reports
))
logger.info(
f"Parsed {len(report.entries)} entries from "
f"structured report for {team}"
)
return report
def parse_freeform_text(self, text: str, team: str = ""
) -> List[InjuryEntry]:
"""Extract injury information from unstructured text
(news articles, tweets).
Uses spaCy NER to identify player names and rule-based
patterns to extract injury and status information.
"""
doc = self.nlp(text)
entries = []
# Find all person entities
person_entities = [
ent for ent in doc.ents if ent.label_ == "PERSON"
]
for person in person_entities:
# Look in the surrounding context (sentence or nearby text)
# for injury and status information
sent = person.sent
sent_text = sent.text.lower()
# Check if the sentence mentions an injury-related term
has_injury_context = any(
term in sent_text
for term in [
"injur", "hurt", "miss", "out", "doubtful",
"questionable", "probable", "sidelined", "sprain",
"strain", "surgery", "return", "cleared", "day-to-day",
"health", "practice",
]
)
if not has_injury_context:
continue
# Extract status
status = self._extract_status(sent_text)
body_part = self._extract_body_part(sent_text)
injury_type = self._extract_injury_type(sent_text)
if status:
entries.append(InjuryEntry(
player_name=person.text,
team=team,
injury_description=sent.text.strip(),
status=status,
body_part=body_part,
injury_type=injury_type,
reported_date=datetime.utcnow().strftime("%Y-%m-%d"),
confidence=0.7, # Lower confidence for freeform
))
logger.info(
f"Extracted {len(entries)} injury entries from freeform text"
)
return entries
def _extract_status(self, text: str) -> Optional[str]:
"""Extract injury status from text."""
text_lower = text.lower()
# Check for explicit status keywords (longest match first)
sorted_keywords = sorted(
self.status_keywords.keys(), key=len, reverse=True
)
for keyword in sorted_keywords:
if keyword in text_lower:
return self.status_keywords[keyword]
# Inference from context
negative_indicators = ["miss", "won't play", "will not play",
"not expected", "held out"]
positive_indicators = ["return", "cleared", "back in",
"practiced", "full participant"]
for indicator in negative_indicators:
if indicator in text_lower:
return "out"
for indicator in positive_indicators:
if indicator in text_lower:
return "available"
return None
def _extract_body_part(self, text: str) -> Optional[str]:
"""Extract injured body part from text."""
text_lower = text.lower()
for part in self.body_parts:
if part in text_lower:
# Check for left/right modifier
idx = text_lower.index(part)
prefix = text_lower[max(0, idx - 10):idx]
side = ""
if "left" in prefix:
side = "left "
elif "right" in prefix:
side = "right "
return f"{side}{part}"
return None
def _extract_injury_type(self, text: str) -> Optional[str]:
"""Extract injury type from text."""
text_lower = text.lower()
for injury_type in self.injury_types:
if injury_type in text_lower:
return injury_type
return None
class InjuryImpactEstimator:
"""Estimate the betting market impact of injury status changes."""
def __init__(self, player_values: Dict[str, float] = None):
"""
player_values maps player names to their approximate
win-share impact per game (e.g., from advanced stats).
Higher values indicate more impactful players.
"""
self.player_values = player_values or {}
# Status-to-probability-of-playing mapping
self.status_play_probs = {
"out": 0.0,
"doubtful": 0.15,
"questionable": 0.50,
"day-to-day": 0.55,
"probable": 0.85,
"available": 1.0,
}
def estimate_impact(self, entries: List[InjuryEntry],
team_id: str) -> Dict[str, float]:
"""Estimate the net impact of injuries on a team.
Returns metrics useful as betting model features:
- expected_missing_value: Expected win-share value of missing players
- injury_severity_score: Weighted injury severity
- star_player_risk: Risk of losing a top player
"""
expected_missing = 0.0
severity_scores = []
star_risk = 0.0
for entry in entries:
if entry.team != team_id and team_id:
continue
player_val = self.player_values.get(entry.player_name, 0.5)
play_prob = self.status_play_probs.get(entry.status, 0.5)
miss_prob = 1.0 - play_prob
expected_missing += player_val * miss_prob
# Severity score (weighted by player value)
severity = miss_prob * player_val
severity_scores.append(severity)
# Star player risk (top-tier player uncertain)
if player_val > 2.0 and play_prob < 0.9:
star_risk = max(star_risk, miss_prob * player_val)
return {
"expected_missing_value": expected_missing,
"injury_severity_score": sum(severity_scores),
"num_injured_players": len(entries),
"star_player_risk": star_risk,
"avg_play_probability": np.mean([
self.status_play_probs.get(e.status, 0.5)
for e in entries
]) if entries else 1.0,
}
32.3 News Impact Quantification
Event Detection
Not all news is equally important. A coaching change is more impactful than a routine practice report. An event detection system classifies incoming text into event categories and estimates their significance.
# nlp/event_detection.py
import re
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum
import numpy as np
import logging
logger = logging.getLogger(__name__)
class EventType(Enum):
INJURY_UPDATE = "injury_update"
TRADE = "trade"
COACHING_CHANGE = "coaching_change"
SUSPENSION = "suspension"
LINEUP_CHANGE = "lineup_change"
WEATHER = "weather"
PLAYER_RETURN = "player_return"
CONTRACT = "contract"
CONTROVERSY = "controversy"
GENERAL_NEWS = "general_news"
@dataclass
class DetectedEvent:
event_type: EventType
headline: str
entities: List[str] # Teams and players involved
significance: float # 0 to 1 scale
confidence: float # Confidence in classification
expected_line_move: float # Estimated points of line movement
timestamp: str
source_text: str
class EventDetector:
"""Detect and classify sports events from text."""
def __init__(self):
# Event type detection patterns
self.patterns = {
EventType.INJURY_UPDATE: [
r"injur(y|ed|ies)",
r"(out|doubtful|questionable|probable)\s+(for|vs|against)",
r"(knee|ankle|hamstring|shoulder|concussion)",
r"(ruled out|sidelined|day-to-day|miss(es|ing)?)",
r"MRI|X-ray|scan|surgery",
r"(limited|full|did not)\s+participa",
],
EventType.TRADE: [
r"trad(e[ds]?|ing)\s+(to|from|for|with)",
r"acquir(e[ds]?|ing)",
r"(deal|swap|package|pick[s]?|draft)",
r"trade deadline",
r"waiv(e[ds]?|er)",
],
EventType.COACHING_CHANGE: [
r"(fir(e[ds]?|ing)|dismiss(ed)?)\s+(coach|manager|head)",
r"(hir(e[ds]?|ing)|appoint(ed)?)\s+(as|new)\s+(coach|manager)",
r"coaching\s+change",
r"interim\s+(coach|manager|head)",
r"resign(s|ed|ation)",
],
EventType.SUSPENSION: [
r"suspend(ed|sion)",
r"ban(ned)?",
r"disciplin(e[ds]?|ary)",
r"violation",
r"PED|substance",
],
EventType.LINEUP_CHANGE: [
r"start(ing|s|er)",
r"bench(ed|ing)?",
r"lineup",
r"rotation",
r"minutes\s+(restriction|limit)",
],
EventType.PLAYER_RETURN: [
r"return(s|ing|ed)\s+(to|from)",
r"(back|cleared)\s+(in|to|for)\s+(action|play|lineup)",
r"activated\s+from",
r"off\s+(injured|injury)\s+(list|reserve)",
r"full\s+(practice|participation)",
],
EventType.WEATHER: [
r"(rain|snow|wind|cold|heat|weather)",
r"(dome|indoor|outdoor|roof)",
r"(delay|postpone|cancel)",
r"(mph|degrees|temperature)",
],
}
# Compile patterns
self.compiled_patterns = {
event_type: [re.compile(p, re.IGNORECASE) for p in patterns]
for event_type, patterns in self.patterns.items()
}
# Base significance by event type
self.base_significance = {
EventType.INJURY_UPDATE: 0.7,
EventType.TRADE: 0.8,
EventType.COACHING_CHANGE: 0.9,
EventType.SUSPENSION: 0.6,
EventType.LINEUP_CHANGE: 0.5,
EventType.PLAYER_RETURN: 0.7,
EventType.WEATHER: 0.3,
EventType.CONTRACT: 0.2,
EventType.CONTROVERSY: 0.4,
EventType.GENERAL_NEWS: 0.1,
}
def detect(self, text: str, timestamp: str = ""
) -> List[DetectedEvent]:
"""Detect events in text."""
events = []
# Score each event type
type_scores = {}
for event_type, patterns in self.compiled_patterns.items():
matches = sum(
1 for p in patterns if p.search(text)
)
if matches > 0:
type_scores[event_type] = matches / len(patterns)
if not type_scores:
return [DetectedEvent(
event_type=EventType.GENERAL_NEWS,
headline=text[:100],
entities=[],
significance=0.1,
confidence=0.5,
expected_line_move=0.0,
timestamp=timestamp,
source_text=text,
)]
# Take the highest-scoring event type
best_type = max(type_scores, key=type_scores.get)
confidence = min(type_scores[best_type] * 2, 1.0)
significance = self.base_significance.get(best_type, 0.1)
# Adjust significance based on text cues
if any(word in text.lower() for word in
["breaking", "just in", "sources say", "confirmed"]):
significance = min(significance + 0.2, 1.0)
# Estimate line movement
line_move = self._estimate_line_move(best_type, text)
events.append(DetectedEvent(
event_type=best_type,
headline=text[:150],
entities=self._extract_entities(text),
significance=significance,
confidence=confidence,
expected_line_move=line_move,
timestamp=timestamp,
source_text=text,
))
return events
def _estimate_line_move(self, event_type: EventType,
text: str) -> float:
"""Estimate expected line movement in points.
This is a rough heuristic. More accurate estimation
requires historical data on similar events.
"""
base_moves = {
EventType.INJURY_UPDATE: 2.0,
EventType.TRADE: 1.5,
EventType.COACHING_CHANGE: 3.0,
EventType.SUSPENSION: 1.5,
EventType.LINEUP_CHANGE: 1.0,
EventType.PLAYER_RETURN: 2.0,
EventType.WEATHER: 0.5,
EventType.GENERAL_NEWS: 0.0,
}
move = base_moves.get(event_type, 0.0)
# Adjust for severity cues
text_lower = text.lower()
if any(w in text_lower for w in
["star", "all-star", "mvp", "starter"]):
move *= 1.5
if any(w in text_lower for w in
["season-ending", "torn", "surgery", "acl", "achilles"]):
move *= 2.0
if any(w in text_lower for w in
["minor", "precautionary", "rest"]):
move *= 0.5
return round(move, 1)
def _extract_entities(self, text: str) -> List[str]:
"""Extract team and player names (simplified)."""
# In production, use spaCy NER or a custom NER model
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
return [
ent.text for ent in doc.ents
if ent.label_ in ("PERSON", "ORG")
]
Measuring Market Reaction to News
To build a news impact model, you need historical data linking news events to subsequent line movements. This requires capturing the betting line before and after news breaks.
# nlp/news_impact.py
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from datetime import datetime, timedelta
from scipy import stats
import logging
logger = logging.getLogger(__name__)
class NewsImpactModel:
"""Quantify the historical impact of news events on betting lines.
Collects pairs of (news_event, line_movement) to build a model
of how different types of news affect the market.
"""
def __init__(self):
self.impact_data = []
def record_event_impact(self, event_type: str,
player_name: str,
team: str,
pre_event_line: float,
post_event_line: float,
time_to_react_minutes: int,
metadata: Dict = None):
"""Record the market impact of a news event."""
line_movement = post_event_line - pre_event_line
self.impact_data.append({
"event_type": event_type,
"player_name": player_name,
"team": team,
"pre_line": pre_event_line,
"post_line": post_event_line,
"line_movement": line_movement,
"abs_line_movement": abs(line_movement),
"time_to_react": time_to_react_minutes,
"timestamp": datetime.utcnow().isoformat(),
**(metadata or {}),
})
def analyze_impacts(self) -> pd.DataFrame:
"""Analyze historical news impacts by category."""
if not self.impact_data:
return pd.DataFrame()
df = pd.DataFrame(self.impact_data)
summary = df.groupby("event_type").agg({
"abs_line_movement": ["mean", "median", "std", "count"],
"time_to_react": "mean",
}).round(2)
return summary
def estimate_current_impact(self, event_type: str,
player_value: float = 1.0
) -> Dict[str, float]:
"""Estimate the expected market impact of a new event.
Uses historical data on similar events to predict line movement.
"""
df = pd.DataFrame(self.impact_data)
if df.empty or event_type not in df["event_type"].values:
return {
"expected_move": 0.0,
"confidence_interval_low": 0.0,
"confidence_interval_high": 0.0,
"sample_size": 0,
}
similar = df[df["event_type"] == event_type]
movements = similar["abs_line_movement"].values
mean_move = np.mean(movements) * player_value
std_move = np.std(movements)
# 90% confidence interval
ci = stats.t.interval(
0.90, len(movements) - 1,
loc=mean_move, scale=std_move / np.sqrt(len(movements))
) if len(movements) > 1 else (0, 0)
return {
"expected_move": float(mean_move),
"confidence_interval_low": float(ci[0]),
"confidence_interval_high": float(ci[1]),
"sample_size": len(movements),
"median_react_time_min": float(similar["time_to_react"].median()),
}
class LineMovementTracker:
"""Track line movements to detect news-driven shifts.
Monitors odds feeds and flags significant movements that may
indicate breaking news, creating opportunities for informed bettors.
"""
def __init__(self, significant_move_threshold: float = 1.0):
self.threshold = significant_move_threshold
self.line_history: Dict[str, List[Tuple[datetime, float]]] = {}
def update_line(self, game_id: str, current_line: float):
"""Record a new line observation."""
timestamp = datetime.utcnow()
if game_id not in self.line_history:
self.line_history[game_id] = []
self.line_history[game_id].append((timestamp, current_line))
def detect_significant_moves(self) -> List[Dict]:
"""Detect games with significant recent line movements."""
alerts = []
for game_id, history in self.line_history.items():
if len(history) < 2:
continue
# Check movement in last 30 minutes
now = datetime.utcnow()
recent = [
(t, line) for t, line in history
if (now - t).total_seconds() < 1800
]
if len(recent) < 2:
continue
start_line = recent[0][1]
current_line = recent[-1][1]
movement = abs(current_line - start_line)
if movement >= self.threshold:
# Calculate velocity (points per minute)
time_span = (recent[-1][0] - recent[0][0]).total_seconds() / 60
velocity = movement / max(time_span, 1)
alerts.append({
"game_id": game_id,
"start_line": start_line,
"current_line": current_line,
"movement": current_line - start_line,
"abs_movement": movement,
"velocity_pts_per_min": velocity,
"time_span_minutes": time_span,
"detected_at": now.isoformat(),
})
return sorted(alerts, key=lambda x: x["abs_movement"], reverse=True)
32.4 Large Language Models for Sports Analysis
Using LLM APIs for Sports Analysis
Large Language Models (LLMs) such as GPT-4 and Claude represent a step change in NLP capability. They can understand nuanced sports context, extract information from complex text, and even generate structured analyses. However, they must be used carefully in a betting context---their outputs are probabilistic, can hallucinate facts, and should never be trusted as primary prediction sources.
LLMs are best used for: 1. Information extraction: Parsing complex, unstructured text into structured data. 2. Summarization: Condensing many articles into key points. 3. Contextual analysis: Understanding subtle implications that rule-based systems miss. 4. Feature engineering assistance: Suggesting new features or interpreting unusual data patterns.
LLMs should NOT be used for: 1. Direct probability estimation: LLMs are not calibrated probability estimators. 2. Replacing statistical models: They lack the mathematical rigor of trained ML models. 3. Real-time decision making: API latency makes them unsuitable for time-critical decisions.
# nlp/llm_analysis.py
import json
import time
import hashlib
from typing import Dict, List, Optional, Any
from dataclasses import dataclass
import logging
logger = logging.getLogger(__name__)
class LLMSportsAnalyzer:
"""Use Large Language Models for sports text analysis.
Supports both OpenAI and Anthropic APIs.
Includes caching, rate limiting, and structured output parsing.
"""
def __init__(self, provider: str = "anthropic",
api_key: str = "",
model: str = None,
cache_enabled: bool = True):
self.provider = provider
self.api_key = api_key
self.cache = {} if cache_enabled else None
if provider == "anthropic":
import anthropic
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model or "claude-sonnet-4-20250514"
elif provider == "openai":
import openai
self.client = openai.OpenAI(api_key=api_key)
self.model = model or "gpt-4o"
else:
raise ValueError(f"Unsupported provider: {provider}")
def _cache_key(self, prompt: str) -> str:
return hashlib.md5(prompt.encode()).hexdigest()
def _call_llm(self, system_prompt: str, user_prompt: str,
temperature: float = 0.0,
max_tokens: int = 2000) -> str:
"""Make an LLM API call with caching and error handling."""
full_prompt = f"{system_prompt}\n{user_prompt}"
# Check cache
if self.cache is not None:
cache_key = self._cache_key(full_prompt)
if cache_key in self.cache:
return self.cache[cache_key]
try:
if self.provider == "anthropic":
message = self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
temperature=temperature,
system=system_prompt,
messages=[
{"role": "user", "content": user_prompt}
],
)
response_text = message.content[0].text
elif self.provider == "openai":
completion = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=temperature,
max_tokens=max_tokens,
)
response_text = completion.choices[0].message.content
# Cache the result
if self.cache is not None:
self.cache[self._cache_key(full_prompt)] = response_text
return response_text
except Exception as e:
logger.error(f"LLM API call failed: {e}")
return ""
def extract_injury_info(self, text: str) -> List[Dict]:
"""Use LLM to extract structured injury data from complex text."""
system_prompt = """You are a sports injury data extraction system.
Extract ALL injury-related information from the given text and return
it as a JSON array. Each entry should have these fields:
- player_name: Full name of the player
- team: Team name if mentioned
- injury: Description of the injury
- body_part: Specific body part if mentioned
- status: One of "out", "doubtful", "questionable", "probable", "available", or "unknown"
- severity: One of "minor", "moderate", "severe", or "unknown"
- context: Brief explanation of why you classified it this way
Return ONLY valid JSON. If no injury information is found, return an empty array []."""
user_prompt = f"Extract injury information from this text:\n\n{text}"
response = self._call_llm(system_prompt, user_prompt)
try:
# Extract JSON from response (handle markdown code blocks)
json_text = response
if "```json" in response:
json_text = response.split("```json")[1].split("```")[0]
elif "```" in response:
json_text = response.split("```")[1].split("```")[0]
return json.loads(json_text.strip())
except (json.JSONDecodeError, IndexError):
logger.warning(f"Failed to parse LLM response as JSON: {response[:200]}")
return []
def analyze_matchup(self, home_team: str, away_team: str,
context_texts: List[str]) -> Dict:
"""Use LLM to generate a qualitative matchup analysis.
This is NOT for probability estimation. It extracts qualitative
factors that might be missed by statistical models.
"""
combined_context = "\n---\n".join(context_texts[:10]) # Limit context
system_prompt = """You are a sports analyst assistant. Analyze the
provided matchup context and identify QUALITATIVE factors that might
affect the game outcome. Focus on factors that statistical models
might miss: team chemistry, scheduling spots, motivation, coaching
matchups, recent trends in play style, etc.
Return a JSON object with these fields:
- home_advantages: List of qualitative advantages for the home team
- away_advantages: List of qualitative advantages for the away team
- key_uncertainties: List of uncertain factors that could swing either way
- narrative_factors: Any narrative or psychological factors (revenge game,
rivalry, etc.)
- confidence: "low", "medium", or "high" in your analysis
IMPORTANT: Do NOT provide win probabilities or predictions. Focus only
on qualitative analysis. Return ONLY valid JSON."""
user_prompt = f"""Matchup: {home_team} (home) vs {away_team} (away)
Recent news and context:
{combined_context}"""
response = self._call_llm(system_prompt, user_prompt,
temperature=0.1)
try:
json_text = response
if "```json" in response:
json_text = response.split("```json")[1].split("```")[0]
elif "```" in response:
json_text = response.split("```")[1].split("```")[0]
return json.loads(json_text.strip())
except (json.JSONDecodeError, IndexError):
logger.warning("Failed to parse matchup analysis")
return {}
def classify_news_impact(self, headline: str,
article_text: str = "") -> Dict:
"""Use LLM to classify the betting impact of a news item."""
system_prompt = """You are a sports betting news classifier.
Analyze the given news item and classify its potential impact on
betting markets.
Return a JSON object with:
- event_type: One of "injury", "trade", "coaching", "suspension",
"lineup", "weather", "return", "off_field", "general"
- teams_affected: List of team names affected
- players_affected: List of player names affected
- direction: "positive" or "negative" for the first team mentioned
- magnitude: "negligible", "minor", "moderate", "major", "extreme"
- time_sensitivity: "immediate", "hours", "days", "weeks"
- reasoning: One sentence explaining the classification
Return ONLY valid JSON."""
user_prompt = f"Headline: {headline}"
if article_text:
user_prompt += f"\n\nFull text: {article_text[:1000]}"
response = self._call_llm(system_prompt, user_prompt)
try:
json_text = response
if "```json" in response:
json_text = response.split("```json")[1].split("```")[0]
elif "```" in response:
json_text = response.split("```")[1].split("```")[0]
return json.loads(json_text.strip())
except (json.JSONDecodeError, IndexError):
return {"event_type": "general", "magnitude": "negligible"}
def summarize_team_news(self, team: str,
articles: List[str]) -> str:
"""Summarize recent team news for human review."""
combined = "\n---\n".join(articles[:15])
system_prompt = f"""Summarize the most important recent news about
{team} as it relates to their upcoming games. Focus on:
1. Injury updates and player availability
2. Recent performance trends
3. Roster or coaching changes
4. Any other factors that could affect game outcomes
Be concise (3-5 bullet points). Include only factual information
from the provided texts."""
return self._call_llm(system_prompt, combined, temperature=0.0)
Limitations and Biases
LLMs have several important limitations in the betting context that you must understand:
Recency bias. LLMs weight recent information heavily in their responses. A team that lost their last two games will be described more negatively than their overall season warrants.
Narrative bias. LLMs are trained on human-written text, which is full of narrative fallacies. They may emphasize "momentum," "clutch genes," or "team chemistry" beyond what the evidence supports.
Hallucination. LLMs can fabricate facts, statistics, and even injury reports. Never trust LLM outputs without verification against ground truth data.
Lack of calibration. If you ask an LLM for a win probability, the number it gives you is not a calibrated probability. It is a language model's best guess at what number a human analyst might say, not a mathematically grounded estimate.
Cost and latency. LLM API calls cost money and take time. At scale, the cost of analyzing every piece of sports news with a large model can be significant. Use LLMs selectively, for the highest-value analysis tasks, and use cheaper methods (VADER, rule-based systems) for bulk processing.
Inconsistency. The same prompt with the same input can produce different outputs across calls (even with temperature=0 due to batched inference). This makes reproducibility challenging. Always log both inputs and outputs.
32.5 Building a News-Driven Signal
Combining NLP Signals into a Betting Feature
The ultimate goal is to distill all NLP analysis into one or more numerical features that can be fed into your betting model alongside traditional statistical features. This requires careful aggregation and normalization.
# nlp/signal_builder.py
import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple
from datetime import datetime, timedelta
from dataclasses import dataclass
import logging
logger = logging.getLogger(__name__)
@dataclass
class NLPSignal:
"""A single NLP-derived betting signal for a team/game."""
team_id: str
game_id: str
game_date: str
# Sentiment features
sentiment_score: float # -1 to 1
sentiment_momentum: float # Change in sentiment over time
sentiment_volume: int # Number of mentions
# Injury features
injury_impact_score: float # Expected missing player value
injury_change_score: float # Net change since last report
star_availability: float # Probability key players play
# News impact features
recent_news_magnitude: float # Aggregate news impact
event_count: int # Number of significant events
# LLM-derived features (optional)
qualitative_edge: float = 0.0 # LLM assessment, normalized
@property
def composite_signal(self) -> float:
"""Weighted combination of all NLP signals."""
weights = {
"sentiment": 0.15,
"injury": 0.45,
"news": 0.25,
"qualitative": 0.15,
}
# Normalize each component to [-1, 1] range
sentiment_component = np.clip(self.sentiment_score, -1, 1)
injury_component = np.clip(-self.injury_impact_score / 5.0, -1, 0)
news_component = np.clip(self.recent_news_magnitude / 3.0, -1, 1)
qual_component = np.clip(self.qualitative_edge, -1, 1)
composite = (
weights["sentiment"] * sentiment_component
+ weights["injury"] * injury_component
+ weights["news"] * news_component
+ weights["qualitative"] * qual_component
)
return float(np.clip(composite, -1, 1))
class NLPSignalBuilder:
"""Builds NLP-derived features for each game in the pipeline."""
def __init__(self, sentiment_analyzer, injury_parser,
event_detector, llm_analyzer=None,
team_aliases: Dict[str, List[str]] = None):
self.sentiment = sentiment_analyzer
self.injuries = injury_parser
self.events = event_detector
self.llm = llm_analyzer
self.team_aliases = team_aliases or {}
# Historical signals for momentum calculation
self.signal_history: Dict[str, List[Tuple[str, float]]] = {}
def build_signal(self, team_id: str, game_id: str,
game_date: str,
documents: List, # TextDocument list
injury_entries: List, # InjuryEntry list
player_values: Dict[str, float] = None,
use_llm: bool = False
) -> NLPSignal:
"""Build a complete NLP signal for a team's upcoming game."""
# 1. Sentiment analysis
sentiment_features = self._compute_sentiment_features(
team_id, documents
)
# 2. Injury impact
injury_features = self._compute_injury_features(
team_id, injury_entries, player_values or {}
)
# 3. News events
news_features = self._compute_news_features(team_id, documents)
# 4. Optional LLM analysis
qual_edge = 0.0
if use_llm and self.llm:
qual_edge = self._compute_qualitative_edge(
team_id, documents
)
# 5. Compute sentiment momentum
momentum = self._compute_momentum(team_id, game_date,
sentiment_features["score"])
signal = NLPSignal(
team_id=team_id,
game_id=game_id,
game_date=game_date,
sentiment_score=sentiment_features["score"],
sentiment_momentum=momentum,
sentiment_volume=sentiment_features["volume"],
injury_impact_score=injury_features["impact"],
injury_change_score=injury_features["change"],
star_availability=injury_features["star_availability"],
recent_news_magnitude=news_features["magnitude"],
event_count=news_features["event_count"],
qualitative_edge=qual_edge,
)
logger.info(
f"NLP signal for {team_id} ({game_id}): "
f"composite={signal.composite_signal:.3f}, "
f"sentiment={signal.sentiment_score:.3f}, "
f"injury={signal.injury_impact_score:.3f}"
)
return signal
def _compute_sentiment_features(self, team_id: str,
documents: list
) -> Dict[str, float]:
"""Compute sentiment features from documents."""
# Filter documents mentioning this team
team_docs = []
for doc in documents:
text_lower = doc.text.lower()
aliases = self.team_aliases.get(team_id, [team_id])
if any(alias.lower() in text_lower for alias in aliases):
team_docs.append(doc)
if not team_docs:
return {"score": 0.0, "volume": 0}
# Analyze sentiment
sentiments = self.sentiment.analyze_batch(
[d.text for d in team_docs]
)
scores = [s.compound_score for s in sentiments]
# Weight by source credibility
weights = []
for doc in team_docs:
w = 1.0
if doc.source == "beat_reporter":
w = 3.0
elif doc.source in ("espn", "official"):
w = 2.0
weights.append(w)
weighted_score = np.average(scores, weights=weights)
return {
"score": float(weighted_score),
"volume": len(team_docs),
}
def _compute_injury_features(self, team_id: str,
injury_entries: list,
player_values: Dict[str, float]
) -> Dict[str, float]:
"""Compute injury impact features."""
from nlp.injury_parser import InjuryImpactEstimator
estimator = InjuryImpactEstimator(player_values)
team_entries = [
e for e in injury_entries if e.team == team_id
]
impact = estimator.estimate_impact(team_entries, team_id)
return {
"impact": impact["expected_missing_value"],
"change": 0.0, # Computed by comparing to previous report
"star_availability": impact["avg_play_probability"],
}
def _compute_news_features(self, team_id: str,
documents: list) -> Dict[str, float]:
"""Compute news event features."""
team_docs = []
for doc in documents:
text_lower = doc.text.lower()
aliases = self.team_aliases.get(team_id, [team_id])
if any(alias.lower() in text_lower for alias in aliases):
team_docs.append(doc)
if not team_docs:
return {"magnitude": 0.0, "event_count": 0}
total_magnitude = 0.0
event_count = 0
for doc in team_docs:
events = self.events.detect(doc.text)
for event in events:
if event.significance > 0.3: # Only count notable events
total_magnitude += (
event.significance * event.expected_line_move
)
event_count += 1
return {
"magnitude": total_magnitude,
"event_count": event_count,
}
def _compute_momentum(self, team_id: str, game_date: str,
current_sentiment: float) -> float:
"""Compute sentiment momentum (rate of change)."""
if team_id not in self.signal_history:
self.signal_history[team_id] = []
history = self.signal_history[team_id]
history.append((game_date, current_sentiment))
# Keep last 30 entries
if len(history) > 30:
history = history[-30:]
self.signal_history[team_id] = history
if len(history) < 3:
return 0.0
# Simple momentum: difference between recent and older average
recent = np.mean([s for _, s in history[-3:]])
older = np.mean([s for _, s in history[:-3]])
return float(recent - older)
def _compute_qualitative_edge(self, team_id: str,
documents: list) -> float:
"""Use LLM to estimate qualitative edge (expensive, use sparingly)."""
if not self.llm:
return 0.0
team_texts = [
d.text for d in documents
if team_id.lower() in d.text.lower()
][:5]
if not team_texts:
return 0.0
try:
analysis = self.llm.classify_news_impact(
headline=f"Recent news about {team_id}",
article_text=" ".join(team_texts),
)
magnitude_map = {
"negligible": 0.0,
"minor": 0.1,
"moderate": 0.3,
"major": 0.6,
"extreme": 0.9,
}
magnitude = magnitude_map.get(
analysis.get("magnitude", "negligible"), 0.0
)
direction = 1.0 if analysis.get("direction") == "positive" else -1.0
return float(magnitude * direction)
except Exception as e:
logger.warning(f"LLM qualitative analysis failed: {e}")
return 0.0
Integration with Existing Models
NLP signals should not replace your statistical model---they should augment it. The integration can happen in two ways:
- Feature injection: Add NLP-derived features directly into your model's feature vector.
- Model stacking: Train a separate NLP model and combine its predictions with the main model.
Feature injection is simpler and usually sufficient:
# nlp/integration.py
import numpy as np
import pandas as pd
from typing import Dict, List, Optional
import logging
logger = logging.getLogger(__name__)
class NLPFeatureIntegrator:
"""Integrate NLP signals into the main prediction pipeline."""
def __init__(self, feature_store, signal_builder):
self.feature_store = feature_store
self.signal_builder = signal_builder
def compute_and_store_features(self, team_id: str,
game_id: str,
game_date: str,
documents: list,
injury_entries: list,
player_values: Dict[str, float] = None
):
"""Compute NLP features and store in the feature store."""
signal = self.signal_builder.build_signal(
team_id=team_id,
game_id=game_id,
game_date=game_date,
documents=documents,
injury_entries=injury_entries,
player_values=player_values,
)
# Convert signal to feature store format
features_df = pd.DataFrame([{
"entity_id": team_id,
"event_date": game_date,
"nlp_sentiment_score": signal.sentiment_score,
"nlp_sentiment_momentum": signal.sentiment_momentum,
"nlp_sentiment_volume": signal.sentiment_volume,
"nlp_injury_impact": signal.injury_impact_score,
"nlp_injury_change": signal.injury_change_score,
"nlp_star_availability": signal.star_availability,
"nlp_news_magnitude": signal.recent_news_magnitude,
"nlp_event_count": signal.event_count,
"nlp_composite_signal": signal.composite_signal,
}])
self.feature_store.store_features(
entity_type="team",
features_df=features_df,
feature_version="nlp_v1",
)
logger.info(
f"Stored NLP features for {team_id} on {game_date}"
)
return signal
@staticmethod
def get_nlp_feature_names() -> List[str]:
"""Return the list of NLP feature names for model training."""
return [
"nlp_sentiment_score",
"nlp_sentiment_momentum",
"nlp_sentiment_volume",
"nlp_injury_impact",
"nlp_injury_change",
"nlp_star_availability",
"nlp_news_magnitude",
"nlp_event_count",
"nlp_composite_signal",
]
Backtesting NLP Features
Backtesting NLP features is methodologically challenging because you need historical text data aligned with historical betting lines. You cannot simply compute today's NLP features and apply them to historical games---that would be data leakage. You need to reconstruct what the NLP features would have been at the time of each historical bet.
# nlp/backtesting.py
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from datetime import datetime
from sklearn.metrics import roc_auc_score, brier_score_loss
import logging
logger = logging.getLogger(__name__)
class NLPBacktester:
"""Backtest NLP features to measure their predictive value.
Key principle: NLP features must be computed using only information
available BEFORE each game. This requires historical text data
with timestamps.
"""
def __init__(self, feature_store):
self.feature_store = feature_store
def evaluate_feature_value(self, nlp_feature_name: str,
base_model_probs: pd.Series,
actual_outcomes: pd.Series,
nlp_feature_values: pd.Series
) -> Dict:
"""Evaluate whether an NLP feature adds value beyond the base model.
Compares:
1. Base model predictions alone
2. Base model predictions adjusted by NLP feature
A valuable NLP feature should improve calibration and
discrimination when added to the base model.
"""
results = {}
# Base model metrics
base_brier = brier_score_loss(actual_outcomes, base_model_probs)
base_auc = roc_auc_score(actual_outcomes, base_model_probs)
results["base_brier"] = base_brier
results["base_auc"] = base_auc
# Simple adjustment: shift base probability by NLP signal
# (This is a simplified illustration; in practice, you would
# retrain the model with and without the NLP feature)
adjustment_weight = 0.05 # Small weight for NLP adjustment
adjusted_probs = base_model_probs + adjustment_weight * nlp_feature_values
adjusted_probs = np.clip(adjusted_probs, 0.01, 0.99)
adj_brier = brier_score_loss(actual_outcomes, adjusted_probs)
adj_auc = roc_auc_score(actual_outcomes, adjusted_probs)
results["adjusted_brier"] = adj_brier
results["adjusted_auc"] = adj_auc
results["brier_improvement"] = base_brier - adj_brier
results["auc_improvement"] = adj_auc - base_auc
# Feature correlation with residuals
# If the NLP feature correlates with model errors,
# it contains information the model is missing
residuals = actual_outcomes - base_model_probs
correlation = np.corrcoef(nlp_feature_values, residuals)[0, 1]
results["residual_correlation"] = correlation
# Profitability test: would betting based on NLP signal be profitable?
# Look at cases where NLP signal strongly disagrees with market
strong_signal = np.abs(nlp_feature_values) > nlp_feature_values.std()
if strong_signal.sum() > 10:
signal_games = actual_outcomes[strong_signal]
signal_probs = base_model_probs[strong_signal]
results["strong_signal_count"] = int(strong_signal.sum())
results["strong_signal_accuracy"] = float(
((nlp_feature_values[strong_signal] > 0) ==
(actual_outcomes[strong_signal] == 1)).mean()
)
# Statistical significance test
from scipy.stats import ttest_rel
if len(base_model_probs) > 30:
base_errors = (actual_outcomes - base_model_probs) ** 2
adj_errors = (actual_outcomes - adjusted_probs) ** 2
t_stat, p_value = ttest_rel(base_errors, adj_errors)
results["improvement_p_value"] = float(p_value)
results["statistically_significant"] = p_value < 0.05
logger.info(
f"NLP feature '{nlp_feature_name}': "
f"Brier improvement={results['brier_improvement']:.4f}, "
f"residual corr={results['residual_correlation']:.4f}"
)
return results
def run_walk_forward_test(self, games_df: pd.DataFrame,
nlp_features: List[str],
target_col: str = "home_win",
base_features: List[str] = None,
train_window_days: int = 365,
test_window_days: int = 30
) -> pd.DataFrame:
"""Walk-forward backtest comparing model with and without NLP features.
This is the gold standard for evaluating whether NLP features
genuinely improve predictions out of sample.
"""
from sklearn.ensemble import GradientBoostingClassifier
games_df = games_df.sort_values("game_date").reset_index(drop=True)
dates = pd.to_datetime(games_df["game_date"])
results = []
start_date = dates.min() + pd.Timedelta(days=train_window_days)
end_date = dates.max()
current_date = start_date
while current_date < end_date:
test_end = current_date + pd.Timedelta(days=test_window_days)
train_mask = (dates < current_date) & (
dates >= current_date - pd.Timedelta(days=train_window_days)
)
test_mask = (dates >= current_date) & (dates < test_end)
if train_mask.sum() < 100 or test_mask.sum() < 10:
current_date = test_end
continue
train_data = games_df[train_mask]
test_data = games_df[test_mask]
# Model WITHOUT NLP features
X_train_base = train_data[base_features].fillna(0)
X_test_base = test_data[base_features].fillna(0)
y_train = train_data[target_col]
y_test = test_data[target_col]
model_base = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
model_base.fit(X_train_base, y_train)
probs_base = model_base.predict_proba(X_test_base)[:, 1]
# Model WITH NLP features
all_features = base_features + nlp_features
X_train_nlp = train_data[all_features].fillna(0)
X_test_nlp = test_data[all_features].fillna(0)
model_nlp = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42
)
model_nlp.fit(X_train_nlp, y_train)
probs_nlp = model_nlp.predict_proba(X_test_nlp)[:, 1]
# Compare
brier_base = brier_score_loss(y_test, probs_base)
brier_nlp = brier_score_loss(y_test, probs_nlp)
results.append({
"period_start": current_date.strftime("%Y-%m-%d"),
"period_end": test_end.strftime("%Y-%m-%d"),
"n_test": len(y_test),
"brier_base": brier_base,
"brier_with_nlp": brier_nlp,
"brier_improvement": brier_base - brier_nlp,
"improvement_pct": (
(brier_base - brier_nlp) / brier_base * 100
),
})
current_date = test_end
results_df = pd.DataFrame(results)
if not results_df.empty:
avg_improvement = results_df["brier_improvement"].mean()
pct_improved = (results_df["brier_improvement"] > 0).mean()
logger.info(
f"Walk-forward backtest complete: "
f"Avg Brier improvement={avg_improvement:.4f}, "
f"Improved in {pct_improved:.0%} of periods"
)
return results_df
32.6 Chapter Summary
This chapter explored how Natural Language Processing techniques can extract predictive signals from the vast ocean of sports text data. The key insights and takeaways are:
Sentiment analysis provides a noisy but real signal. Sports media sentiment, especially from beat reporters, correlates with factors that statistical models miss: team morale, behind-the-scenes dynamics, and emerging trends. VADER provides a fast baseline, while transformer-based models offer higher accuracy at greater computational cost. The most valuable sentiment signal comes from changes in sentiment, not absolute levels, because the market has already priced in persistent conditions.
Injury report parsing is the highest-value NLP application. Injuries are the single largest driver of betting line movements, and the information arrives in semi-structured text that can be systematically parsed. A combination of regex patterns, spaCy NER, and rule-based classification handles the structured reports that leagues publish, while LLMs can extract injury information from unstructured news articles and social media. The injury impact estimator translates parsing results into quantitative features by combining player status with player value metrics.
News event detection and impact quantification close the loop. By classifying news events by type and tracking historical line reactions to similar events, you can estimate the expected market impact of new information. The line movement tracker monitors for sharp moves that may indicate breaking news, creating time-sensitive opportunities.
Large Language Models are powerful tools with important limitations. LLMs excel at information extraction, classification, and summarization. They fail at probability estimation and are prone to hallucination and narrative bias. Use them as sophisticated parsing tools, not as oracles. Cache responses, log everything, and verify outputs against ground truth. The cost of LLM API calls requires judicious use---reserve them for high-value analyses where simpler methods fall short.
NLP features must be backtested rigorously. The walk-forward backtest is the gold standard. Compare a model with and without NLP features over multiple out-of-sample periods. A genuine NLP edge should consistently improve Brier scores and show statistically significant improvement. Many NLP signals that look promising in-sample fail to generalize. Guard against overfitting by using simple, interpretable NLP features (composite sentiment score, injury impact score) rather than high-dimensional text embeddings.
Integration is additive, not substitutive. NLP features augment your statistical model; they do not replace it. The composite NLP signal enters the model as one (or a few) features alongside dozens of statistical features. Its contribution to the overall prediction is typically small---a few percentage points of improvement---but in betting markets, small edges compound into significant long-term profits.
The marriage of NLP with traditional sports analytics represents one of the most promising frontiers in quantitative sports betting. As language models continue to improve, the ability to systematically process and quantify textual information will become an increasingly important source of competitive advantage.
Related Reading
Explore this topic in other books
AI Engineering Pretraining, Transfer Learning & NLP College Football Analytics NLP & Scouting Reports Prediction Markets NLP & Sentiment Analysis