9 min read

Sports betting markets are fundamentally information markets. The price of a bet reflects the aggregated beliefs of all participants about the probability of an outcome. Most of this information enters the market through structured data---box...

Chapter 32: Natural Language Processing for Betting

Sports betting markets are fundamentally information markets. The price of a bet reflects the aggregated beliefs of all participants about the probability of an outcome. Most of this information enters the market through structured data---box scores, statistics, odds movements. But a significant portion enters through unstructured text: injury reports, press conferences, beat reporter tweets, news articles, and team announcements. If you can systematically extract and quantify the information in these texts faster and more accurately than the market, you gain an edge.

This chapter covers the full spectrum of Natural Language Processing (NLP) techniques applied to sports betting: from basic sentiment analysis to advanced Large Language Model (LLM) applications. We build practical systems that ingest text data, extract structured signals, and integrate those signals into betting models.


32.1 Sentiment Analysis from Sports Media

The Information Landscape

Sports media produces enormous volumes of text daily. Each piece of text carries implicit or explicit information about team quality, player health, morale, and expectations. The key sources are:

  • Beat reporters on social media: The single most valuable text source. Beat reporters break news about injuries, lineup changes, coaching decisions, and locker room dynamics before this information is widely known.
  • Sports news articles: Pre-game previews, post-game analyses, and feature articles from outlets like ESPN, The Athletic, and team-specific sites.
  • Press conferences and interviews: Coach and player quotes reveal tactical intentions, injury status, and team morale.
  • Fan and analyst commentary: Higher volume but noisier. Can capture overall market sentiment.
  • Official team communications: Injury reports, roster moves, and transaction announcements.

Scraping Sports News and Social Media

Collecting text data requires automated scraping pipelines. Legal and ethical considerations are important: respect robots.txt, rate limits, and terms of service. Many platforms offer official APIs that should be preferred over scraping.

# nlp/scrapers.py
import requests
import time
import json
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from dataclasses import dataclass, field
from bs4 import BeautifulSoup
import logging

logger = logging.getLogger(__name__)


@dataclass
class TextDocument:
    """Represents a single text document from any source."""
    source: str           # "twitter", "espn", "injury_report", etc.
    text: str
    author: str
    published_at: datetime
    url: Optional[str] = None
    metadata: Dict = field(default_factory=dict)
    entities: List[str] = field(default_factory=list)  # Teams, players mentioned


class NewsScraperBase:
    """Base class for news scrapers with rate limiting."""

    def __init__(self, rate_limit_seconds: float = 2.0):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": (
                "BettingResearchBot/1.0 "
                "(academic research; respects robots.txt)"
            ),
        })
        self.rate_limit = rate_limit_seconds
        self.last_request_time = 0.0

    def _throttled_get(self, url: str, **kwargs) -> requests.Response:
        elapsed = time.time() - self.last_request_time
        if elapsed < self.rate_limit:
            time.sleep(self.rate_limit - elapsed)
        self.last_request_time = time.time()
        return self.session.get(url, timeout=30, **kwargs)


class RSSNewsScraper(NewsScraperBase):
    """Scrape sports news from RSS feeds."""

    def __init__(self, feeds: Dict[str, str] = None):
        super().__init__()
        self.feeds = feeds or {
            "espn_nba": "https://www.espn.com/espn/rss/nba/news",
            "espn_nfl": "https://www.espn.com/espn/rss/nfl/news",
        }

    def fetch_articles(self, feed_name: str = None
                        ) -> List[TextDocument]:
        """Fetch articles from configured RSS feeds."""
        import feedparser

        documents = []
        feeds_to_check = (
            {feed_name: self.feeds[feed_name]}
            if feed_name
            else self.feeds
        )

        for name, url in feeds_to_check.items():
            try:
                feed = feedparser.parse(url)
                for entry in feed.entries:
                    # Parse publication date
                    published = datetime.now()
                    if hasattr(entry, "published_parsed") and entry.published_parsed:
                        published = datetime(*entry.published_parsed[:6])

                    # Extract text content
                    text = entry.get("summary", "")
                    if hasattr(entry, "content"):
                        text = entry.content[0].get("value", text)

                    # Clean HTML
                    soup = BeautifulSoup(text, "html.parser")
                    clean_text = soup.get_text(separator=" ").strip()

                    documents.append(TextDocument(
                        source=name,
                        text=f"{entry.title}. {clean_text}",
                        author=entry.get("author", "unknown"),
                        published_at=published,
                        url=entry.get("link"),
                    ))

                logger.info(
                    f"Fetched {len(feed.entries)} articles from {name}"
                )
            except Exception as e:
                logger.error(f"Error fetching {name}: {e}")

        return documents


class SocialMediaCollector(NewsScraperBase):
    """Collect sports-related social media posts via API."""

    def __init__(self, bearer_token: str = None):
        super().__init__(rate_limit_seconds=3.0)
        self.bearer_token = bearer_token
        if bearer_token:
            self.session.headers.update({
                "Authorization": f"Bearer {bearer_token}",
            })

    def search_recent(self, query: str, max_results: int = 100
                       ) -> List[TextDocument]:
        """Search recent posts using the API.

        Note: This is a generic pattern. Actual implementation depends
        on the specific social media API you have access to.
        """
        documents = []

        # Generic API search pattern
        url = "https://api.social-platform.com/2/search/recent"
        params = {
            "query": query,
            "max_results": min(max_results, 100),
            "tweet.fields": "created_at,author_id,public_metrics",
        }

        try:
            response = self._throttled_get(url, params=params)
            if response.status_code == 200:
                data = response.json()
                for post in data.get("data", []):
                    documents.append(TextDocument(
                        source="social_media",
                        text=post["text"],
                        author=post.get("author_id", "unknown"),
                        published_at=datetime.fromisoformat(
                            post["created_at"].replace("Z", "+00:00")
                        ),
                        metadata={
                            "likes": post.get("public_metrics", {}).get(
                                "like_count", 0
                            ),
                            "retweets": post.get("public_metrics", {}).get(
                                "retweet_count", 0
                            ),
                        },
                    ))
        except Exception as e:
            logger.error(f"Social media search failed: {e}")

        return documents

    def track_beat_reporters(self, reporter_ids: List[str],
                              hours_back: int = 24
                              ) -> List[TextDocument]:
        """Fetch recent posts from specific beat reporters.

        Beat reporters are the most valuable source of breaking news
        about injuries, lineup changes, and team dynamics.
        """
        documents = []
        cutoff = datetime.utcnow() - timedelta(hours=hours_back)

        for reporter_id in reporter_ids:
            try:
                url = (
                    f"https://api.social-platform.com/2/"
                    f"users/{reporter_id}/posts"
                )
                params = {
                    "max_results": 50,
                    "start_time": cutoff.isoformat() + "Z",
                    "tweet.fields": "created_at,public_metrics",
                }

                response = self._throttled_get(url, params=params)
                if response.status_code == 200:
                    data = response.json()
                    for post in data.get("data", []):
                        documents.append(TextDocument(
                            source="beat_reporter",
                            text=post["text"],
                            author=reporter_id,
                            published_at=datetime.fromisoformat(
                                post["created_at"].replace("Z", "+00:00")
                            ),
                            metadata={"reporter_id": reporter_id},
                        ))
            except Exception as e:
                logger.error(
                    f"Failed to fetch posts for reporter {reporter_id}: {e}"
                )

        return documents

Sentiment Scoring

Sentiment analysis assigns a numerical score to text reflecting its emotional valence---positive, negative, or neutral. For sports betting, we care about sentiment toward specific teams and players, not just the overall emotional tone.

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment tool specifically designed for social media text. It handles slang, emoticons, and intensity modifiers well. It is fast, requires no training, and works reasonably well as a baseline.

# nlp/sentiment.py
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


@dataclass
class SentimentResult:
    text: str
    compound_score: float    # -1 (negative) to +1 (positive)
    positive: float
    negative: float
    neutral: float
    method: str
    entities: List[str] = None


class VADERSentimentAnalyzer:
    """VADER-based sentiment analysis for sports text."""

    def __init__(self):
        from nltk.sentiment.vader import SentimentIntensityAnalyzer
        self.analyzer = SentimentIntensityAnalyzer()

        # Add sports-specific lexicon entries
        sports_lexicon = {
            "injury": -2.0,
            "injured": -2.5,
            "doubtful": -1.5,
            "questionable": -1.0,
            "out": -2.0,
            "ruled out": -3.0,
            "dnp": -2.0,
            "sidelined": -2.5,
            "day-to-day": -1.0,
            "probable": 0.5,
            "cleared": 2.0,
            "returning": 2.0,
            "comeback": 2.0,
            "healthy": 2.0,
            "full practice": 1.5,
            "limited practice": -0.5,
            "did not practice": -2.0,
            "suspension": -2.5,
            "suspended": -2.5,
            "traded": -0.5,  # Slightly negative (disruption)
            "signed": 1.0,
            "extension": 1.5,
            "mvp": 2.5,
            "dominant": 2.0,
            "blowout": 1.5,   # Context-dependent but often positive
            "upset": -1.0,
            "collapse": -2.5,
            "slump": -2.0,
            "streak": 1.0,    # Usually "winning streak" context
            "losing streak": -2.0,
            "hot": 1.5,
            "cold": -1.5,
            "clutch": 2.0,
            "choke": -2.5,
        }

        for word, score in sports_lexicon.items():
            self.analyzer.lexicon[word] = score

    def analyze(self, text: str) -> SentimentResult:
        """Analyze sentiment of a single text."""
        scores = self.analyzer.polarity_scores(text)
        return SentimentResult(
            text=text,
            compound_score=scores["compound"],
            positive=scores["pos"],
            negative=scores["neg"],
            neutral=scores["neu"],
            method="vader",
        )

    def analyze_batch(self, texts: List[str]) -> List[SentimentResult]:
        """Analyze sentiment of multiple texts."""
        return [self.analyze(text) for text in texts]


class TransformerSentimentAnalyzer:
    """Transformer-based sentiment analysis using Hugging Face models.

    Significantly more accurate than VADER, especially for nuanced
    or context-dependent statements, but much slower.
    """

    def __init__(self, model_name: str = "cardiffnlp/twitter-roberta-base-sentiment-latest"):
        from transformers import pipeline

        self.pipe = pipeline(
            "sentiment-analysis",
            model=model_name,
            top_k=None,  # Return all scores
            truncation=True,
            max_length=512,
        )
        self.label_map = {
            "positive": 1.0,
            "negative": -1.0,
            "neutral": 0.0,
            # Some models use different labels
            "LABEL_0": -1.0,
            "LABEL_1": 0.0,
            "LABEL_2": 1.0,
        }

    def analyze(self, text: str) -> SentimentResult:
        """Analyze sentiment using transformer model."""
        results = self.pipe(text)[0]

        # Convert to compound score
        compound = 0.0
        pos_score = 0.0
        neg_score = 0.0
        neu_score = 0.0

        for result in results:
            label = result["label"].lower()
            score = result["score"]

            if "positive" in label or label == "LABEL_2":
                pos_score = score
                compound += score
            elif "negative" in label or label == "LABEL_0":
                neg_score = score
                compound -= score
            else:
                neu_score = score

        return SentimentResult(
            text=text,
            compound_score=compound,
            positive=pos_score,
            negative=neg_score,
            neutral=neu_score,
            method="transformer",
        )

    def analyze_batch(self, texts: List[str],
                       batch_size: int = 32) -> List[SentimentResult]:
        """Analyze a batch of texts efficiently."""
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            pipe_results = self.pipe(batch)

            for text, pipe_result in zip(batch, pipe_results):
                compound = 0.0
                pos_score = neg_score = neu_score = 0.0

                for r in pipe_result:
                    label = r["label"].lower()
                    if "positive" in label:
                        pos_score = r["score"]
                        compound += r["score"]
                    elif "negative" in label:
                        neg_score = r["score"]
                        compound -= r["score"]
                    else:
                        neu_score = r["score"]

                results.append(SentimentResult(
                    text=text,
                    compound_score=compound,
                    positive=pos_score,
                    negative=neg_score,
                    neutral=neu_score,
                    method="transformer",
                ))
        return results


class TeamSentimentAggregator:
    """Aggregate sentiment scores at the team level for betting features."""

    def __init__(self, team_aliases: Dict[str, List[str]] = None):
        self.team_aliases = team_aliases or {}

    def identify_team(self, text: str) -> List[str]:
        """Identify which teams are mentioned in text."""
        text_lower = text.lower()
        mentioned_teams = []

        for team_id, aliases in self.team_aliases.items():
            for alias in aliases:
                if alias.lower() in text_lower:
                    mentioned_teams.append(team_id)
                    break

        return mentioned_teams

    def aggregate(self, documents: List[TextDocument],
                   sentiment_results: List[SentimentResult],
                   window_hours: int = 48
                   ) -> Dict[str, Dict[str, float]]:
        """Aggregate sentiment by team over a time window.

        Returns dict mapping team_id to sentiment metrics:
        {
            "LAL": {
                "mean_sentiment": 0.35,
                "median_sentiment": 0.28,
                "sentiment_std": 0.45,
                "num_mentions": 15,
                "positive_ratio": 0.67,
                "negative_ratio": 0.20,
                "weighted_sentiment": 0.32,  # Weighted by engagement
            }
        }
        """
        from collections import defaultdict
        from datetime import datetime, timedelta

        cutoff = datetime.utcnow() - timedelta(hours=window_hours)

        team_sentiments = defaultdict(list)
        team_weights = defaultdict(list)

        for doc, sentiment in zip(documents, sentiment_results):
            if doc.published_at < cutoff:
                continue

            teams = self.identify_team(doc.text)
            for team in teams:
                team_sentiments[team].append(sentiment.compound_score)

                # Weight by source credibility and engagement
                weight = 1.0
                if doc.source == "beat_reporter":
                    weight = 3.0  # Beat reporters are highest signal
                elif doc.source == "official":
                    weight = 2.0
                engagement = (
                    doc.metadata.get("likes", 0)
                    + doc.metadata.get("retweets", 0) * 2
                )
                weight *= 1 + np.log1p(engagement) / 10

                team_weights[team].append(weight)

        result = {}
        for team, scores in team_sentiments.items():
            scores_arr = np.array(scores)
            weights_arr = np.array(team_weights[team])

            result[team] = {
                "mean_sentiment": float(np.mean(scores_arr)),
                "median_sentiment": float(np.median(scores_arr)),
                "sentiment_std": float(np.std(scores_arr)),
                "num_mentions": len(scores),
                "positive_ratio": float(np.mean(scores_arr > 0.05)),
                "negative_ratio": float(np.mean(scores_arr < -0.05)),
                "weighted_sentiment": float(
                    np.average(scores_arr, weights=weights_arr)
                ),
            }

        return result

Correlation with Betting Markets

Sentiment scores are only useful if they correlate with betting-relevant outcomes. The key question is not whether sentiment predicts who wins, but whether it predicts outcomes better than the market. Specifically, we look for cases where sentiment diverges from market-implied probabilities.

A team with very negative sentiment (due to a recent loss streak) that the market has already priced in is not a signal. But a team with quietly improving sentiment (a key player returning from injury, positive practice reports) that the market has not yet fully adjusted to represents an exploitable edge.

The empirical literature shows that:

  1. Extreme negative sentiment following a loss often overshoots, creating value on the losing team's next game.
  2. Injury-related sentiment changes have the most predictive power because injury information takes time to be fully priced in.
  3. Aggregate social media volume (not just sentiment) correlates with public betting action, which can indicate opportunities on the other side.

32.2 Injury Report Parsing

Extracting Structured Data from Injury Reports

Injury reports are the most directly actionable text data in sports betting. An official NBA injury report or an NFL practice participation report contains highly structured information that moves betting lines, but it arrives in semi-structured text format that requires parsing.

Here is an example of raw injury report text:

Los Angeles Lakers Injury Report - January 15, 2025
- LeBron James (Left Knee Soreness) - Questionable
- Anthony Davis (Right Ankle Sprain) - Doubtful
- Austin Reaves (Illness) - Probable
- D'Angelo Russell (Right Hamstring Tightness) - Out

This needs to be converted into structured data:

{
    "team": "Los Angeles Lakers",
    "date": "2025-01-15",
    "players": [
        {"name": "LeBron James", "injury": "Left Knee Soreness", "status": "Questionable"},
        {"name": "Anthony Davis", "injury": "Right Ankle Sprain", "status": "Doubtful"},
        {"name": "Austin Reaves", "injury": "Illness", "status": "Probable"},
        {"name": "D'Angelo Russell", "injury": "Right Hamstring Tightness", "status": "Out"}
    ]
}

Named Entity Recognition and Status Classification

We use a combination of rule-based patterns and NLP models to parse injury reports reliably.

# nlp/injury_parser.py
import re
import spacy
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime
import logging

logger = logging.getLogger(__name__)


@dataclass
class InjuryEntry:
    player_name: str
    team: str
    injury_description: str
    status: str          # "out", "doubtful", "questionable", "probable", "available"
    body_part: Optional[str] = None
    injury_type: Optional[str] = None
    reported_date: Optional[str] = None
    estimated_return: Optional[str] = None
    confidence: float = 1.0


@dataclass
class ParsedInjuryReport:
    team: str
    report_date: str
    entries: List[InjuryEntry] = field(default_factory=list)
    source: str = "unknown"
    raw_text: str = ""


class InjuryReportParser:
    """Parse injury reports into structured data using spaCy NER
    and rule-based patterns."""

    def __init__(self, model_name: str = "en_core_web_lg"):
        self.nlp = spacy.load(model_name)

        # Status keywords and their normalized forms
        self.status_keywords = {
            "out": "out",
            "ruled out": "out",
            "will not play": "out",
            "sidelined": "out",
            "inactive": "out",
            "dnp": "out",
            "did not participate": "out",
            "doubtful": "doubtful",
            "game-time decision": "doubtful",
            "unlikely": "doubtful",
            "questionable": "questionable",
            "uncertain": "questionable",
            "day-to-day": "questionable",
            "50-50": "questionable",
            "probable": "probable",
            "likely to play": "probable",
            "expected to play": "probable",
            "available": "available",
            "cleared": "available",
            "full go": "available",
            "no restrictions": "available",
            "upgraded": "probable",   # Usually positive
            "downgraded": "doubtful",  # Usually negative
        }

        # Body part patterns
        self.body_parts = [
            "knee", "ankle", "hamstring", "groin", "shoulder", "back",
            "hip", "calf", "quadricep", "quad", "foot", "toe", "finger",
            "wrist", "hand", "elbow", "neck", "head", "concussion",
            "achilles", "rib", "oblique", "abdominal", "thigh",
            "shin", "fibula", "tibia", "acl", "mcl",
        ]

        # Injury type patterns
        self.injury_types = [
            "sprain", "strain", "soreness", "tightness", "bruise",
            "contusion", "fracture", "tear", "inflammation",
            "tendinitis", "tendinopathy", "dislocation", "illness",
            "flu", "personal", "rest", "load management", "surgery",
            "bone bruise", "stress fracture", "hyperextension",
        ]

        # Compile regex for structured injury report lines
        self.line_pattern = re.compile(
            r"[-*•]\s*"                          # Bullet point
            r"([A-Z][a-zA-Z'\-.\s]+?)"           # Player name
            r"\s*\(([^)]+)\)\s*[-–—:]\s*"        # Injury description in parens
            r"(Out|Doubtful|Questionable|Probable|Available|Day-to-Day)",
            re.IGNORECASE,
        )

    def parse_structured_report(self, text: str, team: str = "",
                                  date: str = "") -> ParsedInjuryReport:
        """Parse a formally structured injury report (bullet-point format)."""
        report = ParsedInjuryReport(
            team=team,
            report_date=date or datetime.utcnow().strftime("%Y-%m-%d"),
            raw_text=text,
        )

        for match in self.line_pattern.finditer(text):
            player_name = match.group(1).strip()
            injury_desc = match.group(2).strip()
            status_raw = match.group(3).strip()

            status = self.status_keywords.get(
                status_raw.lower(), status_raw.lower()
            )

            body_part = self._extract_body_part(injury_desc)
            injury_type = self._extract_injury_type(injury_desc)

            report.entries.append(InjuryEntry(
                player_name=player_name,
                team=team,
                injury_description=injury_desc,
                status=status,
                body_part=body_part,
                injury_type=injury_type,
                reported_date=report.report_date,
                confidence=0.95,  # High confidence for structured reports
            ))

        logger.info(
            f"Parsed {len(report.entries)} entries from "
            f"structured report for {team}"
        )
        return report

    def parse_freeform_text(self, text: str, team: str = ""
                             ) -> List[InjuryEntry]:
        """Extract injury information from unstructured text
        (news articles, tweets).

        Uses spaCy NER to identify player names and rule-based
        patterns to extract injury and status information.
        """
        doc = self.nlp(text)
        entries = []

        # Find all person entities
        person_entities = [
            ent for ent in doc.ents if ent.label_ == "PERSON"
        ]

        for person in person_entities:
            # Look in the surrounding context (sentence or nearby text)
            # for injury and status information
            sent = person.sent
            sent_text = sent.text.lower()

            # Check if the sentence mentions an injury-related term
            has_injury_context = any(
                term in sent_text
                for term in [
                    "injur", "hurt", "miss", "out", "doubtful",
                    "questionable", "probable", "sidelined", "sprain",
                    "strain", "surgery", "return", "cleared", "day-to-day",
                    "health", "practice",
                ]
            )

            if not has_injury_context:
                continue

            # Extract status
            status = self._extract_status(sent_text)
            body_part = self._extract_body_part(sent_text)
            injury_type = self._extract_injury_type(sent_text)

            if status:
                entries.append(InjuryEntry(
                    player_name=person.text,
                    team=team,
                    injury_description=sent.text.strip(),
                    status=status,
                    body_part=body_part,
                    injury_type=injury_type,
                    reported_date=datetime.utcnow().strftime("%Y-%m-%d"),
                    confidence=0.7,  # Lower confidence for freeform
                ))

        logger.info(
            f"Extracted {len(entries)} injury entries from freeform text"
        )
        return entries

    def _extract_status(self, text: str) -> Optional[str]:
        """Extract injury status from text."""
        text_lower = text.lower()

        # Check for explicit status keywords (longest match first)
        sorted_keywords = sorted(
            self.status_keywords.keys(), key=len, reverse=True
        )
        for keyword in sorted_keywords:
            if keyword in text_lower:
                return self.status_keywords[keyword]

        # Inference from context
        negative_indicators = ["miss", "won't play", "will not play",
                                "not expected", "held out"]
        positive_indicators = ["return", "cleared", "back in",
                                "practiced", "full participant"]

        for indicator in negative_indicators:
            if indicator in text_lower:
                return "out"
        for indicator in positive_indicators:
            if indicator in text_lower:
                return "available"

        return None

    def _extract_body_part(self, text: str) -> Optional[str]:
        """Extract injured body part from text."""
        text_lower = text.lower()
        for part in self.body_parts:
            if part in text_lower:
                # Check for left/right modifier
                idx = text_lower.index(part)
                prefix = text_lower[max(0, idx - 10):idx]
                side = ""
                if "left" in prefix:
                    side = "left "
                elif "right" in prefix:
                    side = "right "
                return f"{side}{part}"
        return None

    def _extract_injury_type(self, text: str) -> Optional[str]:
        """Extract injury type from text."""
        text_lower = text.lower()
        for injury_type in self.injury_types:
            if injury_type in text_lower:
                return injury_type
        return None


class InjuryImpactEstimator:
    """Estimate the betting market impact of injury status changes."""

    def __init__(self, player_values: Dict[str, float] = None):
        """
        player_values maps player names to their approximate
        win-share impact per game (e.g., from advanced stats).
        Higher values indicate more impactful players.
        """
        self.player_values = player_values or {}

        # Status-to-probability-of-playing mapping
        self.status_play_probs = {
            "out": 0.0,
            "doubtful": 0.15,
            "questionable": 0.50,
            "day-to-day": 0.55,
            "probable": 0.85,
            "available": 1.0,
        }

    def estimate_impact(self, entries: List[InjuryEntry],
                         team_id: str) -> Dict[str, float]:
        """Estimate the net impact of injuries on a team.

        Returns metrics useful as betting model features:
        - expected_missing_value: Expected win-share value of missing players
        - injury_severity_score: Weighted injury severity
        - star_player_risk: Risk of losing a top player
        """
        expected_missing = 0.0
        severity_scores = []
        star_risk = 0.0

        for entry in entries:
            if entry.team != team_id and team_id:
                continue

            player_val = self.player_values.get(entry.player_name, 0.5)
            play_prob = self.status_play_probs.get(entry.status, 0.5)
            miss_prob = 1.0 - play_prob

            expected_missing += player_val * miss_prob

            # Severity score (weighted by player value)
            severity = miss_prob * player_val
            severity_scores.append(severity)

            # Star player risk (top-tier player uncertain)
            if player_val > 2.0 and play_prob < 0.9:
                star_risk = max(star_risk, miss_prob * player_val)

        return {
            "expected_missing_value": expected_missing,
            "injury_severity_score": sum(severity_scores),
            "num_injured_players": len(entries),
            "star_player_risk": star_risk,
            "avg_play_probability": np.mean([
                self.status_play_probs.get(e.status, 0.5)
                for e in entries
            ]) if entries else 1.0,
        }

32.3 News Impact Quantification

Event Detection

Not all news is equally important. A coaching change is more impactful than a routine practice report. An event detection system classifies incoming text into event categories and estimates their significance.

# nlp/event_detection.py
import re
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum
import numpy as np
import logging

logger = logging.getLogger(__name__)


class EventType(Enum):
    INJURY_UPDATE = "injury_update"
    TRADE = "trade"
    COACHING_CHANGE = "coaching_change"
    SUSPENSION = "suspension"
    LINEUP_CHANGE = "lineup_change"
    WEATHER = "weather"
    PLAYER_RETURN = "player_return"
    CONTRACT = "contract"
    CONTROVERSY = "controversy"
    GENERAL_NEWS = "general_news"


@dataclass
class DetectedEvent:
    event_type: EventType
    headline: str
    entities: List[str]        # Teams and players involved
    significance: float        # 0 to 1 scale
    confidence: float          # Confidence in classification
    expected_line_move: float  # Estimated points of line movement
    timestamp: str
    source_text: str


class EventDetector:
    """Detect and classify sports events from text."""

    def __init__(self):
        # Event type detection patterns
        self.patterns = {
            EventType.INJURY_UPDATE: [
                r"injur(y|ed|ies)",
                r"(out|doubtful|questionable|probable)\s+(for|vs|against)",
                r"(knee|ankle|hamstring|shoulder|concussion)",
                r"(ruled out|sidelined|day-to-day|miss(es|ing)?)",
                r"MRI|X-ray|scan|surgery",
                r"(limited|full|did not)\s+participa",
            ],
            EventType.TRADE: [
                r"trad(e[ds]?|ing)\s+(to|from|for|with)",
                r"acquir(e[ds]?|ing)",
                r"(deal|swap|package|pick[s]?|draft)",
                r"trade deadline",
                r"waiv(e[ds]?|er)",
            ],
            EventType.COACHING_CHANGE: [
                r"(fir(e[ds]?|ing)|dismiss(ed)?)\s+(coach|manager|head)",
                r"(hir(e[ds]?|ing)|appoint(ed)?)\s+(as|new)\s+(coach|manager)",
                r"coaching\s+change",
                r"interim\s+(coach|manager|head)",
                r"resign(s|ed|ation)",
            ],
            EventType.SUSPENSION: [
                r"suspend(ed|sion)",
                r"ban(ned)?",
                r"disciplin(e[ds]?|ary)",
                r"violation",
                r"PED|substance",
            ],
            EventType.LINEUP_CHANGE: [
                r"start(ing|s|er)",
                r"bench(ed|ing)?",
                r"lineup",
                r"rotation",
                r"minutes\s+(restriction|limit)",
            ],
            EventType.PLAYER_RETURN: [
                r"return(s|ing|ed)\s+(to|from)",
                r"(back|cleared)\s+(in|to|for)\s+(action|play|lineup)",
                r"activated\s+from",
                r"off\s+(injured|injury)\s+(list|reserve)",
                r"full\s+(practice|participation)",
            ],
            EventType.WEATHER: [
                r"(rain|snow|wind|cold|heat|weather)",
                r"(dome|indoor|outdoor|roof)",
                r"(delay|postpone|cancel)",
                r"(mph|degrees|temperature)",
            ],
        }

        # Compile patterns
        self.compiled_patterns = {
            event_type: [re.compile(p, re.IGNORECASE) for p in patterns]
            for event_type, patterns in self.patterns.items()
        }

        # Base significance by event type
        self.base_significance = {
            EventType.INJURY_UPDATE: 0.7,
            EventType.TRADE: 0.8,
            EventType.COACHING_CHANGE: 0.9,
            EventType.SUSPENSION: 0.6,
            EventType.LINEUP_CHANGE: 0.5,
            EventType.PLAYER_RETURN: 0.7,
            EventType.WEATHER: 0.3,
            EventType.CONTRACT: 0.2,
            EventType.CONTROVERSY: 0.4,
            EventType.GENERAL_NEWS: 0.1,
        }

    def detect(self, text: str, timestamp: str = ""
               ) -> List[DetectedEvent]:
        """Detect events in text."""
        events = []

        # Score each event type
        type_scores = {}
        for event_type, patterns in self.compiled_patterns.items():
            matches = sum(
                1 for p in patterns if p.search(text)
            )
            if matches > 0:
                type_scores[event_type] = matches / len(patterns)

        if not type_scores:
            return [DetectedEvent(
                event_type=EventType.GENERAL_NEWS,
                headline=text[:100],
                entities=[],
                significance=0.1,
                confidence=0.5,
                expected_line_move=0.0,
                timestamp=timestamp,
                source_text=text,
            )]

        # Take the highest-scoring event type
        best_type = max(type_scores, key=type_scores.get)
        confidence = min(type_scores[best_type] * 2, 1.0)

        significance = self.base_significance.get(best_type, 0.1)

        # Adjust significance based on text cues
        if any(word in text.lower() for word in
               ["breaking", "just in", "sources say", "confirmed"]):
            significance = min(significance + 0.2, 1.0)

        # Estimate line movement
        line_move = self._estimate_line_move(best_type, text)

        events.append(DetectedEvent(
            event_type=best_type,
            headline=text[:150],
            entities=self._extract_entities(text),
            significance=significance,
            confidence=confidence,
            expected_line_move=line_move,
            timestamp=timestamp,
            source_text=text,
        ))

        return events

    def _estimate_line_move(self, event_type: EventType,
                             text: str) -> float:
        """Estimate expected line movement in points.

        This is a rough heuristic. More accurate estimation
        requires historical data on similar events.
        """
        base_moves = {
            EventType.INJURY_UPDATE: 2.0,
            EventType.TRADE: 1.5,
            EventType.COACHING_CHANGE: 3.0,
            EventType.SUSPENSION: 1.5,
            EventType.LINEUP_CHANGE: 1.0,
            EventType.PLAYER_RETURN: 2.0,
            EventType.WEATHER: 0.5,
            EventType.GENERAL_NEWS: 0.0,
        }

        move = base_moves.get(event_type, 0.0)

        # Adjust for severity cues
        text_lower = text.lower()
        if any(w in text_lower for w in
               ["star", "all-star", "mvp", "starter"]):
            move *= 1.5
        if any(w in text_lower for w in
               ["season-ending", "torn", "surgery", "acl", "achilles"]):
            move *= 2.0
        if any(w in text_lower for w in
               ["minor", "precautionary", "rest"]):
            move *= 0.5

        return round(move, 1)

    def _extract_entities(self, text: str) -> List[str]:
        """Extract team and player names (simplified)."""
        # In production, use spaCy NER or a custom NER model
        import spacy
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(text)
        return [
            ent.text for ent in doc.ents
            if ent.label_ in ("PERSON", "ORG")
        ]

Measuring Market Reaction to News

To build a news impact model, you need historical data linking news events to subsequent line movements. This requires capturing the betting line before and after news breaks.

# nlp/news_impact.py
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from datetime import datetime, timedelta
from scipy import stats
import logging

logger = logging.getLogger(__name__)


class NewsImpactModel:
    """Quantify the historical impact of news events on betting lines.

    Collects pairs of (news_event, line_movement) to build a model
    of how different types of news affect the market.
    """

    def __init__(self):
        self.impact_data = []

    def record_event_impact(self, event_type: str,
                              player_name: str,
                              team: str,
                              pre_event_line: float,
                              post_event_line: float,
                              time_to_react_minutes: int,
                              metadata: Dict = None):
        """Record the market impact of a news event."""
        line_movement = post_event_line - pre_event_line
        self.impact_data.append({
            "event_type": event_type,
            "player_name": player_name,
            "team": team,
            "pre_line": pre_event_line,
            "post_line": post_event_line,
            "line_movement": line_movement,
            "abs_line_movement": abs(line_movement),
            "time_to_react": time_to_react_minutes,
            "timestamp": datetime.utcnow().isoformat(),
            **(metadata or {}),
        })

    def analyze_impacts(self) -> pd.DataFrame:
        """Analyze historical news impacts by category."""
        if not self.impact_data:
            return pd.DataFrame()

        df = pd.DataFrame(self.impact_data)

        summary = df.groupby("event_type").agg({
            "abs_line_movement": ["mean", "median", "std", "count"],
            "time_to_react": "mean",
        }).round(2)

        return summary

    def estimate_current_impact(self, event_type: str,
                                  player_value: float = 1.0
                                  ) -> Dict[str, float]:
        """Estimate the expected market impact of a new event.

        Uses historical data on similar events to predict line movement.
        """
        df = pd.DataFrame(self.impact_data)

        if df.empty or event_type not in df["event_type"].values:
            return {
                "expected_move": 0.0,
                "confidence_interval_low": 0.0,
                "confidence_interval_high": 0.0,
                "sample_size": 0,
            }

        similar = df[df["event_type"] == event_type]
        movements = similar["abs_line_movement"].values

        mean_move = np.mean(movements) * player_value
        std_move = np.std(movements)

        # 90% confidence interval
        ci = stats.t.interval(
            0.90, len(movements) - 1,
            loc=mean_move, scale=std_move / np.sqrt(len(movements))
        ) if len(movements) > 1 else (0, 0)

        return {
            "expected_move": float(mean_move),
            "confidence_interval_low": float(ci[0]),
            "confidence_interval_high": float(ci[1]),
            "sample_size": len(movements),
            "median_react_time_min": float(similar["time_to_react"].median()),
        }


class LineMovementTracker:
    """Track line movements to detect news-driven shifts.

    Monitors odds feeds and flags significant movements that may
    indicate breaking news, creating opportunities for informed bettors.
    """

    def __init__(self, significant_move_threshold: float = 1.0):
        self.threshold = significant_move_threshold
        self.line_history: Dict[str, List[Tuple[datetime, float]]] = {}

    def update_line(self, game_id: str, current_line: float):
        """Record a new line observation."""
        timestamp = datetime.utcnow()

        if game_id not in self.line_history:
            self.line_history[game_id] = []

        self.line_history[game_id].append((timestamp, current_line))

    def detect_significant_moves(self) -> List[Dict]:
        """Detect games with significant recent line movements."""
        alerts = []

        for game_id, history in self.line_history.items():
            if len(history) < 2:
                continue

            # Check movement in last 30 minutes
            now = datetime.utcnow()
            recent = [
                (t, line) for t, line in history
                if (now - t).total_seconds() < 1800
            ]

            if len(recent) < 2:
                continue

            start_line = recent[0][1]
            current_line = recent[-1][1]
            movement = abs(current_line - start_line)

            if movement >= self.threshold:
                # Calculate velocity (points per minute)
                time_span = (recent[-1][0] - recent[0][0]).total_seconds() / 60
                velocity = movement / max(time_span, 1)

                alerts.append({
                    "game_id": game_id,
                    "start_line": start_line,
                    "current_line": current_line,
                    "movement": current_line - start_line,
                    "abs_movement": movement,
                    "velocity_pts_per_min": velocity,
                    "time_span_minutes": time_span,
                    "detected_at": now.isoformat(),
                })

        return sorted(alerts, key=lambda x: x["abs_movement"], reverse=True)

32.4 Large Language Models for Sports Analysis

Using LLM APIs for Sports Analysis

Large Language Models (LLMs) such as GPT-4 and Claude represent a step change in NLP capability. They can understand nuanced sports context, extract information from complex text, and even generate structured analyses. However, they must be used carefully in a betting context---their outputs are probabilistic, can hallucinate facts, and should never be trusted as primary prediction sources.

LLMs are best used for: 1. Information extraction: Parsing complex, unstructured text into structured data. 2. Summarization: Condensing many articles into key points. 3. Contextual analysis: Understanding subtle implications that rule-based systems miss. 4. Feature engineering assistance: Suggesting new features or interpreting unusual data patterns.

LLMs should NOT be used for: 1. Direct probability estimation: LLMs are not calibrated probability estimators. 2. Replacing statistical models: They lack the mathematical rigor of trained ML models. 3. Real-time decision making: API latency makes them unsuitable for time-critical decisions.

# nlp/llm_analysis.py
import json
import time
import hashlib
from typing import Dict, List, Optional, Any
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


class LLMSportsAnalyzer:
    """Use Large Language Models for sports text analysis.

    Supports both OpenAI and Anthropic APIs.
    Includes caching, rate limiting, and structured output parsing.
    """

    def __init__(self, provider: str = "anthropic",
                 api_key: str = "",
                 model: str = None,
                 cache_enabled: bool = True):
        self.provider = provider
        self.api_key = api_key
        self.cache = {} if cache_enabled else None

        if provider == "anthropic":
            import anthropic
            self.client = anthropic.Anthropic(api_key=api_key)
            self.model = model or "claude-sonnet-4-20250514"
        elif provider == "openai":
            import openai
            self.client = openai.OpenAI(api_key=api_key)
            self.model = model or "gpt-4o"
        else:
            raise ValueError(f"Unsupported provider: {provider}")

    def _cache_key(self, prompt: str) -> str:
        return hashlib.md5(prompt.encode()).hexdigest()

    def _call_llm(self, system_prompt: str, user_prompt: str,
                   temperature: float = 0.0,
                   max_tokens: int = 2000) -> str:
        """Make an LLM API call with caching and error handling."""
        full_prompt = f"{system_prompt}\n{user_prompt}"

        # Check cache
        if self.cache is not None:
            cache_key = self._cache_key(full_prompt)
            if cache_key in self.cache:
                return self.cache[cache_key]

        try:
            if self.provider == "anthropic":
                message = self.client.messages.create(
                    model=self.model,
                    max_tokens=max_tokens,
                    temperature=temperature,
                    system=system_prompt,
                    messages=[
                        {"role": "user", "content": user_prompt}
                    ],
                )
                response_text = message.content[0].text

            elif self.provider == "openai":
                completion = self.client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_prompt},
                    ],
                    temperature=temperature,
                    max_tokens=max_tokens,
                )
                response_text = completion.choices[0].message.content

            # Cache the result
            if self.cache is not None:
                self.cache[self._cache_key(full_prompt)] = response_text

            return response_text

        except Exception as e:
            logger.error(f"LLM API call failed: {e}")
            return ""

    def extract_injury_info(self, text: str) -> List[Dict]:
        """Use LLM to extract structured injury data from complex text."""
        system_prompt = """You are a sports injury data extraction system.
Extract ALL injury-related information from the given text and return
it as a JSON array. Each entry should have these fields:
- player_name: Full name of the player
- team: Team name if mentioned
- injury: Description of the injury
- body_part: Specific body part if mentioned
- status: One of "out", "doubtful", "questionable", "probable", "available", or "unknown"
- severity: One of "minor", "moderate", "severe", or "unknown"
- context: Brief explanation of why you classified it this way

Return ONLY valid JSON. If no injury information is found, return an empty array []."""

        user_prompt = f"Extract injury information from this text:\n\n{text}"

        response = self._call_llm(system_prompt, user_prompt)

        try:
            # Extract JSON from response (handle markdown code blocks)
            json_text = response
            if "```json" in response:
                json_text = response.split("```json")[1].split("```")[0]
            elif "```" in response:
                json_text = response.split("```")[1].split("```")[0]
            return json.loads(json_text.strip())
        except (json.JSONDecodeError, IndexError):
            logger.warning(f"Failed to parse LLM response as JSON: {response[:200]}")
            return []

    def analyze_matchup(self, home_team: str, away_team: str,
                         context_texts: List[str]) -> Dict:
        """Use LLM to generate a qualitative matchup analysis.

        This is NOT for probability estimation. It extracts qualitative
        factors that might be missed by statistical models.
        """
        combined_context = "\n---\n".join(context_texts[:10])  # Limit context

        system_prompt = """You are a sports analyst assistant. Analyze the
provided matchup context and identify QUALITATIVE factors that might
affect the game outcome. Focus on factors that statistical models
might miss: team chemistry, scheduling spots, motivation, coaching
matchups, recent trends in play style, etc.

Return a JSON object with these fields:
- home_advantages: List of qualitative advantages for the home team
- away_advantages: List of qualitative advantages for the away team
- key_uncertainties: List of uncertain factors that could swing either way
- narrative_factors: Any narrative or psychological factors (revenge game,
  rivalry, etc.)
- confidence: "low", "medium", or "high" in your analysis

IMPORTANT: Do NOT provide win probabilities or predictions. Focus only
on qualitative analysis. Return ONLY valid JSON."""

        user_prompt = f"""Matchup: {home_team} (home) vs {away_team} (away)

Recent news and context:
{combined_context}"""

        response = self._call_llm(system_prompt, user_prompt,
                                   temperature=0.1)

        try:
            json_text = response
            if "```json" in response:
                json_text = response.split("```json")[1].split("```")[0]
            elif "```" in response:
                json_text = response.split("```")[1].split("```")[0]
            return json.loads(json_text.strip())
        except (json.JSONDecodeError, IndexError):
            logger.warning("Failed to parse matchup analysis")
            return {}

    def classify_news_impact(self, headline: str,
                               article_text: str = "") -> Dict:
        """Use LLM to classify the betting impact of a news item."""
        system_prompt = """You are a sports betting news classifier.
Analyze the given news item and classify its potential impact on
betting markets.

Return a JSON object with:
- event_type: One of "injury", "trade", "coaching", "suspension",
  "lineup", "weather", "return", "off_field", "general"
- teams_affected: List of team names affected
- players_affected: List of player names affected
- direction: "positive" or "negative" for the first team mentioned
- magnitude: "negligible", "minor", "moderate", "major", "extreme"
- time_sensitivity: "immediate", "hours", "days", "weeks"
- reasoning: One sentence explaining the classification

Return ONLY valid JSON."""

        user_prompt = f"Headline: {headline}"
        if article_text:
            user_prompt += f"\n\nFull text: {article_text[:1000]}"

        response = self._call_llm(system_prompt, user_prompt)

        try:
            json_text = response
            if "```json" in response:
                json_text = response.split("```json")[1].split("```")[0]
            elif "```" in response:
                json_text = response.split("```")[1].split("```")[0]
            return json.loads(json_text.strip())
        except (json.JSONDecodeError, IndexError):
            return {"event_type": "general", "magnitude": "negligible"}

    def summarize_team_news(self, team: str,
                              articles: List[str]) -> str:
        """Summarize recent team news for human review."""
        combined = "\n---\n".join(articles[:15])

        system_prompt = f"""Summarize the most important recent news about
{team} as it relates to their upcoming games. Focus on:
1. Injury updates and player availability
2. Recent performance trends
3. Roster or coaching changes
4. Any other factors that could affect game outcomes

Be concise (3-5 bullet points). Include only factual information
from the provided texts."""

        return self._call_llm(system_prompt, combined, temperature=0.0)

Limitations and Biases

LLMs have several important limitations in the betting context that you must understand:

Recency bias. LLMs weight recent information heavily in their responses. A team that lost their last two games will be described more negatively than their overall season warrants.

Narrative bias. LLMs are trained on human-written text, which is full of narrative fallacies. They may emphasize "momentum," "clutch genes," or "team chemistry" beyond what the evidence supports.

Hallucination. LLMs can fabricate facts, statistics, and even injury reports. Never trust LLM outputs without verification against ground truth data.

Lack of calibration. If you ask an LLM for a win probability, the number it gives you is not a calibrated probability. It is a language model's best guess at what number a human analyst might say, not a mathematically grounded estimate.

Cost and latency. LLM API calls cost money and take time. At scale, the cost of analyzing every piece of sports news with a large model can be significant. Use LLMs selectively, for the highest-value analysis tasks, and use cheaper methods (VADER, rule-based systems) for bulk processing.

Inconsistency. The same prompt with the same input can produce different outputs across calls (even with temperature=0 due to batched inference). This makes reproducibility challenging. Always log both inputs and outputs.


32.5 Building a News-Driven Signal

Combining NLP Signals into a Betting Feature

The ultimate goal is to distill all NLP analysis into one or more numerical features that can be fed into your betting model alongside traditional statistical features. This requires careful aggregation and normalization.

# nlp/signal_builder.py
import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple
from datetime import datetime, timedelta
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


@dataclass
class NLPSignal:
    """A single NLP-derived betting signal for a team/game."""
    team_id: str
    game_id: str
    game_date: str

    # Sentiment features
    sentiment_score: float          # -1 to 1
    sentiment_momentum: float       # Change in sentiment over time
    sentiment_volume: int           # Number of mentions

    # Injury features
    injury_impact_score: float      # Expected missing player value
    injury_change_score: float      # Net change since last report
    star_availability: float        # Probability key players play

    # News impact features
    recent_news_magnitude: float    # Aggregate news impact
    event_count: int                # Number of significant events

    # LLM-derived features (optional)
    qualitative_edge: float = 0.0   # LLM assessment, normalized

    @property
    def composite_signal(self) -> float:
        """Weighted combination of all NLP signals."""
        weights = {
            "sentiment": 0.15,
            "injury": 0.45,
            "news": 0.25,
            "qualitative": 0.15,
        }

        # Normalize each component to [-1, 1] range
        sentiment_component = np.clip(self.sentiment_score, -1, 1)
        injury_component = np.clip(-self.injury_impact_score / 5.0, -1, 0)
        news_component = np.clip(self.recent_news_magnitude / 3.0, -1, 1)
        qual_component = np.clip(self.qualitative_edge, -1, 1)

        composite = (
            weights["sentiment"] * sentiment_component
            + weights["injury"] * injury_component
            + weights["news"] * news_component
            + weights["qualitative"] * qual_component
        )

        return float(np.clip(composite, -1, 1))


class NLPSignalBuilder:
    """Builds NLP-derived features for each game in the pipeline."""

    def __init__(self, sentiment_analyzer, injury_parser,
                 event_detector, llm_analyzer=None,
                 team_aliases: Dict[str, List[str]] = None):
        self.sentiment = sentiment_analyzer
        self.injuries = injury_parser
        self.events = event_detector
        self.llm = llm_analyzer
        self.team_aliases = team_aliases or {}

        # Historical signals for momentum calculation
        self.signal_history: Dict[str, List[Tuple[str, float]]] = {}

    def build_signal(self, team_id: str, game_id: str,
                      game_date: str,
                      documents: List,      # TextDocument list
                      injury_entries: List,  # InjuryEntry list
                      player_values: Dict[str, float] = None,
                      use_llm: bool = False
                      ) -> NLPSignal:
        """Build a complete NLP signal for a team's upcoming game."""

        # 1. Sentiment analysis
        sentiment_features = self._compute_sentiment_features(
            team_id, documents
        )

        # 2. Injury impact
        injury_features = self._compute_injury_features(
            team_id, injury_entries, player_values or {}
        )

        # 3. News events
        news_features = self._compute_news_features(team_id, documents)

        # 4. Optional LLM analysis
        qual_edge = 0.0
        if use_llm and self.llm:
            qual_edge = self._compute_qualitative_edge(
                team_id, documents
            )

        # 5. Compute sentiment momentum
        momentum = self._compute_momentum(team_id, game_date,
                                           sentiment_features["score"])

        signal = NLPSignal(
            team_id=team_id,
            game_id=game_id,
            game_date=game_date,
            sentiment_score=sentiment_features["score"],
            sentiment_momentum=momentum,
            sentiment_volume=sentiment_features["volume"],
            injury_impact_score=injury_features["impact"],
            injury_change_score=injury_features["change"],
            star_availability=injury_features["star_availability"],
            recent_news_magnitude=news_features["magnitude"],
            event_count=news_features["event_count"],
            qualitative_edge=qual_edge,
        )

        logger.info(
            f"NLP signal for {team_id} ({game_id}): "
            f"composite={signal.composite_signal:.3f}, "
            f"sentiment={signal.sentiment_score:.3f}, "
            f"injury={signal.injury_impact_score:.3f}"
        )

        return signal

    def _compute_sentiment_features(self, team_id: str,
                                      documents: list
                                      ) -> Dict[str, float]:
        """Compute sentiment features from documents."""
        # Filter documents mentioning this team
        team_docs = []
        for doc in documents:
            text_lower = doc.text.lower()
            aliases = self.team_aliases.get(team_id, [team_id])
            if any(alias.lower() in text_lower for alias in aliases):
                team_docs.append(doc)

        if not team_docs:
            return {"score": 0.0, "volume": 0}

        # Analyze sentiment
        sentiments = self.sentiment.analyze_batch(
            [d.text for d in team_docs]
        )

        scores = [s.compound_score for s in sentiments]

        # Weight by source credibility
        weights = []
        for doc in team_docs:
            w = 1.0
            if doc.source == "beat_reporter":
                w = 3.0
            elif doc.source in ("espn", "official"):
                w = 2.0
            weights.append(w)

        weighted_score = np.average(scores, weights=weights)

        return {
            "score": float(weighted_score),
            "volume": len(team_docs),
        }

    def _compute_injury_features(self, team_id: str,
                                   injury_entries: list,
                                   player_values: Dict[str, float]
                                   ) -> Dict[str, float]:
        """Compute injury impact features."""
        from nlp.injury_parser import InjuryImpactEstimator

        estimator = InjuryImpactEstimator(player_values)
        team_entries = [
            e for e in injury_entries if e.team == team_id
        ]

        impact = estimator.estimate_impact(team_entries, team_id)

        return {
            "impact": impact["expected_missing_value"],
            "change": 0.0,  # Computed by comparing to previous report
            "star_availability": impact["avg_play_probability"],
        }

    def _compute_news_features(self, team_id: str,
                                 documents: list) -> Dict[str, float]:
        """Compute news event features."""
        team_docs = []
        for doc in documents:
            text_lower = doc.text.lower()
            aliases = self.team_aliases.get(team_id, [team_id])
            if any(alias.lower() in text_lower for alias in aliases):
                team_docs.append(doc)

        if not team_docs:
            return {"magnitude": 0.0, "event_count": 0}

        total_magnitude = 0.0
        event_count = 0

        for doc in team_docs:
            events = self.events.detect(doc.text)
            for event in events:
                if event.significance > 0.3:  # Only count notable events
                    total_magnitude += (
                        event.significance * event.expected_line_move
                    )
                    event_count += 1

        return {
            "magnitude": total_magnitude,
            "event_count": event_count,
        }

    def _compute_momentum(self, team_id: str, game_date: str,
                            current_sentiment: float) -> float:
        """Compute sentiment momentum (rate of change)."""
        if team_id not in self.signal_history:
            self.signal_history[team_id] = []

        history = self.signal_history[team_id]
        history.append((game_date, current_sentiment))

        # Keep last 30 entries
        if len(history) > 30:
            history = history[-30:]
            self.signal_history[team_id] = history

        if len(history) < 3:
            return 0.0

        # Simple momentum: difference between recent and older average
        recent = np.mean([s for _, s in history[-3:]])
        older = np.mean([s for _, s in history[:-3]])

        return float(recent - older)

    def _compute_qualitative_edge(self, team_id: str,
                                    documents: list) -> float:
        """Use LLM to estimate qualitative edge (expensive, use sparingly)."""
        if not self.llm:
            return 0.0

        team_texts = [
            d.text for d in documents
            if team_id.lower() in d.text.lower()
        ][:5]

        if not team_texts:
            return 0.0

        try:
            analysis = self.llm.classify_news_impact(
                headline=f"Recent news about {team_id}",
                article_text=" ".join(team_texts),
            )

            magnitude_map = {
                "negligible": 0.0,
                "minor": 0.1,
                "moderate": 0.3,
                "major": 0.6,
                "extreme": 0.9,
            }

            magnitude = magnitude_map.get(
                analysis.get("magnitude", "negligible"), 0.0
            )
            direction = 1.0 if analysis.get("direction") == "positive" else -1.0

            return float(magnitude * direction)

        except Exception as e:
            logger.warning(f"LLM qualitative analysis failed: {e}")
            return 0.0

Integration with Existing Models

NLP signals should not replace your statistical model---they should augment it. The integration can happen in two ways:

  1. Feature injection: Add NLP-derived features directly into your model's feature vector.
  2. Model stacking: Train a separate NLP model and combine its predictions with the main model.

Feature injection is simpler and usually sufficient:

# nlp/integration.py
import numpy as np
import pandas as pd
from typing import Dict, List, Optional
import logging

logger = logging.getLogger(__name__)


class NLPFeatureIntegrator:
    """Integrate NLP signals into the main prediction pipeline."""

    def __init__(self, feature_store, signal_builder):
        self.feature_store = feature_store
        self.signal_builder = signal_builder

    def compute_and_store_features(self, team_id: str,
                                     game_id: str,
                                     game_date: str,
                                     documents: list,
                                     injury_entries: list,
                                     player_values: Dict[str, float] = None
                                     ):
        """Compute NLP features and store in the feature store."""
        signal = self.signal_builder.build_signal(
            team_id=team_id,
            game_id=game_id,
            game_date=game_date,
            documents=documents,
            injury_entries=injury_entries,
            player_values=player_values,
        )

        # Convert signal to feature store format
        features_df = pd.DataFrame([{
            "entity_id": team_id,
            "event_date": game_date,
            "nlp_sentiment_score": signal.sentiment_score,
            "nlp_sentiment_momentum": signal.sentiment_momentum,
            "nlp_sentiment_volume": signal.sentiment_volume,
            "nlp_injury_impact": signal.injury_impact_score,
            "nlp_injury_change": signal.injury_change_score,
            "nlp_star_availability": signal.star_availability,
            "nlp_news_magnitude": signal.recent_news_magnitude,
            "nlp_event_count": signal.event_count,
            "nlp_composite_signal": signal.composite_signal,
        }])

        self.feature_store.store_features(
            entity_type="team",
            features_df=features_df,
            feature_version="nlp_v1",
        )

        logger.info(
            f"Stored NLP features for {team_id} on {game_date}"
        )

        return signal

    @staticmethod
    def get_nlp_feature_names() -> List[str]:
        """Return the list of NLP feature names for model training."""
        return [
            "nlp_sentiment_score",
            "nlp_sentiment_momentum",
            "nlp_sentiment_volume",
            "nlp_injury_impact",
            "nlp_injury_change",
            "nlp_star_availability",
            "nlp_news_magnitude",
            "nlp_event_count",
            "nlp_composite_signal",
        ]

Backtesting NLP Features

Backtesting NLP features is methodologically challenging because you need historical text data aligned with historical betting lines. You cannot simply compute today's NLP features and apply them to historical games---that would be data leakage. You need to reconstruct what the NLP features would have been at the time of each historical bet.

# nlp/backtesting.py
import numpy as np
import pandas as pd
from typing import Dict, List, Tuple
from datetime import datetime
from sklearn.metrics import roc_auc_score, brier_score_loss
import logging

logger = logging.getLogger(__name__)


class NLPBacktester:
    """Backtest NLP features to measure their predictive value.

    Key principle: NLP features must be computed using only information
    available BEFORE each game. This requires historical text data
    with timestamps.
    """

    def __init__(self, feature_store):
        self.feature_store = feature_store

    def evaluate_feature_value(self, nlp_feature_name: str,
                                 base_model_probs: pd.Series,
                                 actual_outcomes: pd.Series,
                                 nlp_feature_values: pd.Series
                                 ) -> Dict:
        """Evaluate whether an NLP feature adds value beyond the base model.

        Compares:
        1. Base model predictions alone
        2. Base model predictions adjusted by NLP feature

        A valuable NLP feature should improve calibration and
        discrimination when added to the base model.
        """
        results = {}

        # Base model metrics
        base_brier = brier_score_loss(actual_outcomes, base_model_probs)
        base_auc = roc_auc_score(actual_outcomes, base_model_probs)

        results["base_brier"] = base_brier
        results["base_auc"] = base_auc

        # Simple adjustment: shift base probability by NLP signal
        # (This is a simplified illustration; in practice, you would
        # retrain the model with and without the NLP feature)
        adjustment_weight = 0.05  # Small weight for NLP adjustment
        adjusted_probs = base_model_probs + adjustment_weight * nlp_feature_values
        adjusted_probs = np.clip(adjusted_probs, 0.01, 0.99)

        adj_brier = brier_score_loss(actual_outcomes, adjusted_probs)
        adj_auc = roc_auc_score(actual_outcomes, adjusted_probs)

        results["adjusted_brier"] = adj_brier
        results["adjusted_auc"] = adj_auc
        results["brier_improvement"] = base_brier - adj_brier
        results["auc_improvement"] = adj_auc - base_auc

        # Feature correlation with residuals
        # If the NLP feature correlates with model errors,
        # it contains information the model is missing
        residuals = actual_outcomes - base_model_probs
        correlation = np.corrcoef(nlp_feature_values, residuals)[0, 1]
        results["residual_correlation"] = correlation

        # Profitability test: would betting based on NLP signal be profitable?
        # Look at cases where NLP signal strongly disagrees with market
        strong_signal = np.abs(nlp_feature_values) > nlp_feature_values.std()
        if strong_signal.sum() > 10:
            signal_games = actual_outcomes[strong_signal]
            signal_probs = base_model_probs[strong_signal]
            results["strong_signal_count"] = int(strong_signal.sum())
            results["strong_signal_accuracy"] = float(
                ((nlp_feature_values[strong_signal] > 0) ==
                 (actual_outcomes[strong_signal] == 1)).mean()
            )

        # Statistical significance test
        from scipy.stats import ttest_rel
        if len(base_model_probs) > 30:
            base_errors = (actual_outcomes - base_model_probs) ** 2
            adj_errors = (actual_outcomes - adjusted_probs) ** 2
            t_stat, p_value = ttest_rel(base_errors, adj_errors)
            results["improvement_p_value"] = float(p_value)
            results["statistically_significant"] = p_value < 0.05

        logger.info(
            f"NLP feature '{nlp_feature_name}': "
            f"Brier improvement={results['brier_improvement']:.4f}, "
            f"residual corr={results['residual_correlation']:.4f}"
        )

        return results

    def run_walk_forward_test(self, games_df: pd.DataFrame,
                                nlp_features: List[str],
                                target_col: str = "home_win",
                                base_features: List[str] = None,
                                train_window_days: int = 365,
                                test_window_days: int = 30
                                ) -> pd.DataFrame:
        """Walk-forward backtest comparing model with and without NLP features.

        This is the gold standard for evaluating whether NLP features
        genuinely improve predictions out of sample.
        """
        from sklearn.ensemble import GradientBoostingClassifier

        games_df = games_df.sort_values("game_date").reset_index(drop=True)
        dates = pd.to_datetime(games_df["game_date"])

        results = []
        start_date = dates.min() + pd.Timedelta(days=train_window_days)
        end_date = dates.max()
        current_date = start_date

        while current_date < end_date:
            test_end = current_date + pd.Timedelta(days=test_window_days)

            train_mask = (dates < current_date) & (
                dates >= current_date - pd.Timedelta(days=train_window_days)
            )
            test_mask = (dates >= current_date) & (dates < test_end)

            if train_mask.sum() < 100 or test_mask.sum() < 10:
                current_date = test_end
                continue

            train_data = games_df[train_mask]
            test_data = games_df[test_mask]

            # Model WITHOUT NLP features
            X_train_base = train_data[base_features].fillna(0)
            X_test_base = test_data[base_features].fillna(0)
            y_train = train_data[target_col]
            y_test = test_data[target_col]

            model_base = GradientBoostingClassifier(
                n_estimators=100, max_depth=4, random_state=42
            )
            model_base.fit(X_train_base, y_train)
            probs_base = model_base.predict_proba(X_test_base)[:, 1]

            # Model WITH NLP features
            all_features = base_features + nlp_features
            X_train_nlp = train_data[all_features].fillna(0)
            X_test_nlp = test_data[all_features].fillna(0)

            model_nlp = GradientBoostingClassifier(
                n_estimators=100, max_depth=4, random_state=42
            )
            model_nlp.fit(X_train_nlp, y_train)
            probs_nlp = model_nlp.predict_proba(X_test_nlp)[:, 1]

            # Compare
            brier_base = brier_score_loss(y_test, probs_base)
            brier_nlp = brier_score_loss(y_test, probs_nlp)

            results.append({
                "period_start": current_date.strftime("%Y-%m-%d"),
                "period_end": test_end.strftime("%Y-%m-%d"),
                "n_test": len(y_test),
                "brier_base": brier_base,
                "brier_with_nlp": brier_nlp,
                "brier_improvement": brier_base - brier_nlp,
                "improvement_pct": (
                    (brier_base - brier_nlp) / brier_base * 100
                ),
            })

            current_date = test_end

        results_df = pd.DataFrame(results)

        if not results_df.empty:
            avg_improvement = results_df["brier_improvement"].mean()
            pct_improved = (results_df["brier_improvement"] > 0).mean()
            logger.info(
                f"Walk-forward backtest complete: "
                f"Avg Brier improvement={avg_improvement:.4f}, "
                f"Improved in {pct_improved:.0%} of periods"
            )

        return results_df

32.6 Chapter Summary

This chapter explored how Natural Language Processing techniques can extract predictive signals from the vast ocean of sports text data. The key insights and takeaways are:

Sentiment analysis provides a noisy but real signal. Sports media sentiment, especially from beat reporters, correlates with factors that statistical models miss: team morale, behind-the-scenes dynamics, and emerging trends. VADER provides a fast baseline, while transformer-based models offer higher accuracy at greater computational cost. The most valuable sentiment signal comes from changes in sentiment, not absolute levels, because the market has already priced in persistent conditions.

Injury report parsing is the highest-value NLP application. Injuries are the single largest driver of betting line movements, and the information arrives in semi-structured text that can be systematically parsed. A combination of regex patterns, spaCy NER, and rule-based classification handles the structured reports that leagues publish, while LLMs can extract injury information from unstructured news articles and social media. The injury impact estimator translates parsing results into quantitative features by combining player status with player value metrics.

News event detection and impact quantification close the loop. By classifying news events by type and tracking historical line reactions to similar events, you can estimate the expected market impact of new information. The line movement tracker monitors for sharp moves that may indicate breaking news, creating time-sensitive opportunities.

Large Language Models are powerful tools with important limitations. LLMs excel at information extraction, classification, and summarization. They fail at probability estimation and are prone to hallucination and narrative bias. Use them as sophisticated parsing tools, not as oracles. Cache responses, log everything, and verify outputs against ground truth. The cost of LLM API calls requires judicious use---reserve them for high-value analyses where simpler methods fall short.

NLP features must be backtested rigorously. The walk-forward backtest is the gold standard. Compare a model with and without NLP features over multiple out-of-sample periods. A genuine NLP edge should consistently improve Brier scores and show statistically significant improvement. Many NLP signals that look promising in-sample fail to generalize. Guard against overfitting by using simple, interpretable NLP features (composite sentiment score, injury impact score) rather than high-dimensional text embeddings.

Integration is additive, not substitutive. NLP features augment your statistical model; they do not replace it. The composite NLP signal enters the model as one (or a few) features alongside dozens of statistical features. Its contribution to the overall prediction is typically small---a few percentage points of improvement---but in betting markets, small edges compound into significant long-term profits.

The marriage of NLP with traditional sports analytics represents one of the most promising frontiers in quantitative sports betting. As language models continue to improve, the ability to systematically process and quantify textual information will become an increasingly important source of competitive advantage.