32 min read

> "The market is a discounting mechanism for all known information -- but text is where that information lives before it becomes known."

Chapter 24: NLP and Sentiment Analysis

"The market is a discounting mechanism for all known information -- but text is where that information lives before it becomes known."

Prediction markets exist at the intersection of human beliefs and real-world events. The raw material of those beliefs -- news articles, social media posts, political speeches, earnings calls, expert commentary -- is overwhelmingly textual. A trader who can systematically extract signal from this torrent of text holds a structural advantage over one who relies on manual reading alone.

This chapter bridges the gap between the machine learning foundations of Chapter 23 and the unstructured world of natural language. We will build a complete NLP toolkit for prediction market trading: from classical text preprocessing through modern transformer architectures to the frontier question of whether large language models can themselves serve as forecasters. Every technique is grounded in practical, runnable Python code and framed around the unique demands of prediction market analysis.


24.1 Why Text Data Matters for Prediction Markets

24.1.1 News Drives Markets

Prediction markets are, at their core, aggregators of information. When a breaking news story changes the probability of an event, it does so because traders read that story, update their beliefs, and place trades. The causal chain is:

$$ \text{Event occurs} \rightarrow \text{Text published} \rightarrow \text{Traders read text} \rightarrow \text{Beliefs update} \rightarrow \text{Prices move} $$

This chain implies a latency between text publication and price movement. For liquid markets like Polymarket or Kalshi, this latency can be seconds to minutes for major breaking news. For less liquid markets, it can be hours or days. That latency is your opportunity.

Consider some concrete examples:

  • Political markets: A poll showing a 5-point shift in a swing state will move presidential election markets. The poll is published as a news article or social media post before it is reflected in market prices.
  • Regulatory markets: An SEC commissioner's speech hinting at new cryptocurrency regulations contains signal about "Will the SEC approve a Bitcoin ETF?" markets.
  • Geopolitical markets: Diplomatic communiques, military movements reported by journalists, and satellite imagery analyses all appear as text before geopolitical event markets adjust.
  • Economic markets: Fed minutes, earnings call transcripts, and economic commentary contain forward-looking language that precedes market movements.

24.1.2 Text as a Leading Indicator

Empirical research consistently shows that textual information leads market price movements. A seminal study by Tetlock (2007) demonstrated that the pessimism in Wall Street Journal columns predicted negative market returns and increased trading volume. Loughran and McDonald (2011) showed that the tone of 10-K filings predicted stock return volatility.

For prediction markets specifically, text is an even more powerful leading indicator because:

  1. Information asymmetry is higher: Unlike equities where thousands of analysts parse every data point, many prediction market questions have thin analyst coverage. A trader with an NLP pipeline monitoring relevant news has a larger edge.
  2. The connection between text and outcome is more direct: A news article about a candidate's scandal directly pertains to "Will candidate X win?" -- unlike the more tenuous connection between a news article and a stock price.
  3. Markets are thinner: Lower liquidity means that it takes longer for information to be fully reflected in prices, extending the window during which text-derived signals are profitable.

24.1.3 Sentiment as a Feature

Sentiment analysis transforms unstructured text into a numerical feature that can be fed into the predictive models we built in Chapter 23. At its simplest, sentiment is a single number on a scale from -1 (very negative) to +1 (very positive). More sophisticated approaches produce multi-dimensional representations:

  • Polarity: How positive or negative is the text?
  • Subjectivity: Is the text factual reporting or opinion?
  • Intensity: How strong is the expressed sentiment?
  • Aspect-specific sentiment: What is the sentiment toward specific entities or topics mentioned in the text?
  • Uncertainty: Does the text express confidence or doubt?

When aggregated over time, sentiment features capture the evolving information environment around a prediction market question. A sustained shift toward negative sentiment about a political candidate, even before any single decisive event, can predict a gradual decline in that candidate's market price.

24.1.4 The NLP Opportunity for Prediction Market Traders

The NLP opportunity in prediction markets is larger than in traditional financial markets for several reasons:

  1. Less competition from sophisticated NLP systems: Major hedge funds deploy enormous NLP infrastructure for equity trading. Prediction markets see far less automated text analysis, meaning the marginal value of even basic NLP is higher.
  2. Diverse text sources: Prediction market questions span politics, sports, entertainment, science, and more. This diversity means that general-purpose NLP models are particularly valuable -- they can extract signal across many domains.
  3. Rapid question lifecycle: New prediction market questions are created frequently, often in response to current events. An NLP system that can quickly assess the information environment around a new question provides an immediate advantage.
  4. Clear outcome labels: Unlike stock returns, prediction market outcomes are binary. This makes it easier to evaluate whether text-derived features actually improve forecasts, and it simplifies the supervised learning setup for training custom models.

The remainder of this chapter builds the technical infrastructure to exploit these opportunities.


24.2 Text Preprocessing Pipeline

Before any NLP model can extract meaning from text, the raw text must be cleaned and standardized. This preprocessing step is critical: garbage in, garbage out. The specific preprocessing steps depend on the downstream model -- classical models require more aggressive preprocessing than modern transformers -- but understanding the full pipeline is essential.

24.2.1 Tokenization

Tokenization is the process of splitting text into discrete units called tokens. These tokens might be words, subwords, or characters, depending on the approach.

Word tokenization splits on whitespace and punctuation:

"The Fed raised rates by 0.25%" -> ["The", "Fed", "raised", "rates", "by", "0.25", "%"]

Subword tokenization (used by BERT, GPT, and other transformers) splits rare words into common subword pieces:

"cryptocurrency" -> ["crypto", "##currency"]
"geopolitical" -> ["geo", "##political"]

The advantage of subword tokenization is that it handles out-of-vocabulary words gracefully. A model that has never seen "cryptocurrency" during training can still process it as the combination of "crypto" and "currency," both of which it likely has seen.

Sentence tokenization splits text into sentences, which is useful when you want to analyze sentiment at the sentence level:

"The candidate won the primary. However, polls show a tough general election."
-> ["The candidate won the primary.", "However, polls show a tough general election."]

24.2.2 Lowercasing

Converting all text to lowercase reduces vocabulary size and ensures that "Fed," "fed," and "FED" are treated as the same token. However, lowercasing destroys information in some cases -- "US" (United States) vs. "us" (pronoun), for example. For classical models, lowercasing is almost always applied. For transformer models, cased and uncased variants exist, and the cased variants often perform better on tasks where capitalization carries meaning (like named entity recognition).

24.2.3 Stopword Removal

Stopwords are high-frequency words that carry little semantic content: "the," "is," "at," "which," "on," etc. Removing them reduces noise for bag-of-words and TF-IDF models. However, stopword removal should be applied judiciously:

  • For sentiment analysis, words like "not," "no," "never" are sometimes in stopword lists but are critical for sentiment (compare "this is good" vs. "this is not good").
  • For transformer models, stopwords should generally NOT be removed, as the model was trained with them and relies on them for understanding syntax and context.

24.2.4 Stemming and Lemmatization

Stemming reduces words to their root form by stripping suffixes:

"running" -> "run"
"election" -> "elect"
"presidential" -> "presidenti"  (imperfect!)

The Porter stemmer and Snowball stemmer are common implementations. Stemming is fast but crude -- it sometimes produces non-words.

Lemmatization reduces words to their dictionary form (lemma) using linguistic knowledge:

"running" -> "run"
"better" -> "good"
"elections" -> "election"

Lemmatization is more accurate but slower. For prediction market text analysis, lemmatization is generally preferred when using classical models, because prediction market text often contains domain-specific terms where crude stemming can destroy meaning.

24.2.5 Text Cleaning for News and Social Media

Real-world text from news articles and social media requires additional cleaning:

  • HTML tag removal: Web-scraped text often contains HTML artifacts.
  • URL removal: Social media posts frequently contain URLs that are not useful for sentiment analysis.
  • Mention removal: Twitter/X @mentions may or may not be relevant.
  • Hashtag processing: #Election2024 should probably become "Election 2024."
  • Emoji handling: Emojis carry sentiment information. They can be converted to text descriptions ("thumbs up," "angry face") or removed.
  • Special character handling: Prediction market text often contains percentages, dollar signs, and other special characters that need standardized handling.
  • Encoding normalization: Convert Unicode characters to ASCII where appropriate ("naive" for "naive").

24.2.6 Python Text Preprocessing Pipeline

import re
import string
from typing import List, Optional

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required NLTK data (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

class TextPreprocessor:
    """
    A configurable text preprocessing pipeline for prediction market
    news and social media text.
    """

    def __init__(
        self,
        lowercase: bool = True,
        remove_stopwords: bool = True,
        lemmatize: bool = True,
        remove_urls: bool = True,
        remove_mentions: bool = True,
        remove_html: bool = True,
        min_word_length: int = 2,
        custom_stopwords: Optional[List[str]] = None,
        preserve_negation: bool = True,
    ):
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        self.remove_urls = remove_urls
        self.remove_mentions = remove_mentions
        self.remove_html = remove_html
        self.min_word_length = min_word_length
        self.preserve_negation = preserve_negation

        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))

        # Preserve negation words even if removing stopwords
        self.negation_words = {
            'not', 'no', 'never', 'neither', 'nor',
            'nobody', 'nothing', 'nowhere', 'hardly',
            'barely', 'scarcely', "n't", "nt"
        }
        if preserve_negation:
            self.stop_words -= self.negation_words

        if custom_stopwords:
            self.stop_words.update(custom_stopwords)

    def clean_html(self, text: str) -> str:
        """Remove HTML tags."""
        return re.sub(r'<[^>]+>', '', text)

    def clean_urls(self, text: str) -> str:
        """Remove URLs."""
        return re.sub(
            r'https?://\S+|www\.\S+', '', text
        )

    def clean_mentions(self, text: str) -> str:
        """Remove @mentions."""
        return re.sub(r'@\w+', '', text)

    def clean_hashtags(self, text: str) -> str:
        """Convert hashtags to words: #Election2024 -> Election 2024."""
        def hashtag_to_words(match):
            tag = match.group(1)
            # Insert space before capital letters
            words = re.sub(r'([A-Z])', r' \1', tag)
            # Insert space before numbers
            words = re.sub(r'(\d+)', r' \1', words)
            return words.strip()
        return re.sub(r'#(\w+)', hashtag_to_words, text)

    def normalize_whitespace(self, text: str) -> str:
        """Collapse multiple whitespace into single spaces."""
        return re.sub(r'\s+', ' ', text).strip()

    def preprocess(self, text: str) -> str:
        """
        Apply the full preprocessing pipeline.

        Parameters
        ----------
        text : str
            Raw input text.

        Returns
        -------
        str
            Cleaned and preprocessed text.
        """
        if self.remove_html:
            text = self.clean_html(text)
        if self.remove_urls:
            text = self.clean_urls(text)
        if self.remove_mentions:
            text = self.clean_mentions(text)

        text = self.clean_hashtags(text)

        if self.lowercase:
            text = text.lower()

        # Tokenize
        tokens = word_tokenize(text)

        # Remove stopwords
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]

        # Lemmatize
        if self.lemmatize:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]

        # Remove short tokens and pure punctuation
        tokens = [
            t for t in tokens
            if len(t) >= self.min_word_length
            and not all(c in string.punctuation for c in t)
        ]

        return ' '.join(tokens)

    def preprocess_batch(self, texts: List[str]) -> List[str]:
        """Preprocess a list of texts."""
        return [self.preprocess(t) for t in texts]


# Example usage
if __name__ == "__main__":
    preprocessor = TextPreprocessor()

    sample_texts = [
        "BREAKING: The Fed just raised interest rates by 0.25%! "
        "https://t.co/abc123 #FedDecision",
        "@SenatorSmith says the new bill will NOT pass the Senate. "
        "Very pessimistic outlook. #Politics2024",
        "<p>Markets are <b>crashing</b> after the announcement.</p>",
    ]

    for raw in sample_texts:
        cleaned = preprocessor.preprocess(raw)
        print(f"RAW:     {raw[:80]}...")
        print(f"CLEANED: {cleaned}")
        print()

This preprocessing pipeline is designed with prediction market text in mind. The preservation of negation words is particularly important: "The bill will NOT pass" and "The bill will pass" have opposite implications for a "Will the bill pass?" market.


24.3 Classical NLP: Bag of Words and TF-IDF

Before diving into deep learning, it is essential to understand classical NLP representations. These methods are fast, interpretable, and surprisingly effective for many prediction market applications.

24.3.1 Bag of Words Representation

The bag of words (BoW) model represents each document as a vector of word counts, ignoring word order entirely. Given a vocabulary of $V$ unique words across all documents, each document $d$ is represented as a vector $\mathbf{x}_d \in \mathbb{R}^V$ where $x_{d,i}$ is the count of word $i$ in document $d$.

For example, consider two sentences:

  • $d_1$: "The market is bullish on the election."
  • $d_2$: "The election market shows bearish sentiment."

The vocabulary is: {the, market, is, bullish, on, election, shows, bearish, sentiment}

$$ \mathbf{x}_{d_1} = [2, 1, 1, 1, 1, 1, 0, 0, 0] $$ $$ \mathbf{x}_{d_2} = [1, 1, 0, 0, 0, 1, 1, 1, 1] $$

The obvious limitation is that "The dog bit the man" and "The man bit the dog" have identical BoW representations. For sentiment analysis, this is often acceptable because the presence of sentiment-bearing words matters more than their exact arrangement.

24.3.2 TF-IDF Weighting

Raw word counts overweight common words. Term Frequency-Inverse Document Frequency (TF-IDF) addresses this by weighting each word by how informative it is across the document collection.

For a term $t$ in document $d$ within a corpus of $N$ documents:

$$ \text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d} $$

$$ \text{IDF}(t) = \log\left(\frac{N}{\text{number of documents containing } t}\right) $$

$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$

Words that appear in every document (like "the") get a low IDF and thus a low TF-IDF score. Words that appear frequently in a specific document but rarely in others get a high TF-IDF score. For prediction market text, this means that generic words are downweighted while topically distinctive words (candidate names, policy terms, specific events) are upweighted.

24.3.3 Document-Term Matrices

When you apply TF-IDF to a collection of $N$ documents with vocabulary size $V$, you get a document-term matrix $\mathbf{X} \in \mathbb{R}^{N \times V}$. This matrix is typically very sparse (most documents contain only a tiny fraction of the full vocabulary) and is stored in sparse matrix format for efficiency.

For prediction market applications, a typical document-term matrix might have:

  • $N = 10{,}000$ news articles about a particular topic
  • $V = 50{,}000$ unique terms (after preprocessing)
  • Sparsity > 99% (each article uses perhaps 200 unique terms)

24.3.4 Text Classification with TF-IDF and Logistic Regression

Combining TF-IDF features with logistic regression gives a strong baseline for text classification. For prediction markets, a common task is classifying news articles as positive, negative, or neutral for a particular market question.

The pipeline is:

  1. Collect labeled training data (news articles labeled with their impact on a market).
  2. Preprocess text using the pipeline from Section 24.2.
  3. Compute TF-IDF features.
  4. Train a logistic regression classifier.
  5. Apply to new articles to predict their market impact.

24.3.5 Python Implementation

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Simulated prediction market news dataset
# Labels: 1 = positive for market (price should go up)
#         0 = negative for market (price should go down)
texts = [
    "Candidate leads in latest swing state polls by wide margin",
    "New endorsement from popular governor boosts campaign",
    "Campaign raises record-breaking funds in Q3",
    "Debate performance widely praised by undecided voters",
    "Candidate stumbles in interview, makes gaffe on key policy",
    "Major donor withdraws support citing lack of confidence",
    "Scandal allegations surface involving campaign staff",
    "Poll numbers declining steadily over the past month",
    "New economic data supports candidate policy position",
    "Opposition candidate gains momentum in key demographics",
]
labels = [1, 1, 1, 1, 0, 0, 0, 0, 1, 0]

# Build a TF-IDF + Logistic Regression pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=10000,
        ngram_range=(1, 2),  # Unigrams and bigrams
        min_df=1,
        max_df=0.95,
        sublinear_tf=True,  # Apply log normalization to TF
    )),
    ('clf', LogisticRegression(
        C=1.0,
        max_iter=1000,
        class_weight='balanced',
    )),
])

# Train the pipeline
pipeline.fit(texts, labels)

# Examine the most informative features
feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
coefficients = pipeline.named_steps['clf'].coef_[0]

# Top positive and negative features
top_positive_idx = np.argsort(coefficients)[-10:]
top_negative_idx = np.argsort(coefficients)[:10]

print("Most POSITIVE features:")
for idx in reversed(top_positive_idx):
    print(f"  {feature_names[idx]:30s} {coefficients[idx]:.4f}")

print("\nMost NEGATIVE features:")
for idx in top_negative_idx:
    print(f"  {feature_names[idx]:30s} {coefficients[idx]:.4f}")

# Predict on new text
new_texts = [
    "Candidate surges in latest national polls",
    "Campaign faces new allegations of financial misconduct",
]

predictions = pipeline.predict(new_texts)
probabilities = pipeline.predict_proba(new_texts)

for text, pred, prob in zip(new_texts, predictions, probabilities):
    sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"\nText: {text}")
    print(f"Prediction: {sentiment} (confidence: {max(prob):.2%})")

The n-gram parameter ngram_range=(1, 2) is particularly important for prediction market text. Bigrams like "not pass," "wide margin," and "record breaking" carry more information than individual words. The sublinear_tf=True parameter applies logarithmic scaling to term frequencies, which prevents very long documents from dominating.


24.4 Sentiment Analysis Fundamentals

Sentiment analysis is the task of determining the emotional tone or opinion expressed in text. For prediction markets, we care not just about general positive/negative sentiment, but about sentiment directed toward specific outcomes.

24.4.1 Lexicon-Based Approaches: VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool specifically designed for social media text. It uses a curated lexicon of words with associated sentiment scores and a set of rules that handle:

  • Punctuation: Exclamation marks amplify sentiment ("Great!!!" > "Great").
  • Capitalization: ALL CAPS amplifies sentiment ("AMAZING" > "amazing").
  • Degree modifiers: "extremely good" > "good" > "somewhat good."
  • Negation: "not good" flips the sentiment of "good."
  • Conjunctions: "but" signals a shift in sentiment ("The food was great, but the service was terrible").

VADER outputs four scores:

  • pos: Proportion of text that is positive.
  • neg: Proportion of text that is negative.
  • neu: Proportion of text that is neutral.
  • compound: Normalized composite score from -1 (most negative) to +1 (most positive).

The compound score is computed as:

$$ \text{compound} = \frac{\text{sum of valence scores}}{\sqrt{\text{sum of valence scores}^2 + \alpha}} $$

where $\alpha$ is a normalization constant (default 15).

24.4.2 Lexicon-Based Approaches: TextBlob

TextBlob provides a simpler sentiment analysis interface that returns:

  • Polarity: Range from -1 (negative) to +1 (positive).
  • Subjectivity: Range from 0 (objective/factual) to 1 (subjective/opinion).

TextBlob's sentiment is based on the Pattern library's lexicon and is computed as a weighted average of the sentiment scores of individual words, adjusted for modifiers and negation.

24.4.3 Rule-Based Approaches for Domain-Specific Text

Generic sentiment lexicons often perform poorly on domain-specific text. In financial and political text, many words have domain-specific sentiment:

Word General Sentiment Financial Sentiment Political Sentiment
"volatile" Neutral Negative Negative
"aggressive" Negative Positive (policy) Negative
"conservative" Neutral Positive (estimate) Varies by context
"liberal" Positive Negative (spending) Varies by context
"unprecedented" Neutral Negative (risk) Context-dependent

The Loughran-McDonald financial sentiment lexicon addresses this for financial text by providing word lists specifically labeled for financial contexts: positive, negative, uncertainty, litigious, strong modal, and weak modal.

For prediction market text, no single existing lexicon is ideal. A practical approach is to start with VADER (which handles social media well) and augment it with domain-specific terms.

24.4.4 Sentiment Scoring for Financial and Political Text

When applying sentiment analysis to prediction market text, several considerations arise:

Directionality matters. "The candidate is doing terribly in polls" is negative for the candidate but may be positive for the opposing candidate's market. Sentiment must be interpreted relative to the specific market question.

Aggregation over time. A single article's sentiment is noisy. Aggregating sentiment over time windows (1 hour, 1 day, 1 week) produces more reliable signals. Common aggregation methods include:

$$ \bar{S}_t = \frac{1}{|D_t|} \sum_{d \in D_t} s_d \quad \text{(simple average)} $$

$$ \bar{S}_t^{\text{exp}} = \alpha \cdot s_t + (1 - \alpha) \cdot \bar{S}_{t-1}^{\text{exp}} \quad \text{(exponential moving average)} $$

$$ \bar{S}_t^{\text{vol}} = \frac{\sum_{d \in D_t} v_d \cdot s_d}{\sum_{d \in D_t} v_d} \quad \text{(volume-weighted, } v_d \text{ = source importance)} $$

Source weighting. Not all text sources are equally informative. A Reuters article likely contains more reliable information than a random tweet. Source credibility can be incorporated through weighting.

24.4.5 Python Sentiment Analyzer

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from dataclasses import dataclass
from typing import Dict, List
import numpy as np


@dataclass
class SentimentResult:
    """Container for multi-method sentiment scores."""
    text: str
    vader_compound: float
    vader_pos: float
    vader_neg: float
    vader_neu: float
    textblob_polarity: float
    textblob_subjectivity: float
    ensemble_score: float


class PredictionMarketSentimentAnalyzer:
    """
    Sentiment analyzer tailored for prediction market text.

    Combines VADER, TextBlob, and custom domain adjustments
    to produce robust sentiment scores.
    """

    def __init__(self, domain_lexicon: Dict[str, float] = None):
        self.vader = SentimentIntensityAnalyzer()

        # Add domain-specific terms to VADER lexicon
        if domain_lexicon:
            self.vader.lexicon.update(domain_lexicon)

        # Default prediction market lexicon adjustments
        default_updates = {
            'landslide': 2.0,
            'frontrunner': 1.5,
            'momentum': 1.0,
            'surge': 1.5,
            'plummet': -2.0,
            'scandal': -2.5,
            'indictment': -2.5,
            'gaffe': -1.5,
            'endorse': 1.5,
            'endorsement': 1.5,
            'collapse': -2.5,
            'landslide': 2.0,
            'unprecedented': -0.5,
            'bipartisan': 1.0,
            'deadlock': -1.0,
            'gridlock': -1.0,
            'filibuster': -0.5,
            'veto': -1.5,
            'override': 1.0,
            'unanimous': 2.0,
            'contested': -1.0,
            'recount': -1.5,
        }
        self.vader.lexicon.update(default_updates)

    def analyze(self, text: str) -> SentimentResult:
        """
        Analyze sentiment of a single text using multiple methods.

        Parameters
        ----------
        text : str
            The text to analyze.

        Returns
        -------
        SentimentResult
            Multi-method sentiment scores.
        """
        # VADER analysis
        vader_scores = self.vader.polarity_scores(text)

        # TextBlob analysis
        blob = TextBlob(text)
        tb_polarity = blob.sentiment.polarity
        tb_subjectivity = blob.sentiment.subjectivity

        # Ensemble: weighted average of VADER compound and TextBlob polarity
        # VADER gets more weight because it handles social media better
        ensemble = 0.6 * vader_scores['compound'] + 0.4 * tb_polarity

        return SentimentResult(
            text=text,
            vader_compound=vader_scores['compound'],
            vader_pos=vader_scores['pos'],
            vader_neg=vader_scores['neg'],
            vader_neu=vader_scores['neu'],
            textblob_polarity=tb_polarity,
            textblob_subjectivity=tb_subjectivity,
            ensemble_score=ensemble,
        )

    def analyze_batch(self, texts: List[str]) -> List[SentimentResult]:
        """Analyze a batch of texts."""
        return [self.analyze(t) for t in texts]

    def aggregate_sentiment(
        self,
        results: List[SentimentResult],
        method: str = 'mean',
        weights: List[float] = None,
    ) -> float:
        """
        Aggregate sentiment scores from multiple texts.

        Parameters
        ----------
        results : List[SentimentResult]
            Sentiment results to aggregate.
        method : str
            Aggregation method: 'mean', 'median', or 'weighted'.
        weights : List[float], optional
            Weights for weighted aggregation.

        Returns
        -------
        float
            Aggregated sentiment score.
        """
        scores = [r.ensemble_score for r in results]

        if method == 'mean':
            return np.mean(scores)
        elif method == 'median':
            return np.median(scores)
        elif method == 'weighted' and weights is not None:
            return np.average(scores, weights=weights)
        else:
            raise ValueError(f"Unknown method: {method}")


# Example usage
if __name__ == "__main__":
    analyzer = PredictionMarketSentimentAnalyzer()

    articles = [
        "Candidate surges in polls after strong debate performance, "
        "gaining endorsement from key swing state governor.",
        "Campaign rocked by scandal as financial irregularities surface. "
        "Major donors threatening to pull support.",
        "New bipartisan agreement reached on infrastructure bill. "
        "Both parties claim victory in negotiations.",
        "BREAKING: Candidate faces indictment on federal charges. "
        "Legal team says allegations are baseless.",
    ]

    results = analyzer.analyze_batch(articles)

    for result in results:
        print(f"Text: {result.text[:60]}...")
        print(f"  VADER compound:       {result.vader_compound:+.4f}")
        print(f"  TextBlob polarity:    {result.textblob_polarity:+.4f}")
        print(f"  TextBlob subjectivity: {result.textblob_subjectivity:.4f}")
        print(f"  Ensemble score:       {result.ensemble_score:+.4f}")
        print()

    # Aggregate
    agg = analyzer.aggregate_sentiment(results)
    print(f"Aggregated sentiment (mean): {agg:+.4f}")

24.5 Transformer Models: BERT and Beyond

The classical methods of the previous sections are useful baselines, but modern NLP has been revolutionized by transformer architectures. Understanding transformers is essential for building state-of-the-art NLP features for prediction market trading.

24.5.1 Attention Mechanism Intuition

The key innovation behind transformers is the attention mechanism. Traditional sequential models (RNNs, LSTMs) process text left-to-right, maintaining a hidden state that serves as a compressed memory of everything seen so far. This creates a bottleneck: information from early in a long document must survive through many processing steps to influence the interpretation of later text.

Attention allows the model to directly focus on any part of the input when processing any other part. When processing the word "rates" in "The Fed raised interest rates by 25 basis points," an attention mechanism can directly attend to "Fed," "raised," and "interest" to determine the meaning of "rates" in this context.

Mathematically, attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between a query and key vectors:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

where: - $Q$ (queries), $K$ (keys), $V$ (values) are matrices derived from the input embeddings through learned linear projections. - $d_k$ is the dimensionality of the key vectors (the denominator prevents the dot products from becoming too large). - The softmax normalizes the compatibility scores into a probability distribution.

Multi-head attention runs multiple attention mechanisms in parallel, each with different learned projections, allowing the model to attend to different types of relationships simultaneously:

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

24.5.2 BERT Architecture Overview

BERT (Bidirectional Encoder Representations from Transformers) was released by Google in 2018 and fundamentally changed NLP. Its key innovations were:

  1. Bidirectional context: Unlike GPT (which reads left-to-right), BERT processes text in both directions simultaneously, giving it a richer understanding of context.

  2. Pre-training on massive text: BERT is pre-trained on two tasks: - Masked Language Modeling (MLM): Randomly mask 15% of input tokens and train the model to predict them. This forces the model to learn contextual representations. - Next Sentence Prediction (NSP): Given two sentences, predict whether the second follows the first. This teaches the model about sentence relationships.

  3. Fine-tuning for downstream tasks: After pre-training, BERT can be fine-tuned on specific tasks with relatively small labeled datasets.

The BERT-base model has: - 12 transformer layers (blocks) - 768 hidden dimensions - 12 attention heads - 110 million parameters

BERT-large doubles these: 24 layers, 1024 dimensions, 16 heads, 340 million parameters.

24.5.3 Pre-Training and Fine-Tuning

The pre-training/fine-tuning paradigm is crucial for prediction market applications:

Pre-training teaches the model general language understanding. BERT was pre-trained on BookCorpus (800M words) and English Wikipedia (2,500M words). This gives it knowledge of grammar, facts, and reasoning patterns.

Fine-tuning adapts the pre-trained model to a specific task. For a prediction market sentiment classifier, fine-tuning involves:

  1. Adding a classification head (a simple linear layer) on top of BERT.
  2. Training the entire model (BERT + classification head) on labeled prediction market text.
  3. Using a small learning rate (2e-5 to 5e-5) to preserve the pre-trained knowledge.

Fine-tuning is remarkably data-efficient. Where training a text classifier from scratch might require hundreds of thousands of labeled examples, fine-tuning BERT can achieve strong performance with just a few thousand.

24.5.4 Using Pre-Trained Models from HuggingFace

The HuggingFace transformers library provides easy access to thousands of pre-trained models. For prediction market applications, the most relevant pre-trained models include:

Model Description Best For
bert-base-uncased Base BERT, 110M params General text classification
finbert BERT fine-tuned on financial text Financial market text
cardiffnlp/twitter-roberta-base-sentiment RoBERTa fine-tuned on tweets Social media sentiment
distilbert-base-uncased Distilled BERT, 66M params When speed matters
roberta-base Robustly optimized BERT General NLP, often better than BERT

24.5.5 Python BERT Sentiment Classification

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    pipeline,
)
import numpy as np
from typing import List, Dict


class TransformerSentimentClassifier:
    """
    Sentiment classifier using pre-trained transformer models.

    Wraps HuggingFace transformers for easy use in prediction
    market applications.
    """

    def __init__(
        self,
        model_name: str = "cardiffnlp/twitter-roberta-base-sentiment-latest",
        device: str = None,
    ):
        """
        Parameters
        ----------
        model_name : str
            HuggingFace model identifier.
        device : str
            'cuda', 'cpu', or None for auto-detection.
        """
        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"

        self.device = device
        self.model_name = model_name

        # Load tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name
        ).to(device)
        self.model.eval()

        # Also create a pipeline for convenience
        self.pipe = pipeline(
            "sentiment-analysis",
            model=self.model,
            tokenizer=self.tokenizer,
            device=0 if device == "cuda" else -1,
            top_k=None,  # Return all scores
        )

    def predict(self, text: str) -> Dict[str, float]:
        """
        Predict sentiment for a single text.

        Returns
        -------
        Dict[str, float]
            Mapping from label to score.
        """
        results = self.pipe(text, truncation=True, max_length=512)
        # results is a list of lists of dicts
        return {r['label']: r['score'] for r in results[0]}

    def predict_batch(
        self, texts: List[str], batch_size: int = 32
    ) -> List[Dict[str, float]]:
        """
        Predict sentiment for a batch of texts.

        Parameters
        ----------
        texts : List[str]
            Texts to analyze.
        batch_size : int
            Batch size for inference.

        Returns
        -------
        List[Dict[str, float]]
            Sentiment scores for each text.
        """
        all_results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            results = self.pipe(
                batch, truncation=True, max_length=512, batch_size=batch_size
            )
            for result in results:
                all_results.append(
                    {r['label']: r['score'] for r in result}
                )
        return all_results

    def get_sentiment_score(self, text: str) -> float:
        """
        Get a single normalized sentiment score in [-1, 1].

        Maps model outputs to a continuous sentiment scale.
        """
        scores = self.predict(text)

        # Handle different label formats from different models
        if 'positive' in scores and 'negative' in scores:
            return scores['positive'] - scores['negative']
        elif 'POSITIVE' in scores and 'NEGATIVE' in scores:
            return scores['POSITIVE'] - scores['NEGATIVE']
        elif 'LABEL_2' in scores and 'LABEL_0' in scores:
            # Common format: 0=negative, 1=neutral, 2=positive
            return scores.get('LABEL_2', 0) - scores.get('LABEL_0', 0)
        else:
            # Fallback: return the highest-scoring positive label
            return max(scores.values()) if scores else 0.0


# Example usage
if __name__ == "__main__":
    # Use a pre-trained sentiment model
    classifier = TransformerSentimentClassifier(
        model_name="cardiffnlp/twitter-roberta-base-sentiment-latest"
    )

    texts = [
        "The candidate delivered an incredible speech that energized voters",
        "Polls show a devastating collapse in support after the scandal",
        "Economic indicators remain steady with moderate growth expected",
        "BREAKING: Major endorsement from former president boosts campaign",
    ]

    for text in texts:
        scores = classifier.predict(text)
        normalized = classifier.get_sentiment_score(text)
        print(f"Text: {text[:65]}...")
        print(f"  Scores: {scores}")
        print(f"  Normalized: {normalized:+.4f}")
        print()

24.6 Fine-Tuning for Prediction Market Text

Pre-trained models provide a strong starting point, but fine-tuning on domain-specific data unlocks significantly better performance for prediction market applications.

24.6.1 Creating Training Data from Market-Relevant Text

The biggest challenge in fine-tuning for prediction markets is obtaining labeled training data. Here are practical approaches:

Approach 1: Market price changes as labels. Collect news articles published during specific time windows and label them based on the subsequent market price movement: - Article published at time $t$ - Market price changes from $p_t$ to $p_{t+\Delta}$ - If $p_{t+\Delta} - p_t > \epsilon$: label = "positive" (price moved up) - If $p_{t+\Delta} - p_t < -\epsilon$: label = "negative" (price moved down) - Otherwise: label = "neutral"

This approach is noisy (many articles may not be causally related to the price change) but scalable.

Approach 2: Manual annotation. Recruit annotators (or use your own judgment) to label articles as positive, negative, or neutral for specific market questions. This produces cleaner labels but is labor-intensive. A few hundred high-quality labels can be sufficient for fine-tuning.

Approach 3: Semi-supervised labeling. Use a pre-trained sentiment model to label a large corpus, manually correct the most uncertain predictions, and use the result for fine-tuning.

Approach 4: LLM-assisted labeling. Use a large language model (GPT-4, Claude) to label articles. Prompt the LLM with the market question and the article, and ask it to classify the article's likely impact on the market. This can produce thousands of labels quickly, though they should be spot-checked for quality.

24.6.2 Fine-Tuning BERT for Political/Financial Sentiment

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AdamW,
    get_linear_schedule_with_warmup,
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np


class MarketTextDataset(Dataset):
    """PyTorch dataset for market-relevant text classification."""

    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long),
        }


def fine_tune_bert(
    texts: list,
    labels: list,
    model_name: str = "distilbert-base-uncased",
    num_labels: int = 3,  # negative, neutral, positive
    epochs: int = 3,
    batch_size: int = 16,
    learning_rate: float = 2e-5,
    warmup_ratio: float = 0.1,
    max_length: int = 256,
    test_size: float = 0.2,
):
    """
    Fine-tune a transformer model on market-relevant text.

    Parameters
    ----------
    texts : list
        List of text strings.
    labels : list
        List of integer labels (0=negative, 1=neutral, 2=positive).
    model_name : str
        Pre-trained model to fine-tune.
    num_labels : int
        Number of classification labels.
    epochs : int
        Number of training epochs.
    batch_size : int
        Training batch size.
    learning_rate : float
        Peak learning rate.
    warmup_ratio : float
        Fraction of training steps for warmup.
    max_length : int
        Maximum token sequence length.
    test_size : float
        Fraction of data to hold out for evaluation.

    Returns
    -------
    model : transformers model
        The fine-tuned model.
    tokenizer : transformers tokenizer
        The associated tokenizer.
    metrics : dict
        Evaluation metrics on the test set.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=num_labels
    ).to(device)

    # Train/test split
    train_texts, test_texts, train_labels, test_labels = train_test_split(
        texts, labels, test_size=test_size, random_state=42, stratify=labels
    )

    # Create datasets
    train_dataset = MarketTextDataset(
        train_texts, train_labels, tokenizer, max_length
    )
    test_dataset = MarketTextDataset(
        test_texts, test_labels, tokenizer, max_length
    )

    train_loader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True
    )
    test_loader = DataLoader(
        test_dataset, batch_size=batch_size, shuffle=False
    )

    # Optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
    total_steps = len(train_loader) * epochs
    warmup_steps = int(total_steps * warmup_ratio)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps,
        num_training_steps=total_steps
    )

    # Training loop
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0

        for batch in train_loader:
            optimizer.zero_grad()

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_batch = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels_batch,
            )

            loss = outputs.loss
            logits = outputs.logits

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()
            predictions = torch.argmax(logits, dim=-1)
            correct += (predictions == labels_batch).sum().item()
            total += labels_batch.size(0)

        avg_loss = total_loss / len(train_loader)
        accuracy = correct / total
        print(
            f"Epoch {epoch + 1}/{epochs} | "
            f"Loss: {avg_loss:.4f} | "
            f"Accuracy: {accuracy:.4f}"
        )

    # Evaluation
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_batch = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids, attention_mask=attention_mask
            )
            predictions = torch.argmax(outputs.logits, dim=-1)

            all_preds.extend(predictions.cpu().numpy())
            all_labels.extend(labels_batch.cpu().numpy())

    label_names = ['negative', 'neutral', 'positive']
    report = classification_report(
        all_labels, all_preds, target_names=label_names, output_dict=True
    )
    print("\nEvaluation Report:")
    print(classification_report(all_labels, all_preds, target_names=label_names))

    return model, tokenizer, report


# Example usage with synthetic data
if __name__ == "__main__":
    # In practice, you would load real market-relevant text data
    sample_texts = [
        "Candidate wins decisive victory in key primary state",
        "New policy proposal receives bipartisan support in Congress",
        "Market analysts predict strong growth in the upcoming quarter",
        "Campaign faces legal challenges that could derail nomination",
        "Polling data shows a significant decline in voter approval",
        "Economic downturn fears mount as unemployment claims rise",
        "Trade negotiations reach a standstill with no resolution",
        "Central bank signals potential rate cuts in response to data",
        "Senate confirms new cabinet appointment with broad support",
        "International tensions escalate following border incident",
    ] * 20  # Repeat for minimal training data

    # Labels: 0=negative, 1=neutral, 2=positive
    sample_labels = [2, 2, 2, 0, 0, 0, 0, 1, 2, 0] * 20

    model, tokenizer, metrics = fine_tune_bert(
        sample_texts,
        sample_labels,
        model_name="distilbert-base-uncased",
        epochs=3,
        batch_size=8,
    )

24.6.3 Transfer Learning Strategies

When fine-tuning for prediction markets, there are several transfer learning strategies to consider:

Strategy 1: Direct fine-tuning. Take a general pre-trained model (e.g., bert-base-uncased) and fine-tune it directly on your prediction market data. This works well when you have at least a few thousand labeled examples.

Strategy 2: Domain-adaptive pre-training. First, continue pre-training the model (using MLM) on a large corpus of unlabeled prediction market text (news articles, social media posts about markets). Then fine-tune on your labeled data. This helps the model adapt to domain-specific vocabulary and writing styles.

Strategy 3: Multi-task fine-tuning. Fine-tune on multiple related tasks simultaneously -- for example, sentiment classification and market impact prediction. The shared representation learning can improve performance on both tasks.

Strategy 4: Few-shot with a large model. If you have very few labeled examples (fewer than 100), use a larger pre-trained model (e.g., roberta-large or a model fine-tuned on NLI tasks like cross-encoder/nli-deberta-v3-large) and train only the classification head, freezing the transformer layers.

24.6.4 Handling Domain-Specific Language

Prediction market text has several domain-specific challenges:

  • Hedging language: "The candidate is likely to win" vs. "The candidate could potentially win" express different confidence levels that matter for probability estimation.
  • Conditional statements: "If the economy slows, the incumbent will lose" requires understanding conditional logic.
  • Negated expectations: "The market surprised by NOT raising rates" contains a negation of an expectation, which is semantically complex.
  • Quantitative claims: "Polls show a 5-point lead" contains a numerical value that a text model may not properly weigh.
  • Sarcasm and irony: Social media sources often contain sarcasm that can fool sentiment models ("Great, another candidate flip-flopping on policy").

Addressing these challenges requires: 1. Including examples of each in your training data. 2. Data augmentation techniques that introduce these patterns. 3. Potentially adding numerical features extracted separately from the text.


24.7 News Impact Analysis

Beyond sentiment, we can measure the impact that specific news events have on prediction market prices. This goes beyond asking "is this article positive or negative?" to asking "how much did this article move the market?"

24.7.1 Measuring How News Affects Market Prices

The basic framework for news impact analysis involves:

  1. Identifying news events: Detect when a significant news article or cluster of articles is published.
  2. Measuring the price change: Calculate the market price change in a window around the news event.
  3. Attributing the change: Determine how much of the price change is attributable to the news (vs. other factors).

Formally, define the abnormal price change around a news event at time $t$ as:

$$ \Delta p_t^{\text{abnormal}} = \Delta p_t^{\text{actual}} - \Delta p_t^{\text{expected}} $$

where $\Delta p_t^{\text{expected}}$ is the price change we would have expected in the absence of the news event. For prediction markets (unlike stocks), the expected price change in the absence of news is typically zero (prices should be a martingale under risk-neutral pricing).

24.7.2 Event Study Methodology Adapted for Prediction Markets

Classical event study methodology from financial economics can be adapted for prediction markets:

  1. Define the event window: The period during which you expect the news to affect prices. For breaking news, this might be [0, +2 hours]. For a scheduled event like a debate, the window might be [-1 hour, +24 hours].

  2. Define the estimation window: A period before the event used to estimate "normal" price behavior. For prediction markets, this might be the 7 days before the event window.

  3. Calculate abnormal returns: The difference between actual price changes and expected changes during the event window.

  4. Test significance: Determine whether the abnormal price change is statistically significant.

For prediction markets, the "return" is typically the change in the probability:

$$ \text{Abnormal change} = p_{t + \tau} - p_t - \hat{\mu} \cdot \tau $$

where $\hat{\mu}$ is the estimated daily drift in the market price (often assumed to be zero for short windows).

24.7.3 News Surprise vs. Expected

Not all news is equally informative. Expected news (a poll showing the frontrunner still leading) should have minimal price impact. Surprising news (an unexpected endorsement, a scandal) should have large impact. Quantifying "surprise" is a key challenge.

Approaches to measuring news surprise:

Approach 1: Deviation from consensus. If polls consistently show Candidate A leading by 5 points, a new poll showing a 10-point lead is surprising. The surprise is the deviation from recent averages.

Approach 2: Novelty of content. Using NLP, measure how different a new article's content is from recent articles. Articles that introduce new topics or entities are more likely to be surprising. This can be measured using cosine distance between TF-IDF vectors:

$$ \text{novelty}(d) = 1 - \max_{d' \in D_{\text{recent}}} \cos(\mathbf{v}_d, \mathbf{v}_{d'}) $$

Approach 3: Market response magnitude. If the market moves significantly after a news event, the news was presumably surprising. This is a retrospective measure but useful for building labeled datasets.

24.7.4 Quantifying News Impact

To systematically quantify news impact, we compute several features for each news event:

$$ \text{Impact}_{\text{immediate}} = p_{t+\delta} - p_t \quad \text{(} \delta \text{ = 1-5 minutes)} $$

$$ \text{Impact}_{\text{short}} = p_{t+1\text{h}} - p_t $$

$$ \text{Impact}_{\text{sustained}} = p_{t+24\text{h}} - p_t $$

$$ \text{Reversal} = \frac{p_{t+24\text{h}} - p_{t+1\text{h}}}{p_{t+1\text{h}} - p_t} $$

The reversal metric is particularly interesting: if news causes an immediate price spike that is fully reversed within 24 hours, the initial reaction was an overreaction, and a trading strategy could exploit this pattern.

24.7.5 Python News Impact Analyzer

import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Optional
from datetime import datetime, timedelta


@dataclass
class NewsEvent:
    """Represents a news event with associated metadata."""
    timestamp: datetime
    headline: str
    text: str
    source: str
    sentiment: float  # Pre-computed sentiment score


@dataclass
class ImpactResult:
    """Results of a news impact analysis."""
    event: NewsEvent
    price_before: float
    price_after_5min: float
    price_after_1h: float
    price_after_24h: float
    impact_immediate: float
    impact_short: float
    impact_sustained: float
    reversal_ratio: Optional[float]
    is_significant: bool


class NewsImpactAnalyzer:
    """
    Analyzes the impact of news events on prediction market prices.

    Implements event study methodology adapted for prediction markets.
    """

    def __init__(
        self,
        significance_threshold: float = 0.02,
        min_volume_threshold: int = 5,
    ):
        """
        Parameters
        ----------
        significance_threshold : float
            Minimum absolute price change to consider significant.
        min_volume_threshold : int
            Minimum number of trades in the event window
            to consider the price change meaningful.
        """
        self.significance_threshold = significance_threshold
        self.min_volume_threshold = min_volume_threshold

    def get_price_at_time(
        self,
        price_data: pd.DataFrame,
        target_time: datetime,
        tolerance: timedelta = timedelta(minutes=5),
    ) -> Optional[float]:
        """
        Get the market price closest to the target time.

        Parameters
        ----------
        price_data : pd.DataFrame
            DataFrame with 'timestamp' and 'price' columns.
        target_time : datetime
            The target time to look up.
        tolerance : timedelta
            Maximum time difference to accept.

        Returns
        -------
        float or None
            The price at the target time, or None if no data available.
        """
        time_diffs = abs(price_data['timestamp'] - target_time)
        min_idx = time_diffs.idxmin()
        if time_diffs[min_idx] <= tolerance:
            return price_data.loc[min_idx, 'price']
        return None

    def analyze_event(
        self,
        event: NewsEvent,
        price_data: pd.DataFrame,
    ) -> Optional[ImpactResult]:
        """
        Analyze the impact of a single news event.

        Parameters
        ----------
        event : NewsEvent
            The news event to analyze.
        price_data : pd.DataFrame
            Market price data with 'timestamp' and 'price' columns.

        Returns
        -------
        ImpactResult or None
            Impact analysis results, or None if insufficient data.
        """
        t = event.timestamp

        price_before = self.get_price_at_time(price_data, t)
        price_5min = self.get_price_at_time(
            price_data, t + timedelta(minutes=5)
        )
        price_1h = self.get_price_at_time(
            price_data, t + timedelta(hours=1)
        )
        price_24h = self.get_price_at_time(
            price_data, t + timedelta(hours=24)
        )

        if any(p is None for p in [price_before, price_5min, price_1h, price_24h]):
            return None

        impact_imm = price_5min - price_before
        impact_short = price_1h - price_before
        impact_sust = price_24h - price_before

        # Reversal ratio
        if abs(impact_short) > 1e-6:
            reversal = (price_24h - price_1h) / (price_1h - price_before)
        else:
            reversal = None

        is_significant = abs(impact_short) >= self.significance_threshold

        return ImpactResult(
            event=event,
            price_before=price_before,
            price_after_5min=price_5min,
            price_after_1h=price_1h,
            price_after_24h=price_24h,
            impact_immediate=impact_imm,
            impact_short=impact_short,
            impact_sustained=impact_sust,
            reversal_ratio=reversal,
            is_significant=is_significant,
        )

    def analyze_events(
        self,
        events: List[NewsEvent],
        price_data: pd.DataFrame,
    ) -> pd.DataFrame:
        """
        Analyze multiple news events and return a summary DataFrame.

        Parameters
        ----------
        events : List[NewsEvent]
            List of news events to analyze.
        price_data : pd.DataFrame
            Market price data.

        Returns
        -------
        pd.DataFrame
            Summary of impact analysis for all events.
        """
        results = []
        for event in events:
            result = self.analyze_event(event, price_data)
            if result is not None:
                results.append({
                    'timestamp': event.timestamp,
                    'headline': event.headline,
                    'source': event.source,
                    'sentiment': event.sentiment,
                    'price_before': result.price_before,
                    'impact_immediate': result.impact_immediate,
                    'impact_short': result.impact_short,
                    'impact_sustained': result.impact_sustained,
                    'reversal_ratio': result.reversal_ratio,
                    'is_significant': result.is_significant,
                })

        df = pd.DataFrame(results)

        if not df.empty:
            # Compute summary statistics
            sig_events = df[df['is_significant']]
            print(f"Total events analyzed: {len(df)}")
            print(f"Significant events: {len(sig_events)}")
            if len(sig_events) > 0:
                print(f"Mean absolute impact (significant): "
                      f"{sig_events['impact_short'].abs().mean():.4f}")
                print(f"Sentiment-impact correlation: "
                      f"{df['sentiment'].corr(df['impact_short']):.4f}")

        return df

    def compute_news_surprise(
        self,
        event_text: str,
        recent_texts: List[str],
        vectorizer=None,
    ) -> float:
        """
        Compute the novelty/surprise of a news event relative to
        recent news.

        Parameters
        ----------
        event_text : str
            The text of the new event.
        recent_texts : List[str]
            Recent article texts for comparison.
        vectorizer : sklearn vectorizer, optional
            Pre-fitted TF-IDF vectorizer. If None, creates one.

        Returns
        -------
        float
            Novelty score in [0, 1], where 1 = very novel.
        """
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity

        if not recent_texts:
            return 1.0  # No baseline, so everything is novel

        if vectorizer is None:
            vectorizer = TfidfVectorizer(max_features=5000)
            all_texts = recent_texts + [event_text]
            tfidf_matrix = vectorizer.fit_transform(all_texts)
        else:
            all_texts = recent_texts + [event_text]
            tfidf_matrix = vectorizer.transform(all_texts)

        # Similarity between the new event and each recent text
        event_vector = tfidf_matrix[-1:]
        recent_vectors = tfidf_matrix[:-1]

        similarities = cosine_similarity(event_vector, recent_vectors)[0]
        max_similarity = similarities.max()

        return 1.0 - max_similarity


# Example usage
if __name__ == "__main__":
    # Create synthetic price data
    np.random.seed(42)
    base_time = datetime(2024, 11, 1, 8, 0, 0)
    timestamps = [
        base_time + timedelta(minutes=i) for i in range(0, 1440 * 3, 5)
    ]
    prices = [0.55]  # Starting probability
    for _ in range(len(timestamps) - 1):
        prices.append(
            np.clip(prices[-1] + np.random.normal(0, 0.002), 0.01, 0.99)
        )

    # Inject a news event impact
    event_idx = 200  # ~16.7 hours in
    for i in range(event_idx, min(event_idx + 20, len(prices))):
        prices[i] += 0.05  # 5% jump

    price_data = pd.DataFrame({
        'timestamp': timestamps,
        'price': prices,
    })

    # Create a news event
    event = NewsEvent(
        timestamp=timestamps[event_idx],
        headline="Major endorsement boosts candidate's campaign",
        text="A key political figure has endorsed the candidate...",
        source="Reuters",
        sentiment=0.8,
    )

    # Analyze
    analyzer = NewsImpactAnalyzer(significance_threshold=0.02)
    result = analyzer.analyze_event(event, price_data)

    if result:
        print(f"Event: {result.event.headline}")
        print(f"Price before:      {result.price_before:.4f}")
        print(f"Price after 5min:  {result.price_after_5min:.4f}")
        print(f"Price after 1h:    {result.price_after_1h:.4f}")
        print(f"Price after 24h:   {result.price_after_24h:.4f}")
        print(f"Immediate impact:  {result.impact_immediate:+.4f}")
        print(f"Short-term impact: {result.impact_short:+.4f}")
        print(f"Sustained impact:  {result.impact_sustained:+.4f}")
        print(f"Significant:       {result.is_significant}")

24.8 Real-Time News Monitoring

To act on NLP signals in a timely manner, you need a system that monitors news sources in real time, computes sentiment scores, and generates alerts when significant events occur.

24.8.1 Building a News Pipeline

A real-time news pipeline for prediction markets has several components:

[News Sources] -> [Collector] -> [Preprocessor] -> [NLP Models]
                                                       |
[Market Data] -> [Feature Store] <---------------------+
                      |
               [Trading Logic] -> [Execution]

News Sources include: - RSS feeds: Major news outlets (Reuters, AP, Bloomberg) publish RSS feeds that can be polled every few minutes. - News APIs: Services like NewsAPI, GDELT, or Event Registry provide structured access to news articles. NewsAPI offers a free tier suitable for development. - Social media APIs: Twitter/X API (now expensive), Reddit API, and Bluesky API provide access to social media posts. - Web scraping: For sources without APIs, targeted web scraping can extract headlines and article text. (Always respect terms of service and robots.txt.)

24.8.2 RSS Feeds and News APIs

RSS feeds are the simplest and most reliable way to monitor news:

import feedparser
import time
from datetime import datetime
from typing import List, Dict, Callable
import hashlib


class RSSMonitor:
    """
    Monitors RSS feeds for new articles.

    Tracks seen articles to avoid processing duplicates.
    """

    def __init__(self, feeds: Dict[str, str]):
        """
        Parameters
        ----------
        feeds : Dict[str, str]
            Mapping from feed name to feed URL.
        """
        self.feeds = feeds
        self.seen_ids = set()

    def _article_id(self, entry) -> str:
        """Generate a unique ID for an article."""
        key = entry.get('id', '') or entry.get('link', '') or entry.get('title', '')
        return hashlib.md5(key.encode()).hexdigest()

    def check_feeds(self) -> List[Dict]:
        """
        Check all feeds for new articles.

        Returns
        -------
        List[Dict]
            List of new articles with title, summary, link,
            published date, and source.
        """
        new_articles = []
        for name, url in self.feeds.items():
            try:
                feed = feedparser.parse(url)
                for entry in feed.entries:
                    article_id = self._article_id(entry)
                    if article_id not in self.seen_ids:
                        self.seen_ids.add(article_id)
                        new_articles.append({
                            'source': name,
                            'title': entry.get('title', ''),
                            'summary': entry.get('summary', ''),
                            'link': entry.get('link', ''),
                            'published': entry.get('published', ''),
                            'fetched_at': datetime.utcnow().isoformat(),
                        })
            except Exception as e:
                print(f"Error fetching {name}: {e}")

        return new_articles

    def monitor(
        self,
        callback: Callable[[List[Dict]], None],
        interval_seconds: int = 60,
    ):
        """
        Continuously monitor feeds and call the callback
        with new articles.

        Parameters
        ----------
        callback : callable
            Function to call with new articles.
        interval_seconds : int
            Seconds between feed checks.
        """
        print(f"Monitoring {len(self.feeds)} feeds "
              f"every {interval_seconds}s...")
        while True:
            new_articles = self.check_feeds()
            if new_articles:
                callback(new_articles)
            time.sleep(interval_seconds)

24.8.3 Real-Time Sentiment Scoring and Alert System

Combining the news monitor with sentiment analysis creates a real-time alert system:

from dataclasses import dataclass, field
from typing import List, Optional
from collections import deque
import json


@dataclass
class SentimentAlert:
    """An alert generated when sentiment crosses a threshold."""
    timestamp: str
    alert_type: str  # 'spike', 'shift', 'volume'
    description: str
    articles: List[Dict]
    sentiment_score: float
    market_relevance: str


class RealTimeNewsSentimentMonitor:
    """
    Real-time news monitoring with sentiment analysis and alerting.

    Monitors news feeds, scores sentiment, tracks rolling averages,
    and generates alerts when conditions are met.
    """

    def __init__(
        self,
        sentiment_analyzer,  # The analyzer from Section 24.4
        keywords: List[str] = None,
        alert_threshold: float = 0.3,
        volume_alert_multiplier: float = 3.0,
        rolling_window: int = 50,
    ):
        self.sentiment_analyzer = sentiment_analyzer
        self.keywords = [k.lower() for k in (keywords or [])]
        self.alert_threshold = alert_threshold
        self.volume_alert_multiplier = volume_alert_multiplier
        self.rolling_window = rolling_window

        self.sentiment_history = deque(maxlen=rolling_window)
        self.article_counts = deque(maxlen=24)  # hourly counts
        self.alerts = []
        self.current_hour_count = 0

    def is_relevant(self, article: Dict) -> bool:
        """Check if an article is relevant to our monitored keywords."""
        if not self.keywords:
            return True
        text = (article.get('title', '') + ' ' +
                article.get('summary', '')).lower()
        return any(kw in text for kw in self.keywords)

    def process_articles(self, articles: List[Dict]) -> List[SentimentAlert]:
        """
        Process new articles: filter, score sentiment, check alerts.

        Parameters
        ----------
        articles : List[Dict]
            New articles from the RSS monitor.

        Returns
        -------
        List[SentimentAlert]
            Any alerts generated.
        """
        relevant = [a for a in articles if self.is_relevant(a)]
        if not relevant:
            return []

        new_alerts = []

        for article in relevant:
            text = article.get('title', '') + '. ' + article.get('summary', '')
            result = self.sentiment_analyzer.analyze(text)
            article['sentiment'] = result.ensemble_score
            self.sentiment_history.append(result.ensemble_score)
            self.current_hour_count += 1

        # Check for sentiment spike
        if len(self.sentiment_history) >= 5:
            recent_avg = np.mean(list(self.sentiment_history)[-5:])
            overall_avg = np.mean(list(self.sentiment_history))

            if abs(recent_avg - overall_avg) > self.alert_threshold:
                direction = "positive" if recent_avg > overall_avg else "negative"
                alert = SentimentAlert(
                    timestamp=datetime.utcnow().isoformat(),
                    alert_type='spike',
                    description=(
                        f"Sentiment {direction} spike detected. "
                        f"Recent avg: {recent_avg:.3f}, "
                        f"Overall avg: {overall_avg:.3f}"
                    ),
                    articles=relevant,
                    sentiment_score=recent_avg,
                    market_relevance="HIGH",
                )
                new_alerts.append(alert)
                self.alerts.append(alert)

        # Check for volume spike
        if len(self.article_counts) > 0:
            avg_hourly = np.mean(list(self.article_counts))
            if (avg_hourly > 0 and
                    self.current_hour_count >
                    avg_hourly * self.volume_alert_multiplier):
                alert = SentimentAlert(
                    timestamp=datetime.utcnow().isoformat(),
                    alert_type='volume',
                    description=(
                        f"News volume spike: {self.current_hour_count} "
                        f"articles vs avg {avg_hourly:.1f}/hour"
                    ),
                    articles=relevant,
                    sentiment_score=np.mean(
                        [a['sentiment'] for a in relevant]
                    ),
                    market_relevance="MEDIUM",
                )
                new_alerts.append(alert)
                self.alerts.append(alert)

        # Print alerts
        for alert in new_alerts:
            print(f"\n{'='*60}")
            print(f"ALERT [{alert.alert_type.upper()}] "
                  f"at {alert.timestamp}")
            print(f"  {alert.description}")
            print(f"  Market relevance: {alert.market_relevance}")
            print(f"  Related articles: {len(alert.articles)}")
            for a in alert.articles[:3]:
                print(f"    - {a['title'][:80]}")
            print(f"{'='*60}")

        return new_alerts

This real-time monitoring system provides the infrastructure needed to act on NLP signals quickly. In practice, you would connect the alert system to your trading logic, potentially placing automated trades when sentiment signals are strong and confirmed by multiple sources.


24.9 Named Entity Recognition and Topic Extraction

Beyond sentiment, we can extract structured information from text using Named Entity Recognition (NER) and topic modeling. These techniques allow us to identify who and what is being discussed, and to track how the conversation around a prediction market question evolves over time.

24.9.1 NER for Extracting People, Organizations, and Events

Named Entity Recognition identifies and classifies named entities in text into predefined categories:

  • PERSON: "Joe Biden," "Elon Musk"
  • ORG: "Federal Reserve," "Polymarket," "Democratic Party"
  • GPE (Geo-Political Entity): "United States," "Ukraine," "Georgia"
  • DATE: "November 2024," "next Tuesday"
  • MONEY: "$1 billion," "25 basis points"
  • EVENT: "Super Bowl," "G7 summit"

For prediction markets, NER is valuable because it connects text to specific markets. An article mentioning "Biden" and "approval rating" is relevant to presidential election markets. An article mentioning "Federal Reserve" and "interest rates" is relevant to monetary policy markets.

24.9.2 Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) is an unsupervised model that discovers topics in a collection of documents. Each topic is represented as a distribution over words, and each document is a mixture of topics.

The generative model for LDA is:

  1. For each topic $k$, draw a word distribution $\phi_k \sim \text{Dir}(\beta)$.
  2. For each document $d$: a. Draw a topic distribution $\theta_d \sim \text{Dir}(\alpha)$. b. For each word in the document:
    • Draw a topic $z \sim \text{Multinomial}(\theta_d)$.
    • Draw a word $w \sim \text{Multinomial}(\phi_z)$.

The hyperparameters $\alpha$ and $\beta$ control the sparsity of the topic and word distributions, respectively. A small $\alpha$ encourages documents to be about a few topics; a small $\beta$ encourages topics to use a few words.

For prediction markets, LDA can be used to: - Track which topics are dominating the news cycle for a particular market. - Detect when a new topic emerges that is relevant to a market. - Cluster articles by topic for more targeted sentiment analysis.

24.9.3 Connecting Entities to Markets

The key insight is that NER and topic modeling can be used to create a mapping between text and markets. This mapping enables automated routing of news to relevant market models:

# Entity-to-market mapping example
entity_market_map = {
    "Biden": ["presidential-election-2024", "us-foreign-policy"],
    "Trump": ["presidential-election-2024", "us-foreign-policy"],
    "Federal Reserve": ["fed-rate-decision", "inflation-target"],
    "SEC": ["bitcoin-etf-approval", "crypto-regulation"],
    "Ukraine": ["ukraine-conflict-resolution", "nato-expansion"],
}

24.9.4 Python NER and Topic Pipeline

import spacy
from collections import Counter, defaultdict
from typing import List, Dict, Tuple
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np


class NERTopicPipeline:
    """
    Combined NER and topic modeling pipeline for prediction
    market text analysis.
    """

    def __init__(
        self,
        spacy_model: str = "en_core_web_sm",
        n_topics: int = 10,
        entity_market_map: Dict[str, List[str]] = None,
    ):
        """
        Parameters
        ----------
        spacy_model : str
            SpaCy model to use for NER.
        n_topics : int
            Number of topics for LDA.
        entity_market_map : dict
            Mapping from entity names to related market IDs.
        """
        self.nlp = spacy.load(spacy_model)
        self.n_topics = n_topics
        self.entity_market_map = entity_market_map or {}

        # LDA components
        self.vectorizer = CountVectorizer(
            max_features=5000,
            stop_words='english',
            min_df=2,
            max_df=0.95,
        )
        self.lda = LatentDirichletAllocation(
            n_components=n_topics,
            random_state=42,
            max_iter=20,
        )
        self.is_fitted = False

    def extract_entities(self, text: str) -> Dict[str, List[str]]:
        """
        Extract named entities from text.

        Parameters
        ----------
        text : str
            Input text.

        Returns
        -------
        Dict[str, List[str]]
            Mapping from entity type to list of entities found.
        """
        doc = self.nlp(text)
        entities = defaultdict(list)
        for ent in doc.ents:
            entities[ent.label_].append(ent.text)
        return dict(entities)

    def find_relevant_markets(self, text: str) -> List[str]:
        """
        Identify prediction markets relevant to the given text
        based on entity mentions.

        Parameters
        ----------
        text : str
            Input text.

        Returns
        -------
        List[str]
            List of relevant market IDs.
        """
        entities = self.extract_entities(text)
        relevant_markets = set()

        for entity_type, entity_list in entities.items():
            for entity in entity_list:
                # Check direct matches
                if entity in self.entity_market_map:
                    relevant_markets.update(self.entity_market_map[entity])
                # Check partial matches
                for key, markets in self.entity_market_map.items():
                    if key.lower() in entity.lower() or entity.lower() in key.lower():
                        relevant_markets.update(markets)

        return list(relevant_markets)

    def fit_topics(self, texts: List[str]):
        """
        Fit the LDA topic model on a corpus.

        Parameters
        ----------
        texts : List[str]
            Corpus of texts to fit on.
        """
        dtm = self.vectorizer.fit_transform(texts)
        self.lda.fit(dtm)
        self.is_fitted = True

    def get_topic_distribution(self, text: str) -> np.ndarray:
        """
        Get the topic distribution for a single text.

        Parameters
        ----------
        text : str
            Input text.

        Returns
        -------
        np.ndarray
            Topic distribution vector of shape (n_topics,).
        """
        if not self.is_fitted:
            raise RuntimeError("Must call fit_topics() before get_topic_distribution()")
        dtm = self.vectorizer.transform([text])
        return self.lda.transform(dtm)[0]

    def get_topic_words(self, n_words: int = 10) -> Dict[int, List[str]]:
        """
        Get the top words for each topic.

        Parameters
        ----------
        n_words : int
            Number of top words per topic.

        Returns
        -------
        Dict[int, List[str]]
            Mapping from topic index to list of top words.
        """
        if not self.is_fitted:
            raise RuntimeError("Must call fit_topics() first.")
        feature_names = self.vectorizer.get_feature_names_out()
        topics = {}
        for idx, topic in enumerate(self.lda.components_):
            top_indices = topic.argsort()[-n_words:][::-1]
            topics[idx] = [feature_names[i] for i in top_indices]
        return topics

    def analyze_document(self, text: str) -> Dict:
        """
        Full analysis of a single document: entities, markets,
        and topics.

        Parameters
        ----------
        text : str
            Input text.

        Returns
        -------
        Dict
            Combined analysis results.
        """
        entities = self.extract_entities(text)
        markets = self.find_relevant_markets(text)

        result = {
            'entities': entities,
            'relevant_markets': markets,
        }

        if self.is_fitted:
            topic_dist = self.get_topic_distribution(text)
            result['topic_distribution'] = topic_dist.tolist()
            result['dominant_topic'] = int(np.argmax(topic_dist))

        return result


# Example usage
if __name__ == "__main__":
    entity_map = {
        "Biden": ["presidential-election-2024"],
        "Trump": ["presidential-election-2024"],
        "Federal Reserve": ["fed-rate-decision-2024"],
        "Fed": ["fed-rate-decision-2024"],
        "SEC": ["bitcoin-etf-approval"],
        "Ukraine": ["ukraine-ceasefire-2024"],
    }

    pipeline = NERTopicPipeline(
        n_topics=5,
        entity_market_map=entity_map,
    )

    # Analyze a single article
    article = (
        "President Biden announced new sanctions against Russia "
        "following the latest escalation in Ukraine. The Federal "
        "Reserve is expected to hold rates steady at its next meeting, "
        "according to market analysts."
    )

    entities = pipeline.extract_entities(article)
    markets = pipeline.find_relevant_markets(article)

    print("Entities found:")
    for etype, elist in entities.items():
        print(f"  {etype}: {elist}")

    print(f"\nRelevant markets: {markets}")

    # Fit topic model on a corpus
    corpus = [
        "The election polls show a tight race between candidates",
        "Federal Reserve signals no change in interest rate policy",
        "Bitcoin ETF application faces SEC scrutiny and delays",
        "Ukraine peace talks stall as both sides set conditions",
        "Campaign fundraising breaks records in the third quarter",
        "Inflation data comes in lower than expected for the month",
        "Cryptocurrency markets rally on institutional adoption news",
        "Diplomatic efforts intensify as ceasefire deadline approaches",
    ] * 5  # Repeat for minimum LDA requirements

    pipeline.fit_topics(corpus)

    topics = pipeline.get_topic_words(n_words=5)
    print("\nDiscovered Topics:")
    for topic_id, words in topics.items():
        print(f"  Topic {topic_id}: {', '.join(words)}")

    analysis = pipeline.analyze_document(article)
    print(f"\nDominant topic: {analysis['dominant_topic']}")
    print(f"Topic distribution: {[f'{x:.3f}' for x in analysis['topic_distribution']]}")

24.10 Building NLP Features for Trading Models

The ultimate goal of all the NLP techniques in this chapter is to produce features that improve prediction market trading models. This section shows how to combine text-derived features with the tabular models from Chapter 23.

24.10.1 Text-Derived Features

From the techniques developed in this chapter, we can extract the following feature categories:

Sentiment Features: - sentiment_mean_1h: Average sentiment of articles in the past 1 hour. - sentiment_mean_24h: Average sentiment over the past 24 hours. - sentiment_std_24h: Standard deviation of sentiment (higher = more disagreement). - sentiment_momentum: Change in average sentiment from 24h ago to now. - sentiment_extreme: Count of articles with |sentiment| > 0.5.

Volume Features: - article_count_1h: Number of relevant articles in the past hour. - article_count_24h: Number of relevant articles in the past 24 hours. - volume_ratio: Ratio of recent volume to historical average. - source_diversity: Number of unique sources covering the topic.

Topic Features: - topic_k_weight: Weight of topic $k$ in recent articles (one feature per topic). - topic_shift: Change in dominant topic over time. - topic_entropy: Entropy of topic distribution (higher = more diverse coverage).

Entity Features: - entity_count: Number of unique entities mentioned. - key_entity_frequency: How often a specific key entity is mentioned. - new_entity_flag: Whether a previously unseen entity appears.

Novelty Features: - news_surprise: How novel the latest article is (from Section 24.7). - information_velocity: Rate at which new information is appearing.

24.10.2 Feature Integration with Tabular Models

These NLP features can be combined with the tabular features from Chapter 23 (market features like price, volume, spread; time features; cross-market features) into a unified feature set:

$$ \mathbf{x}_t = [\underbrace{x_1, \ldots, x_m}_{\text{market features}}, \underbrace{x_{m+1}, \ldots, x_{m+n}}_{\text{NLP features}}] $$

The combined feature vector is then fed into gradient boosted trees, random forests, or neural networks for probability estimation or trading signal generation.

24.10.3 Python NLP Feature Generator

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from collections import deque


class NLPFeatureGenerator:
    """
    Generates NLP-derived features for prediction market
    trading models.

    Maintains rolling windows of article data and computes
    features at each time step.
    """

    def __init__(
        self,
        sentiment_analyzer,
        windows: List[int] = None,
        n_topics: int = 5,
    ):
        """
        Parameters
        ----------
        sentiment_analyzer : object
            Sentiment analyzer with an analyze() method.
        windows : List[int]
            Window sizes in hours for rolling features.
        n_topics : int
            Number of topics in the topic model.
        """
        self.analyzer = sentiment_analyzer
        self.windows = windows or [1, 6, 24, 72]
        self.n_topics = n_topics
        self.article_buffer = []  # List of (timestamp, sentiment, source, ...)

    def add_article(
        self,
        timestamp: datetime,
        text: str,
        source: str,
        topic_distribution: Optional[np.ndarray] = None,
        entities: Optional[List[str]] = None,
    ):
        """
        Add a new article to the buffer and compute its features.

        Parameters
        ----------
        timestamp : datetime
            Publication time.
        text : str
            Article text.
        source : str
            Source name.
        topic_distribution : np.ndarray, optional
            Pre-computed topic distribution.
        entities : List[str], optional
            Pre-extracted entities.
        """
        result = self.analyzer.analyze(text)

        self.article_buffer.append({
            'timestamp': timestamp,
            'sentiment': result.ensemble_score,
            'subjectivity': result.textblob_subjectivity,
            'source': source,
            'topic_dist': topic_distribution,
            'entities': entities or [],
            'text_length': len(text),
        })

    def _get_articles_in_window(
        self, current_time: datetime, hours: int
    ) -> List[Dict]:
        """Get articles within the specified time window."""
        cutoff = current_time - timedelta(hours=hours)
        return [
            a for a in self.article_buffer
            if a['timestamp'] >= cutoff and a['timestamp'] <= current_time
        ]

    def generate_features(self, current_time: datetime) -> Dict[str, float]:
        """
        Generate all NLP features for the current time step.

        Parameters
        ----------
        current_time : datetime
            The current time for feature generation.

        Returns
        -------
        Dict[str, float]
            Dictionary of feature names to values.
        """
        features = {}

        for window in self.windows:
            articles = self._get_articles_in_window(current_time, window)
            prefix = f"nlp_{window}h"

            if not articles:
                # No articles in this window
                features[f"{prefix}_sentiment_mean"] = 0.0
                features[f"{prefix}_sentiment_std"] = 0.0
                features[f"{prefix}_sentiment_min"] = 0.0
                features[f"{prefix}_sentiment_max"] = 0.0
                features[f"{prefix}_article_count"] = 0
                features[f"{prefix}_source_diversity"] = 0
                features[f"{prefix}_subjectivity_mean"] = 0.0
                for k in range(self.n_topics):
                    features[f"{prefix}_topic_{k}"] = 0.0
                continue

            sentiments = [a['sentiment'] for a in articles]

            # Sentiment features
            features[f"{prefix}_sentiment_mean"] = np.mean(sentiments)
            features[f"{prefix}_sentiment_std"] = (
                np.std(sentiments) if len(sentiments) > 1 else 0.0
            )
            features[f"{prefix}_sentiment_min"] = np.min(sentiments)
            features[f"{prefix}_sentiment_max"] = np.max(sentiments)
            features[f"{prefix}_sentiment_skew"] = (
                float(pd.Series(sentiments).skew())
                if len(sentiments) > 2 else 0.0
            )

            # Volume features
            features[f"{prefix}_article_count"] = len(articles)
            features[f"{prefix}_source_diversity"] = len(
                set(a['source'] for a in articles)
            )

            # Subjectivity features
            subj = [a['subjectivity'] for a in articles]
            features[f"{prefix}_subjectivity_mean"] = np.mean(subj)

            # Topic features
            topic_dists = [
                a['topic_dist'] for a in articles
                if a['topic_dist'] is not None
            ]
            if topic_dists:
                avg_topics = np.mean(topic_dists, axis=0)
                for k in range(min(self.n_topics, len(avg_topics))):
                    features[f"{prefix}_topic_{k}"] = avg_topics[k]
                # Topic entropy
                features[f"{prefix}_topic_entropy"] = float(
                    -np.sum(avg_topics * np.log(avg_topics + 1e-10))
                )
            else:
                for k in range(self.n_topics):
                    features[f"{prefix}_topic_{k}"] = 0.0
                features[f"{prefix}_topic_entropy"] = 0.0

            # Entity features
            all_entities = []
            for a in articles:
                all_entities.extend(a['entities'])
            features[f"{prefix}_entity_count"] = len(set(all_entities))
            features[f"{prefix}_entity_mentions"] = len(all_entities)

        # Cross-window features
        short_articles = self._get_articles_in_window(current_time, 1)
        long_articles = self._get_articles_in_window(current_time, 24)

        if short_articles and long_articles:
            short_sent = np.mean([a['sentiment'] for a in short_articles])
            long_sent = np.mean([a['sentiment'] for a in long_articles])
            features["nlp_sentiment_momentum"] = short_sent - long_sent
        else:
            features["nlp_sentiment_momentum"] = 0.0

        # Volume ratio
        if long_articles:
            expected_hourly = len(long_articles) / 24
            actual_hourly = len(short_articles) if short_articles else 0
            features["nlp_volume_ratio"] = (
                actual_hourly / expected_hourly
                if expected_hourly > 0 else 0.0
            )
        else:
            features["nlp_volume_ratio"] = 0.0

        return features

    def generate_feature_dataframe(
        self,
        timestamps: List[datetime],
    ) -> pd.DataFrame:
        """
        Generate features for multiple time steps.

        Parameters
        ----------
        timestamps : List[datetime]
            Time steps at which to generate features.

        Returns
        -------
        pd.DataFrame
            DataFrame with one row per timestamp and one column per feature.
        """
        rows = []
        for ts in timestamps:
            features = self.generate_features(ts)
            features['timestamp'] = ts
            rows.append(features)
        return pd.DataFrame(rows).set_index('timestamp')

This feature generator produces a comprehensive set of NLP features that can be joined with market data features for model training. The multi-window approach (1h, 6h, 24h, 72h) captures both immediate reactions and longer-term sentiment trends.


24.11 LLMs as Forecasters

Perhaps the most fascinating frontier in NLP for prediction markets is the direct use of large language models (LLMs) as forecasters. Rather than using NLP to extract features that are fed into a separate model, what if we simply ask an LLM "What is the probability that event X occurs?"

24.11.1 Using GPT/Claude for Probability Estimation

Modern LLMs like GPT-4 and Claude have been trained on vast amounts of text that includes forecasting discussions, base rates, historical outcomes, and analytical reasoning. When prompted appropriately, they can produce probability estimates for future events.

The basic approach is:

Prompt: "Based on current information as of [date], what is the
probability that [event description]? Provide your estimate as
a number between 0 and 1, and explain your reasoning."

Research has shown that LLM probability estimates are surprisingly well-calibrated for some types of questions, particularly those where: - Historical base rates are informative. - The question has been widely discussed in the training data. - The reasoning is more analytical than requiring private information.

24.11.2 Prompt Engineering for Forecasting

The quality of LLM forecasts depends heavily on the prompt. Key principles:

Principle 1: Provide context. Give the LLM relevant background information, recent developments, and the current date.

Principle 2: Request calibrated probabilities. Ask the LLM to think about base rates, reference classes, and to consider both sides of the argument.

Principle 3: Use structured reasoning. Ask the LLM to list factors for and against, assign weights, and then synthesize a probability.

Principle 4: Request confidence intervals. Ask for a range (e.g., "70-80% likely") rather than a point estimate, to capture the LLM's uncertainty.

Principle 5: Decompose complex questions. Break a complex question into sub-questions, get probabilities for each, and combine them.

An effective prompting template:

You are a professional forecaster. Estimate the probability
of the following event:

EVENT: {event_description}
RESOLUTION DATE: {resolution_date}
CURRENT DATE: {current_date}

Please structure your analysis as follows:
1. BASE RATE: What is the historical base rate for similar events?
2. FACTORS FOR: What factors increase the probability?
3. FACTORS AGAINST: What factors decrease the probability?
4. RECENT DEVELOPMENTS: What recent information is most relevant?
5. PROBABILITY ESTIMATE: Your best estimate as a number between
   0.01 and 0.99.
6. CONFIDENCE: How confident are you in this estimate?
   (low/medium/high)

24.11.3 Limitations and Calibration of LLM Predictions

LLM forecasts have several important limitations:

Knowledge cutoff. LLMs have a training data cutoff date and may not know about recent events. This is particularly important for prediction markets where current information is critical.

Sycophancy. LLMs may anchor on probabilities suggested in the prompt or adjust their estimates to match what they perceive the user wants to hear.

Compression of probabilities. LLMs tend to avoid extreme probabilities. They rarely output values below 0.05 or above 0.95, even when these might be appropriate.

Inconsistency. The same question asked in different ways may produce different probability estimates. Running multiple prompts and averaging can mitigate this.

Lack of true updating. LLMs do not update their beliefs in the Bayesian sense. They generate plausible-sounding analyses that may not properly weight new evidence.

Calibration analysis of LLM forecasts typically shows: - Overconfidence in the 50-80% range (events they say are 70% likely actually happen about 60% of the time). - Underconfidence at extremes (events they say are 90% likely actually happen 95% of the time). - Better calibration on well-known topics than obscure ones.

24.11.4 Python LLM Forecasting Harness

import json
import time
from dataclasses import dataclass
from typing import List, Dict, Optional, Tuple
import numpy as np
from datetime import datetime


@dataclass
class ForecastResult:
    """Container for an LLM forecast result."""
    question: str
    probability: float
    confidence: str
    reasoning: str
    base_rate: Optional[float]
    model: str
    timestamp: str
    prompt_version: str


class LLMForecaster:
    """
    Uses LLMs to generate probability forecasts for prediction
    market questions.

    Supports multiple prompting strategies and ensemble methods.
    """

    def __init__(
        self,
        model: str = "gpt-4",
        api_key: str = None,
        temperature: float = 0.3,
        n_samples: int = 3,
    ):
        """
        Parameters
        ----------
        model : str
            LLM model identifier.
        api_key : str
            API key for the LLM service.
        temperature : float
            Sampling temperature (lower = more deterministic).
        n_samples : int
            Number of independent samples to average.
        """
        self.model = model
        self.api_key = api_key
        self.temperature = temperature
        self.n_samples = n_samples

    def _build_prompt(
        self,
        question: str,
        context: str = "",
        current_date: str = None,
        resolution_date: str = None,
        prompt_version: str = "structured",
    ) -> str:
        """Build the forecasting prompt."""
        if current_date is None:
            current_date = datetime.now().strftime("%Y-%m-%d")

        if prompt_version == "structured":
            prompt = f"""You are an expert superforecaster who is well-calibrated
and thinks carefully about base rates and evidence.

Estimate the probability of the following event:

EVENT: {question}
CURRENT DATE: {current_date}
"""
            if resolution_date:
                prompt += f"RESOLUTION DATE: {resolution_date}\n"
            if context:
                prompt += f"\nRELEVANT CONTEXT:\n{context}\n"

            prompt += """
Please structure your analysis:
1. BASE RATE: Historical base rate for similar events.
2. FACTORS FOR: Evidence increasing the probability.
3. FACTORS AGAINST: Evidence decreasing the probability.
4. SYNTHESIS: Weigh the factors and arrive at an estimate.

IMPORTANT: End your response with exactly this format:
PROBABILITY: [your estimate as a decimal between 0.01 and 0.99]
CONFIDENCE: [low/medium/high]
"""

        elif prompt_version == "simple":
            prompt = f"""What is the probability that: {question}

As of {current_date}. {context}

Respond with a single number between 0.01 and 0.99.
PROBABILITY:"""

        elif prompt_version == "devil_advocate":
            prompt = f"""You are an expert forecaster. Consider the question:
{question}

Current date: {current_date}
{context}

First, make the STRONGEST CASE that this event WILL happen.
Then, make the STRONGEST CASE that it will NOT happen.
Finally, weigh both cases and provide your probability estimate.

PROBABILITY: [decimal between 0.01 and 0.99]
CONFIDENCE: [low/medium/high]
"""
        else:
            raise ValueError(f"Unknown prompt version: {prompt_version}")

        return prompt

    def _parse_response(self, response_text: str) -> Tuple[float, str]:
        """
        Parse the LLM response to extract probability and confidence.

        Parameters
        ----------
        response_text : str
            The raw LLM response.

        Returns
        -------
        Tuple[float, str]
            (probability, confidence_level)
        """
        import re

        # Extract probability
        prob_match = re.search(
            r'PROBABILITY:\s*(0?\.\d+|1\.0?)', response_text
        )
        if prob_match:
            prob = float(prob_match.group(1))
        else:
            # Try to find any decimal between 0 and 1
            numbers = re.findall(r'0\.\d+', response_text)
            if numbers:
                prob = float(numbers[-1])  # Take the last one
            else:
                prob = 0.5  # Default fallback

        prob = np.clip(prob, 0.01, 0.99)

        # Extract confidence
        conf_match = re.search(
            r'CONFIDENCE:\s*(low|medium|high)', response_text, re.IGNORECASE
        )
        confidence = conf_match.group(1).lower() if conf_match else "medium"

        return prob, confidence

    def _call_llm(self, prompt: str) -> str:
        """
        Call the LLM API.

        In production, this would make an actual API call.
        Here we simulate the response for demonstration.
        """
        # SIMULATION: In production, replace with actual API call
        # Example with OpenAI:
        #
        # from openai import OpenAI
        # client = OpenAI(api_key=self.api_key)
        # response = client.chat.completions.create(
        #     model=self.model,
        #     messages=[{"role": "user", "content": prompt}],
        #     temperature=self.temperature,
        # )
        # return response.choices[0].message.content

        # Simulated response for demonstration
        simulated = """
1. BASE RATE: Historically, incumbent parties retain the presidency
about 50-60% of the time in modern US elections.

2. FACTORS FOR: Strong economic indicators, no major scandals,
incumbency advantage.

3. FACTORS AGAINST: Historical trend of party fatigue after two
terms, opposition candidate polling strength.

4. SYNTHESIS: Weighing the incumbency advantage against historical
patterns of party alternation, and considering current polling data.

PROBABILITY: 0.55
CONFIDENCE: medium
"""
        return simulated

    def forecast(
        self,
        question: str,
        context: str = "",
        current_date: str = None,
        resolution_date: str = None,
    ) -> ForecastResult:
        """
        Generate a probability forecast for a question.

        Uses multiple samples and prompt versions to produce
        a robust estimate.

        Parameters
        ----------
        question : str
            The prediction market question.
        context : str
            Relevant context to provide to the LLM.
        current_date : str
            Current date string.
        resolution_date : str
            When the question resolves.

        Returns
        -------
        ForecastResult
            The forecast with probability, confidence, and reasoning.
        """
        probabilities = []
        all_reasoning = []

        prompt_versions = ["structured", "devil_advocate", "simple"]

        for version in prompt_versions:
            for _ in range(self.n_samples):
                prompt = self._build_prompt(
                    question, context, current_date,
                    resolution_date, version
                )
                response = self._call_llm(prompt)
                prob, confidence = self._parse_response(response)
                probabilities.append(prob)
                all_reasoning.append(response)

        # Aggregate: trimmed mean (remove most extreme estimates)
        sorted_probs = sorted(probabilities)
        trimmed = sorted_probs[1:-1] if len(sorted_probs) > 2 else sorted_probs
        final_prob = np.mean(trimmed)

        # Determine overall confidence from variance
        prob_std = np.std(probabilities)
        if prob_std < 0.05:
            overall_confidence = "high"
        elif prob_std < 0.15:
            overall_confidence = "medium"
        else:
            overall_confidence = "low"

        return ForecastResult(
            question=question,
            probability=float(final_prob),
            confidence=overall_confidence,
            reasoning=all_reasoning[0],  # Use first structured response
            base_rate=None,
            model=self.model,
            timestamp=datetime.now().isoformat(),
            prompt_version="ensemble",
        )

    def evaluate_calibration(
        self,
        questions: List[str],
        outcomes: List[int],
        contexts: List[str] = None,
    ) -> Dict[str, float]:
        """
        Evaluate calibration of LLM forecasts against actual outcomes.

        Parameters
        ----------
        questions : List[str]
            List of questions that have resolved.
        outcomes : List[int]
            Actual outcomes (0 or 1).
        contexts : List[str], optional
            Contexts for each question.

        Returns
        -------
        Dict[str, float]
            Calibration metrics including Brier score and
            calibration error.
        """
        if contexts is None:
            contexts = [""] * len(questions)

        forecasts = []
        for q, ctx in zip(questions, contexts):
            result = self.forecast(q, ctx)
            forecasts.append(result.probability)

        forecasts = np.array(forecasts)
        outcomes = np.array(outcomes)

        # Brier score
        brier = np.mean((forecasts - outcomes) ** 2)

        # Calibration by bins
        n_bins = 5
        bins = np.linspace(0, 1, n_bins + 1)
        calibration_errors = []

        for i in range(n_bins):
            mask = (forecasts >= bins[i]) & (forecasts < bins[i + 1])
            if mask.sum() > 0:
                predicted_avg = forecasts[mask].mean()
                actual_avg = outcomes[mask].mean()
                calibration_errors.append(abs(predicted_avg - actual_avg))

        avg_calibration_error = (
            np.mean(calibration_errors) if calibration_errors else 0.0
        )

        # Log loss
        eps = 1e-10
        log_loss = -np.mean(
            outcomes * np.log(forecasts + eps) +
            (1 - outcomes) * np.log(1 - forecasts + eps)
        )

        return {
            'brier_score': float(brier),
            'avg_calibration_error': float(avg_calibration_error),
            'log_loss': float(log_loss),
            'mean_forecast': float(forecasts.mean()),
            'forecast_std': float(forecasts.std()),
            'n_questions': len(questions),
        }


# Example usage
if __name__ == "__main__":
    forecaster = LLMForecaster(
        model="gpt-4",
        temperature=0.3,
        n_samples=2,
    )

    # Single forecast
    result = forecaster.forecast(
        question="Will the incumbent party win the 2024 US presidential election?",
        context="Current polls show a tight race. Economic indicators are mixed.",
        current_date="2024-06-01",
        resolution_date="2024-11-05",
    )

    print(f"Question: {result.question}")
    print(f"Probability: {result.probability:.3f}")
    print(f"Confidence: {result.confidence}")
    print(f"Model: {result.model}")
    print(f"\nReasoning excerpt: {result.reasoning[:200]}...")

    # Calibration evaluation (with simulated data)
    questions = [
        "Will GDP growth exceed 3% in Q1?",
        "Will the bill pass the Senate?",
        "Will the candidate win the primary?",
    ]
    outcomes = [1, 0, 1]

    metrics = forecaster.evaluate_calibration(questions, outcomes)
    print(f"\nCalibration Metrics:")
    for k, v in metrics.items():
        print(f"  {k}: {v:.4f}")

24.11.5 Practical Considerations for LLM Forecasting

Using LLMs as forecasters in a trading context requires careful consideration:

Cost management. LLM API calls are expensive. At $0.01-0.06 per 1,000 tokens for frontier models, running multiple prompts across hundreds of markets can quickly become costly. Budget accordingly and cache results.

Latency. LLM responses take 2-30 seconds. For time-sensitive trading, LLM forecasts should be generated on a schedule (e.g., hourly) rather than in response to individual news events.

Reproducibility. Even with temperature=0, LLM outputs are not perfectly deterministic. Log all prompts and responses for reproducibility.

Complementary use. LLM forecasts are most valuable as one input among many, not as a sole trading signal. They are particularly useful for: - Initial probability estimates for new markets. - Sanity checks on model-derived probabilities. - Identifying qualitative factors that quantitative models might miss. - Generating hypotheses about what information the market may be overlooking.


24.12 Chapter Summary

This chapter has built a comprehensive NLP toolkit for prediction market trading, progressing from foundational text processing through modern deep learning to the frontier of LLM-based forecasting.

Key technical contributions:

  1. Text preprocessing (Section 24.2): A configurable pipeline that handles the unique challenges of prediction market text, including news articles, social media, and political/financial language.

  2. Classical NLP (Section 24.3): TF-IDF representations paired with logistic regression provide fast, interpretable baselines for text classification. The n-gram features and sublinear TF scaling are particularly important for domain-specific text.

  3. Sentiment analysis (Section 24.4): Multiple sentiment scoring methods (VADER, TextBlob, transformers) with domain-specific lexicon augmentation. The ensemble approach reduces the risk of any single method failing on unusual text.

  4. Transformer models (Sections 24.5-24.6): BERT-based classification with fine-tuning on market-relevant text. The attention mechanism provides superior context understanding, and the pre-training/fine-tuning paradigm enables strong performance with limited labeled data.

  5. News impact analysis (Section 24.7): Event study methodology adapted for prediction markets, measuring how news events affect prices and quantifying the surprise content of news.

  6. Real-time monitoring (Section 24.8): An end-to-end pipeline from RSS feeds through sentiment scoring to alerting, enabling timely action on NLP-derived signals.

  7. Entity and topic extraction (Section 24.9): NER connects text to specific markets, while topic modeling tracks the evolving information environment around market questions.

  8. Feature engineering (Section 24.10): A systematic framework for generating NLP features (sentiment, volume, topic, entity, novelty) that integrate with the tabular models from Chapter 23.

  9. LLM forecasting (Section 24.11): Direct use of large language models for probability estimation, including prompt engineering, calibration evaluation, and practical deployment considerations.

The unified perspective: Text data is the raw material from which prediction market prices are ultimately derived. By systematically extracting signal from text -- whether through simple lexicon-based sentiment or sophisticated transformer models -- we can identify information before it is fully reflected in prices. The NLP features developed in this chapter complement the market-derived features of Chapter 23 to create a more complete picture of the information environment.


What's Next

In Chapter 25, we will explore Time Series and Sequential Models, building on the features developed here and in Chapter 23 to model the temporal dynamics of prediction markets. We will cover autoregressive models, recurrent neural networks, and state-space models, showing how to capture the time-dependent patterns in both market data and NLP-derived features. The combination of NLP features from this chapter with the sequential models of Chapter 25 will form the foundation of a sophisticated prediction market trading system.