> "The market is a discounting mechanism for all known information -- but text is where that information lives before it becomes known."
In This Chapter
- 24.1 Why Text Data Matters for Prediction Markets
- 24.2 Text Preprocessing Pipeline
- 24.3 Classical NLP: Bag of Words and TF-IDF
- 24.4 Sentiment Analysis Fundamentals
- 24.5 Transformer Models: BERT and Beyond
- 24.6 Fine-Tuning for Prediction Market Text
- 24.7 News Impact Analysis
- 24.8 Real-Time News Monitoring
- 24.9 Named Entity Recognition and Topic Extraction
- 24.10 Building NLP Features for Trading Models
- 24.11 LLMs as Forecasters
- 24.12 Chapter Summary
- What's Next
Chapter 24: NLP and Sentiment Analysis
"The market is a discounting mechanism for all known information -- but text is where that information lives before it becomes known."
Prediction markets exist at the intersection of human beliefs and real-world events. The raw material of those beliefs -- news articles, social media posts, political speeches, earnings calls, expert commentary -- is overwhelmingly textual. A trader who can systematically extract signal from this torrent of text holds a structural advantage over one who relies on manual reading alone.
This chapter bridges the gap between the machine learning foundations of Chapter 23 and the unstructured world of natural language. We will build a complete NLP toolkit for prediction market trading: from classical text preprocessing through modern transformer architectures to the frontier question of whether large language models can themselves serve as forecasters. Every technique is grounded in practical, runnable Python code and framed around the unique demands of prediction market analysis.
24.1 Why Text Data Matters for Prediction Markets
24.1.1 News Drives Markets
Prediction markets are, at their core, aggregators of information. When a breaking news story changes the probability of an event, it does so because traders read that story, update their beliefs, and place trades. The causal chain is:
$$ \text{Event occurs} \rightarrow \text{Text published} \rightarrow \text{Traders read text} \rightarrow \text{Beliefs update} \rightarrow \text{Prices move} $$
This chain implies a latency between text publication and price movement. For liquid markets like Polymarket or Kalshi, this latency can be seconds to minutes for major breaking news. For less liquid markets, it can be hours or days. That latency is your opportunity.
Consider some concrete examples:
- Political markets: A poll showing a 5-point shift in a swing state will move presidential election markets. The poll is published as a news article or social media post before it is reflected in market prices.
- Regulatory markets: An SEC commissioner's speech hinting at new cryptocurrency regulations contains signal about "Will the SEC approve a Bitcoin ETF?" markets.
- Geopolitical markets: Diplomatic communiques, military movements reported by journalists, and satellite imagery analyses all appear as text before geopolitical event markets adjust.
- Economic markets: Fed minutes, earnings call transcripts, and economic commentary contain forward-looking language that precedes market movements.
24.1.2 Text as a Leading Indicator
Empirical research consistently shows that textual information leads market price movements. A seminal study by Tetlock (2007) demonstrated that the pessimism in Wall Street Journal columns predicted negative market returns and increased trading volume. Loughran and McDonald (2011) showed that the tone of 10-K filings predicted stock return volatility.
For prediction markets specifically, text is an even more powerful leading indicator because:
- Information asymmetry is higher: Unlike equities where thousands of analysts parse every data point, many prediction market questions have thin analyst coverage. A trader with an NLP pipeline monitoring relevant news has a larger edge.
- The connection between text and outcome is more direct: A news article about a candidate's scandal directly pertains to "Will candidate X win?" -- unlike the more tenuous connection between a news article and a stock price.
- Markets are thinner: Lower liquidity means that it takes longer for information to be fully reflected in prices, extending the window during which text-derived signals are profitable.
24.1.3 Sentiment as a Feature
Sentiment analysis transforms unstructured text into a numerical feature that can be fed into the predictive models we built in Chapter 23. At its simplest, sentiment is a single number on a scale from -1 (very negative) to +1 (very positive). More sophisticated approaches produce multi-dimensional representations:
- Polarity: How positive or negative is the text?
- Subjectivity: Is the text factual reporting or opinion?
- Intensity: How strong is the expressed sentiment?
- Aspect-specific sentiment: What is the sentiment toward specific entities or topics mentioned in the text?
- Uncertainty: Does the text express confidence or doubt?
When aggregated over time, sentiment features capture the evolving information environment around a prediction market question. A sustained shift toward negative sentiment about a political candidate, even before any single decisive event, can predict a gradual decline in that candidate's market price.
24.1.4 The NLP Opportunity for Prediction Market Traders
The NLP opportunity in prediction markets is larger than in traditional financial markets for several reasons:
- Less competition from sophisticated NLP systems: Major hedge funds deploy enormous NLP infrastructure for equity trading. Prediction markets see far less automated text analysis, meaning the marginal value of even basic NLP is higher.
- Diverse text sources: Prediction market questions span politics, sports, entertainment, science, and more. This diversity means that general-purpose NLP models are particularly valuable -- they can extract signal across many domains.
- Rapid question lifecycle: New prediction market questions are created frequently, often in response to current events. An NLP system that can quickly assess the information environment around a new question provides an immediate advantage.
- Clear outcome labels: Unlike stock returns, prediction market outcomes are binary. This makes it easier to evaluate whether text-derived features actually improve forecasts, and it simplifies the supervised learning setup for training custom models.
The remainder of this chapter builds the technical infrastructure to exploit these opportunities.
24.2 Text Preprocessing Pipeline
Before any NLP model can extract meaning from text, the raw text must be cleaned and standardized. This preprocessing step is critical: garbage in, garbage out. The specific preprocessing steps depend on the downstream model -- classical models require more aggressive preprocessing than modern transformers -- but understanding the full pipeline is essential.
24.2.1 Tokenization
Tokenization is the process of splitting text into discrete units called tokens. These tokens might be words, subwords, or characters, depending on the approach.
Word tokenization splits on whitespace and punctuation:
"The Fed raised rates by 0.25%" -> ["The", "Fed", "raised", "rates", "by", "0.25", "%"]
Subword tokenization (used by BERT, GPT, and other transformers) splits rare words into common subword pieces:
"cryptocurrency" -> ["crypto", "##currency"]
"geopolitical" -> ["geo", "##political"]
The advantage of subword tokenization is that it handles out-of-vocabulary words gracefully. A model that has never seen "cryptocurrency" during training can still process it as the combination of "crypto" and "currency," both of which it likely has seen.
Sentence tokenization splits text into sentences, which is useful when you want to analyze sentiment at the sentence level:
"The candidate won the primary. However, polls show a tough general election."
-> ["The candidate won the primary.", "However, polls show a tough general election."]
24.2.2 Lowercasing
Converting all text to lowercase reduces vocabulary size and ensures that "Fed," "fed," and "FED" are treated as the same token. However, lowercasing destroys information in some cases -- "US" (United States) vs. "us" (pronoun), for example. For classical models, lowercasing is almost always applied. For transformer models, cased and uncased variants exist, and the cased variants often perform better on tasks where capitalization carries meaning (like named entity recognition).
24.2.3 Stopword Removal
Stopwords are high-frequency words that carry little semantic content: "the," "is," "at," "which," "on," etc. Removing them reduces noise for bag-of-words and TF-IDF models. However, stopword removal should be applied judiciously:
- For sentiment analysis, words like "not," "no," "never" are sometimes in stopword lists but are critical for sentiment (compare "this is good" vs. "this is not good").
- For transformer models, stopwords should generally NOT be removed, as the model was trained with them and relies on them for understanding syntax and context.
24.2.4 Stemming and Lemmatization
Stemming reduces words to their root form by stripping suffixes:
"running" -> "run"
"election" -> "elect"
"presidential" -> "presidenti" (imperfect!)
The Porter stemmer and Snowball stemmer are common implementations. Stemming is fast but crude -- it sometimes produces non-words.
Lemmatization reduces words to their dictionary form (lemma) using linguistic knowledge:
"running" -> "run"
"better" -> "good"
"elections" -> "election"
Lemmatization is more accurate but slower. For prediction market text analysis, lemmatization is generally preferred when using classical models, because prediction market text often contains domain-specific terms where crude stemming can destroy meaning.
24.2.5 Text Cleaning for News and Social Media
Real-world text from news articles and social media requires additional cleaning:
- HTML tag removal: Web-scraped text often contains HTML artifacts.
- URL removal: Social media posts frequently contain URLs that are not useful for sentiment analysis.
- Mention removal: Twitter/X
@mentionsmay or may not be relevant. - Hashtag processing:
#Election2024should probably become "Election 2024." - Emoji handling: Emojis carry sentiment information. They can be converted to text descriptions ("thumbs up," "angry face") or removed.
- Special character handling: Prediction market text often contains percentages, dollar signs, and other special characters that need standardized handling.
- Encoding normalization: Convert Unicode characters to ASCII where appropriate ("naive" for "naive").
24.2.6 Python Text Preprocessing Pipeline
import re
import string
from typing import List, Optional
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
# Download required NLTK data (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
class TextPreprocessor:
"""
A configurable text preprocessing pipeline for prediction market
news and social media text.
"""
def __init__(
self,
lowercase: bool = True,
remove_stopwords: bool = True,
lemmatize: bool = True,
remove_urls: bool = True,
remove_mentions: bool = True,
remove_html: bool = True,
min_word_length: int = 2,
custom_stopwords: Optional[List[str]] = None,
preserve_negation: bool = True,
):
self.lowercase = lowercase
self.remove_stopwords = remove_stopwords
self.lemmatize = lemmatize
self.remove_urls = remove_urls
self.remove_mentions = remove_mentions
self.remove_html = remove_html
self.min_word_length = min_word_length
self.preserve_negation = preserve_negation
self.lemmatizer = WordNetLemmatizer()
self.stop_words = set(stopwords.words('english'))
# Preserve negation words even if removing stopwords
self.negation_words = {
'not', 'no', 'never', 'neither', 'nor',
'nobody', 'nothing', 'nowhere', 'hardly',
'barely', 'scarcely', "n't", "nt"
}
if preserve_negation:
self.stop_words -= self.negation_words
if custom_stopwords:
self.stop_words.update(custom_stopwords)
def clean_html(self, text: str) -> str:
"""Remove HTML tags."""
return re.sub(r'<[^>]+>', '', text)
def clean_urls(self, text: str) -> str:
"""Remove URLs."""
return re.sub(
r'https?://\S+|www\.\S+', '', text
)
def clean_mentions(self, text: str) -> str:
"""Remove @mentions."""
return re.sub(r'@\w+', '', text)
def clean_hashtags(self, text: str) -> str:
"""Convert hashtags to words: #Election2024 -> Election 2024."""
def hashtag_to_words(match):
tag = match.group(1)
# Insert space before capital letters
words = re.sub(r'([A-Z])', r' \1', tag)
# Insert space before numbers
words = re.sub(r'(\d+)', r' \1', words)
return words.strip()
return re.sub(r'#(\w+)', hashtag_to_words, text)
def normalize_whitespace(self, text: str) -> str:
"""Collapse multiple whitespace into single spaces."""
return re.sub(r'\s+', ' ', text).strip()
def preprocess(self, text: str) -> str:
"""
Apply the full preprocessing pipeline.
Parameters
----------
text : str
Raw input text.
Returns
-------
str
Cleaned and preprocessed text.
"""
if self.remove_html:
text = self.clean_html(text)
if self.remove_urls:
text = self.clean_urls(text)
if self.remove_mentions:
text = self.clean_mentions(text)
text = self.clean_hashtags(text)
if self.lowercase:
text = text.lower()
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
if self.remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
# Lemmatize
if self.lemmatize:
tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
# Remove short tokens and pure punctuation
tokens = [
t for t in tokens
if len(t) >= self.min_word_length
and not all(c in string.punctuation for c in t)
]
return ' '.join(tokens)
def preprocess_batch(self, texts: List[str]) -> List[str]:
"""Preprocess a list of texts."""
return [self.preprocess(t) for t in texts]
# Example usage
if __name__ == "__main__":
preprocessor = TextPreprocessor()
sample_texts = [
"BREAKING: The Fed just raised interest rates by 0.25%! "
"https://t.co/abc123 #FedDecision",
"@SenatorSmith says the new bill will NOT pass the Senate. "
"Very pessimistic outlook. #Politics2024",
"<p>Markets are <b>crashing</b> after the announcement.</p>",
]
for raw in sample_texts:
cleaned = preprocessor.preprocess(raw)
print(f"RAW: {raw[:80]}...")
print(f"CLEANED: {cleaned}")
print()
This preprocessing pipeline is designed with prediction market text in mind. The preservation of negation words is particularly important: "The bill will NOT pass" and "The bill will pass" have opposite implications for a "Will the bill pass?" market.
24.3 Classical NLP: Bag of Words and TF-IDF
Before diving into deep learning, it is essential to understand classical NLP representations. These methods are fast, interpretable, and surprisingly effective for many prediction market applications.
24.3.1 Bag of Words Representation
The bag of words (BoW) model represents each document as a vector of word counts, ignoring word order entirely. Given a vocabulary of $V$ unique words across all documents, each document $d$ is represented as a vector $\mathbf{x}_d \in \mathbb{R}^V$ where $x_{d,i}$ is the count of word $i$ in document $d$.
For example, consider two sentences:
- $d_1$: "The market is bullish on the election."
- $d_2$: "The election market shows bearish sentiment."
The vocabulary is: {the, market, is, bullish, on, election, shows, bearish, sentiment}
$$ \mathbf{x}_{d_1} = [2, 1, 1, 1, 1, 1, 0, 0, 0] $$ $$ \mathbf{x}_{d_2} = [1, 1, 0, 0, 0, 1, 1, 1, 1] $$
The obvious limitation is that "The dog bit the man" and "The man bit the dog" have identical BoW representations. For sentiment analysis, this is often acceptable because the presence of sentiment-bearing words matters more than their exact arrangement.
24.3.2 TF-IDF Weighting
Raw word counts overweight common words. Term Frequency-Inverse Document Frequency (TF-IDF) addresses this by weighting each word by how informative it is across the document collection.
For a term $t$ in document $d$ within a corpus of $N$ documents:
$$ \text{TF}(t, d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d} $$
$$ \text{IDF}(t) = \log\left(\frac{N}{\text{number of documents containing } t}\right) $$
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
Words that appear in every document (like "the") get a low IDF and thus a low TF-IDF score. Words that appear frequently in a specific document but rarely in others get a high TF-IDF score. For prediction market text, this means that generic words are downweighted while topically distinctive words (candidate names, policy terms, specific events) are upweighted.
24.3.3 Document-Term Matrices
When you apply TF-IDF to a collection of $N$ documents with vocabulary size $V$, you get a document-term matrix $\mathbf{X} \in \mathbb{R}^{N \times V}$. This matrix is typically very sparse (most documents contain only a tiny fraction of the full vocabulary) and is stored in sparse matrix format for efficiency.
For prediction market applications, a typical document-term matrix might have:
- $N = 10{,}000$ news articles about a particular topic
- $V = 50{,}000$ unique terms (after preprocessing)
- Sparsity > 99% (each article uses perhaps 200 unique terms)
24.3.4 Text Classification with TF-IDF and Logistic Regression
Combining TF-IDF features with logistic regression gives a strong baseline for text classification. For prediction markets, a common task is classifying news articles as positive, negative, or neutral for a particular market question.
The pipeline is:
- Collect labeled training data (news articles labeled with their impact on a market).
- Preprocess text using the pipeline from Section 24.2.
- Compute TF-IDF features.
- Train a logistic regression classifier.
- Apply to new articles to predict their market impact.
24.3.5 Python Implementation
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Simulated prediction market news dataset
# Labels: 1 = positive for market (price should go up)
# 0 = negative for market (price should go down)
texts = [
"Candidate leads in latest swing state polls by wide margin",
"New endorsement from popular governor boosts campaign",
"Campaign raises record-breaking funds in Q3",
"Debate performance widely praised by undecided voters",
"Candidate stumbles in interview, makes gaffe on key policy",
"Major donor withdraws support citing lack of confidence",
"Scandal allegations surface involving campaign staff",
"Poll numbers declining steadily over the past month",
"New economic data supports candidate policy position",
"Opposition candidate gains momentum in key demographics",
]
labels = [1, 1, 1, 1, 0, 0, 0, 0, 1, 0]
# Build a TF-IDF + Logistic Regression pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=10000,
ngram_range=(1, 2), # Unigrams and bigrams
min_df=1,
max_df=0.95,
sublinear_tf=True, # Apply log normalization to TF
)),
('clf', LogisticRegression(
C=1.0,
max_iter=1000,
class_weight='balanced',
)),
])
# Train the pipeline
pipeline.fit(texts, labels)
# Examine the most informative features
feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
coefficients = pipeline.named_steps['clf'].coef_[0]
# Top positive and negative features
top_positive_idx = np.argsort(coefficients)[-10:]
top_negative_idx = np.argsort(coefficients)[:10]
print("Most POSITIVE features:")
for idx in reversed(top_positive_idx):
print(f" {feature_names[idx]:30s} {coefficients[idx]:.4f}")
print("\nMost NEGATIVE features:")
for idx in top_negative_idx:
print(f" {feature_names[idx]:30s} {coefficients[idx]:.4f}")
# Predict on new text
new_texts = [
"Candidate surges in latest national polls",
"Campaign faces new allegations of financial misconduct",
]
predictions = pipeline.predict(new_texts)
probabilities = pipeline.predict_proba(new_texts)
for text, pred, prob in zip(new_texts, predictions, probabilities):
sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
print(f"\nText: {text}")
print(f"Prediction: {sentiment} (confidence: {max(prob):.2%})")
The n-gram parameter ngram_range=(1, 2) is particularly important for prediction market text. Bigrams like "not pass," "wide margin," and "record breaking" carry more information than individual words. The sublinear_tf=True parameter applies logarithmic scaling to term frequencies, which prevents very long documents from dominating.
24.4 Sentiment Analysis Fundamentals
Sentiment analysis is the task of determining the emotional tone or opinion expressed in text. For prediction markets, we care not just about general positive/negative sentiment, but about sentiment directed toward specific outcomes.
24.4.1 Lexicon-Based Approaches: VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool specifically designed for social media text. It uses a curated lexicon of words with associated sentiment scores and a set of rules that handle:
- Punctuation: Exclamation marks amplify sentiment ("Great!!!" > "Great").
- Capitalization: ALL CAPS amplifies sentiment ("AMAZING" > "amazing").
- Degree modifiers: "extremely good" > "good" > "somewhat good."
- Negation: "not good" flips the sentiment of "good."
- Conjunctions: "but" signals a shift in sentiment ("The food was great, but the service was terrible").
VADER outputs four scores:
pos: Proportion of text that is positive.neg: Proportion of text that is negative.neu: Proportion of text that is neutral.compound: Normalized composite score from -1 (most negative) to +1 (most positive).
The compound score is computed as:
$$ \text{compound} = \frac{\text{sum of valence scores}}{\sqrt{\text{sum of valence scores}^2 + \alpha}} $$
where $\alpha$ is a normalization constant (default 15).
24.4.2 Lexicon-Based Approaches: TextBlob
TextBlob provides a simpler sentiment analysis interface that returns:
- Polarity: Range from -1 (negative) to +1 (positive).
- Subjectivity: Range from 0 (objective/factual) to 1 (subjective/opinion).
TextBlob's sentiment is based on the Pattern library's lexicon and is computed as a weighted average of the sentiment scores of individual words, adjusted for modifiers and negation.
24.4.3 Rule-Based Approaches for Domain-Specific Text
Generic sentiment lexicons often perform poorly on domain-specific text. In financial and political text, many words have domain-specific sentiment:
| Word | General Sentiment | Financial Sentiment | Political Sentiment |
|---|---|---|---|
| "volatile" | Neutral | Negative | Negative |
| "aggressive" | Negative | Positive (policy) | Negative |
| "conservative" | Neutral | Positive (estimate) | Varies by context |
| "liberal" | Positive | Negative (spending) | Varies by context |
| "unprecedented" | Neutral | Negative (risk) | Context-dependent |
The Loughran-McDonald financial sentiment lexicon addresses this for financial text by providing word lists specifically labeled for financial contexts: positive, negative, uncertainty, litigious, strong modal, and weak modal.
For prediction market text, no single existing lexicon is ideal. A practical approach is to start with VADER (which handles social media well) and augment it with domain-specific terms.
24.4.4 Sentiment Scoring for Financial and Political Text
When applying sentiment analysis to prediction market text, several considerations arise:
Directionality matters. "The candidate is doing terribly in polls" is negative for the candidate but may be positive for the opposing candidate's market. Sentiment must be interpreted relative to the specific market question.
Aggregation over time. A single article's sentiment is noisy. Aggregating sentiment over time windows (1 hour, 1 day, 1 week) produces more reliable signals. Common aggregation methods include:
$$ \bar{S}_t = \frac{1}{|D_t|} \sum_{d \in D_t} s_d \quad \text{(simple average)} $$
$$ \bar{S}_t^{\text{exp}} = \alpha \cdot s_t + (1 - \alpha) \cdot \bar{S}_{t-1}^{\text{exp}} \quad \text{(exponential moving average)} $$
$$ \bar{S}_t^{\text{vol}} = \frac{\sum_{d \in D_t} v_d \cdot s_d}{\sum_{d \in D_t} v_d} \quad \text{(volume-weighted, } v_d \text{ = source importance)} $$
Source weighting. Not all text sources are equally informative. A Reuters article likely contains more reliable information than a random tweet. Source credibility can be incorporated through weighting.
24.4.5 Python Sentiment Analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from dataclasses import dataclass
from typing import Dict, List
import numpy as np
@dataclass
class SentimentResult:
"""Container for multi-method sentiment scores."""
text: str
vader_compound: float
vader_pos: float
vader_neg: float
vader_neu: float
textblob_polarity: float
textblob_subjectivity: float
ensemble_score: float
class PredictionMarketSentimentAnalyzer:
"""
Sentiment analyzer tailored for prediction market text.
Combines VADER, TextBlob, and custom domain adjustments
to produce robust sentiment scores.
"""
def __init__(self, domain_lexicon: Dict[str, float] = None):
self.vader = SentimentIntensityAnalyzer()
# Add domain-specific terms to VADER lexicon
if domain_lexicon:
self.vader.lexicon.update(domain_lexicon)
# Default prediction market lexicon adjustments
default_updates = {
'landslide': 2.0,
'frontrunner': 1.5,
'momentum': 1.0,
'surge': 1.5,
'plummet': -2.0,
'scandal': -2.5,
'indictment': -2.5,
'gaffe': -1.5,
'endorse': 1.5,
'endorsement': 1.5,
'collapse': -2.5,
'landslide': 2.0,
'unprecedented': -0.5,
'bipartisan': 1.0,
'deadlock': -1.0,
'gridlock': -1.0,
'filibuster': -0.5,
'veto': -1.5,
'override': 1.0,
'unanimous': 2.0,
'contested': -1.0,
'recount': -1.5,
}
self.vader.lexicon.update(default_updates)
def analyze(self, text: str) -> SentimentResult:
"""
Analyze sentiment of a single text using multiple methods.
Parameters
----------
text : str
The text to analyze.
Returns
-------
SentimentResult
Multi-method sentiment scores.
"""
# VADER analysis
vader_scores = self.vader.polarity_scores(text)
# TextBlob analysis
blob = TextBlob(text)
tb_polarity = blob.sentiment.polarity
tb_subjectivity = blob.sentiment.subjectivity
# Ensemble: weighted average of VADER compound and TextBlob polarity
# VADER gets more weight because it handles social media better
ensemble = 0.6 * vader_scores['compound'] + 0.4 * tb_polarity
return SentimentResult(
text=text,
vader_compound=vader_scores['compound'],
vader_pos=vader_scores['pos'],
vader_neg=vader_scores['neg'],
vader_neu=vader_scores['neu'],
textblob_polarity=tb_polarity,
textblob_subjectivity=tb_subjectivity,
ensemble_score=ensemble,
)
def analyze_batch(self, texts: List[str]) -> List[SentimentResult]:
"""Analyze a batch of texts."""
return [self.analyze(t) for t in texts]
def aggregate_sentiment(
self,
results: List[SentimentResult],
method: str = 'mean',
weights: List[float] = None,
) -> float:
"""
Aggregate sentiment scores from multiple texts.
Parameters
----------
results : List[SentimentResult]
Sentiment results to aggregate.
method : str
Aggregation method: 'mean', 'median', or 'weighted'.
weights : List[float], optional
Weights for weighted aggregation.
Returns
-------
float
Aggregated sentiment score.
"""
scores = [r.ensemble_score for r in results]
if method == 'mean':
return np.mean(scores)
elif method == 'median':
return np.median(scores)
elif method == 'weighted' and weights is not None:
return np.average(scores, weights=weights)
else:
raise ValueError(f"Unknown method: {method}")
# Example usage
if __name__ == "__main__":
analyzer = PredictionMarketSentimentAnalyzer()
articles = [
"Candidate surges in polls after strong debate performance, "
"gaining endorsement from key swing state governor.",
"Campaign rocked by scandal as financial irregularities surface. "
"Major donors threatening to pull support.",
"New bipartisan agreement reached on infrastructure bill. "
"Both parties claim victory in negotiations.",
"BREAKING: Candidate faces indictment on federal charges. "
"Legal team says allegations are baseless.",
]
results = analyzer.analyze_batch(articles)
for result in results:
print(f"Text: {result.text[:60]}...")
print(f" VADER compound: {result.vader_compound:+.4f}")
print(f" TextBlob polarity: {result.textblob_polarity:+.4f}")
print(f" TextBlob subjectivity: {result.textblob_subjectivity:.4f}")
print(f" Ensemble score: {result.ensemble_score:+.4f}")
print()
# Aggregate
agg = analyzer.aggregate_sentiment(results)
print(f"Aggregated sentiment (mean): {agg:+.4f}")
24.5 Transformer Models: BERT and Beyond
The classical methods of the previous sections are useful baselines, but modern NLP has been revolutionized by transformer architectures. Understanding transformers is essential for building state-of-the-art NLP features for prediction market trading.
24.5.1 Attention Mechanism Intuition
The key innovation behind transformers is the attention mechanism. Traditional sequential models (RNNs, LSTMs) process text left-to-right, maintaining a hidden state that serves as a compressed memory of everything seen so far. This creates a bottleneck: information from early in a long document must survive through many processing steps to influence the interpretation of later text.
Attention allows the model to directly focus on any part of the input when processing any other part. When processing the word "rates" in "The Fed raised interest rates by 25 basis points," an attention mechanism can directly attend to "Fed," "raised," and "interest" to determine the meaning of "rates" in this context.
Mathematically, attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between a query and key vectors:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$
where: - $Q$ (queries), $K$ (keys), $V$ (values) are matrices derived from the input embeddings through learned linear projections. - $d_k$ is the dimensionality of the key vectors (the denominator prevents the dot products from becoming too large). - The softmax normalizes the compatibility scores into a probability distribution.
Multi-head attention runs multiple attention mechanisms in parallel, each with different learned projections, allowing the model to attend to different types of relationships simultaneously:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$
where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.
24.5.2 BERT Architecture Overview
BERT (Bidirectional Encoder Representations from Transformers) was released by Google in 2018 and fundamentally changed NLP. Its key innovations were:
-
Bidirectional context: Unlike GPT (which reads left-to-right), BERT processes text in both directions simultaneously, giving it a richer understanding of context.
-
Pre-training on massive text: BERT is pre-trained on two tasks: - Masked Language Modeling (MLM): Randomly mask 15% of input tokens and train the model to predict them. This forces the model to learn contextual representations. - Next Sentence Prediction (NSP): Given two sentences, predict whether the second follows the first. This teaches the model about sentence relationships.
-
Fine-tuning for downstream tasks: After pre-training, BERT can be fine-tuned on specific tasks with relatively small labeled datasets.
The BERT-base model has: - 12 transformer layers (blocks) - 768 hidden dimensions - 12 attention heads - 110 million parameters
BERT-large doubles these: 24 layers, 1024 dimensions, 16 heads, 340 million parameters.
24.5.3 Pre-Training and Fine-Tuning
The pre-training/fine-tuning paradigm is crucial for prediction market applications:
Pre-training teaches the model general language understanding. BERT was pre-trained on BookCorpus (800M words) and English Wikipedia (2,500M words). This gives it knowledge of grammar, facts, and reasoning patterns.
Fine-tuning adapts the pre-trained model to a specific task. For a prediction market sentiment classifier, fine-tuning involves:
- Adding a classification head (a simple linear layer) on top of BERT.
- Training the entire model (BERT + classification head) on labeled prediction market text.
- Using a small learning rate (2e-5 to 5e-5) to preserve the pre-trained knowledge.
Fine-tuning is remarkably data-efficient. Where training a text classifier from scratch might require hundreds of thousands of labeled examples, fine-tuning BERT can achieve strong performance with just a few thousand.
24.5.4 Using Pre-Trained Models from HuggingFace
The HuggingFace transformers library provides easy access to thousands of pre-trained models. For prediction market applications, the most relevant pre-trained models include:
| Model | Description | Best For |
|---|---|---|
bert-base-uncased |
Base BERT, 110M params | General text classification |
finbert |
BERT fine-tuned on financial text | Financial market text |
cardiffnlp/twitter-roberta-base-sentiment |
RoBERTa fine-tuned on tweets | Social media sentiment |
distilbert-base-uncased |
Distilled BERT, 66M params | When speed matters |
roberta-base |
Robustly optimized BERT | General NLP, often better than BERT |
24.5.5 Python BERT Sentiment Classification
import torch
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
pipeline,
)
import numpy as np
from typing import List, Dict
class TransformerSentimentClassifier:
"""
Sentiment classifier using pre-trained transformer models.
Wraps HuggingFace transformers for easy use in prediction
market applications.
"""
def __init__(
self,
model_name: str = "cardiffnlp/twitter-roberta-base-sentiment-latest",
device: str = None,
):
"""
Parameters
----------
model_name : str
HuggingFace model identifier.
device : str
'cuda', 'cpu', or None for auto-detection.
"""
if device is None:
device = "cuda" if torch.cuda.is_available() else "cpu"
self.device = device
self.model_name = model_name
# Load tokenizer and model
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name
).to(device)
self.model.eval()
# Also create a pipeline for convenience
self.pipe = pipeline(
"sentiment-analysis",
model=self.model,
tokenizer=self.tokenizer,
device=0 if device == "cuda" else -1,
top_k=None, # Return all scores
)
def predict(self, text: str) -> Dict[str, float]:
"""
Predict sentiment for a single text.
Returns
-------
Dict[str, float]
Mapping from label to score.
"""
results = self.pipe(text, truncation=True, max_length=512)
# results is a list of lists of dicts
return {r['label']: r['score'] for r in results[0]}
def predict_batch(
self, texts: List[str], batch_size: int = 32
) -> List[Dict[str, float]]:
"""
Predict sentiment for a batch of texts.
Parameters
----------
texts : List[str]
Texts to analyze.
batch_size : int
Batch size for inference.
Returns
-------
List[Dict[str, float]]
Sentiment scores for each text.
"""
all_results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
results = self.pipe(
batch, truncation=True, max_length=512, batch_size=batch_size
)
for result in results:
all_results.append(
{r['label']: r['score'] for r in result}
)
return all_results
def get_sentiment_score(self, text: str) -> float:
"""
Get a single normalized sentiment score in [-1, 1].
Maps model outputs to a continuous sentiment scale.
"""
scores = self.predict(text)
# Handle different label formats from different models
if 'positive' in scores and 'negative' in scores:
return scores['positive'] - scores['negative']
elif 'POSITIVE' in scores and 'NEGATIVE' in scores:
return scores['POSITIVE'] - scores['NEGATIVE']
elif 'LABEL_2' in scores and 'LABEL_0' in scores:
# Common format: 0=negative, 1=neutral, 2=positive
return scores.get('LABEL_2', 0) - scores.get('LABEL_0', 0)
else:
# Fallback: return the highest-scoring positive label
return max(scores.values()) if scores else 0.0
# Example usage
if __name__ == "__main__":
# Use a pre-trained sentiment model
classifier = TransformerSentimentClassifier(
model_name="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
texts = [
"The candidate delivered an incredible speech that energized voters",
"Polls show a devastating collapse in support after the scandal",
"Economic indicators remain steady with moderate growth expected",
"BREAKING: Major endorsement from former president boosts campaign",
]
for text in texts:
scores = classifier.predict(text)
normalized = classifier.get_sentiment_score(text)
print(f"Text: {text[:65]}...")
print(f" Scores: {scores}")
print(f" Normalized: {normalized:+.4f}")
print()
24.6 Fine-Tuning for Prediction Market Text
Pre-trained models provide a strong starting point, but fine-tuning on domain-specific data unlocks significantly better performance for prediction market applications.
24.6.1 Creating Training Data from Market-Relevant Text
The biggest challenge in fine-tuning for prediction markets is obtaining labeled training data. Here are practical approaches:
Approach 1: Market price changes as labels. Collect news articles published during specific time windows and label them based on the subsequent market price movement: - Article published at time $t$ - Market price changes from $p_t$ to $p_{t+\Delta}$ - If $p_{t+\Delta} - p_t > \epsilon$: label = "positive" (price moved up) - If $p_{t+\Delta} - p_t < -\epsilon$: label = "negative" (price moved down) - Otherwise: label = "neutral"
This approach is noisy (many articles may not be causally related to the price change) but scalable.
Approach 2: Manual annotation. Recruit annotators (or use your own judgment) to label articles as positive, negative, or neutral for specific market questions. This produces cleaner labels but is labor-intensive. A few hundred high-quality labels can be sufficient for fine-tuning.
Approach 3: Semi-supervised labeling. Use a pre-trained sentiment model to label a large corpus, manually correct the most uncertain predictions, and use the result for fine-tuning.
Approach 4: LLM-assisted labeling. Use a large language model (GPT-4, Claude) to label articles. Prompt the LLM with the market question and the article, and ask it to classify the article's likely impact on the market. This can produce thousands of labels quickly, though they should be spot-checked for quality.
24.6.2 Fine-Tuning BERT for Political/Financial Sentiment
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
AdamW,
get_linear_schedule_with_warmup,
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
class MarketTextDataset(Dataset):
"""PyTorch dataset for market-relevant text classification."""
def __init__(self, texts, labels, tokenizer, max_length=256):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
truncation=True,
padding='max_length',
max_length=self.max_length,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(self.labels[idx], dtype=torch.long),
}
def fine_tune_bert(
texts: list,
labels: list,
model_name: str = "distilbert-base-uncased",
num_labels: int = 3, # negative, neutral, positive
epochs: int = 3,
batch_size: int = 16,
learning_rate: float = 2e-5,
warmup_ratio: float = 0.1,
max_length: int = 256,
test_size: float = 0.2,
):
"""
Fine-tune a transformer model on market-relevant text.
Parameters
----------
texts : list
List of text strings.
labels : list
List of integer labels (0=negative, 1=neutral, 2=positive).
model_name : str
Pre-trained model to fine-tune.
num_labels : int
Number of classification labels.
epochs : int
Number of training epochs.
batch_size : int
Training batch size.
learning_rate : float
Peak learning rate.
warmup_ratio : float
Fraction of training steps for warmup.
max_length : int
Maximum token sequence length.
test_size : float
Fraction of data to hold out for evaluation.
Returns
-------
model : transformers model
The fine-tuned model.
tokenizer : transformers tokenizer
The associated tokenizer.
metrics : dict
Evaluation metrics on the test set.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=num_labels
).to(device)
# Train/test split
train_texts, test_texts, train_labels, test_labels = train_test_split(
texts, labels, test_size=test_size, random_state=42, stratify=labels
)
# Create datasets
train_dataset = MarketTextDataset(
train_texts, train_labels, tokenizer, max_length
)
test_dataset = MarketTextDataset(
test_texts, test_labels, tokenizer, max_length
)
train_loader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True
)
test_loader = DataLoader(
test_dataset, batch_size=batch_size, shuffle=False
)
# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
total_steps = len(train_loader) * epochs
warmup_steps = int(total_steps * warmup_ratio)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=warmup_steps,
num_training_steps=total_steps
)
# Training loop
model.train()
for epoch in range(epochs):
total_loss = 0
correct = 0
total = 0
for batch in train_loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels_batch = batch['labels'].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels_batch,
)
loss = outputs.loss
logits = outputs.logits
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
predictions = torch.argmax(logits, dim=-1)
correct += (predictions == labels_batch).sum().item()
total += labels_batch.size(0)
avg_loss = total_loss / len(train_loader)
accuracy = correct / total
print(
f"Epoch {epoch + 1}/{epochs} | "
f"Loss: {avg_loss:.4f} | "
f"Accuracy: {accuracy:.4f}"
)
# Evaluation
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels_batch = batch['labels'].to(device)
outputs = model(
input_ids=input_ids, attention_mask=attention_mask
)
predictions = torch.argmax(outputs.logits, dim=-1)
all_preds.extend(predictions.cpu().numpy())
all_labels.extend(labels_batch.cpu().numpy())
label_names = ['negative', 'neutral', 'positive']
report = classification_report(
all_labels, all_preds, target_names=label_names, output_dict=True
)
print("\nEvaluation Report:")
print(classification_report(all_labels, all_preds, target_names=label_names))
return model, tokenizer, report
# Example usage with synthetic data
if __name__ == "__main__":
# In practice, you would load real market-relevant text data
sample_texts = [
"Candidate wins decisive victory in key primary state",
"New policy proposal receives bipartisan support in Congress",
"Market analysts predict strong growth in the upcoming quarter",
"Campaign faces legal challenges that could derail nomination",
"Polling data shows a significant decline in voter approval",
"Economic downturn fears mount as unemployment claims rise",
"Trade negotiations reach a standstill with no resolution",
"Central bank signals potential rate cuts in response to data",
"Senate confirms new cabinet appointment with broad support",
"International tensions escalate following border incident",
] * 20 # Repeat for minimal training data
# Labels: 0=negative, 1=neutral, 2=positive
sample_labels = [2, 2, 2, 0, 0, 0, 0, 1, 2, 0] * 20
model, tokenizer, metrics = fine_tune_bert(
sample_texts,
sample_labels,
model_name="distilbert-base-uncased",
epochs=3,
batch_size=8,
)
24.6.3 Transfer Learning Strategies
When fine-tuning for prediction markets, there are several transfer learning strategies to consider:
Strategy 1: Direct fine-tuning. Take a general pre-trained model (e.g., bert-base-uncased) and fine-tune it directly on your prediction market data. This works well when you have at least a few thousand labeled examples.
Strategy 2: Domain-adaptive pre-training. First, continue pre-training the model (using MLM) on a large corpus of unlabeled prediction market text (news articles, social media posts about markets). Then fine-tune on your labeled data. This helps the model adapt to domain-specific vocabulary and writing styles.
Strategy 3: Multi-task fine-tuning. Fine-tune on multiple related tasks simultaneously -- for example, sentiment classification and market impact prediction. The shared representation learning can improve performance on both tasks.
Strategy 4: Few-shot with a large model. If you have very few labeled examples (fewer than 100), use a larger pre-trained model (e.g., roberta-large or a model fine-tuned on NLI tasks like cross-encoder/nli-deberta-v3-large) and train only the classification head, freezing the transformer layers.
24.6.4 Handling Domain-Specific Language
Prediction market text has several domain-specific challenges:
- Hedging language: "The candidate is likely to win" vs. "The candidate could potentially win" express different confidence levels that matter for probability estimation.
- Conditional statements: "If the economy slows, the incumbent will lose" requires understanding conditional logic.
- Negated expectations: "The market surprised by NOT raising rates" contains a negation of an expectation, which is semantically complex.
- Quantitative claims: "Polls show a 5-point lead" contains a numerical value that a text model may not properly weigh.
- Sarcasm and irony: Social media sources often contain sarcasm that can fool sentiment models ("Great, another candidate flip-flopping on policy").
Addressing these challenges requires: 1. Including examples of each in your training data. 2. Data augmentation techniques that introduce these patterns. 3. Potentially adding numerical features extracted separately from the text.
24.7 News Impact Analysis
Beyond sentiment, we can measure the impact that specific news events have on prediction market prices. This goes beyond asking "is this article positive or negative?" to asking "how much did this article move the market?"
24.7.1 Measuring How News Affects Market Prices
The basic framework for news impact analysis involves:
- Identifying news events: Detect when a significant news article or cluster of articles is published.
- Measuring the price change: Calculate the market price change in a window around the news event.
- Attributing the change: Determine how much of the price change is attributable to the news (vs. other factors).
Formally, define the abnormal price change around a news event at time $t$ as:
$$ \Delta p_t^{\text{abnormal}} = \Delta p_t^{\text{actual}} - \Delta p_t^{\text{expected}} $$
where $\Delta p_t^{\text{expected}}$ is the price change we would have expected in the absence of the news event. For prediction markets (unlike stocks), the expected price change in the absence of news is typically zero (prices should be a martingale under risk-neutral pricing).
24.7.2 Event Study Methodology Adapted for Prediction Markets
Classical event study methodology from financial economics can be adapted for prediction markets:
-
Define the event window: The period during which you expect the news to affect prices. For breaking news, this might be [0, +2 hours]. For a scheduled event like a debate, the window might be [-1 hour, +24 hours].
-
Define the estimation window: A period before the event used to estimate "normal" price behavior. For prediction markets, this might be the 7 days before the event window.
-
Calculate abnormal returns: The difference between actual price changes and expected changes during the event window.
-
Test significance: Determine whether the abnormal price change is statistically significant.
For prediction markets, the "return" is typically the change in the probability:
$$ \text{Abnormal change} = p_{t + \tau} - p_t - \hat{\mu} \cdot \tau $$
where $\hat{\mu}$ is the estimated daily drift in the market price (often assumed to be zero for short windows).
24.7.3 News Surprise vs. Expected
Not all news is equally informative. Expected news (a poll showing the frontrunner still leading) should have minimal price impact. Surprising news (an unexpected endorsement, a scandal) should have large impact. Quantifying "surprise" is a key challenge.
Approaches to measuring news surprise:
Approach 1: Deviation from consensus. If polls consistently show Candidate A leading by 5 points, a new poll showing a 10-point lead is surprising. The surprise is the deviation from recent averages.
Approach 2: Novelty of content. Using NLP, measure how different a new article's content is from recent articles. Articles that introduce new topics or entities are more likely to be surprising. This can be measured using cosine distance between TF-IDF vectors:
$$ \text{novelty}(d) = 1 - \max_{d' \in D_{\text{recent}}} \cos(\mathbf{v}_d, \mathbf{v}_{d'}) $$
Approach 3: Market response magnitude. If the market moves significantly after a news event, the news was presumably surprising. This is a retrospective measure but useful for building labeled datasets.
24.7.4 Quantifying News Impact
To systematically quantify news impact, we compute several features for each news event:
$$ \text{Impact}_{\text{immediate}} = p_{t+\delta} - p_t \quad \text{(} \delta \text{ = 1-5 minutes)} $$
$$ \text{Impact}_{\text{short}} = p_{t+1\text{h}} - p_t $$
$$ \text{Impact}_{\text{sustained}} = p_{t+24\text{h}} - p_t $$
$$ \text{Reversal} = \frac{p_{t+24\text{h}} - p_{t+1\text{h}}}{p_{t+1\text{h}} - p_t} $$
The reversal metric is particularly interesting: if news causes an immediate price spike that is fully reversed within 24 hours, the initial reaction was an overreaction, and a trading strategy could exploit this pattern.
24.7.5 Python News Impact Analyzer
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Optional
from datetime import datetime, timedelta
@dataclass
class NewsEvent:
"""Represents a news event with associated metadata."""
timestamp: datetime
headline: str
text: str
source: str
sentiment: float # Pre-computed sentiment score
@dataclass
class ImpactResult:
"""Results of a news impact analysis."""
event: NewsEvent
price_before: float
price_after_5min: float
price_after_1h: float
price_after_24h: float
impact_immediate: float
impact_short: float
impact_sustained: float
reversal_ratio: Optional[float]
is_significant: bool
class NewsImpactAnalyzer:
"""
Analyzes the impact of news events on prediction market prices.
Implements event study methodology adapted for prediction markets.
"""
def __init__(
self,
significance_threshold: float = 0.02,
min_volume_threshold: int = 5,
):
"""
Parameters
----------
significance_threshold : float
Minimum absolute price change to consider significant.
min_volume_threshold : int
Minimum number of trades in the event window
to consider the price change meaningful.
"""
self.significance_threshold = significance_threshold
self.min_volume_threshold = min_volume_threshold
def get_price_at_time(
self,
price_data: pd.DataFrame,
target_time: datetime,
tolerance: timedelta = timedelta(minutes=5),
) -> Optional[float]:
"""
Get the market price closest to the target time.
Parameters
----------
price_data : pd.DataFrame
DataFrame with 'timestamp' and 'price' columns.
target_time : datetime
The target time to look up.
tolerance : timedelta
Maximum time difference to accept.
Returns
-------
float or None
The price at the target time, or None if no data available.
"""
time_diffs = abs(price_data['timestamp'] - target_time)
min_idx = time_diffs.idxmin()
if time_diffs[min_idx] <= tolerance:
return price_data.loc[min_idx, 'price']
return None
def analyze_event(
self,
event: NewsEvent,
price_data: pd.DataFrame,
) -> Optional[ImpactResult]:
"""
Analyze the impact of a single news event.
Parameters
----------
event : NewsEvent
The news event to analyze.
price_data : pd.DataFrame
Market price data with 'timestamp' and 'price' columns.
Returns
-------
ImpactResult or None
Impact analysis results, or None if insufficient data.
"""
t = event.timestamp
price_before = self.get_price_at_time(price_data, t)
price_5min = self.get_price_at_time(
price_data, t + timedelta(minutes=5)
)
price_1h = self.get_price_at_time(
price_data, t + timedelta(hours=1)
)
price_24h = self.get_price_at_time(
price_data, t + timedelta(hours=24)
)
if any(p is None for p in [price_before, price_5min, price_1h, price_24h]):
return None
impact_imm = price_5min - price_before
impact_short = price_1h - price_before
impact_sust = price_24h - price_before
# Reversal ratio
if abs(impact_short) > 1e-6:
reversal = (price_24h - price_1h) / (price_1h - price_before)
else:
reversal = None
is_significant = abs(impact_short) >= self.significance_threshold
return ImpactResult(
event=event,
price_before=price_before,
price_after_5min=price_5min,
price_after_1h=price_1h,
price_after_24h=price_24h,
impact_immediate=impact_imm,
impact_short=impact_short,
impact_sustained=impact_sust,
reversal_ratio=reversal,
is_significant=is_significant,
)
def analyze_events(
self,
events: List[NewsEvent],
price_data: pd.DataFrame,
) -> pd.DataFrame:
"""
Analyze multiple news events and return a summary DataFrame.
Parameters
----------
events : List[NewsEvent]
List of news events to analyze.
price_data : pd.DataFrame
Market price data.
Returns
-------
pd.DataFrame
Summary of impact analysis for all events.
"""
results = []
for event in events:
result = self.analyze_event(event, price_data)
if result is not None:
results.append({
'timestamp': event.timestamp,
'headline': event.headline,
'source': event.source,
'sentiment': event.sentiment,
'price_before': result.price_before,
'impact_immediate': result.impact_immediate,
'impact_short': result.impact_short,
'impact_sustained': result.impact_sustained,
'reversal_ratio': result.reversal_ratio,
'is_significant': result.is_significant,
})
df = pd.DataFrame(results)
if not df.empty:
# Compute summary statistics
sig_events = df[df['is_significant']]
print(f"Total events analyzed: {len(df)}")
print(f"Significant events: {len(sig_events)}")
if len(sig_events) > 0:
print(f"Mean absolute impact (significant): "
f"{sig_events['impact_short'].abs().mean():.4f}")
print(f"Sentiment-impact correlation: "
f"{df['sentiment'].corr(df['impact_short']):.4f}")
return df
def compute_news_surprise(
self,
event_text: str,
recent_texts: List[str],
vectorizer=None,
) -> float:
"""
Compute the novelty/surprise of a news event relative to
recent news.
Parameters
----------
event_text : str
The text of the new event.
recent_texts : List[str]
Recent article texts for comparison.
vectorizer : sklearn vectorizer, optional
Pre-fitted TF-IDF vectorizer. If None, creates one.
Returns
-------
float
Novelty score in [0, 1], where 1 = very novel.
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
if not recent_texts:
return 1.0 # No baseline, so everything is novel
if vectorizer is None:
vectorizer = TfidfVectorizer(max_features=5000)
all_texts = recent_texts + [event_text]
tfidf_matrix = vectorizer.fit_transform(all_texts)
else:
all_texts = recent_texts + [event_text]
tfidf_matrix = vectorizer.transform(all_texts)
# Similarity between the new event and each recent text
event_vector = tfidf_matrix[-1:]
recent_vectors = tfidf_matrix[:-1]
similarities = cosine_similarity(event_vector, recent_vectors)[0]
max_similarity = similarities.max()
return 1.0 - max_similarity
# Example usage
if __name__ == "__main__":
# Create synthetic price data
np.random.seed(42)
base_time = datetime(2024, 11, 1, 8, 0, 0)
timestamps = [
base_time + timedelta(minutes=i) for i in range(0, 1440 * 3, 5)
]
prices = [0.55] # Starting probability
for _ in range(len(timestamps) - 1):
prices.append(
np.clip(prices[-1] + np.random.normal(0, 0.002), 0.01, 0.99)
)
# Inject a news event impact
event_idx = 200 # ~16.7 hours in
for i in range(event_idx, min(event_idx + 20, len(prices))):
prices[i] += 0.05 # 5% jump
price_data = pd.DataFrame({
'timestamp': timestamps,
'price': prices,
})
# Create a news event
event = NewsEvent(
timestamp=timestamps[event_idx],
headline="Major endorsement boosts candidate's campaign",
text="A key political figure has endorsed the candidate...",
source="Reuters",
sentiment=0.8,
)
# Analyze
analyzer = NewsImpactAnalyzer(significance_threshold=0.02)
result = analyzer.analyze_event(event, price_data)
if result:
print(f"Event: {result.event.headline}")
print(f"Price before: {result.price_before:.4f}")
print(f"Price after 5min: {result.price_after_5min:.4f}")
print(f"Price after 1h: {result.price_after_1h:.4f}")
print(f"Price after 24h: {result.price_after_24h:.4f}")
print(f"Immediate impact: {result.impact_immediate:+.4f}")
print(f"Short-term impact: {result.impact_short:+.4f}")
print(f"Sustained impact: {result.impact_sustained:+.4f}")
print(f"Significant: {result.is_significant}")
24.8 Real-Time News Monitoring
To act on NLP signals in a timely manner, you need a system that monitors news sources in real time, computes sentiment scores, and generates alerts when significant events occur.
24.8.1 Building a News Pipeline
A real-time news pipeline for prediction markets has several components:
[News Sources] -> [Collector] -> [Preprocessor] -> [NLP Models]
|
[Market Data] -> [Feature Store] <---------------------+
|
[Trading Logic] -> [Execution]
News Sources include: - RSS feeds: Major news outlets (Reuters, AP, Bloomberg) publish RSS feeds that can be polled every few minutes. - News APIs: Services like NewsAPI, GDELT, or Event Registry provide structured access to news articles. NewsAPI offers a free tier suitable for development. - Social media APIs: Twitter/X API (now expensive), Reddit API, and Bluesky API provide access to social media posts. - Web scraping: For sources without APIs, targeted web scraping can extract headlines and article text. (Always respect terms of service and robots.txt.)
24.8.2 RSS Feeds and News APIs
RSS feeds are the simplest and most reliable way to monitor news:
import feedparser
import time
from datetime import datetime
from typing import List, Dict, Callable
import hashlib
class RSSMonitor:
"""
Monitors RSS feeds for new articles.
Tracks seen articles to avoid processing duplicates.
"""
def __init__(self, feeds: Dict[str, str]):
"""
Parameters
----------
feeds : Dict[str, str]
Mapping from feed name to feed URL.
"""
self.feeds = feeds
self.seen_ids = set()
def _article_id(self, entry) -> str:
"""Generate a unique ID for an article."""
key = entry.get('id', '') or entry.get('link', '') or entry.get('title', '')
return hashlib.md5(key.encode()).hexdigest()
def check_feeds(self) -> List[Dict]:
"""
Check all feeds for new articles.
Returns
-------
List[Dict]
List of new articles with title, summary, link,
published date, and source.
"""
new_articles = []
for name, url in self.feeds.items():
try:
feed = feedparser.parse(url)
for entry in feed.entries:
article_id = self._article_id(entry)
if article_id not in self.seen_ids:
self.seen_ids.add(article_id)
new_articles.append({
'source': name,
'title': entry.get('title', ''),
'summary': entry.get('summary', ''),
'link': entry.get('link', ''),
'published': entry.get('published', ''),
'fetched_at': datetime.utcnow().isoformat(),
})
except Exception as e:
print(f"Error fetching {name}: {e}")
return new_articles
def monitor(
self,
callback: Callable[[List[Dict]], None],
interval_seconds: int = 60,
):
"""
Continuously monitor feeds and call the callback
with new articles.
Parameters
----------
callback : callable
Function to call with new articles.
interval_seconds : int
Seconds between feed checks.
"""
print(f"Monitoring {len(self.feeds)} feeds "
f"every {interval_seconds}s...")
while True:
new_articles = self.check_feeds()
if new_articles:
callback(new_articles)
time.sleep(interval_seconds)
24.8.3 Real-Time Sentiment Scoring and Alert System
Combining the news monitor with sentiment analysis creates a real-time alert system:
from dataclasses import dataclass, field
from typing import List, Optional
from collections import deque
import json
@dataclass
class SentimentAlert:
"""An alert generated when sentiment crosses a threshold."""
timestamp: str
alert_type: str # 'spike', 'shift', 'volume'
description: str
articles: List[Dict]
sentiment_score: float
market_relevance: str
class RealTimeNewsSentimentMonitor:
"""
Real-time news monitoring with sentiment analysis and alerting.
Monitors news feeds, scores sentiment, tracks rolling averages,
and generates alerts when conditions are met.
"""
def __init__(
self,
sentiment_analyzer, # The analyzer from Section 24.4
keywords: List[str] = None,
alert_threshold: float = 0.3,
volume_alert_multiplier: float = 3.0,
rolling_window: int = 50,
):
self.sentiment_analyzer = sentiment_analyzer
self.keywords = [k.lower() for k in (keywords or [])]
self.alert_threshold = alert_threshold
self.volume_alert_multiplier = volume_alert_multiplier
self.rolling_window = rolling_window
self.sentiment_history = deque(maxlen=rolling_window)
self.article_counts = deque(maxlen=24) # hourly counts
self.alerts = []
self.current_hour_count = 0
def is_relevant(self, article: Dict) -> bool:
"""Check if an article is relevant to our monitored keywords."""
if not self.keywords:
return True
text = (article.get('title', '') + ' ' +
article.get('summary', '')).lower()
return any(kw in text for kw in self.keywords)
def process_articles(self, articles: List[Dict]) -> List[SentimentAlert]:
"""
Process new articles: filter, score sentiment, check alerts.
Parameters
----------
articles : List[Dict]
New articles from the RSS monitor.
Returns
-------
List[SentimentAlert]
Any alerts generated.
"""
relevant = [a for a in articles if self.is_relevant(a)]
if not relevant:
return []
new_alerts = []
for article in relevant:
text = article.get('title', '') + '. ' + article.get('summary', '')
result = self.sentiment_analyzer.analyze(text)
article['sentiment'] = result.ensemble_score
self.sentiment_history.append(result.ensemble_score)
self.current_hour_count += 1
# Check for sentiment spike
if len(self.sentiment_history) >= 5:
recent_avg = np.mean(list(self.sentiment_history)[-5:])
overall_avg = np.mean(list(self.sentiment_history))
if abs(recent_avg - overall_avg) > self.alert_threshold:
direction = "positive" if recent_avg > overall_avg else "negative"
alert = SentimentAlert(
timestamp=datetime.utcnow().isoformat(),
alert_type='spike',
description=(
f"Sentiment {direction} spike detected. "
f"Recent avg: {recent_avg:.3f}, "
f"Overall avg: {overall_avg:.3f}"
),
articles=relevant,
sentiment_score=recent_avg,
market_relevance="HIGH",
)
new_alerts.append(alert)
self.alerts.append(alert)
# Check for volume spike
if len(self.article_counts) > 0:
avg_hourly = np.mean(list(self.article_counts))
if (avg_hourly > 0 and
self.current_hour_count >
avg_hourly * self.volume_alert_multiplier):
alert = SentimentAlert(
timestamp=datetime.utcnow().isoformat(),
alert_type='volume',
description=(
f"News volume spike: {self.current_hour_count} "
f"articles vs avg {avg_hourly:.1f}/hour"
),
articles=relevant,
sentiment_score=np.mean(
[a['sentiment'] for a in relevant]
),
market_relevance="MEDIUM",
)
new_alerts.append(alert)
self.alerts.append(alert)
# Print alerts
for alert in new_alerts:
print(f"\n{'='*60}")
print(f"ALERT [{alert.alert_type.upper()}] "
f"at {alert.timestamp}")
print(f" {alert.description}")
print(f" Market relevance: {alert.market_relevance}")
print(f" Related articles: {len(alert.articles)}")
for a in alert.articles[:3]:
print(f" - {a['title'][:80]}")
print(f"{'='*60}")
return new_alerts
This real-time monitoring system provides the infrastructure needed to act on NLP signals quickly. In practice, you would connect the alert system to your trading logic, potentially placing automated trades when sentiment signals are strong and confirmed by multiple sources.
24.9 Named Entity Recognition and Topic Extraction
Beyond sentiment, we can extract structured information from text using Named Entity Recognition (NER) and topic modeling. These techniques allow us to identify who and what is being discussed, and to track how the conversation around a prediction market question evolves over time.
24.9.1 NER for Extracting People, Organizations, and Events
Named Entity Recognition identifies and classifies named entities in text into predefined categories:
- PERSON: "Joe Biden," "Elon Musk"
- ORG: "Federal Reserve," "Polymarket," "Democratic Party"
- GPE (Geo-Political Entity): "United States," "Ukraine," "Georgia"
- DATE: "November 2024," "next Tuesday"
- MONEY: "$1 billion," "25 basis points"
- EVENT: "Super Bowl," "G7 summit"
For prediction markets, NER is valuable because it connects text to specific markets. An article mentioning "Biden" and "approval rating" is relevant to presidential election markets. An article mentioning "Federal Reserve" and "interest rates" is relevant to monetary policy markets.
24.9.2 Topic Modeling with LDA
Latent Dirichlet Allocation (LDA) is an unsupervised model that discovers topics in a collection of documents. Each topic is represented as a distribution over words, and each document is a mixture of topics.
The generative model for LDA is:
- For each topic $k$, draw a word distribution $\phi_k \sim \text{Dir}(\beta)$.
- For each document $d$:
a. Draw a topic distribution $\theta_d \sim \text{Dir}(\alpha)$.
b. For each word in the document:
- Draw a topic $z \sim \text{Multinomial}(\theta_d)$.
- Draw a word $w \sim \text{Multinomial}(\phi_z)$.
The hyperparameters $\alpha$ and $\beta$ control the sparsity of the topic and word distributions, respectively. A small $\alpha$ encourages documents to be about a few topics; a small $\beta$ encourages topics to use a few words.
For prediction markets, LDA can be used to: - Track which topics are dominating the news cycle for a particular market. - Detect when a new topic emerges that is relevant to a market. - Cluster articles by topic for more targeted sentiment analysis.
24.9.3 Connecting Entities to Markets
The key insight is that NER and topic modeling can be used to create a mapping between text and markets. This mapping enables automated routing of news to relevant market models:
# Entity-to-market mapping example
entity_market_map = {
"Biden": ["presidential-election-2024", "us-foreign-policy"],
"Trump": ["presidential-election-2024", "us-foreign-policy"],
"Federal Reserve": ["fed-rate-decision", "inflation-target"],
"SEC": ["bitcoin-etf-approval", "crypto-regulation"],
"Ukraine": ["ukraine-conflict-resolution", "nato-expansion"],
}
24.9.4 Python NER and Topic Pipeline
import spacy
from collections import Counter, defaultdict
from typing import List, Dict, Tuple
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
class NERTopicPipeline:
"""
Combined NER and topic modeling pipeline for prediction
market text analysis.
"""
def __init__(
self,
spacy_model: str = "en_core_web_sm",
n_topics: int = 10,
entity_market_map: Dict[str, List[str]] = None,
):
"""
Parameters
----------
spacy_model : str
SpaCy model to use for NER.
n_topics : int
Number of topics for LDA.
entity_market_map : dict
Mapping from entity names to related market IDs.
"""
self.nlp = spacy.load(spacy_model)
self.n_topics = n_topics
self.entity_market_map = entity_market_map or {}
# LDA components
self.vectorizer = CountVectorizer(
max_features=5000,
stop_words='english',
min_df=2,
max_df=0.95,
)
self.lda = LatentDirichletAllocation(
n_components=n_topics,
random_state=42,
max_iter=20,
)
self.is_fitted = False
def extract_entities(self, text: str) -> Dict[str, List[str]]:
"""
Extract named entities from text.
Parameters
----------
text : str
Input text.
Returns
-------
Dict[str, List[str]]
Mapping from entity type to list of entities found.
"""
doc = self.nlp(text)
entities = defaultdict(list)
for ent in doc.ents:
entities[ent.label_].append(ent.text)
return dict(entities)
def find_relevant_markets(self, text: str) -> List[str]:
"""
Identify prediction markets relevant to the given text
based on entity mentions.
Parameters
----------
text : str
Input text.
Returns
-------
List[str]
List of relevant market IDs.
"""
entities = self.extract_entities(text)
relevant_markets = set()
for entity_type, entity_list in entities.items():
for entity in entity_list:
# Check direct matches
if entity in self.entity_market_map:
relevant_markets.update(self.entity_market_map[entity])
# Check partial matches
for key, markets in self.entity_market_map.items():
if key.lower() in entity.lower() or entity.lower() in key.lower():
relevant_markets.update(markets)
return list(relevant_markets)
def fit_topics(self, texts: List[str]):
"""
Fit the LDA topic model on a corpus.
Parameters
----------
texts : List[str]
Corpus of texts to fit on.
"""
dtm = self.vectorizer.fit_transform(texts)
self.lda.fit(dtm)
self.is_fitted = True
def get_topic_distribution(self, text: str) -> np.ndarray:
"""
Get the topic distribution for a single text.
Parameters
----------
text : str
Input text.
Returns
-------
np.ndarray
Topic distribution vector of shape (n_topics,).
"""
if not self.is_fitted:
raise RuntimeError("Must call fit_topics() before get_topic_distribution()")
dtm = self.vectorizer.transform([text])
return self.lda.transform(dtm)[0]
def get_topic_words(self, n_words: int = 10) -> Dict[int, List[str]]:
"""
Get the top words for each topic.
Parameters
----------
n_words : int
Number of top words per topic.
Returns
-------
Dict[int, List[str]]
Mapping from topic index to list of top words.
"""
if not self.is_fitted:
raise RuntimeError("Must call fit_topics() first.")
feature_names = self.vectorizer.get_feature_names_out()
topics = {}
for idx, topic in enumerate(self.lda.components_):
top_indices = topic.argsort()[-n_words:][::-1]
topics[idx] = [feature_names[i] for i in top_indices]
return topics
def analyze_document(self, text: str) -> Dict:
"""
Full analysis of a single document: entities, markets,
and topics.
Parameters
----------
text : str
Input text.
Returns
-------
Dict
Combined analysis results.
"""
entities = self.extract_entities(text)
markets = self.find_relevant_markets(text)
result = {
'entities': entities,
'relevant_markets': markets,
}
if self.is_fitted:
topic_dist = self.get_topic_distribution(text)
result['topic_distribution'] = topic_dist.tolist()
result['dominant_topic'] = int(np.argmax(topic_dist))
return result
# Example usage
if __name__ == "__main__":
entity_map = {
"Biden": ["presidential-election-2024"],
"Trump": ["presidential-election-2024"],
"Federal Reserve": ["fed-rate-decision-2024"],
"Fed": ["fed-rate-decision-2024"],
"SEC": ["bitcoin-etf-approval"],
"Ukraine": ["ukraine-ceasefire-2024"],
}
pipeline = NERTopicPipeline(
n_topics=5,
entity_market_map=entity_map,
)
# Analyze a single article
article = (
"President Biden announced new sanctions against Russia "
"following the latest escalation in Ukraine. The Federal "
"Reserve is expected to hold rates steady at its next meeting, "
"according to market analysts."
)
entities = pipeline.extract_entities(article)
markets = pipeline.find_relevant_markets(article)
print("Entities found:")
for etype, elist in entities.items():
print(f" {etype}: {elist}")
print(f"\nRelevant markets: {markets}")
# Fit topic model on a corpus
corpus = [
"The election polls show a tight race between candidates",
"Federal Reserve signals no change in interest rate policy",
"Bitcoin ETF application faces SEC scrutiny and delays",
"Ukraine peace talks stall as both sides set conditions",
"Campaign fundraising breaks records in the third quarter",
"Inflation data comes in lower than expected for the month",
"Cryptocurrency markets rally on institutional adoption news",
"Diplomatic efforts intensify as ceasefire deadline approaches",
] * 5 # Repeat for minimum LDA requirements
pipeline.fit_topics(corpus)
topics = pipeline.get_topic_words(n_words=5)
print("\nDiscovered Topics:")
for topic_id, words in topics.items():
print(f" Topic {topic_id}: {', '.join(words)}")
analysis = pipeline.analyze_document(article)
print(f"\nDominant topic: {analysis['dominant_topic']}")
print(f"Topic distribution: {[f'{x:.3f}' for x in analysis['topic_distribution']]}")
24.10 Building NLP Features for Trading Models
The ultimate goal of all the NLP techniques in this chapter is to produce features that improve prediction market trading models. This section shows how to combine text-derived features with the tabular models from Chapter 23.
24.10.1 Text-Derived Features
From the techniques developed in this chapter, we can extract the following feature categories:
Sentiment Features:
- sentiment_mean_1h: Average sentiment of articles in the past 1 hour.
- sentiment_mean_24h: Average sentiment over the past 24 hours.
- sentiment_std_24h: Standard deviation of sentiment (higher = more disagreement).
- sentiment_momentum: Change in average sentiment from 24h ago to now.
- sentiment_extreme: Count of articles with |sentiment| > 0.5.
Volume Features:
- article_count_1h: Number of relevant articles in the past hour.
- article_count_24h: Number of relevant articles in the past 24 hours.
- volume_ratio: Ratio of recent volume to historical average.
- source_diversity: Number of unique sources covering the topic.
Topic Features:
- topic_k_weight: Weight of topic $k$ in recent articles (one feature per topic).
- topic_shift: Change in dominant topic over time.
- topic_entropy: Entropy of topic distribution (higher = more diverse coverage).
Entity Features:
- entity_count: Number of unique entities mentioned.
- key_entity_frequency: How often a specific key entity is mentioned.
- new_entity_flag: Whether a previously unseen entity appears.
Novelty Features:
- news_surprise: How novel the latest article is (from Section 24.7).
- information_velocity: Rate at which new information is appearing.
24.10.2 Feature Integration with Tabular Models
These NLP features can be combined with the tabular features from Chapter 23 (market features like price, volume, spread; time features; cross-market features) into a unified feature set:
$$ \mathbf{x}_t = [\underbrace{x_1, \ldots, x_m}_{\text{market features}}, \underbrace{x_{m+1}, \ldots, x_{m+n}}_{\text{NLP features}}] $$
The combined feature vector is then fed into gradient boosted trees, random forests, or neural networks for probability estimation or trading signal generation.
24.10.3 Python NLP Feature Generator
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from collections import deque
class NLPFeatureGenerator:
"""
Generates NLP-derived features for prediction market
trading models.
Maintains rolling windows of article data and computes
features at each time step.
"""
def __init__(
self,
sentiment_analyzer,
windows: List[int] = None,
n_topics: int = 5,
):
"""
Parameters
----------
sentiment_analyzer : object
Sentiment analyzer with an analyze() method.
windows : List[int]
Window sizes in hours for rolling features.
n_topics : int
Number of topics in the topic model.
"""
self.analyzer = sentiment_analyzer
self.windows = windows or [1, 6, 24, 72]
self.n_topics = n_topics
self.article_buffer = [] # List of (timestamp, sentiment, source, ...)
def add_article(
self,
timestamp: datetime,
text: str,
source: str,
topic_distribution: Optional[np.ndarray] = None,
entities: Optional[List[str]] = None,
):
"""
Add a new article to the buffer and compute its features.
Parameters
----------
timestamp : datetime
Publication time.
text : str
Article text.
source : str
Source name.
topic_distribution : np.ndarray, optional
Pre-computed topic distribution.
entities : List[str], optional
Pre-extracted entities.
"""
result = self.analyzer.analyze(text)
self.article_buffer.append({
'timestamp': timestamp,
'sentiment': result.ensemble_score,
'subjectivity': result.textblob_subjectivity,
'source': source,
'topic_dist': topic_distribution,
'entities': entities or [],
'text_length': len(text),
})
def _get_articles_in_window(
self, current_time: datetime, hours: int
) -> List[Dict]:
"""Get articles within the specified time window."""
cutoff = current_time - timedelta(hours=hours)
return [
a for a in self.article_buffer
if a['timestamp'] >= cutoff and a['timestamp'] <= current_time
]
def generate_features(self, current_time: datetime) -> Dict[str, float]:
"""
Generate all NLP features for the current time step.
Parameters
----------
current_time : datetime
The current time for feature generation.
Returns
-------
Dict[str, float]
Dictionary of feature names to values.
"""
features = {}
for window in self.windows:
articles = self._get_articles_in_window(current_time, window)
prefix = f"nlp_{window}h"
if not articles:
# No articles in this window
features[f"{prefix}_sentiment_mean"] = 0.0
features[f"{prefix}_sentiment_std"] = 0.0
features[f"{prefix}_sentiment_min"] = 0.0
features[f"{prefix}_sentiment_max"] = 0.0
features[f"{prefix}_article_count"] = 0
features[f"{prefix}_source_diversity"] = 0
features[f"{prefix}_subjectivity_mean"] = 0.0
for k in range(self.n_topics):
features[f"{prefix}_topic_{k}"] = 0.0
continue
sentiments = [a['sentiment'] for a in articles]
# Sentiment features
features[f"{prefix}_sentiment_mean"] = np.mean(sentiments)
features[f"{prefix}_sentiment_std"] = (
np.std(sentiments) if len(sentiments) > 1 else 0.0
)
features[f"{prefix}_sentiment_min"] = np.min(sentiments)
features[f"{prefix}_sentiment_max"] = np.max(sentiments)
features[f"{prefix}_sentiment_skew"] = (
float(pd.Series(sentiments).skew())
if len(sentiments) > 2 else 0.0
)
# Volume features
features[f"{prefix}_article_count"] = len(articles)
features[f"{prefix}_source_diversity"] = len(
set(a['source'] for a in articles)
)
# Subjectivity features
subj = [a['subjectivity'] for a in articles]
features[f"{prefix}_subjectivity_mean"] = np.mean(subj)
# Topic features
topic_dists = [
a['topic_dist'] for a in articles
if a['topic_dist'] is not None
]
if topic_dists:
avg_topics = np.mean(topic_dists, axis=0)
for k in range(min(self.n_topics, len(avg_topics))):
features[f"{prefix}_topic_{k}"] = avg_topics[k]
# Topic entropy
features[f"{prefix}_topic_entropy"] = float(
-np.sum(avg_topics * np.log(avg_topics + 1e-10))
)
else:
for k in range(self.n_topics):
features[f"{prefix}_topic_{k}"] = 0.0
features[f"{prefix}_topic_entropy"] = 0.0
# Entity features
all_entities = []
for a in articles:
all_entities.extend(a['entities'])
features[f"{prefix}_entity_count"] = len(set(all_entities))
features[f"{prefix}_entity_mentions"] = len(all_entities)
# Cross-window features
short_articles = self._get_articles_in_window(current_time, 1)
long_articles = self._get_articles_in_window(current_time, 24)
if short_articles and long_articles:
short_sent = np.mean([a['sentiment'] for a in short_articles])
long_sent = np.mean([a['sentiment'] for a in long_articles])
features["nlp_sentiment_momentum"] = short_sent - long_sent
else:
features["nlp_sentiment_momentum"] = 0.0
# Volume ratio
if long_articles:
expected_hourly = len(long_articles) / 24
actual_hourly = len(short_articles) if short_articles else 0
features["nlp_volume_ratio"] = (
actual_hourly / expected_hourly
if expected_hourly > 0 else 0.0
)
else:
features["nlp_volume_ratio"] = 0.0
return features
def generate_feature_dataframe(
self,
timestamps: List[datetime],
) -> pd.DataFrame:
"""
Generate features for multiple time steps.
Parameters
----------
timestamps : List[datetime]
Time steps at which to generate features.
Returns
-------
pd.DataFrame
DataFrame with one row per timestamp and one column per feature.
"""
rows = []
for ts in timestamps:
features = self.generate_features(ts)
features['timestamp'] = ts
rows.append(features)
return pd.DataFrame(rows).set_index('timestamp')
This feature generator produces a comprehensive set of NLP features that can be joined with market data features for model training. The multi-window approach (1h, 6h, 24h, 72h) captures both immediate reactions and longer-term sentiment trends.
24.11 LLMs as Forecasters
Perhaps the most fascinating frontier in NLP for prediction markets is the direct use of large language models (LLMs) as forecasters. Rather than using NLP to extract features that are fed into a separate model, what if we simply ask an LLM "What is the probability that event X occurs?"
24.11.1 Using GPT/Claude for Probability Estimation
Modern LLMs like GPT-4 and Claude have been trained on vast amounts of text that includes forecasting discussions, base rates, historical outcomes, and analytical reasoning. When prompted appropriately, they can produce probability estimates for future events.
The basic approach is:
Prompt: "Based on current information as of [date], what is the
probability that [event description]? Provide your estimate as
a number between 0 and 1, and explain your reasoning."
Research has shown that LLM probability estimates are surprisingly well-calibrated for some types of questions, particularly those where: - Historical base rates are informative. - The question has been widely discussed in the training data. - The reasoning is more analytical than requiring private information.
24.11.2 Prompt Engineering for Forecasting
The quality of LLM forecasts depends heavily on the prompt. Key principles:
Principle 1: Provide context. Give the LLM relevant background information, recent developments, and the current date.
Principle 2: Request calibrated probabilities. Ask the LLM to think about base rates, reference classes, and to consider both sides of the argument.
Principle 3: Use structured reasoning. Ask the LLM to list factors for and against, assign weights, and then synthesize a probability.
Principle 4: Request confidence intervals. Ask for a range (e.g., "70-80% likely") rather than a point estimate, to capture the LLM's uncertainty.
Principle 5: Decompose complex questions. Break a complex question into sub-questions, get probabilities for each, and combine them.
An effective prompting template:
You are a professional forecaster. Estimate the probability
of the following event:
EVENT: {event_description}
RESOLUTION DATE: {resolution_date}
CURRENT DATE: {current_date}
Please structure your analysis as follows:
1. BASE RATE: What is the historical base rate for similar events?
2. FACTORS FOR: What factors increase the probability?
3. FACTORS AGAINST: What factors decrease the probability?
4. RECENT DEVELOPMENTS: What recent information is most relevant?
5. PROBABILITY ESTIMATE: Your best estimate as a number between
0.01 and 0.99.
6. CONFIDENCE: How confident are you in this estimate?
(low/medium/high)
24.11.3 Limitations and Calibration of LLM Predictions
LLM forecasts have several important limitations:
Knowledge cutoff. LLMs have a training data cutoff date and may not know about recent events. This is particularly important for prediction markets where current information is critical.
Sycophancy. LLMs may anchor on probabilities suggested in the prompt or adjust their estimates to match what they perceive the user wants to hear.
Compression of probabilities. LLMs tend to avoid extreme probabilities. They rarely output values below 0.05 or above 0.95, even when these might be appropriate.
Inconsistency. The same question asked in different ways may produce different probability estimates. Running multiple prompts and averaging can mitigate this.
Lack of true updating. LLMs do not update their beliefs in the Bayesian sense. They generate plausible-sounding analyses that may not properly weight new evidence.
Calibration analysis of LLM forecasts typically shows: - Overconfidence in the 50-80% range (events they say are 70% likely actually happen about 60% of the time). - Underconfidence at extremes (events they say are 90% likely actually happen 95% of the time). - Better calibration on well-known topics than obscure ones.
24.11.4 Python LLM Forecasting Harness
import json
import time
from dataclasses import dataclass
from typing import List, Dict, Optional, Tuple
import numpy as np
from datetime import datetime
@dataclass
class ForecastResult:
"""Container for an LLM forecast result."""
question: str
probability: float
confidence: str
reasoning: str
base_rate: Optional[float]
model: str
timestamp: str
prompt_version: str
class LLMForecaster:
"""
Uses LLMs to generate probability forecasts for prediction
market questions.
Supports multiple prompting strategies and ensemble methods.
"""
def __init__(
self,
model: str = "gpt-4",
api_key: str = None,
temperature: float = 0.3,
n_samples: int = 3,
):
"""
Parameters
----------
model : str
LLM model identifier.
api_key : str
API key for the LLM service.
temperature : float
Sampling temperature (lower = more deterministic).
n_samples : int
Number of independent samples to average.
"""
self.model = model
self.api_key = api_key
self.temperature = temperature
self.n_samples = n_samples
def _build_prompt(
self,
question: str,
context: str = "",
current_date: str = None,
resolution_date: str = None,
prompt_version: str = "structured",
) -> str:
"""Build the forecasting prompt."""
if current_date is None:
current_date = datetime.now().strftime("%Y-%m-%d")
if prompt_version == "structured":
prompt = f"""You are an expert superforecaster who is well-calibrated
and thinks carefully about base rates and evidence.
Estimate the probability of the following event:
EVENT: {question}
CURRENT DATE: {current_date}
"""
if resolution_date:
prompt += f"RESOLUTION DATE: {resolution_date}\n"
if context:
prompt += f"\nRELEVANT CONTEXT:\n{context}\n"
prompt += """
Please structure your analysis:
1. BASE RATE: Historical base rate for similar events.
2. FACTORS FOR: Evidence increasing the probability.
3. FACTORS AGAINST: Evidence decreasing the probability.
4. SYNTHESIS: Weigh the factors and arrive at an estimate.
IMPORTANT: End your response with exactly this format:
PROBABILITY: [your estimate as a decimal between 0.01 and 0.99]
CONFIDENCE: [low/medium/high]
"""
elif prompt_version == "simple":
prompt = f"""What is the probability that: {question}
As of {current_date}. {context}
Respond with a single number between 0.01 and 0.99.
PROBABILITY:"""
elif prompt_version == "devil_advocate":
prompt = f"""You are an expert forecaster. Consider the question:
{question}
Current date: {current_date}
{context}
First, make the STRONGEST CASE that this event WILL happen.
Then, make the STRONGEST CASE that it will NOT happen.
Finally, weigh both cases and provide your probability estimate.
PROBABILITY: [decimal between 0.01 and 0.99]
CONFIDENCE: [low/medium/high]
"""
else:
raise ValueError(f"Unknown prompt version: {prompt_version}")
return prompt
def _parse_response(self, response_text: str) -> Tuple[float, str]:
"""
Parse the LLM response to extract probability and confidence.
Parameters
----------
response_text : str
The raw LLM response.
Returns
-------
Tuple[float, str]
(probability, confidence_level)
"""
import re
# Extract probability
prob_match = re.search(
r'PROBABILITY:\s*(0?\.\d+|1\.0?)', response_text
)
if prob_match:
prob = float(prob_match.group(1))
else:
# Try to find any decimal between 0 and 1
numbers = re.findall(r'0\.\d+', response_text)
if numbers:
prob = float(numbers[-1]) # Take the last one
else:
prob = 0.5 # Default fallback
prob = np.clip(prob, 0.01, 0.99)
# Extract confidence
conf_match = re.search(
r'CONFIDENCE:\s*(low|medium|high)', response_text, re.IGNORECASE
)
confidence = conf_match.group(1).lower() if conf_match else "medium"
return prob, confidence
def _call_llm(self, prompt: str) -> str:
"""
Call the LLM API.
In production, this would make an actual API call.
Here we simulate the response for demonstration.
"""
# SIMULATION: In production, replace with actual API call
# Example with OpenAI:
#
# from openai import OpenAI
# client = OpenAI(api_key=self.api_key)
# response = client.chat.completions.create(
# model=self.model,
# messages=[{"role": "user", "content": prompt}],
# temperature=self.temperature,
# )
# return response.choices[0].message.content
# Simulated response for demonstration
simulated = """
1. BASE RATE: Historically, incumbent parties retain the presidency
about 50-60% of the time in modern US elections.
2. FACTORS FOR: Strong economic indicators, no major scandals,
incumbency advantage.
3. FACTORS AGAINST: Historical trend of party fatigue after two
terms, opposition candidate polling strength.
4. SYNTHESIS: Weighing the incumbency advantage against historical
patterns of party alternation, and considering current polling data.
PROBABILITY: 0.55
CONFIDENCE: medium
"""
return simulated
def forecast(
self,
question: str,
context: str = "",
current_date: str = None,
resolution_date: str = None,
) -> ForecastResult:
"""
Generate a probability forecast for a question.
Uses multiple samples and prompt versions to produce
a robust estimate.
Parameters
----------
question : str
The prediction market question.
context : str
Relevant context to provide to the LLM.
current_date : str
Current date string.
resolution_date : str
When the question resolves.
Returns
-------
ForecastResult
The forecast with probability, confidence, and reasoning.
"""
probabilities = []
all_reasoning = []
prompt_versions = ["structured", "devil_advocate", "simple"]
for version in prompt_versions:
for _ in range(self.n_samples):
prompt = self._build_prompt(
question, context, current_date,
resolution_date, version
)
response = self._call_llm(prompt)
prob, confidence = self._parse_response(response)
probabilities.append(prob)
all_reasoning.append(response)
# Aggregate: trimmed mean (remove most extreme estimates)
sorted_probs = sorted(probabilities)
trimmed = sorted_probs[1:-1] if len(sorted_probs) > 2 else sorted_probs
final_prob = np.mean(trimmed)
# Determine overall confidence from variance
prob_std = np.std(probabilities)
if prob_std < 0.05:
overall_confidence = "high"
elif prob_std < 0.15:
overall_confidence = "medium"
else:
overall_confidence = "low"
return ForecastResult(
question=question,
probability=float(final_prob),
confidence=overall_confidence,
reasoning=all_reasoning[0], # Use first structured response
base_rate=None,
model=self.model,
timestamp=datetime.now().isoformat(),
prompt_version="ensemble",
)
def evaluate_calibration(
self,
questions: List[str],
outcomes: List[int],
contexts: List[str] = None,
) -> Dict[str, float]:
"""
Evaluate calibration of LLM forecasts against actual outcomes.
Parameters
----------
questions : List[str]
List of questions that have resolved.
outcomes : List[int]
Actual outcomes (0 or 1).
contexts : List[str], optional
Contexts for each question.
Returns
-------
Dict[str, float]
Calibration metrics including Brier score and
calibration error.
"""
if contexts is None:
contexts = [""] * len(questions)
forecasts = []
for q, ctx in zip(questions, contexts):
result = self.forecast(q, ctx)
forecasts.append(result.probability)
forecasts = np.array(forecasts)
outcomes = np.array(outcomes)
# Brier score
brier = np.mean((forecasts - outcomes) ** 2)
# Calibration by bins
n_bins = 5
bins = np.linspace(0, 1, n_bins + 1)
calibration_errors = []
for i in range(n_bins):
mask = (forecasts >= bins[i]) & (forecasts < bins[i + 1])
if mask.sum() > 0:
predicted_avg = forecasts[mask].mean()
actual_avg = outcomes[mask].mean()
calibration_errors.append(abs(predicted_avg - actual_avg))
avg_calibration_error = (
np.mean(calibration_errors) if calibration_errors else 0.0
)
# Log loss
eps = 1e-10
log_loss = -np.mean(
outcomes * np.log(forecasts + eps) +
(1 - outcomes) * np.log(1 - forecasts + eps)
)
return {
'brier_score': float(brier),
'avg_calibration_error': float(avg_calibration_error),
'log_loss': float(log_loss),
'mean_forecast': float(forecasts.mean()),
'forecast_std': float(forecasts.std()),
'n_questions': len(questions),
}
# Example usage
if __name__ == "__main__":
forecaster = LLMForecaster(
model="gpt-4",
temperature=0.3,
n_samples=2,
)
# Single forecast
result = forecaster.forecast(
question="Will the incumbent party win the 2024 US presidential election?",
context="Current polls show a tight race. Economic indicators are mixed.",
current_date="2024-06-01",
resolution_date="2024-11-05",
)
print(f"Question: {result.question}")
print(f"Probability: {result.probability:.3f}")
print(f"Confidence: {result.confidence}")
print(f"Model: {result.model}")
print(f"\nReasoning excerpt: {result.reasoning[:200]}...")
# Calibration evaluation (with simulated data)
questions = [
"Will GDP growth exceed 3% in Q1?",
"Will the bill pass the Senate?",
"Will the candidate win the primary?",
]
outcomes = [1, 0, 1]
metrics = forecaster.evaluate_calibration(questions, outcomes)
print(f"\nCalibration Metrics:")
for k, v in metrics.items():
print(f" {k}: {v:.4f}")
24.11.5 Practical Considerations for LLM Forecasting
Using LLMs as forecasters in a trading context requires careful consideration:
Cost management. LLM API calls are expensive. At $0.01-0.06 per 1,000 tokens for frontier models, running multiple prompts across hundreds of markets can quickly become costly. Budget accordingly and cache results.
Latency. LLM responses take 2-30 seconds. For time-sensitive trading, LLM forecasts should be generated on a schedule (e.g., hourly) rather than in response to individual news events.
Reproducibility. Even with temperature=0, LLM outputs are not perfectly deterministic. Log all prompts and responses for reproducibility.
Complementary use. LLM forecasts are most valuable as one input among many, not as a sole trading signal. They are particularly useful for: - Initial probability estimates for new markets. - Sanity checks on model-derived probabilities. - Identifying qualitative factors that quantitative models might miss. - Generating hypotheses about what information the market may be overlooking.
24.12 Chapter Summary
This chapter has built a comprehensive NLP toolkit for prediction market trading, progressing from foundational text processing through modern deep learning to the frontier of LLM-based forecasting.
Key technical contributions:
-
Text preprocessing (Section 24.2): A configurable pipeline that handles the unique challenges of prediction market text, including news articles, social media, and political/financial language.
-
Classical NLP (Section 24.3): TF-IDF representations paired with logistic regression provide fast, interpretable baselines for text classification. The n-gram features and sublinear TF scaling are particularly important for domain-specific text.
-
Sentiment analysis (Section 24.4): Multiple sentiment scoring methods (VADER, TextBlob, transformers) with domain-specific lexicon augmentation. The ensemble approach reduces the risk of any single method failing on unusual text.
-
Transformer models (Sections 24.5-24.6): BERT-based classification with fine-tuning on market-relevant text. The attention mechanism provides superior context understanding, and the pre-training/fine-tuning paradigm enables strong performance with limited labeled data.
-
News impact analysis (Section 24.7): Event study methodology adapted for prediction markets, measuring how news events affect prices and quantifying the surprise content of news.
-
Real-time monitoring (Section 24.8): An end-to-end pipeline from RSS feeds through sentiment scoring to alerting, enabling timely action on NLP-derived signals.
-
Entity and topic extraction (Section 24.9): NER connects text to specific markets, while topic modeling tracks the evolving information environment around market questions.
-
Feature engineering (Section 24.10): A systematic framework for generating NLP features (sentiment, volume, topic, entity, novelty) that integrate with the tabular models from Chapter 23.
-
LLM forecasting (Section 24.11): Direct use of large language models for probability estimation, including prompt engineering, calibration evaluation, and practical deployment considerations.
The unified perspective: Text data is the raw material from which prediction market prices are ultimately derived. By systematically extracting signal from text -- whether through simple lexicon-based sentiment or sophisticated transformer models -- we can identify information before it is fully reflected in prices. The NLP features developed in this chapter complement the market-derived features of Chapter 23 to create a more complete picture of the information environment.
What's Next
In Chapter 25, we will explore Time Series and Sequential Models, building on the features developed here and in Chapter 23 to model the temporal dynamics of prediction markets. We will cover autoregressive models, recurrent neural networks, and state-space models, showing how to capture the time-dependent patterns in both market data and NLP-derived features. The combination of NLP features from this chapter with the sequential models of Chapter 25 will form the foundation of a sophisticated prediction market trading system.
Related Reading
Explore this topic in other books
AI Engineering Pretraining, Transfer Learning & NLP College Football Analytics NLP & Scouting Reports Sports Betting NLP for Betting Intelligence