Chapter 24 Exercises

Section 24.2: Text Preprocessing

Exercise 1: Custom Preprocessing Pipeline

Write a text preprocessing function that handles the following edge cases commonly found in prediction market text: - Percentage values ("The candidate leads by 5.3%") should be preserved as "5.3_percent" - Dollar amounts ("$1.5 billion") should be normalized to "1.5_billion_dollars" - Political party abbreviations (GOP, DNC, RNC) should be expanded to full names - State abbreviations (CA, TX, FL) should be expanded to full state names

Test your function on at least 5 sample sentences from political news.

Exercise 2: Tokenizer Comparison

Compare word tokenization, sentence tokenization, and subword tokenization (using the BERT tokenizer) on the following text:

"The Fed's decision to raise rates by 0.25% surprised markets.
Polymarket's 'Will the Fed raise rates?' contract jumped from
$0.35 to $0.78 in minutes."

For each tokenizer, count the number of tokens produced and identify any tokens that seem problematic (e.g., splitting important terms, merging separate concepts).

Exercise 3: Stopword Analysis

Using the NLTK English stopwords list, analyze 100 news headlines about prediction markets. Calculate: - What percentage of words in a typical headline are stopwords? - Which stopwords appear most frequently? - Identify at least 3 cases where removing stopwords would change the meaning of a headline (e.g., "not" in "Fed will not raise rates").

Modify the stopword list to be appropriate for prediction market text analysis.

Section 24.3: Classical NLP

Exercise 4: TF-IDF Feature Analysis

Collect or create a dataset of 50 news headlines (25 positive and 25 negative for a specific prediction market question). Compute TF-IDF features with unigrams and bigrams. Answer: - What are the top 10 most informative features (by TF-IDF weight) for positive articles? - What are the top 10 most informative features for negative articles? - Plot a heatmap of the TF-IDF values for the top 20 features across all 50 documents.

Exercise 5: N-gram Analysis

Implement a function that computes the most frequent n-grams (n=1,2,3) in a corpus of prediction market news text. Run it on a corpus related to a specific market question and identify: - The top 20 unigrams, bigrams, and trigrams - Which n-grams are most useful for sentiment classification and why - Whether bigrams or trigrams provide additional information beyond unigrams for this domain

Exercise 6: TF-IDF Classifier Comparison

Build three text classifiers using TF-IDF features: 1. Logistic Regression 2. Naive Bayes (MultinomialNB) 3. Support Vector Machine (LinearSVC)

Compare their performance on a sentiment classification task using 5-fold cross-validation. Report accuracy, precision, recall, and F1 for each. Which performs best and why?

Exercise 7: Document Similarity

Using TF-IDF vectors and cosine similarity, build a system that, given a new news article, finds the 5 most similar articles from a historical corpus. Test it on political news articles and evaluate whether the retrieved articles are genuinely similar in topic and sentiment.

Section 24.4: Sentiment Analysis

Exercise 8: VADER vs. TextBlob

Analyze the following 10 sentences with both VADER and TextBlob: 1. "The candidate won in a landslide victory." 2. "Polls show a devastating collapse in support." 3. "The market remains relatively stable." 4. "BREAKING: Major scandal rocks the campaign!!!!" 5. "The candidate performed well, but the debate format was unfair." 6. "Not great, not terrible. An average performance." 7. "The policy is NOT going to help the economy." 8. "Analysts are cautiously optimistic about the outcome." 9. "This is the worst decision in modern political history." 10. "Sources say the deal is likely to go through."

Compare the scores and identify cases where the two methods disagree significantly. Which method handles each case better?

Exercise 9: Custom Sentiment Lexicon

Create a custom sentiment lexicon for prediction market text with at least 50 terms. Include: - 20 positive terms (e.g., "landslide," "surge," "endorse") - 20 negative terms (e.g., "scandal," "collapse," "indict") - 10 modifier terms that amplify or reduce sentiment (e.g., "slightly," "massively," "unprecedented")

Implement a simple rule-based sentiment scorer that uses your lexicon and handles negation. Compare its performance to VADER on 20 prediction market headlines.

Exercise 10: Temporal Sentiment Aggregation

Given a time series of article sentiments:

timestamps = [...]  # hourly timestamps for 7 days
sentiments = [...]  # sentiment score for each article

Implement three aggregation methods: 1. Simple moving average (window = 6 hours) 2. Exponential moving average (alpha = 0.3) 3. Volume-weighted average (weight by number of articles per hour)

Plot all three on the same chart and discuss which would be most useful as a trading feature.

Section 24.5: Transformer Models

Exercise 11: Tokenizer Deep Dive

Using the BERT tokenizer (bert-base-uncased), tokenize the following prediction market-relevant sentences and examine the token IDs and attention masks: 1. "Will Bitcoin exceed $100,000 by December 2025?" 2. "The probability of a government shutdown is 35%." 3. "NATO expansion to include Finland and Sweden."

For each sentence, visualize the subword tokens and explain any unexpected tokenization choices. How does the tokenizer handle numbers, special characters, and proper nouns?

Exercise 12: Embedding Comparison

Extract the [CLS] embeddings from BERT for the following pairs of sentences and compute their cosine similarity: - ("The Fed raised rates" , "Interest rates went up") -- semantically similar - ("The Fed raised rates" , "The Fed lowered rates") -- semantically opposite - ("The Fed raised rates" , "The dog chased the cat") -- unrelated

Do the cosine similarities match your intuition? What does this tell you about BERT's understanding of semantic similarity?

Exercise 13: Model Size vs. Performance

Compare the performance of three models of increasing size on a sentiment classification task: 1. DistilBERT (66M parameters) 2. BERT-base (110M parameters) 3. RoBERTa-base (125M parameters)

Measure accuracy, inference time per example, and memory usage. At what point do diminishing returns set in? Which model would you recommend for a real-time trading application?

Exercise 14: Zero-Shot Classification

Use a zero-shot classification model (e.g., facebook/bart-large-mnli) to classify prediction market headlines into categories without any training data. Define 5 categories relevant to prediction markets (e.g., "polling data," "scandal," "policy announcement," "endorsement," "economic indicator"). Evaluate the classifier on 20 headlines and report the accuracy.

Section 24.6: Fine-Tuning

Exercise 15: Data Labeling Strategy

Design a data labeling strategy for fine-tuning a sentiment classifier on prediction market text. Specify: - What data sources you would collect from - How you would define the label scheme (binary, ternary, or more granular) - How you would handle ambiguous cases - Quality control measures (inter-annotator agreement, etc.) - Target dataset size and estimated labeling cost/time

Exercise 16: Fine-Tuning with Limited Data

Starting with a pre-trained DistilBERT model, fine-tune a sentiment classifier using only 100 labeled examples (created manually or synthetically). Then incrementally add more data: 200, 500, 1000 examples. Plot the learning curve (test accuracy vs. training set size) and determine the minimum amount of data needed for acceptable performance (> 80% accuracy).

Exercise 17: Domain Adaptation

Implement a domain-adaptive pre-training approach: 1. Collect 10,000 unlabeled prediction market news articles (you can simulate this with political/financial news). 2. Continue pre-training a BERT model using masked language modeling on this corpus for 1 epoch. 3. Fine-tune the domain-adapted model on your labeled sentiment data. 4. Compare performance with a model fine-tuned without domain adaptation.

Report the improvement (if any) in classification accuracy.

Exercise 18: Hyperparameter Sensitivity

Using the fine-tuning pipeline from Section 24.6, systematically vary the following hyperparameters and measure their effect on test accuracy: - Learning rate: [1e-5, 2e-5, 3e-5, 5e-5] - Batch size: [8, 16, 32] - Number of epochs: [2, 3, 5, 10] - Max sequence length: [64, 128, 256, 512]

Identify which hyperparameter has the largest effect on performance.

Section 24.7: News Impact

Exercise 19: Event Study Implementation

Using historical prediction market data (real or simulated), implement a complete event study: 1. Identify 10 significant news events for a specific market. 2. Define event windows of [-1 hour, +24 hours]. 3. Calculate abnormal price changes for each event. 4. Test whether the average abnormal change is statistically significant (use a t-test). 5. Visualize the average cumulative abnormal change over the event window.

Exercise 20: News Surprise Metric

Implement and compare three methods for measuring news surprise: 1. TF-IDF cosine distance from recent articles 2. Perplexity of the headline under a language model (higher perplexity = more surprising) 3. Magnitude of market price change in the 5 minutes after publication

Compute the correlation between each surprise metric and the 1-hour market impact. Which metric is the best predictor of market-moving news?

Exercise 21: Overreaction Detection

Build a detector for market overreactions to news: 1. Identify cases where a news event causes a large immediate price movement (> 5%). 2. Track the price over the following 24 hours. 3. Calculate the reversal ratio for each event. 4. Determine whether overreactions are more common for positive or negative news. 5. Propose a trading strategy that exploits overreactions and backtest it.

Section 24.8: Real-Time Monitoring

Exercise 22: RSS Feed Monitor

Build a working RSS feed monitor that: 1. Polls at least 3 real news RSS feeds every 60 seconds. 2. Filters articles by relevance to a specific prediction market topic. 3. Scores sentiment on each relevant article using VADER. 4. Prints a formatted summary including headline, source, timestamp, and sentiment score. 5. Maintains a running average of sentiment over the past hour.

Run the monitor for at least 30 minutes and document the output.

Exercise 23: Alert System Design

Design and implement an alert system for a prediction market trader that generates alerts based on: 1. Sentiment spike: Average sentiment moves more than 2 standard deviations from the rolling mean. 2. Volume spike: Number of articles in the past hour exceeds 3x the daily average. 3. New entity: A previously unseen entity (person/organization) appears in relevant news. 4. Sentiment divergence: Sentiment from social media diverges from sentiment from mainstream news.

Test your system with simulated or real data and provide example alerts for each type.

Section 24.9: NER and Topics

Exercise 24: NER for Market Routing

Build a system that uses NER to automatically route news articles to relevant prediction markets. Given a list of 10 prediction market questions and a corpus of 50 news articles, use NER to: 1. Extract entities from each article. 2. Match entities to relevant markets. 3. Evaluate the routing accuracy by manually checking whether articles were correctly matched.

Report the precision and recall of your routing system.

Exercise 25: Topic Evolution Tracking

Using LDA topic modeling, analyze a corpus of news articles over time (at least 30 days, real or simulated). For each day: 1. Compute the topic distribution. 2. Identify the dominant topic. 3. Track how topic distributions change day-to-day.

Visualize the topic evolution as a stacked area chart. Identify any significant topic shifts and discuss how they might relate to prediction market price movements.

Section 24.10: Feature Engineering

Exercise 26: Complete Feature Pipeline

Build a complete NLP feature pipeline that, given a stream of articles and a prediction market, generates a feature vector at each time step. Your feature vector should include at least: - 5 sentiment features - 3 volume features - 3 topic features - 2 entity features - 2 novelty features

Generate features for a 30-day period and analyze the correlation between each feature and subsequent market price changes. Which features have the strongest predictive power?

Exercise 27: Feature Importance Analysis

Train a gradient boosted tree model (XGBoost or LightGBM) using a combination of NLP features (from this chapter) and market features (price, volume, spread). Use SHAP values to analyze feature importance: 1. Which NLP features are most important? 2. How do NLP features compare to market features in importance? 3. Are there interaction effects between NLP and market features?

Visualize the SHAP summary plot and interpret the results.

Section 24.11: LLM Forecasting

Exercise 28: Prompt Engineering Experiment

Design 5 different prompting strategies for LLM probability estimation: 1. Simple direct question 2. Structured analysis (base rate, factors for/against) 3. Devil's advocate (argue both sides) 4. Reference class forecasting (find similar historical events) 5. Decomposition (break into sub-questions)

Test each strategy on 10 prediction market questions and compare the variance of estimates across strategies. Which strategy produces the most calibrated results?

Exercise 29: LLM vs. Market Comparison

Select 20 recently resolved prediction market questions for which you can obtain both: - LLM probability estimates (generated before resolution) - Market prices at the time the LLM estimates were generated

Compare the Brier scores of the LLM and the market. In which types of questions does the LLM outperform the market? In which does the market outperform? What patterns do you observe?

Exercise 30: Ensemble Forecaster

Build an ensemble forecaster that combines: 1. Market price (from a prediction market) 2. LLM probability estimate 3. Sentiment-based probability (map sentiment to probability using logistic function) 4. Historical base rate

Use a simple weighted average or a logistic regression to combine these inputs. Evaluate the ensemble against each individual method on a set of resolved questions. Does the ensemble outperform the best individual method?