Chapter 24 Quiz

Question 1

What is the primary advantage of subword tokenization (used by BERT and GPT) over word-level tokenization?

A) It produces fewer tokens, reducing computational cost B) It can handle out-of-vocabulary words by decomposing them into known subword pieces C) It always preserves the original meaning of words better D) It eliminates the need for any text preprocessing

Answer: B

Subword tokenization splits rare or unseen words into smaller, known pieces (e.g., "cryptocurrency" becomes "crypto" + "##currency"). This means the model can process words it has never seen during training by combining its understanding of the component parts, whereas word-level tokenization would simply assign an unknown token.

Question 2

When removing stopwords from prediction market text for sentiment analysis, which of the following words should typically be preserved?

A) "the" B) "is" C) "not" D) "and"

Answer: C

The word "not" is a negation word that is critical for sentiment analysis. "The bill will pass" and "The bill will NOT pass" have opposite sentiments and opposite implications for a prediction market. While "not" appears in many standard stopword lists, it should be preserved when the downstream task involves sentiment analysis.

Question 3

In the TF-IDF formula, what does the IDF (Inverse Document Frequency) component accomplish?

A) It increases the weight of words that appear in many documents B) It decreases the weight of words that appear in many documents, highlighting distinctive terms C) It normalizes word counts by document length only D) It removes rare words from the vocabulary

Answer: B

IDF = log(N / number of documents containing the term). Words that appear in many documents get a low IDF value, which downweights them. Words that appear in few documents get a high IDF value, which upweights them. This highlights terms that are distinctive to specific documents rather than generic across the corpus.

Question 4

VADER sentiment analysis produces a "compound" score. What is the range of this score?

A) 0 to 1 B) -1 to 1 C) 0 to 100 D) -100 to 100

Answer: B

The VADER compound score ranges from -1 (most negative) to +1 (most positive). It is computed as a normalized sum of the valence scores of all words in the text. Scores near 0 indicate neutral sentiment, while scores near -1 or +1 indicate strong negative or positive sentiment.

Question 5

Why are generic sentiment lexicons often inadequate for prediction market text?

A) Prediction market text is always written in formal language B) Many words have domain-specific sentiment that differs from their general sentiment (e.g., "aggressive" is negative generally but can be positive in policy contexts) C) Prediction market text contains no sentiment-bearing words D) Lexicons cannot handle text longer than 100 words

Answer: B

Domain-specific language creates a mismatch between general sentiment lexicons and the actual sentiment in specialized text. For example, "volatile" is neutral in general usage but negative in financial contexts. "Conservative" can be positive (conservative estimate) or politically charged depending on context. Custom or augmented lexicons are needed for accurate sentiment scoring of prediction market text.

Question 6

In the transformer attention mechanism, what is the purpose of dividing by sqrt(d_k) in the formula Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V?

A) To make the computation faster B) To prevent the dot products from becoming too large, which would push the softmax into regions with very small gradients C) To normalize the output to be between 0 and 1 D) To reduce the number of parameters in the model

Answer: B

When the dimensionality d_k is large, the dot products QK^T can become very large in magnitude. This pushes the softmax function into regions where it has extremely small gradients, making training difficult (the gradients vanish). Dividing by sqrt(d_k) scales the dot products to a reasonable range, keeping the softmax in a well-behaved region.

Question 7

What does BERT's "masked language modeling" pre-training task involve?

A) Removing entire sentences and predicting them B) Randomly masking 15% of input tokens and training the model to predict the masked tokens from context C) Training the model to generate text from left to right D) Classifying documents into predefined categories

Answer: B

In masked language modeling, 15% of the input tokens are randomly replaced with a [MASK] token. The model must predict the original token using the surrounding bidirectional context. This forces BERT to learn deep contextual representations of language, as it must understand the meaning of the surrounding text to fill in the blanks.

Question 8

When fine-tuning BERT for a prediction market sentiment classifier, why is a small learning rate (2e-5 to 5e-5) recommended?

A) Larger learning rates would cause the model to train too slowly B) Small learning rates preserve the valuable pre-trained representations while allowing adaptation to the new task C) BERT cannot be trained with learning rates above 1e-4 D) Small learning rates prevent the model from learning the new task

Answer: B

BERT's pre-trained weights encode valuable linguistic knowledge learned from billions of words. A large learning rate would dramatically modify these weights during fine-tuning, destroying the pre-trained knowledge (catastrophic forgetting). A small learning rate allows the model to gently adapt its representations to the new task while preserving most of the general language understanding.

Question 9

Which of the following is NOT a valid approach for creating labeled training data for a prediction market sentiment classifier?

A) Labeling articles based on subsequent market price movements B) Manual annotation by domain experts C) Using LLM-assisted labeling with quality checks D) Using the model's own predictions as training labels (without any external validation)

Answer: D

Using a model's own predictions as training labels is circular reasoning and would not improve the model. The other approaches all provide external signal: market prices reflect real-world information, manual annotation provides human judgment, and LLM-assisted labeling (when quality-checked) provides reasonably accurate labels from a different system. Approach D would simply reinforce the model's existing biases.

Question 10

In news impact analysis for prediction markets, what is the "abnormal price change"?

A) Any price change greater than 5% B) The difference between the actual price change and the expected price change in the absence of the news event C) The total price change over a 24-hour period D) The price change caused by market manipulation

Answer: B

The abnormal price change is the actual price change minus the expected price change: Delta_p_abnormal = Delta_p_actual - Delta_p_expected. For prediction markets, the expected price change in the absence of news is typically zero (since prices should be a martingale), making the abnormal change approximately equal to the actual change around a news event.

Question 11

What does the "reversal ratio" measure in news impact analysis?

A) The speed at which news spreads across markets B) How much of the initial price reaction to news is reversed over the following hours, indicating potential overreaction C) The number of articles about a topic that are later retracted D) The probability that a news event's impact is permanent

Answer: B

The reversal ratio is calculated as (price_24h - price_1h) / (price_1h - price_before). A reversal ratio near -1 means the initial 1-hour impact was fully reversed over the next 23 hours (overreaction). A reversal ratio near 0 means the impact was sustained. This metric is useful for identifying overreaction patterns that can be exploited through contrarian trading strategies.

Question 12

When measuring "news surprise," the TF-IDF cosine novelty metric computes:

A) The average similarity between all pairs of articles B) One minus the maximum cosine similarity between the new article and all recent articles C) The number of new words introduced by the article D) The sentiment of the article divided by the average sentiment

Answer: B

Novelty(d) = 1 - max(cos(v_d, v_d')) for all d' in recent articles. If the new article is very similar to at least one recent article (high cosine similarity), the novelty is low. If the new article is unlike anything recently published (low cosine similarity to all recent articles), the novelty is high. High-novelty articles are more likely to represent surprising news that will move markets.

Question 13

In a real-time news monitoring pipeline, what is the purpose of maintaining a set of "seen article IDs"?

A) To track which articles have been profitable to trade on B) To prevent processing the same article multiple times when RSS feeds are polled repeatedly C) To prioritize newer articles over older ones D) To limit the total number of articles processed per day

Answer: B

RSS feeds return the most recent articles each time they are polled. Without tracking seen article IDs, the system would re-process the same articles on every poll cycle, leading to duplicate sentiment scores, inflated article counts, and potentially false volume alerts. The seen ID set ensures each article is processed exactly once.

Question 14

Which NER entity type would be most relevant for routing a news article to a "Will Sweden join NATO?" prediction market?

A) PERSON B) MONEY C) GPE (Geo-Political Entity) D) DATE

Answer: C

GPE (Geo-Political Entity) identifies countries, states, and cities. "Sweden" and "NATO" would be recognized as GPE and ORG entities respectively. For routing to geopolitical prediction markets, GPE entities are the most directly relevant, as they identify the countries and regions involved in the market question.

Question 15

What is the key difference between stemming and lemmatization?

A) Stemming is more accurate but slower; lemmatization is faster but less accurate B) Stemming strips suffixes using rules and may produce non-words; lemmatization uses linguistic knowledge to produce valid dictionary forms C) Stemming works on all languages; lemmatization only works on English D) There is no meaningful difference; they produce identical results

Answer: B

Stemming applies rule-based suffix stripping (e.g., Porter stemmer) and can produce non-words like "presidenti" from "presidential." Lemmatization uses linguistic knowledge (part-of-speech, morphological analysis) to produce valid dictionary forms like "good" from "better" or "election" from "elections." Lemmatization is more accurate but slower.

Question 16

In LDA topic modeling, what does the hyperparameter alpha control?

A) The number of topics B) The sparsity of the document-topic distribution (smaller alpha = documents use fewer topics) C) The learning rate during training D) The maximum number of words per topic

Answer: B

Alpha is the concentration parameter of the Dirichlet prior on the document-topic distribution. A smaller alpha encourages documents to be associated with fewer topics (sparser distribution), while a larger alpha encourages documents to be about many topics (more uniform distribution). For prediction market text, a small alpha is often appropriate since individual articles typically focus on one or two topics.

Question 17

When building NLP features for trading models, why is "sentiment momentum" (change in average sentiment from 24h ago to now) potentially more useful than raw sentiment?

A) It is faster to compute B) It captures the direction and rate of change in information, which may predict future price movements better than the level of sentiment C) It eliminates all noise from the sentiment signal D) It is always positively correlated with market price

Answer: B

Raw sentiment level can be informative, but markets may already reflect a sustained positive or negative sentiment environment. Sentiment momentum captures whether the information environment is improving or deteriorating relative to the recent past. A shift from neutral to positive sentiment may be more predictive of future price increases than a consistently positive sentiment level, because the shift represents new information not yet fully priced in.

Question 18

What is the primary limitation of using LLMs as forecasters for prediction markets?

A) LLMs cannot output numbers between 0 and 1 B) LLMs have a knowledge cutoff date and may not know about recent events that are critical for current market questions C) LLMs always predict 50% for every question D) LLMs require more training data than any other method

Answer: B

LLMs are trained on data up to a certain date (their knowledge cutoff). For prediction markets that depend heavily on current information (e.g., "Will this candidate win the election next month?"), the LLM may be missing weeks or months of crucial developments. While you can provide context in the prompt, the LLM's underlying world model may not properly integrate this new information with its training data.

Question 19

In the ensemble sentiment scoring approach combining VADER and TextBlob, why might VADER receive more weight (e.g., 0.6 vs. 0.4)?

A) VADER is always more accurate than TextBlob B) VADER was specifically designed for social media text and handles features like capitalization, punctuation, and emoticons that are common in prediction market-related social media posts C) TextBlob is deprecated and should not be used D) VADER is faster to compute

Answer: B

VADER (Valence Aware Dictionary and sEntiment Reasoner) was specifically designed for social media text and includes rules for handling capitalization (ALL CAPS amplifies sentiment), punctuation (exclamation marks amplify), degree modifiers ("extremely"), and negation. Since much prediction market-relevant text comes from social media or informal news sources, VADER's design gives it an advantage in this domain.

Question 20

When using a zero-shot classification model for prediction market text, what is the key advantage over a fine-tuned model?

A) It is always more accurate B) It requires no labeled training data and can classify text into arbitrary categories defined at inference time C) It runs faster than any fine-tuned model D) It can handle longer documents

Answer: B

Zero-shot classification allows you to define categories at inference time without any training examples. For prediction markets, this is valuable because new market questions arise frequently, and you may need to classify text relevant to a brand-new topic immediately, before you have time to collect and label training data. The trade-off is that zero-shot classifiers are generally less accurate than fine-tuned models for well-defined tasks.

Question 21

What does the "sublinear_tf=True" parameter do in scikit-learn's TfidfVectorizer?

A) It removes terms with low frequency B) It applies 1 + log(tf) instead of raw term frequency, preventing very long documents from dominating C) It limits the vocabulary size D) It enables GPU acceleration

Answer: B

With sublinear_tf=True, the term frequency is replaced by 1 + log(tf). This means that doubling the number of occurrences of a word does not double its weight -- instead, the relationship is logarithmic. This prevents long documents (which naturally have higher word counts) from having disproportionately large feature values, making the TF-IDF representation more robust across documents of varying lengths.

Question 22

In the context of prediction market NLP, what is "aspect-specific sentiment"?

A) Sentiment computed separately for each sentence in a document B) Sentiment toward specific entities or topics mentioned in the text, rather than the overall sentiment of the document C) Sentiment that only considers adjectives and adverbs D) Sentiment computed using only financial terms

Answer: B

Aspect-specific sentiment analyzes sentiment toward particular aspects or entities within a text. For example, "The candidate's economic policy is strong, but their foreign policy is weak" has positive sentiment toward economic policy and negative sentiment toward foreign policy. For prediction markets, this allows you to extract the sentiment specifically relevant to your market question, even from articles that discuss multiple topics.

Question 23

Why is the exponential moving average (EMA) often preferred over a simple moving average for sentiment aggregation in trading applications?

A) EMA is easier to compute B) EMA gives more weight to recent observations, making it more responsive to sudden sentiment shifts that may be relevant for trading C) EMA always produces more accurate predictions D) EMA uses less memory

Answer: B

The EMA formula S_t = alpha * s_t + (1 - alpha) * S_{t-1} gives exponentially decaying weights to older observations. This means recent sentiment shifts are reflected quickly in the aggregated score, while older data still contributes but with diminishing influence. For trading, this responsiveness is important because a sudden sentiment shift (e.g., from a breaking news event) should be quickly reflected in your trading signals.

Question 24

When using multiple prompt versions for LLM forecasting and taking a trimmed mean of the resulting probabilities, what is the purpose of trimming?

A) To reduce the total number of API calls B) To remove the most extreme estimates that may result from prompt sensitivity or LLM inconsistency, producing a more robust aggregate estimate C) To ensure the final probability is always 50% D) To make the computation faster

Answer: B

LLM probability estimates can vary significantly based on prompt wording, and occasionally an estimate may be unreasonably extreme due to the stochastic nature of LLM generation. By removing the highest and lowest estimates (trimming), the remaining estimates are more tightly clustered around the LLM's "true" central tendency, producing a more robust and reliable aggregate forecast.

Question 25

In the NLP feature generator, what does "source diversity" (number of unique sources covering a topic) capture that article count alone does not?

A) The total word count across all articles B) Whether the information is being reported broadly across the media ecosystem (corroboration) or coming from a single source (potentially less reliable) C) The average quality of the articles D) Whether the articles are in English

Answer: B

A high article count from a single source may indicate that one outlet is running many stories on a topic, but this does not necessarily indicate broad significance. High source diversity (many different outlets covering the same topic) suggests that the information is being independently corroborated across the media ecosystem, which typically indicates higher reliability and greater market significance. Source diversity is an important signal of how widely an event is perceived as newsworthy.