Chapter 22 Exercises: Natural Language Processing for Misinformation Detection

Instructions

Exercises are organized by section. Problems marked (Coding) require Python. (Conceptual) problems require written analysis. (Research) problems require consulting external sources. For coding exercises, use the implementations from the chapter's example code files as starting points.

Exercise 22.1 — (Conceptual) What Can and Cannot Be Automated

For each of the following misinformation detection tasks, evaluate whether it is well-suited to automation, partially suited, or poorly suited. Justify your reasoning with reference to specific technical limitations.

(a) Detecting whether a news headline uses clickbait language patterns (b) Determining whether a claim about vaccine efficacy is scientifically accurate (c) Identifying whether two different news articles are reporting on the same underlying event (d) Detecting coordinated networks of fake accounts posting similar content (e) Evaluating whether a political claim is "misleading" (as opposed to false) (f) Matching a new conspiracy theory claim to existing fact-check verdicts (g) Determining the intent of a piece of content (disinformation vs. misinformation vs. satire)

Exercise 22.2 — (Conceptual) The Human Oversight Continuum

Design a graduated content moderation system for a hypothetical social media platform that uses NLP-based misinformation detection at four severity levels:

(a) Level 1 (0–30% confidence the content is misinformation): What should happen? (b) Level 2 (30–60% confidence): What should happen? (c) Level 3 (60–80% confidence): What should happen? (d) Level 4 (>80% confidence): What should happen?

For each level, specify: whether human review is required, what the user sees, whether the content is reduced/removed, and what the user's recourse is. Justify each design choice in terms of the false positive / false negative trade-off.

Part B: Text Preprocessing (Section 22.2)

Exercise 22.3 — (Coding) Implement a Text Preprocessing Pipeline

Write a Python class MisinformationPreprocessor that:

(a) Accepts raw text and returns a preprocessed string. (b) Performs the following transformations in order: 1. URL removal (replace with [URL] token) 2. HTML tag removal 3. Lowercase conversion 4. Mention/hashtag extraction (return as metadata, not removed) 5. Punctuation normalization (collapse multiple !!!! to single !) 6. Tokenization using NLTK or spaCy 7. Stopword removal (with a configurable list) 8. Lemmatization

(c) Before and after applying the preprocessor, count: unique tokens, type-token ratio, average token length. (d) Apply your preprocessor to the following five texts and show the before/after comparison: - A news article headline (your choice) - A tweet with hashtags and mentions - A text with ALL CAPS words - A text with URLs and HTML fragments - A satire headline from a known satire site

Exercise 22.4 — (Coding) Tokenization Comparison

Compare three tokenization approaches on five texts representing different linguistic styles:

(a) Simple whitespace split (b) NLTK word_tokenize() (c) A character n-gram approach (extract all 3-grams and 4-grams)

For each text and approach, report: - Number of tokens - Vocabulary size (unique tokens) - How the approach handles: contractions, hyphenated words, emoticons, URLs, hashtags

Discuss which approach you would choose for: (1) fake news classification on news articles, (2) misinformation detection on tweets, (3) a model that must handle misspellings.

Exercise 22.5 — (Conceptual) Stopword Removal and Misinformation

Standard English stopword lists include words like "not", "never", "always", "every", "no".

(a) Explain why these words might be important for misinformation detection and should NOT be removed. (b) Give three specific examples of misinformation-relevant phrases that would be destroyed by stopword removal. (c) Propose a modified stopword list for the specific task of misinformation detection that retains the words you identified in (a) while still reducing noise. (d) Some researchers have proposed task-adaptive stopword selection using mutual information between words and class labels. Explain how this approach works conceptually.

Part C: Feature Engineering (Section 22.3)

Exercise 22.6 — (Coding) TF-IDF Feature Analysis

Using scikit-learn's TfidfVectorizer and a corpus of at least 20 documents (use synthetic data or the LIAR dataset sample):

(a) Fit a TF-IDF vectorizer on the corpus with vocabulary size 1,000. (b) For each document, identify the 10 highest TF-IDF terms. (c) Compare the top TF-IDF terms for documents labeled "true" vs. "false" in your corpus. What patterns do you observe? (d) Re-run with bigrams (ngram_range=(1,2)). Do bigrams capture patterns that unigrams miss? (e) Create a scatter plot of the first two principal components of the TF-IDF matrix, colored by label. Do the classes separate visually?

Exercise 22.7 — (Coding) Stylometric Feature Extraction

Write a function extract_stylometric_features(text: str) -> dict that computes:

(a) Fraction of ALL CAPS words (words where all alphabetic characters are uppercase) (b) Count of exclamation marks per 100 words (c) Count of question marks per 100 words (d) Type-token ratio (unique words / total words) (e) Average sentence length in words (f) Flesch-Kincaid readability score (formula: 206.835 - 1.015 × avg_words_per_sentence - 84.6 × avg_syllables_per_word) (g) Sentiment polarity score using TextBlob or VADER (h) Count of hedging words ("allegedly", "reportedly", "claimed", "said to be", "supposedly") (i) Count of certainty words ("definitely", "absolutely", "certainly", "always", "never", "proven") (j) Citation-like patterns (mentions of "study", "research", "according to", "researchers")

Apply this to 10 text examples you consider credible vs. 10 you consider non-credible. Report means for each class. Which features differ most between classes?

Exercise 22.8 — (Conceptual) Feature Engineering Trade-offs

For each feature engineering approach below, evaluate: (1) what information it captures, (2) what misinformation signals it might detect, (3) what it misses, and (4) what false signals it might produce.

(a) TF-IDF unigrams (b) Sentiment polarity score (c) Source credibility score (publisher reputation rating) (d) Social engagement features (shares, likes, comments) (e) Named entity frequency (number of named people, organizations, locations) (f) Temporal features (time of posting, day of week)

Part D: Classical ML Approaches (Section 22.4)

Exercise 22.9 — (Coding) Naive Bayes Classifier for Fake News

Using a synthetic or real labeled dataset (at minimum 500 examples, binary labels):

(a) Preprocess text using your pipeline from Exercise 22.3. (b) Vectorize using TF-IDF with unigrams and bigrams. (c) Train a Multinomial Naive Bayes classifier. (d) Evaluate using 5-fold cross-validation reporting: accuracy, precision, recall, F1 for each class. (e) Generate and visualize a confusion matrix. (f) List the top 20 most discriminative features (words) for each class using feature_log_prob_. (g) Discuss: do the top features make intuitive sense for misinformation detection? What artifacts or spurious patterns do you see?

Exercise 22.10 — (Coding) SVM vs. Naive Bayes Comparison

Extend Exercise 22.9 to compare:

(a) Multinomial Naive Bayes (from 22.9) (b) LinearSVC with TF-IDF features (c) LinearSVC with TF-IDF + stylometric features (from 22.7)

For each model: - Report accuracy, macro F1, and class-specific F1 - Plot ROC curves (for binary classification) - Report training and inference time

(d) Is there a consistent winner? Under what conditions might Naive Bayes be preferred over SVM despite lower accuracy?

Exercise 22.11 — (Coding) Random Forest with Mixed Features

Using the same dataset:

(a) Create a feature matrix that combines TF-IDF vectors (reduced to 100 dimensions via PCA) with stylometric features (from 22.7). (b) Train a Random Forest classifier with 200 trees. (c) Plot feature importances. Are text-based or stylometric features more important? (d) Compare performance to the SVM models from 22.10. (e) Explain why Random Forest typically performs worse than SVM on pure TF-IDF features but may do better with mixed feature types.

Exercise 22.12 — (Conceptual) LIAR Dataset Analysis

Research the LIAR dataset (Wang, 2017):

(a) What is the distribution of the six truth labels? Is the dataset balanced or imbalanced? What effect does this have on accuracy metrics? (b) What metadata is available alongside the text? List at least five metadata fields. (c) Why might a model that uses only metadata (no text) achieve competitive accuracy on LIAR? What does this imply about what text-only models are actually learning? (d) LIAR contains statements from 2007–2016. How might temporal drift affect a model trained on LIAR and deployed in 2024? (e) PolitiFact labels are produced by a US-based partisan political fact-checking organization. What systematic biases might this introduce into the dataset?

Part E: Word Embeddings (Section 22.5)

Exercise 22.13 — (Coding) Word2Vec Exploration

Train a Word2Vec model on a news corpus (at minimum 1,000 articles; use any available news dataset):

(a) Train with Skip-gram and window size 5, dimension 100. (b) Find the 10 most similar words to: "vaccine", "misinformation", "election", "fake", "evidence". (c) Test the analogy task: king - man + woman = ? ; doctor - man + woman = ? (d) Visualize 50 selected words in 2D using UMAP or t-SNE. Do semantically related clusters emerge? (e) Compare results when you use pre-trained GloVe-6B-100d vectors. How do the similarity lists differ?

Exercise 22.14 — (Coding) Document Embedding for Claim Matching

Given a small set of 10 fact-checked claims (you may use real PolitiFact claims):

(a) Compute document embeddings by averaging word embeddings of each claim. (b) Compute pairwise cosine similarities between all 45 pairs. (c) For a new query claim, find the most similar existing claim by cosine similarity. (d) Implement a simple retrieval system: given a query, return the top-3 most similar claims with their similarity scores and fact-check verdicts. (e) Evaluate on 5 test queries: does the retrieved similar claim have the same or different truth label? What does this tell you about the faithfulness of embedding similarity as a proxy for truth-label similarity?

Exercise 22.15 — (Conceptual) Bias in Word Embeddings

Bolukbasi et al. (2016) demonstrated that Word2Vec embeddings trained on Google News encode gender stereotypes.

(a) What specific test did they use to demonstrate this bias? Describe the methodology. (b) How might biased word embeddings affect a misinformation detection system that uses them as features? (c) Debiasing techniques for word embeddings include "hard debiasing" and "soft debiasing." Describe each conceptually. (d) Is debiasing word embeddings sufficient to remove bias from a downstream classifier? What other sources of bias would remain?

Part F: Transformer Models (Section 22.6)

Exercise 22.16 — (Coding) Fine-Tuning DistilBERT for Text Classification

Using HuggingFace Transformers:

(a) Load DistilBERT (distilbert-base-uncased) and its tokenizer. (b) Create a small labeled dataset (200 examples minimum) for binary classification (real/fake news). (c) Tokenize the dataset using the DistilBERT tokenizer with max_length=256. (d) Fine-tune for 3 epochs using the HuggingFace Trainer API with: - Learning rate: 2e-5 - Batch size: 16 - Warmup steps: 50 (e) Evaluate on a held-out test set and report accuracy, precision, recall, F1. (f) Compare to Logistic Regression + TF-IDF on the same data split. By how much does fine-tuned DistilBERT improve over TF-IDF?

Exercise 22.17 — (Conceptual) Understanding Self-Attention

The self-attention mechanism is core to transformer models.

(a) Explain in your own words what a "query," "key," and "value" are in the self-attention mechanism. Use an analogy to explain the computation. (b) Why does multi-head attention (using multiple attention heads in parallel) capture richer information than single-head attention? (c) How does BERT's bidirectional attention differ from GPT's unidirectional (left-to-right) attention? Which architecture is more appropriate for text classification, and why? (d) A BERT-base model has 12 attention layers, each with 12 heads. What is this model learning at different layers? (Hint: research "what do attention heads learn?") (e) The self-attention computation has O(n²) complexity where n is sequence length. Why is this a limitation for very long documents (e.g., 10,000-word news articles)?

Exercise 22.18 — (Coding) BERT Attention Visualization

Using the bertviz library or manual attention extraction:

(a) Load a pre-trained BERT model and extract attention weights for an example sentence. (b) Visualize the attention pattern for a misinformation-relevant sentence (e.g., "Scientists have found no evidence that vaccines cause autism"). (c) Identify which tokens attend to which other tokens in the first, middle, and last BERT layers. (d) What linguistic patterns are captured by attention in each layer? (Subject-verb, negation, entity references?) (e) Discuss: can attention weights be used as explanations for BERT's classification decisions? What are the limitations of attention-based explanations?

Part G: Claim Verification Pipeline (Section 22.7)

Exercise 22.19 — (Coding) Build a Simple Claim Matching System

Using pre-computed sentence embeddings (use the sentence-transformers library, e.g., 'all-MiniLM-L6-v2'):

(a) Create an index of 20 fact-checked claims (use any public fact-checking database). (b) For each claim, store: claim text, embedding, verdict (true/false/mixed), source URL. (c) Implement a function find_similar_claims(query: str, top_k: int = 3) -> list that returns the top-k most similar indexed claims. (d) Test on 10 novel claims and assess: (1) whether the retrieved claims are topically relevant, (2) whether the truth labels of retrieved claims match the query's actual truth label. (e) What recall@3 do you achieve — how often is the correct matching claim in the top 3 results?

Exercise 22.20 — (Conceptual) Stance Detection and Its Limits

(a) Define stance detection formally as a classification problem. What are the input(s), output(s), and label set? (b) The Fake News Challenge (FNC-1) dataset defines four stance labels: agree, disagree, discuss, unrelated. Why is "discuss" an important label? What does it mean for a document to "discuss" a claim without agreeing or disagreeing? (c) A stance detection model achieves 75% accuracy on the FNC-1 test set. Is this good? Research the distribution of FNC-1 labels and calculate what accuracy a majority-class baseline would achieve. (d) How does stance detection differ from fact-checking? Can a document discuss a false claim while the document itself is factually accurate? (e) Describe a realistic scenario where a stance detection model could be misused.

Exercise 22.21 — (Conceptual) FEVER Benchmark Analysis

(a) How were the FEVER claims generated? What mutation operations were used to create false claims? (b) What fraction of FEVER claims are SUPPORTED, REFUTED, and NOT ENOUGH INFORMATION? How does this distribution affect evaluation? (c) Explain the "multi-hop" problem in FEVER: why are some claims particularly hard to verify? (d) The FEVER 2.0 task allows adversarial claim generation — humans try to write claims that fool a given model. Describe how this changes the benchmark dynamics and what it reveals about model brittleness. (e) FEVER uses Wikipedia as the sole evidence source. List five categories of real-world misinformation that FEVER-trained systems could not evaluate.

Part H: Limitations and Ethics (Sections 22.8–22.9)

Exercise 22.22 — (Coding) Adversarial Example Generation

Demonstrate the vulnerability of a text classifier to adversarial attacks:

(a) Train a TF-IDF + Logistic Regression classifier on your fake news dataset from earlier exercises. (b) For 10 correctly-classified "fake news" examples, modify each in three ways: - Synonym substitution for the top 3 discriminative words - Random character insertion (add one extra letter per word in 5 random words) - Simple paraphrase (manually rewrite in different syntactic form) (c) For each modification, test whether the classifier changes its prediction. (d) Report the adversarial success rate (fraction of modifications that changed the prediction). (e) What does this exercise reveal about what the classifier is actually learning?

Exercise 22.23 — (Coding) Label Leakage Investigation

Demonstrate the label leakage problem:

(a) Using a fake news dataset where articles come from labeled sources (reliable vs. unreliable publishers), train a classifier using ONLY the domain/URL of the article (no article text). (b) Report accuracy of the domain-only classifier. (c) Train a text-only classifier (no domain information). (d) Compare performance. If the domain-only classifier performs comparably to text-only, what does this suggest about what the text classifier may be learning? (e) Propose a test to diagnose whether a model is learning content-based vs. source-based features. How would you design a dataset that forces content-based learning?

Exercise 22.24 — (Conceptual) Dataset Bias Audit

Propose a systematic audit of a fake news classification dataset:

(a) What distributional properties would you measure to characterize potential biases? List at least eight properties. (b) How would you test for topic imbalance between real and fake news classes? (c) How would you test for annotation artifacts (systematic lexical differences introduced by the labeling process rather than the true content difference)? (d) Design a "stress test" for a classifier trained on this dataset: what distribution-shifted test sets would reveal the boundaries of what the model has actually learned?

Exercise 22.25 — (Conceptual) False Positive / False Negative Ethics

A large social media platform deploys an automated system that flags political posts containing potential misinformation, adding a warning label. The system has: - 85% precision (15% false positive rate) - 70% recall (30% false negative rate)

The platform processes 10 million political posts per week.

(a) Calculate the number of correctly flagged misinformation posts per week. (b) Calculate the number of false positives (correct speech incorrectly labeled as misinformation) per week. (c) Calculate the number of false negatives (actual misinformation missed) per week. (d) A civil liberties organization argues that 1.5 million false positives per week is an unacceptable infringement on free expression. The platform argues that 900,000 missed misinformation posts per week is an unacceptable public harm. How would you navigate this trade-off? (e) Is there a precision/recall threshold that satisfies both concerns? If not, what non-technical interventions (policy, legal, institutional) might be needed?

Exercise 22.26 — (Research) Audit a Real Platform's Automated Moderation

Research the publicly available documentation on ONE major social media platform's automated content moderation system:

(a) What types of content does the platform claim to automatically detect? (b) What information does the platform publish about error rates? Is this information sufficient to evaluate the system's fairness? (c) Have independent auditors (journalists, academics, NGOs) found evidence of systematic bias? Describe one documented case. (d) What human review processes exist for contested moderation decisions? (e) How does the platform handle appeals? What fraction of appeals result in reversal?

Exercise 22.27 — (Coding) Model Card for a Fake News Classifier

Write a "Model Card" (following Mitchell et al., 2019) for the classifier you built in Exercise 22.16:

(a) Model Details: Architecture, training framework, primary language, date. (b) Intended Uses: Primary use cases, out-of-scope uses. (c) Factors: Groups, instrumentation, environment affecting performance. (d) Metrics: Accuracy, precision, recall, F1 on test set; disaggregated by topic if possible. (e) Evaluation Data: Description of test set, how it was collected, known limitations. (f) Training Data: Description, known biases. (g) Ethical Considerations: Potential harms, false positive/negative trade-offs. (h) Caveats and Recommendations: Who should and should not use this model.

Exercise 22.28 — (Research) The FEVER Shared Task

Research the FEVER shared task papers from 2018 and 2019:

(a) Describe the three-stage pipeline used by the top-performing system from FEVER 2018. (b) What retrieval method did top systems use, and how did retrieval quality affect overall verification accuracy? (c) The winning system from FEVER 2018 achieved approximately 64% FEVER score. What does this metric measure, and why is it more demanding than simple verdict accuracy? (d) How did systems improve between FEVER 2018 and FEVER 2019? What architectural changes drove improvement? (e) What categories of claims were most and least reliably verified? What does this tell us about the limitations of Wikipedia-based verification?

Exercise 22.29 — (Coding) End-to-End Misinformation Detection Pipeline

Combine components from previous exercises to build an end-to-end pipeline:

(a) Input: A piece of text (article or tweet) (b) Stage 1: Preprocess using your pipeline from 22.3 (c) Stage 2: Extract stylometric features using your function from 22.7 (d) Stage 3: Compute sentence embedding using sentence-transformers (e) Stage 4: Retrieve the top-3 most similar fact-checked claims (from 22.19) (f) Stage 5: Run TF-IDF + LR classifier from 22.10 (g) Output: Credibility score (0–1), top similar fact-checked claims, key stylometric warnings

Test on at least five examples and present results in a structured report format.

Exercise 22.30 — (Conceptual) Capstone: Design a Responsible Misinformation Detection System

You have been hired to design a misinformation detection system for a major news aggregator that surfaces news articles from thousands of publishers. Write a 1,500-word technical and ethical design specification covering:

(a) Scope definition: What categories of content will and will not be in scope for automated analysis? (b) Technical architecture: Which components (preprocessing, features, models) would you deploy, and why? (c) Training data: What datasets would you use or collect? What biases would you audit for? (d) Evaluation framework: Beyond accuracy, what metrics matter? How would you evaluate for fairness? (e) Deployment safeguards: What human oversight mechanisms, appeals processes, and monitoring systems would you require? (f) Transparency commitments: What would you disclose publicly about the system? (g) What you would NOT do: List specific design choices you would explicitly reject and explain why.