Chapter 27 Exercises

DataField.Dev

Chapter 27 Exercises

Exercise 27.1 — Preprocessing Choices and Their Effects (Individual, 45 minutes)

The preprocess_text() function from Section 27.3 accepts a method parameter that controls whether stemming or lemmatization is used. This exercise explores how that choice affects downstream analysis.

Part A: Preprocessing Comparison Apply the preprocessing function to the same 10 speech excerpts using (a) lemmatization and (b) stemming. For each method, print the first 20 tokens from each excerpt. Answer: 1. Which method produces more readable tokens? 2. Which method produces shorter, more compact tokens? 3. For a word frequency analysis where you want to group "economy" and "economic" together, which method is more appropriate? 4. For a classifier where you want maximum discriminating power regardless of readability, which might you prefer?

Part B: Stopword Sensitivity Remove the political_stopwords set from the preprocessing function (keeping only NLTK's default English stopwords). Re-run the top word analysis from Section 27.4.1. How does the list of top 20 words change? Which words appear in the original analysis but not the revised one, and vice versa? Were the political stopwords the right choice?

Part C: Write-Up Write 300–400 words explaining how preprocessing choices constitute methodological decisions — not technical details — and why they should be reported explicitly in any publication using computational text analysis.

Exercise 27.2 — VADER Validation (Pair, 60 minutes)

VADER's performance on political text may differ from its performance on social media text (the domain it was calibrated for). This exercise assesses its validity for the ODA speech corpus.

Part A: Manual Coding Select 30 speech excerpts at random from the ODA corpus. For each, independently assign a sentiment score on a -3 to +3 scale (very negative to very positive), where 0 = neutral. Average your two scores per excerpt.

Part B: Correlation with VADER Compute the Pearson correlation between your human codes and VADER's compound scores for the same 30 excerpts. Visualize the relationship with a scatter plot.

Part C: Disagreement Analysis Identify the five excerpts where VADER's score and your human code diverged most (measured by absolute difference). For each: 1. What feature of the text explains the divergence? 2. Is VADER over-scoring positively, over-scoring negatively, or failing to detect sentiment? 3. What category of text does VADER appear to handle poorly for political speech?

Part D: Implication Based on your validation, write a paragraph that would be appropriate for inclusion in a research paper as a "measurement limitations" note. Be specific about VADER's strengths and weaknesses as observed in this exercise.

Exercise 27.3 — Topic Modeling Sensitivity (Individual, 60 minutes)

LDA results are sensitive to the number of topics specified. This exercise explores that sensitivity.

Part A: Multi-k Comparison Run LDA with k = 4, 6, 8, 10, and 12 topics on the ODA speech corpus (using the same preprocessing and vectorizer settings as in Section 27.8). For each value of k, print the top 10 words for each topic.

Part B: Topic Labeling For each value of k, attempt to label each topic (e.g., "Economy," "Healthcare," "Immigration"). Document your reasoning for each label.

Part C: Coherence For each k, compute the perplexity score (available from sklearn's LDA model via .perplexity() method) on a held-out 20% of the corpus. Plot perplexity vs. k. Does the perplexity measure agree with your qualitative assessment of topic coherence?

Part D: Recommendation Write a 200-word justification for your recommended value of k for this corpus, integrating both the quantitative perplexity evidence and your qualitative assessment of topic interpretability.

Exercise 27.4 — Media Framing Analysis (Individual or Pair, 75 minutes)

Using the oda_media.csv dataset, conduct an extended analysis of media framing patterns.

Part A: Sentiment by Topic and Source Type Create a heatmap showing mean sentiment score by topic (rows) and source type (columns). Use seaborn.heatmap() or a manually generated matplotlib visualization. Which topic-source type combinations show the most extreme sentiment? The most neutral?

Part B: Temporal Sentiment Trends Plot the rolling 14-day average sentiment for each source type over the campaign period. Do different source types show different temporal dynamics? Are there periods where sentiment spikes across all source types (suggesting a shared external event)?

Part C: Fact-Check Rating and Language For articles with fact-check ratings of "true" or "mostly_true" versus "false" or "mostly_false," extract the article excerpts and apply the VADER analysis and word frequency comparison from Sections 27.4 and 27.5. Are fact-checked-false articles more emotionally extreme? Which specific word patterns distinguish the two groups?

Part D: Research Brief Write a 500-word research brief summarizing your findings about media framing patterns in the ODA coverage dataset. Your brief should: (1) state your core finding, (2) present supporting evidence with specific numbers, (3) identify one alternative explanation for your main finding, and (4) note at least one limitation of the analysis.

Exercise 27.5 — Populism Score Analysis (Individual, 45 minutes)

The populism_score variable in oda_speeches.csv is generated by a validated scale measuring the degree of populist communication in political speech (high scores = more populist framing: people vs. elite, anti-establishment rhetoric, claims of popular sovereignty).

Part A: Descriptive Analysis Compute and visualize the distribution of populism_score by party, event type, and state. Which combinations show the highest and lowest populism scores?

Part B: Correlates Compute the correlation between populism_score and: (a) VADER compound score, (b) Flesch-Kincaid Grade Level, (c) word_count, (d) date (does populism increase as Election Day approaches?). Visualize each correlation with a scatter plot including a regression line.

Part C: High-Populism Language Using the top-words-by-subgroup analysis from Section 27.4.2, compare speeches in the top quartile of populism score to speeches in the bottom quartile. What specific words and phrases are most distinctively associated with high-populism speeches?

Part D: Interpretation Challenge VADER scores political language as having positive or negative sentiment. High-populism speeches often use negative language about elites and positive language about "the people." Does the VADER score of high-populism speeches tend to be positive (because of positive claims about ordinary people) or negative (because of negative claims about elites)? What does your finding tell you about how VADER handles populist rhetoric specifically?

Exercise 27.6 — Building and Evaluating the Classifier (Individual, 90 minutes)

The partisan language classifier in Section 27.10 uses Logistic Regression with TF-IDF features. This exercise extends and evaluates it.

Part A: Alternative Models Train the same classification task with (a) a Naive Bayes classifier (MultinomialNB) and (b) a Random Forest classifier (RandomForestClassifier — you may need to convert the sparse TF-IDF matrix with .toarray()). Report cross-validated accuracy for all three classifiers. Which performs best?

Part B: Confusion Matrix Interpretation For the Logistic Regression model, produce a confusion matrix. Which party's speeches is the model more likely to misclassify? Examine 5 specific misclassified excerpts. What features of the language led to misclassification? What does this tell you about the limits of party-as-label?

Part C: Temporal Generalization Train the model on speeches from the first campaign cycle in the dataset and test it on speeches from the second campaign cycle. Does accuracy decline when the training and test data come from different time periods? What would explain a temporal decline in accuracy?

Part D: Ethical Statement Write a 200-word statement, as if you were Sam presenting this classifier to Adaeze, describing: (1) what the classifier can legitimately be used for, (2) what it should not be used for, and (3) what safeguards you would recommend before any public deployment.

Exercise 27.7 — Replication and Extension (Capstone Exercise, 2–3 hours)

Design and execute a complete text analysis pipeline for a corpus of your own construction.

Step 1: Data Collection Identify a corpus of at least 50 political text documents. Options include: congressional floor speeches from the Congressional Record, State of the State speeches (available from Ballotpedia), campaign website policy pages, candidate debate transcripts (available from the Commission on Presidential Debates), or political party platform documents.

Step 2: Pipeline Construction Apply the full pipeline from this chapter: preprocessing, word frequency analysis (with party or ideological group comparison), VADER sentiment analysis, readability analysis, and n-gram analysis. For corpora of 200+ documents, include a topic model.

Step 3: Reproducibility Package Document your pipeline so it can be reproduced: include all preprocessing choices as explicit decisions with justification, report library versions, set and document random seeds, and describe your data provenance.

Step 4: Interpretive Report Write a 1,000–1,500-word analytical report presenting your findings. Your report must include: (a) at least three specific findings with supporting visualizations, (b) at least two limitations of the computational approach for this specific corpus, and (c) at least one finding that surprised you and a discussion of whether the surprise reflects a genuine empirical discovery or a limitation of the methodology.