Chapter 27 Further Reading

DataField.Dev

Chapter 27 Further Reading

Foundational Computational Text Analysis

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media. The canonical NLTK reference text, available free online at nltk.org/book. Chapters 1–3 cover tokenization, normalization, and frequency analysis; Chapter 6 covers text classification; Chapter 7 covers information extraction. The examples are somewhat dated, but the conceptual grounding remains essential.

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press. The definitive contemporary textbook on computational text analysis for social scientists. Covers the full spectrum from preprocessing to supervised and unsupervised learning, with consistent attention to measurement validity and interpretive caution. Essential reading for anyone working seriously in political text analysis.

Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. The foundational text for the statistical NLP approach. More mathematical than Bird et al., but the conceptual framework for understanding why computational methods work the way they do is invaluable.

Political Text Analysis

Laver, M., Benoit, K., & Garry, J. (2003). Extracting policy positions from political texts using words as data. American Political Science Review, 97(2), 311–331. Foundational application of Wordscores — a text scaling approach that uses reference texts to place political documents on policy dimensions. One of the earliest rigorous applications of computational methods to political science.

Slapin, J. B., & Proksch, S. O. (2008). A scaling model for estimating time-series party positions from texts. American Journal of Political Science, 52(3), 705–722. Introduces Wordfish, a model that scales parties or speakers in a policy space without requiring reference texts. Widely used for analyzing legislative speech and party manifestos.

Monroe, B. L., Colaresi, M. P., & Quinn, K. M. (2008). Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4), 372–403. Develops the log-odds ratio approach to identifying distinctive vocabulary — a more statistically rigorous alternative to simple frequency comparisons. Standard reference for partisan vocabulary analysis.

Gentzkow, M., Shapiro, J. M., & Taddy, M. (2019). Measuring group differences in high-dimensional choices: Method and application to congressional speech. Econometrica, 87(4), 1307–1340. Rigorous econometric approach to measuring partisan speech divergence. Documents the substantial increase in partisan language differentiation in the US Congress over 1873–2016.

Sentiment Analysis

Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media. The original VADER paper. Read for the methodological rationale and the human evaluation study that validated the approach against other methods on social media text.

Young, L., & Soroka, S. (2012). Affective news: The automated coding of sentiment in political texts. Political Communication, 29(2), 205–231. Validation study comparing automated sentiment coding (including lexical and machine learning approaches) to human coding in political text. Key reading for understanding what automated sentiment tools actually measure.

Soroka, S., Young, L., & Balmas, M. (2015). Two kinds of negativity: The effects of incivility and slant in political messages. Political Psychology, 36, 83–95. Distinguishes between incivility (tone) and negative issue framing — two dimensions of negativity that automated tools often conflate. Important for interpreting VADER scores in political contexts.

Readability and Political Communication

Schaffner, B. F., & Sellers, P. J. (2003). The structural determinants of local congressional news coverage. Political Communication, 20(1), 41–57. Research showing that clarity and concreteness in political communication affect media coverage and audience engagement. Background for why readability metrics matter politically.

Tedin, K. L., & Murray, R. W. (1981). Dynamics of candidate evaluation in a statewide election. American Politics Quarterly, 9(3), 345–362. Classic research on how voters evaluate political communication, with attention to language accessibility and credibility.

Trump, D., Clinton, H., et al. For tracking readability of contemporary political speeches, the Flesch-Kincaid scores of major speeches are compiled by various journalism organizations including Politico (politico.com) and FiveThirtyEight (fivethirtyeight.com). These provide calibration benchmarks for comparing your own analyses.

Topic Modeling

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. The original LDA paper. Mathematically technical but the introduction provides an accessible motivation for the model. Essential citation for any research using LDA.

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., ... & Rand, D. G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064–1082. Introduces Structural Topic Models (STM), an extension of LDA that incorporates document-level metadata (like party, year, or office) directly into the topic model. More appropriate than basic LDA when you have theoretically relevant covariates.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 22. The "word intrusion" and "topic intrusion" tests for evaluating topic model quality. Essential for any researcher using LDA in publishable work.

Text Classification

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. Four principles for responsible use of automated text analysis in political science. Essential reading before conducting any publishable text classification work.

Recchia, G., & Laver, M. (2016). Where do parties come from? Political Science Research and Methods, 4(1), 73–97. Uses supervised machine learning to classify political text by party positions. Good applied example of the classification approach used in Section 27.10.

Media Framing and Computational Analysis

Boydstun, A. E., Card, D., Gross, J., Resnik, P., & Smith, N. A. (2014). Tracking the development of media frames within and across policy issues. New Directions for Child and Adolescent Development, 2014(145), 79–98. Application of computational methods to tracking media frame dynamics. Useful model for the framing analysis in Section 27.9.

Green-Pedersen, C., & Mortensen, P. B. (2015). Avoidance and engagement: How US and Danish parties respond to each other's agendas. European Journal of Political Research, 54(1), 3–20. Research on agenda-setting and party responsiveness using text analysis. Illustrates how computational methods can address substantive political science questions about strategic communication.

Tools and Python Resources

spaCy documentation (spacy.io) — For production-quality NLP pipelines beyond what NLTK provides. SpaCy's named entity recognition, dependency parsing, and transformer-based models are state of the art for applied text analysis.

Hugging Face Transformers (huggingface.co/docs/transformers) — For transformer-based language models (BERT, RoBERTa, etc.). Substantially more powerful than VADER for sentiment and classification tasks, but require more computational resources and training data. The next step for analysts who have mastered the NLTK pipeline.

Quanteda — An R package for text analysis widely used in political science, with methods parallel to those covered in this chapter. For analysts fluent in R, Quanteda's documentation and tutorials at quanteda.io are the best entry point.

SAGE Research Methods: Content Analysis (methods.sagepub.com) — Comprehensive collection of resources on content analysis methodology, including both computational and manual approaches. Particularly useful for understanding construct validity in text analysis measures.