Further Reading: Text and NLP Visualization


Tier 1: Essential Reading

Harris, Jacob. "Word Clouds Considered Harmful." NiemanLab, 2011. The canonical critique of word clouds from a working data journalist. Short, direct, and influential. Search "word clouds considered harmful" — the post is freely available online and still makes the argument cleanly more than a decade later.

Sievert, Carson, and Kenneth Shirley. "LDAvis: A method for visualizing and interpreting topics." Proceedings of the workshop on interactive language learning, visualization, and interfaces, 2014. The paper introducing pyLDAvis. Describes the design decisions behind the 2D topic scatter and the relevance slider.

Michel, Jean-Baptiste, et al. "Quantitative analysis of culture using millions of digitized books." Science 331, no. 6014 (2011): 176-182. The paper that introduced the Google Ngram dataset and the term "culturomics." Freely available and directly relevant to Case Study 2.


Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O'Reilly Media, 2009. The classic NLTK book, freely available at nltk.org/book. Covers tokenization, stopwords, stemming, and basic text analysis. Older than modern libraries but still a good introduction to the concepts.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing. 3rd ed. draft, 2024. The most comprehensive modern NLP textbook. Covers the full pipeline from tokenization to modern transformers. Freely available at stanford.edu/~jurafsky/slp3. Essential for serious NLP work.

Kessler, Jason. "Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ." ACL 2017 Demo Session, 2017. The paper introducing Scattertext. Covers the visualization philosophy and the specific design decisions.

Cairo, Alberto. The Truthful Art. New Riders, 2016. Chapter on text visualization covers word clouds, topic models, and sentiment. Contains Cairo's critique of word clouds and recommendations for alternatives.

Blei, David M. "Probabilistic Topic Models." Communications of the ACM 55, no. 4 (2012): 77-84. A readable introduction to topic models (LDA and variants) from one of the field's co-founders.

Mohammad, Saif M., and Peter D. Turney. "Crowdsourcing a Word-Emotion Association Lexicon." Computational Intelligence 29, no. 3 (2013): 436-465. The paper introducing the NRC Emotion Lexicon, a widely-used resource for emotion analysis.


Tier 3: Tools and Online Resources

Resource URL / Source Description
wordcloud github.com/amueller/word_cloud The Python wordcloud library. Despite the chapter's critique, still useful for decorative purposes.
nltk nltk.org The Python NLP toolkit. Comprehensive, with built-in visualization utilities.
spaCy spacy.io Modern industrial NLP library with fast tokenization, NER, and dependency parsing.
gensim radimrehurek.com/gensim Topic modeling and similarity in Python.
pyLDAvis github.com/bmabey/pyLDAvis The interactive topic model visualizer.
Scattertext github.com/JasonKessler/scattertext Comparative text visualization by Jason Kessler.
Google Ngram Viewer books.google.com/ngrams The interactive Ngram Viewer discussed in Case Study 2.
Google Ngram data storage.googleapis.com/books/ngrams/books/datasetsv3.html The raw Ngram dataset for bulk analysis.
Hugging Face Transformers huggingface.co Pretrained transformer models for sentiment, classification, and more.
textblob textblob.readthedocs.io Simple sentiment analysis library (wraps NLTK).
VADER sentiment github.com/cjhutto/vaderSentiment Rule-based sentiment analysis tuned for social media.
Voyant Tools voyant-tools.org Web-based interactive text analysis with built-in visualizations.

A note on reading order: If you want one additional source, read Harris's "Word Clouds Considered Harmful" blog post — it's short, sharp, and still relevant. For serious NLP work, bookmark Jurafsky & Martin's free textbook. For practical visualization, start with the Scattertext paper for an example of good text visualization design.