31 min read

> — Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1921)

Learning Objectives

  • Create word clouds with the wordcloud library, including custom shapes and color maps
  • Evaluate word clouds critically: their limitations and when simpler bar charts are better
  • Visualize topic model outputs: pyLDAvis, topic-term bar charts, topic proportion area charts
  • Create sentiment-over-time charts with smoothing and event annotation
  • Visualize text frequency data: term frequency bar charts, TF-IDF comparisons, co-occurrence networks
  • Create dispersion plots and concordance visualizations for corpus analysis

Chapter 26: Text and NLP Visualization — Word Clouds, Topic Models, and Sentiment

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1921)


26.1 Why Text Is Hard to Visualize

Every chart in this textbook so far has visualized numeric data. Even categorical data — species, country, product line — has been about entities with numeric attributes that can be compared, counted, or aggregated. Text is different. A document is not a number; it is a sequence of words, each with its own meaning, context, and relationship to other words. You cannot scatter-plot "Hamlet" against "Macbeth." You cannot histogram the characters of a novel. Text does not fit naturally into the numeric visualization tools we have developed.

But text is everywhere. Social media posts, news articles, scientific papers, customer reviews, legal documents, emails — the modern data pipeline is at least as much text as numbers. And there are genuine questions about text that benefit from visualization: What words appear most often in this corpus? How has the sentiment of product reviews changed over time? What topics dominate a year of news coverage? How do two documents differ in vocabulary? Each question calls for a specific visualization approach.

This chapter covers the main techniques for visualizing text and the outputs of natural language processing (NLP) pipelines. The core insight is that text visualization almost always requires reducing text to numbers first. You do not visualize the raw text; you visualize a frequency distribution, a topic model's output, a sentiment score, or a similarity measure. The visualization step is mostly a standard chart type — a bar chart, a line chart, a network — applied to the derived numeric data. The text-specific work is the reduction, not the chart.

There is no new threshold concept in this chapter. What is new is the vocabulary: word clouds, TF-IDF, topic models, pyLDAvis, sentiment time series, co-occurrence networks, dispersion plots. Each is a specific tool for a specific text analysis question. We will also spend significant time critiquing the most famous text visualization — the word cloud — because despite its popularity, it is one of the weaker options for most tasks.

A word cloud (also called a tag cloud) is a visualization where words from a corpus are displayed with size proportional to their frequency. The most frequent words are biggest; less frequent words are smaller. The words are arranged in a rough cluster, often with random rotation for visual variety.

Word clouds became popular in the early 2000s, initially on Flickr (for tag browsing) and later as a general text-summary visualization. They were easy to generate, visually striking, and seemed to convey "what's in this text" at a glance. By the mid-2010s, they were ubiquitous — in blog posts, presentations, infographics, and introductory NLP tutorials.

The wordcloud Python library makes them trivial to create:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = "The quick brown fox jumps over the lazy dog. The dog sleeps."
wc = WordCloud(width=800, height=400, background_color="white").generate(text)

fig, ax = plt.subplots(figsize=(10, 5))
ax.imshow(wc, interpolation="bilinear")
ax.axis("off")
plt.show()

The WordCloud object tokenizes the text, removes stopwords, counts frequencies, and lays out the words with sizes proportional to counts. The generate method takes a string; generate_from_frequencies takes a dictionary of {word: count} pairs for pre-tokenized data.

Word clouds support customization: custom colors via colormap, custom shapes via mask (an image that constrains the layout), custom stopwords, and minimum word length. Advanced versions can create word clouds in the shape of a logo, a country, or any silhouette.

26.3 The Word Cloud Critique

Despite their popularity, word clouds are bad visualizations for most analytical tasks. This is not a niche opinion — it is the consensus view of the data visualization community, and it is worth understanding why.

Critique 1: Area is perceived inconsistently. In a word cloud, word size encodes frequency. But the relationship between "size" and "word" is ambiguous. Is it the width of the word? The height? The bounding box area? The total ink? Different words with the same frequency can end up looking very different depending on their letter count and shape. "I" and "antidisestablishmentarianism" at the same frequency will have vastly different visual weights.

Critique 2: Rotation hurts readability. Word clouds often rotate some words 90 degrees to fit them into the layout. Rotated words are harder to read than horizontal ones, and readers systematically underestimate the visual weight of rotated words. The rotation is for aesthetic variety, but it distorts the information.

Critique 3: Frequency comparisons are imprecise. Can you tell from a word cloud whether "election" is twice as frequent as "politics," or three times, or ten? You cannot. The sizes are visually ordered but not quantitatively comparable. For any question requiring quantitative answers, a bar chart is strictly better.

Critique 4: Long words dominate regardless of frequency. A seven-letter word at a given font size takes much more visual space than a three-letter word at the same font size. This gives long words disproportionate prominence, regardless of their actual frequency.

Critique 5: Color is usually random. Most word clouds assign colors for visual variety, not to encode anything. The random colors add visual noise without adding information.

Critique 6: The reader cannot ignore specific words. A bar chart of word frequencies has a specific order that the reader can scan. A word cloud's layout is arbitrary, so the reader's eye wanders and the extraction of information is slow.

The alternative is simple: a horizontal bar chart of the top N most frequent words. Each word is a row, sorted by frequency descending, with the bar length encoding frequency. You can read off the exact counts. You can compare between words quantitatively. You can read the words easily because they are not rotated. The bar chart is strictly more informative than the word cloud for the question "what are the most frequent words?"

import pandas as pd
import seaborn as sns

word_counts = pd.DataFrame({"word": ["the", "of", "and", ...], "count": [5000, 4200, 3800, ...]})
top_20 = word_counts.nlargest(20, "count")

fig, ax = plt.subplots(figsize=(8, 6))
sns.barplot(data=top_20, y="word", x="count", ax=ax)
ax.set_title("Top 20 Most Frequent Words")

When should you use a word cloud? Almost never for analytical purposes. The only legitimate uses are:

  • Decoration — when you want a text-themed visual element for a cover page, a title card, or similar aesthetic purpose. In this case, the word cloud is graphic design, not data visualization, and the lack of analytical precision does not matter.
  • Novelty for general audiences — when you want to catch someone's attention with something that looks interesting. A word cloud is more eye-catching than a bar chart, which may be appropriate for an infographic or a marketing piece.
  • When the shape is the point — word clouds in the shape of a country, a logo, or a subject of the text can be a design element that reinforces the theme.

For any task involving analysis, quantitative comparison, or serious communication of frequency information, use a bar chart. The word cloud is the most popular text visualization for the wrong reasons — it looks impressive without being informative.

26.4 Term Frequency Bar Charts

The simplest and most effective text visualization is a term frequency bar chart. Count the frequency of each word in the corpus, take the top N, and display as a horizontal bar chart sorted by frequency.

from collections import Counter
import re

def tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

STOPWORDS = set(["the", "a", "an", "of", "to", "in", "and", "or", "for", "on", "at", ...])

text = open("corpus.txt").read()
tokens = tokenize(text)
tokens_filtered = [t for t in tokens if t not in STOPWORDS and len(t) > 2]

counts = Counter(tokens_filtered)
top_20 = counts.most_common(20)

fig, ax = plt.subplots(figsize=(8, 6))
words, freqs = zip(*top_20)
ax.barh(words, freqs)
ax.invert_yaxis()  # highest on top
ax.set_xlabel("Frequency")
ax.set_title("Top 20 Words in Corpus")

Key design decisions:

  • Stopword removal is almost always necessary. Common words ("the", "a", "of") dominate any natural-language corpus and drown out the interesting words. The list of stopwords depends on the language and domain — nltk.corpus.stopwords and sklearn.feature_extraction.text.ENGLISH_STOP_WORDS provide standard lists.
  • Minimum length filter removes short noise tokens. len(t) > 2 eliminates many abbreviations and typos.
  • Case folding (lowering all text) combines "The" and "the" into one count. Usually correct; sometimes problematic for proper nouns.
  • Horizontal orientation makes the word labels easier to read than vertical bars.
  • Top N rather than all words — usually 15-25 is the sweet spot.

For comparing two documents or two groups, use a grouped bar chart:

# Compare word frequencies between two documents
df = pd.DataFrame({
    "word": common_words,
    "doc1": [counts1[w] for w in common_words],
    "doc2": [counts2[w] for w in common_words],
})

df_melted = df.melt(id_vars="word", var_name="document", value_name="count")
sns.barplot(data=df_melted, y="word", x="count", hue="document")

This shows which words are more frequent in doc1 vs. doc2, directly comparable.

26.5 TF-IDF: What Makes a Document Distinctive

Raw term frequency has a limitation: the most frequent words in any document are usually the generic common words of the language. "The," "a," "of" — even after stopword removal, many documents share their top-frequent words, making comparisons uninformative. TF-IDF (Term Frequency–Inverse Document Frequency) solves this by weighting words by how specifically they appear in a document.

The TF-IDF score of word w in document d, given a corpus of N documents, is:

TF-IDF(w, d) = (frequency of w in d) × log(N / (number of documents containing w))

Words that appear frequently in d but rarely in the rest of the corpus get high TF-IDF scores. Words that appear in every document (like common words) get low scores. TF-IDF emphasizes the distinctive vocabulary of each document.

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["doc1 text here", "doc2 text here", "doc3 text here"]
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
tfidf_matrix = vectorizer.fit_transform(docs)
feature_names = vectorizer.get_feature_names_out()

# Top TF-IDF words for doc 0
tfidf_scores = tfidf_matrix[0].toarray()[0]
top_idx = tfidf_scores.argsort()[::-1][:20]
top_words = [(feature_names[i], tfidf_scores[i]) for i in top_idx]

Visualizing TF-IDF scores is usually done with horizontal bar charts, one per document. The charts show "what makes this document different from the others" rather than "what words does this document have." For comparing genre-specific corpora (news vs. blogs vs. scientific papers, say), the TF-IDF bar charts reveal the distinctive vocabulary of each.

A parallel bar chart can show TF-IDF scores for multiple documents side by side:

fig, axes = plt.subplots(1, 3, figsize=(15, 6), sharey=False)
for ax, doc_idx, title in zip(axes, [0, 1, 2], ["Doc 1", "Doc 2", "Doc 3"]):
    scores = tfidf_matrix[doc_idx].toarray()[0]
    top = sorted(enumerate(scores), key=lambda x: -x[1])[:15]
    words = [feature_names[i] for i, _ in top]
    vals = [v for _, v in top]
    ax.barh(words, vals)
    ax.set_title(title)
    ax.invert_yaxis()

26.6 Topic Models and pyLDAvis

Topic models are unsupervised methods for discovering the latent themes in a corpus. Latent Dirichlet Allocation (LDA), the most common topic model, assumes each document is a mixture of topics, and each topic is a distribution over words. Running LDA on a corpus produces:

  • A set of topics, each characterized by a distribution over the vocabulary.
  • For each document, a distribution over topics.

Visualizing topic models is non-trivial. The standard tool is pyLDAvis, an interactive visualization developed by Sievert and Shirley in 2014. pyLDAvis shows:

  • A 2D scatter plot of topics (using principal components of the topic-word matrix), where proximity indicates similarity. Clicking a topic selects it for detailed inspection.
  • A bar chart of the top terms for the selected topic, with bars showing both overall frequency and topic-specific relevance.
  • An adjustable relevance slider (λ) that balances topic-specific terms against overall frequent terms.
import pyLDAvis
import pyLDAvis.sklearn
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer(max_features=1000, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(docs)

lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(doc_term_matrix)

pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, doc_term_matrix, vectorizer)

The resulting interactive visualization lets the analyst explore the topics and their terms. It is the de facto standard for topic model exploration in Python.

Alternative topic visualizations:

  • Topic-term bar charts: for each topic, show the top 10-15 terms as a bar chart. Simple and reproducible; less interactive than pyLDAvis.
  • Topic proportion over time: if the documents have timestamps, plot the proportion of each topic per time period as a stacked area chart. This reveals how the topic mixture evolves.
  • Topic co-occurrence heatmap: show which topics tend to appear together in the same document via a topic-topic correlation heatmap.

Topic models are useful for exploring large corpora where the themes are not obvious. They are less useful for precise analysis because the topics are algorithmically defined and may not correspond to meaningful human categories. Always inspect the topics manually to verify they make sense before drawing conclusions.

26.7 Sentiment Analysis Over Time

Sentiment analysis assigns a positive/negative score to each piece of text. When applied to a time series of text — tweets, reviews, news articles — the result is a sentiment score time series that can be visualized with all the time series tools from Chapter 25.

The basic workflow:

from textblob import TextBlob  # or VADER, or transformer-based sentiment

tweets["sentiment"] = tweets["text"].apply(lambda t: TextBlob(t).sentiment.polarity)
daily_sentiment = tweets.resample("D", on="timestamp")["sentiment"].mean()

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(daily_sentiment.index, daily_sentiment, color="steelblue", alpha=0.4, label="Daily")
ax.plot(daily_sentiment.index, daily_sentiment.rolling(7).mean(), color="darkblue", label="7-day MA")
ax.axhline(0, color="gray", linestyle="--")
ax.fill_between(daily_sentiment.index, 0, daily_sentiment, where=daily_sentiment > 0,
                color="green", alpha=0.2)
ax.fill_between(daily_sentiment.index, 0, daily_sentiment, where=daily_sentiment < 0,
                color="red", alpha=0.2)
ax.set_ylabel("Sentiment")
ax.set_title("Daily Sentiment Over Time")

Design decisions specific to sentiment over time:

  • A zero line is essential because sentiment has a meaningful midpoint (neutral). Without the zero line, readers cannot immediately tell whether sentiment is positive or negative.
  • Positive/negative fills (green for above zero, red for below) make the polarity immediately visible.
  • Smoothing is almost always necessary. Raw daily sentiment is extremely noisy because individual posts vary dramatically.
  • Event annotations give context. A sentiment spike on a specific date is meaningless without knowing what happened that day.
  • Confidence bands from aggregating sentiment scores of many posts can be shown as fill_between regions around the central line.

Sentiment charts are a staple of social media analytics, product review analysis, and political communication research. The basic visualization is straightforward; the challenge is usually in computing reliable sentiment scores, not in plotting them.

26.8 Co-Occurrence Networks

When you want to see which words appear together, a co-occurrence network is the right tool. Each node is a word; each edge represents co-occurrence (within a window, a sentence, or a document). Edge weights represent frequency of co-occurrence.

import networkx as nx
from itertools import combinations
from collections import Counter

# Build co-occurrence counts within sentences
cooccur = Counter()
for sentence in sentences:
    tokens = set(tokenize(sentence)) - STOPWORDS
    for w1, w2 in combinations(sorted(tokens), 2):
        cooccur[(w1, w2)] += 1

# Build the graph
G = nx.Graph()
for (w1, w2), count in cooccur.most_common(100):
    G.add_edge(w1, w2, weight=count)

# Visualize with NetworkX
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, node_size=500, node_color="lightblue",
        edge_color="gray", alpha=0.7)

The resulting network shows which words form clusters. In a corpus about politics, you might see clusters around "election," "vote," "candidate," "debate" — these words tend to appear together in political articles. The structure of the network reveals the topical structure of the corpus without requiring a formal topic model.

Co-occurrence networks can suffer from the hairball problem (Chapter 24) when there are many terms. The fixes: limit to top N words by total frequency, filter edges by minimum weight, or use community detection and color nodes by cluster.

26.9 Dispersion Plots

A dispersion plot (also called a lexical dispersion plot) shows where specific words appear within a document or corpus. Each word has a horizontal line representing the document's length, and vertical marks indicate the positions where the word appears.

import matplotlib.pyplot as plt
import numpy as np

tokens = tokenize(novel_text)
target_words = ["Emma", "Knightley", "Harriet", "Elton", "Frank"]

fig, ax = plt.subplots(figsize=(12, 4))
for i, word in enumerate(target_words):
    positions = [idx for idx, t in enumerate(tokens) if t == word.lower()]
    ax.scatter(positions, [i] * len(positions), marker="|", s=100)

ax.set_yticks(range(len(target_words)))
ax.set_yticklabels(target_words)
ax.set_xlabel("Position in text")
ax.set_title("Lexical Dispersion: Character Appearances")

Dispersion plots are useful for corpus linguistics and literary analysis. They reveal:

  • Character entrance/exit points: where characters appear and disappear in a novel.
  • Topic shifts: where specific terms cluster or spread out.
  • Plot structure: by looking at which terms appear when, you can see the shape of the narrative.

NLTK has a built-in dispersion plot function (nltk.dispersion_plot) that handles the details. For custom work, the matplotlib code above is a starting point.

26.10 Text Visualization Pitfalls

Text visualization has its own set of pitfalls, some of which overlap with other chapters but are worth mentioning explicitly.

Pitfall 1: Unfiltered stopwords. Any chart of word frequencies dominated by "the," "a," "of" is a failure. Always remove stopwords, usually with an off-the-shelf list plus a few corpus-specific additions.

Pitfall 2: Case-sensitivity bugs. "Apple" and "apple" are the same word for most analyses but different to a naive tokenizer. Always lowercase unless proper nouns matter.

Pitfall 3: Tokenization artifacts. Punctuation, contractions, URLs, and emojis confuse simple tokenizers. Use re.findall(r"\b\w+\b", text) as a baseline and consider nltk.word_tokenize for better handling. For social media, use specialized tokenizers that handle hashtags, mentions, and emojis.

Pitfall 4: Short-text problems. Topic models and TF-IDF assume documents have substantial length. For tweets (which are under 280 characters), standard LDA produces noisy topics. Consider aggregating tweets by user, hashtag, or time period before topic modeling.

Pitfall 5: Cross-language confusion. Tokenization, stopwords, and sentiment models are language-specific. Mixing English and Spanish in the same analysis without language detection produces garbage.

Pitfall 6: Over-interpretation of topic models. LDA topics are statistical artifacts, not human-interpretable categories by construction. The topics often look meaningful to the analyst but may not correspond to anything real. Always validate with domain knowledge and examples.

26.11 Progressive Project (Alternate): Social Media Text Visualization

The climate project does not really apply to this chapter because climate data is mostly numeric. For this chapter's exercises, we will use a different dataset: a corpus of social media posts about climate change, with timestamps, author metadata, and post text.

The exercises will cover:

  1. Term frequency bar charts of the most common words, with and without stopword removal.
  2. A word cloud (for decoration) and a critique comparing it to the bar chart.
  3. TF-IDF comparison between posts from different time periods (pre-2015 vs. post-2015).
  4. LDA topic modeling of the corpus with pyLDAvis visualization.
  5. Sentiment over time showing how sentiment about climate has evolved.
  6. A co-occurrence network of the top 50 terms, colored by community.

The goal is to build a complete text analysis pipeline from raw posts to visualizations. Each step reduces the text to a numeric representation, and each visualization answers a specific question about the corpus. No single visualization captures everything — the pipeline is cumulative.

26.12 Text Preprocessing: The Foundation

Every text visualization depends on preprocessing — the pipeline that turns raw text into the cleaned tokens you analyze. Getting preprocessing right is more consequential than any visualization choice, because bad preprocessing makes even the best chart misleading. This section summarizes the steps.

Tokenization. Split the text into words (or subwords, or sentences). The simplest tokenizer uses a regex: re.findall(r"\b\w+\b", text). For better handling of punctuation, contractions, and language-specific conventions, use NLTK (nltk.word_tokenize) or spaCy. For social media, use a specialized tokenizer like TweetTokenizer that handles hashtags, mentions, and URLs.

Case folding. Lowercase the tokens: tokens = [t.lower() for t in tokens]. This combines "The" and "the" into one count. Skip this step if proper nouns matter (e.g., entity analysis).

Stopword removal. Remove common words that carry little meaning. Standard lists: nltk.corpus.stopwords.words("english") (179 English stopwords) or sklearn.feature_extraction.text.ENGLISH_STOP_WORDS. Domain-specific corpora often need additional stopwords (e.g., "said" in news text, "thanks" in customer support).

Stemming and lemmatization. Collapse word variants to a root form. "Running", "ran", and "runs" all become "run." Stemmers (Porter, Snowball) use rules and are fast but imprecise. Lemmatizers (WordNet, spaCy) use linguistic dictionaries and are slower but more accurate. For frequency counting, lemmatization gives cleaner results.

Punctuation and number handling. Decide whether punctuation is meaningful (rarely) and whether numbers are meaningful (sometimes — dates in historical text, quantities in scientific text). By default, strip both.

N-grams. Instead of single words, you can count pairs (bigrams) or triples (trigrams). "climate change" as a bigram is more informative than "climate" and "change" separately. Most libraries support n-gram extraction — sklearn's CountVectorizer(ngram_range=(1, 2)) counts unigrams and bigrams together.

Custom filters. Remove URLs, email addresses, emojis, specific phrases, or anything else that is noise for your analysis. Regex substitutions are the usual tool.

The output of this pipeline is a list of cleaned tokens (or a document-term matrix) that the visualization step consumes. Every chart in this chapter assumes the pipeline has been run; skipping preprocessing produces visualizations dominated by stopwords, punctuation, and case-variant duplicates.

26.13 Visualizing Entity Recognition Output

Named Entity Recognition (NER) is an NLP task that identifies mentions of people, organizations, locations, dates, and other entities in text. NER output is typically a list of (entity, type, start, end) tuples indicating which spans of text belong to which entity class.

Visualizing NER output requires specialized tools. The standard ones:

displaCy (spaCy's built-in visualizer) renders entities as colored spans inline with the text:

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
displacy.render(doc, style="ent")

The output is HTML with each entity highlighted in a colored background and a small label indicating its type. This is ideal for validating NER models on specific examples and for interactive exploration of text.

Entity frequency bar charts show which entities appear most often:

from collections import Counter
entities = Counter((ent.text, ent.label_) for ent in doc.ents)
top = entities.most_common(20)

This gives a summary of "who and what the text is about" — useful for corpus overview.

Entity co-occurrence networks show which entities appear together, using the co-occurrence pattern from Section 26.8 applied to entities instead of words. This reveals relationships: which people are mentioned alongside which organizations, which places are associated with which events.

Entity timelines plot entity mentions over time in a temporal corpus. For news data, this reveals when specific entities enter and leave the public conversation. Implementation: extract entities with timestamps, resample, and plot with the time series tools from Chapter 25.

26.14 Word Embeddings and Dimensionality Reduction

Modern NLP represents words as high-dimensional vectors (Word2Vec, GloVe, BERT embeddings). These vectors encode semantic similarity: words with similar meanings have similar vectors. Visualizing these vectors requires dimensionality reduction to 2D.

The standard tools are:

t-SNE (t-distributed Stochastic Neighbor Embedding): reduces high-dimensional vectors to 2D while preserving local structure. Nearby points in the high-dimensional space stay nearby in the 2D projection. Good for visualization; not good for preserving global structure.

UMAP (Uniform Manifold Approximation and Projection): a newer alternative to t-SNE that is faster and often produces more interpretable global structure. Usually the default choice for modern word embedding visualization.

PCA (Principal Components Analysis): the classical dimensionality reduction method. Fast, deterministic, but captures only linear relationships in the embedding space. Useful as a baseline.

A typical word embedding visualization workflow:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume `vectors` is an N x D matrix of word vectors and `words` is the list of words
reducer = TSNE(n_components=2, random_state=42, perplexity=30)
projected = reducer.fit_transform(vectors)

fig, ax = plt.subplots(figsize=(10, 10))
ax.scatter(projected[:, 0], projected[:, 1], s=10, alpha=0.5)
for i, word in enumerate(words):
    if i < 50:  # label only the first 50 to avoid clutter
        ax.annotate(word, (projected[i, 0], projected[i, 1]), fontsize=8)
ax.set_title("Word Embeddings (t-SNE)")

The resulting scatter plot shows words grouped by semantic similarity. Related words cluster together: "king," "queen," "prince," "princess" form one group; "run," "walk," "jog," "sprint" form another. The visualization is qualitative — you cannot read off specific distances — but it reveals the structure of the embedding space.

Warning on t-SNE: the algorithm is stochastic and parameter-sensitive. Different runs with different seeds produce different layouts. The distances in the 2D projection are not meaningful — only nearby points can be trusted to be similar, and far points are not necessarily dissimilar. Always disclose the parameters (perplexity, learning rate) and treat t-SNE plots as exploratory, not definitive.

26.15 Text Comparison Visualizations

When you want to compare two or more texts, several specialized visualizations exist.

Word frequency difference plots: compute word frequencies in both corpora, subtract, and plot the most different words as a bar chart. Words more common in corpus A appear as positive bars; words more common in corpus B appear as negative bars. This shows distinctive vocabulary at a glance.

diff = counts_A - counts_B
top_A = diff.nlargest(10)
top_B = (-diff).nlargest(10)
combined = pd.concat([top_A, -top_B]).sort_values()

fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(combined.index, combined.values,
        color=["blue" if v > 0 else "red" for v in combined.values])

Scattertext: a specialized library by Jason Kessler that visualizes word-level differences between two text categories. Words are plotted on axes representing their frequency in each category, with distinctive words labeled and visible. The output is an interactive HTML scatter plot that is particularly effective for political text analysis (e.g., Republican vs. Democrat speeches).

Concordance / keyword-in-context (KWIC) views: for specific words, show the surrounding context in each occurrence. Not a chart per se, but a specialized text display. NLTK provides text.concordance(word, width=80) which lists every occurrence with surrounding words.

Text similarity heatmaps: compute pairwise similarities (cosine similarity of TF-IDF vectors, embedding distance) between all documents and display as a heatmap. Clusters of similar documents appear as blocks. Useful for corpora with natural groupings.

26.16 Advanced Sentiment: Aspect-Based and Emotion Dimensions

Sentiment analysis at its simplest assigns one number (polarity from -1 to +1) per document. More sophisticated approaches decompose sentiment into multiple dimensions or attach it to specific aspects of the text.

Aspect-based sentiment analysis assigns sentiment to specific features or aspects rather than the whole text. For a product review, instead of "overall positive," you get "positive about battery, negative about screen, neutral about price." Visualizing aspect-based sentiment is harder because you have multiple scores per document.

A useful pattern is a heatmap of products × aspects colored by sentiment:

import seaborn as sns
sns.heatmap(aspect_matrix, cmap="RdYlGn", center=0, vmin=-1, vmax=1,
            annot=True, fmt=".2f", cbar_kws={"label": "Sentiment"})

Each row is a product (or company or topic), each column is an aspect (battery, screen, price, software), and the cell color encodes the sentiment. This reveals which products are weak on which dimensions and which aspects are universally liked or disliked.

Emotion dimensions decompose sentiment into multiple emotional categories: joy, anger, fear, sadness, surprise, disgust. Each document gets a vector of emotion scores rather than a single polarity. Visualizing emotion is typically done with:

  • Radar charts (one axis per emotion) for individual documents or groups. Effective for showing emotional "profiles."
  • Stacked area charts over time showing how the emotion mixture evolves.
  • Small multiples with one line chart per emotion, sharing the x-axis.

Radar charts for emotions are one of the legitimate uses of radar charts — which are otherwise considered weak visualizations because they are hard to read quantitatively. For emotion profiles, the radar chart's circular structure matches the intuition that emotions are categorically related, and the shape of the "emotion polygon" can be meaningfully compared across documents.

26.17 Corpus Comparison at Scale

When you have many documents and want to compare them visually, several scale-friendly approaches exist beyond the pairwise methods in Section 26.15.

Document clustering with dimensionality reduction: compute document embeddings (TF-IDF vectors, or modern sentence-level embeddings from a transformer model), project to 2D with UMAP, and scatter-plot. Each point is a document; clusters reveal groups of similar documents.

from sklearn.feature_extraction.text import TfidfVectorizer
import umap

vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
doc_vectors = vectorizer.fit_transform(docs)

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
projected = reducer.fit_transform(doc_vectors)

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(projected[:, 0], projected[:, 1], c=doc_labels, cmap="tab10", alpha=0.5, s=20)

The scatter plot color-codes by category or cluster label, revealing whether documents of the same category cluster together. This is one of the few visualization techniques that scales to tens of thousands of documents.

Topic prevalence heatmaps: after fitting a topic model, make a heatmap of documents × topics where each cell is the document's probability for that topic. Sort rows by dominant topic. The result reveals the overall topic distribution and identifies documents that span multiple topics.

Temporal topic streams: stream graphs (a variant of stacked area charts) show the proportion of each topic over time. Each "stream" is one topic, and its width at each time point is proportional to how much of the corpus at that time is on that topic. The visual metaphor is a river of topics flowing through time. Tools like streamgraphs and manual matplotlib can produce them.

Author-topic visualizations: when documents have authors, you can visualize which authors write about which topics. A bipartite graph (authors on one side, topics on the other, edges weighted by how much each author writes about each topic) is one option. A heatmap (authors × topics) is another. Both reveal authorial specializations.

26.18 Ethics of Text Visualization

Text data often comes from individuals — tweets, reviews, comments, messages. Visualizing text data raises ethical questions that are easy to ignore but worth taking seriously.

Privacy: even "public" text data (tweets, reviews) may identify individuals who did not expect aggregate analysis of their words. Research ethics (IRB review, consent) applies to some of this work, and even non-research uses should consider what the individuals would want. A dispersion plot showing every instance of a controversial word in a public dataset may expose individual contributors in a way that a summary statistic would not.

Representativeness: text corpora are usually biased. Twitter users are not representative of the general population; customer reviewers are not representative of all customers; historical texts are biased toward literate populations of specific eras. Visualizations of text data inherit these biases, and without caveats, they can mislead.

Amplification: visualizing which words are most frequent in hate speech, extremist content, or misinformation can inadvertently amplify it by making it visible. The analytical goal (understanding the phenomenon) is legitimate, but the visualization can be weaponized by the same groups being studied. Careful researchers think about publication and dissemination before visualizing.

Misinterpretation: sentiment scores are algorithmic outputs, not ground truth about human emotions. A sentiment time series chart can imply that "people felt X on date Y" when the actual data is "a machine learning model classified posts as X on date Y." The two are not the same, and presenting the latter as the former is a form of overclaiming.

Consent and scraping: much text data is obtained by scraping websites or social media in ways that violate terms of service. The visualization is downstream of the data collection, but the chart maker is participating in the pipeline. Consider whether the data was obtained legitimately before building visualizations on it.

None of these considerations invalidates text visualization. They are caveats that apply with particular force to text data because of its human and personal nature. A chart that handles them honestly is more trustworthy than one that ignores them.

26.19 Text Visualization Tools and Libraries

A quick inventory of the main Python libraries for text visualization, to help you choose the right tool for a given task.

Core tokenization and NLP:

  • NLTK — The Python NLP toolkit. Has tokenization, stopwords, stemming, tagging, parsing, and a few visualization utilities like dispersion_plot. Comprehensive but sometimes slow for large corpora.
  • spaCy — Modern industrial NLP. Faster than NLTK, with better entity recognition and dependency parsing. The displaCy visualizer renders entities, syntactic structure, and dependency trees inline with text.
  • gensim — Topic modeling (LDA, LSI, Word2Vec). The canonical library for document-level text analysis in Python.
  • transformers (Hugging Face) — Modern transformer-based models (BERT, GPT, RoBERTa) with pretrained embeddings and classifiers. The source for state-of-the-art sentiment, entity recognition, and semantic similarity.

Visualization-specific:

  • wordcloud — The de facto word cloud library. Despite this chapter's critique, it exists and is widely used; learning it is useful even if you do not use it for serious analytics.
  • pyLDAvis — The standard topic model visualization tool. Interactive, well-designed, and the de facto choice for LDA exploration.
  • Scattertext — Specialized visualization for comparing two text categories. Excellent for political speech analysis, product review comparison, and similar bipartite comparisons.
  • Yellowbrick — A scikit-learn visualization library with modules for text visualization (frequency distributions, t-SNE for text, dispersion plots). Useful for model diagnostics.
  • Altair — The grammar-of-graphics library from Chapter 22 works well for sentiment time series, TF-IDF bar charts, and other standard text visualizations.
  • bertviz — Visualizes attention weights in transformer models. Useful for research but too specialized for most applications.

Interactive exploration:

  • TextVis libraries in D3/Observable — Many advanced text visualizations exist in JavaScript but not in Python. For novel text visualization, JavaScript ecosystems often have more to offer.
  • Voyant Tools — A web-based interactive text analysis tool with built-in visualizations. Free and useful for quick exploration without any coding.
  • Overview — An open-source tool for exploring document collections with topic-model-style clustering. Useful for journalism and document review tasks.

A practical workflow: use spaCy for preprocessing (tokenization, entities), gensim for topic modeling, pyLDAvis for topic exploration, matplotlib or seaborn for frequency bar charts, and Altair for interactive dashboards. For advanced research, add transformers and bertviz.

26.20 A Final Word on Word Clouds

The chapter has been critical of word clouds throughout, and it is worth ending with a balanced statement of the case. Word clouds are not intrinsically evil. They are a specific visualization with specific strengths and weaknesses, and the mistake is using them for tasks where a bar chart would be strictly better.

Where word clouds legitimately work:

  • As decorative elements for covers, title pages, presentations, or marketing materials where analytical precision is not the goal.
  • As attention-getters for general audiences who might not engage with a bar chart but will look at a word cloud.
  • When shape is meaningful — a word cloud in the silhouette of a person, a country, or an object can add thematic resonance that a bar chart cannot.
  • As a quick sanity check on whether preprocessing is working — if the word cloud is dominated by "the" and "of," your stopwords list is incomplete.

Where word clouds fail:

  • Any task requiring quantitative comparison between words.
  • Any publication-quality analytical visualization where precision matters.
  • Frequency distribution overviews where the top-N bar chart is more informative.
  • Sentiment or polarity displays where the magnitude and sign of values matter.

The broader lesson is that no visualization type is universally good or universally bad. Word clouds became the poster child for "bad viz" because they are widely misused, not because they are inherently flawed. The same logic applies to pie charts, dual-axis charts, and many other controversial chart types: they have legitimate uses in specific contexts, and the rule against them is really a rule against using them in the wrong contexts.

When you see a word cloud in someone else's work, ask: what question is this answering? If the question is "what is this text about, roughly?" the word cloud is acceptable. If the question is "which words are most frequent, and how do their frequencies compare?" the word cloud is failing and a bar chart would be better. The question determines the right chart; the chart does not determine the question.

The same test applies to bar charts, pie charts, heatmaps, and every other chart type. A chart is a choice among many, and the right choice is the one that answers the reader's question with the least cognitive load. Popularity is not a defense, and neither is familiarity — "everyone uses word clouds" is as weak a reason as "everyone uses pie charts." The discipline is to ask each time whether the chart is serving the question or merely filling space.

A practical test: if you replaced your word cloud with a horizontal bar chart of the top 20 words, would your audience lose something? If yes — for example, you are creating a visually evocative image for a magazine spread where aesthetic impact matters more than readable counts — the word cloud is the right choice. If no — for example, you are producing a dashboard panel for a product manager who wants to see trending terms — the bar chart is strictly better and the word cloud is a habit, not a design decision. Being able to give this answer deliberately separates thoughtful chart making from reflexive chart making, and the word cloud is merely the most visible instance of a choice that applies to every chart type in this book: never reach for a default without asking whether the default is doing the work you need.

26.21 Check Your Understanding

Before continuing to Chapter 27 (Statistical and Scientific Visualization), make sure you can answer:

  1. Why are word clouds considered weak analytical visualizations?
  2. What is the right alternative to a word cloud for showing word frequencies?
  3. What does TF-IDF measure, and why is it more informative than raw frequency?
  4. What is LDA, and what visualization tool is standard for exploring its output?
  5. How do you visualize sentiment over time, and what design elements are specific to sentiment charts?
  6. What is a co-occurrence network, and when is it appropriate?
  7. What is a dispersion plot, and what question does it answer?
  8. Name three text visualization pitfalls and their remedies.

If any of these are unclear, re-read the relevant section. Chapter 27 moves to the specialized domain of statistical and scientific publication figures.

26.22 Chapter Summary

This chapter covered the main techniques for visualizing text data:

  • Word clouds are popular but flawed. Use them for decoration, not analysis.
  • Term frequency bar charts are the effective default for showing word frequencies.
  • TF-IDF weights words by their distinctiveness, emphasizing what makes each document different from the rest of the corpus.
  • Topic models (LDA) discover latent themes; pyLDAvis is the standard interactive exploration tool.
  • Sentiment over time charts combine time series techniques with sentiment scores, using zero-line reference and positive/negative fills.
  • Co-occurrence networks show which words appear together, revealing topical clusters.
  • Dispersion plots show where specific words appear in a document.
  • Text visualization pitfalls include unfiltered stopwords, case-sensitivity bugs, tokenization artifacts, short-text problems, cross-language confusion, and over-interpretation of topic models.

No new threshold concept — this chapter applied standard visualization tools (bar charts, line charts, networks) to text data. The text-specific work is in the preprocessing and reduction, not in the chart type. The main lesson is that text visualization is mostly about choosing the right numeric reduction; once you have the numbers, the visualization follows standard principles.

Chapter 27 addresses publication-quality statistical and scientific figures — the matplotlib techniques, color choices, and export formats that journals require.

26.23 Spaced Review

  • From Chapter 6 (Data-Ink Ratio): Why do word clouds fail the data-ink ratio test? What would an extreme data-ink advocate say about them?
  • From Chapter 24 (Networks): Co-occurrence networks are a specific application of network visualization. When should you choose a co-occurrence network over a clustered bar chart?
  • From Chapter 25 (Time Series): Sentiment over time is a special case of time series. Which techniques from Chapter 25 apply, and which sentiment-specific additions matter?
  • From Chapter 5 (Choosing the Right Chart): The chapter argues that bar charts beat word clouds for most text tasks. How does this fit into Chapter 5's chart-selection framework?
  • From Chapter 9 (Storytelling): A sentiment chart without event annotations is raw data. How does annotation turn it into a narrative?

Text visualization is a young field compared to numeric visualization, and the tools are correspondingly less polished. The word cloud remains the most famous example, but it is also the weakest one. For serious text analysis, use bar charts, TF-IDF, topic models, and networks — the standard visualization toolkit applied to text-derived numbers. The workflow is always the same: preprocess the text carefully, reduce it into one or more appropriate numeric representations, and then visualize the resulting numbers with the familiar chart types that the rest of this book has covered in detail. There is nothing exotic about text visualization beyond the preprocessing step, and the discipline is in choosing the right reduction rather than inventing new chart forms. Chapter 27 moves to the specialized domain of publication-quality statistical figures.