Exercises: Text and NLP Visualization
Install: pip install wordcloud nltk scikit-learn gensim pyldavis textblob spacy. For spaCy also run python -m spacy download en_core_web_sm.
Part A: Conceptual (6 problems)
A.1 ★☆☆ | Recall
Name three critiques of word clouds as analytical visualizations.
Guidance
(1) Word size conflates length and frequency — long words dominate regardless of count. (2) Rotation hurts readability and distorts visual weight. (3) Quantitative comparisons between word sizes are imprecise. (Others: arbitrary layout, random color, hard to scan systematically.)A.2 ★☆☆ | Recall
What does TF-IDF measure?
Guidance
Term Frequency–Inverse Document Frequency. Combines how often a word appears in a document (TF) with how rare it is across the corpus (IDF). High TF-IDF means the word is frequent in this document but unusual in others, making it distinctive for that document.A.3 ★★☆ | Understand
What is LDA and what does pyLDAvis show?
Guidance
LDA (Latent Dirichlet Allocation) is an unsupervised topic model that assumes each document is a mixture of topics and each topic is a distribution over words. pyLDAvis is an interactive visualization that shows (1) a 2D scatter of topics by similarity, (2) a bar chart of top terms for the selected topic, and (3) an adjustable relevance slider.A.4 ★★☆ | Understand
Why is a zero reference line important on sentiment-over-time charts?
Guidance
Sentiment has a meaningful midpoint (neutral). Without a zero line, readers cannot tell at a glance whether sentiment is positive or negative. The zero line plus positive/negative fills (green/red) makes the polarity immediately visible.A.5 ★★★ | Analyze
When is a word cloud an acceptable choice?
Guidance
As decoration (covers, marketing, infographics), when the shape is meaningful (country silhouette, logo), as an attention-getter for general audiences, or as a quick sanity check on preprocessing. Never for quantitative analysis or publication-quality analytical figures.A.6 ★★★ | Evaluate
A colleague visualizes sentiment about a brand with a single line chart (no smoothing, no zero line, no annotations). What do you suggest?
Guidance
Add a zero reference line. Add smoothing (7- or 30-day rolling mean) because raw daily sentiment is noisy. Add positive/negative fills to make polarity visible. Annotate key events (product launches, PR incidents) to give context. Consider the number of underlying observations per day — sparse days have unreliable sentiment and should be flagged.Part B: Applied (10 problems)
B.1 ★☆☆ | Apply
Create a basic word cloud from a short text.
Guidance
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "the quick brown fox jumps over the lazy dog " * 50
wc = WordCloud(width=800, height=400, background_color="white").generate(text)
plt.imshow(wc)
plt.axis("off")
plt.show()
B.2 ★☆☆ | Apply
Count the top 20 words in a corpus and plot a horizontal bar chart.
Guidance
from collections import Counter
import re
tokens = re.findall(r"\b\w+\b", text.lower())
STOPWORDS = {"the", "a", "an", "of", "to", "in", "and", "or", "for"}
filtered = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
top = Counter(filtered).most_common(20)
words, counts = zip(*top)
plt.barh(words, counts)
plt.gca().invert_yaxis()
B.3 ★★☆ | Apply
Compute TF-IDF scores for a corpus of 5 documents and plot the top 10 words per document.
Guidance
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["doc one text", "doc two text", "doc three text", "doc four text", "doc five text"]
vec = TfidfVectorizer(stop_words="english")
tfidf = vec.fit_transform(docs)
features = vec.get_feature_names_out()
for i in range(len(docs)):
scores = tfidf[i].toarray()[0]
top_idx = scores.argsort()[::-1][:10]
print([features[j] for j in top_idx])
B.4 ★★☆ | Apply
Fit an LDA topic model with 5 topics and visualize it with pyLDAvis.
Guidance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.sklearn
vec = CountVectorizer(stop_words="english")
X = vec.fit_transform(docs)
lda = LatentDirichletAllocation(n_components=5, random_state=42).fit(X)
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, X, vec)
B.5 ★★☆ | Apply
Compute TextBlob sentiment for a list of texts and plot the distribution.
Guidance
from textblob import TextBlob
import seaborn as sns
texts = ["I love it", "It's okay", "I hate it", ...]
sentiments = [TextBlob(t).sentiment.polarity for t in texts]
sns.histplot(sentiments, bins=20)
plt.axvline(0, color="red", linestyle="--")
B.6 ★★★ | Apply
Build a sentiment-over-time chart with smoothing and zero reference line.
Guidance
df = pd.DataFrame({"date": dates, "sentiment": sentiments}).set_index("date")
daily = df.resample("D").mean()
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(daily.index, daily, color="lightgray", alpha=0.5, label="Daily")
ax.plot(daily.index, daily.rolling(7).mean(), color="steelblue", label="7-day MA")
ax.axhline(0, color="gray", linestyle="--")
ax.fill_between(daily.index, 0, daily.squeeze(), where=(daily.squeeze() > 0),
color="green", alpha=0.2)
ax.fill_between(daily.index, 0, daily.squeeze(), where=(daily.squeeze() < 0),
color="red", alpha=0.2)
ax.legend()
B.7 ★★☆ | Apply
Use spaCy to extract named entities from a paragraph and visualize with displaCy.
Guidance
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
displacy.render(doc, style="ent")
B.8 ★★★ | Apply
Build a co-occurrence network of the top 30 terms in a corpus, using sentence windows.
Guidance
from itertools import combinations
import networkx as nx
cooccur = Counter()
for sentence in sentences:
toks = set(tokenize(sentence)) & top_30_words
for a, b in combinations(sorted(toks), 2):
cooccur[(a, b)] += 1
G = nx.Graph()
for (a, b), c in cooccur.items():
if c > 5:
G.add_edge(a, b, weight=c)
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, node_size=500, edge_color="gray", alpha=0.7)
B.9 ★★☆ | Apply
Create a dispersion plot showing where 5 specific words appear in a novel-length text.
Guidance
targets = ["alice", "rabbit", "queen", "hatter", "cat"]
tokens = tokenize(text)
fig, ax = plt.subplots(figsize=(12, 3))
for i, word in enumerate(targets):
positions = [j for j, t in enumerate(tokens) if t == word]
ax.scatter(positions, [i] * len(positions), marker="|", s=100)
ax.set_yticks(range(len(targets)))
ax.set_yticklabels(targets)
ax.set_xlabel("Position")
B.10 ★★★ | Create
Use UMAP to reduce TF-IDF vectors of a small document collection to 2D and plot with category colors.
Guidance
import umap
vec = TfidfVectorizer(stop_words="english", max_features=500)
X = vec.fit_transform(docs).toarray()
reducer = umap.UMAP(random_state=42)
proj = reducer.fit_transform(X)
plt.scatter(proj[:, 0], proj[:, 1], c=categories, cmap="tab10", alpha=0.6)
Part C: Synthesis (4 problems)
C.1 ★★★ | Analyze
Take a corpus of political speeches and compare the vocabulary of two parties using a word-frequency-difference bar chart. What does the chart reveal that a single word cloud would not?
Guidance
The difference bar chart shows which words are more common in party A vs. party B directly — the magnitude and direction of each difference are readable. A word cloud would show the most common words for each party separately, but you could not tell which words are distinctive because the most common words (generic political vocabulary) dominate both clouds. TF-IDF between the two parties would work similarly well.C.2 ★★★ | Evaluate
A journalist wants to show "what people are saying about climate change" from 10 million tweets. Which visualizations would you recommend, and which would you avoid?
Guidance
**Recommend**: sentiment over time (with smoothing and annotations), topic model + pyLDAvis for themes, TF-IDF comparison between time periods, term frequency bar charts for headline numbers, co-occurrence network for key topics. **Avoid**: raw word clouds (uninformative at scale), scatter plots of individual tweets (million-point hairballs), and anything that treats all 10 million tweets as independent observations without aggregation. Aggregation is mandatory at this scale.C.3 ★★★ | Create
Build a dashboard of text analysis visualizations for a customer review corpus: top words, TF-IDF by product, sentiment over time, topic model, co-occurrence network. Which insights does each panel contribute?
Guidance
**Top words**: overall vocabulary ("battery", "screen", "price"). **TF-IDF by product**: distinctive features of each product (what customers mention about product X specifically). **Sentiment over time**: trends in satisfaction. **Topic model**: latent themes across reviews (problems, features, comparisons). **Co-occurrence**: which product features are discussed together. Each panel answers a different question; together they form a complete review analysis.C.4 ★★★ | Evaluate
The chapter argues that text visualization is mostly about choosing the right numeric reduction, and the chart type follows standard principles. Do you agree? Are there visualizations that are genuinely text-specific and cannot be reduced to charts of numbers?
Guidance
Mostly agree. Even dispersion plots, word clouds, and co-occurrence networks reduce text to numeric representations before plotting (positions, counts, edge weights). The one arguable exception is **concordance / KWIC views** where the actual text fragments are shown with highlights — these display raw text alongside position information. Even here, however, the "visualization" is text formatting rather than a chart. The broader point stands: text visualization is about numeric reduction plus standard charts, not a separate category of visual technique.Chapter 27 moves from specialized data types to the production of publication-quality statistical and scientific figures.