Exercises: Text and NLP Visualization

DataField.Dev

Exercises: Text and NLP Visualization

Install: pip install wordcloud nltk scikit-learn gensim pyldavis textblob spacy. For spaCy also run python -m spacy download en_core_web_sm.

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

Name three critiques of word clouds as analytical visualizations.

Guidance

(1) Word size conflates length and frequency — long words dominate regardless of count. (2) Rotation hurts readability and distorts visual weight. (3) Quantitative comparisons between word sizes are imprecise. (Others: arbitrary layout, random color, hard to scan systematically.)

A.2 ★☆☆ | Recall

What does TF-IDF measure?

Guidance

Term Frequency–Inverse Document Frequency. Combines how often a word appears in a document (TF) with how rare it is across the corpus (IDF). High TF-IDF means the word is frequent in this document but unusual in others, making it distinctive for that document.

A.3 ★★☆ | Understand

What is LDA and what does pyLDAvis show?

Guidance

LDA (Latent Dirichlet Allocation) is an unsupervised topic model that assumes each document is a mixture of topics and each topic is a distribution over words. pyLDAvis is an interactive visualization that shows (1) a 2D scatter of topics by similarity, (2) a bar chart of top terms for the selected topic, and (3) an adjustable relevance slider.

A.4 ★★☆ | Understand

Why is a zero reference line important on sentiment-over-time charts?

Guidance

Sentiment has a meaningful midpoint (neutral). Without a zero line, readers cannot tell at a glance whether sentiment is positive or negative. The zero line plus positive/negative fills (green/red) makes the polarity immediately visible.

A.5 ★★★ | Analyze

When is a word cloud an acceptable choice?

Guidance

As decoration (covers, marketing, infographics), when the shape is meaningful (country silhouette, logo), as an attention-getter for general audiences, or as a quick sanity check on preprocessing. Never for quantitative analysis or publication-quality analytical figures.

A.6 ★★★ | Evaluate

A colleague visualizes sentiment about a brand with a single line chart (no smoothing, no zero line, no annotations). What do you suggest?

Guidance

Add a zero reference line. Add smoothing (7- or 30-day rolling mean) because raw daily sentiment is noisy. Add positive/negative fills to make polarity visible. Annotate key events (product launches, PR incidents) to give context. Consider the number of underlying observations per day — sparse days have unreliable sentiment and should be flagged.

Part B: Applied (10 problems)

B.1 ★☆☆ | Apply

Create a basic word cloud from a short text.

Guidance

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = "the quick brown fox jumps over the lazy dog " * 50
wc = WordCloud(width=800, height=400, background_color="white").generate(text)
plt.imshow(wc)
plt.axis("off")
plt.show()

B.2 ★☆☆ | Apply

Count the top 20 words in a corpus and plot a horizontal bar chart.

Guidance

from collections import Counter
import re

tokens = re.findall(r"\b\w+\b", text.lower())
STOPWORDS = {"the", "a", "an", "of", "to", "in", "and", "or", "for"}
filtered = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
top = Counter(filtered).most_common(20)

words, counts = zip(*top)
plt.barh(words, counts)
plt.gca().invert_yaxis()

B.3 ★★☆ | Apply

Compute TF-IDF scores for a corpus of 5 documents and plot the top 10 words per document.

Guidance

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["doc one text", "doc two text", "doc three text", "doc four text", "doc five text"]
vec = TfidfVectorizer(stop_words="english")
tfidf = vec.fit_transform(docs)
features = vec.get_feature_names_out()

for i in range(len(docs)):
    scores = tfidf[i].toarray()[0]
    top_idx = scores.argsort()[::-1][:10]
    print([features[j] for j in top_idx])

B.4 ★★☆ | Apply

Fit an LDA topic model with 5 topics and visualize it with pyLDAvis.

Guidance

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.sklearn

vec = CountVectorizer(stop_words="english")
X = vec.fit_transform(docs)
lda = LatentDirichletAllocation(n_components=5, random_state=42).fit(X)

pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, X, vec)

B.5 ★★☆ | Apply

Compute TextBlob sentiment for a list of texts and plot the distribution.

Guidance

from textblob import TextBlob
import seaborn as sns

texts = ["I love it", "It's okay", "I hate it", ...]
sentiments = [TextBlob(t).sentiment.polarity for t in texts]

sns.histplot(sentiments, bins=20)
plt.axvline(0, color="red", linestyle="--")

B.6 ★★★ | Apply

Build a sentiment-over-time chart with smoothing and zero reference line.

Guidance

df = pd.DataFrame({"date": dates, "sentiment": sentiments}).set_index("date")
daily = df.resample("D").mean()

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(daily.index, daily, color="lightgray", alpha=0.5, label="Daily")
ax.plot(daily.index, daily.rolling(7).mean(), color="steelblue", label="7-day MA")
ax.axhline(0, color="gray", linestyle="--")
ax.fill_between(daily.index, 0, daily.squeeze(), where=(daily.squeeze() > 0),
                color="green", alpha=0.2)
ax.fill_between(daily.index, 0, daily.squeeze(), where=(daily.squeeze() < 0),
                color="red", alpha=0.2)
ax.legend()

B.7 ★★☆ | Apply

Use spaCy to extract named entities from a paragraph and visualize with displaCy.

Guidance

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
displacy.render(doc, style="ent")

B.8 ★★★ | Apply

Build a co-occurrence network of the top 30 terms in a corpus, using sentence windows.

Guidance

from itertools import combinations
import networkx as nx

cooccur = Counter()
for sentence in sentences:
    toks = set(tokenize(sentence)) & top_30_words
    for a, b in combinations(sorted(toks), 2):
        cooccur[(a, b)] += 1

G = nx.Graph()
for (a, b), c in cooccur.items():
    if c > 5:
        G.add_edge(a, b, weight=c)

pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, node_size=500, edge_color="gray", alpha=0.7)

B.9 ★★☆ | Apply

Create a dispersion plot showing where 5 specific words appear in a novel-length text.

Guidance

targets = ["alice", "rabbit", "queen", "hatter", "cat"]
tokens = tokenize(text)

fig, ax = plt.subplots(figsize=(12, 3))
for i, word in enumerate(targets):
    positions = [j for j, t in enumerate(tokens) if t == word]
    ax.scatter(positions, [i] * len(positions), marker="|", s=100)

ax.set_yticks(range(len(targets)))
ax.set_yticklabels(targets)
ax.set_xlabel("Position")

B.10 ★★★ | Create

Use UMAP to reduce TF-IDF vectors of a small document collection to 2D and plot with category colors.

Guidance

import umap

vec = TfidfVectorizer(stop_words="english", max_features=500)
X = vec.fit_transform(docs).toarray()

reducer = umap.UMAP(random_state=42)
proj = reducer.fit_transform(X)

plt.scatter(proj[:, 0], proj[:, 1], c=categories, cmap="tab10", alpha=0.6)

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

Take a corpus of political speeches and compare the vocabulary of two parties using a word-frequency-difference bar chart. What does the chart reveal that a single word cloud would not?

Guidance

The difference bar chart shows which words are more common in party A vs. party B directly — the magnitude and direction of each difference are readable. A word cloud would show the most common words for each party separately, but you could not tell which words are distinctive because the most common words (generic political vocabulary) dominate both clouds. TF-IDF between the two parties would work similarly well.

C.2 ★★★ | Evaluate

A journalist wants to show "what people are saying about climate change" from 10 million tweets. Which visualizations would you recommend, and which would you avoid?

Guidance

**Recommend**: sentiment over time (with smoothing and annotations), topic model + pyLDAvis for themes, TF-IDF comparison between time periods, term frequency bar charts for headline numbers, co-occurrence network for key topics. **Avoid**: raw word clouds (uninformative at scale), scatter plots of individual tweets (million-point hairballs), and anything that treats all 10 million tweets as independent observations without aggregation. Aggregation is mandatory at this scale.

C.3 ★★★ | Create

Build a dashboard of text analysis visualizations for a customer review corpus: top words, TF-IDF by product, sentiment over time, topic model, co-occurrence network. Which insights does each panel contribute?

Guidance

**Top words**: overall vocabulary ("battery", "screen", "price"). **TF-IDF by product**: distinctive features of each product (what customers mention about product X specifically). **Sentiment over time**: trends in satisfaction. **Topic model**: latent themes across reviews (problems, features, comparisons). **Co-occurrence**: which product features are discussed together. Each panel answers a different question; together they form a complete review analysis.

C.4 ★★★ | Evaluate

The chapter argues that text visualization is mostly about choosing the right numeric reduction, and the chart type follows standard principles. Do you agree? Are there visualizations that are genuinely text-specific and cannot be reduced to charts of numbers?

Guidance

Mostly agree. Even dispersion plots, word clouds, and co-occurrence networks reduce text to numeric representations before plotting (positions, counts, edge weights). The one arguable exception is **concordance / KWIC views** where the actual text fragments are shown with highlights — these display raw text alongside position information. Even here, however, the "visualization" is text formatting rather than a chart. The broader point stands: text visualization is about numeric reduction plus standard charts, not a separate category of visual technique.

Chapter 27 moves from specialized data types to the production of publication-quality statistical and scientific figures.