Quiz: Text and NLP Visualization

Q: Which Python library is the de facto standard for creating word clouds? A) nltk B) wordcloud C) matplotlib D) seaborn

B. The `wordcloud` library by Andreas Mueller is the standard. It handles tokenization, stopword removal, layout, and rendering.

Q: TF-IDF stands for: A) Token Frequency–Inverse Document Frequency B) Term Frequency–Inverse Document Frequency C) Token Frequency–Inferred Document Frequency D) Term Frequency–Internal Document Frequency

B. Term Frequency–Inverse Document Frequency. The TF part measures how often a word appears in a document; the IDF part measures how rare the word is across the corpus.

Q: Which visualization is the standard for exploring LDA topic models? A) seaborn heatmap B) pyLDAvis C) matplotlib bar chart D) NetworkX graph

B. pyLDAvis provides an interactive 2D scatter of topics plus adjustable term relevance bars. Created by Sievert and Shirley in 2014.

Q: Why is a zero reference line important on sentiment-over-time charts? A) It aligns with the x-axis B) Sentiment has a meaningful midpoint (neutral) and the line makes polarity visible at a glance C) Matplotlib requires it D) It helps with grid alignment

B. Without a zero line, the reader cannot immediately tell whether sentiment is positive or negative. The zero line plus positive/negative fills make polarity obvious.

Q: Which chart type is strictly better than a word cloud for comparing word frequencies quantitatively? A) Pie chart B) Horizontal bar chart C) Heatmap D) Scatter plot

B. A horizontal bar chart of top-N words sorted by frequency gives exact values, supports quantitative comparison, and is easier to read than a word cloud.

Q: A co-occurrence network has: A) Nodes as documents, edges as similarity B) Nodes as words, edges indicating they appear together C) Nodes as topics, edges as shared terms D) Nodes as sentences, edges as references

B. Each node is a word; each edge represents co-occurrence (within a window, sentence, or document). Edge weights typically encode co-occurrence frequency.

Q: Which spaCy tool renders named entities inline with text? A) displaCy B) pyLDAvis C) bertviz D) wordcloud

A. displaCy is spaCy's built-in visualizer for entities and dependency structure. `displacy.render(doc, style="ent")` produces HTML with entities highlighted.

Q: A dispersion plot shows: A) Sentiment distribution B) Where specific words appear in a document C) Word embedding dimensions D) Topic prevalence

B. Dispersion plots show the positions of target words within a document as vertical marks on horizontal lines. Useful for corpus linguistics and literary analysis.

Q: Which dimensionality reduction technique is the modern alternative to t-SNE for word embeddings? A) PCA B) UMAP C) LDA D) SVD

B. UMAP (Uniform Manifold Approximation and Projection) is newer, faster, and often preserves more global structure than t-SNE.

Q: Which step is almost always necessary in text preprocessing? A) Stopword removal B) Upper-casing C) Adding punctuation D) Splitting characters

A. Without stopword removal, common words like "the" and "of" dominate any frequency analysis and drown out meaningful patterns.

DataField.Dev

Quiz: Text and NLP Visualization

Part I: Multiple Choice (10 questions)

Q1. Which Python library is the de facto standard for creating word clouds?

A) nltk B) wordcloud C) matplotlib D) seaborn

Answer

**B.** The `wordcloud` library by Andreas Mueller is the standard. It handles tokenization, stopword removal, layout, and rendering.

Q2. TF-IDF stands for:

A) Token Frequency–Inverse Document Frequency B) Term Frequency–Inverse Document Frequency C) Token Frequency–Inferred Document Frequency D) Term Frequency–Internal Document Frequency

Answer

**B.** Term Frequency–Inverse Document Frequency. The TF part measures how often a word appears in a document; the IDF part measures how rare the word is across the corpus.

Q3. Which visualization is the standard for exploring LDA topic models?

A) seaborn heatmap B) pyLDAvis C) matplotlib bar chart D) NetworkX graph

Answer

**B.** pyLDAvis provides an interactive 2D scatter of topics plus adjustable term relevance bars. Created by Sievert and Shirley in 2014.

Q4. Why is a zero reference line important on sentiment-over-time charts?

A) It aligns with the x-axis B) Sentiment has a meaningful midpoint (neutral) and the line makes polarity visible at a glance C) Matplotlib requires it D) It helps with grid alignment

Answer

**B.** Without a zero line, the reader cannot immediately tell whether sentiment is positive or negative. The zero line plus positive/negative fills make polarity obvious.

Q5. Which chart type is strictly better than a word cloud for comparing word frequencies quantitatively?

A) Pie chart B) Horizontal bar chart C) Heatmap D) Scatter plot

Answer

**B.** A horizontal bar chart of top-N words sorted by frequency gives exact values, supports quantitative comparison, and is easier to read than a word cloud.

Q6. A co-occurrence network has:

A) Nodes as documents, edges as similarity B) Nodes as words, edges indicating they appear together C) Nodes as topics, edges as shared terms D) Nodes as sentences, edges as references

Answer

**B.** Each node is a word; each edge represents co-occurrence (within a window, sentence, or document). Edge weights typically encode co-occurrence frequency.

Q7. Which spaCy tool renders named entities inline with text?

A) displaCy B) pyLDAvis C) bertviz D) wordcloud

Answer

**A.** displaCy is spaCy's built-in visualizer for entities and dependency structure. `displacy.render(doc, style="ent")` produces HTML with entities highlighted.

Q8. A dispersion plot shows:

A) Sentiment distribution B) Where specific words appear in a document C) Word embedding dimensions D) Topic prevalence

Answer

**B.** Dispersion plots show the positions of target words within a document as vertical marks on horizontal lines. Useful for corpus linguistics and literary analysis.

Q9. Which dimensionality reduction technique is the modern alternative to t-SNE for word embeddings?

A) PCA B) UMAP C) LDA D) SVD

Answer

**B.** UMAP (Uniform Manifold Approximation and Projection) is newer, faster, and often preserves more global structure than t-SNE.

Q10. Which step is almost always necessary in text preprocessing?

A) Stopword removal B) Upper-casing C) Adding punctuation D) Splitting characters

Answer

**A.** Without stopword removal, common words like "the" and "of" dominate any frequency analysis and drown out meaningful patterns.

Part II: Short Answer (10 questions)

Q11. Explain the main criticism of word clouds as analytical visualizations.

Answer

Word size conflates word length with frequency (long words appear prominent regardless of count), rotation hurts readability, quantitative comparisons are imprecise, the layout is arbitrary so scanning is inefficient, and color is usually random and non-informative. A horizontal bar chart of top-N words is strictly better for analysis.

Q12. Write pandas/Counter code to count the top 20 words in a text after stopword removal.

Answer

from collections import Counter
import re

STOPWORDS = {"the", "a", "of", "to", "in", "and"}
tokens = re.findall(r"\b\w+\b", text.lower())
filtered = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
top = Counter(filtered).most_common(20)

Q13. What is the difference between raw term frequency and TF-IDF, and when is each appropriate?

Answer

Raw term frequency counts how often each word appears. TF-IDF adjusts for how rare the word is across the corpus — frequent-everywhere words get downweighted, and document-specific words get upweighted. Use raw frequency for "what words dominate this text?" and TF-IDF for "what makes this text distinctive compared to others?"

Q14. Describe the three components of a pyLDAvis visualization.

Answer

(1) A 2D scatter plot of topics where proximity indicates similarity. (2) A bar chart of the top terms for the selected topic. (3) An adjustable relevance slider (λ) that balances topic-specific terms vs. overall frequent terms.

Q15. Write code to compute daily sentiment from a DataFrame of posts with date and text columns.

Answer

from textblob import TextBlob

df["sentiment"] = df["text"].apply(lambda t: TextBlob(t).sentiment.polarity)
daily = df.set_index("date")["sentiment"].resample("D").mean()

Q16. Why are radar charts a legitimate choice for visualizing emotion profiles?

Answer

Radar charts are usually weak for quantitative analysis, but emotions are a natural categorical set where the circular layout matches the intuition that emotions are related categories. The shape of the polygon gives a memorable "emotional fingerprint" that can be compared across documents. For emotion specifically, the radar chart's weaknesses matter less than its visual memorability.

Q17. What are three ethical considerations when visualizing text data?

Answer

(1) **Privacy** — even public text identifies individuals; aggregation should be considered. (2) **Representativeness** — text corpora are biased, and visualizations inherit the bias; disclose it. (3) **Amplification** — visualizing extremist or harmful content can inadvertently amplify it. (4) **Over-interpretation** — sentiment scores are model outputs, not ground truth about human emotions. (5) **Consent and scraping** — obtained legitimately or not matters.

Q18. What is a scattertext chart, and when is it useful?

Answer

Scattertext is a specialized visualization library by Jason Kessler that plots word-level differences between two text categories. Words appear as points with coordinates representing their frequency in each category; distinctive words are labeled. Useful for comparing two corpora (Republican vs. Democrat speeches, positive vs. negative reviews).

Q19. Describe a typical pipeline for visualizing word embeddings.

Answer

(1) Obtain word vectors (Word2Vec, GloVe, BERT). (2) Select a subset of words for visualization (too many makes the plot unreadable). (3) Reduce dimensions to 2D with t-SNE or UMAP. (4) Scatter-plot the 2D positions with word labels. (5) Disclose the reduction parameters (perplexity, neighbors, random seed) since the layout depends on them. (6) Treat the result as qualitative — the distances in 2D are not meaningful, only local similarity is.

Q20. Explain why the chapter says "text visualization is mostly about numeric reduction."

Answer

Text cannot be visualized directly; you first reduce it to numbers (word frequencies, TF-IDF scores, topic proportions, sentiment scores, embedding coordinates) and then visualize the numbers with standard chart types (bar charts, line charts, scatter plots, heatmaps, networks). The text-specific work is in the reduction, not the chart. Choosing the right reduction is the design decision; the chart type follows.

Scoring Rubric

Score	Level	Meaning
18–20	Mastery	You understand the main text visualization techniques and their limitations.
14–17	Proficient	You know the basics; review topic models and sentiment visualization.
10–13	Developing	You grasp the preprocessing pipeline; re-read Sections 26.4-26.8.
< 10	Review	Re-read the full chapter.

After this quiz, move on to Chapter 27 (Statistical and Scientific Visualization).