Quiz: Relational and Categorical Visualization

Q: The difference between `sns.barplot` and matplotlib's `ax.bar` is: (a) They are the same function (b) `sns.barplot` automatically computes an aggregate (mean by default) from raw data and plots it with an error bar; `ax.bar` plots bars at specified heights (c) `ax.bar` is faster (d) `sns.barplot` only works with integers

(b) `sns.barplot` automatically computes an aggregate (mean by default) from raw data and plots it with an error bar; `ax.bar` plots bars at specified heights. The seaborn version is a statistical plotting function; the matplotlib version is a plotting-what-you-give-it function.

Q: To fit and overlay a regression line on a scatter plot in seaborn, you use: (a) `sns.scatterplot(..., regression=True)` (b) `sns.regplot(data=df, x="x", y="y")` (c) `sns.fit_line(df)` (d) `sns.plot(..., fit="linear")`

(b) `sns.regplot(data=df, x="x", y="y")`. `regplot` is the axes-level function for scatter + regression. `lmplot` is the figure-level version that also supports faceting via `col` and `row`.

Q: Which chart type is best for showing every observation in a group comparison? (a) Bar plot (b) Point plot (c) Strip plot or swarm plot (d) Count plot

(c) Strip plot or swarm plot. Both show every individual observation. Strip uses jitter to prevent exact overlap; swarm uses an algorithm for non-overlapping placement. For showing every data point, these are the right choice.

Q: The figure-level function that wraps relational chart types with faceting is: (a) `sns.relplot` (b) `sns.displot` (c) `sns.catplot` (d) `sns.facet`

(a) `sns.relplot`. It wraps `scatterplot` and `lineplot` (selected via `kind="scatter"` or `kind="line"`) and supports faceting via `col` and `row`. For categorical charts, use `catplot`. For distributional, `displot`.

Q: The `sns.regplot` chart fits a regression line. To fit a polynomial regression (e.g., quadratic), you pass: (a) `polynomial=True` (b) `order=2` (c) `fit_type="polynomial"` (d) `kind="quadratic"`

(b) `order=2`. The `order` parameter specifies the polynomial degree. `order=1` is linear (default); `order=2` is quadratic; `order=3` is cubic; and so on. For logistic regression, use `logistic=True` instead.

Q: The threshold concept of this chapter is: (a) Always use bar charts for comparisons (b) Show the data, not just the summary (c) Seaborn is better than matplotlib (d) Error bars are unnecessary

(b) Show the data, not just the summary. The dynamite plot critique is the clearest expression of this principle: a bar chart with error bars hides information that a strip+box combination preserves. Once you see the difference, you cannot unsee it.

Q: "`sns.lineplot` with multiple observations per x will draw individual lines for each observation by default."

False. With multiple observations per x, `sns.lineplot` aggregates — computing the mean and drawing a confidence band — rather than drawing individual lines. To disable aggregation, set `estimator=None`. To draw individual lines per group, use `units=` to specify the identifier column.

Q: "`sns.barplot` and `sns.countplot` are the same function."

False. `sns.barplot` plots a bar for the aggregated (mean by default) value of a y variable for each category. `sns.countplot` plots a bar for the count of observations in each category (no y-variable needed). Different purposes: `barplot` shows summary statistics, `countplot` shows frequencies.

DataField.Dev

Quiz: Relational and Categorical Visualization

20 questions. Aim for mastery (18+).

Multiple Choice (10 questions)

1. The chapter's "dynamite plot critique" argues that:

(a) Bar charts are never appropriate (b) Bar charts with error bars hide the underlying distribution of data points (c) Error bars should always be omitted (d) Bar charts should use 3D effects

Answer

**(b)** Bar charts with error bars hide the underlying distribution of data points. A dynamite plot reduces the data to two numbers (mean and spread), hiding the sample size, shape, individual observations, and outliers. The alternative is to show the raw data (strip or swarm) alongside a summary (box or violin).

2. The difference between sns.barplot and matplotlib's ax.bar is:

(a) They are the same function (b) sns.barplot automatically computes an aggregate (mean by default) from raw data and plots it with an error bar; ax.bar plots bars at specified heights (c) ax.bar is faster (d) sns.barplot only works with integers

Answer

**(b)** `sns.barplot` automatically computes an aggregate (mean by default) from raw data and plots it with an error bar; `ax.bar` plots bars at specified heights. The seaborn version is a statistical plotting function; the matplotlib version is a plotting-what-you-give-it function.

3. To fit and overlay a regression line on a scatter plot in seaborn, you use:

(a) sns.scatterplot(..., regression=True) (b) sns.regplot(data=df, x="x", y="y") (c) sns.fit_line(df) (d) sns.plot(..., fit="linear")

Answer

**(b)** `sns.regplot(data=df, x="x", y="y")`. `regplot` is the axes-level function for scatter + regression. `lmplot` is the figure-level version that also supports faceting via `col` and `row`.

4. The automatic aggregation in sns.lineplot computes:

(a) The sum of y values for each x (b) The mean of y values for each x, with a bootstrap 95% confidence interval by default (c) The median of y values (d) All y values without aggregation

Answer

**(b)** The mean of y values for each x, with a bootstrap 95% confidence interval by default. When there are multiple observations per x-value, seaborn aggregates automatically. To disable aggregation, pass `estimator=None`. To change the aggregator, pass `estimator=np.median` or similar.

5. Which chart type is best for showing every observation in a group comparison?

(a) Bar plot (b) Point plot (c) Strip plot or swarm plot (d) Count plot

Answer

**(c)** Strip plot or swarm plot. Both show every individual observation. Strip uses jitter to prevent exact overlap; swarm uses an algorithm for non-overlapping placement. For showing every data point, these are the right choice.

6. The split=True parameter on sns.violinplot is useful for:

(a) Splitting the x-axis into multiple panels (b) Comparing two levels of a hue variable by putting one on each side of each violin (c) Splitting the data into training and test sets (d) Splitting the figure into subplots

Answer

**(b)** Comparing two levels of a hue variable by putting one on each side of each violin. This is useful for direct pairwise comparison, such as male vs. female body mass within each species.

7. For a small sample (8 observations per group), the recommended categorical chart is:

(a) Violin plot (b) Strip plot or swarm plot (c) Kernel density estimate (d) Box plot only

Answer

**(b)** Strip plot or swarm plot. With 8 observations, KDE-based charts (violin, KDE) are unreliable because there is not enough data to estimate a distribution. Strip and swarm plots show each individual point honestly. Box plots work but hide the individual observations that matter most at small sample sizes.

8. The figure-level function that wraps relational chart types with faceting is:

(a) sns.relplot (b) sns.displot (c) sns.catplot (d) sns.facet

Answer

**(a)** `sns.relplot`. It wraps `scatterplot` and `lineplot` (selected via `kind="scatter"` or `kind="line"`) and supports faceting via `col` and `row`. For categorical charts, use `catplot`. For distributional, `displot`.

9. The sns.regplot chart fits a regression line. To fit a polynomial regression (e.g., quadratic), you pass:

(a) polynomial=True (b) order=2 (c) fit_type="polynomial" (d) kind="quadratic"

Answer

**(b)** `order=2`. The `order` parameter specifies the polynomial degree. `order=1` is linear (default); `order=2` is quadratic; `order=3` is cubic; and so on. For logistic regression, use `logistic=True` instead.

10. The threshold concept of this chapter is:

(a) Always use bar charts for comparisons (b) Show the data, not just the summary (c) Seaborn is better than matplotlib (d) Error bars are unnecessary

Answer

**(b)** Show the data, not just the summary. The dynamite plot critique is the clearest expression of this principle: a bar chart with error bars hides information that a strip+box combination preserves. Once you see the difference, you cannot unsee it.

True / False (5 questions)

11. "sns.lineplot with multiple observations per x will draw individual lines for each observation by default."

Answer

**False.** With multiple observations per x, `sns.lineplot` aggregates — computing the mean and drawing a confidence band — rather than drawing individual lines. To disable aggregation, set `estimator=None`. To draw individual lines per group, use `units=` to specify the identifier column.

12. "sns.barplot and sns.countplot are the same function."

Answer

**False.** `sns.barplot` plots a bar for the aggregated (mean by default) value of a y variable for each category. `sns.countplot` plots a bar for the count of observations in each category (no y-variable needed). Different purposes: `barplot` shows summary statistics, `countplot` shows frequencies.

13. "The order parameter is needed whenever you want non-alphabetical category ordering."

Answer

**True.** seaborn orders categorical variables alphabetically by default. For meaningful orderings (days of week, months of year, severity levels), pass `order=[list]` explicitly. This applies to all categorical functions and is one of the most commonly overlooked seaborn parameters.

14. "The strip plot + box plot combination is the chapter's recommended alternative to the dynamite plot."

Answer

**True.** The chapter recommends combining strip plots (showing every point) with box plots (showing the summary) as the alternative to bar charts with error bars. The strip plot shows the raw data that the bar+error bar hides. Other combinations like strip+violin or swarm+box are also appropriate.

15. "The estimator parameter on sns.lineplot only accepts np.mean or np.median."

Answer

**False.** `estimator` accepts any callable that takes an array and returns a scalar. You can use `np.mean`, `np.median`, custom functions, or even lambda expressions. For example, `estimator=lambda x: np.percentile(x, 90)` computes the 90th percentile.

Short Answer (3 questions)

16. Explain the "dynamite plot" critique in three to four sentences and describe the recommended alternative.

Answer

A dynamite plot is a bar chart where each bar represents a group's mean (or median) with an error bar showing some measure of spread (SE, SD, or CI). The problem is that this reduces the data to two numbers and hides everything else: the sample size, the distribution shape, individual observations, and outliers. Two dramatically different datasets can produce identical dynamite plots. The chapter recommends a strip plot (showing every observation) overlaid with a box plot or violin plot (showing the distribution summary), so the reader sees both the raw data and the summary.

17. Describe the three subfamilies of categorical charts covered in this chapter and give an example chart from each.

Answer

The three subfamilies are: **showing every point** (`sns.stripplot`, `sns.swarmplot`) — best when sample size is small and you want to see individual observations; **showing distributions** (`sns.boxplot`, `sns.violinplot`) — best when sample size is moderate to large and you want to see the summary statistics and shape; **showing summaries** (`sns.barplot`, `sns.pointplot`, `sns.countplot`) — best for quick aggregate comparisons, though the dynamite plot critique applies. Each answers a different question; the choice depends on the sample size, the audience, and what specific aspect of the data you want to show.

18. Explain the difference between a confidence interval (ci) and a standard error of the mean (se) in the context of sns.lineplot error bars, and describe when each is appropriate.

Answer

A **confidence interval** (e.g., `("ci", 95)`) represents the range within which the true population parameter is likely to fall, computed via bootstrap resampling. A 95% CI means "if we repeated the sampling many times, 95% of the CIs would contain the true mean." A **standard error** (`("se", 1)`) is one standard error of the mean: `SE = SD / sqrt(n)`. It is shorter than the 95% CI by a factor of ~1.96 and represents "how precisely we estimated the mean," not a probability range. Use CI for showing uncertainty in the estimate (the most common choice for publication). Use SE for showing "how precisely we estimated the mean" without a probability framing (common in neuroscience). Avoid using them interchangeably — they are not the same size and do not have the same interpretation.

Applied Scenarios (2 questions)

19. A colleague produces the following chart comparing drug treatment effects:

sns.barplot(data=trial, x="treatment", y="outcome")

With n=15 patients per group, identify three problems with this chart and write a better version that addresses them.

Answer

**Problems:** 1. **Dynamite plot**: the bar+error bar hides the individual observations. Readers cannot see the sample size, the distribution shape, or outliers. 2. **Default error bar ambiguity**: without specification, readers may not know whether the error bar is SE, SD, or 95% CI. seaborn's default is 95% CI, but this is not always clear. 3. **Small sample size not shown**: 15 patients per group is small enough that showing individual observations would be informative. The bar chart hides this fact. **Better version:**

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))

# Individual observations via strip plot
sns.stripplot(data=trial, x="treatment", y="outcome", size=8, alpha=0.7, color="black", ax=ax)

# Box plot for summary (with transparent fill)
sns.boxplot(data=trial, x="treatment", y="outcome", showfliers=False,
            boxprops=dict(alpha=0.3), ax=ax)

# Mean overlay with 95% CI
sns.pointplot(data=trial, x="treatment", y="outcome",
              estimator="mean", errorbar=("ci", 95),
              markers="_", color="#d62728", ax=ax)

ax.set_title("Treatment Effect on Outcome (n=15 per group)")
plt.show()

This shows every patient (strip), the distribution summary (box), and the mean with 95% CI (point). Sample size is visible in the strip plot count. Error bar type is explicit.

20. You have a dataset with columns month, category, and sales (12 months × 5 categories × ~30 observations per combination). Write seaborn code to visualize how sales differ across categories and how the pattern changes across months. Use a figure-level function for faceting.

Answer

import seaborn as sns
sns.set_theme(style="whitegrid", context="notebook")

# Option 1: Box plot faceted by month (12 panels)
g = sns.catplot(
    data=df,
    x="category",
    y="sales",
    col="month",
    col_wrap=4,
    kind="box",
    height=3,
    aspect=1,
    showfliers=False,
)

# Option 2: Line plot of monthly means by category (single panel)
g2 = sns.relplot(
    data=df,
    x="month",
    y="sales",
    hue="category",
    kind="line",
    height=5,
    aspect=1.8,
    errorbar=("ci", 95),
)

# Option 3: Strip + box combination (single panel per month)
g3 = sns.catplot(
    data=df,
    x="category",
    y="sales",
    col="month",
    col_wrap=4,
    kind="strip",
    height=3,
)

The first option gives 12 faceted box plots — good for comparing category distributions within each month. The second option gives a single line chart with 5 category lines over 12 months — good for seeing the temporal patterns. The third option shows every individual observation per category × month — most honest for a moderate sample size. Which is best depends on the specific question: "how does the distribution differ?" → box; "how does the mean change over time?" → line; "what do individual observations look like?" → strip. All three are valid and answer different questions.

Review against mastery thresholds. Chapter 19 covers multi-variable exploration with pairplot, jointplot, heatmap, and clustermap — the final chapter of Part IV.