Exercises: Relational and Categorical Visualization

DataField.Dev

Exercises: Relational and Categorical Visualization

These exercises use the tips and penguins datasets from seaborn's built-ins. Assume import seaborn as sns, sns.set_theme(style="whitegrid"), and tips = sns.load_dataset("tips").

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

List the seaborn functions in the relational family and the categorical family covered in this chapter.

Guidance

**Relational**: `sns.scatterplot`, `sns.lineplot`, `sns.regplot`, `sns.lmplot`, `sns.relplot` (figure-level). **Categorical**: `sns.stripplot`, `sns.swarmplot`, `sns.boxplot`, `sns.violinplot`, `sns.barplot`, `sns.pointplot`, `sns.countplot`, `sns.catplot` (figure-level).

A.2 ★★☆ | Understand

Explain the "dynamite plot" critique in your own words. What does a bar-with-error-bar chart hide?

Guidance

A dynamite plot is a bar chart with an error bar (SE, SD, or CI). It reduces the data to two numbers — the summary and one measure of spread — and hides everything else: the sample size, the distribution shape, individual outliers, skewness, bimodality. Two datasets can produce identical dynamite plots while having completely different point distributions. The chapter recommends showing the raw data (strip plot) alongside a summary (box or violin) instead.

A.3 ★★☆ | Understand

What is the difference between sns.barplot and matplotlib's ax.bar?

Guidance

`sns.barplot` computes an aggregate (mean by default) from the raw data and plots it as a bar with a confidence interval as an error bar. It takes `data=` and column names. `ax.bar` just draws bars of heights you specify — no aggregation, no error bars, no automatic computation. For plotting summary statistics of grouped data, use seaborn. For plotting specific heights you have already computed, use matplotlib.

A.4 ★★☆ | Understand

Describe the automatic aggregation in sns.lineplot. When is it useful, and when can it confuse?

Guidance

`sns.lineplot` automatically groups by x-value, computes the mean, and draws a confidence band for each group. This is useful when you have multiple observations per x (repeated measurements, multiple subjects, multiple trials) and want to see the typical trajectory with uncertainty. It can confuse when readers assume each row is one observation — they may not realize the line represents a mean. Disable with `estimator=None` when you want raw points.

A.5 ★★☆ | Analyze

Explain the difference between ("ci", 95), ("se", 1), and ("sd", 1) as error bar specifications. When is each appropriate?

Guidance

**`("ci", 95)`**: 95% bootstrap confidence interval. Represents the uncertainty about the mean estimate. Appropriate for most scientific contexts. **`("se", 1)`**: one standard error of the mean. Shorter than the 95% CI by a factor of ~1.96. Appropriate for "how precisely we estimated the mean." **`("sd", 1)`**: one standard deviation. Represents the spread of individual observations, not uncertainty about the mean. Much larger than SE or CI for large samples. Appropriate for "how variable is the data." The three are not interchangeable; always specify and label which one you use.

A.6 ★★★ | Evaluate

The chapter recommends combining a strip plot with a box plot for group comparisons. When would a plain box plot be preferable?

Guidance

Plain box plots are preferable when: (1) the dataset is very large (10,000+ points per group) and the strip plot would be unreadable; (2) the chart is part of a publication with strict space constraints; (3) the audience is trained to read box plots and does not need individual points; (4) the specific quartiles matter more than the individual observations. For most exploratory and teaching contexts, the strip+box combination is more informative.

Part B: Applied (8 problems)

B.1 ★★☆ | Apply

Create a scatter plot of tip vs. total_bill with hue by smoker and size by size.

Guidance

sns.scatterplot(data=tips, x="total_bill", y="tip", hue="smoker", size="size", sizes=(30, 200))

B.2 ★★☆ | Apply

Create a regression plot (sns.regplot) of tip vs. total_bill. Add a quadratic regression (order=2) as a second chart and compare.

Guidance

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.regplot(data=tips, x="total_bill", y="tip", ax=axes[0])
axes[0].set_title("Linear")

sns.regplot(data=tips, x="total_bill", y="tip", order=2, ax=axes[1])
axes[1].set_title("Quadratic")

B.3 ★★☆ | Apply

Create a strip plot of total_bill by day, then overlay a box plot with showfliers=False and boxprops=dict(alpha=0.3).

Guidance

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=tips, x="day", y="total_bill", showfliers=False, boxprops=dict(alpha=0.3), ax=ax)
sns.stripplot(data=tips, x="day", y="total_bill", color="black", alpha=0.4, size=4, ax=ax)

B.4 ★★☆ | Apply

Create a grouped box plot of total_bill by day and smoker using hue. Use the tips dataset.

Guidance

sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker")

B.5 ★★☆ | Apply

Create a faceted scatter plot with sns.relplot, showing tip vs. total_bill with col="day" and hue="smoker".

Guidance

sns.relplot(data=tips, x="total_bill", y="tip", col="day", hue="smoker", kind="scatter", height=4)

B.6 ★★☆ | Apply

Create a count plot showing the number of observations for each day in the tips dataset, grouped by time (Lunch/Dinner).

Guidance

sns.countplot(data=tips, x="day", hue="time")

B.7 ★★★ | Apply

Replace a dynamite plot with a strip+box combination using the penguins dataset. Show body_mass_g by species.

Guidance

penguins = sns.load_dataset("penguins")
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=penguins, x="species", y="body_mass_g", showfliers=False,
            boxprops=dict(alpha=0.3), ax=ax)
sns.stripplot(data=penguins, x="species", y="body_mass_g", color="black", alpha=0.5, size=4, ax=ax)

B.8 ★★★ | Create

Create a faceted regression plot using sns.lmplot that fits a linear regression to tip vs. total_bill, with one panel per day of the week.

Guidance

sns.lmplot(data=tips, x="total_bill", y="tip", col="day", height=4, aspect=1)

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

Take a published scientific figure that uses a bar chart with error bars (the dynamite plot format) and critique it using the chapter's arguments. What information is hidden?

Guidance

Find a real figure from a scientific paper. Identify: the sample size (often not stated), the distribution shape (hidden), the individual observations (hidden), the type of error bar (may or may not be labeled). Propose a redesign using strip+box or violin+point that shows more information.

C.2 ★★★ | Create

Design an analysis for comparing test scores across three schools (100 students each). Propose three specific chart types (relational, distributional, categorical) that you would produce together to tell the complete story.

Guidance

Possible trio: (1) a strip+box combination showing individual scores by school; (2) an ECDF comparing the three distributions; (3) a scatter plot of score vs. a covariate (e.g., study hours) colored by school. Together, these show the group comparison, the full distribution, and the underlying relationship.

C.3 ★★★ | Evaluate

The chapter argues that sns.lmplot makes regression modeling too easy. Defend or critique this concern. Is the automatic regression a productivity win or a source of errors?

Guidance

Defense of lmplot: for exploratory work, a quick regression overlay is genuinely useful. You see relationships at a glance without writing model fitting code. Critique: the chart does not tell the user that the model may be inappropriate. No residual check, no R² reported, no significance test. Users who are not trained statisticians can mistakenly treat the line as a model they can trust. The right answer is probably "use it for exploration; do formal modeling in scikit-learn or statsmodels for serious work."

C.4 ★★★ | Create

Build a categorical chart decision tree for yourself. Start with the question "how many observations per group do I have?" and branch through the appropriate chart types. Include at least four chart types as leaves.

Guidance

A decision tree example: < 10 points per group → strip plot; 10-50 → swarm plot or strip+pointplot; 50-200 → box+strip or violin+strip; 200+ → box plot, violin plot, or histograms per group. Then further branch by question type (show every point vs. show summary vs. show shape). The specific tree depends on your typical data.

These exercises build fluency with seaborn's relational and categorical families. Do at least five Part B exercises before Chapter 19.