Chapter 16 Exercises: Statistical Visualization with seaborn

Contributors to Introduction to Data Science

Chapter 16 Exercises: Statistical Visualization with seaborn

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 16 material, building from recall to original analysis. Part E applies your skills to a new dataset. Part M mixes in concepts from earlier chapters. You will need Python with seaborn, pandas, and matplotlib installed, plus access to the WHO vaccination CSV.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

Exercise 16.1 — Figure-level vs. axes-level

Explain the difference between seaborn's figure-level functions (displot, relplot, catplot) and axes-level functions (histplot, scatterplot, boxplot). When would you prefer one over the other? Give a concrete example.

Guidance

Figure-level functions create their own Figure and return a FacetGrid object. They support `col` and `row` parameters for multi-panel layouts. Axes-level functions draw onto an existing matplotlib Axes, making them better for embedding seaborn plots into custom subplot layouts you built with `plt.subplots()`. Use figure-level when you want faceting; use axes-level when you want to combine seaborn with other matplotlib elements on the same Axes.

Exercise 16.2 — When to use which categorical plot

For each scenario below, name the most appropriate seaborn categorical plot type (strip, swarm, box, violin, bar, point) and explain why:

You have 30 data points per category and want to show every observation.
You want to compare median values and identify outliers across 8 groups.
You suspect the distribution within one category is bimodal.
You want to show mean values with confidence intervals for a presentation.
You want to show how the relationship between category and outcome changes across two different experimental conditions.

Guidance

1. Swarm plot — small dataset, non-overlapping display of every point. 2. Box plot — explicitly shows median, quartiles, and outliers. 3. Violin plot — shows the full distribution shape, revealing bimodality that box plots hide. 4. Bar plot — simple, widely understood, CI is shown by default. 5. Point plot — shows means connected by lines, making interactions (crossing lines) visually obvious.

Exercise 16.3 — hue vs. col

A classmate creates a scatter plot with hue="region" and the result has six overlapping colors that are hard to distinguish. Suggest two different strategies for improving readability, and explain when each is preferable.

Guidance

Strategy 1: Replace `hue` with `col` (or add `col` and remove `hue`), creating separate panels for each region. Preferable when you have many categories or when the data overlaps heavily. Strategy 2: Keep `hue` but reduce the number of categories (e.g., group regions into broader categories) or use the `"colorblind"` palette with `alpha` adjustments. Preferable when you want direct comparison on the same axes and have few enough groups to make colors distinguishable.

Exercise 16.4 — KDE intuition

In your own words, explain what a kernel density estimate (KDE) does. Why might a KDE give a misleading impression for a very small dataset (say, 10 data points)? What would you use instead?

Guidance

A KDE places a small smooth curve (kernel) at each data point and sums them to estimate a continuous probability density. With only 10 points, the KDE can create smooth bumps and valleys that suggest structure that is really just noise. With very small datasets, prefer a rug plot, strip plot, or a histogram with few bins, which show the actual data without imposing smoothness assumptions.

Exercise 16.5 — Correlation heatmap interpretation

You create a correlation heatmap and see a cell showing r = 0.95 between "ice cream sales" and "drowning deaths." A colleague concludes that ice cream causes drowning. What is wrong with this interpretation, and what visualization might help clarify the relationship?

Guidance

Correlation does not imply causation. Both variables are likely confounded by a third variable — temperature or season. A scatter plot with `hue="month"` or a faceted line plot showing both variables over time would reveal that both rise in summer and fall in winter, suggesting temperature as the confound. The heatmap shows association, not causation.

Part B: Applied Skills (2-star)

Exercise 16.6 — Distribution exploration

Using the WHO vaccination dataset, create the following distribution plots:

A histogram of coverage_pct with 25 bins and a KDE overlay.
A KDE plot of coverage_pct split by region using hue, with filled curves.
An ECDF plot of coverage_pct split by income_group.

For each plot, write one sentence describing the key pattern you observe.

Guidance

Use `sns.displot()` with `kind="hist"`, `kde=True` for part 1. For part 2, use `kind="kde"`, `hue="region"`, `fill=True`. For part 3, use `kind="ecdf"`, `hue="income_group"`. Observations should note things like skewness, multimodality, or differences between groups. For example, the ECDF might show that high-income countries reach 90%+ coverage much earlier on the x-axis than low-income countries.

Exercise 16.7 — Categorical comparisons

Create three different categorical plots showing coverage_pct by region:

A box plot
A violin plot with quartile lines
A combined violin + strip plot (using axes-level functions on the same Axes)

Which plot tells you the most about the data? Write two sentences justifying your answer.

Guidance

For the combined plot, create a figure with `plt.subplots()`, then call `sns.violinplot()` with `inner=None` and `alpha=0.3`, followed by `sns.stripplot()` with small marker size. The combined violin + strip is typically most informative because you get both the distribution shape and the individual data points. However, the box plot is superior for quickly comparing medians and spotting outliers.

Exercise 16.8 — Regression exploration

Create a regression plot (lmplot) showing the relationship between gdp_per_capita and coverage_pct. Then create a second version with hue="income_group". Compare the single regression line to the group-specific lines. What does this tell you about the Simpson's paradox potential in this data?

Guidance

The overall regression line may show a different slope than the within-group lines. This is related to Simpson's paradox: the aggregate relationship can differ from the group-specific relationships when groups have different baseline levels. For example, the overall positive slope might be steeper than individual group slopes because the groups themselves are arranged along the GDP axis.

Exercise 16.9 — Correlation heatmap

Select five numeric columns from the vaccination dataset and create a correlation heatmap with: - Annotations showing correlation values to two decimal places - A diverging colormap centered at zero - The upper triangle masked - A descriptive title

Identify the pair of variables with the strongest positive correlation and the pair with the strongest negative correlation. Explain whether these correlations make intuitive sense.

Guidance

Use `df[cols].corr()` to compute the matrix, `np.triu(np.ones_like(corr, dtype=bool))` for the mask, and `sns.heatmap()` with `annot=True`, `fmt=".2f"`, `cmap="coolwarm"`, `center=0`, `mask=mask`. The explanation should connect the statistical values to domain knowledge — for example, GDP and health expenditure are likely positively correlated because wealthier countries spend more on health.

Exercise 16.10 — Pair plot

Create a pair plot of four numeric variables from the dataset, colored by region. Use KDE for the diagonal. What pair of variables shows the most interesting relationship? Create a standalone lmplot for that pair to examine it more closely.

Guidance

Use `sns.pairplot()` with `hue="region"`, `diag_kind="kde"`. Identify the pair with the most visible pattern (positive/negative correlation, clustering, or nonlinear relationship). Then create `sns.lmplot()` for that specific pair. The student should practice the workflow of overview (pairplot) followed by focused investigation (lmplot).

Exercise 16.11 — FacetGrid

Create a FacetGrid that shows histograms of coverage_pct faceted by region (using col with col_wrap=3). Add a vertical red dashed line at the global median coverage. Each panel should have its own title showing just the region name.

Guidance

Use `sns.displot()` with `col="region"`, `col_wrap=3`. To add the median line, either use `FacetGrid` directly with `.map()` or access the axes from the returned FacetGrid and add `axvline` to each. Compute the global median with `df["coverage_pct"].median()` before plotting.

Exercise 16.12 — Theme and palette exploration

Create the same box plot of coverage_pct by region four times, each with a different combination of theme and palette:

style="darkgrid", palette="deep"
style="whitegrid", palette="colorblind"
style="ticks", palette="pastel"
style="white", palette="muted"

Which combination would you choose for (a) a notebook exploration, (b) a formal paper, and (c) a presentation? Justify your choices.

Guidance

Create a 2x2 subplot figure or four separate figures, changing `sns.set_theme()` before each. For (a) notebooks, `darkgrid` or `whitegrid` with any palette provides helpful reference lines. For (b) formal papers, `ticks` or `white` with `muted` or `colorblind` is clean and accessible. For (c) presentations, `whitegrid` with `colorblind` and `context="talk"` for larger text.

Exercise 16.13 — Context parameter

Create the same violin plot three times using context="paper", context="notebook", and context="talk". Save each to a PNG file and compare the file sizes and visual appearance. What changes between contexts?

Guidance

The `context` parameter scales font sizes, line widths, and marker sizes proportionally. `"paper"` uses the smallest sizes (for print), `"notebook"` is the default, and `"talk"` uses larger sizes (for slides). File sizes may increase slightly with larger contexts due to thicker lines, but the main difference is visual scale.

Part C: Real-World Applications (2-3 star)

Exercise 16.14 — Weather data analysis (3-star)

Load a weather dataset (or create a synthetic one with columns: month, temperature_c, precipitation_mm, city). Create the following visualizations:

A violin plot of temperature by month, with the hue encoding distinguishing cities.
A scatter plot of temperature vs. precipitation, faceted by city.
A heatmap of average temperature by city and month.

Write a brief paragraph summarizing what patterns the visualizations reveal.

Guidance

If creating synthetic data, use `np.random.normal()` with different means per city and month. For the heatmap, use `df.pivot_table(values="temperature_c", index="city", columns="month", aggfunc="mean")`. The summary should discuss seasonal patterns, city-to-city differences, and any relationship between temperature and precipitation.

Exercise 16.15 — Student grades exploration (2-star)

Create a DataFrame with fictional student data (100 students, columns: name, major, gpa, study_hours_per_week, class_year). Using seaborn, create:

A scatter plot of study_hours_per_week vs. gpa with a regression line.
A box plot of gpa by major.
A pair plot of the numeric variables, colored by class_year.

What does the regression plot suggest about the relationship between study time and GPA?

Guidance

Generate synthetic data where study hours and GPA have a moderate positive correlation (r around 0.5-0.6) with noise. Use `sns.lmplot()` for the regression plot. The regression should show a positive but imperfect relationship — study hours help, but do not guarantee a high GPA.

Exercise 16.16 — E-commerce analysis (3-star)

You have an e-commerce DataFrame with columns: product_category, price, rating, review_count, is_returned. Create a multi-panel visualization using FacetGrid that shows the relationship between price and rating for each product_category, with points colored by is_returned. Include regression lines. What categories show the strongest price-rating relationship?

Guidance

Use `sns.lmplot()` with `col="product_category"`, `hue="is_returned"`, `col_wrap=3`. Alternatively, build a `FacetGrid` manually and map `sns.regplot`. The student should identify which categories have steep slopes (price matters more for rating) versus flat slopes (price is unrelated to rating).

Exercise 16.17 — The complete exploration workflow (3-star)

Pick a dataset of your choice (or use the vaccination data). Perform a complete seaborn exploration workflow:

Start with pairplot for an overview.
Identify the two most interesting pairwise relationships.
Create detailed lmplot or relplot visualizations for each.
Create distribution plots (displot) for key variables.
Create categorical comparisons (catplot) for key groupings.
Create a correlation heatmap.
Write a one-paragraph summary of your findings.

Guidance

This is an open-ended exercise. The goal is to practice the full exploration cycle: overview, identify, investigate, summarize. The summary should connect the visual patterns to substantive conclusions about the data.

Exercise 16.18 — Publication-ready figure (3-star)

Create a single figure with four subplots arranged in a 2x2 grid, each showing a different seaborn visualization of the same dataset:

Top-left: KDE of a key numeric variable by group
Top-right: Scatter plot with regression line
Bottom-left: Box plot by category
Bottom-right: Correlation heatmap

Use axes-level functions. Add a main title, ensure consistent styling, and save to a high-resolution PNG (300 dpi). The figure should look professional enough for a report.

Guidance

Create the layout with `fig, axes = plt.subplots(2, 2, figsize=(12, 10))`. Use axes-level functions (`sns.kdeplot`, `sns.regplot`, `sns.boxplot`, `sns.heatmap`) and pass `ax=axes[i, j]` to each. Set `sns.set_theme(style="ticks", context="paper")` for clean styling. Use `fig.suptitle()` for the main title and `fig.savefig("figure.png", dpi=300, bbox_inches="tight")`.

Part D: Synthesis and Extension (3-4 star)

Exercise 16.19 — Custom FacetGrid mapping (3-star)

Create a custom function that draws a histogram with a KDE overlay and a vertical line at the mean, then use FacetGrid.map() to apply it across all regions. The mean line should be labeled with the actual mean value using plt.text().

Guidance

Define a function like `def hist_with_mean(data, **kwargs)` that calls `sns.histplot(data, kde=True, **kwargs)`, computes `mean = data.mean()`, draws `plt.axvline(mean, ...)`, and adds `plt.text(mean, ..., f"Mean: {mean:.1f}")`. Then use `g = sns.FacetGrid(df, col="region", col_wrap=3)` and `g.map(hist_with_mean, "coverage_pct")`.

Exercise 16.20 — Comparing distributions rigorously (4-star)

Create a visualization that compares the distribution of coverage_pct between two income groups, showing: 1. Overlapping KDEs 2. An ECDF plot 3. A Q-Q plot (using scipy.stats.probplot or manual quantile computation) 4. A combined violin plot

Arrange these four views in a 2x2 grid. Based on all four views, describe how the distributions differ in terms of center, spread, shape, and tails.

Guidance

This requires combining seaborn axes-level functions with scipy for the Q-Q plot. For the Q-Q plot, compute quantiles for both groups and plot them against each other. The comparison should go beyond "one group is higher" to discuss skewness, bimodality, tail behavior, and spread. Multiple views provide different information: KDE for shape, ECDF for cumulative differences, Q-Q for distributional similarity, violin for quartile comparison.

Exercise 16.21 — seaborn vs. matplotlib comparison (3-star)

Choose one of the seaborn visualizations you created in this exercise set. Recreate the identical (or near-identical) visualization using only matplotlib. Compare the two code listings side by side. How many lines of code does each require? What does seaborn handle automatically that you had to specify manually?

Guidance

The student should discover that seaborn handles: automatic color cycling for groups, legend creation, axis labeling from column names, statistical computation (means, CIs, KDEs), sensible bin widths, figure sizing, and theme/style application. The matplotlib version will typically be 3-5 times longer. This reinforces why seaborn exists while appreciating that matplotlib gives more control.

Part M: Mixed Review (integrating earlier chapters)

Exercise 16.22 — Data cleaning before visualization (2-star)

Load a messy dataset that contains missing values, inconsistent category names (e.g., "USA", "U.S.A.", "United States"), and outliers. Using pandas skills from Part II, clean the data first, then create appropriate seaborn visualizations. Document your cleaning steps and explain how each one would have affected the visualization if skipped.

Guidance

This integrates [Chapter 8](../../part-02-data-wrangling/chapter-08-cleaning-messy-data/index.md) (cleaning) and [Chapter 10](../../part-02-data-wrangling/chapter-10-text-data/index.md) (text data) with Chapter 16 visualization. The student should show that inconsistent names create extra categories in categorical plots, missing values can silently reduce the data, and outliers can compress the scale of box or scatter plots. Cleaning first, then visualizing, produces more accurate and readable charts.

Exercise 16.23 — Reshaping for visualization (2-star)

You have a DataFrame in wide format with columns: country, coverage_2018, coverage_2019, coverage_2020, coverage_2021. Melt it into long format, then create a line plot with relplot showing coverage over time, colored by country. Explain why the wide format could not be used directly with seaborn's hue parameter.

Guidance

Use `pd.melt()` from [Chapter 9](../../part-02-data-wrangling/chapter-09-reshaping-transforming/index.md) to reshape. seaborn's `hue` expects a single column whose values define the groups. In wide format, the year information is encoded in column names, not in a column of values. After melting, you have `country`, `year`, and `coverage` columns, which map directly to `hue`, `x`, and `y`. This reinforces the "tidy data" principle.

Exercise 16.24 — Grammar of Graphics revisited (2-star)

For each of the following seaborn function calls, identify the Grammar of Graphics components (data, aesthetic mappings, geometric object, faceting, statistical transformation):

# (a)
sns.relplot(data=df, x="gdp", y="coverage",
            hue="region", size="population",
            kind="scatter", col="year")

# (b)
sns.catplot(data=df, x="region", y="coverage",
            kind="violin", hue="income_group")

# (c)
sns.displot(data=df, x="coverage",
            kind="kde", hue="region", fill=True)

Guidance

(a) Data: `df`. Aesthetics: x=gdp, y=coverage, color=region, size=population. Geom: scatter points. Facet: col=year. Stat: none (raw values). (b) Data: `df`. Aesthetics: x=region, y=coverage, color=income_group. Geom: violin (mirrored KDE). Facet: none. Stat: KDE estimation. (c) Data: `df`. Aesthetics: x=coverage, color=region. Geom: filled area (density curve). Facet: none. Stat: kernel density estimation.

Exercise 16.25 — From question to visualization (3-star)

For each question below, choose the most appropriate seaborn function and parameters, then write the code and create the plot. Justify your choices.

"Has vaccination coverage improved over the past decade across all regions?"
"Which income group has the most variation in coverage?"
"Is there a relationship between health expenditure and coverage, and does it differ by region?"
"What does the joint distribution of GDP and literacy rate look like?"
"How do all numeric variables in this dataset relate to each other?"

Guidance

1. `relplot(kind="line", x="year", y="coverage_pct", hue="region")` — time series comparison. 2. `catplot(kind="box", x="income_group", y="coverage_pct")` — box plots show spread via IQR. 3. `lmplot(x="health_expenditure", y="coverage_pct", hue="region")` — regression with grouping. 4. `displot(x="gdp_per_capita", y="literacy_rate", kind="kde")` — bivariate KDE. 5. `pairplot()` or a correlation `heatmap()` — multivariate overview.