Exercises: Distributional Visualization

DataField.Dev

Exercises: Distributional Visualization

These exercises use seaborn's built-in datasets (penguins, tips) so you can run them without external data. All assume import seaborn as sns and sns.set_theme(style="whitegrid").

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

Name the six distributional chart types covered in this chapter and state what each reveals about the data.

Guidance

**Histogram** (`histplot`): binned counts, shows overall shape. **KDE** (`kdeplot`): smooth density estimate, shows continuous shape. **Rug plot** (`rugplot`): individual observations as marginal marks, supplementary. **ECDF** (`ecdfplot`): cumulative distribution, exact quantile reading. **Violin plot** (`violinplot`): combines box plot summary with KDE shape. **Ridge plot** (FacetGrid + kdeplot): stacked KDEs for comparing many groups.

A.2 ★★☆ | Understand

Explain why the shape of a distribution is information that summary statistics (mean, standard deviation, median) cannot capture. Give a specific example.

Guidance

Summary statistics are lossy compressions. A bimodal distribution and a unimodal normal distribution can have identical mean and standard deviation but different shapes. Example: a bank's customer savings balances. If you computed "mean = $5,000, SD = $3,000," you might conclude the typical customer has a $5,000 balance. But the actual distribution might be bimodal — half the customers have ~$500 balances (low engagement) and half have ~$9,500 (high engagement). The mean hides the two groups. Only a histogram or KDE reveals them.

A.3 ★★☆ | Understand

Explain the role of bandwidth in a KDE. What happens when bandwidth is too small? Too large? How should you choose a bandwidth?

Guidance

Bandwidth controls the smoothness of the KDE curve. **Too small**: the curve is jagged and follows individual data points, producing noise. **Too large**: the curve is over-smoothed and hides features like bimodality. Choose by starting with the default, comparing to a histogram, and adjusting if the KDE misses features you see in the histogram or adds features you do not see. seaborn's `bw_adjust` parameter lets you multiply the automatic bandwidth — use 0.5 for less smooth, 1.5 for more smooth.

A.4 ★★☆ | Understand

The chapter argues that ECDFs are "the most honest distributional visualization." Explain why.

Guidance

ECDFs make no decisions that affect the visual output. Histograms require bin count (wrong choice hides features). KDEs require bandwidth (wrong choice smooths too much or too little). Box plots reduce the data to five numbers (hides shape). Violin plots inherit KDE's smoothing choices. ECDFs show every data point as a step in a cumulative curve, with no binning, no smoothing, and no summarization. You can read exact quantiles by eye. The downside is that ECDFs are less familiar and require reader training.

A.5 ★★☆ | Analyze

When should you NOT use a KDE or violin plot?

Guidance

(1) **Small samples** (fewer than ~30 observations): KDE and violins are unreliable and produce smooth shapes that imply more data than exists. Use strip plots instead. (2) **Data with hard boundaries** (like ages, which cannot be negative): KDE extends tails past the boundary, which is misleading. Use `clip` or switch to a histogram. (3) **Bimodal distributions with default bandwidth**: the default may over-smooth and hide bimodality. Check with a histogram first.

A.6 ★★★ | Evaluate

The chapter gives five real-world examples and recommends a chart type for each. Compare two of them (e.g., "web response times" vs. "test scores by class") and explain what makes the recommended chart type right for each context.

Guidance

Response times are right-skewed with a long tail; the log-scale histogram or ECDF reveals the tail that a linear axis would hide. Test scores are truncated (0-100) and moderately normal; violin plots by class compare distributions across groups while showing medians and spreads. Each recommendation reflects the specific shape characteristics and the audience's needs. The pattern: match the chart type to the data shape, not to aesthetic preferences.

Part B: Applied (8 problems)

B.1 ★★☆ | Apply

Load the penguins dataset and create a histogram of body_mass_g with 30 bins. Add kde=True to overlay a kernel density estimate.

Guidance

import seaborn as sns
penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins, x="body_mass_g", bins=30, kde=True)

B.2 ★★☆ | Apply

Create a multi-group histogram of body_mass_g split by species, using multiple="layer" with alpha=0.6 for transparency.

Guidance

sns.histplot(data=penguins, x="body_mass_g", hue="species", multiple="layer", alpha=0.6, bins=30)

B.3 ★★☆ | Apply

Create a KDE comparison of body_mass_g across species with filled curves. Try two values of bw_adjust (0.5 and 1.5) and compare.

Guidance

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True, alpha=0.4, bw_adjust=0.5, ax=axes[0])
axes[0].set_title("bw_adjust=0.5")

sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True, alpha=0.4, bw_adjust=1.5, ax=axes[1])
axes[1].set_title("bw_adjust=1.5")

The `0.5` version shows more detail; the `1.5` version is smoother. Neither is strictly better; which you prefer depends on what you want to emphasize.

B.4 ★★☆ | Apply

Create an ECDF of body_mass_g split by species. Identify the approximate median for each species by reading the ECDF at y=0.5.

Guidance

sns.ecdfplot(data=penguins, x="body_mass_g", hue="species")

The y-axis is cumulative proportion. At y=0.5, the x-value is the median for each species. Adelie penguins have the lowest median mass, Gentoo the highest.

B.5 ★★☆ | Apply

Create a violin plot of body_mass_g by species with inner="quartile". Then create a split violin plot with hue="sex" and split=True.

Guidance

sns.violinplot(data=penguins, x="species", y="body_mass_g", inner="quartile")

# Split by sex
sns.violinplot(data=penguins, x="species", y="body_mass_g", hue="sex", split=True, inner="quartile")

The split version shows male vs. female on each side of the violin. This is useful for direct pairwise comparison.

B.6 ★★★ | Apply

Create a 2D KDE of bill_length_mm vs. bill_depth_mm, colored by species, with filled contours.

Guidance

sns.kdeplot(
    data=penguins,
    x="bill_length_mm",
    y="bill_depth_mm",
    hue="species",
    fill=True,
    alpha=0.4,
)

The result is a contour plot with overlapping regions showing where each species's bills fall in the 2D space. The three species form visibly separate clusters.

B.7 ★★★ | Apply

Build a ridge plot of total bill by day from the tips dataset using the FacetGrid pattern from Section 17.7.

Guidance

See Section 17.7 of the chapter for the complete pattern. Key elements: `sns.FacetGrid(tips, row="day", aspect=6, height=1)`, `g.map(sns.kdeplot, "total_bill", fill=True)`, `g.figure.subplots_adjust(hspace=-0.6)` to overlap the rows, and inline labels for each row.

B.8 ★★★ | Create

Use sns.displot (figure-level) to create a faceted histogram of body_mass_g with one panel per species. Then recreate the same chart using sns.histplot (axes-level) with manual subplot arrangement, and compare the code.

Guidance

# Figure-level (seaborn handles layout)
sns.displot(data=penguins, x="body_mass_g", col="species", kind="hist", bins=20, height=4)

# Axes-level (manual layout)
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
species_list = penguins["species"].unique()
for ax, sp in zip(axes, species_list):
    subset = penguins[penguins["species"] == sp]
    sns.histplot(data=subset, x="body_mass_g", bins=20, ax=ax)
    ax.set_title(sp)

The figure-level version is much shorter. Use axes-level when you need custom layout control.

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

For the penguins dataset, produce five different distributional views of body_mass_g: histogram, KDE, ECDF, box plot, and violin plot. Compare what each reveals.

Guidance

Each view reveals different things. Histogram: the overall shape and individual bin counts. KDE: the smooth shape. ECDF: the exact quantiles. Box plot: the five-number summary. Violin: the shape combined with the summary. No single view is sufficient; together, they give a complete picture.

C.2 ★★★ | Evaluate

Section 17.4 claims that ECDFs are "underused." Find a Stack Overflow question or a data visualization tutorial about distributions, and identify whether the answer uses a histogram, KDE, or ECDF. Is the choice justified?

Guidance

Most online answers use histograms because they are familiar. The choice is usually defensible (histograms work), but ECDFs would often be clearer for the specific question. The exercise is training your eye to spot when an ECDF would be better.

C.3 ★★★ | Create

Design a figure with four panels showing the penguin body mass distribution by species using four different chart types: histogram, KDE, ECDF, and violin. Use a 2x2 grid with plt.subplots.

Guidance

fig, axes = plt.subplots(2, 2, figsize=(12, 10), constrained_layout=True)

sns.histplot(data=penguins, x="body_mass_g", hue="species", multiple="layer", alpha=0.6, ax=axes[0, 0])
axes[0, 0].set_title("Histogram")

sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True, alpha=0.4, ax=axes[0, 1])
axes[0, 1].set_title("KDE")

sns.ecdfplot(data=penguins, x="body_mass_g", hue="species", ax=axes[1, 0])
axes[1, 0].set_title("ECDF")

sns.violinplot(data=penguins, x="species", y="body_mass_g", inner="quartile", ax=axes[1, 1])
axes[1, 1].set_title("Violin")

C.4 ★★★ | Evaluate

The chapter argues for using strip plots for small samples instead of KDE or violin. Find a published chart (in a paper or a blog post) where a KDE is used on a small sample. Is it misleading? How would you redesign it?

Guidance

The exercise is about recognizing when distributional estimates are unreliable. A KDE on 10 data points is visually indistinguishable from a KDE on 1000 points, but the former is much less reliable. Finding an example in the wild and critiquing it builds the habit of checking sample sizes before trusting distributional charts.

These exercises build fluency with seaborn's distributional functions. Do at least five Part B exercises before Chapter 18.