Quiz: Distributional Visualization

Q: The `bw_adjust` parameter in `sns.kdeplot` controls: (a) The number of bins (b) The bandwidth of the kernel (smoothness of the curve) as a multiplier on the automatic default (c) The color of the fill (d) The transparency of the curve

(b) The bandwidth of the kernel as a multiplier on the automatic default. `bw_adjust=0.5` halves the bandwidth (less smooth, more detail); `bw_adjust=2.0` doubles it (more smooth, fewer features). The default (1.0) uses seaborn's automatic choice (usually Scott's rule).

Q: A violin plot with `inner="quartile"` draws: (a) Nothing inside the violin (b) Lines at the 25th, 50th, and 75th percentiles (c) A small box plot (d) Individual observations

(b) Lines at the 25th, 50th, and 75th percentiles. This is the default `inner` for violins and shows the quartile positions without obscuring the violin shape. Other options: `"box"` (small box plot), `"point"` (individual observations), `"stick"` (small ticks), `None` (nothing).

Q: For a split violin plot comparing two levels of a categorical variable, you pass: (a) `kind="split"` (b) `hue="grouping_var", split=True` (c) `two_groups=True` (d) `overlay=True`

(b) `hue="grouping_var", split=True`. The `split=True` parameter puts one group on each side of the violin, enabling direct visual comparison. This works only when the `hue` variable has exactly 2 levels.

Q: For a strictly positive variable (like income or age), a KDE can produce: (a) Negative densities (b) A tail extending below zero, which is misleading (c) Only correct results (d) Infinite values

(b) A tail extending below zero, which is misleading. The KDE kernel is symmetric and extends past the data boundary. For strictly positive data, this creates a tail that implies negative values are possible. Use the `clip=(0, None)` parameter to truncate the KDE at zero.

Q: For 5 data points per group, the recommended chart type is: (a) Kernel density estimate (b) Violin plot (c) Strip plot (showing individual observations) (d) Ridge plot

(c) Strip plot (showing individual observations). With only 5 points, KDE and violin plots are unreliable — they produce smooth shapes that imply more data than exists. A strip plot shows every point honestly without pretending to estimate a distribution.

Q: "A histogram with `hue` and `multiple='stack'` is good for comparing the shapes of the individual distributions."

False. Stacked histograms are good for showing totals and the bottom group's distribution, but the stacked groups do not share a common baseline, making shape comparison difficult. For comparing shapes, use `multiple="layer"` (overlaid with transparency) or `multiple="dodge"` (side-by-side).

Q: "seaborn's `ecdfplot` supports the `hue` parameter for comparing groups."

True. `sns.ecdfplot(data=df, x="variable", hue="group")` produces one ECDF per group with automatic color assignment. The overlapping ECDFs make group comparison precise.

DataField.Dev

Quiz: Distributional Visualization

20 questions. Aim for mastery (18+).

Multiple Choice (10 questions)

1. The chapter's threshold concept is that:

(a) Every distribution should be visualized as a histogram (b) Distribution shape contains information that summary statistics cannot capture (c) KDE is always better than histograms (d) Box plots are obsolete

Answer

**(b)** Distribution shape contains information that summary statistics cannot capture. Bimodal, skewed, heavy-tailed, and truncated distributions all have features that mean, median, and standard deviation miss. Visualizing the distribution is how you recover that information.

2. To compare two distributions with different sample sizes fairly, you should set:

(a) stat="count" (default) (b) stat="density" or stat="probability" or multiple="fill" (c) Larger bin count for the smaller group (d) Different colors for each group

Answer

**(b)** `stat="density"` or `stat="probability"` or `multiple="fill"`. These normalize for sample size so that the larger group does not visually dominate. `stat="density"` normalizes each group's area to 1; `stat="probability"` normalizes each group's bars to sum to 1; `multiple="fill"` normalizes the total at each bin to 1.

3. The bw_adjust parameter in sns.kdeplot controls:

(a) The number of bins (b) The bandwidth of the kernel (smoothness of the curve) as a multiplier on the automatic default (c) The color of the fill (d) The transparency of the curve

Answer

**(b)** The bandwidth of the kernel as a multiplier on the automatic default. `bw_adjust=0.5` halves the bandwidth (less smooth, more detail); `bw_adjust=2.0` doubles it (more smooth, fewer features). The default (1.0) uses seaborn's automatic choice (usually Scott's rule).

4. An ECDF (sns.ecdfplot) shows:

(a) The density function of the data (b) The cumulative proportion of data values less than or equal to x (c) The histogram counts (d) The violin plot

Answer

**(b)** The cumulative proportion of data values less than or equal to x. The ECDF is a step function that starts at 0 on the left and ends at 1 on the right. At any y-value, you can read the corresponding quantile directly: at y=0.5, the x-value is the median.

5. A violin plot with inner="quartile" draws:

(a) Nothing inside the violin (b) Lines at the 25th, 50th, and 75th percentiles (c) A small box plot (d) Individual observations

Answer

**(b)** Lines at the 25th, 50th, and 75th percentiles. This is the default `inner` for violins and shows the quartile positions without obscuring the violin shape. Other options: `"box"` (small box plot), `"point"` (individual observations), `"stick"` (small ticks), `None` (nothing).

6. For a split violin plot comparing two levels of a categorical variable, you pass:

(a) kind="split" (b) hue="grouping_var", split=True (c) two_groups=True (d) overlay=True

Answer

**(b)** `hue="grouping_var", split=True`. The `split=True` parameter puts one group on each side of the violin, enabling direct visual comparison. This works only when the `hue` variable has exactly 2 levels.

7. Ridge plots (joy plots) are:

(a) A built-in seaborn function called sns.ridgeplot (b) Built manually using FacetGrid with one row per group and overlapping KDEs (c) Only available in matplotlib (d) Obsolete

Answer

**(b)** Built manually using FacetGrid with one row per group and overlapping KDEs. seaborn does not have a built-in ridge plot function. The pattern uses `sns.FacetGrid` with `row="group"`, `hspace=-0.6` for negative vertical spacing (creating the overlap), and inline labels. See Section 17.7.

8. For a strictly positive variable (like income or age), a KDE can produce:

(a) Negative densities (b) A tail extending below zero, which is misleading (c) Only correct results (d) Infinite values

Answer

**(b)** A tail extending below zero, which is misleading. The KDE kernel is symmetric and extends past the data boundary. For strictly positive data, this creates a tail that implies negative values are possible. Use the `clip=(0, None)` parameter to truncate the KDE at zero.

9. For 5 data points per group, the recommended chart type is:

(a) Kernel density estimate (b) Violin plot (c) Strip plot (showing individual observations) (d) Ridge plot

Answer

**(c)** Strip plot (showing individual observations). With only 5 points, KDE and violin plots are unreliable — they produce smooth shapes that imply more data than exists. A strip plot shows every point honestly without pretending to estimate a distribution.

10. The main advantage of ECDFs over histograms is:

(a) ECDFs are faster to render (b) No binning decisions are required, and every data point is shown exactly (c) ECDFs look more impressive (d) ECDFs work only with large samples

Answer

**(b)** No binning decisions are required, and every data point is shown exactly. Histograms require you to choose a bin count, which affects the visual shape. ECDFs have no such decision — the step curve is determined entirely by the data. This makes them the most honest distributional visualization, even if they are less familiar to readers.

True / False (5 questions)

11. "A histogram with hue and multiple='stack' is good for comparing the shapes of the individual distributions."

Answer

**False.** Stacked histograms are good for showing totals and the bottom group's distribution, but the stacked groups do not share a common baseline, making shape comparison difficult. For comparing shapes, use `multiple="layer"` (overlaid with transparency) or `multiple="dodge"` (side-by-side).

12. "KDE plots are always more accurate than histograms for showing distribution shape."

Answer

**False.** KDE's accuracy depends on the bandwidth choice, which is not data-free. A poorly-chosen bandwidth can over-smooth (hiding features) or under-smooth (adding noise). For small samples, KDE is unreliable. Histograms are often more honest because the binning is explicit and the reader can see every bar.

13. "seaborn's ecdfplot supports the hue parameter for comparing groups."

Answer

**True.** `sns.ecdfplot(data=df, x="variable", hue="group")` produces one ECDF per group with automatic color assignment. The overlapping ECDFs make group comparison precise.

14. "Violin plots should be used for any sample size, including very small groups."

Answer

**False.** Violin plots use KDE under the hood, which is unreliable for small samples. For groups with fewer than ~30 observations, use strip plots (showing individual observations) instead. A violin plot on 5 points produces a smooth shape that implies more data than exists.

15. "Ridge plots are a built-in seaborn function."

Answer

**False.** Ridge plots are not a built-in function. They are built manually using seaborn's FacetGrid with `row="group"` and overlapping KDEs. See Section 17.7 for the complete recipe.

Short Answer (3 questions)

16. Explain in three to four sentences when you would use a KDE instead of a histogram, and when you would use a histogram instead of a KDE.

Answer

Use a **KDE** when you want a smooth estimate of the distribution without the visual noise of binning, especially for comparing distributions across groups (overlaid KDEs are easier to read than overlaid histograms), and when the sample size is large enough (~100+ points) for the estimate to be reliable. Use a **histogram** when you want to see the raw counts per bin, when the sample size is small to moderate, when you want to avoid bandwidth choices, or when your audience is unfamiliar with KDE. Both have valid use cases; the right choice depends on the specific question and audience.

17. Describe the ECDF's advantages over histograms and explain why it is "underused."

Answer

ECDFs have three main advantages: (1) no binning decisions — every data point is shown exactly as a step in the cumulative curve; (2) exact quantile reading — at y=0.5, the x-value is the median; (3) precise group comparison — overlapping ECDFs show exactly where two distributions differ. ECDFs are underused because they are less familiar to general audiences. They do not "look like" distributions the way histograms do, so readers need training to interpret them. The cumulative nature means you read slopes rather than heights, which is a different perceptual task. For technical audiences, ECDFs are underused and should be preferred for serious distributional analysis.

18. Explain what the "inner" parameter does in sns.violinplot and describe when to use each of the common values.

Answer

The `inner` parameter controls what is drawn inside each violin to summarize the distribution. **`"quartile"`** (default) draws lines at the 25th, 50th, and 75th percentiles — shows the median and IQR without obscuring the violin shape. **`"box"`** draws a small box plot — gives the complete five-number summary but is more visually dense. **`"point"`** or **`"stick"`** shows individual observations — good for small samples where you want to see every data point. **`None`** draws nothing inside — useful when the violin shape is the only thing you want to show. For most purposes, `"quartile"` is the right default.

Applied Scenarios (2 questions)

19. You have purchase amounts for 10,000 customers split into two groups: "regular" and "premium." Write seaborn code to visualize the two distributions using an ECDF, and explain what the reader would learn from it that a histogram or box plot would not reveal.

Answer

import seaborn as sns
sns.set_theme(style="whitegrid")

sns.ecdfplot(data=purchases, x="amount", hue="segment")
plt.xscale("log")  # often useful for purchase amounts
plt.title("Cumulative Distribution of Purchase Amounts by Segment")
plt.xlabel("Amount (USD, log scale)")
plt.ylabel("Cumulative proportion")

**What the ECDF reveals that other charts do not:** - **Exact quantiles**: at any y-value, the reader can read off the exact amount below which that proportion of customers fall. Premium vs. regular medians, 90th percentiles, 99th percentiles all readable directly. - **Precise comparison across the distribution**: the vertical gap between the two ECDFs shows exactly how much one distribution lags the other at each amount level. - **Whole distribution visible**: ECDFs show every data point. Histograms hide individual points in bins; box plots hide everything except the five-number summary. - **No binning artifacts**: histograms with different bin counts look different and may mislead. ECDFs have no binning. The trade-off is that ECDFs require training to read. First-time viewers may need help understanding what the y-axis means, but once they do, the chart is more informative than a histogram for most comparisons.

20. A colleague produces the following chart for a clinical trial with 8 patients in the treatment group and 10 in the control group: sns.violinplot(data=trial, x="treatment", y="outcome"). Identify the problem and propose a better visualization.

Answer

**Problem:** Violin plots are unreliable for small samples. With 8 patients in one group and 10 in the other, the KDE that generates the violin shape is essentially making up density estimates from very few points. The resulting smooth shapes imply much more data than exists and can mislead the reader into seeing distributional features that are really just sampling noise. **Better visualization:** Show individual observations with a summary overlay. A strip plot or swarm plot displays every data point directly, and a point plot adds the mean or median for each group:

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))

# Individual observations
sns.stripplot(data=trial, x="treatment", y="outcome", size=8, alpha=0.7, ax=ax)

# Mean overlay as a dash
sns.pointplot(
    data=trial,
    x="treatment",
    y="outcome",
    estimator="mean",
    markers="_",
    color="black",
    errorbar=None,
    ax=ax,
)

ax.set_title("Treatment vs. Control: Individual Outcomes with Group Means")

This version shows every patient honestly, with the group mean as a reference mark. The reader can see the exact sample size (by counting the points) and form their own judgment about the variability and overlap. For small samples, showing every point is the ethical choice; pretending the data is dense enough for a smooth distribution estimate is not.

Review against mastery thresholds. Chapter 18 covers relational and categorical visualization in seaborn, building on the distributional tools from this chapter.