Key Takeaways — Chapter 17: Distributional Visualization

DataField.Dev

Key Takeaways — Chapter 17: Distributional Visualization

1. Distribution Shape Is Information

The threshold concept: the shape of a distribution contains information that summary statistics cannot capture. A bimodal and a unimodal distribution can have identical mean and standard deviation but different structures. Visualization is how you preserve and communicate that structure.

2. Six Distributional Chart Types

Histogram (sns.histplot): binned counts. Familiar, requires a bin count choice. KDE (sns.kdeplot): smooth density estimate. Requires a bandwidth choice. Rug plot (sns.rugplot): individual observations as marginal marks. Supplementary. ECDF (sns.ecdfplot): cumulative proportion. No decisions required. Exact quantile reading. Violin plot (sns.violinplot): box plot with KDE draped over it. Good for group comparison. Ridge plot (FacetGrid + kdeplot): stacked KDEs with overlap. Striking for many groups.

3. Histograms with seaborn

sns.histplot(data=df, x="col", bins=30, hue="group", multiple="layer"). Use multiple="layer" for shape comparison, "stack" for totals, "dodge" for side-by-side, "fill" for proportional. Use stat="density" or stat="probability" for groups of different sizes.

4. KDE Bandwidth Is a Design Decision

The bw_adjust parameter controls smoothness. Default is 1.0 (Scott's rule). Smaller values (0.3-0.5) show more detail; larger (1.5-3.0) show a smoother shape. Check against a histogram to verify the KDE bandwidth is reasonable. For bimodal data, the default often over-smooths.

5. ECDFs Are Underused and Worth Using

ECDFs (sns.ecdfplot) make no binning or smoothing decisions. They show every data point as a step in a cumulative curve, enable exact quantile reading, and are the cleanest way to compare groups. They are less familiar to general audiences but are the most honest distributional chart. Learn to read them and prefer them for serious analytical work.

6. Violin Plots Combine Summary and Shape

sns.violinplot shows the KDE shape on both sides of a central axis with summary statistics as the inner representation. Use inner="quartile" for a clean default, inner="box" for a full box plot, inner="point" or "stick" for small samples. Watch out for misleading tails (cut=0 truncates at the data range). For two-level comparisons, use split=True with hue.

7. Ridge Plots Are Built from FacetGrid

seaborn has no ridgeplot function. Build ridge plots with sns.FacetGrid using row="group", hue="group", aspect=9, height=0.7, and negative hspace=-0.6 for overlap. Map filled KDEs with g.map(sns.kdeplot, "variable", fill=True). Remove axes with g.despine and set(yticks=[], ylabel=""). Use inline labels for each row.

8. Small Samples Need Strip Plots

KDE and violin plots are unreliable for samples smaller than ~30 observations. For small groups, use strip plots (sns.stripplot) to show individual observations, optionally with a point plot overlay for the group mean or median. Never use a smooth distributional estimate on data too sparse to support it — the smoothness implies more data than exists.

9. 2D Distributions with kdeplot and histplot

For joint distributions, pass both x and y to sns.kdeplot (for 2D contours) or sns.histplot (for 2D histograms). The sns.jointplot function combines a bivariate plot with marginal distributions on the top and right. 2D KDE with hue shows overlapping cluster regions for multiple groups.

10. Choose the Chart to Match the Question

No single distributional chart type is always right. Histograms for familiar overview. KDE for smooth shapes. ECDF for precise quantile reading and group comparison. Violin for group summary + shape. Ridge for many-group temporal evolution. Match the chart type to the specific question, the sample size, the number of groups, and the audience. When in doubt, try several and see which reveals the feature you care about.