Key Takeaways — Chapter 18: Relational and Categorical Visualization

DataField.Dev

Key Takeaways — Chapter 18: Relational and Categorical Visualization

1. The Relational Family

sns.scatterplot, sns.lineplot, sns.relplot (figure-level), sns.regplot, and sns.lmplot (figure-level with regression). These visualize the relationship between two continuous variables. Multi-variable encoding via hue, style, and size adds additional dimensions. Automatic aggregation in lineplot computes means and confidence bands from multi-observation data.

2. The Categorical Family Has Three Subfamilies

Showing every point: sns.stripplot, sns.swarmplot — best for small samples where individual observations matter. Showing distributions: sns.boxplot, sns.violinplot — best for moderate to large samples where the distribution shape and summary both matter. Showing summaries: sns.barplot, sns.pointplot, sns.countplot — best for quick aggregate comparisons but beware the dynamite plot critique.

3. The Dynamite Plot Critique

The threshold concept: bar charts with error bars hide the distribution of individual data points. A mean with a single error bar tells you nothing about sample size, distribution shape, outliers, or skewness. Two dramatically different datasets can produce identical dynamite plots. The fix is to show the data alongside the summary — strip plots combined with box or violin plots.

4. Strip Plus Box Is the Recommended Alternative

sns.stripplot (alpha=0.4, size=4) combined with sns.boxplot(showfliers=False, boxprops=dict(alpha=0.3)) on the same Axes produces a chart that shows both the individual observations and the distribution summary. This is the modern replacement for the dynamite plot in biomedical publication and scientific visualization.

5. Automatic Regression Overlays

sns.regplot(data, x, y) fits a linear regression and overlays it with a confidence band. sns.lmplot is the figure-level version with faceting. Use order=N for polynomial fits; use logistic=True for binary outcomes. Check residuals with sns.residplot. Beware extrapolation beyond the data range.

6. Automatic Aggregation in lineplot

sns.lineplot computes the mean per x-value and draws a 95% bootstrap CI band by default. To disable aggregation, pass estimator=None. To change the aggregator, pass estimator=np.median. To show individual series, pass units= to identify each series. Always label the aggregation in the chart subtitle or caption.

7. Error Bar Types Matter

errorbar=("ci", 95) is a 95% bootstrap confidence interval (seaborn default). errorbar=("se", 1) is one standard error of the mean (much smaller). errorbar=("sd", 1) is one standard deviation (represents data spread, not uncertainty). These are not interchangeable — CI and SE represent uncertainty about the mean; SD represents spread of individual observations. Always specify explicitly and label the chart.

8. Categorical Order Matters

seaborn orders categorical variables alphabetically by default. For ordered categories (days of week, months, severity levels), pass order=[list] explicitly. Alphabetical order for ordinal data is almost always wrong, and overlooking this is one of the most common seaborn mistakes.

9. catplot and relplot Handle Faceting

Figure-level sns.catplot and sns.relplot support col and row parameters for automatic small multiples. Use them when the whole figure is a single faceted visualization. Use the axes-level functions when integrating with manual matplotlib layouts.

10. Show the Data, Not Just the Summary

The chapter's threshold concept: summaries hide information that the raw data reveals. Whenever you produce a group comparison chart, ask whether showing every individual observation would add value. Usually the answer is yes. The strip+box combination, the swarm+box combination, the violin+strip combination — all of these show both the data and the summary. Default to showing the data; make the case for hiding it when you choose to do so.