Chapter 18: Relational and Categorical Visualization

29 min read

> — Every serious data visualization book, applied here to group comparisons.

Learning Objectives

Create scatter plots with sns.scatterplot() and relplot including multi-variable encoding
Create line plots with sns.lineplot() including confidence bands and multiple series
Add regression lines with sns.regplot() and sns.lmplot()
Create categorical charts: strip, swarm, box, violin, bar, point, count
Distinguish between categorical plots that show every point, that show distribution, and that show a summary
Apply catplot() to facet categorical comparisons across panels
Recognize and critique the 'dynamite plot' anti-pattern

In This Chapter

18.1 Scatter Plots with sns.scatterplot
18.2 Line Plots with sns.lineplot
18.3 Regression Overlays with sns.regplot and sns.lmplot
18.4 Categorical Charts: Showing Every Point
18.5 Categorical Charts: Showing Distributions
18.6 Categorical Charts: Showing Summaries
18.7 The Dynamite Plot Critique
18.8 Figure-Level catplot
18.9 The Error Bar Controversy
18.10 Best Practices Summary
18.10 Common Categorical Pitfalls
18.10 seaborn vs. matplotlib for the Same Chart
18.10 Residuals and Regression Diagnostics
18.10 seaborn catplot Patterns
18.11 Combining Relational and Categorical Analysis
18.10 A Quick Reference for the Categorical Family
18.10 The Dynamite Plot Alternative in Detail
18.10 Using relplot for Faceted Relational Plots
18.10 Advanced Scatter Plot Techniques
18.10 Relational Pitfalls
18.10 Categorical Plot Decision Matrix
18.11 The Climate Relational and Categorical Story
Chapter Summary
Spaced Review: Concepts from Chapters 1-17

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 18: Relational and Categorical Visualization

"Show the data." — Every serious data visualization book, applied here to group comparisons.

Chapter 17 covered seaborn's distributional family — how to visualize the shape of a single variable or a joint distribution. This chapter covers the other two families: relational (how two variables relate) and categorical (how values compare across categories).

Relational visualization answers questions like: "Is there a correlation between X and Y?" "How does this variable change over time?" "Are there clusters in the 2D space?" The main seaborn functions are sns.scatterplot, sns.lineplot, sns.regplot, and sns.lmplot, with the figure-level sns.relplot wrapping the scatter and line cases with faceting support.

Categorical visualization answers questions like: "How do values compare across groups?" "Is there a difference between these treatments?" "Which category has the highest median?" The main seaborn functions include sns.stripplot, sns.swarmplot, sns.boxplot, sns.violinplot, sns.barplot, sns.pointplot, and sns.countplot, with the figure-level sns.catplot wrapping them all.

This chapter walks through the main functions in each family, covers the automatic statistical overlays (regression lines, confidence intervals), and makes the case against the "dynamite plot" — the bar-with-error-bar chart that hides individual data points behind a summary. The threshold concept is that showing the data is better than just showing the summary. Once you internalize this, you will find yourself favoring strip+box combinations over plain bar charts for most group comparisons.

18.1 Scatter Plots with sns.scatterplot

The basic scatter plot is the foundation of relational visualization. seaborn's sns.scatterplot is more capable than matplotlib's ax.scatter in several ways: automatic handling of categorical encodings, built-in legends, and integration with the data parameter.

Basic Usage

import seaborn as sns
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip")

This produces a simple scatter with default colors. Every observation becomes a point.

Multi-Variable Encoding

The real power of scatterplot is encoding additional variables as hue, style, and size:

sns.scatterplot(
    data=tips,
    x="total_bill",
    y="tip",
    hue="smoker",
    style="time",
    size="size",
    sizes=(40, 200),  # range of marker sizes
)

This encodes five variables (x, y, color, shape, size) on a single chart. seaborn handles the palette, the legend, and the size scaling automatically. The equivalent matplotlib code would require manual iteration over each combination of hue, style, and size — possibly dozens of lines.

Overplotting

For dense scatter plots, use alpha for transparency or switch to a hexbin or 2D KDE:

sns.scatterplot(data=big_df, x="x", y="y", alpha=0.3, s=20)

Small markers and low alpha let the density become visible through overlap. For very dense data, use sns.histplot with both x and y for a 2D histogram, or sns.kdeplot with both x and y for contours.

18.2 Line Plots with sns.lineplot

Line plots connect observations in sequence. sns.lineplot is the seaborn equivalent of ax.plot for continuous relationships.

Basic Usage with Aggregation

flights = sns.load_dataset("flights")
sns.lineplot(data=flights, x="year", y="passengers")

This plots passengers over year. If there are multiple observations per year (the flights dataset has 12 months per year), seaborn aggregates automatically — computing the mean for each year and drawing a confidence band around it. The confidence band is the 95% bootstrap interval by default.

Disabling Aggregation

If each row is already one observation and you want to see individual points without aggregation:

sns.lineplot(data=df, x="x", y="y", estimator=None)

estimator=None disables the aggregation. The line connects the raw points in order.

Multiple Series

sns.lineplot(data=flights, x="year", y="passengers", hue="month")

With hue="month", you get one line per month with a legend. seaborn handles the color assignment and legend automatically.

Customizing the Confidence Band

sns.lineplot(data=df, x="x", y="y", errorbar=("ci", 95))   # 95% CI (default)
sns.lineplot(data=df, x="x", y="y", errorbar=("se", 1))    # 1 standard error
sns.lineplot(data=df, x="x", y="y", errorbar=("sd", 2))    # 2 standard deviations
sns.lineplot(data=df, x="x", y="y", errorbar=None)         # no band

The errorbar parameter accepts tuples specifying the type and size. Pass None to disable the band entirely.

18.3 Regression Overlays with sns.regplot and sns.lmplot

Fitting a regression line and overlaying it on a scatter plot is one of the most common statistical visualizations. seaborn provides two functions for this: sns.regplot (axes-level) and sns.lmplot (figure-level with faceting).

Basic Regression Plot

sns.regplot(data=tips, x="total_bill", y="tip")

This produces a scatter plot with a fitted linear regression line and a shaded 95% confidence interval around the line. The computation is done internally by seaborn using scipy.

Polynomial Regression

sns.regplot(data=tips, x="total_bill", y="tip", order=2)  # quadratic
sns.regplot(data=tips, x="total_bill", y="tip", order=3)  # cubic

The order parameter fits a polynomial of the specified degree. Higher orders can fit curved relationships but risk overfitting. For most exploratory work, stick with order=1 (linear) unless you have a specific reason to believe the relationship is curved.

Logistic Regression

sns.regplot(data=df, x="predictor", y="binary_outcome", logistic=True)

For binary outcomes, logistic=True fits a logistic regression instead of linear. The result is an S-shaped curve that saturates at 0 and 1.

Faceted Regression with lmplot

sns.lmplot(
    data=tips,
    x="total_bill",
    y="tip",
    col="day",
    hue="smoker",
    height=4,
)

sns.lmplot is the figure-level version, with built-in faceting via col and row. Each panel gets its own regression fit. With hue, each color group gets its own regression line within each panel.

Regression Pitfalls

1. Correlation is not causation. A fitted line does not prove the relationship is causal. The regression line shows the best linear fit to the data; interpret it as "if the relationship were linear, this is the best line" not as "X causes Y."

2. Polynomial overfitting. Higher-order polynomials can fit noise rather than signal. A quadratic or cubic fit that looks perfect on training data may not generalize. For serious modeling, use scikit-learn with train/test splits.

3. Extrapolation is unsafe. The regression line extends beyond the data range in the plot, which can mislead readers into assuming the model applies to values you have no data for.

4. Non-linear relationships. If the underlying relationship is not linear, a linear fit will be wrong. Check residuals with sns.residplot to see if the linear assumption holds.

18.4 Categorical Charts: Showing Every Point

Categorical charts compare values across categories. seaborn's categorical family has three subfamilies based on what they show: every point (strip, swarm), distributions (box, violin), and summaries (bar, point).

Strip Plot

sns.stripplot shows every observation as a point, with jitter to prevent exact overlap.

sns.stripplot(data=tips, x="day", y="total_bill")

With the default jitter=True, points within the same category are spread slightly horizontally so you can see them individually. Without jitter, points at the same value would overlap exactly.

Strip plots are good when: - Sample size is small (< 50 per group) and showing every point matters. - The individual observations are important (e.g., clinical data). - Combined with a box or violin plot, they show both the summary and the data.

Swarm Plot

sns.swarmplot is similar to strip but uses an algorithm to avoid overlap without jitter:

sns.swarmplot(data=tips, x="day", y="total_bill")

Swarm plots look cleaner than strip plots for small to moderate datasets. For larger datasets (more than ~100 per category), the swarm algorithm can warn about overlap — in those cases, strip plots with smaller markers are better.

Combining Strip with Box or Violin

The most informative categorical chart combines a summary visualization with the raw data points:

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=tips, x="day", y="total_bill", showfliers=False, ax=ax)
sns.stripplot(data=tips, x="day", y="total_bill", color="black", alpha=0.4, size=4, ax=ax)

This shows the box plot summary (median, quartiles, whiskers) underneath with the individual data points overlaid. The reader sees both the distribution summary and every observation. showfliers=False hides the box plot's outlier markers because the strip plot already shows them.

18.5 Categorical Charts: Showing Distributions

For larger groups where showing every point is impractical, use distribution-based categorical charts.

Box Plot

sns.boxplot(data=tips, x="day", y="total_bill")

Shows the five-number summary (min, Q1, median, Q3, max) for each category. Same visual as matplotlib's box plot but with seaborn's theming.

Violin Plot

sns.violinplot(data=tips, x="day", y="total_bill", inner="quartile")

Shows the KDE shape plus an inner representation (quartile lines, box plot, etc.). Covered in Chapter 17. Combines distribution shape with summary statistics.

Box Plot with Hue

sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker")

With hue, each category gets multiple boxes side by side (one per hue value). Useful for two-dimensional categorical comparison. seaborn handles the positioning automatically.

18.6 Categorical Charts: Showing Summaries

For quick summaries, the categorical family has bar and point plots that show aggregated statistics.

Bar Plot (the Statistical Version)

sns.barplot(data=tips, x="day", y="total_bill")

Important note: sns.barplot is different from matplotlib's ax.bar. seaborn's barplot computes an aggregate (mean by default) and plots it as a bar with a confidence interval as an error bar. Matplotlib's ax.bar just draws bars of specified heights.

The default aggregation is the mean. To change it:

import numpy as np
sns.barplot(data=tips, x="day", y="total_bill", estimator=np.median)

Point Plot

sns.pointplot is similar to barplot but uses dots connected by lines instead of bars:

sns.pointplot(data=tips, x="day", y="total_bill", hue="smoker")

Point plots are useful for showing how a summary statistic changes across categories, especially when you have multiple groups (via hue) and want to compare their trends.

Count Plot

sns.countplot shows the frequency of each category (no y-variable needed):

sns.countplot(data=tips, x="day")

This plots the count of observations for each day. It is equivalent to sns.barplot with the y-variable being the count itself. Useful for quick frequency checks.

18.7 The Dynamite Plot Critique

This section is the chapter's threshold concept in detail. It is about why the default sns.barplot (mean with error bar) is often a bad choice, and what to use instead.

What the Dynamite Plot Is

A "dynamite plot" is a bar chart where each bar represents the mean of a group and an error bar extends above (and sometimes below) the bar showing one standard error, one standard deviation, or a confidence interval. The visual resembles a stick of dynamite — hence the nickname.

sns.barplot(data=tips, x="day", y="total_bill")  # default: mean with 95% CI

This is the default seaborn barplot. It looks like the dynamite plot described above.

Why It Is Bad

Dynamite plots hide the actual distribution of the data. From the chart, you cannot tell:

Whether the data is unimodal or bimodal. The mean could be between two clusters.
The sample size. A bar with a small error bar could come from 10 points or 10,000 — the chart does not say.
Skewness. The error bar is symmetric, implying symmetric distribution, but the actual data may be heavily skewed.
Outliers. The mean is pulled by outliers, but the chart does not show them.
Individual observations. Two different datasets can produce identical bar-with-error-bar charts while having completely different point distributions.

The dynamite plot reduces the data to two numbers (mean and one measure of spread) and hides everything else. This is sometimes what you want — for a quick headline statistic — but it is usually a loss of information.

The Alternative: Strip + Box + Point

Instead of a dynamite plot, show the raw data with a summary overlay:

fig, ax = plt.subplots(figsize=(8, 5))
sns.stripplot(data=tips, x="day", y="total_bill", alpha=0.4, color="gray", size=4, ax=ax)
sns.boxplot(data=tips, x="day", y="total_bill", showfliers=False, boxprops=dict(alpha=0.3), ax=ax)
sns.pointplot(data=tips, x="day", y="total_bill", color="#d62728", estimator="mean",
              markers="D", errorbar=("ci", 95), ax=ax)

This produces a three-layer chart: 1. Strip plot showing every individual observation (the raw data). 2. Box plot (with transparent fill) showing the five-number summary. 3. Point plot showing the mean with a 95% CI as a diamond marker with an error bar.

The reader sees everything: the individual points, the distribution summary, and the mean estimate. No information is hidden. The mean is one piece of the story, not the whole story.

When Dynamite Plots Are OK

Dynamite plots are acceptable when:

The audience expects the convention. In some fields (especially biomedical publication), dynamite plots are the standard, and readers know how to interpret them in context.
Space is extremely limited. In a small supplementary figure where only the mean matters, a bar + error bar is the most compact representation.
You provide the raw data elsewhere. If the paper includes the raw data as a supplementary file or a separate figure, the dynamite plot in the main figure is a summary, not a hiding of information.

But these cases are exceptions. For most visualization work, showing the data is better than just showing the summary.

The Threshold Concept

Once you see the problem with dynamite plots, you cannot unsee it. You will find yourself critiquing every bar-with-error-bar chart you encounter. You will notice when a paper or report shows only the summary without the underlying distribution. You will prefer strip+box combinations in your own work. This is the mark of internalizing the principle: show the data, not just the summary.

18.8 Figure-Level catplot

For faceted categorical charts, use the figure-level sns.catplot:

g = sns.catplot(
    data=tips,
    x="day",
    y="total_bill",
    hue="smoker",
    col="sex",
    kind="box",
    height=4,
    aspect=1,
)

kind can be "strip", "swarm", "box", "violin", "boxen", "point", "bar", or "count". The figure-level function supports faceting via col and row, handles the legend automatically, and returns a FacetGrid.

For most categorical work, start with the axes-level functions to verify the chart looks right, then switch to catplot when you need faceting.

18.9 The Error Bar Controversy

Error bars on charts are ubiquitous, but they are not all the same. This short section clarifies what the different error bar options in seaborn actually mean and when each is appropriate.

The Four Common Error Bar Types

("ci", 95) — 95% confidence interval. Computed via bootstrap resampling. Represents the range within which the true population parameter is likely to fall. This is the seaborn default for lineplot and barplot. Appropriate for most scientific contexts.

("se", 1) — 1 standard error of the mean. SE = SD / sqrt(n). Shorter than a 95% CI by a factor of about 1.96. Appropriate when you want to show "how precisely we estimated the mean" rather than a confidence interval. Common in neuroscience and psychology.

("sd", 1) — 1 standard deviation. Represents the spread of the data, not the uncertainty about the mean. Much larger than SE or CI for large samples. Appropriate when you want to show "how variable the data is" rather than "how confident we are in the mean."

("pi", 95) — 95% prediction interval. Wider than a CI because it accounts for both the uncertainty in the mean and the variation of individual observations. Useful for showing "where a new observation would likely fall."

Why It Matters

The difference between SE and SD is huge and commonly confused. Consider a dataset with SD = 10 and n = 100. The SE is 10 / √100 = 1. A chart using SE shows tiny error bars (width 2); a chart using SD shows 10× larger bars (width 20). Same data, drastically different visual impression.

Papers sometimes label error bars "SEM" (standard error of the mean) without explaining what it means. Readers who assume it is SD will overestimate the variability. Readers who assume it is CI will underestimate the uncertainty about the mean. Both get the wrong impression.

The Practical Rule

Always specify the errorbar type explicitly and state it in the chart caption or subtitle:

sns.barplot(data=df, x="group", y="value", errorbar=("ci", 95))
# In the caption: "Bars show mean; error bars show 95% bootstrap CI."

Readers who know the convention can interpret the chart correctly. Readers who do not can look it up. The key is not to hide the convention.

When to Avoid Error Bars

For small samples or highly skewed data, error bars can mislead because they imply symmetric uncertainty when the actual distribution is not symmetric. In these cases, prefer showing the raw data (strip plot) or a violin plot that reveals the actual shape.

18.10 Best Practices Summary

Before moving on, here is a condensed summary of best practices for relational and categorical visualization in seaborn.

For relational visualization:

Use scatterplot for two continuous variables with optional hue/style/size.
Use lineplot for time series or other ordered continuous data, letting automatic aggregation handle multiple observations per x.
Use regplot for fitted regression lines with confidence bands. Check residuals with residplot.
Use lmplot for faceted regressions. Stick to linear (order=1) unless you have a specific reason for polynomial.
Use relplot for figure-level faceted relational charts with col and row.

For categorical visualization:

For every-point displays, use stripplot or swarmplot. Swarm is cleaner for small samples.
For distribution displays, use boxplot or violinplot. Violin adds shape; box adds the classic summary.
For summary displays, use pointplot or barplot — but show the raw data alongside when possible.
Avoid the dynamite plot (bar + error bar alone). Combine strip/swarm with box/violin to show both data and summary.
Use catplot for figure-level faceted categorical charts.

General rules:

Order categorical axes meaningfully (order=), not alphabetically.
Specify the estimator and error bar explicitly for any summary chart, so the reader knows what they are seeing.
For small samples (< 30 per group), show individual observations, not smooth distributional estimates.
Combine chart types when it adds information (strip + box, scatter + regression, violin + point).
When in doubt, start with the simplest chart that shows the question, and add complexity only when needed.

These are not rules to memorize rigidly — they are defaults to apply when you are unsure. Professional chart makers develop their own judgment over time, and the best charts often depart from defaults deliberately. The point is to have defaults that produce reasonable charts without requiring per-chart decision making for every choice.

18.10 Common Categorical Pitfalls

Beyond the dynamite plot, several common mistakes appear in categorical visualization. This section names them and their fixes.

Pitfall 1: Categorical Ordering by Alphabet

Symptom: Days of the week appear as "Fri, Sat, Sun, Thu" instead of "Thu, Fri, Sat, Sun."

Cause: seaborn orders categorical variables alphabetically by default, which is rarely meaningful.

Fix: Pass order explicitly: sns.boxplot(..., order=["Thur", "Fri", "Sat", "Sun"]). For ordinal categories (severity, month, day), always specify order.

Pitfall 2: Too Many Categories

Symptom: A bar chart has 50+ categories, and the labels are rotated and overlapping.

Cause: seaborn draws all categories by default regardless of count.

Fix: Use sns.barplot with orient="h" (horizontal) for long category labels. Or filter to top-N: df.nlargest(20, "value") before plotting. Or use catplot with col_wrap to facet.

Pitfall 3: Skipping the Statistical Check

Symptom: You see a difference between groups on a chart and assume it is statistically significant.

Cause: Visual differences can be driven by noise, especially for small samples. A chart showing a difference is not proof of a statistical effect.

Fix: Always accompany the chart with a statistical test (t-test, Mann-Whitney, ANOVA, etc.) and report the p-value. The chart shows the visual difference; the test confirms whether it is larger than chance.

Pitfall 4: Stacked Bar Chart for Comparison

Symptom: A stacked bar chart is used to compare the sizes of individual segments across categories, but the comparison is hard because segments do not share baselines.

Cause: Stacked bars work for showing totals and the bottom segment but poorly for comparing middle or top segments.

Fix: Use grouped bars (dodge=True) or small multiples instead.

Pitfall 5: Pie Chart in a Categorical Display

Symptom: A pie chart is used for categorical data with many slices.

Cause: Pie charts are a classic anti-pattern. Humans cannot read angles or areas as accurately as lengths.

Fix: Use a horizontal bar chart sorted by value. Same data, much more readable. seaborn does not even include a pie-chart function, which is a deliberate design choice.

Pitfall 6: Color as Encoding Without a Legend

Symptom: A categorical chart uses hue to color groups, but the output has no legend, leaving the reader unable to decode.

Cause: When axes-level functions are called multiple times, the legend can be lost. Or legend=False is set inadvertently.

Fix: Check that the legend appears. For multi-call patterns (box + strip overlay), use one call's legend and suppress the other's.

18.10 seaborn vs. matplotlib for the Same Chart

This section explicitly compares seaborn and matplotlib versions of typical relational and categorical charts, to make the productivity difference concrete.

Scatter with Regression

matplotlib:

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

x = tips["total_bill"]
y = tips["tip"]

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.6)

# Fit regression
slope, intercept, _, _, _ = stats.linregress(x, y)
x_fit = np.linspace(x.min(), x.max(), 100)
y_fit = slope * x_fit + intercept

# Compute confidence band
n = len(x)
std_err = np.sqrt(np.sum((y - (slope*x + intercept))**2) / (n - 2))
ci = 1.96 * std_err * np.sqrt(1/n + (x_fit - x.mean())**2 / np.sum((x - x.mean())**2))

ax.plot(x_fit, y_fit, color="red")
ax.fill_between(x_fit, y_fit - ci, y_fit + ci, alpha=0.2, color="red")

seaborn:

sns.regplot(data=tips, x="total_bill", y="tip")

One line vs. about 15 lines. seaborn handles the regression fit, the confidence band computation, and the plotting.

Grouped Bar with Error Bars

matplotlib:

fig, ax = plt.subplots()
days = tips["day"].unique()
x_pos = np.arange(len(days))
width = 0.35

for i, smoker in enumerate(["Yes", "No"]):
    means = []
    errs = []
    for day in days:
        subset = tips[(tips["day"] == day) & (tips["smoker"] == smoker)]["total_bill"]
        means.append(subset.mean())
        errs.append(subset.std() / np.sqrt(len(subset)))
    ax.bar(x_pos + i * width - width/2, means, width, yerr=errs, label=smoker)

ax.set_xticks(x_pos)
ax.set_xticklabels(days)
ax.legend()

seaborn:

sns.barplot(data=tips, x="day", y="total_bill", hue="smoker")

Again, one line vs. many. The productivity difference is especially large for grouped categorical charts.

Small Multiple

matplotlib:

fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
for ax, day in zip(axes, ["Thur", "Fri", "Sat", "Sun"]):
    subset = tips[tips["day"] == day]
    ax.scatter(subset["total_bill"], subset["tip"], alpha=0.6)
    ax.set_title(day)
    ax.set_xlabel("Total bill")
axes[0].set_ylabel("Tip")

seaborn:

sns.relplot(data=tips, x="total_bill", y="tip", col="day", kind="scatter", height=4)

Again, dramatically shorter. For any faceted analysis, seaborn is the productivity win.

When matplotlib Is Still Better

For highly customized publication figures where you need exact control over every element, matplotlib is often cleaner. seaborn's automation is a productivity win for standard charts but can get in the way of unusual requirements. For a single extremely polished chart with custom typography and layout, consider starting with matplotlib directly.

18.10 Residuals and Regression Diagnostics

When you fit a regression line, checking the residuals tells you whether the linear assumption is appropriate. seaborn provides sns.residplot for this purpose.

Basic Residual Plot

sns.residplot(data=tips, x="total_bill", y="tip")

This fits a linear regression internally, computes the residuals (observed minus predicted), and plots them against x. Ideally, the residuals should be scattered randomly around zero with no visible pattern. If there is a pattern — a curve, a funnel, a cluster — the linear model is wrong.

With LOWESS Smoother

sns.residplot(data=tips, x="total_bill", y="tip", lowess=True, line_kws={"color": "red"})

lowess=True adds a LOWESS (locally weighted) smooth over the residuals. If the smooth is approximately flat at zero, the linear fit is reasonable. If the smooth shows a curve, the linear fit is missing a non-linear component.

Interpreting Residual Plots

Flat random cloud: linear fit is appropriate.

Curved pattern: the relationship is non-linear. Try a polynomial fit or a different model.

Funnel shape (residuals get wider as x increases): heteroscedasticity. The variance of y increases with x, violating the linear model's constant-variance assumption. Log-transforming y or using weighted regression may help.

Clustering: the data has structure the linear model misses. Check for group effects or non-independent observations.

Residual plots are a basic diagnostic tool for any regression analysis. They are underused because most practitioners skip them when the fitted line "looks right," but a flat-looking fit can still hide patterns in the residuals.

18.10 seaborn catplot Patterns

The figure-level sns.catplot is useful for faceted categorical comparisons. Here are the most common patterns.

sns.catplot(
    data=tips,
    x="day",
    y="total_bill",
    col="sex",
    kind="box",
    height=4,
)

One box plot per sex (two panels), each showing total bill by day. Useful for showing whether the day-of-week pattern differs between subgroups.

Grouped Comparison with Two Hues

sns.catplot(
    data=tips,
    x="day",
    y="total_bill",
    hue="smoker",
    col="sex",
    kind="box",
    height=4,
)

Each panel has grouped box plots (smoker vs. non-smoker) for each day, and there are two panels (one per sex). This is a 4-way comparison in a single figure.

Per-Panel Chart Type

kind can be "strip", "swarm", "box", "violin", "boxen", "point", "bar", or "count". You can switch chart types easily:

sns.catplot(data=tips, x="day", y="total_bill", col="sex", kind="strip", height=4)
sns.catplot(data=tips, x="day", y="total_bill", col="sex", kind="violin", height=4)
sns.catplot(data=tips, x="day", y="total_bill", col="sex", kind="bar", height=4)

For exploratory analysis, try several kind values to see which reveals the patterns you are interested in.

Order of Categories

By default, seaborn orders categories alphabetically (for string columns) or in the order they appear (for ordered categories). To control the order explicitly, pass order:

sns.catplot(
    data=tips,
    x="day",
    y="total_bill",
    kind="box",
    order=["Thur", "Fri", "Sat", "Sun"],
)

Custom ordering is essential when the natural order is not alphabetical (days of week, months of year, severity levels).

18.11 Combining Relational and Categorical Analysis

Real analyses often combine relational and categorical questions. This section shows common patterns.

Pattern 1: Scatter with Categorical Overlay

You have continuous x and y variables and want to see how a categorical variable affects the relationship.

sns.scatterplot(data=tips, x="total_bill", y="tip", hue="day", style="smoker")

Two continuous variables (scatter) with two categorical encodings (hue and style). This is the basic multi-variable scatter pattern.

Pattern 2: Faceted Regression

You want to see whether the relationship between two variables differs across groups.

sns.lmplot(
    data=tips,
    x="total_bill",
    y="tip",
    col="smoker",
    row="time",
    height=4,
)

Four panels (2 smokers × 2 times), each with its own scatter and regression line. You can see whether the slope or intercept differs across groups.

Pattern 3: Group Comparison with Raw Data

You want to compare groups while showing every observation.

fig, ax = plt.subplots(figsize=(10, 6))
sns.stripplot(data=tips, x="day", y="total_bill", hue="smoker", dodge=True, alpha=0.6, ax=ax)
sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker", dodge=True,
            showfliers=False, boxprops=dict(alpha=0.3), ax=ax)

Overlapping strip and box plots with hue encoding for a second categorical dimension. The reader sees every point, the distribution summary, and the group structure.

Pattern 4: Small-Multiple Time Series by Category

You have time-series data for multiple groups and want to see each group's trajectory separately.

sns.relplot(
    data=flights,
    x="month",
    y="passengers",
    col="year",
    col_wrap=4,
    kind="line",
    height=3,
    aspect=1.5,
)

One panel per year, showing monthly passenger counts within that year. Visible patterns: each year has a similar seasonal shape, but the overall level is increasing.

When to Combine

Combining relational and categorical analysis is the right approach when:

Your data has both continuous and categorical dimensions you care about.
Group differences matter for the relationship being studied.
You want to show subgroup effects without losing the big-picture.

The key is to layer the visualizations deliberately. Start with the relational question (scatter or line), add categorical encoding (hue, style, or col), and add raw-data overlays (strip plots) where they clarify the picture.

18.10 A Quick Reference for the Categorical Family

Here is a compact reference for the categorical functions covered in this chapter.

Strip Plot

sns.stripplot(data=df, x="category", y="value", hue="group",
              jitter=True, size=5, alpha=0.7, dodge=True)

jitter: horizontal spread to prevent exact overlap. Default True.
dodge: separate hue groups horizontally. Default False.
size: marker size.

Swarm Plot

sns.swarmplot(data=df, x="category", y="value", hue="group",
              size=5, dodge=True)

Same as strip but uses an algorithm to avoid overlap without jitter. Works best for small to moderate samples. Warns at large sample sizes.

Box Plot

sns.boxplot(data=df, x="category", y="value", hue="group",
            showfliers=True, dodge=True, linewidth=1, palette="deep")

showfliers: whether to draw the outlier markers. Set to False when you are overlaying a strip or swarm plot.
dodge: separate hue groups within each category.

Violin Plot

sns.violinplot(data=df, x="category", y="value", hue="group",
               inner="quartile", split=False, cut=2, bw_method="scott")

inner: "quartile", "box", "point", "stick", or None.
split: for two-level hue, put one group on each side of the violin.
cut: extend the KDE beyond the data range (in bandwidths).

Bar Plot (Statistical)

sns.barplot(data=df, x="category", y="value", hue="group",
            estimator="mean", errorbar=("ci", 95), dodge=True)

estimator: aggregation function ("mean", "median", or any callable).
errorbar: error bar specification. Prefer showing the underlying data with a strip plot.

Point Plot

sns.pointplot(data=df, x="category", y="value", hue="group",
              estimator="mean", errorbar=("ci", 95), markers=["o", "s"])

markers: marker shapes for each hue group.
Draws connected points (useful for showing trends across ordered categories).

Count Plot

sns.countplot(data=df, x="category", hue="group", dodge=True)

Plots the frequency (count) of each category. No y parameter needed.

18.10 The Dynamite Plot Alternative in Detail

The previous sections introduced the dynamite plot critique. This section walks through the specific alternatives and when to use each.

Alternative 1: Strip + Point

Simplest alternative. Show every point with the mean as a dark overlay.

fig, ax = plt.subplots(figsize=(8, 5))
sns.stripplot(data=tips, x="day", y="total_bill", alpha=0.5, size=5, ax=ax)
sns.pointplot(
    data=tips,
    x="day",
    y="total_bill",
    estimator="mean",
    color="black",
    markers="_",
    errorbar=("ci", 95),
    ax=ax,
)

The strip plot shows every observation; the point plot overlays the mean as a dash with a 95% CI. The reader sees every data point plus the summary.

Alternative 2: Box + Strip

Shows the five-number summary plus every observation.

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=tips, x="day", y="total_bill", showfliers=False, ax=ax)
sns.stripplot(data=tips, x="day", y="total_bill", color="black", alpha=0.3, size=4, ax=ax)

showfliers=False hides the box plot's outlier markers because the strip plot already shows them. The result is a box for each category with every point visible inside.

Alternative 3: Violin + Strip

Shows the distribution shape plus every observation.

fig, ax = plt.subplots(figsize=(8, 5))
sns.violinplot(data=tips, x="day", y="total_bill", inner=None, ax=ax)
sns.stripplot(data=tips, x="day", y="total_bill", color="black", alpha=0.4, size=4, ax=ax)

inner=None removes the violin's default quartile marks because the strip plot shows the individual points. The violin provides the smooth shape; the strip provides the raw data.

Alternative 4: Box + Swarm

For smaller samples, swarm plots are cleaner than strip plots.

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=tips.sample(80), x="day", y="total_bill", showfliers=False, ax=ax)
sns.swarmplot(data=tips.sample(80), x="day", y="total_bill", color="black", size=4, ax=ax)

The swarm algorithm spreads points without overlap, producing a compact but complete display. For datasets of 20-100 observations per category, swarm plots are the cleanest choice.

Choosing Among Alternatives

Strip + point: simplest, smallest footprint. Good for quick charts and presentations.
Box + strip: most informative for moderate samples. Classic scientific-figure style.
Violin + strip: most complete — shows shape, summary, and individual data. Good for serious analysis.
Box + swarm: cleanest for small samples (20-100 per category). Great for publication.

All four are better than a dynamite plot for the same data. The specific choice depends on the sample size, the audience, and the space available.

18.10 Using relplot for Faceted Relational Plots

As with the other function families, seaborn has a figure-level function for relational plots: sns.relplot. It wraps scatterplot and lineplot with faceting support.

Basic relplot

sns.relplot(
    data=tips,
    x="total_bill",
    y="tip",
    col="day",
    kind="scatter",
    height=3,
)

The kind parameter selects scatter or line. col (and optionally row) facet across panels. height and aspect control panel size. The return value is a FacetGrid that supports further customization.

Line relplot with Aggregation

sns.relplot(
    data=tips,
    x="size",
    y="total_bill",
    kind="line",
    hue="sex",
    col="time",
    height=4,
    aspect=1.3,
)

With kind="line", seaborn aggregates within each facet separately — each panel's line shows the mean and confidence band for that subset of the data. This is useful for comparing how a relationship differs across subgroups.

When relplot Is the Right Choice

Use relplot when:

You want faceting across multiple panels.
The whole figure is a single relational visualization.
You do not need fine-grained matplotlib layout control.

Use scatterplot or lineplot (axes-level) when:

You are integrating the chart into a manually-constructed matplotlib figure.
You need multiple chart types in one figure.
You want the most control over the matplotlib Axes.

18.10 Advanced Scatter Plot Techniques

Beyond basic scatters, seaborn supports several advanced techniques for specific scatter plot needs.

Jittering for Discrete Values

When x or y is discrete (like integer ages or rating scores), points stack exactly on top of each other and overlap completely. Adding jitter spreads them slightly:

sns.stripplot(data=df, x="category", y="score", jitter=0.15)

stripplot applies jitter by default. For scatter plots where one axis is categorical, stripplot is actually the right function — not scatterplot. Use scatter plots for continuous data on both axes.

Connecting Scatter Points

For data that represents a path or a trajectory, you can combine scatter with a connecting line:

sns.scatterplot(data=path, x="x", y="y")
sns.lineplot(data=path, x="x", y="y", estimator=None, sort=False)

The sort=False parameter on lineplot connects the points in their original order rather than sorting by x. This is useful for trajectories where the sequence matters (like a random walk or a GPS trail).

Marker Customization

sns.scatterplot(
    data=tips,
    x="total_bill",
    y="tip",
    hue="smoker",
    markers={"Yes": "o", "No": "s"},  # circles for Yes, squares for No
    style="smoker",
)

The markers parameter accepts a dictionary mapping hue values to marker shapes. This is useful when you want specific shapes (e.g., one shape per treatment in an experiment) rather than seaborn's automatic choices.

Relationship with matplotlib.axes.scatter

Under the hood, sns.scatterplot calls matplotlib's Axes.scatter. Any parameter that matplotlib accepts can be passed through seaborn — for example, edgecolors="white" for white marker borders. The seaborn documentation lists the commonly used ones; for unusual parameters, pass them as keyword arguments.

18.10 Relational Pitfalls

Relational charts have their own failure modes. This section catalogs the common ones.

Overplotting in Scatter Plots

Symptom: You have 50,000 data points in a scatter plot and the chart looks like a solid mass of ink.

Cause: Too many points overlap to distinguish.

Fix: Use transparency (alpha=0.1-0.3), smaller markers (s=10 or smaller), or switch to a hexbin (sns.jointplot(kind="hex")) or a 2D KDE contour plot. For very large datasets, consider Datashader (Chapter 28).

Misleading Regression Extrapolation

Symptom: A regression line extends past the data range, and the reader assumes the model applies everywhere.

Cause: seaborn's regplot draws the line across the full axis range by default, which can include regions with no data.

Fix: Set ax.set_xlim to truncate the view to the data range, or use truncate=True in regplot (the default, though sometimes it still shows the line in sparse regions).

Linear Fit on Non-Linear Data

Symptom: A linear regression looks like a flat line across curved data.

Cause: The relationship is not linear, and a linear model is the wrong tool.

Fix: Check residuals with sns.residplot(data=df, x="x", y="y"). If residuals show a pattern (e.g., a clear curve), the linear model is inappropriate. Try a polynomial fit (order=2 or higher) or a non-linear model.

Automatic Aggregation Confusion

Symptom: sns.lineplot aggregates automatically, but the reader assumed it was showing individual observations.

Cause: Without explicit documentation, viewers do not know that the line represents the mean and the band represents a 95% CI.

Fix: Always label the chart clearly: "mean with 95% confidence interval" in the caption or subtitle. For transparency, consider showing the raw data (strip plot) alongside the aggregated line.

Regression Line Without Uncertainty

Symptom: A scatter plot with a single fitted line and no indication of the fit's uncertainty.

Cause: sns.regplot draws the confidence band by default, but some tutorials disable it with ci=None.

Fix: Leave the confidence band visible. The band shows how much the fitted line could wiggle under different samples of the same size. Hiding it implies more precision than the fit actually has.

18.10 Categorical Plot Decision Matrix

With nine categorical chart types, which do you use when? Here is a decision guide.

By Sample Size

Tiny (< 10 points per group): strip plot or scatter. Never use KDE-based charts (violin).

Small (10-50 points per group): strip plot with pointplot overlay, or swarm plot. Box plots work but hide individual observations.

Medium (50-500 points per group): strip or swarm with box overlay, or violin plot. You have enough data for distributional estimates.

Large (500+ points per group): box plot, violin plot, or histogram / KDE. Strip plots become cluttered; use transparency.

By Question

"What are the individual values?" — strip plot or swarm plot.

"What is the distribution shape?" — violin plot.

"What are the five-number summary statistics?" — box plot.

"What is the mean (or median) for each group?" — point plot or bar plot (with awareness of the dynamite plot critique).

"How does the count compare across categories?" — count plot.

"How does a summary statistic change across groups and subgroups?" — point plot with hue (parallel lines showing mean by subgroup).

By Number of Groups

2-5 groups: any categorical chart type. Pick based on the question.

6-15 groups: box plot, violin plot, or strip plot. Bar plots become cluttered.

16-50 groups: horizontal categorical chart (swap x and y). Or small multiples (catplot with col_wrap).

50+ groups: consider whether a different chart type is better. Heatmap for matrix-like data; ridge plot for many distributions.

18.11 The Climate Relational and Categorical Story

For the progressive project, here are relational and categorical views of the climate data.

Relational: CO2 vs. Temperature Regression

climate_merged = pd.merge(climate, co2, on="year")  # combine the datasets
sns.regplot(data=climate_merged, x="co2_ppm", y="temperature_anomaly")

This produces a scatter of CO2 vs. temperature with a linear regression overlay. The result shows the strong positive relationship the reader would expect.

Categorical: Anomalies by Decade with Strip + Box

climate["decade"] = (climate["year"] // 10 * 10).astype(int)

fig, ax = plt.subplots(figsize=(14, 5))
sns.stripplot(data=climate, x="decade", y="anomaly", alpha=0.4, color="gray", size=4, ax=ax)
sns.boxplot(data=climate, x="decade", y="anomaly", showfliers=False,
            boxprops=dict(alpha=0.3), ax=ax)

This shows every annual anomaly as a strip-plot point plus a transparent box plot overlay showing the decade-level summary. The reader sees both the individual years and the distribution shift across decades.

Relational: Time Series with Decade Coloring

sns.lineplot(data=climate, x="year", y="anomaly",
             hue="decade", palette="viridis", legend=False)

A line chart of temperature over time with the line colored by decade (a continuous progression through viridis). The temporal flow is visible both in the line shape and in the color gradient.

Chapter Summary

This chapter covered seaborn's relational and categorical visualization families.

Relational functions (sns.scatterplot, sns.lineplot, sns.relplot, sns.regplot, sns.lmplot) visualize the relationship between two continuous variables. Key features: automatic multi-variable encoding via hue, style, and size; automatic aggregation and confidence bands for lineplot; regression overlays with polynomial and logistic options in regplot and lmplot.

Categorical functions fall into three subfamilies: - Showing every point: stripplot, swarmplot. - Showing distributions: boxplot, violinplot. - Showing summaries: barplot, pointplot, countplot.

The figure-level sns.catplot wraps all of them with faceting support.

The dynamite plot critique is the chapter's threshold concept. A bar chart with error bars hides the actual distribution behind a summary. The alternative — a strip plot with a box or violin overlay — shows every data point plus the summary, giving the reader a more honest and complete view. Once you internalize this, you will favor strip+box combinations over plain bar charts for most group comparisons.

Next in Chapter 19: multi-variable exploration with pairplot, jointplot, heatmap, and clustermap. These tools apply the concepts from Chapters 16-18 to datasets with more than two variables, where the challenge is showing all pairwise relationships or the full structure of the data.

Spaced Review: Concepts from Chapters 1-17

Chapter 4: The ethics chapter argued against hiding uncertainty. How does the dynamite plot hide uncertainty, and how does the strip+box alternative restore it?
Chapter 5: The chart selection matrix maps questions to chart types. Which seaborn functions correspond to "comparison" questions? "Relationship" questions?
Chapter 11: matplotlib has ax.bar. seaborn has sns.barplot. How are they different? When would you use each?
Chapter 16: Figure-level and axes-level functions. Which relational and categorical functions are figure-level, and which are axes-level?
Chapter 17: Violin plots and box plots appear in both the distributional chapter and this chapter. Why? Which family are they really part of?
Chapter 17: Small samples deserve strip plots rather than violin plots. Does the same advice apply to categorical comparisons in this chapter?

Learning Objectives

In This Chapter

Chapter 18: Relational and Categorical Visualization

18.1 Scatter Plots with sns.scatterplot

Basic Usage

Multi-Variable Encoding

Overplotting

18.2 Line Plots with sns.lineplot

Basic Usage with Aggregation

Disabling Aggregation

Multiple Series

Customizing the Confidence Band

18.3 Regression Overlays with sns.regplot and sns.lmplot

Basic Regression Plot

Polynomial Regression

Logistic Regression

Faceted Regression with lmplot

Regression Pitfalls

18.4 Categorical Charts: Showing Every Point

Strip Plot

Swarm Plot

Combining Strip with Box or Violin

18.5 Categorical Charts: Showing Distributions

Box Plot

Violin Plot

Box Plot with Hue

18.6 Categorical Charts: Showing Summaries

Bar Plot (the Statistical Version)

Point Plot

Count Plot

18.7 The Dynamite Plot Critique

What the Dynamite Plot Is

Why It Is Bad

The Alternative: Strip + Box + Point

When Dynamite Plots Are OK

The Threshold Concept

18.8 Figure-Level catplot

18.9 The Error Bar Controversy

The Four Common Error Bar Types

Why It Matters

The Practical Rule

When to Avoid Error Bars

18.10 Best Practices Summary

18.10 Common Categorical Pitfalls

Pitfall 1: Categorical Ordering by Alphabet

Pitfall 2: Too Many Categories

Pitfall 3: Skipping the Statistical Check

Pitfall 4: Stacked Bar Chart for Comparison

Pitfall 5: Pie Chart in a Categorical Display

Pitfall 6: Color as Encoding Without a Legend

18.10 seaborn vs. matplotlib for the Same Chart

Scatter with Regression

Grouped Bar with Error Bars

Small Multiple

When matplotlib Is Still Better

18.10 Residuals and Regression Diagnostics

Basic Residual Plot

With LOWESS Smoother

Interpreting Residual Plots

18.10 seaborn catplot Patterns

Single-Variable Comparison Across Facets

Grouped Comparison with Two Hues

Per-Panel Chart Type

Order of Categories

18.11 Combining Relational and Categorical Analysis

Pattern 1: Scatter with Categorical Overlay

Pattern 2: Faceted Regression

Pattern 3: Group Comparison with Raw Data

Pattern 4: Small-Multiple Time Series by Category

When to Combine

18.10 A Quick Reference for the Categorical Family

Strip Plot

Swarm Plot

Box Plot

Violin Plot

Bar Plot (Statistical)

Point Plot

Count Plot

18.10 The Dynamite Plot Alternative in Detail

Alternative 1: Strip + Point