Exercises: Multi-Variable Exploration

DataField.Dev

Exercises: Multi-Variable Exploration

These exercises use seaborn's built-in datasets (penguins, iris, mpg) and a synthetic climate dataset. All exercises assume import seaborn as sns, import matplotlib.pyplot as plt, import pandas as pd, and import numpy as np.

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

List the four main multi-variable exploration tools covered in this chapter. For each, describe what the tool shows and when to use it.

Guidance

**Pair plot** (`sns.pairplot`): grid of pairwise scatter plots plus diagonals. Use for an overview of all pairwise relationships in a small dataset (≤8 variables). **Joint plot** (`sns.jointplot`): bivariate view plus marginal distributions. Use to zoom in on one important pair. **Heatmap** (`sns.heatmap`): colored matrix of a summary statistic (usually correlation). Use for compact overview of many variables. **Cluster map** (`sns.clustermap`): heatmap with hierarchical clustering reordering. Use to see grouping structure.

A.2 ★☆☆ | Recall

What is the difference between sns.pairplot and PairGrid? When should you use each?

Guidance

`sns.pairplot` is a convenience wrapper that applies the same plotting function to all panels. `PairGrid` is the lower-level class that lets you apply different functions to the diagonal, upper triangle, and lower triangle. Use `pairplot` when the defaults work; use `PairGrid` when you need asymmetric panels (e.g., scatter below and KDE contours above) or custom functions.

A.3 ★★☆ | Understand

Explain why correlation heatmaps should use diverging colormaps centered at zero, while count heatmaps should use sequential colormaps.

Guidance

Correlation has a meaningful midpoint: zero means no correlation, positive means one direction of association, negative means the other. A diverging colormap uses two hues to distinguish the signs and a neutral color at the midpoint. Counts have no natural midpoint — they go from low to high without a zero reference — so a sequential colormap is more honest. Using a diverging colormap for counts would imply a false midpoint; using a sequential colormap for correlation would hide the sign of the relationship.

A.4 ★★☆ | Understand

What is Shneiderman's information-seeking mantra, and how does it map onto the four tools in this chapter?

Guidance

The mantra is "overview first, zoom and filter, details on demand." **Overview**: the heatmap (compact) and pair plot (richer). **Zoom and filter**: the joint plot (focus on one pair) and the `vars` parameter of pair plot (focus on a subset). **Details on demand**: back to the DataFrame — use the chart to identify suspicious observations, then inspect the underlying rows. The cluster map straddles overview and filter: it gives an overview *and* suggests natural groupings to filter on.

A.5 ★★★ | Analyze

Why does hierarchical clustering always produce a tree, even when the data has no real cluster structure? What visual check tells you whether the tree is meaningful?

Guidance

Hierarchical clustering is a deterministic algorithm: it merges the two closest clusters at each step, regardless of whether the merge "makes sense." It will produce a tree even on random noise. The visual check is the heatmap itself — if clustering reorders the rows and produces a clean block-diagonal pattern of strong within-block correlations and weak between-block correlations, the tree reflects real structure. If the reordered heatmap still looks random, the tree is an artifact.

A.6 ★★★ | Evaluate

A researcher uses a correlation heatmap to claim that variable A causes variable B because their correlation is 0.92. What three objections would you raise?

Guidance

(1) Correlation is not causation — a third variable could drive both. (2) The sample size behind the correlation matters — pandas' pairwise correlation uses only rows where both variables are present, and a small sample can produce a large correlation by chance. (3) Correlation measures linear relationships; a strong 0.92 does not imply a linear *causal* mechanism — you should verify with a scatter plot and consider confounders, time-ordering, and alternative explanations before any causal claim.

Part B: Applied (10 problems)

B.1 ★☆☆ | Apply

Load the penguins dataset and produce a default pair plot with sns.pairplot. How many panels does the plot have?

Guidance

penguins = sns.load_dataset("penguins")
sns.pairplot(data=penguins)

The penguins dataset has four numeric columns, so the pair plot has a 4×4 grid = 16 panels.

B.2 ★☆☆ | Apply

Extend B.1 by adding hue="species". Describe what changes in the diagonal and off-diagonal panels.

Guidance

sns.pairplot(data=penguins, hue="species")

With `hue`, the diagonal histograms become KDE curves (one per species). The off-diagonal scatter plots color each point by species. Clusters that were invisible in the hueless version become obvious: bill length alone does not separate Adélie from Chinstrap, but bill length plus flipper length does.

B.3 ★★☆ | Apply

Create a pair plot using PairGrid with scatter plots on the lower triangle, KDE contours on the upper triangle, and histograms on the diagonal.

Guidance

g = sns.PairGrid(data=penguins, hue="species")
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(sns.kdeplot)
g.add_legend()

The result is asymmetric: the lower triangle shows raw data, the upper triangle shows smoothed density, and the diagonal shows marginal distributions.

B.4 ★★☆ | Apply

Create a joint plot of bill_length_mm vs. bill_depth_mm with kind="hex". Why might the hex version be preferable to the default scatter?

Guidance

sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", kind="hex")

The hex version bins points into hexagonal cells and colors by count. For large datasets, this avoids the overplotting that a scatter plot suffers. The penguins dataset is small enough that both work, but on a dataset with tens of thousands of points, the hex version would reveal density structure that the scatter would obscure.

B.5 ★★☆ | Apply

Create a joint plot of bill_length_mm vs. bill_depth_mm with hue="species" and kind="scatter". What do the marginal panels look like?

Guidance

sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind="scatter")

The center panel is a colored scatter. The marginal panels become layered KDEs — one per species — rather than plain histograms. Seaborn switches to KDEs automatically when `hue` is set because layered histograms can be hard to read.

B.6 ★★☆ | Apply

Compute a correlation matrix on the four numeric columns of penguins and display it with sns.heatmap. Use a diverging colormap, set vmin=-1, vmax=1, center=0, and annot=True.

Guidance

numeric_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
corr = penguins[numeric_cols].corr()
sns.heatmap(corr, cmap="coolwarm", vmin=-1, vmax=1, center=0, annot=True, fmt=".2f")

The heatmap shows that bill length and flipper length are strongly positively correlated, while bill depth and flipper length are negatively correlated.

B.7 ★★★ | Apply

Mask the upper triangle of the heatmap from B.6 using np.triu. Show only the lower triangle plus the diagonal.

Guidance

mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, mask=mask, cmap="coolwarm", vmin=-1, vmax=1, center=0, annot=True, fmt=".2f")

The `k=1` argument excludes the diagonal from the mask. The result is a lower-triangular heatmap with the diagonal visible.

B.8 ★★★ | Apply

Produce a cluster map of the penguins correlation matrix using Ward linkage. Describe what the dendrograms tell you.

Guidance

sns.clustermap(corr, cmap="coolwarm", vmin=-1, vmax=1, center=0, annot=True, fmt=".2f", method="ward")

The dendrograms on the top and left show which variables cluster together. Bill length, flipper length, and body mass form one tight cluster (they all correlate positively). Bill depth sits separately (it correlates negatively with the others). The reordered heatmap shows this structure visually: a block of red cells for the positive-cluster and an isolated blue-tinted row for bill depth.

B.9 ★★☆ | Apply

Load the iris dataset with sns.load_dataset("iris"). Create a pair plot with kind="reg" and hue="species". What does the regression fit add to each panel?

Guidance

iris = sns.load_dataset("iris")
sns.pairplot(data=iris, kind="reg", hue="species")

Wait — with `hue` set, `kind="reg"` does not apply the regression to each hue group the way you might expect. You need `PairGrid` with `map_offdiag(sns.regplot)` for per-hue regression lines. For the simpler case without hue:

sns.pairplot(data=iris, kind="reg")

This fits one regression line per panel to all data pooled together.

B.10 ★★★ | Create

Using the mpg dataset (sns.load_dataset("mpg")), identify the top three correlated pairs among the numeric columns via a heatmap, then zoom in on the strongest pair with a joint plot.

Guidance

mpg = sns.load_dataset("mpg")
numeric_cols = mpg.select_dtypes(include="number").columns
corr = mpg[numeric_cols].corr()
sns.heatmap(corr, cmap="coolwarm", vmin=-1, vmax=1, center=0, annot=True, fmt=".2f")
plt.show()

# After inspection, the strongest absolute correlation is between weight and displacement
sns.jointplot(data=mpg, x="weight", y="displacement", kind="reg")

Heavier cars have larger engines (displacement) — the correlation is near 0.93. The joint plot confirms the relationship is nearly linear with a modest spread around the regression line.

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

Take the climate dataset used throughout this textbook (year, temperature_anomaly, co2_ppm, sea_level_mm, era). Build a pair plot colored by era. Describe what the diagonal and off-diagonal panels reveal about the three eras.

Guidance

The diagonals show each variable's distribution per era: temperature and CO2 both shift dramatically from pre-industrial (near zero / ~285 ppm) to modern (+1°C / ~410 ppm). The off-diagonals show that all four variables move together — the three eras form distinct clouds in every pairwise scatter, with the industrial era bridging the gap. The pair plot reveals multi-variable cluster separation that a single scatter would only partially show.

C.2 ★★★ | Evaluate

Suppose a colleague sends you a correlation heatmap of 50 variables with cmap="viridis". What do you tell them to change, and why?

Guidance

Viridis is a sequential colormap designed for quantities without a meaningful midpoint. For correlation, you need a diverging colormap (coolwarm, RdBu_r, vlag) centered at zero, with `vmin=-1` and `vmax=1`. The current choice hides the sign of the correlations — a strongly positive and strongly negative correlation will look equally bright but in different shades of viridis, which does not convey "opposite direction" visually. Additionally, 50 variables is too many for readable annotation — suggest omitting `annot=True` and relying on colors alone, or producing a cluster map to bring similar variables together.

C.3 ★★★ | Create

Build a cluster map on the mpg dataset correlation matrix. Compare the variable ordering to the original column order. What groups does the clustering reveal?

Guidance

mpg = sns.load_dataset("mpg")
numeric_cols = mpg.select_dtypes(include="number").columns
corr = mpg[numeric_cols].corr()
sns.clustermap(corr, cmap="coolwarm", center=0, vmin=-1, vmax=1, annot=True, fmt=".2f", method="ward")

The clustering typically groups "engine size" variables (cylinders, displacement, horsepower, weight) as one cluster and "efficiency/performance" variables (mpg, acceleration) as another. Year often sits separately because it correlates weakly with everything. The block structure in the reordered heatmap confirms this grouping.

C.4 ★★★ | Evaluate

The chapter argues that pair plots do not scale beyond about eight variables. Devise a workflow for exploring a 30-variable dataset using the tools in this chapter.

Guidance

Step 1: Compute the full correlation matrix and display as a heatmap (no annotation; 30 labels is already crowded). Step 2: Identify groups of strongly correlated variables, either by eye or via `clustermap`. Step 3: Pick one representative variable from each group (the one most meaningful to the analysis). Step 4: Produce a pair plot of the ~6-8 representatives. Step 5: For the most interesting pair found in step 4, produce a joint plot with `kind="reg"` or `kind="hex"`. The heatmap and cluster map do the initial filtering; the pair plot and joint plot do the deep inspection.

These exercises span the full multi-variable exploration workflow — from compact overview (heatmap) to detailed zoom (joint plot). Chapter 20 introduces Plotly Express, where many of these chart types reappear with interactive features layered on top.