Case Study 1: The Datasaurus Dozen Revisited with ECDFs

DataField.Dev

Case Study 1: The Datasaurus Dozen Revisited with ECDFs

Chapter 11 Case Study 1 introduced the Datasaurus Dozen — 13 datasets with identical summary statistics but wildly different 2D shapes. This case study takes the same data and examines the marginal (1D) distributions using ECDFs. The result illustrates how distributional charts reveal things that scatter plots alone cannot.

The Situation

The Datasaurus Dozen (Matejka and Fitzmaurice, 2017) showed that 13 datasets can share identical summary statistics — mean, standard deviation, correlation — but look completely different as 2D scatter plots. One dataset is shaped like a dinosaur; another is an x-shape; another is a bullseye. The scatter plots are dramatically different while the statistics are identical. This is the most vivid demonstration of "statistics alone are lossy" in modern data visualization.

Chapter 11 of this textbook reproduced the Datasaurus Dozen as a 4×4 grid of small scatter plots, demonstrating the principle that visualization reveals what summaries hide. That chapter used matplotlib's ax.scatter and focused on the 2D shape comparison.

This case study takes the same dataset and asks a different question: what do the marginal distributions look like? Each dataset has 142 (x, y) pairs. If we ignore the 2D structure and just look at the 1D distribution of x values (or y values), what do we see? Do the same patterns hold? Are the distributions similar or different?

The answer is instructive. The 2D scatter plots are dramatically different, but the 1D marginal distributions are nearly identical. The mean and standard deviation constrain the marginals, and within those constraints, the 13 datasets end up with very similar 1D shapes. The differences are all in the 2D structure — how the points are correlated, clustered, or arranged.

This is a useful lesson about what different chart types reveal. A scatter plot shows 2D structure. A histogram or KDE shows 1D structure. They are not substitutes for each other — they answer different questions. Looking at just one or just the other gives you an incomplete picture.

The Data

The Datasaurus Dozen data can be loaded from several sources. For this case study, assume you have it as a pandas DataFrame with columns dataset, x, and y:

import pandas as pd

# Load from the datasaurus package or a CSV
# pip install datasaurus
from datasaurus import datasaurus_dozen
df = datasaurus_dozen()

# Or from a CSV
# df = pd.read_csv("datasaurus.csv")

The data has 13 datasets × 142 points = 1,846 rows. For this case study, we will focus on the x values, although the analysis of y values would produce similar results.

The Visualization

Here is the seaborn code to produce ECDFs for all 13 datasets:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", context="notebook")

# Gray out all except the "dino" to focus attention
colors = {name: "#cccccc" if name != "dino" else "#d62728" for name in df["dataset"].unique()}
palette = [colors[n] for n in df["dataset"].unique()]

fig, ax = plt.subplots(figsize=(10, 6))
sns.ecdfplot(
    data=df,
    x="x",
    hue="dataset",
    palette=colors,
    ax=ax,
    legend=False,
)

ax.set_title("Marginal Distributions of x Across the Datasaurus Dozen",
             fontsize=14, loc="left", fontweight="semibold")
ax.set_xlabel("x value")
ax.set_ylabel("Cumulative proportion")
fig.savefig("datasaurus_ecdf.png", dpi=300, bbox_inches="tight")

Key design decisions:

1. ECDF rather than histogram. ECDFs show every data point without binning. The reader can see exactly where each dataset's values fall without bin-count noise. For comparing 13 datasets, ECDFs are cleaner than 13 overlaid histograms.

2. Highlight the dinosaur. The dino dataset is colored red; the other 12 are muted gray. This applies the highlight strategy from Chapter 8 — focus the reader's attention on the interesting case while keeping context visible.

3. No legend. With 13 datasets, a full legend would be cluttered. Since only the dinosaur is highlighted, an explicit inline label for "dino" would suffice, but for this case study we skip the legend entirely to focus on the shape comparison.

What the Chart Shows

When the chart is drawn, something surprising happens: all 13 ECDFs are nearly identical. The dinosaur's ECDF (red) traces almost the same path as the 12 gray ECDFs. You cannot tell from the marginal x-distributions that some of these datasets are shaped like a dinosaur, a bullseye, or an x-shape.

This is the key insight. The 2D scatter plots are dramatically different — you can easily tell the dino from the bullseye — but the 1D marginal distributions are nearly identical. The difference is not in the 1D distributions; it is in the 2D structure (the correlation, the clustering, the specific arrangement).

This matches what we know about the data. Matejka and Fitzmaurice constrained the 13 datasets to have identical mean and standard deviation in both x and y. Fixing the mean and standard deviation of a 1D distribution heavily constrains its shape, especially for a distribution with 142 points. Within those constraints, the 1D distributions end up looking very similar.

Why It Matters

The Datasaurus Dozen + ECDF combination teaches several specific lessons:

1. 1D and 2D information are different. A chart that shows the 1D distribution tells you about the marginal shape. A chart that shows the 2D scatter tells you about the joint structure. These are different questions with different answers. Looking at only one is incomplete.

2. Constraining statistics constrains shape. When you fix a distribution's mean and standard deviation, you heavily constrain the possible shapes. You cannot, for example, produce a strongly bimodal distribution with a target mean and standard deviation unless the data points are strategically placed. The Datasaurus Dozen exploits this: the 2D correlations can vary wildly within the constrained marginals.

3. ECDFs are excellent for group comparison. Overlaying 13 ECDFs shows all 13 distributions at once without the visual noise of overlapping histograms or KDEs. The comparison is precise and exhaustive — you can see where distributions differ, even slightly, without needing summary statistics.

4. The highlight strategy works at scale. 13 groups is too many to compare visually if all are colored. But graying out 12 and highlighting 1 (the dino) lets the reader see both the specific case and the overall pattern. This is the small-multiples-with-context pattern from Chapter 9.

5. "Statistics alone are lossy" cuts both ways. The original Datasaurus Dozen showed that summary statistics hide the 2D structure. This case study shows that visualization (the scatter plot) hides the 1D distribution agreement. Neither the statistics nor the scatter plot alone tells the complete story. You need both.

The Complementary Analysis: 2D Density

For completeness, it is worth producing a 2D density plot of the dinosaur dataset to see the full 2D structure alongside the 1D marginals:

dino = df[df["dataset"] == "dino"]

fig, axes = plt.subplots(1, 2, figsize=(12, 5), constrained_layout=True)

# Scatter of the dinosaur
axes[0].scatter(dino["x"], dino["y"], alpha=0.7, s=30)
axes[0].set_title("(a) Dinosaur scatter")
axes[0].set_aspect("equal")

# ECDF of x values, with all 13 datasets
sns.ecdfplot(data=df, x="x", hue="dataset", palette={name: "#cccccc" if name != "dino" else "#d62728" for name in df["dataset"].unique()}, ax=axes[1], legend=False)
axes[1].set_title("(b) All 13 marginal distributions")

Side by side, this figure tells the full story: the 2D shape is a dinosaur (panel a), but the 1D marginal distribution is indistinguishable from the other 12 datasets (panel b). The statistics are identical; the 2D structure is unique; the 1D distribution is constrained to be similar. Three different facts, one dataset.

Lessons for Practice

1. Different chart types answer different questions. A scatter plot, a histogram, an ECDF, and a box plot are not redundant. They reveal different aspects of the data. For any important dataset, produce several chart types and see what each reveals.

2. The Datasaurus Dozen is a teaching tool for more than one lesson. The original lesson ("statistics alone are lossy") is widely cited. The lesson here ("1D marginals can be similar even when 2D structures differ") is less widely cited but equally important for anyone using summary statistics.

3. ECDFs are great for comparing many groups. With 13 groups, the ECDF overlay is cleaner than 13 histograms or 13 KDEs. The step curves are easier to distinguish than binned bars or smoothed densities. When you have many groups to compare, ECDFs should be the first choice.

4. The highlight strategy scales. Graying out context and highlighting one group is an effective way to manage visual complexity with many categories. This applies to ECDFs, to line charts, to scatter plots, and to any chart where multiple series would otherwise be overwhelming.

5. Summary statistics are a starting point, not an endpoint. Always visualize the distribution after computing summary statistics. The statistics may be informative; the distribution may show features the statistics miss. For data you care about, both are cheap to produce, and together they are much more informative than either alone.

Discussion Questions

On the complementary information. The original Datasaurus Dozen showed that 2D scatter plots reveal what summary statistics hide. This case study shows that 1D marginals are nearly identical across the 13 datasets. What does this pair of observations tell you about the relationship between 1D and 2D visualization?
On the highlight strategy. The case study uses gray for 12 datasets and red for the dinosaur. Why does this work better than using 13 different colors? At what number of groups does the highlight strategy become necessary?
On ECDFs vs. histograms. The case study uses ECDFs for the group comparison. Would histograms or KDEs have worked as well? What specific advantages do ECDFs offer for this particular visualization?
On teaching with the Datasaurus. The Datasaurus Dozen is a canonical teaching example. Is it a good vehicle for teaching ECDFs specifically, or is it more useful for teaching the general "visualize your data" lesson?
On the generalization. The case study's insight — that constraining statistics constrains 1D shape but not 2D structure — is a general principle. What other datasets or visualizations could illustrate this principle?
On your own data. Think of a dataset you have analyzed recently. Would comparing the 1D marginals to the 2D scatter reveal anything you missed? Is the 2D structure more interesting than the 1D distribution, or vice versa?

The Datasaurus Dozen was designed as a demonstration of a specific principle, but it rewards re-analysis from different angles. Adding ECDFs to the original scatter plot analysis shows that the same data can be visualized in complementary ways, each revealing different aspects of the structure. This is the general lesson of distributional visualization: the 1D shape, the 2D structure, and the summary statistics are three different kinds of information. For serious analysis, you need all three.