Case Study 1: Reproducing the Datasaurus Dozen with ax.scatter()

DataField.Dev

Case Study 1: Reproducing the Datasaurus Dozen with ax.scatter()

In 2017, Justin Matejka and George Fitzmaurice published a paper called "Same Stats, Different Graphs" that created thirteen datasets — including one shaped like a dinosaur — that all share the same summary statistics. The paper's graphics are all simple scatter plots. This case study reconstructs them in matplotlib to show how a few parameter choices on ax.scatter() turn a generic API call into a teaching moment that connects directly to Chapter 1's threshold concept.

The Situation

In Chapter 1, we introduced Anscombe's Quartet — the 1973 example by Frank Anscombe of four datasets with identical summary statistics but radically different visual patterns. The quartet has become the canonical demonstration of why visualization matters: the numbers alone cannot reveal the structure of the data, but a scatter plot can. We have been referring back to Anscombe's Quartet throughout the book as evidence for the "the statistics lie" threshold concept of Chapter 1.

But Anscombe's Quartet has only four datasets. It is compelling but small. In 2017, two researchers at Autodesk, Justin Matejka and George Fitzmaurice, decided to dramatically expand Anscombe's demonstration. They built a system that could generate datasets with specific summary statistics using simulated annealing — essentially, a trial-and-error optimization algorithm that perturbs a dataset toward a target shape while keeping its statistics constant. With this system, they produced thirteen datasets with exactly the same mean, standard deviation, and Pearson correlation (to two decimal places) but radically different visual shapes. The most famous of the thirteen is the dinosaur — a clean silhouette of a Tyrannosaurus Rex, made of points, with statistics identical to the others.

The paper, titled "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing," was published at the CHI 2017 conference. The accompanying dataset — the Datasaurus Dozen — has become a standard teaching tool in data visualization education. It is the Anscombe's Quartet of the modern era, just with thirteen examples instead of four, and with one of them shaped like a dinosaur for maximum rhetorical effect.

This case study is worth including in Chapter 11 for a specific reason: reproducing the Datasaurus Dozen is an exercise in using ax.scatter() with thirteen different datasets, and the exercise illustrates several things simultaneously. It shows how simple the API is — you can reproduce the entire dataset with ax.scatter() calls in a loop. It shows how the chart type (scatter) is what makes the structure visible. And it reinforces Chapter 1's threshold concept: summary statistics alone are lossy compressions of the full data, and visualization is how you recover the structure the statistics discarded.

The Data

The Datasaurus Dozen is a dataset of 142 (x, y) pairs in each of thirteen groups. The groups are named: dino, away, bullseye, circle, dots, h_lines, high_lines, slant_down, slant_up, star, v_lines, wide_lines, x_shape. Each group has 142 points. The summary statistics are:

Mean of x: ~54.26
Mean of y: ~47.83
Standard deviation of x: ~16.76
Standard deviation of y: ~26.93
Pearson correlation (x, y): ~-0.06

These statistics are identical to two decimal places across all thirteen groups. An analyst who computed only the statistics would conclude that all thirteen datasets are "about the same." The scatter plot reveals that they are radically different: a dinosaur, an x-shape, concentric circles, bullseyes, lines, and so on.

The dataset is available from several sources: - The original paper's supplementary materials (autodeskresearch.com/publications/samestats) - The R "datasauRus" package (CRAN) - The Python "datasaurus" package (pip install datasaurus) - GitHub repositories that include the raw CSV files

For this case study, assume you have loaded the data as a pandas DataFrame with columns dataset, x, and y.

The Visualization

The reproduction of the Datasaurus Dozen is almost trivially simple in matplotlib. Here is a complete implementation:

import matplotlib.pyplot as plt
import pandas as pd

# Load the data (assume it's in a CSV named datasaurus.csv)
df = pd.read_csv("datasaurus.csv")

# Create a 4x4 grid of subplots (13 datasets need 16 slots; we'll leave 3 empty)
fig, axes = plt.subplots(4, 4, figsize=(12, 12))

# Flatten the 2D array of axes into a 1D list for easier iteration
axes_flat = axes.flatten()

# Plot each dataset in its own panel
datasets = df["dataset"].unique()
for i, dataset_name in enumerate(datasets):
    subset = df[df["dataset"] == dataset_name]
    ax = axes_flat[i]
    ax.scatter(
        subset["x"],
        subset["y"],
        s=20,
        alpha=0.7,
        color="#1f77b4",
    )
    ax.set_title(dataset_name, fontsize=10)
    ax.set_xlim(-5, 105)
    ax.set_ylim(-5, 105)
    ax.set_xticks([])
    ax.set_yticks([])

# Hide the unused panels
for i in range(len(datasets), len(axes_flat)):
    axes_flat[i].set_visible(False)

fig.suptitle("The Datasaurus Dozen: Same Stats, Different Graphs", fontsize=14, y=0.99)
fig.savefig("datasaurus.png", dpi=300, bbox_inches="tight")

Twenty-five lines of code (including comments). The result is a 4x4 grid (with 3 empty cells) showing all thirteen Datasaurus Dozen datasets. The dinosaur shape is visible in the "dino" panel. The concentric circles are visible in the "bullseye" panel. The x-shape is visible in the "x_shape" panel. The three "lines" datasets (h_lines, v_lines, wide_lines) show clearly geometric patterns. And critically, if you compute the mean, standard deviation, and correlation for each panel, they are all identical.

Key Technical Details

A few matplotlib specifics worth highlighting:

1. plt.subplots(4, 4) returns a 2D array. The axes variable is a 2D numpy array with shape (4, 4). To iterate over the panels as a flat list, we use axes.flatten(), which returns a 1D view of the array. This is a common pattern when you have more panels than a single row or column can display.

2. ax.set_xlim() and ax.set_ylim() ensure consistent scales. Without these, each panel would auto-scale to its own data range, and the panels would be incomparable. Setting explicit limits to (-5, 105) (slightly beyond the data's 0-100 range) ensures every panel uses the same scale, which is essential for the comparison to work. This implements the shared-axis principle from Chapter 8.

3. ax.set_xticks([]) and ax.set_yticks([]) remove tick marks. For a 4x4 grid of small panels, tick marks would clutter the display. Removing them is a Chapter 6 declutter move that lets the scatter shapes be the visual focus.

4. axes_flat[i].set_visible(False) hides unused panels. With 13 datasets in a 16-cell grid, three cells would otherwise display empty Axes. Setting visible=False hides them cleanly. An alternative is to use a 4x4 grid with fewer cells (harder to do cleanly in matplotlib) or a non-rectangular layout with GridSpec (which we will cover in Chapter 13).

5. The ax.scatter call itself is minimal. Just ax.scatter(x, y, s=20, alpha=0.7, color=...). The four parameters are: x-values, y-values, size, transparency, color. Nothing fancy. The power of the demonstration comes from the data, not from elaborate styling.

Making the Dinosaur Prominent

In the original paper, the dino dataset is usually shown prominently because it is the one that most dramatically refutes the "statistics alone are enough" position. To reproduce the paper's emphasis, you might highlight the dino panel while keeping the others as gray context:

fig, axes = plt.subplots(4, 4, figsize=(12, 12))
axes_flat = axes.flatten()

for i, dataset_name in enumerate(datasets):
    subset = df[df["dataset"] == dataset_name]
    ax = axes_flat[i]
    if dataset_name == "dino":
        color = "#d62728"  # red, emphasized
        alpha = 0.9
    else:
        color = "#999999"  # gray, context
        alpha = 0.5
    ax.scatter(subset["x"], subset["y"], s=20, alpha=alpha, color=color)
    ax.set_title(dataset_name, fontsize=10)
    ax.set_xlim(-5, 105)
    ax.set_ylim(-5, 105)
    ax.set_xticks([])
    ax.set_yticks([])

for i in range(len(datasets), len(axes_flat)):
    axes_flat[i].set_visible(False)

fig.suptitle("The Datasaurus Dozen: Same Stats, Different Graphs", fontsize=14, y=0.99)

This is the grayed-out strategy from Chapter 9 applied to a small-multiple layout. The dinosaur is the focus; the other twelve datasets are context that establishes the "same stats" claim without competing for the reader's attention.

The Impact

The Datasaurus Dozen has become one of the most widely-shared data visualization teaching examples of the past decade. It appears in:

Visualization courses at dozens of universities.
Statistics textbooks as an example of why visualization is not optional.
Data science bootcamps as a motivator for the importance of EDA (exploratory data analysis).
Corporate training programs on data literacy.
Conference talks on visualization and statistics.
Social media as a viral example of the "statistics can mislead" principle.

The impact is not in the code (which is simple) or the dataset (which is a small CSV file) but in the rhetorical power of the demonstration. A single image of the Datasaurus Dozen grid says, in an instant, that summary statistics are lossy compressions. No amount of text can make the argument as effectively. A student who sees the image remembers it for years.

The paper's specific contribution was not the dinosaur — Anscombe's Quartet had already made the point — but the method: simulated annealing to generate datasets with arbitrary shapes and fixed statistics. This is a reproducible, extensible technique. You can use it to generate your own "same stats, different graphs" datasets for your own domains. Matejka and Fitzmaurice released their code publicly, and others have extended it to produce shapes for specific teaching purposes.

The Datasaurus Dozen also illustrates a subtler point about the Python visualization ecosystem. The reproduction above is about 25 lines of matplotlib code. The data is 13 × 142 = 1,846 rows in a CSV file. Together, they demonstrate that a meaningful teaching example does not require elaborate tools or massive datasets. The minimal combination of pandas.read_csv(), plt.subplots(4, 4), and ax.scatter() is enough to reproduce one of the most influential visualization teaching tools of the last decade. This is what the Python visualization ecosystem does well: it gives you the primitives to build meaningful demonstrations with a small amount of code.

Why It Works: A Theoretical Analysis

The Datasaurus Dozen's effectiveness as a teaching tool rests on several principles you have already encountered.

1. Small multiples enforce comparison. The 4x4 grid is a small multiple in the exact sense of Chapter 8: same chart type, same axes, same design grammar, different data. The consistency makes it possible to compare the thirteen datasets at a glance. Without the small-multiple layout, the "same stats" claim would be harder to believe because the reader would have to trust the analyst's summary; with the layout, the reader can see the thirteen datasets side by side and verify that they are radically different.

2. The shared axes are essential. All thirteen panels use the same x- and y-limits. Without shared scales, the reader could not compare shapes directly — each panel would auto-scale to its own data, and patterns at different spatial scales would look similar. The explicit set_xlim and set_ylim calls enforce the comparison.

3. Minimal styling keeps the focus on the shapes. Each panel has no tick marks, no axis labels, just the dataset name as a small title. The visual noise is stripped out so the only thing the reader notices is the shape of the scatter. This is the declutter procedure from Chapter 6 applied at the panel level.

4. The surprise is the point. A reader who sees the first few panels (the dinosaur, the x-shape, the concentric circles) has an "aha" moment: these are all so obviously different! Then they read the caption or the title: same mean, same standard deviation, same correlation. The cognitive dissonance between "obviously different" and "identical statistics" is the rhetorical engine of the demonstration. A less visually striking example (four similar-looking scatterplots) would not produce the same effect.

5. It connects to a foundational insight. The Datasaurus Dozen does not teach anything new; it teaches the same insight as Anscombe's Quartet from 1973. What makes it effective is the expansion and the surprise. The underlying claim — that summary statistics are lossy compressions of the full data — is the same. But the visual proof is stronger, and a stronger proof reaches readers that Anscombe's smaller example did not.

6. It works in the specific medium of the scatter plot. The scatter plot is uniquely well-suited to revealing shapes in 2D point data because it places no constraints on the points. Every (x, y) pair is drawn where the data says it belongs, regardless of whether the result is a cloud, a line, a dinosaur, or a concentric circle. Other chart types (histograms, bar charts, box plots) would aggregate or summarize the points and lose the shape information. The scatter plot preserves it.

Lessons for Modern Practice

Reproducing the Datasaurus Dozen teaches more than matplotlib syntax.

1. Visualization reveals what statistics hide. This is the Chapter 1 threshold concept, and the Datasaurus Dozen is its strongest single illustration. Whenever you are about to present a dataset by quoting its mean, standard deviation, and correlation, ask yourself: would looking at a scatter plot change your impression of the data? If the answer is yes — and it often is — the scatter plot should be part of your presentation.

2. Small multiples are powerful. The 4x4 grid of scatter plots is not decorative. It is the mechanism that makes the comparison possible. When you have many groups that need to be compared on the same dimensions, reach for small multiples first, not overlay charts or separate figures.

3. Consistent scales matter. The explicit set_xlim and set_ylim calls are not optional. Without them, the demonstration would fail because the auto-scaled panels would obscure the shape comparison. When you build a small multiple, set consistent scales explicitly.

4. The minimum viable version is usually enough. The Datasaurus Dozen reproduction is 25 lines of code. You do not need a specialized visualization library, a complex framework, or elaborate styling. The basic ax.scatter() + plt.subplots(4, 4) + set_xlim pattern is enough. Reach for more elaborate tools only when the minimum viable version genuinely is not enough.

5. Teaching examples reward visual impact. The Datasaurus Dozen is not taught because it is the most statistically rigorous example; it is taught because the dinosaur is memorable. When you build your own teaching materials, do not underestimate the value of visual impact. A memorable example is taught more widely, remembered longer, and shared more often than a forgettable but technically superior alternative.

6. Free datasets enable teaching. The Datasaurus Dozen is freely available, which is part of why it spread so widely. The original paper released the data and the simulated annealing code under a permissive license. If you want your teaching example to reach a wide audience, release the data freely — the ROI on free data is enormous in terms of reach and citation.

7. The threshold concept is universal. Chapter 1's claim that "statistics alone are lossy" applies to every dataset you will ever work with, in every context, for every question. The Datasaurus Dozen is not a special case; it is a particularly vivid example of a claim that is always true. Every time you plot your data before running a statistical summary, you are following the practical implication of the threshold concept. The plotting is not optional polish; it is the thing that reveals the structure the statistics discard.

Discussion Questions

On the dinosaur specifically. The dinosaur is the most-cited example in the Datasaurus Dozen. Why is it more effective than the other twelve shapes? What does the dinosaur add that an x-shape or a bullseye does not?
On reproducibility. The paper released the data and the code publicly. This is consistent with Chapter 4's discussion of transparency and Chapter 7's source attribution principle. How does this kind of openness change the impact of a paper or a dataset?
On generating your own "Datasaurus." The simulated annealing technique used to create the Datasaurus Dozen is applicable to any shape. Could you generate a "Datasaurus" for your own domain — a cluster of datasets with identical statistics but visually different shapes that illustrate a domain-specific point? What would the shapes represent?
On the connection to Anscombe. Anscombe's Quartet (1973) made the same point forty-four years earlier. Why did the Datasaurus Dozen get more attention despite saying essentially the same thing? What does the comparison tell you about the role of visual impact in teaching?
On the role of scatter plots in EDA. The Datasaurus Dozen makes the case that scatter plots are uniquely valuable for revealing structure in 2D data. Is this a special case, or are there other chart types that play similar "revelation" roles for their specific data types? (Histograms for distributions? Heatmaps for 2D categorical data? Maps for geographic data?)
On the code itself. The reproduction is 25 lines of matplotlib. Could it be simpler? Could it be more polished? What trade-offs are involved in the choices (small multiples vs. single chart, minimal styling vs. full styling, dinosaur highlight vs. equal treatment)?

The Datasaurus Dozen is a small example in the grand scheme of data visualization, but it carries a disproportionate teaching load. In a single image, it reminds every student of data science that statistics alone are not enough, that visualization is a necessary part of the workflow, and that the chart type you choose determines what the reader can see. Reproducing it in matplotlib takes a few minutes. Understanding what it teaches takes a career.