Case Study 1: Anscombe's Quartet and the Datasaurus Dozen Revisited

DataField.Dev

Case Study 1: Anscombe's Quartet and the Datasaurus Dozen Revisited

In 1973, statistician Francis Anscombe published four small datasets that had identical means, variances, and correlations — and wildly different shapes. His point was that summary statistics are not enough; you have to look at the data. Four decades later, Justin Matejka and George Fitzmaurice extended the argument with a dataset shaped like a dinosaur. Both demonstrations reinforce the central message of this chapter: a correlation heatmap is a starting point, not an ending point.

The Situation: Numbers Alone Do Not Describe a Dataset

Summary statistics — mean, variance, correlation, regression slope — are the currency of statistical reporting. A paper that says "the correlation between X and Y was 0.82" is assumed to have told you something meaningful about the relationship. An analyst who reports r = 0.82 to a stakeholder is assumed to have summarized the data accurately. But summary statistics are a projection — a reduction of a complex dataset to a handful of numbers. And like any projection, they lose information.

Francis Anscombe, a statistician at Yale in the 1970s, was frustrated by this reduction. He had spent years reviewing statistical analyses and noticed a pattern: analysts would compute summary statistics, make claims based on those statistics, and never actually look at the data. Sometimes the analyses were correct. Sometimes the summary statistics hid patterns that would have changed the conclusion entirely. Anscombe wanted to demonstrate the problem in the most vivid way possible.

In 1973, he published a short paper with a title as blunt as its content: "Graphs in Statistical Analysis." The paper introduced four datasets, now known collectively as Anscombe's Quartet. Each dataset had 11 points. Each had the same x-mean (9.0), the same y-mean (7.5), the same x-variance (11.0), the same y-variance (4.1), the same correlation (0.816), and the same regression line (y = 3 + 0.5x). By every standard summary statistic, they were identical.

But when Anscombe plotted them, the datasets were utterly different. Dataset I was a cloud of points around a linear trend. Dataset II was a perfect parabola — the linear regression had completely mischaracterized the relationship. Dataset III was a linear cloud with one extreme outlier that was pulling the regression line. Dataset IV was a vertical cluster of points at x=8 with a single leverage point at x=19 — the regression was entirely determined by that one outlier.

All four datasets would look identical in a correlation heatmap. All four would appear in the same cell with the same color, the same annotation, the same place in the clustering hierarchy. A researcher who relied on the heatmap alone would draw the same conclusion about all four: strong positive correlation, linear relationship, good candidate for regression. And in three out of four cases, that researcher would be wrong.

The Data: Four Datasets with Identical Statistics

Anscombe's four datasets are small — 11 points each — and easy to generate. Their first x values are the integers 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. Their y values are carefully constructed to produce the same summary statistics despite different shapes.

Dataset I is constructed from a linear model with added noise: y ≈ 0.5x + 3 plus a small random term. This is what most analysts would expect to see when reporting r = 0.816 — a noisy linear trend.

Dataset II is a perfect parabola. The x values go from 4 to 14; the y values are constructed so that y = -0.127x² + 2.78x - 5.996. The relationship is deterministic, but it is quadratic, not linear. Because the parabola is symmetric around the mean of x, the linear correlation is still 0.816. A linear regression fit to this data produces exactly the same line as for Dataset I, but the line is nonsense — the actual relationship is curved.

Dataset III is a linear cloud with an outlier. Ten of the 11 points lie almost perfectly on a line with slope ≈ 0.35. The eleventh point is dramatically off the line, pulling the linear regression toward itself. The resulting regression has slope 0.5 — higher than the slope that fits the bulk of the data. The correlation is 0.816, but the "true" relationship, if you believe the bulk points are the signal and the outlier is noise, is a line with a different slope.

Dataset IV is the strangest of the four. Ten of the 11 points have x = 8, meaning they form a vertical cluster with no x-variation at all. The eleventh point has x = 19 and pulls the regression line into existence — without that single leverage point, there would be no x-variance and no regression. The correlation value of 0.816 is almost entirely determined by one point.

Anscombe's brilliant demonstration was that all of these different scenarios produce the same summary statistics. The implication was clear: if you want to know what your data looks like, you have to look at it.

Anscombe's Plot

Anscombe's original paper included a 2×2 grid of scatter plots — one per dataset — arranged so the reader could compare them side by side. The plots were small, hand-drawn by the standards of the time, and devastating. Readers who had been reporting correlation coefficients as definitive summaries were forced to confront the fact that their summaries were not definitive at all.

In modern seaborn, the 2×2 grid would be built with sns.FacetGrid or a simple matplotlib subplot arrangement:

import seaborn as sns
import matplotlib.pyplot as plt

anscombe = sns.load_dataset("anscombe")  # seaborn ships this dataset
g = sns.FacetGrid(anscombe, col="dataset", col_wrap=2, height=3)
g.map(sns.regplot, "x", "y", ci=None)

Seaborn ships Anscombe's Quartet as a built-in dataset (sns.load_dataset("anscombe")), which is itself a kind of monument — the dataset is important enough that the seaborn maintainers decided to include it. The one-line plot reproduces Anscombe's original figure with minimal effort.

The regression line in each panel is identical (y ≈ 3 + 0.5x, with the small differences being rounding). The underlying data is not. Dataset I looks reasonable. Dataset II is obviously a parabola. Dataset III has an obvious outlier pulling the line. Dataset IV has the leverage-point problem. The regression line is the wrong tool for three of the four datasets, and the scatter plot tells you so immediately.

The Modern Extension: The Datasaurus Dozen

Anscombe's demonstration was powerful but small. Four datasets, 11 points each, carefully hand-constructed to produce identical statistics. A skeptic might argue that the four examples were cherry-picked — that in practice, most datasets with similar statistics look similar.

In 2017, Justin Matejka and George Fitzmaurice of Autodesk Research published a paper titled "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing." The paper introduced a method for constructing datasets that share the same summary statistics but look wildly different. The method used simulated annealing, an optimization technique, to iteratively modify a dataset while preserving its means, variances, and correlations.

The centerpiece of the paper was a dataset shaped like a dinosaur. Matejka and Fitzmaurice started with a set of points forming a rough dinosaur silhouette (x-y coordinates tracing a T-Rex outline) and derived twelve other datasets — the "Datasaurus Dozen" — that had identical means, standard deviations, and Pearson correlations but looked completely different. The twelve variations included an X-shape, a star, a slanted cross, parallel lines, circles, a bullseye, and several more.

All thirteen datasets share: - x mean = 54.26 - y mean = 47.83 - x standard deviation = 16.76 - y standard deviation = 26.93 - Pearson correlation = -0.06

An analyst using only these summary statistics could not distinguish the dinosaur from the star from the parallel lines. All three would appear in the same cell of a correlation heatmap. All three would cluster with all the others in a cluster map. The numerical summary is useless for telling them apart. Only the visual reveals the structure.

This extension of Anscombe's argument is valuable because it is not cherry-picked. The Datasaurus Dozen was produced by an algorithm, not by a statistician sitting with graph paper. The algorithm could have produced many more such datasets — the paper is essentially a proof that for any target statistics, there are infinitely many dataset shapes that satisfy them.

Why This Matters for Multi-Variable Exploration

Anscombe's Quartet and the Datasaurus Dozen are single-pair demonstrations — two variables, x and y, with no third dimension to worry about. So why include them in a chapter on multi-variable exploration?

Because the lesson scales. A correlation heatmap is a projection from raw data to a single number per pair. A 20-variable heatmap reduces 20 full distributions and 190 bivariate relationships to 190 colored cells. Every cell that Anscombe's argument applies to is a cell that could be hiding a Dataset II, a Dataset III, or a Dataset IV — a non-linear relationship, an outlier pulling a coefficient, a leverage point driving an entire correlation. The heatmap cannot warn you about any of them.

This is why the multi-variable workflow of this chapter is iterative, not terminal. The heatmap is where you start, not where you end. After the heatmap surfaces an interesting correlation, you must follow up with a pair plot or a joint plot to verify the shape of the relationship. The pair plot cannot be skipped. The joint plot cannot be skipped. Anscombe's quartet is the reason why.

A specific application: when your heatmap shows a correlation of 0.8 between two variables, before you write anything about the relationship, produce a scatter plot and check. Is the cloud linear, or curved? Are there outliers? Does the regression line fit the bulk of the data, or is it being dragged by leverage points? The answer determines whether 0.8 is a meaningful summary or a misleading one.

The Impact: From Statistical Caution to Visual Discipline

Anscombe's paper was not a revelation in 1973 — statisticians had long known that summary statistics could be misleading. What the paper did was make the problem concrete. Before Anscombe, arguments for visualizing data were abstract ("look at your data, it's important"). After Anscombe, the argument had a picture: four datasets, identical statistics, obviously different shapes. The quartet became a standard teaching example in statistics courses, included in textbooks, reproduced in lectures, and eventually — by the 2010s — built into tools like seaborn as a convenient demonstration.

The Datasaurus Dozen had a similar impact in the data science community of the late 2010s. The dinosaur-shaped scatter plot became a meme, circulated on Twitter and in blog posts and in conference talks. The meme was funny, but the underlying point was serious: you cannot trust summary statistics to describe a dataset. You have to visualize.

Both demonstrations have been incorporated into the standard curriculum for data science and statistics. Beginning analysts learn the quartet as a cautionary tale. Experienced analysts use the dinosaur as a punchline in talks. The message is the same: look at the data.

Theory Connection: Visualization as Dimensionality Expansion

The theoretical significance of Anscombe's demonstration is that summary statistics are dimensionality reductions. A dataset with 11 points in two dimensions has 22 numbers. Reducing it to mean, variance, and correlation keeps only five numbers. Seventeen numbers of structure are lost in the reduction, and those seventeen numbers can contain the entire story.

Visualization inverts this reduction. A scatter plot shows all 22 numbers simultaneously, preserving everything. You see the mean, the variance, the correlation, and the shape, the outliers, the clusters, the trends, the leverage points. The scatter plot is higher-bandwidth than the summary statistics; it carries more information per unit of cognitive effort.

This is why the chapter's multi-variable tools matter. The heatmap is a compressed summary. The pair plot is an expanded view. The workflow of "heatmap first, then pair plot" is a workflow of "compress first to find what matters, then expand to see what is there." Neither tool is complete on its own. Together they cover the trade-off between breadth (many variables in a compact display) and depth (full visual detail on a few pairs).

The Datasaurus Dozen adds one more theoretical point. Summary statistics are not just incomplete — they are fundamentally non-unique. For any set of summary statistics, many different datasets satisfy them. The mapping from dataset to summary is many-to-one, and the inverse mapping is impossible. Visualization is the only way to recover the structure that summaries hide.

Discussion Questions

On cherry-picked examples. Anscombe's quartet was hand-constructed to make a point. Does this weaken the argument? The Datasaurus Dozen used an algorithm; does that strengthen it?
On reporting standards. Should scientific journals require scatter plots alongside every reported correlation coefficient? What would this change about publication length and review practice?
On heatmaps in industry. Large tech companies produce correlation heatmaps of hundreds of metrics as part of their dashboarding. Given Anscombe's argument, are these heatmaps useful? What safeguards would make them more trustworthy?
On the workflow. This chapter argues for "heatmap first, then pair plot." What are the costs of this workflow — the time, the cognitive load, the temptation to skip? How do you resist the temptation?
On the next step. If a heatmap cannot be trusted alone and a pair plot does not scale past eight variables, what tool would you want next? Dimensionality reduction (PCA, UMAP)? Interactive exploration? Both?

Anscombe's Quartet is small enough to fit in an 11-row DataFrame. Load it with sns.load_dataset("anscombe") and plot it the next time you are tempted to report a correlation coefficient without looking at the scatter. The lesson — that summary statistics hide structure — is the foundational warning that every chapter in this book has been reinforcing since Chapter 1. Multi-variable exploration is the practical response: look at the data, with the right tools, at the right level of abstraction, iteratively.