Case Study 1: The Experiment That Changed Chart Design — Cleveland & McGill (1984)

DataField.Dev

Case Study 1: The Experiment That Changed Chart Design — Cleveland & McGill (1984)

Background

Before 1984, chart design was largely a matter of convention and intuition. Statisticians and graphic designers had strong opinions about which charts worked best, but those opinions were grounded in experience and aesthetic preference, not experimental evidence. Nobody had systematically tested whether the human visual system decodes a bar chart more accurately than a pie chart, or a scatter plot more accurately than a bubble chart. The question sounds simple. The answer required careful experimentation.

William Cleveland was a researcher at Bell Laboratories in Murray Hill, New Jersey — the same institution where the transistor was invented, where the Unix operating system was born, and where John Tukey was developing the intellectual foundations of exploratory data analysis. Robert McGill was at AT&T Bell Laboratories. Both were statisticians who took graphical methods seriously — not as decoration, but as analytical tools whose effectiveness was an empirical question.

Their 1984 paper, "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods," published in the Journal of the American Statistical Association, proposed a theory of graphical perception, tested it experimentally, and derived practical recommendations for chart design. It remains the most cited paper in the field of visualization perception.

The Central Question

The paper asked: When a viewer extracts a quantitative value from a chart, how accurate is that extraction, and how does accuracy depend on the visual encoding used?

This is a deceptively precise question. It does not ask "which chart looks better" or "which chart do people prefer." It asks which chart allows the viewer to make more accurate quantitative judgments. Preference and accuracy, as Cleveland and McGill would demonstrate, are not the same thing.

Theoretical Framework

Cleveland and McGill began by cataloging the "elementary perceptual tasks" that a viewer performs when reading a chart. When you look at a bar chart and judge that one bar is about twice the height of another, you are performing a length-ratio judgment. When you look at a pie chart and judge that one slice is about a quarter of the whole, you are performing an angle-proportion judgment. When you look at a scatter plot and judge that one point is higher than another, you are performing a position-comparison judgment.

They identified ten elementary perceptual tasks, ordered by their theoretical accuracy:

Position along a common scale
Position along identical, non-aligned scales
Length
Direction (angle, slope)
Area
Volume
Curvature
Shading / color saturation
Color hue (for quantitative judgment)
Density

The theoretical ordering was based on prior psychophysical research — particularly Stevens's Power Law, which describes how perceived stimulus intensity relates to actual physical intensity for different sensory channels. But theory alone was not enough. Cleveland and McGill wanted empirical confirmation.

The Experimental Design

The experiments were elegant in their simplicity. Subjects were shown simple graphical displays — not full charts with titles, legends, and real data, but stripped-down stimuli that isolated a single perceptual task. The task was always the same: judge the smaller of two marked values as a percentage of the larger.

For example, in a position-along-common-scale experiment, subjects saw two dots on a common horizontal axis and judged the position of the lower dot as a percentage of the higher dot's position. In an angle experiment, subjects saw two pie-chart slices and judged the smaller angle as a percentage of the larger. In an area experiment, subjects saw two circles of different sizes and judged the smaller circle's area as a percentage of the larger's.

By holding the judgment task constant (always "what percentage?") and varying only the encoding, the researchers isolated the effect of the visual channel itself. This was the methodological insight that made the paper so influential: it separated the encoding from the context, the judgment from the chart type.

The accuracy measure was log absolute error — the logarithm of the absolute difference between the judged percentage and the true percentage. Lower log absolute error meant more accurate perception.

Key Results

The experiments confirmed the theoretical ordering, with some refinements:

Position along a common scale produced the lowest error. Subjects could judge the ratio of two positions to within a few percentage points. This is the encoding used by scatter plots, line charts, and dot plots.

Position along non-aligned scales was nearly as good, but measurably less accurate. This is the encoding used when comparing values across small-multiple panels that share a common axis range but are physically separated.

Length was slightly less accurate than position, but still quite good — provided the lengths shared a common baseline. Without a common baseline (as in the interior segments of a stacked bar chart), length judgments degraded substantially.

Angle was markedly less accurate. Subjects showed systematic errors when judging the ratio of two angles, particularly in the mid-range (around 30-70 degrees). This is the encoding used by pie charts. The errors were not random — they were biased, meaning that pie chart readings are not just imprecise but systematically distorted.

Area produced large errors. Subjects consistently underestimated the ratio of two areas — a circle twice the area of another was judged to be less than twice as large. This is consistent with Stevens's Power Law (exponent ~0.7 for area). This is the encoding used by bubble charts and treemaps.

Color saturation produced the largest errors among the tested channels. Subjects could distinguish "lighter" from "darker" but could not accurately judge the ratio. This is the encoding used by choropleth maps and heatmaps.

The Stacked Bar Chart Discovery

One of the paper's most practically important findings concerned stacked bar charts. In a regular bar chart, all bars share a common baseline (typically the x-axis), and the viewer judges each bar's height — a length judgment from a common base. This is accurate. But in a stacked bar chart, only the bottom segment has a common baseline. The upper segments are "floating" — their bases depend on the values of the segments below them.

Cleveland and McGill found that judging the size of these non-baselined segments was substantially less accurate than judging baselined segments. In perceptual terms, the task changed from "judge a length from a common baseline" (high accuracy) to "judge a length from a varying baseline" (lower accuracy, closer to an area or position-on-non-aligned-scales judgment).

This finding directly challenged the widespread use of stacked bar charts for precise comparison. It did not say stacked bar charts are useless — they can still show part-to-whole composition and overall totals effectively — but it demonstrated that comparing individual segments across stacks is perceptually harder than most designers assumed.

The Pie Chart Controversy

The paper's implications for pie charts were immediately controversial and remain so. The data showed that angle judgments (the perceptual task underlying pie chart reading) were significantly less accurate than position or length judgments (the tasks underlying bar charts and dot plots). The straightforward conclusion: for any comparison task that a pie chart can perform, a bar chart or dot plot will perform it more accurately.

But the conclusion requires nuance. Cleveland and McGill tested a specific task: judging the ratio of two values. They did not test the "part-of-whole" framing that pie charts uniquely emphasize. A pie chart communicates "these slices make up 100% of something" in a way that a bar chart does not, because the circular enclosure naturally implies a complete whole. Whether that framing benefit outweighs the accuracy cost is a design judgment, not a purely empirical question.

The pie chart debate continues to this day. Stephen Few argues forcefully against pie charts in almost all circumstances. Robert Kosara has conducted more nuanced research showing that pie charts perform reasonably well for simple part-of-whole judgments (especially "is this slice more or less than 25%?"). The Cleveland-McGill hierarchy does not resolve the debate, but it gives it a scientific grounding.

The Replication: Heer and Bostock (2010)

Twenty-six years later, Jeffrey Heer and Michael Bostock replicated the Cleveland-McGill experiments using Amazon Mechanical Turk — an online crowdsourcing platform. Their 2010 paper, "Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design," tested thousands of participants instead of the original study's small sample.

The results largely confirmed the original hierarchy. Position judgments were most accurate, followed by length, then angle, then area. The rankings were stable across a much larger and more demographically diverse sample than the original Bell Labs employees.

Heer and Bostock also introduced methodological refinements. They found that the accuracy gap between aligned and non-aligned position judgments was larger than Cleveland and McGill had estimated — suggesting that the spatial separation of small multiples imposes a real perceptual cost, even when the scales are identical.

The replication was significant for two reasons. First, it validated the original findings with modern methods and a broader population. Second, it demonstrated that crowdsourcing could be a viable method for visualization perception research, opening the door to larger and more varied experiments.

Practical Implications: What Changed After 1984

The Cleveland-McGill results had immediate practical consequences. Cleveland himself, in his subsequent books The Elements of Graphing Data (1985) and Visualizing Data (1993), applied the hierarchy to redesign dozens of common chart types. Specific recommendations that flowed directly from the experimental results include:

Dot plots over bar charts for some tasks. Cleveland argued that when the baseline is arbitrary or when the viewer needs to compare values across categories without a meaningful zero point, a dot plot (position encoding only) is more effective than a bar chart (position + length encoding). The dot plot strips away the bar — a visual element that implies a meaningful distance from zero — and relies on the highest-accuracy channel alone.

Trellis displays (small multiples) over overlaid plots. Cleveland's research showed that position along non-aligned scales, while less accurate than position along a common scale, was still far more accurate than the angle, area, or color encodings that overloaded single plots often require. This led to the development of trellis displays (now called small multiples or faceted plots): a grid of simple charts, each showing a subset of the data, rather than one complicated chart trying to show everything at once.

Replacing pie charts with grouped dot plots. The hierarchy placed angle well below position and length. For any comparison task that a pie chart performs, a grouped dot plot (position on a common scale) outperforms it. Cleveland was blunt about this recommendation, and it became one of the most debated prescriptions in the field.

Legacy and Influence

The Cleveland-McGill hierarchy has become the de facto standard for teaching and evaluating visualization design. It is cited in every major visualization textbook. It underlies the default chart-type recommendations in tools like Tableau, which was designed by researchers (including Jock Mackinlay, who extended Cleveland and McGill's work) who took perceptual accuracy seriously.

The hierarchy also influenced the development of grammar-of-graphics frameworks (ggplot2, Altair, Vega-Lite), which separate the data-to-visual mapping step from the rendering step. When these frameworks ask you to map a variable to an "aesthetic" or an "encoding channel," the concept traces directly to Cleveland and McGill's elementary perceptual tasks.

Perhaps most importantly, the paper established the principle that visualization design is an empirical question, not merely an aesthetic one. You can test whether one design works better than another. You can measure the accuracy of visual judgments. You can rank encoding channels. This empirical stance transformed visualization from a craft tradition into a science-informed discipline.

Limitations and Ongoing Debate

No study is without limitations, and the Cleveland-McGill paper is no exception. The experiments tested a narrow task: judging the ratio of two values. Real chart reading involves many other tasks — identifying trends, spotting outliers, estimating distributions, comparing groups — and the hierarchy may not apply equally to all of them. Some researchers have argued that for trend detection (as opposed to precise value comparison), line charts and even area charts may perform better than the hierarchy would predict, because the Gestalt principles of connection and continuity support trend perception in ways that the elementary perceptual task framework does not fully capture.

The stimuli were also simplified — stripped-down displays with no titles, legends, or contextual information. Real charts operate in rich contexts where labeling, annotation, and familiarity influence comprehension. A viewer who is deeply familiar with pie charts may extract information from them more quickly than the hierarchy suggests, because practice and expectation partially compensate for perceptual inefficiency.

These limitations do not invalidate the hierarchy. They define its scope. The Cleveland-McGill ranking tells you which channels are most accurate for quantitative ratio judgments under controlled conditions. For other tasks and in richer contexts, the ranking remains a strong default but is not the final word.

Discussion Questions

Cleveland and McGill tested accuracy of ratio judgments. Are there chart-reading tasks where their hierarchy might not apply? For instance, would the ranking change if the task were "identify the trend" rather than "judge the ratio of two values"?
The original study used a small sample of Bell Labs employees — a highly educated, technically trained population. The Heer-Bostock replication used Mechanical Turk workers, a more diverse but self-selected population. How might the results differ with other populations (e.g., children, elderly adults, people with low numeracy)?
The hierarchy ranks individual encoding channels. But real charts use multiple channels simultaneously. How should a designer think about the interaction between channels — for example, when position and color reinforce the same variable, or when position and area encode different variables?
The paper's impact has been enormous, but some critics argue that it has led to an overly narrow focus on accuracy at the expense of other goals (engagement, memorability, emotional impact). Do you agree? When might a less accurate chart be the better design choice?
How would you design an experiment to test the perceptual accuracy of a visual encoding that Cleveland and McGill did not test — for example, animation speed, transparency, or border thickness?

Key Takeaways from This Case Study

The Cleveland-McGill hierarchy is grounded in controlled experiments, not opinion.
The fundamental test was ratio judgment accuracy across different visual encodings.
Position > length > angle > area > color saturation, with stacked (non-baselined) segments performing worse than baselined ones.
The hierarchy has been replicated with larger and more diverse samples and remains robust.
The hierarchy provides a ranking, not a prohibition — context and communication goals may justify lower-ranked encodings.
The paper established that visualization design effectiveness is an empirical question.