Case Study 2: DABEST and the Rise of Estimation Plots

DataField.Dev

Case Study 2: DABEST and the Rise of Estimation Plots

In 2019, a group of biologists led by Joses Ho at the ASTAR research institute in Singapore published a paper in Nature Methods titled "Moving beyond P values: data analysis with estimation graphics." The paper argued that the traditional bar chart + p-value approach to reporting biological experiments was misleading and should be replaced with estimation plots that show effect sizes and confidence intervals. They released a Python package called DABEST (Data Analysis with Bootstrap-coupled ESTimation) that made estimation plots accessible to working biologists. Within a few years, estimation plots had been endorsed by multiple journals and were appearing in papers across biomedicine. The DABEST story illustrates how a well-designed visualization can change not just how data is displayed but how scientists reason about statistics.*

The Situation: Bar Charts and P-Values in Biological Statistics

Biological experiments typically compare groups: a control group and a treated group, or several doses of a drug, or cells before and after a stimulus. The standard way to report these comparisons has been, for decades, a bar chart with error bars and p-value annotations. Each bar is a group's mean; the error bar is usually standard error of the mean (SEM); a star or asterisk above the bars indicates p < 0.05.

This format has several problems. The dynamite plot critique from Chapter 18 covered the most obvious one: showing only mean ± SEM hides the full distribution of the data. Outliers, skewness, and bimodality are invisible. Two groups with very different distributions can produce identical bar charts if their means and SEMs happen to match.

But there is a deeper problem. The bar chart + p-value format encourages a specific style of statistical reasoning: "is the effect significant at α = 0.05?" A star means significant; no star means not significant. The actual magnitude of the effect and its uncertainty fade into the background. Researchers and readers fall into a binary mindset — significant vs. not significant — that the underlying statistics do not support.

Statisticians have been criticizing this approach for decades. The American Statistical Association published a 2016 statement explicitly warning against the misuse of p-values. A 2019 editorial in Nature (by Amrhein, Greenland, and McShane) called for the retirement of "statistical significance" as a binary concept. Hundreds of researchers signed on.

But the bar-chart-plus-p-value format persisted. It was what authors had always done, what reviewers expected, and what was easy to produce with GraphPad or Excel. Changing it would require a new tool with a new convention and enough momentum to displace the old one.

The DABEST Solution

Joses Ho and colleagues at the A*STAR laboratory in Singapore decided to build that tool. They wanted a visualization that would:

Show the full data distribution rather than hiding it behind a bar.
Emphasize effect sizes and confidence intervals rather than p-value thresholds.
Make it easy to interpret — no special training required to read.
Be as easy to produce as a bar chart, so the barrier to adoption was low.

The result was the estimation plot (also called a Gardner-Altman plot, after its earlier conceptual forms). The layout:

Left panel: shows the raw data for each group, as a swarm plot or strip plot. Every observation is visible as a dot.
Right panel (narrower): shows the effect size — the difference between the two groups' means — with its confidence interval. The reference line is set at the control group's mean, and the effect is plotted as a distance from that reference.

The right panel is the key innovation. Instead of showing two means and letting the reader compute the difference mentally, it shows the difference directly, with uncertainty. The reader can see at a glance: "the treatment increased the outcome by 15 units ± 3 units (95% CI)." No p-values required.

Joses Ho et al. packaged this visualization as a Python library called DABEST. The API is straightforward:

import dabest
import pandas as pd

df = pd.DataFrame({
    "group": ["control"] * 20 + ["treatment"] * 20,
    "outcome": control_data.tolist() + treatment_data.tolist(),
})

analysis = dabest.load(df, x="group", y="outcome", idx=("control", "treatment"))
analysis.mean_diff.plot()

The call runs bootstrap resampling to compute the confidence interval of the mean difference and produces the two-panel estimation plot. Variations are available: median difference instead of mean difference, Cohen's d standardized effect size, Hedges' g, shared-control multi-group comparisons, and paired comparisons for repeated measures.

The package was released open-source on GitHub and PyPI. The Ho et al. paper introducing it was published in Nature Methods in 2019 and accompanied by a live-updating website (estimationstats.com) where anyone could upload data and produce estimation plots without installing anything.

The Response

The response was mixed but significant. Supporters welcomed DABEST as the tool they had been waiting for — an easy way to report biological experiments honestly. Critics argued that the format was unfamiliar and that reviewers and editors were not ready for it.

Over the following years, several things happened:

Journal endorsements: Nature Methods published an editorial explicitly supporting estimation plots as a preferred format. eLife added guidance to its author instructions recommending them. A few specialty journals made estimation plots mandatory for certain types of submissions.

Growing adoption: by 2022-2023, estimation plots were appearing regularly in biomedical papers. Not majority practice, but a visible minority, and growing. Some labs adopted them as their default and never went back to bar charts.

Software integration: DABEST was integrated into several data science platforms. An R version was released. A JASP (statistical software) plugin was developed. The visualization spread beyond the original Python library.

Pedagogy: statistics courses started teaching estimation plots alongside traditional bar-chart-plus-t-test reporting. Textbooks began including them. Students who learned estimation plots as the default would carry the format into their future careers.

Resistance: some reviewers and senior researchers continued to prefer the old format. The resistance was based partly on habit (p-values were what they grew up with) and partly on legitimate concerns (effect sizes are harder to interpret for multi-way comparisons). The old format has not disappeared; it has become one of several options rather than the only one.

By 2024, estimation plots were a recognized and legitimate alternative to bar charts in biological statistics. They were not universal, but they were no longer marginal. The DABEST project had succeeded in its core goal: changing what it was possible to publish without extraordinary justification.

The Visual Design of Estimation Plots

The DABEST estimation plot is a specific visual design that makes several deliberate choices. Each choice is worth examining.

The two-panel layout (data on the left, effect on the right) is chosen so the reader's eye naturally moves from the raw data to the summary. A one-panel version (just the effect size) would lose the data context; a separate-figure version (effect size in a separate figure) would lose the visual connection. The two-panel layout keeps both in view.

The swarm plot of individual data points shows the distribution. Each dot is one observation. The reader can see outliers, skewness, bimodality, and sample size all at a glance. This is the key difference from bar charts, which collapse all this information into a single mean.

The confidence interval as a vertical distribution (sometimes with a density curve) on the right panel uses position encoding to show the effect size and its uncertainty. Position is the strongest visual encoding (Chapter 2), so this is an appropriate choice for the primary statistical information.

The reference line at the control group's mean gives the effect size a zero point. The treatment's effect is shown as a distance above or below this line, not as an absolute value. This emphasizes the comparison rather than the absolute levels.

The absence of p-value annotations is a deliberate statement. DABEST does not show stars or "n.s." labels. The reader is invited to judge the effect size and confidence interval directly, without the crutch of a significance threshold.

These choices are not arbitrary. Each one reflects a specific statistical and visual principle, and the combined effect is a chart that encourages different reasoning than a bar chart does. The reader of a bar chart asks "is it significant?" The reader of an estimation plot asks "how big is the effect, and how certain are we about the estimate?"

Theory Connection: How Chart Types Shape Statistical Reasoning

The DABEST story illustrates an important principle: the way data is visualized affects the questions people ask about it. A bar chart encourages binary significance thinking because it displays only means and error bars, with stars as the summary. An estimation plot encourages magnitude-plus-uncertainty thinking because it displays distributions and confidence intervals.

This is not a new observation. Statisticians have argued for decades that p-values are overused and confidence intervals underused. But the arguments did not change practice until a visualization tool made the alternative easier. DABEST succeeded where text-based arguments failed because it gave researchers a concrete tool that was as easy to use as the old format.

The lesson for practitioners: changing what people see changes what people think about. If you want researchers to focus on effect sizes, give them a chart that shows effect sizes prominently. If you want readers to notice uncertainty, give them a chart with uncertainty visible. The chart type is not just a display; it is a cognitive tool that shapes what questions get asked and answered.

This is also a lesson about the limits of prescriptive advice. Telling people "please use effect sizes instead of p-values" rarely works. Giving people a tool that produces effect-size plots with a single function call does work, because it aligns the incentives: the new behavior is easier than the old one.

The DABEST Package in Practice

Using DABEST in a Python workflow is straightforward. A typical session:

import pandas as pd
import dabest

df = pd.DataFrame({
    "condition": ["control"] * 30 + ["low_dose"] * 30 + ["high_dose"] * 30,
    "response": np.concatenate([control_values, low_dose_values, high_dose_values]),
})

analysis = dabest.load(df, x="condition", y="response",
                        idx=("control", "low_dose", "high_dose"))

# Various plot types:
analysis.mean_diff.plot(fig_size=(6, 4))     # mean differences
analysis.median_diff.plot()                   # median differences
analysis.cohens_d.plot()                      # standardized Cohen's d
analysis.hedges_g.plot()                      # Hedges' g (small-sample correction)

# Access the underlying statistics:
print(analysis.mean_diff.results)

The output includes the estimation plot and a numeric summary with the effect sizes, confidence intervals, and (optionally) p-values. The statistics are computed via bootstrap resampling, which is robust and does not require assumptions about the underlying distribution.

For a paired comparison (before and after the same subjects), use paired="baseline":

paired = dabest.load(df, x="condition", y="response",
                     idx=("before", "after"), paired="baseline", id_col="subject")
paired.mean_diff.plot()

The resulting chart shows each subject as a line connecting the "before" and "after" observations, revealing individual responses alongside the aggregate effect. This is more informative than separate bar charts of before and after means.

Discussion Questions

On format change. DABEST succeeded where text-based arguments for effect sizes failed. Why are visual changes sometimes more effective than verbal ones?
On barriers to adoption. Some reviewers still prefer bar charts and p-values. What are their reasons, and are they legitimate?
On pedagogy. If estimation plots replaced bar charts in introductory statistics courses, what would students gain? What might they lose?
On multi-group comparisons. Estimation plots work best for 2-group and small multi-group comparisons. How do they scale to many groups? What are the alternatives at scale?
On the DABEST API. The package makes estimation plots as easy to produce as bar charts. Is this the key to its success, or would the format have spread even without the package?
On your own practice. The next time you compare two groups in a figure, will you use a bar chart or an estimation plot? What does the answer depend on?

DABEST is a specific tool, but it is also a case study in how visualization can change statistical culture. The estimation plot format makes effect sizes and uncertainty visible in a way that bar charts do not, and by making the format easy to produce, DABEST removed the last practical barrier to adoption. The lesson — that changing what people see changes what people think about — applies far beyond biology. When you design a visualization, you are shaping how readers reason about the data, not just displaying it. Choose the format that matches the reasoning you want to encourage.