Case Study 2: The Gapminder Relationship Plot

DataField.Dev

Case Study 2: The Gapminder Relationship Plot

Hans Rosling's Gapminder visualizations (Chapter 9 Case Study 1) used animated bubble charts to tell the story of global development. The same data can be visualized statically using seaborn, with scatter plots, regression overlays, and faceting. This case study walks through the reconstruction and examines what seaborn adds (and does not add) compared to Rosling's original tool.

The Situation

Hans Rosling's Gapminder project, covered in Chapter 9, used an animated bubble chart to show how countries changed over time on two key dimensions: income per capita (x-axis), life expectancy (y-axis), population size (bubble size), and region (color). The animation played from 1800 to 2010, showing how countries moved through the health-wealth space as they developed.

The Rosling version was animated and interactive. It ran on custom software (Trendalyzer, later acquired by Google). The visualization was memorable and shaped how millions of people understood global development.

A static reconstruction in seaborn loses the animation but preserves the core insight: countries cluster in different parts of the health-wealth space, and the clusters have changed over time. With seaborn's scatter plot, regression overlays, and faceting, you can produce a static version that conveys much of the same information without requiring custom animation software.

This case study walks through the reconstruction. It is a practical exercise in applying Chapter 16-18 techniques to a meaningful real-world dataset. It also illustrates the trade-offs between static and animated visualization — what each medium does well, and when each is appropriate.

The Data

The Gapminder data includes:

Country: ~200 countries worldwide.
Year: annual data from ~1800 to present (exact range varies by variable).
GDP per capita: income in inflation-adjusted dollars.
Life expectancy: years at birth.
Population: total population.
Region: continent or development classification (Africa, Americas, Asia, Europe, Oceania).

The data is available from Gapminder (gapminder.org) or through the gapminder Python package. For this case study, we will use a simplified version covering 142 countries and five snapshot years: 1952, 1972, 1992, 2007, and 2020.

import pandas as pd
# Load from a local CSV or the gapminder package
gap = pd.read_csv("gapminder.csv")

Each row is one (country, year) combination with columns for GDP per capita, life expectancy, population, and region.

The Visualizations

Visualization 1: Static Scatter for a Single Year

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", context="notebook")

gap_2007 = gap[gap["year"] == 2007]

fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(
    data=gap_2007,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    hue="continent",
    sizes=(30, 2000),
    alpha=0.7,
    palette="tab10",
    ax=ax,
)
ax.set_xscale("log")  # GDP per capita is highly skewed
ax.set_xlabel("GDP per Capita (USD, log scale)")
ax.set_ylabel("Life Expectancy (years)")
ax.set_title("Global Development, 2007")
ax.legend(bbox_to_anchor=(1.02, 1), loc="upper left")
fig.savefig("gapminder_2007.png", dpi=300, bbox_inches="tight")

This is the basic Rosling-style bubble chart for a single year. Key elements:

Log x-axis: GDP per capita spans several orders of magnitude. Log scale prevents the high-income outliers from dominating the chart and makes developing-country details visible.
Size by population: larger bubbles for larger countries (China and India are the biggest).
Hue by continent: each region gets its own color for visual grouping.
Alpha=0.7: allows overlap to be visible.
Legend outside the plot: avoids overlapping with the data.

The chart shows the well-known pattern: high-income countries have high life expectancy (upper right), low-income countries have lower life expectancy (lower left), and there is a clear curve connecting them. The continental grouping is visible — Africa clusters in the lower-left, Europe and Oceania in the upper-right.

Visualization 2: Regression Overlay

fig, ax = plt.subplots(figsize=(10, 7))
sns.regplot(
    data=gap_2007,
    x="gdpPercap",
    y="lifeExp",
    scatter_kws={"alpha": 0.6, "s": 50},
    line_kws={"color": "red", "linewidth": 2},
    logx=True,
    ax=ax,
)
ax.set_xscale("log")
ax.set_title("GDP per Capita vs. Life Expectancy (2007)")

sns.regplot fits a regression line and overlays it on the scatter. The logx=True parameter fits the regression in log(x) space, which is appropriate for GDP data. The result is a curved relationship on the linear axis (log on x) that shows the diminishing returns of income: doubling GDP doubles life expectancy improvements roughly the same amount on the log scale.

Visualization 3: Faceted by Year

gap_snapshots = gap[gap["year"].isin([1952, 1972, 1992, 2007, 2020])]

g = sns.relplot(
    data=gap_snapshots,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    hue="continent",
    col="year",
    col_wrap=3,
    kind="scatter",
    height=4,
    aspect=1.1,
    sizes=(20, 1500),
    alpha=0.7,
    palette="tab10",
)

# Apply log scale to all panels
for ax in g.axes.flat:
    ax.set_xscale("log")

g.set_axis_labels("GDP per Capita (USD, log scale)", "Life Expectancy (years)")
g.figure.suptitle("Global Development Over Time", fontsize=14, y=1.02)
g.savefig("gapminder_faceted.png", dpi=300, bbox_inches="tight")

This produces a 2×3 faceted grid (with one empty panel) showing one scatter plot per year. The same colors and size mapping are preserved across panels, so the reader can compare years directly.

The static faceted version shows the progression over time — countries moving from the lower-left (low income, low life expectancy) toward the upper-right (high income, high life expectancy). The pattern that Rosling's animation showed as motion is visible here as static snapshots, one per year.

Visualization 4: A Specific Country's Trajectory

country = "China"
country_data = gap[gap["country"] == country]

fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(
    data=country_data,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    sizes=(100, 2000),
    alpha=0.7,
    ax=ax,
)
sns.lineplot(
    data=country_data,
    x="gdpPercap",
    y="lifeExp",
    estimator=None,
    sort=False,  # keep the temporal order, not sorted by x
    linewidth=1,
    color="gray",
    alpha=0.5,
    ax=ax,
)

# Annotate select years
for _, row in country_data.iloc[::10].iterrows():
    ax.annotate(int(row["year"]), (row["gdpPercap"], row["lifeExp"]),
                fontsize=8, ha="center")

ax.set_xscale("log")
ax.set_xlabel("GDP per Capita (USD, log scale)")
ax.set_ylabel("Life Expectancy (years)")
ax.set_title(f"{country}'s Development Trajectory, 1800-2020")

This shows China's path through the health-wealth space over time. Each year is a scatter point; a gray line connects them in temporal order. The sort=False parameter is critical — it tells lineplot to connect the points in the order given by the DataFrame, not in x-sorted order. Annotated year labels at every 10th point show when China passed through each point.

The result is a "country trajectory" chart that shows how one country moved through the space. Do this for several countries (use a loop or faceting) and you get a multi-country trajectory plot.

Trade-offs vs. Rosling's Animation

The static seaborn versions preserve much of the Rosling visualization but lose one thing: the temporal animation. Here is what each medium does well.

Rosling's animation:

Vivid temporal progression. Watching countries move through the space is compelling.
Gestalt of movement. You see the overall pattern of global development as a visual sweep.
Live presentation. Works beautifully in a talk where Rosling himself was narrating.

seaborn static version:

Print-compatible. Works in papers, reports, and PDFs.
Precise reading. You can study any specific year or country at your own pace.
No custom software. Any seaborn user can reproduce it.
Shareable. A PNG can be embedded anywhere.
Composable. The faceted version shows multiple years simultaneously.

Neither is strictly better. For a live presentation or a memorable TED talk, the animation is right. For a report, a paper, or a static article, the seaborn version is right. For serious data exploration, the seaborn version is actually more useful because you can pause, zoom, and examine specific points.

The Rosling animation and the seaborn static version complement each other. A good data journalism project might include both: the animation for impact and the static version for careful reading.

Lessons for Practice

1. The same data supports multiple visualizations. Gapminder data supports bubble charts, line charts, small multiples, regression plots, trajectory plots, and many more. The right chart depends on the question and the medium.

2. Log scales are often essential. GDP per capita spans orders of magnitude. A linear x-axis would squish developing countries into a thin strip at the left. Log scale reveals the structure in every income range.

3. Size encoding for population is conventional but imperfect. Rosling's bubble chart uses size for population, but size is a low-accuracy encoding (Chapter 2). The reader can tell that China and India are large but cannot accurately compare their sizes. Use size when it is conventional for the domain (bubble charts for Gapminder-style plots) but remember that precision is limited.

4. Color for categorical encoding is the natural fit for continents. seaborn's hue parameter handles this automatically. Pick a palette that distinguishes the categories — tab10 or Set2 work well for 5-10 categories.

5. Faceting is the static alternative to animation. Small multiples over time are the static equivalent of an animation. They let the reader see the temporal progression at their own pace. The Chapter 8 small-multiples principle applies directly.

6. Country trajectories use sort=False. When you want to connect points in temporal order rather than x-sorted order, pass sort=False to lineplot. This is the canonical way to show a path through a 2D space.

7. Direct annotation beats legends for specific points. When you want to call out specific years or countries, use ax.annotate to place text directly at the relevant points. Legends are for categorical encodings; annotations are for specific callouts.

Discussion Questions

On the static vs. animated trade-off. Rosling's animation is memorable; the seaborn static version is precise. Which would you choose for: a TED talk, a scientific paper, a policy brief, a news article?
On the bubble chart itself. Size-by-population is a convention for Gapminder-style plots but is an imperfect encoding. Would a different encoding (filter by top-N countries? color by population bin?) work better? Why or why not?
On trajectory plots. The country trajectory version shows one country's path over time. Is this more informative than the faceted multi-year snapshot, or less? When would you use each?
On log scales. GDP per capita requires log scale. Other variables (life expectancy, literacy, infant mortality) are mostly linear. How do you explain log scales to a general audience when you need them?
On color choices. The case study uses tab10 for continents. For five regions (Africa, Americas, Asia, Europe, Oceania), is this the right palette? What other choices would work?
On reproducibility. The seaborn version is fully reproducible — any reader can run the code and get the same chart. Rosling's original animation relied on custom software. How does reproducibility affect the long-term value of a visualization?

Hans Rosling's Gapminder visualizations are famous for their animation and presentation impact. The seaborn static reconstructions preserve most of the information content at the cost of the animation. Both are valid; both serve different contexts. The exercise of reproducing famous visualizations in your own tools is a good way to understand what each medium does well, and to build fluency with the seaborn functions you have learned in this chapter.