Case Study 1: Michael Waskom and the Birth of seaborn

DataField.Dev

Case Study 1: Michael Waskom and the Birth of seaborn

In 2012, Michael Waskom was a PhD student in cognitive neuroscience at Stanford who needed better tools for plotting experimental data. He wrote a small Python library on the side and named it seaborn, after Samuel Norman Seaborn from The West Wing. A decade later, his side project is one of the most widely-used statistical visualization libraries in the world.

The Situation

In 2012, the Python data visualization ecosystem was dominated by matplotlib. The library had been around for nearly a decade, was powerful and flexible, and could produce publication-quality figures. But it was verbose for statistical work. Plotting grouped data required loops. Computing confidence intervals required manual pandas operations before the matplotlib call. Small multiples required manual subplot arrangement with GridSpec. For a working scientist who needed to iterate on statistical plots dozens of times a day, matplotlib felt heavy.

Meanwhile, R had ggplot2 — Hadley Wickham's implementation of the grammar of graphics. ggplot2 was concise, declarative, and optimized for statistical workflows. An R user could produce a faceted scatter plot with confidence bands in three lines of code; the matplotlib equivalent required thirty. Scientists who worked in both languages felt the gap acutely. Some of them switched to R entirely; some waited for Python to catch up.

Michael Waskom was in the second camp. He was a PhD student at Stanford in Anthony Wagner's cognitive neuroscience lab, working on experiments that involved fMRI data and psychological measurements. His workflow required a lot of statistical visualization — distribution plots, group comparisons, regression overlays, small multiples. He was using matplotlib but writing a lot of utility functions to handle the common cases. The utility functions accumulated. At some point, he decided to clean them up into a proper library and release it publicly.

The library was called seaborn. Waskom first released it on PyPI in 2012 as version 0.1. The name was an inside joke — Samuel Norman Seaborn was a fictional speechwriter on the television show The West Wing, and his initials SNS became the library's Python alias (import seaborn as sns). The joke became so ingrained that a decade later, seaborn documentation still uses sns as the conventional alias and Waskom has mentioned the origin in interviews.

This case study examines seaborn's early history, the specific design decisions that shaped it, and what the library's trajectory teaches about open source development and how a personal tool becomes a community standard.

The Early Versions

Seaborn 0.1 (2012) was a small library — a few hundred lines of Python code that implemented specific statistical plot types. The initial focus was on:

Distribution plots: histograms with kernel density overlay, rug plots, box plots, violin plots.
Regression overlays: scatter plots with fitted regression lines and confidence bands.
Categorical plots: bar plots with error bars, strip plots showing individual observations.
Time-series plots: line charts with multiple series and aggregation.

The API was initially imperative — you called functions that took numpy arrays or pandas Series, and each function produced one chart. Over the next several versions, the API became more DataFrame-centric and more declarative, converging on the data=df, x="col", y="col", hue="col" pattern that is now canonical.

Waskom's design philosophy, articulated in early seaborn documentation and blog posts, was:

1. Do not reinvent matplotlib. Every seaborn function should produce matplotlib Figure and Axes objects that users can further customize. seaborn should be a convenience layer, not a replacement.

2. Make statistical operations first-class. Common statistics (mean, confidence interval, regression fit, kernel density estimate) should be built into the plotting call, not require separate pandas or scipy steps.

3. Use pandas DataFrames as the primary input. By 2012, pandas was becoming the standard Python data structure. seaborn should accept DataFrames directly with string column names, eliminating the need to extract arrays manually.

4. Produce good default aesthetics. matplotlib's defaults at the time were ugly (Chapter 12 Case Study 2 covers this history). seaborn should ship with better defaults — cleaner spines, better color palettes, more readable fonts — so that simple calls produced publication-quality output.

5. Support faceting. The single most painful matplotlib task was building small multiples. seaborn's FacetGrid abstraction should make faceting a one-line operation.

These design choices look obvious now because they are the standard Python statistical visualization approach. At the time, they were choices — Waskom made them deliberately, and the alternative choices (ignoring pandas, requiring explicit statistics, leaving aesthetics alone) would have produced a very different library.

Key Moments in the Library's History

seaborn's development has had several inflection points that shaped its current state.

2012-2014: Incubation. seaborn grew slowly, used mostly within the Stanford neuroscience community and a handful of early adopters. Waskom continued his PhD and developed seaborn on the side. Major releases added new chart types and refined the API.

2015: First major API overhaul. seaborn 0.6 introduced the figure-level function abstraction, formalizing the FacetGrid pattern. The factorplot function (later renamed to catplot) became a canonical figure-level interface for categorical data.

2016-2018: Adoption by data science. As pandas and Jupyter gained traction in the data science community, seaborn became increasingly popular as the default statistical visualization tool. VanderPlas's Python Data Science Handbook (2016) included extensive seaborn coverage, which introduced thousands of readers to the library.

2018: Waskom completes his PhD. Waskom finished his doctorate in cognitive neuroscience and took a postdoctoral research position at New York University. seaborn remained a side project that he maintained in his spare time.

2020: seaborn 0.11 — the modern API. This was the biggest API change in seaborn's history. The release introduced displot, histplot, ecdfplot, kdeplot with unified signatures, and deprecated the old distplot function. The new API was cleaner, more consistent, and easier to teach. Many tutorials and books had to be updated.

2022: seaborn 0.12 — the experimental objects interface. Waskom introduced a new API called seaborn.objects that is closer to ggplot2's grammar of graphics. The classic API continues alongside it; the objects API is a more composable alternative for advanced users.

2023+: Continued maintenance. seaborn has a small team of maintainers, with Waskom still closely involved. The project is hosted on GitHub, has thousands of stars, and receives regular contributions from the community.

The Adoption Story

Seaborn's adoption curve was gradual rather than explosive. The library gained users slowly over many years as word of mouth, tutorial coverage, and academic use compounded. By 2018-2019, seaborn was the de facto default for Python statistical visualization. By 2023, it was installed in most scientific Python distributions and used by millions of practitioners.

Several factors drove the adoption:

1. The pandas integration. As pandas became the standard Python data structure, seaborn's DataFrame-first API became a natural fit. Users who worked with DataFrames gravitated to seaborn because it "just worked" with their data.

2. The Jupyter notebook boom. Jupyter notebooks emphasized iteration and exploration, and seaborn's conciseness was a perfect match. Users could produce a complex statistical plot in one line and see the result immediately.

3. The ggplot2 gap. Python users who admired ggplot2's declarative style found seaborn as the closest equivalent. While not identical to ggplot2, seaborn's data=df, x="col", hue="col" pattern was close enough to satisfy most ggplot2 refugees.

4. Educational adoption. Data science courses in universities and bootcamps often taught seaborn as the default statistical visualization library. This created a pipeline of new practitioners who learned seaborn first and adopted it in their professional work.

5. Compatibility with matplotlib. The ability to drop down to matplotlib for customization meant that seaborn's limitations were never blockers. Users could always escape to the underlying matplotlib layer when they needed fine control.

6. Waskom's thoughtful API design. seaborn's API has evolved thoughtfully, with careful attention to user feedback and consistency. The 0.11 overhaul, though disruptive, made the library cleaner and more learnable. Users trust that the API will continue to improve rather than fragment.

The Open Source Sustainability Question

seaborn is maintained primarily by volunteers. Michael Waskom has been the lead maintainer since 2012, with a small team of contributors. The project receives some corporate support through sponsorships (Tidelift, grants from NASA and the Chan Zuckerberg Initiative), but it is fundamentally a volunteer effort.

This raises questions about sustainability. A library used by millions of people, installed as a dependency in countless projects, is maintained by one person on the side. If Waskom stopped maintaining seaborn tomorrow, the library would continue to exist but new features and bug fixes would slow dramatically. This is a common pattern in open source — NumPy, pandas, and matplotlib all face similar sustainability concerns.

The Python scientific community has made progress on this problem through organizations like NumFOCUS, which provides fiscal sponsorship and organizational support for major scientific Python projects. NumFOCUS sponsors matplotlib, pandas, numpy, and other libraries. seaborn is not officially a NumFOCUS project but benefits from the broader ecosystem of support.

The sustainability question is not unique to seaborn, but seaborn illustrates it particularly vividly. The library is widely used, professionally critical, and maintained by a small volunteer team. This tension — between the value the library provides and the resources available to maintain it — is a recurring theme in modern open source software.

What the History Teaches

The seaborn origin story offers several lessons for open source development and tool choice.

1. Side projects can become critical infrastructure. Waskom built seaborn as a set of utility functions for his research. Within a decade, it became a standard tool used by millions. The distance from "personal tool" to "community standard" is shorter than it seems.

2. Thoughtful API design compounds. seaborn's API has been refined over a decade with attention to consistency and teachability. The 0.11 overhaul was disruptive but produced a better API. Users who committed to the library now benefit from that thoughtfulness.

3. Compatibility with existing tools matters. seaborn's decision to produce matplotlib objects, rather than inventing a new rendering system, is a huge part of why the library succeeded. Users could adopt seaborn incrementally, dropping down to matplotlib whenever needed. A library that required a full stack change would have had a much harder adoption curve.

4. Good defaults are a feature. seaborn's prettier-than-matplotlib defaults were one of the library's earliest selling points. Users did not just get new functions; they got better-looking charts with no extra work. Defaults matter.

5. Community-driven development is slow but durable. seaborn grew over ten years, gradually and steadily, through community adoption rather than marketing. This growth pattern is slower than commercial tools but produces more resilient ecosystems.

6. Academic origins shape the library. seaborn's focus on statistical operations reflects Waskom's academic background. The library prioritizes scientific use cases — confidence intervals, regression fits, distributional plots — over business dashboard use cases. This is not a criticism; it is a consequence of who built the library and for what purpose.

Discussion Questions

On the side-project-to-infrastructure trajectory. seaborn started as a personal tool and became a standard. What other open source libraries have followed similar trajectories? What does this pattern teach us about how useful software gets built?
On sustainability. seaborn is maintained by a small team of volunteers. Is this sustainable long-term? What should happen if the lead maintainer steps away?
On API overhauls. The seaborn 0.11 release changed the API significantly. Was this the right call, given the disruption to existing tutorials and code? How should libraries balance API improvement with backward compatibility?
On the name. seaborn is named after a character from a TV show. Does the whimsical name help or hurt the library's adoption? Would a more serious name (like "statplot" or "pyviz") have changed the trajectory?
On the relationship with matplotlib. seaborn depends on matplotlib but competes with it for user attention. Is this relationship symbiotic or parasitic? How should the two communities coordinate?
On your own use. Based on this case study, would you use seaborn for your own projects? What would make you choose a different library?

Seaborn is one of many Python libraries that started as a personal tool and grew into community infrastructure. Michael Waskom's contribution — a decade of careful design and maintenance — has made statistical visualization in Python substantially easier for millions of practitioners. The next time you type import seaborn as sns, remember the grad student who wrote the original code on the side, the TV character who lent his initials, and the slow-but-steady community adoption that turned one scientist's utility functions into a standard tool.