Case Study 1: How matplotlib Came to Dominate Scientific Python
In 2002, a neurobiology postdoc named John Hunter needed to plot EEG data from epilepsy patients. The tools he had access to were not good enough. He built his own plotting library on weekends and named it matplotlib. Twenty years later, nearly every Python chart in the world is rendered by descendants of that weekend project. This case study is about what it took, why it mattered, and what it teaches about the design decisions you are learning to work with.
The Situation
In 2002, the Python data ecosystem was tiny compared to what it is today. There was no pandas (that came in 2008). There was no scikit-learn (2007). NumPy had just been created from the merger of Numeric and Numarray. Scientific Python users had to stitch together tools from multiple languages and libraries: MATLAB for interactive analysis, Fortran for heavy numerics, Python for scripting and glue code. Plotting was a particular pain point. Python had a few plotting libraries — Chaco, Biggles, a couple of wrappers around gnuplot — but none of them had the combination of ease of use, flexibility, and output quality that MATLAB's plot function provided.
John Hunter was a postdoctoral fellow in neurobiology at the University of Chicago, working on the analysis of electrocorticography data from epilepsy patients. His workflow was unusual for scientific computing at the time: instead of MATLAB, he wanted to use Python, because Python had better string handling, better file parsing, and better general-purpose programming capabilities. But he needed to plot time-series data, and the existing Python plotting options did not meet his needs. He wanted to produce MATLAB-quality figures from Python code.
The practical problem was the paper he was writing. Medical journal figures had specific requirements: high resolution, precise control over layout, publication-ready vector output. MATLAB could do this, but MATLAB's licensing costs were prohibitive (especially for academic work at scale), and its text handling was primitive. Python was free, had better text handling, and was already the language Hunter used for the rest of his analysis pipeline. The only missing piece was the plotting.
Rather than switch away from Python, Hunter decided to build a plotting library that would feel familiar to MATLAB users. The design principle was simple: match MATLAB's interface and output quality so closely that a MATLAB user could transition to Python with minimal friction. The weekend project started in early 2003. By the end of 2003, matplotlib version 0.5 was released to the Python scientific community. By 2005, matplotlib was the de facto standard for Python plotting. By 2010, it was included in every major Python scientific distribution, used in thousands of research papers, and taught in universities around the world.
Hunter's neurobiology research was important, but matplotlib — the side project — has turned out to be his most widely-used contribution. He passed away in 2012 at the age of 44, years before matplotlib's adoption reached its current scale. The library is now maintained by a team of volunteer developers and sustained by NumFOCUS, with contributions from hundreds of people around the world. But the core design decisions that shape matplotlib today were made by one postdoc in 2003 who needed to produce better EEG figures for his research.
The Design Goals
Hunter's original design goals for matplotlib, stated in early papers and documentation, were:
1. MATLAB compatibility. The API should feel familiar to MATLAB users. The pyplot interface — plt.plot(x, y), plt.title(...), plt.xlabel(...) — was deliberately modeled on MATLAB's plot function. This was not laziness; it was a strategic decision to lower the barrier to entry for the large MATLAB community that might otherwise stay on MATLAB.
2. Publication-quality output. The library should produce figures suitable for academic journals, which means high DPI, precise layout, vector output formats (PDF, EPS, SVG), and embedded fonts. From the beginning, savefig supported multiple formats and DPI settings, and the output quality matched what MATLAB and Mathematica were producing.
3. Interactive and non-interactive use. The library should work both in interactive Python sessions (where you want to see the chart on screen) and in batch scripts (where you want to save to files). The backend architecture, introduced early in matplotlib's history, was the solution: Agg for batch raster output, interactive backends (Tk, GTK, Qt) for live display, vector backends (PDF, SVG, PS) for publication.
4. Embedded in applications. The library should be usable inside other Python applications — GUI programs, web servers, batch processing pipelines. This required the backend abstraction to support embedding in Qt, Tk, WxWidgets, and later web-based applications.
5. Customization without rewriting. The library should be flexible enough that advanced users could customize every aspect of the chart, while still being approachable for beginners who just wanted to call plt.plot(). The answer was the Artist hierarchy: beginners use pyplot and accept defaults; advanced users traverse the Artist tree and configure individual objects.
These five goals explain nearly every architectural decision matplotlib made. The pyplot API exists because of goal 1. The backend system exists because of goals 3 and 4. The Artist hierarchy exists because of goal 5. The savefig function exists because of goal 2. Twenty years later, these decisions still shape how matplotlib works, for better and for worse.
The Technical Architecture (Briefly)
The chapter covered matplotlib's architecture in detail, so here is a brief recap of the decisions Hunter made in 2003-2005 that still define the library:
Three-layer architecture. Hunter separated matplotlib into three layers: backend (rendering to specific output formats), artist (the Python object tree that represents the chart), and scripting (the pyplot wrapper for quick chart creation). This separation allowed different users to engage at different levels: casual users just use pyplot, power users work with Artists directly, and developers can write new backends for new output formats.
The Figure-Axes-Axis trinity. The core classes — Figure for the whole image, Axes for a single plotting area, Axis for a single numerical axis — were designed to support both single-chart figures and multi-panel layouts. The plural-vs-singular naming (Axes vs. Axis) has been a source of confusion ever since, but it comes directly from the underlying mathematical concept of "a coordinate system" (a single Axes contains both a horizontal and a vertical Axis).
The Artist tree. Every visible element on a matplotlib chart is an instance of the Artist class. Text is a subclass of Artist. Line2D is a subclass of Artist. Legend is a subclass of Artist. This design means that every element is configurable through a consistent interface: you find the Artist in the tree, set its properties, and re-render.
The pyplot state machine. The pyplot interface maintains a "current Figure" and "current Axes," which are set implicitly by pyplot calls. This was a deliberate concession to MATLAB users, whose plot, title, and xlabel functions in MATLAB implicitly operate on the current figure. The state machine is convenient for simple cases and confusing for complex ones, which is why modern best practice (and this book) recommends the OO API for anything beyond a one-liner.
Backend flexibility. The backend system allows the same Artist tree to be rendered in different ways: to PNG via Agg, to PDF via the PDF backend, to an interactive window via Qt or Tk. This meant that matplotlib could be used for publication output AND interactive exploration, both in the same workflow, without switching libraries.
Each of these decisions has trade-offs. The Figure-Axes-Axis naming is confusing. The pyplot state machine creates bugs in complex code. The Artist tree is verbose to work with directly. But the trade-offs are all in the direction of flexibility and power, which is why matplotlib has survived twenty years of competition from newer libraries (Plotly, Bokeh, Altair, HoloViews) without being displaced. The newer libraries often handle specific use cases better, but they almost all use matplotlib for at least some of their output, and none of them have replaced matplotlib as the foundation of the Python scientific visualization ecosystem.
The Impact
Matplotlib's impact on Python and on scientific computing more broadly is hard to overstate. A few specific effects:
Impact on Python's scientific rise. In the early 2000s, Python was not yet a serious competitor to MATLAB, R, or IDL for scientific computing. Matplotlib was one of the critical pieces that made Python viable for scientific workflows. Combined with NumPy (for numerics), SciPy (for scientific algorithms), and eventually IPython/Jupyter (for interactive exploration), matplotlib enabled Python to become a full-featured scientific computing environment. By 2010, Python was widely adopted in physics, astronomy, biology, and engineering research, and matplotlib was the visualization layer in nearly every Python scientific paper.
Impact on the Python visualization ecosystem. Matplotlib established the design patterns that later Python visualization libraries built on or reacted against. Seaborn (Michael Waskom, 2012) is a high-level API built directly on top of matplotlib — every seaborn chart is a matplotlib chart underneath. Pandas plotting is a thin wrapper around matplotlib. geopandas uses matplotlib for mapping. Plotly, Bokeh, and Altair were designed partly as alternatives to matplotlib but still interoperate with it. The Python visualization ecosystem is built around, on top of, or in conversation with matplotlib.
Impact on research reproducibility. Before matplotlib, scientific figures were often produced in proprietary tools (MATLAB, Origin, SigmaPlot) and could not be reproduced without the same proprietary software. With matplotlib, the figure and the code that produced it could both be published together as open source. Researchers could share their plotting code alongside their data and their analysis code, and anyone with a Python installation could reproduce the figures exactly. This capability underlies the modern culture of reproducible research in computational science.
Impact on education. Matplotlib became the standard teaching tool for data visualization in Python courses. The reasons: it is free, it is installed by default in most scientific Python distributions, it is widely documented, and it handles the full range from toy examples to publication-quality figures. A student who learns matplotlib can produce the charts they need for homework, research, and professional work, without switching tools. The same student learning MATLAB or R faces different concerns but the same underlying truth: the plotting library shapes what students can do, and matplotlib democratized Python visualization.
Impact on individual practitioners. For anyone who works with data in Python, matplotlib has probably saved thousands of hours cumulatively. The ability to produce a chart with three lines of code, to customize every element, to save to any format, to embed in any application — these capabilities are now taken for granted, but before matplotlib they required multiple tools and custom glue code. Matplotlib absorbed the entire job of "getting data into a chart" and made it a single library call.
Why It Worked: The Design Decisions in Retrospect
Looking back with twenty years of hindsight, we can see why matplotlib succeeded where other plotting libraries did not.
1. The MATLAB-compatible interface lowered adoption costs. Users who already knew MATLAB could start using matplotlib in minutes. They did not have to learn a new conceptual model; they just had to learn that plot was now plt.plot and title was now plt.title. The transition path was clear, and the transition cost was low. Other Python plotting libraries tried to invent new APIs, and they failed to gain users because the new APIs required relearning.
2. The Artist tree allowed unlimited customization. Even when the pyplot shortcuts did not do what a user needed, the Artist tree was always available. A scientist who needed precise control over tick mark positioning could traverse the tree, find the relevant Artist, and modify it directly. This meant that no matter what the user wanted to do, there was always a path. Other plotting libraries often had beautiful defaults but no clear way to override them for edge cases. Matplotlib was ugly by default but infinitely customizable.
3. The backend abstraction supported every use case. Interactive users used interactive backends. Publication users used PDF. Web applications used Agg and embedded the resulting images in HTML. Same library, same code, different outputs. This meant that a user's investment in learning matplotlib paid off across multiple use cases, rather than requiring them to learn different libraries for different outputs.
4. The community built what the core team didn't. Matplotlib's core team was small, but the community added specialized functionality: statistical plots (which became seaborn), geographic plots (which became cartopy and geopandas), 3D plots (mpl_toolkits.mplot3d), animation support, and dozens of other extensions. The community could contribute because the Artist tree and the backend abstraction were well-documented and stable.
5. The documentation and gallery set a new standard. The matplotlib gallery, with hundreds of example charts and their full source code, became the canonical way to learn the library. Users could browse the gallery, find something close to what they wanted, copy the code, and modify it. The gallery is still the most common way experienced users approach a new chart type — they search the gallery before they think about the API. Other Python libraries have emulated this pattern, but matplotlib's gallery was the original.
6. It was free and open source under a permissive license. The BSD license meant that anyone could use matplotlib in any context, including commercial products, without licensing fees or legal concerns. This was a significant advantage over MATLAB (expensive) and some other plotting tools (restrictive licenses). The permissive license also meant that downstream libraries could build on matplotlib without constraint, which accelerated the ecosystem.
Complications and Criticisms
Matplotlib is not universally loved. Several legitimate criticisms are worth acknowledging.
The API is inconsistent. Different methods on the Axes class take slightly different argument patterns, with some using keyword arguments one way and others using different conventions. The naming is not always consistent. This is a common complaint about matplotlib, and it reflects the library's organic growth: features were added over time by different contributors, and the API cleanup has been incremental.
Pyplot state machine causes bugs. As we have discussed, the pyplot state machine is convenient for simple cases but becomes a source of bugs in complex code. The OO API exists to address this, but because pyplot is taught first in most tutorials, many users do not discover the OO API until they have already been burned by pyplot bugs.
The default styles are ugly. Matplotlib's default output has been criticized for decades as ugly: heavy black spines, saturated primary colors, cramped layouts, default serif fonts, and so on. Matplotlib 2.0 (2017) made significant improvements to the defaults, but users still frequently override them with style sheets or custom rcParams to get acceptable output. The "ugly climate plot" you produced in Section 10.9 was ugly partly because matplotlib defaults are ugly.
The documentation is overwhelming. Matplotlib's API is huge — thousands of methods, parameters, and options — and the documentation is comprehensive but dense. New users can spend hours trying to find the specific parameter they need. The gallery helps, but it does not replace good tutorials, and the official tutorials are uneven in quality.
The layout engine is fragile. Matplotlib's automatic layout (tight_layout, constrained_layout) works most of the time but occasionally produces unexpected results. Users have to fall back on manual subplots_adjust calls, which are tedious and not portable across figsizes. This is one area where newer libraries (especially Altair, which delegates layout to Vega) have a genuine advantage.
It is slow for large datasets. Matplotlib is not optimized for datasets with millions of points. A scatter plot of 10 million points will be slow to render and will produce huge PNG files. For big data visualization, specialized libraries (Datashader, Holoviews, Vaex) have emerged to fill the gap. Matplotlib works for everything, but it does not always work well for everything.
Lessons for Modern Practice
Matplotlib's history offers several lessons for anyone using the library today.
The OO API is the right default. Hunter built pyplot as a gesture to MATLAB users, and it served that purpose well. But the OO API — which is also matplotlib-native — is a better way to write code for anything beyond the simplest exploratory plot. Use it by default, and you will avoid the class of bugs that the pyplot state machine creates.
Defaults are starting points, not finished products. Matplotlib defaults reflect 2003 design sensibilities and a "safe for any use case" approach. They are ugly on purpose, in the sense that they do not commit to any particular aesthetic. You should expect to override them. Chapter 12 will cover the specific overrides, and you will build your own style sheet that encodes your preferred defaults.
Read the gallery before writing new code. The matplotlib gallery has solutions for nearly every chart type you will ever need. Searching the gallery is almost always faster than reading the API documentation or writing code from scratch. Modern matplotlib users rely on the gallery as a starting point, not as a last resort.
Understand the Artist tree when the pyplot shortcuts fail. You can get far with pyplot and OO methods without ever touching the Artist tree directly. But when something is not working as expected — when a tick label is in the wrong place, when a legend is cut off, when a custom annotation needs precise positioning — the Artist tree is where you find the object and fix it. Knowing that the tree exists and how to navigate it is part of being a proficient matplotlib user.
Contribute back if you can. Matplotlib is maintained by volunteers, sustained by a nonprofit, and used by millions. If you benefit from it, consider contributing: filing bug reports with reproducible examples, improving documentation, reviewing pull requests, or donating to NumFOCUS. The library exists because of the generosity of its contributors, starting with John Hunter in 2003 and continuing through the current maintainers.
The best plotting library is the one you already know. Matplotlib has competitors, and some of them are better for specific tasks. But matplotlib is installed everywhere, taught everywhere, and supported by every other Python data tool. The network effects are enormous. Unless you have a specific reason to use something else, matplotlib is the right choice — not because it is the best, but because it is the one everyone else is using, and the interoperability pays off.
Discussion Questions
-
On the MATLAB-compatible interface. Hunter designed pyplot to feel like MATLAB to ease adoption by MATLAB users. In retrospect, this decision is responsible for both matplotlib's rapid adoption and the pyplot state-machine bugs that frustrate modern users. Was it the right call? What would matplotlib look like if Hunter had designed a cleaner API from scratch?
-
On the Artist tree. The threshold concept of this chapter is that everything in matplotlib is an object in a tree. This conceptual model is powerful but verbose — most users do not want to think about trees of Artists every time they make a chart. Would matplotlib be better if it hid the tree and exposed a simpler model, at the cost of flexibility? What does Altair (which does hide the model) do differently?
-
On the pace of API change. Matplotlib's API has grown organically over twenty years, with new features added and old features preserved for backward compatibility. This has produced an API that is comprehensive but inconsistent. Should matplotlib make a breaking-change release that cleans up the API, at the cost of breaking existing code? What are the trade-offs?
-
On defaults. Matplotlib's default styles have been criticized as ugly for two decades. Matplotlib 2.0 (2017) improved them significantly, but many users still override them heavily. Should matplotlib ship with stronger aesthetic opinions (a "house style") or remain neutral so that users can apply their own styles? What would each choice imply?
-
On open source sustainability. Matplotlib is maintained by volunteers and one part-time paid maintainer, and the library is used by millions of people, including large tech companies. Is this sustainable? What would it take for matplotlib to have dedicated full-time maintenance, and who should fund it?
-
On the relationship between matplotlib and newer libraries. Seaborn, Plotly, Altair, and others have emerged as alternatives or complements to matplotlib. Yet matplotlib remains the foundation that most of them build on. Is matplotlib's dominance a good thing for the Python ecosystem, or does it hold back more radical innovation? What would a "post-matplotlib" visualization ecosystem look like?
John Hunter built matplotlib because he needed to plot EEG data. Twenty years later, the library he wrote is the foundation of Python data visualization for millions of users. His weekend project became the shared infrastructure of a global scientific community. This is part of how open source works: one person's specific problem, solved well, becomes a tool that everyone else can use for their own specific problems. The next time you type import matplotlib.pyplot as plt, remember the postdoc who typed those lines first — and the long chain of decisions that made it possible for you to use his library today.