Case Study 2: How Jake VanderPlas Taught a Generation to Read matplotlib

DataField.Dev

Case Study 2: How Jake VanderPlas Taught a Generation to Read matplotlib

Matplotlib's documentation is comprehensive but dense. Its API is powerful but confusing. Most people did not learn matplotlib from the official docs. They learned it from books and free online resources, and one resource in particular — Jake VanderPlas's Python Data Science Handbook — shaped how a generation of practitioners think about matplotlib.

The Situation

By 2015, matplotlib had been around for twelve years and was the de facto standard for Python data visualization. But it had a problem: the learning curve was steep. The official documentation was accurate but intimidating — hundreds of classes, thousands of methods, and a fundamental distinction between pyplot and the OO API that most tutorials glossed over. Books on matplotlib existed but were either too technical (focused on specific use cases) or too shallow (a chapter of examples in a broader Python book). For a student or practitioner coming to matplotlib for the first time, the resources were inadequate.

The practical problem was cognitive: matplotlib has two APIs, and most tutorials taught only one of them — pyplot — because pyplot is easier to explain in a first example. But pyplot has well-known limitations that bite once a user moves past simple charts. The tutorials that taught pyplot first were setting students up for a painful transition when they hit those limitations. The few tutorials that taught the OO API first were rare and felt unnecessarily complicated for beginners. Nobody had found a way to teach both at the right depth in the right order.

Jake VanderPlas, an astronomer-turned-data-scientist who worked at the University of Washington's eScience Institute, had been teaching Python data science courses for years. He had seen the same confusion repeatedly: students would learn pyplot, get stuck, and have to unlearn bad habits. In 2016, he released the Python Data Science Handbook, a comprehensive free book covering NumPy, pandas, matplotlib, and scikit-learn. The book was published by O'Reilly as a physical book, but — critically — the full text was also released online as a series of Jupyter notebooks under a permissive license.

The matplotlib chapter in the Python Data Science Handbook became one of the most widely-read matplotlib tutorials in the world. It taught both pyplot and the OO API, explained the distinction clearly, and walked through chart types, customization, and advanced features with worked examples. Students and self-learners read it. University courses assigned it. Corporate training programs adopted it. By 2020, the book had been translated into several languages and had been read by hundreds of thousands of people. VanderPlas's approach to teaching matplotlib shaped how a generation of Python practitioners learned the library.

The case study is worth examining not because the book is a unique artifact (many good matplotlib resources exist) but because it is a representative example of a specific phenomenon: free, high-quality educational content that reaches an audience the official documentation cannot reach. Understanding why the book worked — and what made it a better teaching resource than the official matplotlib docs — illuminates a lesson about documentation, education, and the social infrastructure around technical tools.

The Content

The Python Data Science Handbook's matplotlib chapter (Chapter 4 in the book) covered the following topics in order:

Introduction: Why matplotlib? VanderPlas opened with a brief argument for using matplotlib: it is the foundation of Python visualization, it is installed everywhere, it is flexible enough for any use case. This was not a marketing pitch; it was a practical justification that acknowledged the library's limitations while making the case for learning it.

Two interfaces: pyplot and object-oriented. VanderPlas introduced both APIs explicitly, early in the chapter, and explained the distinction: pyplot is a state-machine interface suited for quick exploration, and the OO API is explicit about which Figure and Axes are being manipulated. He recommended using the OO API for anything beyond the simplest cases, but he did not pretend pyplot was obsolete. This was a key move: students learned that both exist, when to use each, and that the choice is a deliberate design decision.

Simple line plots. The first hands-on section walked through creating a basic line chart with plt.plot() (pyplot) and then recreating it with fig, ax = plt.subplots() and ax.plot() (OO). The side-by-side comparison made the two APIs concrete and showed exactly what differs.

Simple scatter plots. Same pattern: show the pyplot version, then the OO version, highlight the differences, and move on.

Visualizing errors. Error bars, shaded regions, confidence intervals. This section was useful for statistical visualization and introduced fill_between and errorbar.

Density and contour plots. For showing 2D distributions. This went slightly beyond the typical beginner tutorial but was still accessible.

Histograms, binnings, and density. Distribution visualization with hist, hist2d, and kernel density estimation.

Customizing plot legends. Legend placement, formatting, and common customizations. This was a short section, but it addressed one of the most common pain points for beginners.

Customizing colorbars. Colorbars for heatmaps and other color-encoded charts. Another common pain point.

Multiple subplots. plt.subplot(), plt.subplots(), and GridSpec. This section introduced the patterns for multi-panel figures.

Text and annotation. plt.text() and plt.annotate() for adding labels and callouts.

Customizing ticks. Tick positions, tick labels, and tick formatters.

Customizing matplotlib: configurations and stylesheets. rcParams and mpl.style. This section introduced the idea of building a consistent house style.

Three-dimensional plotting. A brief look at mpl_toolkits.mplot3d for 3D plots (which the chapter acknowledged are usually a bad idea but sometimes necessary).

Geographic data with Basemap. Brief introduction to mapping (now dated, as Basemap has been superseded by cartopy).

Visualization with seaborn. A brief bridge to seaborn, which VanderPlas explained is a high-level interface built on top of matplotlib.

Further resources. Links to documentation, the gallery, and other books.

That is the table of contents. What makes it effective is not the topic list (many matplotlib tutorials cover the same topics) but the sequencing, the depth, and the code.

Why It Worked

Several features of VanderPlas's approach are worth examining as examples of technical writing done well.

1. It taught both APIs together. Most matplotlib tutorials taught pyplot first, then later mentioned "by the way, there is also an object-oriented API." This framing made the OO API feel like an advanced topic that beginners could defer. VanderPlas inverted the framing: the OO API is the default, pyplot is a shortcut for simple cases, and both are introduced in the first hands-on section. Students came away understanding that they have a choice and knowing when to make it.

2. The examples were runnable Jupyter notebooks. The book was written as Jupyter notebooks, which meant every code example was a real piece of runnable code with a visible output. Students could download the notebooks, open them in their own environment, and modify the examples. This is a much better learning mode than reading static code in a printed book, because the student can experiment with the code and see what happens. The Jupyter format was rare for technical books in 2016 but is now standard.

3. The explanations were in natural language. VanderPlas wrote in the voice of a teacher explaining something to a friend, not in the dry reference style of the official documentation. Sentences like "matplotlib's interface will feel familiar to anyone who has used MATLAB" or "the simplest way to create a figure with one axes is to use plt.subplots()" are conversational and welcoming. They do not assume the reader is already an expert. The official matplotlib documentation is accurate but often reads as if it were written for people who already know the answer.

4. The topics were ordered by dependency. The book introduced concepts in an order that respected their dependencies: basic plots first, then customization, then multi-panel layouts, then advanced topics. A student reading the chapter sequentially encountered each concept in a context where the prerequisites had already been covered. This seems obvious, but many tutorials violate it by jumping between topics without building up the foundation.

5. Every section had a "why" before a "how." VanderPlas introduced each topic with a brief motivation ("you will often want to customize the legend for clarity") before showing the code. The why-before-how pattern helps readers understand when to apply a technique, not just how to apply it. Tutorials that show code without motivation leave readers able to copy-paste but unable to adapt the code to their own needs.

6. The book was free online. VanderPlas negotiated with O'Reilly to release the full text under a permissive license, hosted on GitHub as a repository of Jupyter notebooks. This meant that anyone with an internet connection could read the book, regardless of whether they could afford the physical edition. The free online version dramatically expanded the reach of the book and made it a default recommendation in many learning paths.

7. The code was high quality. Every example in the book follows good Python style, uses meaningful variable names, and produces output that matches the accompanying explanation. Students who learn from well-written examples develop better coding habits than students who learn from sloppy examples. The quality of the code is a subtle but important factor in the book's effectiveness as a learning resource.

8. It explained the Artist tree without belaboring it. The book introduced the concept that matplotlib charts are built from objects and that the OO API exposes those objects, but it did not try to teach the entire Artist hierarchy as a formal type system. Students learned to think of Figures and Axes as objects they could configure, without needing to memorize the entire class hierarchy. This was the right level of abstraction for beginners.

The Impact

The Python Data Science Handbook's impact on matplotlib learning is hard to measure precisely, but several indicators suggest its influence:

Adoption in university courses. Many universities teaching Python data science courses either assigned the book or recommended it as supplementary reading. Students in these courses learned matplotlib through VanderPlas's chapter, which shaped their mental model of the library.

Adoption in corporate training. Data science training programs at large companies often included the book in their curriculum. Workers who were transitioning from Excel or R to Python used the book to learn matplotlib alongside NumPy and pandas.

Stack Overflow answers. After the book was released, Stack Overflow answers about matplotlib began to reflect the book's style more frequently. Answers that used the OO API (fig, ax = plt.subplots()) became more common. Answers that explained the difference between pyplot and OO became more common. The book probably did not cause this shift alone, but it contributed to a broader trend toward the OO API as the preferred approach.

Second editions and updates. VanderPlas has updated the book several times, keeping it current with matplotlib's evolving API. The second edition (2023) reflects matplotlib 3.x conventions and includes updates to pandas and other libraries. The continuing maintenance keeps the book relevant, which extends its effective lifespan.

Similar educational resources. The book's success inspired similar resources. Hans Fangohr, Philipp Rudiger, and others produced comparable free online books and tutorials. Cole Knaflic's Storytelling with Data focuses on the design side. Claus Wilke's Fundamentals of Data Visualization does something similar for R and ggplot2. The free-book model that VanderPlas demonstrated has been adopted by many technical educators.

The broader impact: the Python data science community now has a large body of high-quality free educational content. Matplotlib has been a beneficiary of this, because good free tutorials reduce the friction of learning the library, which increases adoption. A student in 2016 had the Python Data Science Handbook as a primary resource; a student in 2024 has the book, the matplotlib gallery, and a dozen other high-quality free resources. The ecosystem is richer than it was when VanderPlas wrote the first edition, and his book is part of why.

What the Book Did Not Do (and Why That Matters)

The Python Data Science Handbook is influential, but it is not comprehensive. A few things the book did not do are worth noting:

It did not deeply integrate with design principles. The book focuses on matplotlib as a tool: how to produce charts, how to customize them, how to build multi-panel layouts. It does not spend much time on the perceptual science of why certain designs work better than others (Chapter 2 of this textbook), the chart selection framework (Chapter 5 of this textbook), or the narrative structure of data stories (Chapter 9 of this textbook). The book assumes readers know what to visualize and why; it teaches them how to visualize with matplotlib.

This is not a criticism — the book is about Python data science, not about visualization design as a craft. But it means the book is a complement to, not a substitute for, the principles in Parts I and II of this textbook. A student who learns matplotlib from VanderPlas but does not learn design principles from Tufte, Knaflic, Cairo, Wilke, or this textbook will be able to produce technically correct charts that are not necessarily well-designed.

It did not cover every chart type. The book focuses on the most common chart types (line, scatter, bar, histogram, contour) and the most common customizations. It does not cover specialized visualizations (network graphs, Sankey diagrams, geographic heatmaps, animated plots, custom chart types). A reader who needs something unusual has to go elsewhere — the gallery, Stack Overflow, third-party libraries, or specialized books.

It did not address accessibility deeply. Colorblind-safe palettes, screen-reader-compatible figures, high-contrast modes — these topics are addressed briefly but not at the depth that a modern accessibility-aware design practice would require. This reflects the state of the field in 2016 more than any failure of the book; accessibility in data visualization has become a bigger topic in the years since.

It did not emphasize the OO API strongly enough for some tastes. Some readers (including your author) think the book could have pushed the OO API even harder, treating pyplot as legacy. VanderPlas took a more balanced approach, acknowledging that pyplot is common and useful in some contexts. This is a judgment call, not a failing, but it is worth noting.

It did not cover matplotlib's internal architecture in depth. The book teaches users how to use matplotlib effectively, not how matplotlib works under the hood. Users who want to understand the rendering pipeline, the backend abstraction, or the Artist class hierarchy in detail have to read other sources. This is appropriate for a general data science book — most users do not need to understand internals — but it is a gap that this textbook (specifically, this Chapter 10) tries to fill.

Lessons for Modern Practice

The book's success offers several lessons for anyone learning matplotlib or producing educational content.

Learn both APIs but default to OO. VanderPlas's approach of teaching both pyplot and OO, then defaulting to OO for anything beyond simple cases, is the right pedagogical balance. Students end up knowing both, being able to read pyplot code on Stack Overflow, and writing their own code in the cleaner OO style.

Free resources expand access. The single most important decision VanderPlas made was negotiating a free online version. This decision is responsible for the book's enormous reach. Commercial publishers often resist free release, but the free version did not hurt physical book sales — it increased them, because the free version generated word-of-mouth recommendations that the paid version alone could not.

Jupyter notebooks are the right format for code-heavy educational content. The book's presentation as runnable notebooks is a significant advantage over static book formats. Readers can experiment, modify, and verify. Modern educational content should default to this format whenever possible.

Natural-language explanations matter. The gap between VanderPlas's conversational style and the official matplotlib documentation is a reminder that good technical writing is a skill distinct from technical accuracy. Documentation that is correct but dry can fail to teach; documentation that is wrong is worse, but documentation that is correct and approachable is a rare and valuable thing.

Teaching order matters. The book introduces concepts in an order that respects dependencies. This is basic pedagogical principle, but many tutorials violate it. When you are learning a new library, try to find resources that walk through concepts in a logical order, and be skeptical of tutorials that jump between topics without building up foundations.

The gallery is a complement to tutorials, not a substitute. VanderPlas's book teaches concepts; the matplotlib gallery shows examples. Both are needed. Use tutorials to understand what is possible and how to think about the library; use the gallery to find specific solutions for specific chart types. Relying only on one or the other will leave gaps.

Writing comes back to you. VanderPlas was a working data scientist whose book emerged from his teaching. The act of writing the book clarified his own understanding and forced him to think through issues he might otherwise have skipped. If you are learning matplotlib, consider teaching it to someone else, writing a blog post, or contributing to documentation. Teaching is one of the fastest ways to consolidate your own understanding.

Matplotlib rewards patience. The book is long — more than 500 pages in its physical form. Most readers do not read it all. The readers who do come out with a thorough understanding of matplotlib, pandas, NumPy, and scikit-learn that supports years of subsequent work. The reward for putting in the time is compounding: every hour spent on the book saves many hours later when the concepts are already internalized.

Discussion Questions

On the role of free educational resources. VanderPlas's book reached an audience that the official matplotlib documentation never could. What is the appropriate relationship between free community-produced educational content and the official docs of a tool like matplotlib? Should projects invest more in their own docs, or should they rely on the community to fill the gap?
On teaching both APIs. The book teaches pyplot and OO together. Some educators argue that this is confusing for beginners and that one API should be taught first (whichever one the educator prefers). Which approach do you think is better, and why?
On the sustainability of free books. VanderPlas updated the book for a second edition, but many free technical books stop being maintained after a few years and become outdated. What structural changes could make free educational content more sustainable? Is there a role for institutional or commercial support that does not compromise the "free" part?
On what the book chose not to cover. The Python Data Science Handbook does not deeply integrate design principles. Is this a reasonable scope choice for a "data science" book, or should technical books on visualization tools always include design education? Where is the boundary between "how to use the tool" and "how to make good charts"?
On the Jupyter notebook format. The book's notebooks are a major factor in its effectiveness as a learning resource. Is the notebook format always better than static text for code-heavy content? What are the costs of notebooks (versioning, rendering, accessibility) that static formats avoid?
On your own learning path. How did you learn matplotlib (if you have)? What resources did you use? Looking back, what would you do differently? Is there a specific book or tutorial that changed how you thought about the library?

The Python Data Science Handbook is one example of how a single well-crafted educational resource can shape how a technical tool is learned. Matplotlib is harder to learn than it should be, but the community has gradually produced better resources than the official documentation provides. VanderPlas's book is a particularly influential example, but it is not the only one. The broader lesson: the social infrastructure around a technical tool — the books, tutorials, Stack Overflow answers, blog posts, example galleries — is as important as the tool itself. A tool with good community infrastructure is more useful than a tool without, even if the tools themselves are technically equivalent. Matplotlib has good community infrastructure, and that is part of why it has remained dominant even as newer alternatives have emerged.