> "The five essential chart types — line, bar, scatter, histogram, and box plot — answer the five most common data questions. Learn them, and you can handle eighty percent of real-world visualization work."
Learning Objectives
- Create line charts with `ax.plot()` including multiple series, line styles, markers, and transparency
- Create bar charts with `ax.bar()` and `ax.barh()` including grouped bars, stacked bars, and horizontal orientation
- Create scatter plots with `ax.scatter()` including color encoding, size encoding, and transparency for overplotting
- Create histograms with `ax.hist()` including bin selection, normalization, and overlaid histograms
- Create box plots with `ax.boxplot()` including grouped boxes, horizontal orientation, and outlier display
- Select the appropriate chart type from these five for a given data question, referencing the Chapter 5 framework
- Handle common data-preparation tasks: datetime parsing, categorical ordering, binning
- Combine pandas DataFrames with matplotlib using direct column access
- Recognize and avoid common chart-type pitfalls (unsorted x-axis for line charts, overplotting in scatter, bin-width issues in histograms)
In This Chapter
- 11.1 Line Charts: ax.plot()
- 11.2 Bar Charts: ax.bar() and ax.barh()
- 11.3 Scatter Plots: ax.scatter()
- 11.4 Histograms: ax.hist()
- 11.5 Box Plots: ax.boxplot()
- 11.6 Combining Chart Types: When One Chart Is Not Enough
- 11.7 pandas Integration: The Shortcut
- 11.8 Error Bars and Uncertainty: A Small but Essential Extension
- 11.9 Five Chart Types, One Dataset: The Climate Example
- Chapter Summary
- Spaced Review: Concepts from Chapters 1-10
Chapter 11: Essential Chart Types in matplotlib
"The five essential chart types — line, bar, scatter, histogram, and box plot — answer the five most common data questions. Learn them, and you can handle eighty percent of real-world visualization work." — A common (and roughly correct) rule of thumb in data science education
Chapter 10 introduced the matplotlib architecture and the canonical fig, ax = plt.subplots() pattern. You wrote your first real climate chart with a single ax.plot() call and saved it to a PNG. The chart was deliberately ugly, but it was correct — the line was the right shape, the axes were labeled (minimally), and the file saved to disk. You have the basic pattern.
This chapter takes that basic pattern and expands it in two directions. First, we learn the full range of options on ax.plot() — color, linewidth, linestyle, marker, transparency — so that a line chart can be more than a default blue line. Second, we introduce the other essential chart methods: ax.bar() for bar charts, ax.scatter() for scatter plots, ax.hist() for histograms, and ax.boxplot() for box plots. Each of these is an Axes method that follows the same pattern as ax.plot(): create the Artists, configure the parameters, let matplotlib render.
The chapter organizes itself around the five chart types, with one section per type. Each section covers the basic call, the most important parameters, the common variations (grouped bars, multi-series lines, overlaid histograms), and the common pitfalls. At the end of each section, we apply the chart type to the climate dataset and produce a version of the progressive project. By the end of the chapter, you will have five different matplotlib charts of the same underlying climate data, each answering a different question — a concrete demonstration of the Chapter 5 principle that the chart type follows from the question.
A word about the threshold concept. Every parameter in every one of these methods is a design decision. When you set color="steelblue", you are choosing a palette. When you set linewidth=2, you are making a visual-weight decision. When you set alpha=0.5, you are managing overplotting in a way that connects to pre-attentive processing. The API is not just technical syntax — it is the instrument for implementing the perception science and design principles from Parts I and II. As you read this chapter, try to connect each parameter to the principle it implements. This is what separates matplotlib users who produce default charts from matplotlib users who produce thoughtful charts: the deliberate mapping of parameters to design decisions.
Chapter 10 warned that Part III is code-heavy. This chapter is where the warning becomes real. There are more code blocks in this chapter than in all nine previous chapters combined. Type them. Run them. Modify them. Matplotlib is a library you learn by doing, and reading without typing produces weak retention. Every exercise at the end of the chapter has a concrete goal; do at least five of them before moving on to Chapter 12.
11.1 Line Charts: ax.plot()
Line charts are the most common chart type in data visualization — appropriate whenever you are showing change over time, a continuous relationship between two variables, or a smooth function. The matplotlib method is ax.plot(), and it is one of the most-used methods in the entire library.
The Basic Call
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot([1, 2, 3, 4, 5], [10, 15, 13, 18, 21])
ax.set_title("A Simple Line")
fig.savefig("line.png", dpi=300, bbox_inches="tight")
The ax.plot(x, y) call creates a Line2D Artist and adds it to the Axes. The first argument is the x-values, the second is the y-values. Both can be Python lists, numpy arrays, pandas Series, or any other iterable of numbers. The x-values and y-values must be the same length.
If you pass only one argument, matplotlib treats it as the y-values and uses 0, 1, 2, ... as the default x-values:
ax.plot([10, 15, 13, 18, 21]) # x defaults to [0, 1, 2, 3, 4]
This is convenient for quick exploratory charts but not recommended for publication code, because the x-axis becomes an index rather than a meaningful variable.
Styling Parameters
ax.plot() accepts dozens of keyword arguments that control the appearance of the line. The most important ones:
color (or c): the color of the line. Accepts many formats — named colors ("red", "steelblue", "darkgreen"), hex codes ("#1f77b4"), RGB tuples ((0.12, 0.47, 0.71)), or single-letter codes ("b" for blue). The named colors are the most readable:
ax.plot(x, y, color="steelblue")
ax.plot(x, y, color="#ff6600") # hex
ax.plot(x, y, color=(0.2, 0.4, 0.8)) # RGB tuple
Chapter 3 was about what colors to choose; this parameter is where you implement those choices.
linewidth (or lw): the thickness of the line in points. Default is 1.5. A thin line (linewidth=0.5) recedes visually; a thick line (linewidth=3) draws attention. This is your tool for implementing visual hierarchy from Chapter 8 and emphasis from Chapter 9.
ax.plot(x, y, linewidth=2.5)
linestyle (or ls): the pattern of the line. Options include "-" (solid, default), "--" (dashed), ":" (dotted), "-." (dash-dot), and "None" (no line, only markers). Dashed and dotted lines are useful for reference lines, forecasts, and "not real data" indicators.
ax.plot(x, historical, linestyle="-", label="Historical")
ax.plot(x, forecast, linestyle="--", label="Forecast")
marker: the shape drawn at each data point. Options include "o" (circle), "s" (square), "^" (triangle), "D" (diamond), "x" (cross), "+" (plus), and many others. By default, no marker is drawn. Adding markers is useful for sparse data where you want the reader to see individual observations:
ax.plot(x, y, marker="o") # line with circle markers
ax.plot(x, y, marker="o", linestyle="None") # markers only, no line — equivalent to scatter
markersize (or ms): the size of the markers. Default depends on the marker shape. Adjust for visual weight.
alpha: transparency, from 0 (fully transparent) to 1 (fully opaque). Default is 1. Use alpha=0.5 or similar when you have many overlapping lines and want the overlap to be visible:
ax.plot(x, y, alpha=0.6) # 60% opaque
Alpha is the primary tool for managing overplotting, which we will see again with scatter plots.
label: the text that will appear in a legend if you call ax.legend() later:
ax.plot(x, y, label="Temperature (°C)")
ax.legend() # display the legend
Setting labels and calling legend() is the standard pattern for multi-series line charts.
Multiple Lines on One Axes
To plot multiple series on the same Axes, just call ax.plot() multiple times:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(years, usa, label="USA", color="steelblue")
ax.plot(years, europe, label="Europe", color="darkorange")
ax.plot(years, asia, label="Asia", color="seagreen")
ax.legend()
ax.set_title("Regional Trends")
Each call to ax.plot() adds another Line2D to the Axes. The colors cycle through matplotlib's default color cycle unless you specify explicit colors. For publication-quality charts, you should always specify colors explicitly rather than relying on the default cycle — this is a Chapter 3 concern.
Handling Datetime x-axes
For time-series data, the x-axis is often a date or timestamp rather than a simple integer. Matplotlib handles datetime objects natively, but there are a few things to know:
import pandas as pd
import matplotlib.pyplot as plt
dates = pd.date_range("2020-01-01", periods=100, freq="D")
values = [...] # some data
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(dates, values)
# Optional: rotate the x-axis labels if they overlap
fig.autofmt_xdate()
The fig.autofmt_xdate() call is a convenience that rotates the date labels and adjusts their spacing automatically. Without it, long date labels can overlap. For more control over date formatting, use matplotlib.dates.DateFormatter — see Chapter 12 for details.
Common Line Chart Pitfalls
Pitfall 1: Connecting missing data with a line. If your data has gaps (missing values), matplotlib will connect the points across the gap by default, producing a line that falsely implies continuity. The fix is to mark missing values as NaN (numpy's Not-a-Number), which matplotlib handles correctly by breaking the line:
import numpy as np
values = [1.0, 2.0, np.nan, np.nan, 5.0, 6.0] # gap in the middle
ax.plot(x, values) # line will break at the NaN values
Pitfall 2: Unsorted x-axis. If your x-values are not sorted, the line will zigzag back and forth because matplotlib connects the points in the order you provide them. Sort your data first:
df = df.sort_values("year")
ax.plot(df["year"], df["value"])
Pitfall 3: Too many lines. A line chart with 20+ series becomes a "spaghetti chart" that the reader cannot parse. Chapter 8 covered the solution: small multiples (one line per panel) or the highlight strategy (gray out most lines, highlight a few). Do not try to force 20 lines onto one Axes.
Pitfall 4: Missing units. The line itself shows the pattern, but without axis labels (Chapter 7), the reader does not know what the values mean. Always set ax.set_ylabel("Temperature (°C)") or similar.
The Climate Line Chart
Apply line charts to the progressive project:
import matplotlib.pyplot as plt
import pandas as pd
climate = pd.read_csv("climate_data.csv") # columns: year, anomaly
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(climate["year"], climate["anomaly"], color="#d62728", linewidth=1.5)
ax.axhline(0, color="gray", linewidth=0.8, linestyle="--") # baseline reference line
ax.set_title("Global Temperature Anomaly")
ax.set_xlabel("Year")
ax.set_ylabel("Temperature Anomaly (°C)")
fig.savefig("climate_line.png", dpi=300, bbox_inches="tight")
The ax.axhline(0, ...) call adds a horizontal reference line at y=0 (the baseline). This is useful context for anomaly data. The color="#d62728" is a warm red, chosen to suggest "temperature." The linewidth=1.5 is the default but stated explicitly. This is still not polished — the title is descriptive, not an action title; the spines are still there; no annotations — but it is a meaningful improvement over the Chapter 10 version.
Check Your Understanding — In the code above, identify every place where a Chapter 3, Chapter 6, or Chapter 7 principle is being implemented (or not). Which parameters encode design decisions, and which defaults are being accepted?
11.2 Bar Charts: ax.bar() and ax.barh()
Bar charts are the right choice for comparing values across categories — the signal chart type for the comparison question from Chapter 5. The matplotlib methods are ax.bar() (vertical bars) and ax.barh() (horizontal bars).
The Basic Call
fig, ax = plt.subplots(figsize=(8, 5))
categories = ["Enterprise", "Professional", "Starter", "Growth", "Legacy"]
revenue = [450, 320, 180, 250, 95] # USD millions
ax.bar(categories, revenue)
ax.set_title("Revenue by Product Line")
ax.set_ylabel("Revenue (USD millions)")
The ax.bar(categories, values) call takes a list of category names (x-positions) and a list of values (bar heights). matplotlib automatically positions the bars at integer x-coordinates and uses the category names as tick labels.
Horizontal Bars
For many categories or long category labels, horizontal bars are usually cleaner:
fig, ax = plt.subplots(figsize=(8, 6))
categories = ["Enterprise Plan", "Professional Plan", "Starter Plan", "Growth Plan", "Legacy Plan"]
revenue = [450, 320, 180, 250, 95]
ax.barh(categories, revenue)
ax.set_title("Revenue by Product Line")
ax.set_xlabel("Revenue (USD millions)")
ax.barh() flips the orientation: categories run down the y-axis, bars extend horizontally. Long category labels fit comfortably on the y-axis, which is why horizontal bars are the default choice for rankings with many categories (Chapter 8's "tall aspect ratio" principle).
Sorted Bars for Ranking
For comparison questions, sort the bars by value so the ranking is visible:
import pandas as pd
df = pd.DataFrame({"category": categories, "revenue": revenue})
df_sorted = df.sort_values("revenue", ascending=True) # ascending for barh so biggest is at top
fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(df_sorted["category"], df_sorted["revenue"])
ax.set_title("Revenue by Product Line (Sorted)")
ax.set_xlabel("Revenue (USD millions)")
Sorting is a small choice with a large effect on readability. The sorted version lets the reader see the ranking instantly; the unsorted version requires the reader to compare heights by eye.
Styling Parameters
Bar charts share many parameters with other chart types, plus a few specific to bars:
ax.bar(
categories,
values,
color="steelblue", # fill color
edgecolor="black", # outline color (optional)
width=0.8, # bar width (0.0 to 1.0, default 0.8)
alpha=0.8, # transparency
linewidth=0.5, # edge line width
label="2024", # for the legend
)
The most important parameters are color and width. Use edgecolor only when you want visible borders around the bars (usually not necessary for clean designs).
Grouped Bars
To compare multiple series across the same categories — for example, 2023 vs 2024 revenue by product line — use grouped bars. This requires manually positioning the bars side by side:
import numpy as np
categories = ["Enterprise", "Professional", "Starter", "Growth", "Legacy"]
revenue_2023 = [380, 300, 170, 220, 110]
revenue_2024 = [450, 320, 180, 250, 95]
x = np.arange(len(categories)) # [0, 1, 2, 3, 4]
width = 0.35
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(x - width/2, revenue_2023, width, label="2023", color="#1f77b4")
ax.bar(x + width/2, revenue_2024, width, label="2024", color="#ff7f0e")
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.set_ylabel("Revenue (USD millions)")
ax.set_title("Revenue by Product Line: 2023 vs 2024")
ax.legend()
Key details: use numpy integers for the base x positions; offset each series by ±width/2; set the width parameter so the bars do not overlap; use set_xticks and set_xticklabels explicitly to position the category labels at the center of each group. This is more verbose than a pandas or seaborn grouped bar chart, but it is the pure matplotlib way.
Stacked Bars
Stacked bars show composition — how a total breaks down into parts. Use the bottom parameter to stack one bar on top of another:
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(categories, subscription_revenue, label="Subscription", color="#1f77b4")
ax.bar(categories, services_revenue, bottom=subscription_revenue, label="Services", color="#ff7f0e")
ax.set_ylabel("Revenue (USD millions)")
ax.legend()
The bottom=subscription_revenue parameter tells matplotlib to start the services bar where the subscription bar ended, stacking them vertically. For three or more series, accumulate the bottoms:
import numpy as np
subscription = np.array([300, 250, 120, 180, 80])
services = np.array([100, 50, 40, 50, 10])
licensing = np.array([50, 20, 20, 20, 5])
ax.bar(categories, subscription, label="Subscription")
ax.bar(categories, services, bottom=subscription, label="Services")
ax.bar(categories, licensing, bottom=subscription + services, label="Licensing")
A warning: stacked bars are good at showing totals and at comparing the first (bottom) segment across categories. They are bad at comparing the middle or top segments because those segments do not share a common baseline. Chapter 8 discussed this limitation. For comparing specific segments, use grouped bars instead.
Common Bar Chart Pitfalls
Pitfall 1: Non-zero baseline. Chapter 4 was explicit: bar charts must have a y-axis that starts at zero. The bar length is the encoding, and a truncated baseline distorts the comparison. If your default matplotlib output starts at a non-zero value, override it: ax.set_ylim(0, max(values) * 1.1).
Pitfall 2: Too many bars. A bar chart with 50+ bars becomes unreadable. Consider aggregating, filtering, or using a different chart type (a heatmap or a strip plot for denser displays).
Pitfall 3: Unsorted categories. Alphabetical order is rarely meaningful. Sort by value (for rankings), by time (for temporal categories), or by a natural order (for ordinal categories like "Low", "Medium", "High").
Pitfall 4: Long category labels. Switch to horizontal bars (barh) rather than rotating vertical labels. Horizontal bars are more legible for long labels.
The Climate Bar Chart: Decade Averages
For the climate progressive project, a bar chart makes sense if we average temperatures by decade:
climate["decade"] = (climate["year"] // 10) * 10
decade_avg = climate.groupby("decade")["anomaly"].mean()
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(decade_avg.index, decade_avg.values, width=8, color="#d62728", edgecolor="white")
ax.axhline(0, color="gray", linewidth=0.8)
ax.set_title("Decadal Average Temperature Anomaly")
ax.set_xlabel("Decade")
ax.set_ylabel("Average Anomaly (°C)")
Here, width=8 makes each bar 8 years wide (out of a 10-year decade), leaving a small gap between decades. The axhline(0) shows the baseline clearly. The result is a chart that answers "how has each decade's average compared to the 1951-1980 baseline?" — a different question than the line chart answered.
Check Your Understanding — What question does the decade bar chart answer that the line chart does not? What does the line chart show that the decade bar chart hides?
11.3 Scatter Plots: ax.scatter()
Scatter plots are the right chart type for relationship questions — showing how two continuous variables relate to each other. The matplotlib method is ax.scatter().
The Basic Call
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x, y)
ax.set_title("Simple Scatter")
ax.set_xlabel("x")
ax.set_ylabel("y")
The ax.scatter(x, y) call plots one dot per data point. Each dot is a PathCollection Artist (technically a collection of path markers). The basic call produces default blue dots with default size.
Why ax.scatter() and not ax.plot(x, y, marker="o", linestyle="None")? Both produce dot plots, but ax.scatter() supports per-point color and size encodings, while ax.plot() applies a single color and size to all points. For scatter plots with encoded variables, always use ax.scatter().
Styling Parameters
ax.scatter(
x, y,
c="steelblue", # color (single value or array of values)
s=30, # size in points^2 (single value or array)
alpha=0.6, # transparency
edgecolors="white", # marker edge color
linewidths=0.5, # marker edge width
marker="o", # shape
label="Data Points",
)
The key parameters specific to scatter are:
c: color. Can be a single color (like ax.plot()) or an array of values that matplotlib will map to colors via a colormap. When you pass an array, matplotlib uses the cmap parameter to determine the color mapping.
s: size in points-squared. Can be a single number or an array. When you pass an array, each point is sized by its corresponding value. This is how bubble charts are made.
cmap: the colormap to use when c is an array. Default is "viridis". Options include "plasma", "coolwarm", "RdYlBu_r", and many others. Chapter 3's color palette discussion applies directly here.
Encoding Extra Variables with Color and Size
The power of scatter is encoding additional variables. A scatter plot with: - x-position = variable 1 - y-position = variable 2 - color = variable 3 - size = variable 4
...encodes four variables simultaneously. This is the Minard-style multi-variable encoding from Chapter 5.
fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(
df["gdp_per_capita"],
df["life_expectancy"],
c=df["child_mortality"], # third variable: color
s=df["population"] / 1e6, # fourth variable: size (scaled)
cmap="RdYlGn_r", # red (high mortality) to green (low)
alpha=0.7,
edgecolors="black",
linewidths=0.5,
)
ax.set_xlabel("GDP per Capita (USD)")
ax.set_ylabel("Life Expectancy (years)")
ax.set_title("Global Development")
# Add a colorbar for the color encoding
fig.colorbar(scatter, label="Child Mortality (per 1000)")
The fig.colorbar(scatter, ...) call adds a colorbar that shows the mapping from color to value. This is essential when using color as an encoding — without a colorbar, the reader cannot decode the values.
Managing Overplotting
When you have many points, they can overlap and obscure the distribution. This is overplotting, and it is the main perceptual problem with scatter plots.
Techniques to manage overplotting:
1. Transparency (alpha). Make each point partly transparent so overlap becomes visible:
ax.scatter(x, y, alpha=0.3)
With alpha=0.3, a single point appears pale, but 10 overlapping points appear dark. The density is encoded in the opacity.
2. Smaller markers. With many points, smaller markers are easier to read:
ax.scatter(x, y, s=5) # small dots
3. Different colors or markers for groups. If the overplotting is between distinct groups, use color to separate them:
for group_name, group_df in df.groupby("category"):
ax.scatter(group_df["x"], group_df["y"], label=group_name, alpha=0.6)
ax.legend()
4. Hexagonal binning. For very large datasets, replace the scatter with a hexbin plot that shows density:
ax.hexbin(x, y, gridsize=30, cmap="viridis")
We will see hexbin in more detail in Chapter 14. For now, know that it exists as an option when overplotting is severe.
Common Scatter Pitfalls
Pitfall 1: Misleading bubble sizes. If you encode a variable as size, make sure the size is proportional to the area of the marker, not the diameter. matplotlib's s parameter is in "points squared," which is area, so s=value is usually correct. But if you compute sizes as s=radius**2 * pi, that works too. The danger is code like s=value where value is a radius — this produces sizes that grow linearly instead of quadratically, making small values look too big. Chapter 4 discussed this distortion.
Pitfall 2: Too many points. A scatter plot with 100,000+ points will be slow to render and will obscure patterns. Consider sampling, hexbin, or a density plot.
Pitfall 3: Using scatter for categorical data. If one of your variables is categorical (e.g., product type), a scatter plot puts all points at the same x-position, creating a vertical strip. For categorical data, use a bar chart, a box plot, or a strip plot with jittering.
The Climate Scatter: CO2 vs Temperature
co2 = pd.read_csv("co2_data.csv") # year, co2_ppm
merged = pd.merge(climate, co2, on="year")
fig, ax = plt.subplots(figsize=(7, 6))
scatter = ax.scatter(
merged["co2_ppm"],
merged["anomaly"],
c=merged["year"],
cmap="viridis",
s=30,
alpha=0.7,
edgecolors="white",
linewidths=0.3,
)
ax.set_xlabel("CO2 Concentration (ppm)")
ax.set_ylabel("Temperature Anomaly (°C)")
ax.set_title("CO2 vs Temperature")
fig.colorbar(scatter, label="Year")
The scatter plot answers a different question than the line or bar charts: "Is there a relationship between CO2 and temperature?" The color encodes year, so the reader can see both the relationship and how it evolved over time. The near-linear pattern (from low-left to high-right) makes the correlation visible.
Check Your Understanding — The climate scatter uses year as a color encoding. Why is this better than adding year as a third axis (a 3D scatter plot) or as two separate charts (one per decade)?
11.4 Histograms: ax.hist()
Histograms show the distribution of a single continuous variable — how values are spread out, where they cluster, whether the distribution is symmetric or skewed. The matplotlib method is ax.hist().
The Basic Call
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(values, bins=30)
ax.set_title("Distribution of Values")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
The ax.hist(values, bins=N) call divides the range of values into N equal-width bins and counts how many values fall into each bin. The result is a bar chart where the x-axis is the value and the y-axis is the count.
Key Parameters
bins: the number of bins (or a sequence of bin edges). Default is 10. Bin count dramatically affects the shape of the histogram:
- Too few bins (e.g., 5): the distribution looks lumpy and simplified.
- Too many bins (e.g., 100): the distribution looks noisy and every small variation is visible.
- Just right (usually 20-50 for most datasets): the shape is smooth and informative.
ax.hist(values, bins=20)
ax.hist(values, bins=[0, 5, 10, 20, 40]) # explicit bin edges
density: if True, normalize the bar heights so the total area equals 1 (a probability density). Useful when comparing distributions of different sizes.
ax.hist(values, bins=30, density=True)
histtype: the rendering style. Options are "bar" (default, filled bars), "step" (outline only, good for overlaid histograms), "stepfilled" (filled outline), and "barstacked" (for multiple overlaid histograms).
ax.hist(values, bins=30, histtype="step", linewidth=2)
color, alpha, edgecolor: as in bar charts.
Overlaid Histograms
To compare distributions of different groups, plot them on the same axes with transparency:
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(group_a, bins=30, alpha=0.6, label="Group A", color="#1f77b4")
ax.hist(group_b, bins=30, alpha=0.6, label="Group B", color="#ff7f0e")
ax.legend()
ax.set_title("Distribution Comparison")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
With alpha=0.6, the overlap between the two histograms is visible. The reader can see where the distributions overlap and where they differ.
Alternative: use histtype="step" to show only the outlines, which works well for two or three groups without cluttering:
ax.hist(group_a, bins=30, histtype="step", label="Group A", linewidth=2)
ax.hist(group_b, bins=30, histtype="step", label="Group B", linewidth=2)
Bin Width Selection
Choosing the number of bins is not arbitrary. Several heuristics exist:
Sturges' rule: bins = ceil(log2(n) + 1) where n is the number of data points. Good for small-to-medium datasets. matplotlib's default.
Scott's rule: bin_width = 3.49 * std / n^(1/3). Assumes normality. Produces fewer bins for high-variance data.
Freedman-Diaconis rule: bin_width = 2 * IQR / n^(1/3). More robust to outliers than Scott's rule.
In practice, try a few bin counts and pick the one that shows the structure clearly. matplotlib does not automatically use Freedman-Diaconis (you would need numpy's np.histogram_bin_edges() for that), but the default usually works.
Common Histogram Pitfalls
Pitfall 1: Confusing histograms with bar charts. They look similar, but they answer different questions. Bar charts compare discrete categories; histograms show the distribution of a continuous variable. Chapter 5's chart selection framework addressed this: if your x-axis is categorical, use a bar chart; if it is continuous bins, use a histogram.
Pitfall 2: Misleading bin boundaries. With a small number of bins, the exact boundaries can matter. A bin edge that falls between two clusters can split them in ways that hide the bimodality. Experiment with slightly different bin counts to see if the structure is robust.
Pitfall 3: Not normalizing when comparing groups of different sizes. If Group A has 1000 points and Group B has 100, the raw counts will make Group A look "bigger" even if the distributions have the same shape. Use density=True to normalize, so the histograms show shape rather than count.
The Climate Histogram: Distribution of Annual Anomalies
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(climate["anomaly"], bins=30, color="#d62728", edgecolor="white", alpha=0.8)
ax.axvline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Distribution of Annual Temperature Anomalies")
ax.set_xlabel("Temperature Anomaly (°C)")
ax.set_ylabel("Number of Years")
The histogram answers: "What is the distribution of yearly temperature anomalies over the full time range?" The shape reveals whether warming has been gradual (a single shifted peak) or bimodal (distinct cold and warm eras). The axvline(0) shows the baseline reference. This is a different question than the line or bar charts answered.
11.5 Box Plots: ax.boxplot()
Box plots show the distribution of a variable in a compact form — the median, the quartiles, the whiskers (range), and the outliers. They are particularly useful for comparing distributions across groups. The matplotlib method is ax.boxplot().
The Basic Call
fig, ax = plt.subplots(figsize=(7, 5))
ax.boxplot(values)
ax.set_title("Box Plot of Values")
ax.set_ylabel("Value")
For a single group of values, ax.boxplot(values) produces one box with:
- A line inside the box at the median.
- The box itself showing the interquartile range (IQR) — from the 25th percentile to the 75th percentile.
- Whiskers extending to 1.5 × IQR beyond the box (approximately the 5th and 95th percentiles for normal-ish data).
- Outlier points beyond the whiskers.
This five-number summary compactly represents the distribution without the detail of a histogram.
Grouped Box Plots
The real power of box plots is comparing distributions across groups:
fig, ax = plt.subplots(figsize=(8, 5))
ax.boxplot(
[group_a, group_b, group_c, group_d],
labels=["A", "B", "C", "D"],
)
ax.set_title("Distribution by Group")
ax.set_ylabel("Value")
The first argument is a list of arrays (one per group). The labels parameter gives each box a category name. The result is four side-by-side boxes that let the reader compare distributions at a glance — medians, spreads, outliers.
Styling Parameters
matplotlib's box plots have many styling options through dictionaries passed to ax.boxplot:
ax.boxplot(
groups,
labels=labels,
patch_artist=True, # fill the boxes with color
showmeans=True, # show the mean as a marker
notch=True, # add a notch at the median for confidence indication
medianprops=dict(color="black", linewidth=1.5),
boxprops=dict(facecolor="lightblue", edgecolor="navy"),
whiskerprops=dict(color="navy"),
capprops=dict(color="navy"),
flierprops=dict(marker="o", markersize=3, alpha=0.4),
)
The styling is verbose because each component (box, median, whiskers, caps, outliers) is styled separately. For complex customization, many practitioners use seaborn's sns.boxplot() instead, which has a simpler API. We will cover seaborn in Part IV.
Horizontal Box Plots
Like bar charts, box plots can be horizontal:
ax.boxplot(groups, labels=labels, vert=False)
The vert=False flag flips the orientation. Use horizontal box plots for the same reason as horizontal bars: long category labels or many groups that would crowd a vertical layout.
Reading a Box Plot
Box plots are compact but require training to read. The key elements:
- The box = middle 50% of the data. A tall box means high variability in the middle values; a short box means the middle values are clustered.
- The median line = the 50th percentile. A median near the middle of the box means the distribution is symmetric; a median near the edge means it is skewed.
- The whiskers = 1.5 × IQR beyond the box edges. Long whiskers mean extreme values are not too rare; short whiskers mean the extremes are close to the box.
- Outliers = points beyond the whiskers. Many outliers mean the distribution has a long tail; few mean the distribution is compact.
- Notch (if enabled) = a narrower region at the median showing approximate 95% confidence interval of the median. Useful for comparing medians across groups: if two notches do not overlap, the medians are significantly different.
Box plots are sometimes criticized as hard to read for non-specialists. Alternatives include violin plots (which show the full distribution as a shape) and strip plots (which show individual points). We will cover both in Chapter 17 (seaborn distributional visualization).
Common Box Plot Pitfalls
Pitfall 1: Using box plots for too few data points. With fewer than ~10 data points per group, a box plot is misleading because the quartiles are unstable. For small groups, use a strip plot (individual points) instead.
Pitfall 2: Box plots hide bimodality. A bimodal distribution (two peaks) can look identical to a unimodal distribution in a box plot because the summary statistics are the same. If you suspect bimodality, check with a histogram or violin plot.
Pitfall 3: Inconsistent whisker definitions. matplotlib's default is 1.5 × IQR, but other tools use different definitions (5th-95th percentile, min-max, etc.). When comparing box plots across tools, verify the whisker definition.
The Climate Box Plot: Anomalies by Decade
climate["decade"] = (climate["year"] // 10) * 10
decades = sorted(climate["decade"].unique())
decade_groups = [climate[climate["decade"] == d]["anomaly"].values for d in decades]
fig, ax = plt.subplots(figsize=(12, 5))
ax.boxplot(decade_groups, labels=[f"{d}s" for d in decades])
ax.axhline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Temperature Anomalies by Decade")
ax.set_xlabel("Decade")
ax.set_ylabel("Anomaly (°C)")
The box plot answers: "How did the distribution of annual temperatures change from decade to decade?" This is different from both the line chart (which shows year-by-year values) and the bar chart (which shows decade averages). The box plot shows the range within each decade — some decades had high variability, others were more stable.
11.6 Combining Chart Types: When One Chart Is Not Enough
Sometimes the question has multiple dimensions that one chart type cannot address. You can add a line over a histogram, or bars next to a box plot, by calling multiple plot methods on the same Axes:
fig, ax = plt.subplots(figsize=(8, 5))
# Histogram in the background
ax.hist(values, bins=30, color="lightgray", alpha=0.8)
# A line showing the mean on top
ax.axvline(np.mean(values), color="red", linewidth=2, label=f"Mean = {np.mean(values):.2f}")
# A line showing the median
ax.axvline(np.median(values), color="blue", linewidth=2, label=f"Median = {np.median(values):.2f}")
ax.legend()
ax.set_title("Distribution with Mean and Median")
The histogram and the vertical lines share the same Axes. Each is a separate Artist added to the Axes's tree. This combination technique works for any pair of chart types that make sense together.
For more elaborate multi-chart figures (different chart types in different panels), use multi-panel layouts with plt.subplots(rows, cols) — which we will cover in detail in Chapter 13.
11.7 pandas Integration: The Shortcut
pandas DataFrames have built-in plot methods that are thin wrappers around matplotlib. They are convenient for exploratory charts:
df.plot(x="year", y="value", kind="line")
df.plot(x="category", y="count", kind="bar")
df.plot(kind="scatter", x="gdp", y="life")
df["value"].hist(bins=30)
df.boxplot(column="value", by="category")
Each of these methods creates a Figure and Axes internally, plots the data, and returns the Axes object. You can continue to customize the chart using the returned Axes:
ax = df.plot(x="year", y="value", figsize=(10, 4))
ax.set_title("Custom Title")
ax.set_ylabel("Value (units)")
ax.grid(False)
For exploratory analysis in a Jupyter notebook, df.plot() is faster to type than the explicit fig, ax = plt.subplots() pattern. For production code, the explicit OO pattern is usually clearer because it makes the Figure and Axes references visible.
Recommendation: use pandas plot methods for quick exploration; use the explicit OO pattern for publication-quality code. Both produce the same matplotlib output; the difference is in how much control you have over the process.
11.8 Error Bars and Uncertainty: A Small but Essential Extension
No chart of real data is complete without some acknowledgment of uncertainty. Measurement noise, sampling variance, confidence intervals — these are the things that a raw data point hides but that an honest chart should show. matplotlib supports error bars directly through several methods.
Error Bars on Line and Scatter Charts
The ax.errorbar() method is a line chart (or scatter, depending on parameters) with error bars attached:
fig, ax = plt.subplots(figsize=(10, 4))
ax.errorbar(
years,
temperature,
yerr=uncertainty, # error in y (can be a single value or an array)
fmt="o-", # format: dots connected by lines
capsize=3, # size of the little caps at the ends of error bars
color="#d62728",
ecolor="gray", # error bar color (different from line color)
elinewidth=0.8, # error bar line width
alpha=0.9,
)
ax.set_title("Temperature with Measurement Uncertainty")
The yerr parameter takes a single value (the same error for every point) or an array (different error for each point). You can also pass xerr for horizontal error. The fmt argument uses matplotlib's abbreviated marker+linestyle syntax: "o-" is circles with solid lines, "s--" is squares with dashed lines, "None" is no line.
Shaded Confidence Bands
For time series with a continuous uncertainty range, a shaded band is often clearer than per-point error bars:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(years, temperature, color="#d62728", linewidth=1.5)
ax.fill_between(
years,
temperature - uncertainty, # lower bound
temperature + uncertainty, # upper bound
color="#d62728",
alpha=0.2, # transparent fill
)
ax.set_title("Temperature with Confidence Band")
The ax.fill_between() method fills the vertical region between two y-value arrays at each x-value. Combined with a central line, it produces the "line with shaded band" pattern that is standard for climate reconstructions, economic forecasts, and any other time series with continuous uncertainty.
Error Bars on Bar Charts
Bar charts support error bars through the yerr parameter:
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(
categories,
means,
yerr=standard_errors,
capsize=5,
color="steelblue",
ecolor="black",
)
ax.set_title("Group Means with Standard Error")
ax.set_ylabel("Mean value")
Error bars on bar charts are the visual equivalent of saying "here is the estimate, and here is how much it could wiggle if we ran the experiment again." Without them, the bar chart implies precision that the data does not support.
Why Uncertainty Matters
Chapter 4 was explicit: hiding uncertainty is a form of visualization dishonesty. Every real measurement has noise; every estimate has a confidence interval; every forecast has a range of plausible outcomes. A chart that shows point estimates without any uncertainty indicator implies a precision the data does not warrant and leaves the reader no way to judge whether small differences are meaningful. The ax.errorbar() and ax.fill_between() methods give you the tools to show uncertainty; whether to use them is a question of honest communication, not optional decoration.
11.9 Five Chart Types, One Dataset: The Climate Example
To close the chapter, here is the full set of five climate chart types produced by the methods you have just learned.
Chart 1: Line chart of annual temperature anomalies.
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(climate["year"], climate["anomaly"], color="#d62728", linewidth=1.5)
ax.axhline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Annual Temperature Anomaly")
ax.set_ylabel("Anomaly (°C)")
Answer: "How has the annual temperature anomaly evolved over 140+ years?" (Change over time.)
Chart 2: Bar chart of decade averages.
decade_avg = climate.groupby("decade")["anomaly"].mean()
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(decade_avg.index, decade_avg.values, width=8, color="#d62728")
ax.axhline(0, color="gray", linewidth=0.8)
ax.set_title("Decadal Average Anomaly")
ax.set_ylabel("Average Anomaly (°C)")
Answer: "How does each decade compare to the baseline, averaged across all years in the decade?" (Comparison.)
Chart 3: Scatter plot of CO2 vs temperature.
fig, ax = plt.subplots(figsize=(7, 6))
scatter = ax.scatter(merged["co2_ppm"], merged["anomaly"], c=merged["year"], cmap="viridis", alpha=0.7)
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Anomaly (°C)")
ax.set_title("CO2 vs Temperature")
fig.colorbar(scatter, label="Year")
Answer: "Is there a relationship between CO2 concentration and temperature?" (Relationship.)
Chart 4: Histogram of annual anomalies.
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(climate["anomaly"], bins=30, color="#d62728", alpha=0.8, edgecolor="white")
ax.axvline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Distribution of Annual Anomalies")
ax.set_xlabel("Anomaly (°C)")
ax.set_ylabel("Number of Years")
Answer: "How are annual anomalies distributed across the entire time range?" (Distribution.)
Chart 5: Box plot of anomalies by decade.
fig, ax = plt.subplots(figsize=(12, 5))
ax.boxplot(decade_groups, labels=[f"{d}s" for d in decades])
ax.axhline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Anomaly Distribution by Decade")
ax.set_ylabel("Anomaly (°C)")
Answer: "How does the variability of anomalies within each decade compare?" (Distribution by group.)
Five chart types, one dataset, five different answers. This is the Chapter 5 principle — "the chart type follows from the question" — expressed in matplotlib code. Choosing which chart type to produce is a design decision; the code to produce each one is a mechanical translation of that decision into the appropriate ax method.
None of these charts are polished yet. They are still missing the Chapter 7 disciplines: action titles, annotations, source attribution, custom fonts, cleaned-up spines. Chapter 12 will fix all of these. But the charts are correct, they use the right chart type for each question, and they demonstrate that matplotlib can produce the full Chapter 5 chart matrix with a handful of method calls.
Chapter Summary
This chapter covered the five essential chart types — line, bar, scatter, histogram, and box plot — as implemented in matplotlib's object-oriented API. Each chart type has a corresponding Axes method: ax.plot(), ax.bar() / ax.barh(), ax.scatter(), ax.hist(), and ax.boxplot(). Together, these five methods cover the vast majority of real-world visualization work.
Line charts (ax.plot()) are for change over time or continuous relationships. Key parameters: color, linewidth, linestyle, marker, alpha. Multiple series go on the same Axes by calling ax.plot() multiple times.
Bar charts (ax.bar(), ax.barh()) are for comparing values across categories. Use horizontal bars for long category labels or many categories. Sort by value when the ranking matters. Grouped bars (manual x-position offsets) and stacked bars (the bottom parameter) handle multi-series comparisons.
Scatter plots (ax.scatter()) are for relationships between two continuous variables. The c and s parameters allow encoding additional variables as color and size (bubble chart style). Manage overplotting with alpha, smaller markers, grouping, or hexbin for very large datasets.
Histograms (ax.hist()) show the distribution of a single continuous variable. Choose bin count carefully (usually 20-50 for most datasets). Use density=True to normalize when comparing groups of different sizes. Overlay with transparency (alpha=0.5) for two or three groups.
Box plots (ax.boxplot()) show distribution summaries (median, quartiles, whiskers, outliers) in compact form — particularly useful for comparing distributions across groups. The matplotlib box plot API is verbose; consider seaborn's sns.boxplot() for simpler customization.
The pandas plot methods (df.plot(), df.plot.bar(), etc.) are thin wrappers around matplotlib suitable for quick exploratory charts. For production code, prefer the explicit fig, ax = plt.subplots() pattern.
The threshold concept is that every parameter in every plot method is a design decision. color implements Chapter 3 palette choices. linewidth implements Chapter 6 visual weight. alpha manages the pre-attentive processing concerns from Chapter 2. The API is not just syntax; it is the instrument for implementing the principles from Parts I and II.
The five climate charts in Section 11.8 demonstrate the full Chapter 5 chart selection matrix applied to one dataset. Each chart answers a different question, and each is a direct implementation of the "chart type follows the question" principle.
Next in Chapter 12: Customization Mastery. Now that you can produce each chart type, you learn how to polish them: custom colors, typography, decluttering, annotations, style sheets, and exporting publication-quality output. Chapter 12 is where Parts I and II pay off fully — you take the ugly default charts from this chapter and turn them into publication-quality figures that meet every standard from the first nine chapters.
Spaced Review: Concepts from Chapters 1-10
These questions reinforce ideas from earlier chapters. If any feel unfamiliar, revisit the relevant chapter before proceeding.
-
Chapter 2: Cleveland and McGill's encoding accuracy hierarchy ranks position on a common scale as the most accurately perceived encoding. Which of the five chart types in this chapter uses position on a common scale most directly? Which uses less accurate encodings?
-
Chapter 3: The
cmapparameter inax.scatter(..., cmap="viridis")selects a colormap. How should you choose a cmap for sequential data (e.g., years)? For diverging data (e.g., above/below a baseline)? For categorical data (e.g., product lines)? -
Chapter 4: Chapter 4 argued that bar charts must start their y-axis at zero. How do you enforce this in matplotlib code? What happens if you accept the default matplotlib autoscaling?
-
Chapter 5: The chart selection matrix asks you to classify data and question. For each of the five climate charts in Section 11.8, identify the data type classification and the question type.
-
Chapter 6: The declutter procedure says "remove, lighten, simplify." In the code examples in this chapter, where does decluttering happen (or not)? Which lines implement decluttering, and which defaults would you override in Chapter 12?
-
Chapter 7: Action titles state the finding. None of the titles in this chapter are action titles — they are descriptive. For each of the five climate charts, write an action title that states a plausible finding.
-
Chapter 10: The canonical
fig, ax = plt.subplots()pattern introduces each chart type in this chapter. Why is the pattern uniform across chart types? What would change if you used pyplot instead?