Exercises: Essential Chart Types

DataField.Dev

Exercises: Essential Chart Types

These exercises are hands-on. Every Part B exercise should be run in a Python environment, not just read. For each chart type, produce a working chart, modify its parameters, and verify the result. The only way to learn matplotlib is to type code.

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

Name the five essential matplotlib Axes methods covered in this chapter. For each, state the chart type it produces and the Chapter 5 question type it answers best.

Guidance

`ax.plot()` — line chart, for change over time or continuous relationships. `ax.bar()` / `ax.barh()` — bar chart, for comparison across categories. `ax.scatter()` — scatter plot, for relationships between two continuous variables. `ax.hist()` — histogram, for distribution of a single continuous variable. `ax.boxplot()` — box plot, for distribution summaries, especially across groups.

A.2 ★☆☆ | Recall

Explain the difference between ax.plot() and ax.scatter() when producing a dot plot. Why does matplotlib have two methods that can produce similar-looking outputs?

Guidance

`ax.plot(x, y, marker="o", linestyle="None")` produces dots but applies a single color and size to all points. `ax.scatter(x, y)` supports per-point color (`c`) and size (`s`) encodings via arrays, plus a colormap (`cmap`). For simple dot plots, either works. For encoded variables (bubble charts, color-mapped scatter), always use `ax.scatter()` because it supports the per-point encodings that `ax.plot()` cannot handle efficiently.

A.3 ★★☆ | Understand

The chapter says that bar charts "must start their y-axis at zero." How do you enforce this in matplotlib code? What happens if you accept the default autoscaling?

Guidance

Use `ax.set_ylim(0, max_value * 1.1)` or similar to explicitly set the y-axis to start at zero. Matplotlib's autoscaling often starts the axis at a non-zero value (near the minimum data value) for better visual separation, but for bar charts this violates [Chapter 4](../../part-01-seeing-data/chapter-04-lies-distortions-honest-charts/index.md)'s zero-baseline rule and distorts the length encoding. Always override the default for bar charts intended for publication.

A.4 ★★☆ | Understand

Explain what "overplotting" means in the context of scatter plots. Name three techniques from Section 11.3 for managing overplotting and describe when each is appropriate.

Guidance

Overplotting is when many data points overlap in a scatter plot, obscuring the true density. Techniques: (1) **Alpha (transparency)** — set `alpha=0.3` or similar so overlap shows as darker regions; works for a few hundred points. (2) **Smaller markers** — reduce `s=5` or similar so individual points take less space; works for medium density. (3) **Hexbin / 2D histogram** — replace scatter with a density plot (`ax.hexbin` or `ax.hist2d`); works for very large datasets where individual points are not meaningful.

A.5 ★★☆ | Analyze

Explain the difference between a histogram and a bar chart. Why does the chapter insist these are different chart types, even though they both use rectangular bars?

Guidance

Bar charts compare values across discrete categories (product A vs. product B), and each bar corresponds to one category. Histograms show the distribution of a continuous variable by dividing its range into bins and counting how many values fall in each bin. The x-axis of a bar chart is categorical; the x-axis of a histogram is continuous. Bar chart bars have gaps between them (emphasizing discreteness); histogram bars touch (emphasizing continuity). Using one when you need the other produces charts that display data without answering the intended question.

A.6 ★★★ | Evaluate

The chapter's threshold concept says that "every parameter in every plot method is a design decision." Defend this claim with specific examples from ax.scatter(). Which parameters implement which principles from Parts I and II?

Guidance

`c` (color) implements [Chapter 3](../../part-01-seeing-data/chapter-03-color/index.md) palette choices — the color is chosen from the perceptual categories (sequential, diverging, qualitative). `s` (size) implements [Chapter 2](../../part-01-seeing-data/chapter-02-how-the-eye-sees/index.md) encoding of magnitude through area. `alpha` (transparency) manages pre-attentive processing by letting overlap become visible. `cmap` (colormap) implements perceptual uniformity from Chapter 3. `edgecolors` controls visual contrast between points and background. Every parameter connects to a principle; the API is not just syntax but an instrument for implementing design decisions.

Part B: Applied — Hands-On with Each Chart Type (11 problems)

B.1 ★☆☆ | Apply

Create a line chart of the values [3, 5, 8, 13, 21, 34, 55] against their indices (use range(7) for the x-values). Add a title, x-label, y-label, and save as line.png with dpi=300.

Guidance

import matplotlib.pyplot as plt

x = range(7)
y = [3, 5, 8, 13, 21, 34, 55]

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(x, y)
ax.set_title("Fibonacci Sequence (Partial)")
ax.set_xlabel("Index")
ax.set_ylabel("Value")
fig.savefig("line.png", dpi=300, bbox_inches="tight")

Verify the chart opens correctly and the line shape matches the data (a clear exponential-looking curve).

B.2 ★☆☆ | Apply

Plot two lines on the same Axes: y1 = [10, 20, 30, 40, 50] and y2 = [5, 15, 25, 35, 45]. Give each a color, a label, and different linewidths. Add a legend.

Guidance

fig, ax = plt.subplots(figsize=(10, 4))

x = range(5)
y1 = [10, 20, 30, 40, 50]
y2 = [5, 15, 25, 35, 45]

ax.plot(x, y1, color="#1f77b4", linewidth=2, label="Series A")
ax.plot(x, y2, color="#ff7f0e", linewidth=2, label="Series B")
ax.legend()
ax.set_title("Two Series on One Axes")
fig.savefig("two_lines.png", dpi=300, bbox_inches="tight")

Notice: each call to `ax.plot()` adds another line to the same Axes. The `label` parameter is what the legend picks up.

B.3 ★★☆ | Apply

Create a horizontal bar chart (ax.barh) showing the populations of five cities of your choice (real or made up). Sort the bars so the largest city is at the top. Include city names on the y-axis and population on the x-axis.

Guidance

import pandas as pd

cities = pd.DataFrame({
    "city": ["Tokyo", "Delhi", "Shanghai", "São Paulo", "Mexico City"],
    "population": [37, 31, 27, 22, 22],  # millions
})
cities = cities.sort_values("population", ascending=True)  # ascending for barh so biggest is at top

fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(cities["city"], cities["population"], color="steelblue")
ax.set_title("Population of Five Megacities")
ax.set_xlabel("Population (millions)")
fig.savefig("cities_barh.png", dpi=300, bbox_inches="tight")

Notice: `ascending=True` in the sort puts the biggest city at the top of the horizontal bar chart (because the y-axis origin is at the bottom). This is a common source of confusion.

B.4 ★★☆ | Apply

Create a grouped bar chart comparing two years of revenue across five product lines. Use explicit x-position offsets to place the bars side by side. Add a legend.

Guidance

import numpy as np

categories = ["Product A", "Product B", "Product C", "Product D", "Product E"]
rev_2023 = [100, 150, 80, 120, 90]
rev_2024 = [130, 155, 90, 140, 85]

x = np.arange(len(categories))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(x - width/2, rev_2023, width, label="2023", color="#1f77b4")
ax.bar(x + width/2, rev_2024, width, label="2024", color="#ff7f0e")
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.set_title("Revenue by Product Line: 2023 vs 2024")
ax.set_ylabel("Revenue (USD millions)")
ax.legend()
fig.savefig("grouped_bars.png", dpi=300, bbox_inches="tight")

The `np.arange(len(categories))` gives you integer x-positions 0 through 4. The `±width/2` offsets place the two bars side by side at each category center. The `set_xticks` and `set_xticklabels` override the default ticks to show the category names at the group centers.

B.5 ★★☆ | Apply

Create a scatter plot of 100 random points using numpy.random.randn(). Use a single color with alpha=0.6 for transparency. Add grid lines with ax.grid(True, alpha=0.3).

Guidance

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)  # reproducibility
x = np.random.randn(100)
y = np.random.randn(100)

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x, y, alpha=0.6, color="steelblue", s=30)
ax.grid(True, alpha=0.3)
ax.set_title("Random Scatter")
ax.set_xlabel("x")
ax.set_ylabel("y")
fig.savefig("scatter.png", dpi=300, bbox_inches="tight")

The `alpha=0.6` makes each point slightly transparent so overlap is visible. The `grid(True, alpha=0.3)` adds faint gridlines (the `alpha=0.3` is a [Chapter 6](../../part-02-design-principles/chapter-06-data-ink-ratio/index.md) declutter move — making the grid recede behind the data).

B.6 ★★☆ | Apply

Create a scatter plot where the color of each point encodes a third variable (e.g., a "category score") and the size encodes a fourth variable ("population"). Add a colorbar. Use a diverging or sequential colormap from Chapter 3.

Guidance

import numpy as np

np.random.seed(1)
n = 50
x = np.random.randn(n)
y = np.random.randn(n)
score = np.random.randn(n)
population = np.abs(np.random.randn(n)) * 100 + 20  # positive, scaled

fig, ax = plt.subplots(figsize=(8, 6))
sc = ax.scatter(x, y, c=score, s=population, cmap="viridis", alpha=0.7, edgecolors="white", linewidths=0.5)
ax.set_title("Four-Variable Scatter")
ax.set_xlabel("X variable")
ax.set_ylabel("Y variable")
fig.colorbar(sc, label="Score")
fig.savefig("bubble.png", dpi=300, bbox_inches="tight")

The scatter encodes four variables: x-position, y-position, color (`c=score`), and size (`s=population`). The colorbar is essential — without it, the color has no defined meaning.

B.7 ★★☆ | Apply

Create a histogram of 1000 normally-distributed random values. Try three different bin counts (10, 30, 100) and compare the results.

Guidance

import numpy as np

np.random.seed(42)
data = np.random.randn(1000)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, bins in zip(axes, [10, 30, 100]):
    ax.hist(data, bins=bins, color="steelblue", alpha=0.8, edgecolor="white")
    ax.set_title(f"{bins} bins")
    ax.set_xlabel("Value")
fig.savefig("hist_bins.png", dpi=300, bbox_inches="tight")

10 bins looks chunky and over-smoothed. 100 bins looks noisy. 30 bins shows the bell shape clearly and is the best of the three for this dataset. This is why bin count matters: too few hides structure, too many adds noise, the right number shows the distribution.

B.8 ★★☆ | Apply

Create two overlaid histograms on the same Axes: one from np.random.normal(0, 1, 500) and one from np.random.normal(2, 1, 500). Use different colors and alpha=0.6 so both are visible. Add a legend.

Guidance

import numpy as np

np.random.seed(1)
group_a = np.random.normal(0, 1, 500)
group_b = np.random.normal(2, 1, 500)

fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(group_a, bins=30, alpha=0.6, label="Group A", color="#1f77b4")
ax.hist(group_b, bins=30, alpha=0.6, label="Group B", color="#ff7f0e")
ax.legend()
ax.set_title("Two Distributions Compared")
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
fig.savefig("two_hists.png", dpi=300, bbox_inches="tight")

The two distributions overlap in the middle (values around 1). With `alpha=0.6`, the overlap region shows both colors mixing, and the separate peaks are clearly visible. This is the standard pattern for comparing two distributions.

B.9 ★★☆ | Apply

Create a grouped box plot comparing three random samples (different means, different spreads). Label each box.

Guidance

import numpy as np

np.random.seed(42)
group_a = np.random.normal(10, 2, 100)
group_b = np.random.normal(12, 3, 100)
group_c = np.random.normal(15, 1.5, 100)

fig, ax = plt.subplots(figsize=(8, 5))
ax.boxplot(
    [group_a, group_b, group_c],
    labels=["Group A", "Group B", "Group C"],
)
ax.set_title("Three Groups Compared")
ax.set_ylabel("Value")
fig.savefig("boxplot.png", dpi=300, bbox_inches="tight")

The box plot shows median, IQR, whiskers, and outliers for each group in compact form. You can see at a glance which group has the highest median (C), which has the most spread (B), and whether there are outliers.

B.10 ★★★ | Apply

Reproduce the climate line chart from Section 11.1 using synthetic data. The x-axis should be years 1880 to 2024. The y-values should increase roughly linearly with some noise, simulating temperature anomalies. Add a horizontal reference line at y=0.

Guidance

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
years = np.arange(1880, 2025)
# Fabricated anomalies: roughly linear trend from -0.3 to +1.2, plus noise
anomalies = -0.3 + (years - 1880) * 0.01 + np.random.randn(len(years)) * 0.15

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(years, anomalies, color="#d62728", linewidth=1.5)
ax.axhline(0, color="gray", linewidth=0.8, linestyle="--")
ax.set_title("Global Temperature Anomaly")
ax.set_xlabel("Year")
ax.set_ylabel("Temperature Anomaly (°C)")
fig.savefig("climate_line.png", dpi=300, bbox_inches="tight")

Notice the chart is still ugly by [Chapter 7](../../part-02-design-principles/chapter-07-typography-annotation/index.md) standards — descriptive title, no annotations — but the chart type and parameters are correct.

B.11 ★★★ | Create

Using the same synthetic climate data, produce all five climate charts from Section 11.9 (line, bar of decade averages, scatter of CO2 vs temperature using fabricated CO2 data, histogram of anomalies, box plot by decade). Save each to a separate PNG file.

Guidance

This is the capstone exercise for the chapter. Produce each chart type using the synthetic data from B.10, plus fabricated CO2 data (e.g., starting at 290 ppm in 1880 and rising to 420 ppm in 2024). Save each chart to a separate file. At the end, you should have five PNG files showing the same data from five different angles. Compare them: which chart is best for which question? Which charts emphasize different aspects of the data?

Part C: Synthesis and Design Judgment (5 problems)

C.1 ★★★ | Analyze

Take any of the five climate charts from Section 11.9 and identify every place where the code accepts a matplotlib default that you would override for publication. For each default, state what the default is and what you would change it to.

Guidance

Every chart accepts defaults for: line width, marker size, axis limits, tick positions, font family, title styling, axis label formatting, spine visibility, grid display, legend placement (if applicable). For publication, you would override most of these: set explicit figsize, explicit colors, explicit linewidth, explicit title with action wording, explicit axis labels with units, remove top and right spines, lighten gridlines, position legend deliberately, set explicit tick formatting. [Chapter 12](../chapter-12-customization-mastery/index.md) covers these overrides in detail.

C.2 ★★★ | Create

Write a reusable function plot_time_series(df, x_col, y_col, title, ylabel, ax=None) that takes a DataFrame and column names, plots a line chart on the given Axes (creating one if None), and returns the Axes. Demonstrate by calling it twice with different data.

Guidance

def plot_time_series(df, x_col, y_col, title, ylabel, ax=None):
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(df[x_col], df[y_col], linewidth=1.5)
    ax.set_title(title)
    ax.set_xlabel(x_col.title())
    ax.set_ylabel(ylabel)
    return ax

# Demonstration:
import pandas as pd
df1 = pd.DataFrame({"year": range(2000, 2021), "value": range(21)})
df2 = pd.DataFrame({"year": range(2000, 2021), "value": [i**2 for i in range(21)]})

plot_time_series(df1, "year", "value", "Linear Growth", "Value")
plot_time_series(df2, "year", "value", "Quadratic Growth", "Value")

The optional `ax=None` parameter lets the function create its own Axes or accept one from the caller. This is the canonical pattern for reusable matplotlib utilities — it makes the function work both for standalone charts and as part of a multi-panel figure.

C.3 ★★★ | Evaluate

Compare df.plot.line(...) (pandas shortcut) with the explicit fig, ax = plt.subplots(); ax.plot(df["x"], df["y"]) (OO pattern). For which use cases is each preferable? Write your answer as a short practitioner's guide.

Guidance

Pandas shortcut: good for quick exploratory charts in a notebook, when the goal is to see the data quickly and you are not going to publish the result. Fewer keystrokes. OO pattern: good for production code, publication-quality charts, functions that need to accept an Axes argument, multi-panel figures, charts that need extensive customization. More explicit, easier to debug. Rule of thumb: use pandas for exploration, OO for anything else. Both produce the same matplotlib output under the hood.

C.4 ★★★ | Evaluate

The chapter argues that the five chart types in this chapter "handle eighty percent of real-world visualization work." Do you agree? Name three use cases that require chart types not in the Big Five, and explain why.

Guidance

Legitimate extensions: (1) Geographic maps (choropleth, dot map) — essential for spatial data, not covered by the Big Five. (2) Network / graph visualization — for showing connections between entities. (3) Heatmaps or correlation matrices — for 2D comparisons. (4) Sankey diagrams or flow charts — for showing transitions between states. The 80% claim is approximately right for tabular data with basic comparisons, distributions, and trends; it underestimates the importance of specialized chart types for specific domains.

C.5 ★★★ | Create

Take the Chapter 5 chart selection matrix (data type × question type) and write a small Python dictionary mapping (data_type, question_type) pairs to the recommended matplotlib method. Use it to look up the right method for three specific scenarios of your choice.

Guidance

chart_method = {
    ("categorical", "comparison"): "ax.bar / ax.barh",
    ("continuous", "distribution"): "ax.hist or ax.boxplot",
    ("continuous", "relationship"): "ax.scatter",
    ("temporal", "change_over_time"): "ax.plot",
    ("continuous", "comparison_by_group"): "ax.boxplot",
    # etc.
}

# Usage:
scenario = ("temporal", "change_over_time")
print(f"For {scenario}, use: {chart_method[scenario]}")

This kind of lookup dictionary is not production-ready (real chart selection needs more nuance), but writing it forces you to internalize the mapping from the [Chapter 5](../../part-01-seeing-data/chapter-05-choosing-the-right-chart/index.md) framework to the matplotlib API. This is exactly the translation that experienced matplotlib users do automatically.

These exercises are hands-on. Every Part B exercise should be typed and run, not just read. If you only read the exercises, you will not retain the syntax. By the end of the set, you should be able to produce any of the five chart types without consulting the chapter.