Exercises: Statistical and Scientific Visualization

DataField.Dev

Exercises: Statistical and Scientific Visualization

These exercises assume import matplotlib.pyplot as plt, import matplotlib as mpl, import numpy as np, import pandas as pd, and optional import seaborn as sns, from statannotations.Annotator import Annotator.

Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

What are typical single-column and double-column figure widths for Nature?

Guidance

Single-column: 89 mm ≈ 3.5 inches. Double-column: 183 mm ≈ 7.2 inches. Most journals have similar conventions with slight variations. Science uses 55/120/180 mm; PLOS uses 789/1651 pixels at 300 DPI.

A.2 ★☆☆ | Recall

What is the difference between Type 42 and Type 3 fonts in PDF output?

Guidance

**Type 42** is TrueType font format, widely compatible and the default for modern matplotlib. **Type 3** is older PostScript subsetting, which many journals reject because fonts get embedded as character outlines rather than proper font definitions. Set `mpl.rcParams["pdf.fonttype"] = 42` and `mpl.rcParams["ps.fonttype"] = 42` to force Type 42.

A.3 ★★☆ | Understand

Why must scientific figures show uncertainty (error bars, confidence intervals, bands)?

Guidance

A point estimate without uncertainty is incomplete. The reader cannot tell whether a difference between two means is meaningful or within noise. Uncertainty visualization lets the reader judge whether the data supports the claims. Peer reviewers routinely reject figures that report point estimates without uncertainty.

A.4 ★★☆ | Understand

Describe the Wong colorblind-safe palette and when to use it.

Guidance

Bang Wong's 2011 palette has 8 colors (#000000, #E69F00, #56B4E9, #009E73, #F0E442, #0072B2, #D55E00, #CC79A7) designed to be distinguishable by all major forms of color blindness. Use for any publication figure with categorical color encoding, especially when the journal expects grayscale-printable output.

A.5 ★★☆ | Analyze

Explain why a bar chart with error bars (dynamite plot) is considered weak for modern publication.

Guidance

Dynamite plots ([Chapter 18](../../part-04-seaborn/chapter-18-relational-categorical/index.md)) show only the mean and error, hiding the full distribution. They cannot reveal outliers, skewness, or bimodality. Modern reproducibility-focused guidance recommends showing individual data points (strip + box + summary) so the reader can see the underlying distribution. Some journals (e.g., *Nature Methods*) explicitly discourage dynamite plots.

A.6 ★★★ | Evaluate

A colleague has prepared a 2-panel figure with 6 pt axis labels, black-and-white printout that obscures the red-green lines, and no panel labels. List the problems and fixes.

Guidance

(1) 6 pt is below the typical 7-9 pt minimum — increase to 7 or 8 pt. (2) Red-green is not colorblind-safe and fails in grayscale — switch to Wong palette or use redundant encoding (line style + marker). (3) Panel labels are missing — add "a", "b" in bold at the top-left of each panel using `ax.text` with `transform=ax.transAxes`. Also check font embedding (Type 42) and figure width (match single- or double-column).

Part B: Applied (10 problems)

B.1 ★☆☆ | Apply

Set up matplotlib rcParams for a Nature single-column figure.

Guidance

import matplotlib as mpl
mpl.rcParams.update({
    "pdf.fonttype": 42,
    "ps.fonttype": 42,
    "font.family": "sans-serif",
    "font.sans-serif": ["Arial", "Helvetica", "DejaVu Sans"],
    "font.size": 7,
    "axes.titlesize": 8,
    "axes.labelsize": 7,
    "xtick.labelsize": 6,
    "ytick.labelsize": 6,
    "legend.fontsize": 6,
    "figure.figsize": (3.5, 2.5),
    "savefig.dpi": 300,
})

B.2 ★☆☆ | Apply

Create a scatter plot with error bars and an axis label using LaTeX math notation.

Guidance

x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 4, 9, 16, 25])
yerr = np.array([0.5, 0.8, 1.2, 1.5, 2.0])

fig, ax = plt.subplots(figsize=(3.5, 2.5))
ax.errorbar(x, y, yerr=yerr, fmt="o", capsize=3, color="black")
ax.set_xlabel(r"$x$ (unit)")
ax.set_ylabel(r"$y = x^2$")
ax.set_title(r"$y$ vs. $x$")

B.3 ★★☆ | Apply

Build a 2×2 panel figure with panel labels "a", "b", "c", "d" in bold.

Guidance

fig, axes = plt.subplots(2, 2, figsize=(7.2, 5.4))
for ax, letter in zip(axes.flat, "abcd"):
    ax.text(-0.15, 1.05, letter, transform=ax.transAxes,
            fontsize=12, fontweight="bold", va="top")

B.4 ★★☆ | Apply

Plot a regression line with a 95% confidence band using fill_between.

Guidance

fig, ax = plt.subplots(figsize=(4, 3))
ax.scatter(x, y, alpha=0.5, s=10)
ax.plot(x, y_pred, color="steelblue")
ax.fill_between(x, y_lower, y_upper, color="steelblue", alpha=0.2, label="95% CI")
ax.legend()

B.5 ★★☆ | Apply

Create a QQ plot of residuals using scipy.stats.probplot.

Guidance

import scipy.stats as stats

residuals = np.random.randn(100)
fig, ax = plt.subplots(figsize=(3.5, 3.5))
stats.probplot(residuals, dist="norm", plot=ax)
ax.set_title("Normal QQ Plot")

B.6 ★★☆ | Apply

Use statannotations to add significance brackets to a seaborn boxplot.

Guidance

import seaborn as sns
from statannotations.Annotator import Annotator

df = pd.DataFrame({"group": ["A", "A", "B", "B", "C", "C"] * 10,
                   "value": np.random.randn(60)})

fig, ax = plt.subplots(figsize=(4, 3))
sns.boxplot(data=df, x="group", y="value", ax=ax)
annotator = Annotator(ax, [("A", "B"), ("A", "C"), ("B", "C")], data=df, x="group", y="value")
annotator.configure(test="t-test_ind", text_format="star", loc="outside")
annotator.apply_and_annotate()

B.7 ★★★ | Apply

Build a forest plot showing 4 study effects plus a pooled estimate.

Guidance

studies = ["A", "B", "C", "D", "Pooled"]
effects = [0.1, 0.2, -0.05, 0.15, 0.1]
lowers = [0.0, 0.1, -0.15, 0.05, 0.04]
uppers = [0.2, 0.3, 0.05, 0.25, 0.16]

fig, ax = plt.subplots(figsize=(4, 3))
y_pos = np.arange(len(studies))[::-1]
for i, (s, e, lo, hi) in enumerate(zip(studies, effects, lowers, uppers)):
    y = y_pos[i]
    m = "D" if s == "Pooled" else "s"
    ax.plot([lo, hi], [y, y], color="black")
    ax.scatter(e, y, marker=m, s=60 if m == "D" else 40, color="black")

ax.axvline(0, color="gray", linestyle="--")
ax.set_yticks(y_pos)
ax.set_yticklabels(studies)
ax.set_xlabel("Effect size (95% CI)")

B.8 ★★★ | Apply

Create a volcano plot from a dataset of effect sizes and p-values.

Guidance

fig, ax = plt.subplots(figsize=(4, 4))
significant = (p_values < 0.05) & (abs(effects) > 1)
ax.scatter(effects, -np.log10(p_values),
           c=["red" if s else "gray" for s in significant], s=10, alpha=0.6)
ax.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=0.5)
ax.axvline(1, color="black", linestyle="--", linewidth=0.5)
ax.axvline(-1, color="black", linestyle="--", linewidth=0.5)
ax.set_xlabel("log2 fold change")
ax.set_ylabel("-log10 p-value")

B.9 ★★☆ | Apply

Save a figure as both PDF (vector) and TIFF (raster at 300 DPI).

Guidance

fig.savefig("figure.pdf", bbox_inches="tight")
fig.savefig("figure.tif", dpi=300, bbox_inches="tight")

B.10 ★★★ | Create

Build a complete 4-panel publication figure for the climate dataset: (a) time series with rolling mean, (b) regression scatter, (c) calendar heatmap, (d) regional bar chart. Use a journal-style rcParams, panel labels, and export as PDF.

Guidance

Follow the Section 27.12 structure. Apply the Nature style module, create a 2×2 subplot figure at `figsize=(7.2, 6)`, build each panel separately, add panel labels with `ax.text`, tight_layout or constrained_layout for spacing, save as PDF. The full code is 50-80 lines.

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

Take a figure from a recent paper you have read and evaluate it against the Section 27.11 checklist. Which items pass, which fail?

Guidance

Most published figures pass most items, but many fail at least one or two. Common failures: missing panel labels, insufficient font size, non-colorblind-safe palettes, missing n values in caption, incomplete error-bar definitions. The exercise is subjective but develops a critical eye for production quality.

C.2 ★★★ | Evaluate

A reviewer comments: "Figure 2 uses Type 3 fonts. Please resubmit with Type 42." What happened, and how do you fix it?

Guidance

Older matplotlib versions defaulted to Type 3 fonts for PDF output, and some journal PDF processors cannot handle them. Fix: set `mpl.rcParams["pdf.fonttype"] = 42` and `mpl.rcParams["ps.fonttype"] = 42` before creating the figure, then re-run the figure code. No content changes; just re-export.

C.3 ★★★ | Create

Create a reusable Python module that applies your preferred journal style and provides convenience functions for sized figures and panel labels. Write a docstring explaining how to use it.

Guidance

See Section 27.13 for the basic template. Add docstrings, a main function that applies the style, and helpers like `figsize_single_col(aspect)` and `add_panel_label(ax, letter)`. Save as a module you can import into every figure script.

C.4 ★★★ | Evaluate

The chapter argues that modern publication figures should emphasize effect sizes and reproducibility over p-value thresholds. Do you agree? What are the costs of this shift?

Guidance

The shift is well-motivated: effect sizes communicate practical significance better than p-values, and reproducibility catches errors that peer review misses. The costs: some older readers are not fluent in effect-size interpretation, and the shift requires learning new visualization conventions (estimation plots, forest plots). The transition is gradual but real, and the direction is clear.

Chapter 28 covers big data visualization strategies for datasets too large for standard tools.