Exercises: Big Data Visualization

Install: pip install datashader holoviews bokeh dask plotly. Generate test data with numpy and pandas.


Part A: Conceptual (6 problems)

A.1 ★☆☆ | Recall

Name three symptoms of big-data overplotting.

Guidance (1) Scatter plots become solid clouds; individual points are invisible. (2) Rendering takes a long time; the chart becomes slow or unresponsive. (3) File size grows dramatically; vector formats become unwieldy. (4) Meaning disappears; the chart conveys "a lot of points" but no specific pattern. Any three of these indicate you need a big-data strategy.

A.2 ★☆☆ | Recall

At what approximate sizes does each technique become necessary?

Guidance Under 10k points: regular scatter. 10k-50k: alpha blending. 50k-500k: hex bins, 2D histograms, or WebGL. 500k-5M: hex bins or scattergl for interactive; datashader for massive. 5M-100M: datashader. Beyond 100M: datashader + Dask for out-of-core.

A.3 ★★☆ | Understand

What is the chapter's threshold concept and why does it matter?

Guidance "Aggregation is interpretation." When you bin, sample, or rasterize, you are making a design decision about what level of detail matters. There is no "raw" visualization of a million points — every rendering is a summary. Understanding this prevents the practitioner from treating aggregated visualizations as neutral displays of the data.

A.4 ★★☆ | Understand

What does datashader do differently from matplotlib for big data?

Guidance Matplotlib renders each point individually — scaling linearly with input size. Datashader rasterizes into a fixed pixel grid (e.g., 800×600), so the rendering cost is O(N) for the aggregation step plus O(pixels) for the display, which is constant regardless of input size. This lets datashader handle millions to billions of points where matplotlib would fail.

A.5 ★★☆ | Analyze

Compare WebGL (Plotly scattergl) with datashader. When would you choose each?

Guidance **WebGL (scattergl)**: preserves individual points, allows hover tooltips, good for interactive exploration where specific points matter. Handles up to ~1-2 million points on a modern laptop. Best for: interactive sharing, dashboards, HTML reports where readers want to inspect points. **Datashader**: rasterizes to a fixed resolution, scales to billions, better for overview visualization at extreme scale. Best for: static images of massive datasets, out-of-core processing, publication-quality overview charts.

A.6 ★★★ | Evaluate

A colleague uses a random sample of 10,000 points from a 10-million-point dataset and reports that "there is no correlation between X and Y." Is this reliable?

Guidance For the bulk correlation, yes — 10,000 random points are plenty to estimate a correlation coefficient reliably. But for outlier detection, sparse regions, or rare subgroups, no — a random sample misses exactly the patterns that might matter. If the claim is "the average relationship is null," the sample is adequate. If the claim is "there is no pattern anywhere," the sample cannot support it. Suggest verifying with a datashader view of the full dataset to check for patterns that sampling might miss.

Part B: Applied (10 problems)

B.1 ★☆☆ | Apply

Create a scatter plot of 100,000 random points with alpha=0.05. What does the density pattern look like?

Guidance
import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(100_000)
y = np.random.randn(100_000)

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x, y, alpha=0.05, s=1)

B.2 ★☆☆ | Apply

Use matplotlib's hexbin to visualize the same 100,000 points.

Guidance
fig, ax = plt.subplots(figsize=(6, 6))
hb = ax.hexbin(x, y, gridsize=50, cmap="viridis")
fig.colorbar(hb, ax=ax, label="Count")

B.3 ★★☆ | Apply

Create a 2D histogram of 1 million random points using ax.hist2d.

Guidance
x = np.random.randn(1_000_000)
y = np.random.randn(1_000_000)

fig, ax = plt.subplots(figsize=(6, 6))
h = ax.hist2d(x, y, bins=100, cmap="viridis")
fig.colorbar(h[3], ax=ax)

B.4 ★★☆ | Apply

Use datashader to render 1 million points into an 800×600 image.

Guidance
import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd

df = pd.DataFrame({"x": x, "y": y})
canvas = ds.Canvas(plot_width=800, plot_height=600)
agg = canvas.points(df, "x", "y")
img = tf.shade(agg, cmap="viridis", how="eq_hist")
img  # displays in Jupyter

B.5 ★★☆ | Apply

Create a Plotly scatter with WebGL rendering for 500,000 points.

Guidance
import plotly.express as px
import pandas as pd

df = pd.DataFrame({"x": x[:500_000], "y": y[:500_000]})
fig = px.scatter(df, x="x", y="y", render_mode="webgl")
fig.show()

B.6 ★★☆ | Apply

Take a random sample of 10,000 from a million-point dataset and plot it.

Guidance
sample = df.sample(n=10_000, random_state=42)
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(sample["x"], sample["y"], alpha=0.3, s=5)

B.7 ★★★ | Apply

Build a stratified sample that takes 1000 points from each category, then plot.

Guidance
sample = df.groupby("category", group_keys=False).apply(lambda g: g.sample(min(len(g), 1000)))

sns.scatterplot(data=sample, x="x", y="y", hue="category")

B.8 ★★★ | Apply

Use datashader with a categorical aggregation to show different classes in different colors.

Guidance
df["category"] = df["category"].astype("category")
agg = canvas.points(df, "x", "y", agg=ds.count_cat("category"))
color_key = {"A": "red", "B": "blue", "C": "green"}
img = tf.shade(agg, color_key=color_key, how="eq_hist")

B.9 ★★☆ | Apply

Build a HoloViews + datashader interactive plot of a million points.

Guidance
import holoviews as hv
from holoviews.operation.datashader import datashade

hv.extension("bokeh")

points = hv.Points(df, kdims=["x", "y"])
shaded = datashade(points, cmap="viridis")
shaded

B.10 ★★★ | Create

Build a four-panel comparison of the same 1M-point dataset using: (a) alpha blending with a sample, (b) hex binning, (c) datashader, (d) scattergl. Compare rendering times.

Guidance Use `%time` in Jupyter to measure each. Expect alpha to be slow, hex and datashader to be fast, and scattergl to be fast but with larger output size. The fastest tool depends on the machine; the point is to develop intuition about relative costs.

Part C: Synthesis (4 problems)

C.1 ★★★ | Analyze

A dataset has 50 million rows with two outliers at extreme coordinates. Which big-data visualization technique would preserve the outliers, and which would lose them?

Guidance **Preserves outliers**: datashader (shows them as individual or near-individual pixels at their coordinates), hex bin with fine grid (shows them as individual cells), scattergl (shows them as individual markers). **Loses outliers**: random sampling (probably misses them), aggressive aggregation (hex bin with coarse grid), KDE (smooths them into the density). For outlier-sensitive analysis, use datashader or scattergl.

C.2 ★★★ | Evaluate

A colleague has prepared a 100-million-point visualization and says "rendering takes 10 minutes, but that's fine because the final image is beautiful." Is this acceptable?

Guidance For a one-time static image, 10 minutes may be acceptable — especially if the alternative is a lower-quality visualization. For interactive exploration or production dashboards, 10 minutes is not acceptable. Suggest profiling to find the bottleneck: usually it's data loading, not rendering. Parquet over CSV, Dask for out-of-core, and caching pre-aggregated data can reduce times dramatically.

C.3 ★★★ | Create

Design a multi-scale visualization pipeline for a 10-million-point dataset: overview, zoom, and drill-down. Which tool for each level?

Guidance **Overview**: datashader image of all 10M points, rasterized to 1200x800. **Zoom**: HoloViews with re-aggregation as the user zooms. **Drill-down**: filtered subset displayed with Plotly scattergl, showing individual points with hover tooltips for specific inspection. Each level uses a different tool tuned to the scale of its view.

C.4 ★★★ | Evaluate

The chapter argues that "aggregation is interpretation." Does this mean big-data visualizations are inherently less honest than small-data ones?

Guidance Not necessarily. Every visualization is a summary of the data — even a scatter plot of 10 points makes design choices about axis range, color, and emphasis. Big-data visualizations are more obviously summaries because the aggregation is explicit. Small-data visualizations hide their design choices better, but the choices are still there. The chapter's point is not that big-data visualizations are less honest, but that the design decisions are more consequential and should be disclosed. An honest big-data visualization discloses its aggregation; a dishonest small-data one can hide its choices behind "raw" data.

Chapter 29 begins Part VII (Dashboards and Production) with Streamlit for building full interactive applications.