Chapter 28: Big Data Visualization — When You Have a Million Points

32 min read

> "When you have millions of data points, you stop visualizing data and start visualizing summaries of data. The summary is a design decision."

Learning Objectives

Identify when a dataset is too large for standard scatter/line plots
Apply aggregation strategies: alpha blending, hex binning, 2D histograms, rasterization
Use datashader for rendering millions of points into meaningful images
Use Plotly with WebGL rendering (scattergl) for interactive large-data visualization
Apply sampling strategies: random, stratified, and reservoir sampling
Create aggregated summaries at multiple scales: overview, detail, drill-down
Explain the trade-offs: aggregation loses individual points, sampling introduces randomness, rasterization loses vector quality

In This Chapter

28.1 The Big Data Problem
28.2 Alpha Blending: The Simplest Strategy
28.3 Hex Binning
28.4 2D Histograms and Density Estimation
28.5 Datashader: Rasterization at Scale
28.6 WebGL Rendering in Plotly
28.7 Sampling Strategies
28.8 Multi-Scale Visualization
28.9 Big Data Pitfalls
28.10 Progressive Project (Alternate): A Million-Point Social Media Scatter
28.11 Datashader in Depth
28.12 Integrating Datashader with Interactive Tools
28.13 Comparing the Strategies
28.14 Big Data Comes with Big Preprocessing
28.15 Honest Aggregation
28.16 Case Examples at Different Scales
28.17 Cloud and Streaming Scales
28.18 The Limits of Visual Perception
28.19 Practical Optimization Tips
28.20 When Not to Visualize at All
28.21 Check Your Understanding
28.22 Chapter Summary
28.23 Spaced Review

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 28: Big Data Visualization — When You Have a Million Points

"When you have millions of data points, you stop visualizing data and start visualizing summaries of data. The summary is a design decision." — paraphrased from James Bednar, developer of datashader

28.1 The Big Data Problem

Every chart in this textbook so far has assumed that the reader can distinguish individual data points. A scatter plot shows dots; the reader can count them, compare them, notice outliers. A line chart shows a line; the reader can trace it from start to end. A bar chart shows bars; the reader can read off each category's value. The implicit assumption is that the number of points is comfortably below the number of pixels on the screen — maybe a few thousand at most.

This assumption breaks down at scale. A modern dataset might have a million rows of sensor data, a billion rows of clickstream events, ten million rows of genomic variants, or hundreds of millions of social media posts. Naively plotting these datasets with standard scatter or line plots fails in specific and predictable ways:

Problem 1: Overplotting. A scatter of 100,000 points produces a dark cloud where individual points are invisible. Adding more points does not help — the cloud just gets darker. The reader learns nothing about the distribution except that it exists.

Problem 2: Rendering performance. Plotting millions of markers in matplotlib takes a long time and produces a huge file. Plotly with standard SVG rendering becomes unresponsive. Most libraries simply give up or render something unusable.

Problem 3: Meaningless individual points. Even if you could see each point, there are too many for individual attention. The reader cannot process a million data points. The question shifts from "what does this point mean?" to "what does the overall pattern mean?"

Problem 4: File size. A vector SVG of a million points is enormous — many megabytes. A raster PNG at screen resolution is smaller but still slow to load.

The solution is not to abandon visualization. It is to aggregate before plotting. Instead of drawing individual points, compute a summary (density, count, average) and draw the summary. The summary might be a hex bin map, a 2D histogram, a contour plot, a rasterized density image, or an aggregated sample. Each summary loses information — you can no longer identify individual points — but the aggregate pattern is visible in a way it would not be otherwise.

This is the chapter's threshold concept: aggregation is interpretation. When you bin, sample, or rasterize a million points, you are making a design decision about what level of detail matters. There is no "raw" visualization of a million points. Every rendering is a summary, and the choice of summary is the most consequential decision in the visualization.

28.2 Alpha Blending: The Simplest Strategy

The simplest response to overplotting is alpha blending — making each point semi-transparent so that overlapping points darken and reveal density. A scatter with alpha=0.1 turns a cloud of solid dots into a gradient where dense regions are darker and sparse regions are lighter. Implementation is trivial:

import matplotlib.pyplot as plt
import numpy as np

x = np.random.randn(100_000)
y = np.random.randn(100_000)

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x, y, alpha=0.05, s=1)

The alpha=0.05 means each point contributes 5% opacity. Twenty points stacked on the same pixel produce full opacity; one point produces 5%. The result reveals the Gaussian density of the underlying distribution as a gradient.

Alpha blending has limits. It works for up to a few hundred thousand points; beyond that, rendering time becomes unacceptable (each point is still a separate marker). It also has a perceptual problem: the mapping from density to darkness is non-linear and hard to read quantitatively. A dark region might be 100 points stacked or 20 points stacked — you cannot tell precisely.

For up to about 50,000 points, alpha blending is a good first resort. Beyond that, other techniques are needed.

28.3 Hex Binning

Hex binning divides the plot area into hexagonal cells, counts the points in each cell, and colors each cell by count. The result is a 2D density map where individual points are invisible but the pattern of density is clear. Matplotlib has built-in hex binning:

fig, ax = plt.subplots(figsize=(6, 6))
hb = ax.hexbin(x, y, gridsize=50, cmap="viridis")
fig.colorbar(hb, ax=ax, label="Count")

The gridsize=50 parameter controls the resolution — 50 hexagonal bins across the x-axis range. Higher values give finer detail at the cost of sparser bins. Typical values are 30-100 depending on the data size and the plot dimensions.

Hex bins have several advantages over rectangular bins:

Better space coverage. Hexagons tile the plane efficiently without gaps.
No directional bias. Rectangular bins can create visual artifacts along their axes. Hexagons do not.
Visually pleasing. The honeycomb pattern is aesthetically clean.

Plotly has px.density_heatmap for rectangular bins and px.density_heatmap(..., nbinsx=50, nbinsy=50) as the rectangular equivalent. Matplotlib's ax.hist2d produces rectangular 2D histograms.

# Plotly version
import plotly.express as px
fig = px.density_heatmap(df, x="x", y="y", nbinsx=50, nbinsy=50)
fig.show()

Hex bins and 2D histograms work well up to several million points. The performance is good because the aggregation is O(N) (linear in the number of points) and the display is a small 2D grid (typically 50×50 = 2500 cells, regardless of input size).

28.4 2D Histograms and Density Estimation

A 2D histogram is the rectangular cousin of hex binning. Each cell is a rectangle; the count is the number of points inside. For most purposes it is interchangeable with hex binning, but the rectangular layout has slight advantages for certain domains (genomics, astronomy) where grid-aligned structure is meaningful.

fig, ax = plt.subplots(figsize=(6, 6))
h = ax.hist2d(x, y, bins=50, cmap="viridis")
fig.colorbar(h[3], ax=ax, label="Count")

Kernel density estimation (KDE) produces a smooth density estimate instead of discrete bins. The result looks like a smoothed 2D histogram and can be displayed as contour lines or a filled heatmap:

import seaborn as sns

sns.kdeplot(x=x, y=y, fill=True, cmap="viridis", levels=20)

KDE has the advantage of producing a smooth continuous surface — no visible bin boundaries — at the cost of being slower than hex binning for very large datasets. For up to 100,000 points it is fast enough; beyond that, it starts to slow down.

Which to use?

Hex bins for most cases, especially up to several million points.
2D histograms when grid alignment matters or when you need specific rectangular bin sizes.
KDE when you want a smooth visualization and performance is not a concern.

All three produce similar information — a 2D density map. The differences are aesthetic and performance-related.

28.5 Datashader: Rasterization at Scale

For datasets beyond a few million points, even hex binning becomes slow. The solution is datashader, a specialized library developed by Continuum Analytics (now Anaconda) for rendering massive datasets through rasterization.

The datashader approach: instead of drawing individual points or hex bins, create a grid of pixels (say, 800×600) and count how many data points fall in each pixel. The result is a 2D array that can be displayed as an image. Because the array size is fixed (480,000 pixels regardless of input size), the rendering time scales only with the input size, not with the output size.

import datashader as ds
import datashader.transfer_functions as tf
import pandas as pd

df = pd.DataFrame({"x": x, "y": y})
canvas = ds.Canvas(plot_width=800, plot_height=600)
agg = canvas.points(df, "x", "y")
img = tf.shade(agg, cmap="viridis")

The canvas.points call computes the point counts per pixel; tf.shade renders the counts as a colored image. The resulting image can be displayed directly, saved as a PNG, or embedded in a Plotly or matplotlib figure.

Datashader handles datasets of millions to billions of points efficiently. It uses Numba for just-in-time compilation and Dask for out-of-core processing, so even datasets that do not fit in memory can be rendered. The output is always a raster image — you cannot zoom beyond the pixel resolution without regenerating — but for big-data visualization this is usually fine.

Datashader's main features:

Scalable aggregation: counts, sums, means, standard deviations, counts-by-category, etc.
Multiple shading modes: linear, logarithmic, equal-histogram (for heavy-tailed distributions), categorical coloring.
Interactive integration: with HoloViews and Bokeh for pan/zoom that re-aggregates at each zoom level.
Out-of-core processing: with Dask for datasets larger than memory.

Datashader is the right tool when alpha blending, hex bins, and 2D histograms run out of steam. For datasets under 1 million points, simpler tools are usually enough. For datasets above 10 million, datashader is often the only option.

28.6 WebGL Rendering in Plotly

WebGL (Web Graphics Library) is a JavaScript API for GPU-accelerated 2D and 3D graphics in web browsers. Plotly uses WebGL for large-data chart types, where it can handle orders of magnitude more points than the default SVG rendering.

The main WebGL trace types in Plotly:

scattergl: WebGL-accelerated scatter. Drop-in replacement for scatter.
heatmapgl: WebGL-accelerated heatmap.
scatterpolargl: WebGL-accelerated polar scatter.

To use them with Plotly Express:

import plotly.express as px

fig = px.scatter(df, x="x", y="y", render_mode="webgl")
fig.show()

The render_mode="webgl" tells Plotly Express to use Scattergl instead of Scatter. With WebGL, a scatter of 500,000 points renders smoothly; 1 million is borderline; 5 million is too much for most browsers.

For direct Graph Objects use:

import plotly.graph_objects as go

fig = go.Figure(go.Scattergl(x=x, y=y, mode="markers", marker=dict(size=2, opacity=0.5)))

WebGL's advantage is interactivity — you can pan, zoom, and hover on large datasets without the lag that SVG would produce. Its limits: browser memory (WebGL has a per-browser point limit), aesthetic rendering (anti-aliasing is less precise than SVG), and compatibility (very old browsers do not support WebGL).

For interactive web delivery of large datasets, WebGL is usually the right choice. For static print output, rasterization (datashader → PNG) is better because the final output is a fixed image anyway.

28.7 Sampling Strategies

Sometimes the right response to big data is not to visualize all of it, but to visualize a representative sample. A scatter of 1000 random points from a million can convey the same pattern as a scatter of all million, at a fraction of the rendering cost. The key is choosing a sampling strategy that preserves the patterns you care about.

Random sampling: select N points uniformly at random. Simple and unbiased, but can miss rare patterns.

sample = df.sample(n=10_000, random_state=42)

Stratified sampling: sample proportionally within each category. Preserves the balance of categories even when some are rare.

sample = df.groupby("category", group_keys=False).apply(lambda g: g.sample(min(len(g), 1000)))

Reservoir sampling: streaming algorithm that produces a random sample of fixed size from a stream of unknown length. Useful for datasets too large to load into memory.

Weighted sampling: weight the sampling probability by some feature (e.g., sample more heavily from rare categories or from recent dates).

Stratified by 2D bins: divide the plot area into bins, then sample a fixed number of points from each non-empty bin. This ensures that sparse regions of the plot are represented even when random sampling would miss them.

def bin_and_sample(df, n_per_bin=10, bins=50):
    df = df.copy()
    df["x_bin"] = pd.cut(df["x"], bins)
    df["y_bin"] = pd.cut(df["y"], bins)
    return df.groupby(["x_bin", "y_bin"], group_keys=False).apply(
        lambda g: g.sample(min(len(g), n_per_bin))
    )

Sampling has a trade-off: smaller samples render faster but lose detail. The right sample size depends on the pattern you want to preserve. For visual inspection of a scatter plot, 5,000-50,000 points is usually enough. For statistical inference, more.

A warning: sampling can introduce bias if not done carefully. If the data has outliers that matter, random sampling will usually miss them (because outliers are rare by definition). For outlier detection, use strategies that explicitly include extreme points (e.g., always include the top/bottom 1% alongside a random sample of the rest).

28.8 Multi-Scale Visualization

A sophisticated big-data visualization strategy is multi-scale: provide an aggregated overview for the whole dataset, and a detailed view for zoomed-in regions. The reader starts with the overview, identifies interesting regions, and drills down with higher resolution.

This is Shneiderman's mantra (Chapter 9 and Chapter 19) applied to scale: overview first, zoom and filter, details on demand. For big data, each level of the mantra uses a different technique:

Overview: aggregated view (hex bins, datashader, 2D histogram) showing the whole dataset.

Zoom and filter: progressive refinement as the user zooms. Interactive tools re-aggregate at each zoom level so the detail emerges naturally.

Details on demand: individual points revealed on hover, click, or deep zoom. Only the specific points the user asks about are rendered as individuals.

The HoloViews + datashader combination implements this natively:

import holoviews as hv
from holoviews.operation.datashader import datashade

hv.extension("bokeh")

points = hv.Points(df, kdims=["x", "y"])
shaded = datashade(points, cmap="viridis")
shaded  # displays with interactive zoom and re-rasterization

The resulting plot is interactive: when the user zooms, datashader re-aggregates the visible region at the new resolution, so the image always has the same effective pixel count regardless of zoom level. Zoom in, and you see progressively finer detail. This is one of the most elegant solutions to big-data visualization in the Python ecosystem.

Plotly + Dash can achieve similar effects through callback-driven re-aggregation. For custom implementations, the pattern is: listen for zoom events, recompute the aggregation on the visible subset, update the displayed image.

28.9 Big Data Pitfalls

Big-data visualization has its own set of pitfalls that deserve explicit mention.

Pitfall 1: Assuming aggregation is neutral. Aggregation loses information. A hex bin hides the individual points; sampling randomly drops points; datashader renders at a fixed resolution. Every aggregation is a choice about what to preserve and what to discard. Readers may assume the visualization shows "the data" when it actually shows a summary.

Pitfall 2: Simpson's paradox in aggregation. When you aggregate, subgroup patterns can reverse. A 2D histogram that shows an overall positive correlation might hide that specific categories have negative correlations. Always verify aggregated patterns against category-specific views.

Pitfall 3: Color scale saturation. A density heatmap with a linear color scale is dominated by the peak densities. Sparse regions disappear because they map to near-zero colors. Use a log color scale for heavy-tailed distributions (norm=matplotlib.colors.LogNorm()).

Pitfall 4: Sampling bias from outlier loss. Random sampling preserves the bulk distribution but loses outliers. If outliers matter, include them explicitly or use a larger sample.

Pitfall 5: Aliasing in rasterization. When a high-resolution pattern is rendered to a low-resolution grid, the grid can introduce visual artifacts (moire patterns, discrete jumps). Use anti-aliasing or a finer grid if the artifacts are visible.

Pitfall 6: False density. A 2D histogram of heavily-aggregated data can imply density where it does not exist. A single cell with 100 points looks the same as 100 adjacent cells each with 1 point. Use finer bins or supplement with individual points for clarity.

Pitfall 7: Losing the analytical thread. Big-data visualization is often more about the technical problem (how do I render a million points?) than about the analytical question (what am I trying to learn?). Stay focused on the question.

For this chapter's exercises, we use a synthetic dataset of 1 million social media events with (longitude, latitude) locations. The goal is to visualize the geographic distribution of events at multiple levels of detail.

Level 1: Alpha blending with a small sample (10,000 points). Fast but loses information about the full dataset.

Level 2: Hex binning of the full million. Reveals the global density pattern but cannot be zoomed without losing detail.

Level 3: Datashader of the full million with out-of-core processing. Produces a pixel-perfect image that can be rendered at any resolution.

Level 4: HoloViews + datashader interactive. Multi-scale exploration with re-aggregation at each zoom level. The reader starts with the overview and drills down to specific regions.

Level 5: Plotly scattergl for a sampled subset (100,000 points). Interactive with hover tooltips for individual events, at the cost of requiring sampling.

Each level answers a slightly different question and has different trade-offs. Level 1 is fastest to build and best for quick exploration. Level 2 is a reliable middle ground. Level 3 scales to any size. Level 4 is the most powerful for exploration. Level 5 is best for interactive sharing where individual events matter.

No single level is "the" visualization. The whole point of this chapter is that big-data visualization requires choosing the right level for the job.

28.11 Datashader in Depth

Datashader deserves more detailed treatment because it is the most capable tool for truly large visualizations and because its API is distinctive. Understanding the core pattern unlocks most of its power.

The datashader pipeline has four stages:

Stage 1: Canvas. A canvas defines the output pixel grid. You specify width, height, and optionally the data extent (x_range, y_range). The canvas is the container for the rasterization.

import datashader as ds
canvas = ds.Canvas(plot_width=800, plot_height=600,
                   x_range=(-5, 5), y_range=(-5, 5))

Stage 2: Aggregation. You project the data onto the canvas and compute an aggregation per pixel. The most common aggregation is count (number of points per pixel), but datashader supports many: sum, mean, max, min, count_cat (count by category), any (binary presence), and more.

agg = canvas.points(df, "x", "y", agg=ds.count())
# Or for categorical:
agg = canvas.points(df, "x", "y", agg=ds.count_cat("category"))
# Or for numeric values:
agg = canvas.points(df, "x", "y", agg=ds.mean("value"))

The agg result is an xarray Dataset (or DataArray) — a 2D grid of aggregation results indexed by pixel position. It is not yet an image; it is the raw counts.

Stage 3: Shading. Convert the aggregation grid into a colored image. tf.shade handles this, with options for colormap, how='linear'/'log'/'eq_hist', color_key for categorical data, and more.

import datashader.transfer_functions as tf
img = tf.shade(agg, cmap="viridis", how="eq_hist")

The how="eq_hist" option is particularly useful for heavy-tailed data: it equalizes the histogram so every intensity level is used approximately equally, which brings out detail in both sparse and dense regions. how="log" uses a log scale; how="linear" is the default.

Stage 4: Display. The shaded image can be displayed in Jupyter, saved to a file, or embedded in another plotting library.

tf.set_background(img, "white")  # set background color
img.to_pil().save("datashader_output.png")

Datashader with categorical data. For data with categories, datashader can produce an image where each pixel's color reflects the category counts. Points of different categories blend visually, so a pixel with 30 red points and 10 blue points appears as a reddish-blue.

agg = canvas.points(df, "x", "y", agg=ds.count_cat("category"))
color_key = {"A": "red", "B": "blue", "C": "green"}
img = tf.shade(agg, color_key=color_key, how="eq_hist")

The result is a visualization where the spatial structure of each category is visible, even when the categories overlap. This is particularly useful for multi-class spatial data — e.g., showing different types of events on a map.

Out-of-core with Dask. For datasets too large to fit in memory, datashader integrates with Dask:

import dask.dataframe as dd
ddf = dd.read_csv("huge_dataset.csv")
agg = canvas.points(ddf, "x", "y")

Dask handles the chunking and parallelization; datashader aggregates each chunk and combines them. This scales to datasets well beyond the available memory, at the cost of slower processing.

28.12 Integrating Datashader with Interactive Tools

Datashader by itself produces static images. To make it interactive (pan, zoom, hover), you combine it with a charting library that supports callbacks.

HoloViews + datashader + Bokeh: the standard combination.

import holoviews as hv
from holoviews.operation.datashader import datashade, dynspread
import datashader as ds
import pandas as pd

hv.extension("bokeh")

df = pd.DataFrame({"x": x, "y": y})
points = hv.Points(df, kdims=["x", "y"])
shaded = datashade(points, aggregator=ds.count(), cmap="viridis")
dynspread(shaded, threshold=0.5)  # spread sparse points for visibility

The datashade operation wraps the rasterization in a HoloViews object that re-runs at each zoom level. The dynspread operation makes isolated points slightly larger so they remain visible against the background when dense regions are also present. The result is a Bokeh-rendered plot that zooms to infinite detail.

Plotly with datashader: possible but less integrated. You can compute the datashader image manually and embed it as a background image in a Plotly figure. This is more work than HoloViews but gives you Plotly's other features (hover, range selectors, layout customization).

Panel dashboards: Panel is a dashboarding library that pairs well with datashader and HoloViews. A Panel dashboard can include datashader visualizations alongside other widgets and charts, producing full interactive applications.

For exploration of unknown large datasets, HoloViews + datashader is the most productive starting point. For production-grade applications that need custom layouts and interactions, Plotly + Dash or Panel may be better depending on the team's preferences.

28.13 Comparing the Strategies

Each big-data strategy has strengths and weaknesses. The table below summarizes.

Strategy	Dataset size	Performance	Interactivity	Individual points visible?
Alpha blending	Up to ~50k	Slow for >100k	Yes (if wrapped)	No (but density visible)
Hex binning	Up to several million	Fast	Yes (in most libs)	No
2D histogram	Up to several million	Fast	Yes	No
KDE	Up to ~100k	Slow for >100k	Yes	No (smoothed)
Datashader (static)	Millions to billions	Fast, scalable	No	No
Datashader + HoloViews	Millions to billions	Fast, interactive	Yes, re-aggregating	At high zoom
Plotly scattergl	Up to ~1 million	Fast with WebGL	Yes	Yes
Random sampling + scatter	Up to original	Depends on sample size	Yes	Only sampled points

Decision guide:

Under 10k points: regular scatter plot. No big-data tools needed.
10k–50k points: alpha blending or small hex binning.
50k–500k points: hex binning, 2D histogram, or Plotly scattergl.
500k–5M points: hex binning for static, scattergl for interactive, datashader for massive scale.
5M–100M points: datashader. For interactivity, HoloViews + datashader.
>100M points: datashader with Dask for out-of-core processing.

The boundaries are approximate and depend on the specific machine. On a modern laptop, scattergl handles about 1-2 million points comfortably; on a powerful workstation, more. The principles are the same regardless.

28.14 Big Data Comes with Big Preprocessing

A theme of this chapter has been that visualization at scale requires aggregation. But the aggregation step is often more time-consuming than the visualization itself. Preparing a million-row dataset for visualization typically involves:

Loading: reading from CSV, Parquet, database, or cloud storage. Parquet is usually the fastest format for large tabular data. CSV is slow and should be avoided when possible.

Cleaning: removing nulls, handling outliers, type conversions. For big data, consider the trade-off between filtering early (smaller dataset to visualize) and filtering late (more information available).

Feature engineering: computing derived columns. For visualizations, this often means computing categories, quantile bins, or time-period labels that will become chart facets.

Aggregation: if the full-resolution data is not needed for the visualization, pre-aggregate to a smaller dataset. For a time series visualization of a year of minute-level data, daily aggregates are often sufficient and 1440 times smaller.

Spatial indexing: for geographic data, pre-compute spatial indices (Uber's H3, or a simple grid) to speed up filtering and aggregation.

Parquet over CSV: if you are going to visualize the same dataset multiple times, convert it to Parquet once and reuse. Parquet is typically 5-10x faster to read than CSV and 3-5x smaller on disk.

These preprocessing steps are part of the visualization workflow, even though they are not "visualization" in the narrow sense. Budget time for them. A big-data visualization that takes 10 minutes to render is usually bottlenecked by the data loading and aggregation, not by the plotting library.

28.15 Honest Aggregation

The chapter's threshold concept is that aggregation is interpretation. This has practical implications for how you communicate big-data visualizations.

Disclose the aggregation. A hex bin plot should mention the grid size and the aggregation function. A datashader image should mention the pixel resolution. A sampled visualization should mention the sample size and selection method. Captions that say "1 million points" without explaining how they were rendered are incomplete.

Show multiple levels. If possible, show both the aggregated view and a sample of individual points. The reader gets the overall pattern from the aggregation and a sense of the raw data from the sample.

Cross-check with filters. Aggregate patterns can hide subgroup effects. Before making claims based on a big-data visualization, filter by relevant categories and verify the aggregate holds up.

Report outliers separately. Aggregation often loses outliers. If outliers matter for your analysis, show them explicitly — either as overlaid markers or as a supplementary list.

Use log color scales for heavy tails. Most real-world data has heavy-tailed distributions (a few dense regions, many sparse ones). Linear color scales hide the sparse regions. Log scales (or eq_hist in datashader) reveal them.

Validate with synthetic data. If you are using a novel aggregation technique, verify it on synthetic data with known patterns. Generate data with a specific structure, aggregate it, and check whether the aggregation recovers the structure.

These practices are the big-data equivalent of the "honest chart" principles from Chapter 4. The rules look different at scale (aggregation rather than axis manipulation is the main risk), but the underlying principle is the same: make the design choices visible and defensible.

28.16 Case Examples at Different Scales

To ground the abstract strategies in specific situations, consider four real-world scenarios and which tool fits each.

Scenario 1: NYC taxi trip data. ~170 million trips, each with pickup and dropoff coordinates, distance, fare, time. Total size: ~40 GB in Parquet.

A scatter plot is obviously impossible. Even hex binning of the full dataset takes minutes. The right tool is datashader with Dask: out-of-core processing of the full dataset, rasterized into a pixel grid. With the eq_hist shading option, the full spatial distribution of trips is visible in a single image, and interactive zoom can drill into specific boroughs. The famous "Pictures from 1.5 Million Taxi Trips" visualization by Chris Whong (2014) used similar techniques and became a landmark example of urban data visualization.

Scenario 2: IoT sensor data. 10,000 sensors each reporting a value every minute for a year. Total rows: ~5.2 billion. Each value is a float and a timestamp.

Even datashader struggles at this scale without aggregation. The right approach is pre-aggregation: compute hourly or daily averages per sensor (reducing 5.2 billion to ~88 million), then visualize with datashader or WebGL. For time-series views, aggregate over all sensors (reducing to 525,600 time points), then visualize as a normal line chart. The 5.2-billion-row raw data never needs to be visualized directly; aggregation produces tractable derivatives.

Scenario 3: User behavior analytics. ~1 million users, each with a few dozen events per day over a year. Total rows: ~11 billion, but user-level aggregates fit in memory.

The right approach is hierarchical aggregation: user-level aggregates for user-comparison charts, cohort-level aggregates for cohort comparisons, event-type aggregates for event-type comparisons. Each analytical question uses a different aggregation, and the visualizations reflect those aggregations directly. Raw event data is rarely needed for visualization at this scale.

Scenario 4: Genomic variants. ~50 million variants across ~1000 samples, each variant with a position and several per-sample measurements.

The scale is manageable but the data is complex. Manhattan plots (Chapter 27) handle per-variant p-values via scatter with WebGL. Heatmaps handle variant × sample matrices with aggregation by chromosome region. The specific tool depends on the question; the general approach is to pre-aggregate along the relevant dimension before plotting.

Each scenario uses a different combination of tools from this chapter, and the right combination depends on the scale, the question, and the available infrastructure. The through-line is that raw data is rarely visualized at big scales. Aggregation is always part of the pipeline.

28.17 Cloud and Streaming Scales

Beyond the "millions to billions on a laptop" scale, there are datasets at the "petabyte on a cluster" scale. These require distributed computing (Spark, Dask, BigQuery) for aggregation, and the visualization is almost always of pre-aggregated results rather than raw data.

Pattern for cluster-scale visualization:

Run a distributed query (Spark SQL, BigQuery, Dask) to aggregate the data. The output is a small (usually under 1 million rows) summary.
Load the summary into pandas.
Visualize the summary with any of the tools from earlier chapters.

The visualization step is standard; the aggregation step is the challenge. Tools like BigQuery's INFORMATION_SCHEMA and Spark's DataFrame API support the aggregations needed.

Streaming visualization is a different beast. Real-time dashboards (trading systems, operations monitoring, social media analytics) need to update as data arrives. The pattern:

A streaming pipeline (Kafka, Kinesis, Flink) processes events in real time.
An aggregation layer (time-windowed counts, moving averages) produces a current snapshot.
A visualization layer (Plotly Dash with dcc.Interval, Streamlit with periodic refresh, Grafana for metrics) displays the snapshot.

For most data science use cases, streaming is overkill. Periodic refresh (once per minute or hour) is enough. When true real-time is needed, dedicated tools like Grafana often beat Python-native options.

28.18 The Limits of Visual Perception

A philosophical point worth making: even with perfect tools, there is a limit to how much information a human can extract from a single chart. A screen has maybe 10 million pixels. A human's visual attention can process maybe a few dozen distinct visual elements at once. A complicated chart with millions of points can display density information, but it cannot display a million distinct facts.

This limit is not a technology problem. It is a human-perception problem. No matter how good your visualization tool is, there is a ceiling on how much a human can learn from a single image. The response is to focus on the specific pattern you want the reader to see, rather than trying to display everything.

For a million-point scatter, the interesting patterns are usually: overall density, regions of high and low density, outliers, specific shapes. A visualization that emphasizes these patterns — via datashader with eq_hist, via outlier highlighting, via zoom-and-drill — is more useful than one that simply renders the million points.

The broader principle: big data is not about showing more; it is about showing the right summary. The summary chosen depends on what you want the reader to learn. If the analyst does not know what the reader should learn, no amount of visualization technology will produce a useful chart. The analytical question must come first; the visualization serves it.

This is also the reason why this chapter is shorter on code examples and longer on strategic thinking. The code for hex binning or datashader is easy to learn — a few lines per tool. The hard part is deciding which tool fits which question, and which aggregation preserves the analytical signal. That judgment develops with experience and with exposure to specific domain problems.

28.19 Practical Optimization Tips

When big-data visualization is slow, the bottleneck is usually not the plotting library. It is the data loading, the aggregation step, or the transfer between Python and the browser. A few practical optimizations that help:

Read Parquet, not CSV. Parquet files are typically 5-10x faster to read than CSV, and they preserve types so no coercion is needed. Convert your CSV once with df.to_parquet("file.parquet") and then read from Parquet thereafter.

Use columnar filters before loading. If you only need some rows, filter during load: pd.read_parquet("file.parquet", filters=[("year", ">=", 2020)]). This avoids loading data you do not need.

Use Dask for out-of-core. When data does not fit in memory, Dask lets you process it in chunks. dd.read_parquet("huge.parquet") gives you a Dask DataFrame that you can operate on as if it were pandas but that runs in chunks under the hood.

Pre-aggregate at load time. If the visualization needs monthly aggregates, compute them at load time and cache the result. Do not re-aggregate on every chart refresh.

Sample before visualizing. For exploratory work, df.sample(n=100_000) is often enough to see the pattern. Save the full-data visualization for final output.

Use Plotly Express render_mode="webgl". For scatter plots with more than 10,000 points, always use WebGL. It costs nothing and prevents most slowness.

Cache expensive computations. Use @functools.lru_cache or an explicit dict to cache aggregations that will be recomputed. In notebooks, this prevents re-running the same aggregation when you tweak a chart style.

Profile before optimizing. Use %time or %%time in Jupyter to find where the slowdown actually is. Often it is not where you think. A chart that takes 30 seconds to render might be spending 28 seconds on data loading and 2 seconds on rendering; optimizing the render loop would be wasted effort.

Use rasterized PNG for static output. For a static image of a million-point visualization, a 800×600 PNG at ~500 KB is usually better than a multi-MB SVG or PDF. Vector formats do not help when the data is rasterized anyway.

Increase pandas and numpy memory if needed. For datasets in the hundreds of millions of rows, use data types that take less memory (int32 instead of int64, float32 instead of float64). df.astype({"col": "int32"}) can halve the memory footprint.

These optimizations add up. A visualization pipeline that takes 60 seconds naively can often be reduced to 5 seconds with a few targeted improvements. For interactive dashboards (Chapters 29-30), this speedup is the difference between a usable tool and a frustrating one.

28.20 When Not to Visualize at All

The chapter has been about strategies for visualizing massive datasets. It is worth acknowledging that sometimes the right answer is not to visualize at all. A table of summary statistics, a machine-learning model's output, or a narrative report may communicate the underlying message better than any chart.

Situations where visualization may not be the right response:

When the pattern is a single number. If the answer to the analytical question is "the mean increased by 15%", a text statement is clearer than a chart. Charts add value when the pattern is spatial, distributional, or relational — not when it is a single scalar.

When the audience does not read charts. Some stakeholders genuinely prefer tables or prose. For these audiences, a well-designed table beats a fancy chart. Respect the audience's preferences rather than insisting on visualization.

When the data is too uncertain. A visualization implies precision. If the underlying data is so noisy or biased that any pattern could be an artifact, a cautious text summary is more honest than a chart that "shows" the pattern.

When the cost of the visualization exceeds the value. For exploratory work in a notebook, quick charts are fine. For a production dashboard that will be viewed once by one person, a careful visualization may not be worth the engineering effort. Match the investment to the usage.

When the ML model is the output. For some modern analyses, the output is a trained model rather than a chart. A fraud-detection model does not need a visualization of its own predictions; it needs a score per transaction. Visualizations at the training and evaluation stage are appropriate (confusion matrices, ROC curves), but the final product may be a model-serving API, not a chart.

The broader principle: visualization is a tool, not an end in itself. Use it when it helps communicate, understand, or explore. Skip it when it does not. Big-data visualization specifically is tempting because it looks impressive, but the test is still whether it helps the reader learn something. A hex bin plot that reveals a pattern is valuable. A hex bin plot that looks cool but conveys nothing new is just decoration.

This honesty about when not to visualize is the counterweight to the rest of this book, which has mostly been about when and how to visualize. Both directions matter. The mature practitioner knows both when to reach for a chart and when to put the chart aside.

A related observation: many big-data visualizations exist because the data exists, not because the reader needs them. A team that has collected a million rows of sensor data feels obligated to show them all, even when summary statistics would serve better. The obligation is self-imposed and worth questioning. "We have the data, so we should visualize it" is not the same as "the reader needs to see the data." The first is about the producer's investment; the second is about the reader's need. When these come apart, the reader should win.

The big-data visualization techniques described in this chapter are tools for when the reader genuinely needs to see patterns in a large dataset — spatial density, temporal trends, multi-class distributions, high-dimensional outliers. They are not tools for showing off the size of your data. The size of your data is interesting to you; the pattern in your data is interesting to the reader. Keep the distinction in mind, and use the tools for the patterns the reader needs to see rather than for the sheer volume of the data that happens to be available. The discipline of matching the visualization to a genuine analytical need is the same discipline we have been developing throughout this entire book, and it applies with full force at big-data scales as well.

28.21 Check Your Understanding

Before continuing to Chapter 29 (Dashboards with Streamlit), make sure you can answer:

What are the symptoms of big-data overplotting, and at what dataset sizes do they appear?
What is alpha blending, and at what sizes does it stop working?
What does hex binning do, and why is it better than a raw scatter for large data?
What is datashader, and what is its scaling advantage?
What is WebGL, and how does Plotly use it for large-data charts?
What are three sampling strategies, and when would you use each?
What does multi-scale visualization mean, and what tools implement it in Python?
Name three big-data visualization pitfalls.

If any of these are unclear, re-read the relevant section. Chapter 29 moves beyond individual charts to the production of full interactive dashboards with Streamlit.

28.22 Chapter Summary

This chapter covered the main strategies for visualizing big data:

Alpha blending works for up to ~50,000 points; reveals density through transparency.
Hex binning aggregates points into hexagonal cells; scales to several million points.
2D histograms are the rectangular equivalent of hex bins.
KDE produces smooth density estimates but is slow for very large data.
Datashader rasterizes millions of points into pixel grids; scales to billions with Dask.
WebGL rendering in Plotly (scattergl) handles hundreds of thousands of points interactively.
Sampling strategies (random, stratified, reservoir, 2D-bin sampling) provide representative subsets when aggregation is inappropriate.
Multi-scale visualization via HoloViews + datashader or Plotly + Dash supports zoom-and-filter exploration.
Big-data pitfalls include Simpson's paradox in aggregation, color scale saturation, outlier loss, rasterization aliasing, and false density.

The chapter's threshold concept — aggregation is interpretation — argues that every big-data visualization is a summary, and the choice of summary is the most consequential design decision. Understanding this shifts the practitioner's mindset from "display all the data" to "choose an informative summary of the data."

Chapter 28 closes Part VI (Specialized Domains). Part VII (Dashboards and Production) begins with Chapter 29 (Streamlit) and covers the techniques for building full interactive applications with visualization at their core.

28.23 Spaced Review

From Chapter 19 (Multi-Variable Exploration): Heatmaps and clustermaps are a form of aggregation. How do they relate to the big-data strategies in this chapter?
From Chapter 20 (Plotly Express): The render_mode="webgl" parameter was mentioned briefly in Chapter 20. Why is it essential for big-data interactive charts?
From Chapter 14 (Specialized Charts): 2D histograms and hex bins were introduced in Chapter 14. How does the big-data framing change their use?
From Chapter 4 (Honest Charts): Aggregation can hide outliers and subgroup patterns. How is this a form of the lie-factor problem from Chapter 4?
From Chapter 5 (Choosing the Right Chart): The chapter's lesson is "aggregation is a design decision." How does this fit into Chapter 5's chart-selection framework?

With Chapter 28, Part VI (Specialized Domains) is now complete. Big-data visualization is a real engineering problem, not just an analytical one. The tools in this chapter — alpha, hex bin, datashader, WebGL, sampling — are the vocabulary for solving it. The discipline is knowing which tool fits which dataset size and which analytical question. For most data science work today, you will encounter datasets that break standard tools at least occasionally, and this chapter's techniques are the response. Part VII begins next with Chapter 29 (Streamlit), where individual charts combine into full interactive applications.

Learning Objectives

In This Chapter

Chapter 28: Big Data Visualization — When You Have a Million Points

28.1 The Big Data Problem

28.2 Alpha Blending: The Simplest Strategy

28.3 Hex Binning

28.4 2D Histograms and Density Estimation

28.5 Datashader: Rasterization at Scale

28.6 WebGL Rendering in Plotly

28.7 Sampling Strategies

28.8 Multi-Scale Visualization

28.9 Big Data Pitfalls

28.10 Progressive Project (Alternate): A Million-Point Social Media Scatter

28.11 Datashader in Depth

28.12 Integrating Datashader with Interactive Tools

28.13 Comparing the Strategies

28.14 Big Data Comes with Big Preprocessing

28.15 Honest Aggregation

28.16 Case Examples at Different Scales

28.17 Cloud and Streaming Scales

28.18 The Limits of Visual Perception

28.19 Practical Optimization Tips

28.20 When Not to Visualize at All

28.21 Check Your Understanding

28.22 Chapter Summary

28.23 Spaced Review