Quiz: Big Data Visualization

Q: What is "overplotting" in visualization? A) Using too many colors B) Points stacking on top of each other in a scatter, obscuring individual points C) Drawing a chart that is too large for the page D) Applying multiple transforms to the same data

B. Overplotting occurs when too many points are drawn in a small area and individual points become invisible. Alpha blending, hex binning, and aggregation are responses to overplotting.

Q: Which technique works up to roughly 50,000 points and is the simplest response to overplotting? A) Datashader B) Alpha blending C) Sampling D) KDE

B. Alpha blending (setting a low alpha value like 0.05) makes overlapping points sum visually to reveal density. Simple to implement but limited to moderately-sized datasets.

Q: Which library rasterizes massive datasets into pixel grids? A) plotly B) datashader C) seaborn D) matplotlib

B. Datashader, developed by Anaconda, rasterizes data into a fixed pixel grid. Scales to millions and billions of points; integrates with Dask for out-of-core processing.

Q: Plotly's WebGL-accelerated scatter trace is: A) `go.Scatter3d` B) `go.Scattergl` C) `go.Webscatter` D) `go.Fastscatter`

B. `go.Scattergl` (or `render_mode="webgl"` in Plotly Express) uses WebGL for GPU-accelerated rendering of large scatter plots, handling hundreds of thousands of points smoothly.

Q: What is the chapter's threshold concept? A) More data is always better B) Aggregation is interpretation C) Alpha blending is the only big-data strategy D) Rasterization is always worse than vectors

B. Every big-data visualization is a summary produced by aggregation, sampling, or rasterization. The choice of aggregation is a design decision that affects what the reader sees.

Q: Which hexbin parameter controls the resolution of the aggregation grid? A) `size` B) `gridsize` C) `bins` D) `resolution`

B. `ax.hexbin(x, y, gridsize=50)` uses a grid of 50 hexagons across the x range. Higher values give finer detail.

Q: Which shading mode in datashader equalizes the histogram to reveal detail in both sparse and dense regions? A) `how="linear"` B) `how="log"` C) `how="eq_hist"` D) `how="square"`

C. `how="eq_hist"` applies histogram equalization, which is particularly useful for heavy-tailed distributions where linear scales hide the sparse regions.

Q: What does stratified sampling preserve that random sampling may not? A) Temporal order B) Proportions of categories C) Color information D) Pixel density

B. Stratified sampling samples proportionally within each category, preserving the balance of categories. Random sampling can miss rare categories entirely.

Q: For out-of-core processing of datasets larger than memory, datashader integrates with: A) Pandas B) Dask C) NumPy D) Scikit-learn

B. Dask provides chunked, parallel processing of large datasets. Datashader can aggregate Dask DataFrames without loading them fully into memory.

Q: Which library combines datashader with interactive bokeh-rendered plots that re-aggregate on zoom? A) matplotlib B) seaborn C) holoviews D) altair

C. HoloViews + datashader + Bokeh is the standard combination for interactive big-data exploration. `datashade(points)` wraps a HoloViews Points object to re-aggregate at each zoom level.

DataField.Dev

Quiz: Big Data Visualization

Part I: Multiple Choice (10 questions)

Q1. What is "overplotting" in visualization?

A) Using too many colors B) Points stacking on top of each other in a scatter, obscuring individual points C) Drawing a chart that is too large for the page D) Applying multiple transforms to the same data

Answer

**B.** Overplotting occurs when too many points are drawn in a small area and individual points become invisible. Alpha blending, hex binning, and aggregation are responses to overplotting.

Q2. Which technique works up to roughly 50,000 points and is the simplest response to overplotting?

A) Datashader B) Alpha blending C) Sampling D) KDE

Answer

**B.** Alpha blending (setting a low alpha value like 0.05) makes overlapping points sum visually to reveal density. Simple to implement but limited to moderately-sized datasets.

Q3. Which library rasterizes massive datasets into pixel grids?

A) plotly B) datashader C) seaborn D) matplotlib

Answer

**B.** Datashader, developed by Anaconda, rasterizes data into a fixed pixel grid. Scales to millions and billions of points; integrates with Dask for out-of-core processing.

Q4. Plotly's WebGL-accelerated scatter trace is:

A) go.Scatter3d B) go.Scattergl C) go.Webscatter D) go.Fastscatter

Answer

**B.** `go.Scattergl` (or `render_mode="webgl"` in Plotly Express) uses WebGL for GPU-accelerated rendering of large scatter plots, handling hundreds of thousands of points smoothly.

Q5. What is the chapter's threshold concept?

A) More data is always better B) Aggregation is interpretation C) Alpha blending is the only big-data strategy D) Rasterization is always worse than vectors

Answer

**B.** Every big-data visualization is a summary produced by aggregation, sampling, or rasterization. The choice of aggregation is a design decision that affects what the reader sees.

Q6. Which hexbin parameter controls the resolution of the aggregation grid?

A) size B) gridsize C) bins D) resolution

Answer

**B.** `ax.hexbin(x, y, gridsize=50)` uses a grid of 50 hexagons across the x range. Higher values give finer detail.

Q7. Which shading mode in datashader equalizes the histogram to reveal detail in both sparse and dense regions?

A) how="linear" B) how="log" C) how="eq_hist" D) how="square"

Answer

**C.** `how="eq_hist"` applies histogram equalization, which is particularly useful for heavy-tailed distributions where linear scales hide the sparse regions.

Q8. What does stratified sampling preserve that random sampling may not?

A) Temporal order B) Proportions of categories C) Color information D) Pixel density

Answer

**B.** Stratified sampling samples proportionally within each category, preserving the balance of categories. Random sampling can miss rare categories entirely.

Q9. For out-of-core processing of datasets larger than memory, datashader integrates with:

A) Pandas B) Dask C) NumPy D) Scikit-learn

Answer

**B.** Dask provides chunked, parallel processing of large datasets. Datashader can aggregate Dask DataFrames without loading them fully into memory.

Q10. Which library combines datashader with interactive bokeh-rendered plots that re-aggregate on zoom?

A) matplotlib B) seaborn C) holoviews D) altair

Answer

**C.** HoloViews + datashader + Bokeh is the standard combination for interactive big-data exploration. `datashade(points)` wraps a HoloViews Points object to re-aggregate at each zoom level.

Part II: Short Answer (10 questions)

Q11. Describe the four stages of a datashader pipeline.

Answer

(1) **Canvas**: define the output pixel grid. (2) **Aggregation**: project data onto the canvas and compute counts/sums/means per pixel. (3) **Shading**: convert the aggregation to a colored image. (4) **Display**: show or save the image.

Q12. Write matplotlib code to create a hex bin plot of 1 million random points.

Answer

import matplotlib.pyplot as plt
import numpy as np

x = np.random.randn(1_000_000)
y = np.random.randn(1_000_000)

fig, ax = plt.subplots(figsize=(6, 6))
hb = ax.hexbin(x, y, gridsize=80, cmap="viridis")
fig.colorbar(hb, ax=ax, label="Count")

Q13. What does render_mode="webgl" do in Plotly Express?

Answer

It tells Plotly Express to use Scattergl (WebGL-rendered) instead of Scatter (SVG-rendered). WebGL uses the GPU and handles far more points smoothly than SVG, at the cost of slightly less-polished anti-aliasing. Essential for scatter plots with more than ~50,000 points.

Q14. Explain why KDE is slower than hex binning for very large datasets.

Answer

KDE (kernel density estimation) computes a smoothed density at every evaluation point by summing contributions from all data points. The cost is O(N × M) where N is the data size and M is the number of evaluation points. Hex binning is O(N) — each point is placed in one bin with constant work. For large N, the KDE cost dominates; hex binning stays fast.

Q15. What is reservoir sampling?

Answer

A streaming algorithm that produces a uniformly random sample of fixed size from a stream of unknown length. You process each element once and probabilistically replace an element in the reservoir. Useful when the full dataset does not fit in memory and cannot be loaded all at once.

Q16. What are three big-data visualization pitfalls to avoid?

Answer

(1) **Assuming aggregation is neutral** — the choice of aggregation affects what is visible. (2) **Simpson's paradox in aggregation** — subgroup patterns can reverse. (3) **Color scale saturation** — linear scales hide sparse regions in heavy-tailed data; use log or eq_hist. (4) **Sampling bias** — random sampling loses outliers. (5) **Aliasing in rasterization** — coarse grids create artifacts.

Q17. When should you use parquet instead of CSV for big-data visualization workflows?

Answer

Always, if possible. Parquet is 5-10x faster to read, 3-5x smaller on disk, and preserves data types. For datasets that will be loaded repeatedly in notebook sessions, converting CSV to Parquet once pays back quickly. CSV is only appropriate for human-readable data exchange.

Q18. Describe the multi-scale visualization pattern and which Python tools implement it.

Answer

Multi-scale visualization provides an overview of the full dataset and allows the reader to zoom in for detail. At each zoom level, the visualization re-aggregates at the new resolution. **Tools**: HoloViews + datashader + Bokeh (the canonical combination); Plotly + Dash with callback-based re-aggregation; custom implementations listening for zoom events. The key is that the overview uses one technique (datashader), and detail uses another (scattergl or scatter), with seamless transition.

Q19. Explain the trade-off between vector formats (SVG, PDF) and raster formats (PNG, TIFF) for big-data visualization output.

Answer

**Vector** formats scale losslessly but get huge when rendering millions of points (each point is a DOM/XML element). For 100k+ points, vector files become unwieldy. **Raster** formats have fixed pixel dimensions regardless of input size; the file size depends on resolution, not data size. For big-data visualization, raster (PNG at 300 DPI) is almost always better than vector. Datashader's output is always raster for this reason.

Q20. The chapter says "big data is not about showing more; it is about showing the right summary." Explain what this means.

Answer

The instinct with big data is to try to display every point. But at large scales, displaying everything is impossible (more points than pixels), and even if it were possible, a human cannot process millions of individual facts. The actual goal is to convey the pattern — the density, the trends, the outliers — and that requires choosing an appropriate aggregation. Big-data visualization is fundamentally about choosing the right summary, not about rendering the raw data. The size of the data is not the interesting thing; the pattern in the data is.

Scoring Rubric

Score	Level	Meaning
18–20	Mastery	You understand the main big-data strategies and can choose appropriately.
14–17	Proficient	You know the basics; review datashader and multi-scale patterns.
10–13	Developing	You grasp the concepts; re-read Sections 28.2-28.6 and work all Part B exercises.
< 10	Review	Re-read the full chapter.

With this quiz, Part VI (Specialized Domains) is now complete. Chapter 29 begins Part VII (Dashboards and Production) with Streamlit.