Further Reading: Big Data Visualization

Tier 1: Essential Reading

datashader documentation. datashader.org The official datashader documentation. Organized by pipeline stage (canvas → aggregate → shade) with many examples. Essential reference for any big-data visualization in Python.

HoloViews documentation. holoviews.org The companion library that makes datashader interactive. Tutorials cover the datashader integration and the broader HoloViews approach to data visualization.

Bednar, James A. "Datashader: Revealing the Structure of Genuinely Big Data." SciPy 2018 talk. Bednar is the primary developer of datashader. His SciPy talk (available on YouTube) is the best introduction to why datashader exists and how it works. Highly recommended.

Tier 2: Recommended Specialized Sources

Whong, Chris. "Pictures from 1.5 Million Taxi Trips." chriswhong.com, 2014. The original blog post that introduced the NYC taxi dataset to the visualization community. Whong was the requester who obtained the data via FOIL, and his visualizations were among the first to show what the dataset contained.

Sloan Digital Sky Survey Data Release documentation. sdss.org The official SDSS documentation. Includes details on the catalog format, the data access tools, and example visualizations. Read alongside Case Study 2.

Shneiderman, Ben. "Extreme visualization: squeezing a billion records into a million pixels." SIGMOD Conference, 2008. Shneiderman's paper on visualizing very large datasets. Covers the theoretical foundations and the practical techniques. Cited in many datashader and big-data visualization papers.

Wickham, Hadley. "Bin-summarise-smooth: A framework for visualising large data." had.co.nz, 2013. Wickham's paper on scalable visualization using a bin → summarize → smooth pipeline. Predates datashader but describes the same general approach. Freely available on Wickham's website.

Perrot, Alexandre, et al. "HeatPipe: High Throughput, Low Latency Big Data Heatmap." VLDB, 2015. A research paper on scalable heatmap construction for streaming big data. Relevant for readers interested in the algorithmic side.

Tier 3: Tools and Online Resources

Resource	URL / Source	Description
datashader GitHub	github.com/holoviz/datashader	The datashader source code and issue tracker.
HoloViews GitHub	github.com/holoviz/holoviews	HoloViews source code and examples.
Dask	dask.org	Parallel computing library. Integrates with datashader for out-of-core processing.
Bokeh	bokeh.org	Interactive web plotting library. Used by HoloViews for rendering.
Plotly scattergl docs	plotly.com/python/webgl-vs-svg/	Plotly's guide to WebGL vs. SVG rendering, with performance benchmarks.
Parquet	parquet.apache.org	The columnar data format that should replace CSV for most big-data work.
NYC TLC Trip Record Data	nyc.gov/site/tlc/about/tlc-trip-record-data.page	The official NYC taxi dataset, continuously updated since 2014.
SDSS data	sdss.org/dr17	The SDSS Data Release 17, current as of 2024. Freely available.
TOPCAT	star.bristol.ac.uk/~mbt/topcat	The astronomical catalog viewer discussed in Case Study 2.
vaex	vaex.io	Another out-of-core Python library for large datasets, with built-in visualization tools.
pyarrow	arrow.apache.org/docs/python/	Python bindings for Apache Arrow, the columnar memory format that powers fast big-data processing.

A note on reading order: If you want one additional source, watch James Bednar's SciPy 2018 talk on datashader. It is the best single introduction to why big-data visualization needs specialized tools and how datashader solves the problem. For more depth, the HoloViews tutorial on interactive datashader is the right next step.