Further Reading: Big Data Visualization


Tier 1: Essential Reading

datashader documentation. datashader.org The official datashader documentation. Organized by pipeline stage (canvas → aggregate → shade) with many examples. Essential reference for any big-data visualization in Python.

HoloViews documentation. holoviews.org The companion library that makes datashader interactive. Tutorials cover the datashader integration and the broader HoloViews approach to data visualization.

Bednar, James A. "Datashader: Revealing the Structure of Genuinely Big Data." SciPy 2018 talk. Bednar is the primary developer of datashader. His SciPy talk (available on YouTube) is the best introduction to why datashader exists and how it works. Highly recommended.


Whong, Chris. "Pictures from 1.5 Million Taxi Trips." chriswhong.com, 2014. The original blog post that introduced the NYC taxi dataset to the visualization community. Whong was the requester who obtained the data via FOIL, and his visualizations were among the first to show what the dataset contained.

Sloan Digital Sky Survey Data Release documentation. sdss.org The official SDSS documentation. Includes details on the catalog format, the data access tools, and example visualizations. Read alongside Case Study 2.

Shneiderman, Ben. "Extreme visualization: squeezing a billion records into a million pixels." SIGMOD Conference, 2008. Shneiderman's paper on visualizing very large datasets. Covers the theoretical foundations and the practical techniques. Cited in many datashader and big-data visualization papers.

Wickham, Hadley. "Bin-summarise-smooth: A framework for visualising large data." had.co.nz, 2013. Wickham's paper on scalable visualization using a bin → summarize → smooth pipeline. Predates datashader but describes the same general approach. Freely available on Wickham's website.

Perrot, Alexandre, et al. "HeatPipe: High Throughput, Low Latency Big Data Heatmap." VLDB, 2015. A research paper on scalable heatmap construction for streaming big data. Relevant for readers interested in the algorithmic side.


Tier 3: Tools and Online Resources

Resource URL / Source Description
datashader GitHub github.com/holoviz/datashader The datashader source code and issue tracker.
HoloViews GitHub github.com/holoviz/holoviews HoloViews source code and examples.
Dask dask.org Parallel computing library. Integrates with datashader for out-of-core processing.
Bokeh bokeh.org Interactive web plotting library. Used by HoloViews for rendering.
Plotly scattergl docs plotly.com/python/webgl-vs-svg/ Plotly's guide to WebGL vs. SVG rendering, with performance benchmarks.
Parquet parquet.apache.org The columnar data format that should replace CSV for most big-data work.
NYC TLC Trip Record Data nyc.gov/site/tlc/about/tlc-trip-record-data.page The official NYC taxi dataset, continuously updated since 2014.
SDSS data sdss.org/dr17 The SDSS Data Release 17, current as of 2024. Freely available.
TOPCAT star.bristol.ac.uk/~mbt/topcat The astronomical catalog viewer discussed in Case Study 2.
vaex vaex.io Another out-of-core Python library for large datasets, with built-in visualization tools.
pyarrow arrow.apache.org/docs/python/ Python bindings for Apache Arrow, the columnar memory format that powers fast big-data processing.

A note on reading order: If you want one additional source, watch James Bednar's SciPy 2018 talk on datashader. It is the best single introduction to why big-data visualization needs specialized tools and how datashader solves the problem. For more depth, the HoloViews tutorial on interactive datashader is the right next step.