Further Reading: Big Data Visualization
Tier 1: Essential Reading
datashader documentation. datashader.org The official datashader documentation. Organized by pipeline stage (canvas → aggregate → shade) with many examples. Essential reference for any big-data visualization in Python.
HoloViews documentation. holoviews.org The companion library that makes datashader interactive. Tutorials cover the datashader integration and the broader HoloViews approach to data visualization.
Bednar, James A. "Datashader: Revealing the Structure of Genuinely Big Data." SciPy 2018 talk. Bednar is the primary developer of datashader. His SciPy talk (available on YouTube) is the best introduction to why datashader exists and how it works. Highly recommended.
Tier 2: Recommended Specialized Sources
Whong, Chris. "Pictures from 1.5 Million Taxi Trips." chriswhong.com, 2014. The original blog post that introduced the NYC taxi dataset to the visualization community. Whong was the requester who obtained the data via FOIL, and his visualizations were among the first to show what the dataset contained.
Sloan Digital Sky Survey Data Release documentation. sdss.org The official SDSS documentation. Includes details on the catalog format, the data access tools, and example visualizations. Read alongside Case Study 2.
Shneiderman, Ben. "Extreme visualization: squeezing a billion records into a million pixels." SIGMOD Conference, 2008. Shneiderman's paper on visualizing very large datasets. Covers the theoretical foundations and the practical techniques. Cited in many datashader and big-data visualization papers.
Wickham, Hadley. "Bin-summarise-smooth: A framework for visualising large data." had.co.nz, 2013. Wickham's paper on scalable visualization using a bin → summarize → smooth pipeline. Predates datashader but describes the same general approach. Freely available on Wickham's website.
Perrot, Alexandre, et al. "HeatPipe: High Throughput, Low Latency Big Data Heatmap." VLDB, 2015. A research paper on scalable heatmap construction for streaming big data. Relevant for readers interested in the algorithmic side.
Tier 3: Tools and Online Resources
| Resource | URL / Source | Description |
|---|---|---|
| datashader GitHub | github.com/holoviz/datashader | The datashader source code and issue tracker. |
| HoloViews GitHub | github.com/holoviz/holoviews | HoloViews source code and examples. |
| Dask | dask.org | Parallel computing library. Integrates with datashader for out-of-core processing. |
| Bokeh | bokeh.org | Interactive web plotting library. Used by HoloViews for rendering. |
| Plotly scattergl docs | plotly.com/python/webgl-vs-svg/ | Plotly's guide to WebGL vs. SVG rendering, with performance benchmarks. |
| Parquet | parquet.apache.org | The columnar data format that should replace CSV for most big-data work. |
| NYC TLC Trip Record Data | nyc.gov/site/tlc/about/tlc-trip-record-data.page | The official NYC taxi dataset, continuously updated since 2014. |
| SDSS data | sdss.org/dr17 | The SDSS Data Release 17, current as of 2024. Freely available. |
| TOPCAT | star.bristol.ac.uk/~mbt/topcat | The astronomical catalog viewer discussed in Case Study 2. |
| vaex | vaex.io | Another out-of-core Python library for large datasets, with built-in visualization tools. |
| pyarrow | arrow.apache.org/docs/python/ | Python bindings for Apache Arrow, the columnar memory format that powers fast big-data processing. |
A note on reading order: If you want one additional source, watch James Bednar's SciPy 2018 talk on datashader. It is the best single introduction to why big-data visualization needs specialized tools and how datashader solves the problem. For more depth, the HoloViews tutorial on interactive datashader is the right next step.