Further Reading: Chapter 28

Working with Large Datasets


Dask

1. Dask Documentation --- docs.dask.org The official Dask documentation is well-organized and practical. The "Best Practices" page is essential reading: it covers when to use Dask (and when not to), partition sizing, memory management, and common anti-patterns. The "DataFrame" section provides a thorough comparison of Dask's API with pandas, including a table of supported and unsupported operations. The "Delayed" section explains custom parallelism for functions that do not fit the DataFrame API.

2. "Scalable Machine Learning with Dask" --- Hussain Sultan (2023, Packt) A hands-on book focused on using Dask for ML pipelines. Covers distributed DataFrame operations, Dask-ML for distributed model training, and integration with scikit-learn, XGBoost, and PyTorch. Chapters 3-5 (Dask DataFrames, delayed computation, and the distributed scheduler) are the most relevant to this chapter's material. The production deployment chapters cover Dask on Kubernetes and cloud environments.

3. "Dask Tutorial" --- dask.org/dask-tutorial A Jupyter notebook-based tutorial maintained by the Dask core team. Works through progressively complex examples: delayed computation, DataFrames, arrays, and distributed scheduling. Takes about 3 hours to complete and is the fastest way to build fluency with Dask's task graph model. Available as a GitHub repository that you can clone and run locally.


Polars

4. Polars User Guide --- docs.pola.rs The official Polars documentation is the definitive reference. The "Getting Started" section covers the expression system, lazy vs. eager evaluation, and the key differences from pandas. The "Expressions" section is a masterclass in Polars' query language: filtering, groupby, window functions, string operations, and temporal operations, all with examples. The "Lazy API" section explains the query optimizer's predicate pushdown, projection pushdown, and common subexpression elimination.

5. "Polars Cookbook" --- Regan Carey (2024, Packt) A recipe-based guide organized by task: data loading, cleaning, transformation, aggregation, joining, and visualization. Each recipe includes both the Polars and pandas equivalent, making it an efficient translation guide for pandas users. The chapter on performance benchmarking provides reproducible comparisons on datasets of varying sizes.

6. "Why Polars Is So Fast" --- Ritchie Vink (Polars blog) A blog post by the creator of Polars explaining the architectural decisions that drive its performance: the Rust execution engine, Apache Arrow memory format, lazy evaluation with query optimization, and multithreaded execution. Technical but accessible. Understanding why Polars is fast helps you write code that takes advantage of its strengths rather than fighting its design.


Apache Arrow

7. Apache Arrow Documentation --- arrow.apache.org Arrow is the columnar memory specification that underpins Polars, DuckDB, and increasingly pandas. The documentation covers the columnar format, the IPC (Feather) file format, and the compute kernel API. The "Format" section is the specification itself --- dense but essential for understanding why Arrow enables zero-copy data sharing. The Python bindings documentation (pyarrow) covers practical usage.

8. "Apache Arrow: A Multi-Language Toolbox for Accelerated Data Interchange" --- Wes McKinney and the Arrow Authors (2024) A white paper by the Arrow project's co-creator explaining the rationale, design, and ecosystem. The key insight: by standardizing the in-memory format, Arrow eliminates the serialization/deserialization overhead that occurs when data moves between tools (e.g., from pandas to Spark, or from R to Python). This paper explains the "why" that the documentation does not always make explicit.


SQL Optimization

9. SQL Performance Explained --- Markus Winand (2012, self-published) The single best book on SQL indexing and query optimization, despite its age. Winand focuses on how databases actually use indexes --- B-trees, composite indexes, index-only scans, partial indexes --- with visual diagrams that make the mechanics clear. The companion website use-the-index-luke.com is freely available and covers the same material. If you read one resource on SQL optimization, read this one.

10. High Performance MySQL --- Silvia Botros and Jeremy Tinley (4th edition, 2021) The standard reference for MySQL performance tuning. Chapters 5-7 (indexing, query optimization, and server tuning) are relevant regardless of which database you use, because the principles of index selection, query plan analysis, and partition pruning are universal. Chapter 5 on indexing strategies is among the most practical treatments available.

11. "Use The Index, Luke" --- use-the-index-luke.com --- Markus Winand A free online tutorial on SQL indexing, organized by topic: WHERE clause optimization, JOINs, ORDER BY, partial results, and INSERT performance. Each page includes example queries, execution plans, and index recommendations for PostgreSQL, MySQL, Oracle, and SQL Server. The "Myth Directory" section debunks common misconceptions (e.g., "indexes slow down writes too much to be worth it").

12. PostgreSQL Documentation: Performance Tips --- postgresql.org/docs/current/performance-tips.html PostgreSQL's own guide to EXPLAIN output, statistics, and query planning. The "Using EXPLAIN" page teaches you to read query plans, identify sequential scans vs. index scans, and understand the cost model. The "Controlling the Planner" page explains when and why the planner chooses suboptimal plans and how to guide it.


File Formats

13. Parquet Format Specification --- parquet.apache.org The technical specification for the Parquet file format: row groups, column chunks, page encoding, dictionary encoding, and statistics-based predicate pushdown. Understanding this specification explains why reading 3 of 30 columns from Parquet is 10x faster than reading the full file: each column is stored in a separate column chunk, and the reader can seek directly to the relevant chunks.

14. "Feather V2 and the Apache Arrow IPC Format" --- arrow.apache.org/docs/format/Columnar.html Feather V2 is the Arrow IPC (Inter-Process Communication) file format. It is the fastest format for read/write within a single-machine session because it stores data in Arrow's native memory layout with minimal transformation. This page explains the wire format, compression options, and the distinction between IPC streaming format (for pipes) and IPC file format (Feather, for random access).


Memory Management and Profiling

15. tracemalloc Documentation --- docs.python.org/3/library/tracemalloc.html Python's built-in memory tracing module. Essential for understanding where memory is being consumed in a data pipeline. The documentation covers snapshot comparison, top memory consumers, and filtering by file/module. Use tracemalloc.get_traced_memory() to track peak memory during DataFrame operations.

16. memory-profiler --- pypi.org/project/memory-profiler A line-by-line memory profiling tool for Python. Decorate a function with @profile and run with mprof run script.py to generate memory usage plots over time. This is the tool to use when you need to understand when during execution memory spikes occur --- for example, to identify which step in a pipeline is causing memory pressure.


Scaling Beyond This Chapter

17. Learning Spark --- Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee (2nd edition, 2020) The standard introduction to Apache Spark, covering Spark SQL, DataFrames, Structured Streaming, and MLlib. Chapters 3-5 (Spark SQL, DataFrames, and Spark Catalyst optimizer) are the natural next step if your data exceeds the 100 GB threshold discussed in this chapter. Spark operates on the same principles --- lazy evaluation, columnar processing, partition-based parallelism --- but on distributed clusters.

18. DuckDB Documentation --- duckdb.org DuckDB is an embedded analytical database (like SQLite, but columnar and vectorized) that runs inside your Python process. It can query Parquet files directly, execute SQL on pandas DataFrames, and often matches or beats Polars on analytical queries. DuckDB is increasingly the tool of choice for SQL-heavy analytics on local files in the 1-100 GB range. The Python API documentation is comprehensive.

19. "Practical Guide to Large-Scale Data Processing" --- Various Authors (Towards Data Science / Medium) A collection of practitioner articles covering real-world experiences with data scaling. Search for "large dataset pandas" or "pandas memory optimization" on Towards Data Science for dozens of practical walkthroughs. Quality varies, but the best articles include reproducible benchmarks and honest assessments of tradeoffs.


Benchmarking and Comparisons

20. "Database-like ops benchmark" --- duckdblabs.github.io/db-benchmark A regularly updated benchmark comparing pandas, Polars, Dask, DuckDB, data.table (R), and Spark on common DataFrame operations: groupby, join, and sort. Datasets range from 500 MB to 50 GB. The methodology is transparent and reproducible. This is the most credible source for "which tool is fastest for X" comparisons, and the results confirm the general guidance in this chapter: Polars and DuckDB lead on single-machine performance, Dask and Spark lead on distributed scale.

21. "Is Polars the New pandas?" --- Al Sweigart (2024, Real Python) A tutorial-style comparison of pandas and Polars for common data science tasks. Includes side-by-side code examples, performance benchmarks on realistic datasets, and practical guidance on when to switch. The tone is balanced (not "pandas is dead"), which matches this chapter's perspective: pandas is not going away, but knowing Polars gives you a significant performance option.


How to Use This List

If you need to get productive with Dask quickly, start with the Dask Tutorial (item 3) and the best practices page in the documentation (item 1). The tutorial takes 3 hours and builds genuine fluency.

If you need to get productive with Polars quickly, start with the User Guide (item 4) and Ritchie Vink's blog post (item 6). Understanding the expression system is the key hurdle; once past it, Polars is a joy to use.

If your bottleneck is SQL performance, read Winand (item 9 or 11) before anything else. His explanation of how indexes work will save you more time than any Python-side optimization.

If you are choosing between tools, consult the DuckDB Labs benchmark (item 20) for hard numbers, then match the tool to your specific constraints: data size, infrastructure, ecosystem needs, and team expertise.

If your data is growing beyond 100 GB and you are considering Spark, start with Damji et al. (item 17). But first, ask whether SQL push-down and materialized views can reduce your data to a size that Polars or Dask can handle on a single machine. Spark adds significant infrastructure complexity, and the payoff is only worth it at true distributed scale.


This reading list supports Chapter 28: Working with Large Datasets. Return to the chapter to review concepts before diving in.