Key Takeaways: Chapter 28

Working with Large Datasets


  1. The first thing to try when pandas is slow is not a new tool --- it is dtype optimization. Converting low-cardinality string columns to category, downcasting integers from int64 to int16 or int32, and using float32 instead of float64 can reduce memory usage by 50-80% with zero loss of information. For datasets in the 1-5 GB range, dtype optimization alone can be the difference between a MemoryError and a smooth analysis. Always apply this step before reaching for Dask or Polars.

  2. Dask extends pandas to datasets larger than RAM by partitioning data and evaluating lazily. A Dask DataFrame is a collection of pandas DataFrames (partitions), and operations are recorded in a task graph rather than executed immediately. Computation happens only when you call .compute(). This means Dask never needs the full dataset in memory at once --- it processes partitions sequentially or in parallel, combines partial results, and discards intermediate data. The cost is that not all pandas operations translate directly (no .iloc, expensive sorts, no in-place mutation).

  3. Polars is the speed champion for single-machine analytics. Built in Rust on Apache Arrow, Polars is typically 10-50x faster than pandas for groupby, filter, and join operations. Its expression system (pl.col(), pl.when(), chained methods) differs from pandas but enables a query optimizer that pushes predicates down, projects only needed columns, and eliminates redundant computation. Use Polars when data fits on one machine but pandas is too slow, or when you need maximum throughput in a production pipeline.

  4. Lazy evaluation is not just a performance trick --- it changes how you think about data pipelines. In eager evaluation (pandas), every line of code produces a materialized DataFrame. In lazy evaluation (Dask, Polars LazyFrame), you build a computation plan that is optimized and executed as a whole. The practical implication: chain all your operations and compute once at the end. Every intermediate .compute() or .collect() breaks the optimization chain and may force unnecessary materialization of large intermediate results.

  5. The fastest way to handle large data in Python is to not handle it in Python. If your data lives in a database, push filters, aggregations, and joins to SQL. A properly indexed SQL query can reduce 100 million rows to 1,000 aggregated rows in milliseconds. The anti-pattern is SELECT * FROM large_table followed by pandas filtering and groupby. The correct pattern is a targeted SQL query that returns only the rows and columns you actually need. Indexes, partitioning, and materialized views are the tools that make this work at scale.

  6. SQL optimization hinges on three techniques: indexing, partitioning, and query rewriting. An index on a filtered column turns a full table scan into a seek (seconds to milliseconds). Table partitioning by date or key ensures queries only read relevant data segments. Query rewriting --- avoiding functions on indexed columns, using range predicates instead of YEAR(), limiting exploratory queries --- enables the database optimizer to choose efficient execution plans. Always use EXPLAIN ANALYZE to verify that your query is using indexes as intended.

  7. File format matters more than most practitioners realize. Parquet is 3-10x smaller than CSV and 5-15x faster to read. It is columnar (reading 3 of 30 columns reads 10% of the data), compressed, and preserves dtypes. Feather (Arrow IPC) is even faster for read/write in a single session. CSV should be a last resort for large datasets. Switching from CSV to Parquet is often the single largest performance improvement in a data pipeline, requiring no code changes beyond the read/write calls.

  8. Apache Arrow is the emerging standard for in-memory columnar data. Arrow provides a common memory format that Polars, pandas (2.0+), DuckDB, and Spark can share without copying. Arrow-backed string dtypes in pandas reduce memory usage by 50-80% compared to the default object dtype. As the ecosystem converges on Arrow, the boundaries between tools become thinner: data can flow from SQL to Arrow to Polars to pandas to scikit-learn with minimal conversion overhead.

  9. The tool selection decision follows a clear framework: data size, operation type, and ecosystem needs. Less than 1 GB: pandas. 1-10 GB on one machine with speed needs: Polars. 1-100 GB or out-of-core: Dask. More than 100 GB or distributed: Spark. Database-resident data with SQL-expressible operations: push to SQL. And for the modeling step (scikit-learn, statsmodels, XGBoost): always convert to pandas at the boundary, because the ecosystem integration is unmatched.

  10. Develop on a sample, validate on the full dataset. Build your pipeline logic on 1% of the data, where iteration is fast and errors are cheap. Once the logic is correct, switch to the appropriate tool (Dask, Polars, or SQL push-down) for the full dataset. This is faster than debugging on 100 million rows and produces the same pipeline code --- only the data source and tool change, not the business logic.


If You Remember One Thing

The goal of large-data engineering is reduction. Every technique in this chapter --- SQL push-down, Dask aggregation, Polars transformation, dtype optimization --- exists to reduce billions of raw rows into the thousands or millions of rows that your model actually needs. The raw data is not the input to your model; it is the input to your pipeline, and the pipeline's job is to compress information, not to pass it through unchanged. A 2.6-billion-row event log becomes a 2.4-million-row feature matrix. A 622-million-row sensor table becomes a 1,500-row daily summary. The tools differ, but the principle is the same: get to the smallest DataFrame that preserves the information you need, and do the rest in pandas.


These takeaways summarize Chapter 28: Working with Large Datasets. Return to the chapter for full context.