Quiz: Chapter 28

DataField.Dev

Quiz: Chapter 28

Working with Large Datasets

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A pandas DataFrame with 10 million rows and a column of type object containing only 4 unique string values ("low", "medium", "high", "critical") uses 620 MB for that column. After converting it to category dtype, the column uses approximately:

A) 310 MB (50% reduction)
B) 80 MB (87% reduction)
C) 10 MB (98% reduction)
D) 620 MB (no change --- categories are just labels)

Answer: C) 10 MB (98% reduction). Categorical dtype stores an integer code per row (one byte for 4 categories, so ~10 MB for 10M rows) plus a small lookup table mapping codes to strings. The object dtype stores a separate Python string object for each of the 10 million rows, each with ~50+ bytes of overhead. For low-cardinality string columns, the savings are dramatic --- often 95-99%.

Question 2 (Multiple Choice)

A Dask DataFrame has 16 partitions. When you write ddf.groupby('key').mean(), Dask:

A) Loads all partitions into memory, concatenates them, and runs a single pandas groupby
B) Computes a partial groupby on each partition independently, then merges the partial results
C) Sorts all data by the key column first, then runs the groupby
D) Sends the data to a database engine for processing

Answer: B) Computes a partial groupby on each partition independently, then merges the partial results. This is the map-reduce pattern: each partition (which is a regular pandas DataFrame) runs its own groupby to produce partial sums and counts, and then a final reduction step combines these partials across partitions. This works because mean = sum / count, and both sum and count are commutative and associative. Not all operations can be split this way --- median, for example, cannot be computed from partial medians.

Question 3 (Short Answer)

Explain the difference between eager evaluation and lazy evaluation in the context of DataFrame operations. Give one advantage of each approach.

Answer: Eager evaluation (pandas default) executes each operation immediately and returns the result. Lazy evaluation (Dask, Polars LazyFrame) records operations in a task graph or query plan without executing them; computation only happens when explicitly triggered (e.g., .compute() or .collect()). Eager evaluation is easier to debug because you can inspect intermediate results at each step. Lazy evaluation enables optimization: the engine can reorder operations, eliminate unnecessary intermediate DataFrames, push filters before joins, and parallelize independent branches of the computation graph.

Question 4 (Multiple Choice)

You have a SQL table with 200 million rows. The query SELECT * FROM events WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01' takes 45 seconds. Which optimization is most likely to produce the largest improvement?

A) Adding LIMIT 1000 to the query
B) Creating an index on the created_at column
C) Changing SELECT * to SELECT event_id, created_at
D) Running the query from a faster Python client library

Answer: B) Creating an index on the created_at column. Without an index, the database performs a full table scan of all 200 million rows, checking each one against the date range. An index on created_at allows the database to use a B-tree lookup to jump directly to the rows in the date range, turning an O(n) scan into an O(log n) seek plus a sequential read of matching rows. Column selection (C) reduces data transfer but not scan time. LIMIT (A) stops after finding 1,000 matching rows but still scans sequentially until it finds them.

Question 5 (Multiple Choice)

A Polars LazyFrame plan shows the following optimized query:

FILTER [(col("status") == "critical") & (col("value") > 100.0)]
  FROM
    SCAN parquet [3 of 12 columns]

The "3 of 12 columns" indicates that the optimizer:

A) Randomly selected 3 columns to reduce memory usage
B) Applied projection pushdown, reading only the columns needed by downstream operations
C) Detected that 9 columns had no variance and removed them
D) Partitioned the data into 3 column groups for parallel processing

Answer: B) Applied projection pushdown, reading only the columns needed by downstream operations. The Polars query optimizer analyzes the entire query plan and determines which columns are actually used in filters, computations, and outputs. It then pushes this column selection down to the Parquet reader, which reads only those 3 columns from disk. Since Parquet is a columnar format, reading 3 of 12 columns means reading approximately 25% of the data from disk, providing a significant I/O speedup.

Question 6 (Short Answer)

A colleague writes the following Dask code:

ddf = dd.read_parquet('large_data/')
filtered = ddf[ddf['value'] > 100].compute()
grouped = filtered.groupby('key').mean()

Identify the performance problem and write the corrected version.

Answer: The .compute() is called too early --- after the filter but before the groupby. This materializes the entire filtered dataset as a pandas DataFrame in memory, losing all of Dask's lazy evaluation benefits. If the filtered data is large, this may cause a memory error. The corrected version chains both operations lazily and calls .compute() only at the end: result = ddf[ddf['value'] > 100].groupby('key').mean().compute(). This lets Dask build a single task graph that filters and aggregates per partition, keeping memory usage proportional to partition size rather than total filtered data size.

Question 7 (Multiple Choice)

Which file format is best suited for a 20 GB analytical dataset that will be read many times, often selecting only 4 of 30 columns?

A) CSV (universal compatibility)
B) Parquet (columnar, compressed, supports column selection)
C) JSON (human-readable, flexible schema)
D) HDF5 (fast read/write, good for numerical data)

Answer: B) Parquet. As a columnar format, Parquet stores each column contiguously on disk. Reading 4 of 30 columns means reading approximately 13% of the file, compared to CSV or JSON which require reading every byte of every row. Parquet also applies column-level compression (typically 3-10x), so the 20 GB dataset might be 3-5 GB on disk. Additionally, Parquet preserves dtypes (no re-inference on load), supports predicate pushdown (filtering during read), and is natively supported by pandas, Polars, Dask, and Spark.

Question 8 (Multiple Choice)

Apache Arrow is best described as:

A) A database engine optimized for analytical queries
B) A columnar in-memory format that enables zero-copy data sharing between libraries
C) A file format for storing large datasets on disk
D) A Python library for parallel computation

Answer: B) A columnar in-memory format that enables zero-copy data sharing between libraries. Arrow defines a standard memory layout for columnar data. When data is stored in Arrow format, it can be passed between Polars, pandas (with Arrow-backed dtypes), DuckDB, Spark, and other Arrow-compatible tools without copying or serializing the data. Arrow IPC (Feather) is the corresponding file format, but Arrow itself is primarily a memory specification. Polars uses Arrow as its native memory format, which is one reason it interoperates efficiently with other tools.

Question 9 (Short Answer)

You have a 50 GB dataset that fits on your machine's 64 GB of RAM --- barely. Should you use pandas, Polars, or Dask? Explain your reasoning, considering both the immediate task and what happens when the dataset grows.

Answer: Polars is the best choice. pandas would consume roughly 50 GB for the data plus overhead for operations (copies during groupby, merge, etc.), likely exceeding 64 GB and crashing. Polars uses Arrow memory format with lower overhead and avoids unnecessary copies, so it can handle the 50 GB more safely within the 64 GB limit. Dask could also work (processing the data in partitions), but adds complexity that is unnecessary if the data fits on one machine. Pragmatically, Polars should be the primary tool with Dask as a fallback if the dataset grows beyond 64 GB, at which point Dask's out-of-core processing becomes essential.

Question 10 (Multiple Choice)

The SQL query SELECT * FROM events WHERE YEAR(created_at) = 2024 is slow even though an index exists on created_at. The reason is:

A) The YEAR() function is computationally expensive
B) Wrapping the indexed column in a function prevents the database from using the index
C) The index only works for equality comparisons, not function results
D) The database needs to read the entire created_at column to extract the year

Answer: B) Wrapping the indexed column in a function prevents the database from using the index. The index is built on the raw created_at values. When you apply YEAR(), the database cannot look up the transformed value in the index --- it must compute YEAR(created_at) for every row (a full table scan) and then filter. The fix is to rewrite the query as WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01', which allows the database to use the index directly on the raw column values.

Question 11 (Multiple Choice)

You need to compute the median value of a column across a 100 GB dataset using chunked reading (500 MB chunks). Which statement is correct?

A) Compute the median of each chunk and then average the per-chunk medians
B) Compute the median of each chunk and then take the median of the per-chunk medians
C) Neither A nor B produces the correct global median --- you need to sort the entire dataset or use an approximate algorithm
D) The median is undefined for datasets larger than RAM

Answer: C) Neither A nor B produces the correct global median. Unlike mean (which can be computed from partial sums and counts) or max (which can be computed from partial maxes), the median requires knowledge of the global ordering of all values. The median of per-chunk medians is not the global median. For out-of-core median computation, you either need to sort the entire dataset (expensive), use an approximate algorithm (like t-digest or the Greenwald-Khanna algorithm), or sample and compute the sample median with known error bounds.

Question 12 (Short Answer)

Explain why Polars is often faster than Dask for single-machine operations, even though both support lazy evaluation and parallelism.

Answer: Polars executes all computation in compiled Rust code with no Python interpreter overhead, while Dask dispatches work to pandas, which ultimately runs through Python and NumPy's C extensions but still incurs Python object overhead, GIL contention, and serialization costs between partitions. Polars' query optimizer also performs semantic optimizations (predicate pushdown, projection pushdown, common subexpression elimination) that Dask's task graph scheduler does not. Additionally, Polars uses Apache Arrow as its native memory format, avoiding the memory inefficiency of pandas' object dtype for strings. Dask's strength is distributed computation across multiple machines, which Polars does not support.

Question 13 (Multiple Choice)

A Dask pipeline processes 50 GB of data across 100 partitions. At one point, it calls .repartition(npartitions=10). This operation:

A) Is free because Dask is lazy
B) Is cheap because it only changes metadata
C) Is expensive because it triggers a data shuffle, moving data between partitions
D) Fails because you cannot reduce the number of partitions

Answer: C) Is expensive because it triggers a data shuffle, moving data between partitions. Repartitioning requires redistributing data across the new partition boundaries. If the number of partitions decreases (100 to 10), data from approximately 10 original partitions must be combined into each new partition. This involves reading, serializing, transferring, and deserializing data. While the repartition itself is lazy (it will not execute until .compute()), when it does execute, it is one of the most expensive operations in a Dask pipeline. Use repartition sparingly and only when the partition count is clearly wrong for your workload.

Question 14 (Short Answer)

You are building a dashboard that refreshes every 5 minutes, querying a PostgreSQL table with 500 million rows. The query computes daily revenue by product category for the last 90 days. Describe two database-side optimizations (not Python-side) that would make this dashboard responsive.

Answer: First, create a materialized view that pre-computes the daily revenue aggregation. Refresh the materialized view on a schedule (e.g., every 5 minutes), and have the dashboard query the materialized view instead of the raw table --- the view has ~270 rows (90 days times 3 categories) instead of 500 million. Second, partition the underlying table by date (e.g., monthly range partitions) and create a composite index on (created_at, product_category). The partition pruning ensures that the 90-day query only scans 3-4 monthly partitions rather than the full table, and the composite index enables an index-only scan within each partition.

Question 15 (Multiple Choice)

You convert a pandas DataFrame from object (string) dtype to Arrow-backed string dtype using pd.ArrowDtype(pa.string()). The primary benefit is:

A) Strings are stored in a compact binary format rather than as individual Python objects
B) String operations become case-insensitive
C) The DataFrame becomes immutable
D) The DataFrame can now be used with Polars without conversion

Answer: A) Strings are stored in a compact binary format rather than as individual Python objects. pandas' object dtype stores each string as a separate Python object on the heap, with 50+ bytes of overhead per object (reference count, type pointer, length, hash, plus the string data). Arrow-backed strings store all string data in a contiguous binary buffer with an offset array, eliminating per-object overhead. For a column with 10 million short strings, this can reduce memory from hundreds of MB to tens of MB.

This quiz covers the major concepts from Chapter 28. Review any questions you found difficult by returning to the relevant section of the chapter.