Chapter 31 — Exercises

DataField.Dev

Chapter 31 — Exercises

Hands-on with COPY/\copy against mercado. (answer in Appendix) = worked solution in Answers. ⭐ = stretch.

Group A — COPY basics

31.1 Export all products to a CSV with COPY ... TO (or \copy). Open the file and check the header row. (answer in Appendix)

31.2 Bulk-load that CSV into a fresh products_copy table with \copy.

31.3 Explain the difference between COPY ... FROM '/path' and \copy ... FROM 'path'. When does each apply? (answer in Appendix)

Group B — Speed

31.4 Load ~10,000 rows two ways — an INSERT loop and a single COPY — and compare the time. (answer in Appendix)

31.5 Rank these slowest→fastest and explain: single-row inserts, multi-row inserts, batched transactions, COPY.

31.6 ⭐ From Python (psycopg), load rows via the driver's copy API instead of an execute() loop. Why is it faster?

Group C — Staging & transform

31.7 Create a loose staging_products table (text columns), COPY raw CSV into it, then INSERT ... SELECT into the real products (cast types, resolve category name→id, skip bad rows, upsert). (answer in Appendix)

31.8 Why load into a loose staging table first instead of straight into the constrained table?

31.9 ⭐ Why might you make the staging table UNLOGGED, and what's the trade-off?

Group D — Pipelines

31.10 Explain ETL vs ELT and when you'd choose each. (answer in Appendix)

31.11 What makes a load idempotent, and why does that matter for retries? (Tie to ON CONFLICT, Ch. 13.)

31.12 ⭐ Sketch an incremental load that only processes new rows since the last run (high-water mark).

Group E — Big-load performance

31.13 Why might you drop indexes before a huge initial load and rebuild after? When is this NOT safe? (answer in Appendix)

31.14 Why run ANALYZE after a big load? (Tie to Chapter 24.)

Group F — Progressive project

31.15 Bulk-load a dataset into your project with COPY/\copy; compare to an INSERT loop.

31.16 Use a staging table to load + transform + upsert; make it idempotent and re-run it.

31.17 ⭐ Describe how you'd turn this into an incremental, scheduled pipeline.

Self-check. If you reach for COPY (not loops) for bulk, stage-then-transform raw data, and build idempotent loads, you can move data at scale. Next: securing it.