Chapter 31: Bulk Data Operations and ETL Foundations — COPY, Loading, and Moving Data at Scale

DataField.Dev

29 min read

> Where you are: Part V, Chapter 31 of 40 (a ★ added chapter). Chapters 29–30 moved data a row or object at a time. This chapter moves it by the million — bulk loading and unloading, and the foundations of the ETL/ELT pipelines that feed analytics...

In This Chapter

Row-by-row is the slow way
COPY: the fast path
COPY from application code
Why row-by-row is slow, quantified
COPY in depth
Staging tables: load, then transform
Data formats and interchange
ETL and ELT
Bulk-load performance tips
Incremental loads and change data capture
Idempotency: pipelines you can safely re-run
A worked scenario: a daily import pipeline
Common mistakes
Bulk-load performance, in depth
The modern data stack and where pipelines run
Data movement as a discipline
Progressive project: load data the fast way
A field guide to bulk-loading mistakes
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 31: Bulk Data Operations and ETL Foundations — COPY, Loading, and Moving Data at Scale

Where you are: Part V, Chapter 31 of 40 (a ★ added chapter). Chapters 29–30 moved data a row or object at a time. This chapter moves it by the million — bulk loading and unloading, and the foundations of the ETL/ELT pipelines that feed analytics and integrations.

Learning paths: 💻 Developer · 🏗️ DBA · 📊 Analyst/data engineer (core — this is how data gets into the warehouse). Uses SQL, psql, and a little Python.

Row-by-row is the slow way

You've loaded data with INSERT (Chapter 13). For a few rows, fine. For millions — importing a dataset, migrating a system, feeding a warehouse — INSERT-per-row is painfully slow: each statement is a round trip and (by default) its own transaction, with all the per-statement overhead. Loading a million rows this way can take hours.

PostgreSQL's bulk path, COPY, loads the same data orders of magnitude faster — often minutes or seconds instead of hours — by streaming rows in bulk with minimal per-row overhead. (This book's own generate_data.sql and any real data load lean on bulk techniques.) Knowing the bulk path is the difference between a data load that's a coffee break and one that's an overnight job.

COPY: the fast path

COPY transfers data between a table and a file (or a stream) in bulk. Two directions:

-- Load: file → table (server-side; the file is on the DB server)
COPY products (sku, name, price, category_id)
FROM '/data/products.csv' WITH (FORMAT csv, HEADER true);

-- Unload: table → file
COPY (SELECT * FROM orders WHERE order_date >= '2024-01-01')
TO '/data/orders_2024.csv' WITH (FORMAT csv, HEADER true);

COPY ... FROM reads a file into a table; COPY ... TO writes a query's result to a file. The FORMAT csv, HEADER true options handle comma-separated files with a header row (other formats: text, binary). COPY is fast because it bypasses the per-statement overhead of individual INSERTs — it's a tight bulk stream.

Server-side vs. client-side: `\copy`

COPY ... FROM '/path' reads a file on the database server (and requires file privileges). When your file is on your machine (the client), use psql's \copy meta-command, which streams the file from the client over the connection:

\copy products (sku, name, price, category_id) FROM 'products.csv' WITH (FORMAT csv, HEADER true)

\copy is the same speed and format options, but the file path is client-side — the common case for loading a file you have locally. (COPY = server file + privileges; \copy = client file via psql.)

COPY from application code

Drivers expose COPY for fast loads from your program. In psycopg (3), you stream rows through a copy operation:

# psycopg 3 — fast bulk load from Python
with conn.cursor() as cur:
    with cur.copy("COPY products (sku, name, price, category_id) FROM STDIN") as copy:
        for row in big_iterable_of_rows:
            copy.write_row(row)
conn.commit()

This is dramatically faster than looping INSERTs. When you must load many rows from an application — a daily import, a migration — reach for the driver's COPY support, not a loop of execute(). (psycopg2 has copy_expert/copy_from; psycopg3's cursor.copy() is cleaner.)

Performance note. Even short of COPY, a multi-row INSERT (INSERT ... VALUES (...), (...), (...)) and batching many rows per transaction beat one-row-per-statement. But for genuine bulk, COPY wins by a wide margin. The hierarchy, slowest to fastest: single-row inserts → multi-row inserts → batched transactions → COPY.

Why row-by-row is slow, quantified

Understanding why COPY is so much faster than row-by-row INSERT — not just that it is — helps you recognize when bulk techniques matter and reason about data-loading performance generally. The difference comes down to per-row overhead, and quantifying it makes the case vivid.

Each individual INSERT statement carries substantial fixed overhead independent of the actual data: the statement must be sent to the server (a network round trip), parsed and planned, executed within a transaction (with its own commit overhead if autocommitted), and acknowledged back to the client. For a single row, this overhead dwarfs the trivial work of storing a few values. Now multiply by a million: a million round trips, a million parse-plan cycles, a million transaction commits — the overhead, repeated a million times, is what makes row-by-row loading take hours. The actual data being small makes no difference; it's the per-statement machinery, paid once per row, that dominates. This is why a million single-row inserts can take hours while the same data via COPY takes seconds — the data is identical, but COPY pays the overhead once for the whole stream rather than once per row.

COPY is fast because it amortizes this overhead across the entire load. It's a single statement that streams all the rows in one operation — one round trip's worth of protocol overhead, one parse, one transaction, minimal per-row work — so the fixed costs are paid once instead of a million times. The rows flow as a tight bulk stream with almost no per-row ceremony. This is the same principle as batching generally: doing N things in one operation is vastly cheaper than N separate operations when each operation has fixed overhead. It's why the performance hierarchy runs single-row inserts (worst — full overhead per row) → multi-row inserts (better — overhead amortized across the rows in each statement) → batched transactions (better still — commit overhead amortized) → COPY (best — nearly all overhead eliminated). Understanding this hierarchy lets you reach for the right tool by data volume: a handful of rows, single inserts are fine; thousands, multi-row inserts or batching; millions, COPY. The quantified intuition — overhead is per-statement, so fewer statements for the same data is faster — is what makes "use COPY for bulk" obvious rather than arbitrary, and it generalizes to any "should I batch this?" decision in data work.

COPY in depth

COPY is the workhorse of bulk data movement, and understanding its full capabilities — formats, options, directions, and error handling — lets you use it for the wide range of bulk tasks it handles. At its core, COPY transfers data in bulk between a table and an external representation (a file or a stream), in either direction: COPY ... FROM loads data into a table, COPY ... TO unloads a query's results out to a file. This bidirectionality makes COPY the tool for both importing data (load a CSV into a table) and exporting it (dump a query's results to a file for another system).

The format options handle the variety of real-world data files. FORMAT csv handles comma-separated values, the most common interchange format, with sub-options for the delimiter, quote character, null representation, and HEADER true to skip/expect a header row. FORMAT text is PostgreSQL's tab-delimited text format. FORMAT binary is a compact binary format that's fastest (no text parsing) but PostgreSQL-specific (only useful for PostgreSQL-to-PostgreSQL transfer). For most interchange with other systems, CSV is the lingua franca; for PostgreSQL-to-PostgreSQL bulk transfer, binary is fastest. COPY also lets you specify which columns to load (COPY products (sku, name, price)), so a file with a subset of columns, or columns in a different order, maps correctly to the table.

A crucial distinction, often confused, is server-side COPY versus client-side \copy. COPY ... FROM '/path/file.csv' reads a file on the database server's filesystem and requires the database to have file-access privileges — appropriate when the file is already on the server (a scheduled load of a file placed there). But when your file is on your machine (the client), the server can't read it, so you use psql's \copy meta-command, which reads the client-side file and streams it to the server over the connection. The two are otherwise identical in format and speed; the difference is purely where the file lives — server (COPY) or client (\copy). This trips people up constantly: they try COPY ... FROM 'myfile.csv' for a local file and get a "file not found" error because the server looked on its filesystem, not theirs. The rule: file on the server, COPY; file on your machine, \copy. And from application code, drivers expose COPY over the connection (psycopg's cursor.copy()), streaming rows from the program — the fast way to bulk-load from an application, far better than an INSERT loop.

Error handling during COPY matters because bulk loads encounter bad data. By default, COPY is all-or-nothing within its transaction — if any row fails (a type mismatch, a constraint violation), the whole COPY aborts and loads nothing, which prevents partial loads but means one bad row stops everything. This is why the staging-table pattern (next) loads into a loose table with no constraints first — so the bulk load succeeds regardless of data quality, and validation happens afterward in SQL where you can quarantine bad rows rather than aborting the load. Newer PostgreSQL versions add options to handle errors more gracefully (skipping or logging bad rows), but the staging approach remains the robust general pattern. Understanding COPY fully — its directions, formats, the server/client distinction, and its error behavior — equips you for the full range of bulk loading and unloading, which is the foundation of every data pipeline.

Staging tables: load, then transform

Raw incoming data is rarely in the exact shape of your clean, constrained tables — it needs validation, deduplication, type fixing, or transformation. The standard pattern is a staging table: load the raw data fast into a loose temporary table, then transform and move it into the real tables with SQL.

-- 1. A loose staging table (text columns, no constraints — fast to load)
CREATE TABLE staging_products (sku text, name text, price text, category text);

-- 2. Bulk-load raw data into it (fast, no constraint checks)
COPY staging_products FROM '/data/import.csv' WITH (FORMAT csv, HEADER true);

-- 3. Transform + load into the real table with SQL (validate, convert, look up keys)
INSERT INTO products (sku, name, price, category_id)
SELECT s.sku, s.name, s.price::numeric, c.category_id
FROM staging_products s
JOIN categories c ON c.name = s.category          -- resolve category name → id
WHERE s.price ~ '^[0-9.]+$'                        -- skip bad rows
ON CONFLICT (sku) DO UPDATE SET ...;               -- upsert (idempotent — Ch. 13)

-- 4. Drop or truncate the staging table when done
DROP TABLE staging_products;

The staging pattern separates fast loading (bulk COPY into a loose table) from correct transformation (SQL with the database's full power: joins, type casts, constraints, upserts). It also lets you validate and report on bad rows before they reach production tables. For a throwaway staging table, an UNLOGGED table loads even faster (it skips the WAL — fine for data you can re-load).

Data formats and interchange

Bulk data moves between systems in formats, and understanding the common ones — their strengths and when each is appropriate — is part of practical data movement, since you'll constantly load from and export to various formats. The format is the contract between the systems exchanging data, and choosing or handling it well matters.

CSV (comma-separated values) is the universal interchange format — every system reads and writes it, it's human-readable, and COPY's FORMAT csv handles it directly. Its weaknesses are real, though: it has no type information (everything is text, so the receiver must know that column 3 is a number and column 5 is a date), it handles nested or structured data poorly (it's flat), and its quoting/escaping rules have edge cases (commas and newlines within values, requiring careful quoting). Despite these flaws, CSV's universality makes it the default for cross-system data exchange, and COPY/\copy make it efficient. JSON (and newline-delimited JSON, JSONL) carries structure and some type information (numbers vs. strings vs. nulls), making it better for nested or semi-structured data, at the cost of verbosity. PostgreSQL can load JSON data (into jsonb columns, Chapter 16, or by parsing during transformation), useful when the source is API responses or document data.

For analytical and big-data contexts, columnar formats like Parquet have become standard. Parquet stores data column-by-column (rather than row-by-row), compressed and with type information, which is enormously efficient for analytical queries that read few columns of many rows (the OLAP access pattern, Chapter 20/34) — and far more compact than CSV. While PostgreSQL doesn't load Parquet natively in core, the broader data ecosystem (warehouses, data lakes, the modern data stack) uses it pervasively, and extensions/tools bridge PostgreSQL to it. The format landscape, then, runs from CSV (universal, flat, text) through JSON (structured, verbose) to columnar formats like Parquet (analytical, compressed, typed), each suited to different contexts — CSV for general interchange, JSON for structured/document data, Parquet for analytical big data. Knowing which format suits which situation, and how to load from and export to each (CSV via COPY directly, others via transformation or tools), is part of the practical fluency of moving data between systems. The format is often dictated by what the other system produces or expects, so handling the common ones competently — especially CSV, the lingua franca, via COPY — is essential bulk-data skill.

ETL and ELT

Bulk loading is the heart of a data pipeline. Two classic shapes:

ETL — Extract, Transform, Load: extract from sources, transform the data (clean, reshape, aggregate) in a separate tool/process, then load the result into the target. Transformation happens before loading.
ELT — Extract, Load, Transform: extract, load the raw data into the target (often a warehouse) first, then transform it there with SQL. Transformation happens after loading, using the database's power.

The staging-table pattern above is essentially ELT in miniature: load raw (E, L), transform with SQL (T). Modern data stacks (with powerful warehouses, Chapter 34) increasingly favor ELT — load everything, transform in SQL (tools like dbt) — because the warehouse can transform faster than an external tool, and raw data is preserved. ETL still suits cases where you must clean/filter before it touches the target (size, sensitivity).

A robust pipeline also handles:

Incremental loads — load only new/changed data each run (by timestamp or a high-water mark), not the whole source every time.
Idempotency — a re-run produces the same result (upserts, ON CONFLICT, Chapter 13) — so a failed/retried run is safe.
Error handling & reporting — bad rows quarantined, not silently dropped (the staging table helps).

Bulk-load performance tips

For very large loads:

Drop/disable indexes and rebuild after — maintaining indexes during a bulk load is expensive (Chapter 23); for a big initial load, dropping indexes, loading, then CREATE INDEX is often far faster. (Don't do this on a live table serving queries.)
Use COPY, not INSERT loops — the headline.
Batch and use few transactions — one big load, not a million tiny commits.
UNLOGGED staging tables — skip the WAL for throwaway data.
ANALYZE after loading — refresh statistics so the optimizer plans well on the new data (Chapter 24 — exactly why generate_data.sql ends with ANALYZE).

Incremental loads and change data capture

A pipeline that re-loads all the source data on every run is wasteful and doesn't scale — as the source grows, re-loading everything takes ever longer. Real pipelines load incrementally: each run processes only the new or changed data since the last run. Understanding incremental loading is essential for pipelines that run repeatedly (nightly, hourly) against growing sources.

The common incremental approach uses a high-water mark — a record of how far the last run got, so the next run starts from there. If the source data has a reliable timestamp (a created_at or updated_at), each run loads rows newer than the last run's maximum timestamp, then records the new maximum for next time. "Load orders created since the last run" processes only the day's new orders, not the entire order history, keeping each run's work proportional to the new data, not the total data. Similarly, an incrementing id can serve as a high-water mark for append-only sources (load rows with id greater than the last loaded id). The key is having a reliable way to identify "what's new since last time" — a timestamp or a monotonic id — and tracking the watermark between runs. This turns an O(total-data) full reload into an O(new-data) incremental load, which is what makes pipelines sustainable as data grows.

A more sophisticated approach is change data capture (CDC), which captures every change (insert, update, delete) to source tables as it happens, often by reading the database's transaction log (PostgreSQL's WAL, via logical replication — Chapters 28, 35). CDC propagates changes to downstream systems in near-real-time, rather than periodically polling for changes by timestamp. It's more complex than high-water-mark polling but enables real-time data integration and handles updates and deletes (not just inserts) cleanly. For many pipelines, timestamp-based incremental loading suffices; for real-time or update-heavy integration, CDC is the tool. Either way, the principle is the same: don't reprocess what you've already processed — identify and handle only the changes, whether by polling a high-water mark or capturing changes from the log. Incremental processing is what separates a pipeline that scales from one that grinds to a halt as its source grows, and designing for it from the start (ensuring sources have reliable timestamps or change-capture) is part of building sustainable data pipelines.

Idempotency: pipelines you can safely re-run

Idempotency — the property that running an operation twice has the same effect as running it once — is arguably the most important property of a robust data pipeline, because pipelines fail and get re-run constantly, and a non-idempotent pipeline corrupts data when it does. Understanding why, and how to achieve it, is central to building pipelines that survive contact with reality.

The reality is that pipeline runs fail partway — a network blip, a source unavailable, a crash, a bad row that aborts a load — and the response is to re-run them. If the pipeline isn't idempotent, re-running after a partial failure double-loads the data that succeeded before the failure: rows get inserted twice, counts get incremented twice, totals get corrupted. A pipeline that produces wrong data when re-run is a pipeline you can't safely retry, which means every failure requires careful manual cleanup before re-running — operationally miserable and error-prone. An idempotent pipeline, by contrast, can simply be re-run after any failure: the parts that succeeded are harmlessly re-applied (no-ops), the parts that didn't are completed, and the result is correct regardless of how many times or at what point it's re-run. This transforms failure handling from "carefully clean up, then re-run" to "just re-run," which is enormously more robust.

Achieving idempotency uses the tools from Chapter 13. Upserts (INSERT ... ON CONFLICT DO UPDATE/NOTHING) make loading idempotent: re-loading a row that already exists updates it (or skips it) rather than duplicating it, so the final state is the same whether the row was loaded once or five times. Incremental high-water marks contribute: if a re-run starts from the same watermark, it reprocesses the same rows, and the upserts make that reprocessing harmless. Deterministic transformations ensure the same input produces the same output, so re-running yields the same result. The combination — upsert-based loading, watermark-based incrementing, deterministic transforms — produces a pipeline where re-running is always safe. The discipline is to design for re-runnability from the start: assume the pipeline will fail partway and be re-run, and make every step idempotent so that's safe. This is the data-pipeline expression of the idempotency principle from Chapter 13, and it's what separates pipelines that are robust (re-run freely, recover from any failure) from those that are fragile (every failure a manual cleanup). A pipeline you can re-run without thinking is a pipeline you can operate confidently; one you can't is a perpetual source of stress and data-corruption risk.

A worked scenario: a daily import pipeline

Let's assemble the chapter's techniques into a realistic daily import pipeline, because seeing them work together shows how a robust pipeline is built. The scenario: every night, Mercado receives a CSV of product updates from a supplier (new products, price changes, discontinued items) and must merge it into the products table. Walk through building this to be fast, correct, incremental, and idempotent.

The pipeline stages map to the chapter's techniques. Extract: the supplier's CSV arrives (placed on a server, or fetched). Load (fast): bulk-load the raw CSV into a loose staging table with COPY (or \copy) — text columns, no constraints, optionally UNLOGGED for speed — so the load is fast and succeeds regardless of data quality. Transform (correct): with the raw data staged, use SQL to validate it (filter out rows with bad prices or missing SKUs, reporting them rather than silently dropping), convert types (price::numeric), and resolve foreign keys (look up category_id from the category name). Merge (idempotent): upsert the clean rows into products with INSERT ... SELECT ... ON CONFLICT (sku) DO UPDATE, so new products are inserted and existing ones updated — idempotently, so re-running the night's load doesn't duplicate. Finalize: handle discontinued items (perhaps a soft-delete, Chapter 21), drop the staging table, and ANALYZE products to refresh statistics for the optimizer.

This pipeline embodies every lesson: it's fast (bulk COPY into staging, not INSERT loops), correct (SQL transformation with validation and key resolution, not loading raw data into constrained tables), robust to bad data (loose staging table loads regardless, validation quarantines bad rows), idempotent (upserts make re-running safe after a failure), and well-planned afterward (ANALYZE so the new data is queried efficiently). If it could be incremental (only new/changed products), a high-water mark would limit each run's work. Notice how it draws on the whole book: COPY (this chapter), staging and transformation in SQL (Part II), upserts (Chapter 13), foreign-key resolution via joins (Chapter 6), soft deletes (Chapter 21), ANALYZE (Chapter 24). A real data pipeline is a synthesis of database skills, orchestrated into a repeatable, robust process. Building one like this — staged, validated, idempotent, analyzed — is the practical capstone of bulk data work, and it's exactly the kind of pipeline that feeds real systems and warehouses reliably, night after night, surviving the bad data and partial failures that reality throws at it.

Common mistakes

Loading millions of rows with INSERT loops — hours instead of minutes. Use COPY.
Confusing COPY (server file) with \copy (client file) — COPY ... FROM '/path' looks for the file on the server; for a local file use \copy.
Loading raw data straight into constrained production tables — bad rows abort the load or corrupt data. Stage, validate, then transform.
Non-idempotent pipelines — a retried run double-loads. Use upserts / incremental high-water marks.
Forgetting ANALYZE after a big load — stale stats → bad plans (Chapter 24's Case Study 2).

Bulk-load performance, in depth

For genuinely large loads — millions or billions of rows, initial migrations, warehouse loads — several techniques beyond "use COPY" can multiply performance further, and understanding why they work (grounded in the internals of Part IV) lets you apply them appropriately. These are the tools for when even COPY needs to go faster.

The biggest lever for a large initial load is dropping indexes before the load and rebuilding them after. Recall (Chapter 23) that every index must be updated on every insert — so loading a million rows into a table with five indexes means a million index updates per index, five million index maintenance operations during the load, which dominates the load time. If you drop the indexes, bulk-load the data (now just writing rows, no index maintenance), then CREATE INDEX to rebuild them, the index build is a single bulk sort-and-build operation — far faster than incremental per-row maintenance. For a large initial load into an empty or to-be-populated table, drop-load-rebuild can be several times faster than loading with indexes in place. (The caveat: this is for initial loads or maintenance windows, not live tables serving queries, since dropping indexes on a live table would cripple its queries during the load.) Similarly, deferring constraint checking or loading into a constraint-free staging table avoids per-row constraint validation during the bulk load.

Other techniques target the internals. UNLOGGED tables skip the write-ahead log (Chapter 28) — since WAL is written for durability and crash recovery, skipping it makes writes faster, appropriate for staging tables holding data you can re-load if lost (the WAL's protection is unneeded for throwaway data). Batching into few transactions (rather than many small commits) amortizes commit overhead — each commit involves a WAL flush (Chapter 28's fsync), so fewer commits mean fewer expensive flushes. Tuning maintenance_work_mem higher gives index builds more memory to sort in (avoiding disk spills). And ANALYZE after the load is essential — not for load speed, but so the optimizer has accurate statistics for querying the newly-loaded data (stale statistics after a big load cause the bad plans of Chapter 24; this is exactly why generate_data.sql ends with ANALYZE). The throughline is that bulk loading is where the internals knowledge of Part IV pays off concretely: understanding indexes (drop/rebuild), the WAL (UNLOGGED, batching), and statistics (ANALYZE) lets you load data dramatically faster than naive approaches. These techniques, combined with COPY, are how you load enormous datasets in reasonable time — the difference between a migration that takes a weekend and one that takes an hour.

The modern data stack and where pipelines run

Bulk data movement and ETL/ELT pipelines exist within a broader ecosystem — the "modern data stack" — and understanding where PostgreSQL and these techniques fit in that larger picture orients you for real data-engineering work. The pipelines you build don't run in isolation; they're part of how data flows through an organization.

The classic flow moves data from operational systems (transactional databases like Mercado, where business happens) to analytical systems (data warehouses, Chapter 34, where analysis happens), because the two have opposite needs (OLTP vs. OLAP, Chapter 20). Pipelines extract from the operational sources, optionally transform, and load into the warehouse — the ETL/ELT shapes this chapter covered. The modern data stack has shifted strongly toward ELT: load raw data into a powerful cloud warehouse first, then transform it there with SQL, because warehouses have become fast enough to do the transformation better than external tools, and preserving the raw data enables re-transformation as needs change. Tools like dbt (data build tool) have popularized this "transform in SQL, in the warehouse" approach — dbt is essentially organized, version-controlled, tested SQL transformations, which is why the SQL skills of this book are directly the skills of modern data engineering. Orchestration tools (Airflow, Dagster, Prefect) schedule and manage pipeline runs, handling dependencies, retries, and monitoring — the operational layer around the data movement.

Where does PostgreSQL fit? In several roles: as an operational source (the transactional database data is extracted from — Mercado), as a target (a PostgreSQL-based warehouse or analytical database for smaller-scale analytics — and theme #4 again, since PostgreSQL with partitioning and good indexing handles substantial analytical workloads without a separate warehouse), and as the engine where staging-and-transform happens (the SQL-based transformation of ELT). The COPY and staging techniques you've learned are exactly how data moves into and through these roles. The broader lesson is that data movement is a discipline in its own right — data engineering — built on the bulk and pipeline foundations of this chapter, and that the SQL and database skills of this book are directly the skills that discipline requires. The modern data engineer's core competence is moving data reliably (pipelines: incremental, idempotent, robust) and transforming it correctly (SQL: the whole of Part II), which is precisely what this chapter and the book have taught. Whether you become a data engineer or simply build the occasional import, understanding where pipelines fit in the larger data ecosystem — operational to analytical, increasingly via ELT and SQL transformation — orients the bulk techniques within the bigger picture of how organizations move and use their data.

Data movement as a discipline

Stepping back, this chapter has been about a distinct discipline within database work: moving data at scale, reliably. It's worth recognizing it as such, because it's a skill set that grows in importance as data volumes grow and as organizations increasingly run on data flowing between systems. Data movement isn't an afterthought to "real" database work — it's a substantial competence in its own right, with its own principles and pitfalls.

The principles this chapter established form that discipline's core. Move data in bulk, not row by row — COPY, not INSERT loops, because per-statement overhead dominates at scale. Separate fast loading from correct transformation — the staging-table pattern, bulk-load loose, transform with SQL. Process incrementally — load only what's new, via high-water marks or CDC, so pipelines scale with data growth. Design for idempotency — make re-running safe, because pipelines fail and get re-run constantly. Handle bad data deliberately — quarantine and report it, don't silently drop or let it abort the load. Optimize for the internals — drop/rebuild indexes, use UNLOGGED staging, batch transactions, ANALYZE after. Each principle addresses a way bulk data movement goes wrong, and together they're the discipline of doing it reliably.

This discipline connects to everything else in the book. It uses the SQL of Part II (transformation is SQL), the design of Part III (loading into well-designed, constrained tables; the staging pattern), the performance and internals of Part IV (why COPY is fast, index/WAL/statistics techniques), and the integration of Part V (loading from application code, pipelines). It's a synthesis — which is fitting, because real data work is a synthesis, drawing on the whole range of database skills to accomplish concrete goals like "load this dataset" or "feed this warehouse nightly." As data continues to grow in volume and importance across every industry, the ability to move it reliably at scale — to build pipelines that are fast, correct, incremental, and idempotent — becomes ever more valuable. This chapter is the foundation of that ability, and the discipline it teaches, grounded in the rest of the book's database skills, is what lets you handle data not just at the row-at-a-time scale of application queries but at the million-row scale of real datasets and the ongoing flow of real pipelines. That capability — bulk, reliable, scalable data movement — is an increasingly central part of what it means to work with data professionally.

The next and final chapter of Part V turns from moving data to protecting it: database security — SQL injection (revisited with the full defensive treatment), roles and privileges, row-level security, and encryption. It's a fitting close to the application-integration part, because the seam where applications meet the database is not only where data flows (Chapters 29–31) but where it's most exposed to attack. Having learned to connect, map, and move data across that seam, you'll learn to secure it — completing the picture of responsible application-database integration, where data moves both efficiently and safely.

It's worth noting that bulk data movement and security intersect directly: the staging tables, export files, and pipeline transfers of this chapter are themselves data that must be protected (an exported CSV of customer data is as sensitive as the table it came from), and pipelines that move data between systems must do so securely (encrypted transfer, access-controlled staging). So the efficiency this chapter taught and the security the next one teaches aren't separate concerns but two requirements of the same responsible data handling. Moving data fast is only half the job; moving it safely — and protecting it everywhere it lands — is the other half, which is exactly where Part V concludes.

Progressive project: load data the fast way

For your project:

Export a query's result to CSV with COPY ... TO (or \copy).
Bulk-load a dataset with COPY/\copy and compare the time to an INSERT loop of the same data.
Use a staging table: load raw CSV into a loose table, then transform + upsert into your real tables with SQL (resolving keys, fixing types, skipping bad rows).
Make it idempotent — re-run the load and confirm it doesn't duplicate (via ON CONFLICT).
ANALYZE after loading, and (stretch) describe how you'd do an incremental load.

A field guide to bulk-loading mistakes

Bulk data operations have characteristic failure modes, and a checklist of them helps you avoid the expensive ones. Row-by-row loading at scale is the headline mistake: using INSERT loops for millions of rows, turning a minutes-long COPY job into an hours-long crawl. Use COPY for bulk. Confusing COPY and \copy wastes time on baffling "file not found" errors: COPY ... FROM '/path' reads a file on the server; for a local file, use \copy. Loading raw data straight into constrained production tables causes loads to abort on the first bad row, or worse, lets bad data through: stage into a loose table, validate, then transform into the real tables. Non-idempotent pipelines double-load data when re-run after a failure: use upserts (ON CONFLICT) so re-running is safe. Forgetting ANALYZE after a big load leaves the optimizer with stale statistics, causing terrible plans on the freshly-loaded data (Chapter 24's classic): always ANALYZE after a substantial load.

A few more: full reloads instead of incremental make pipelines slower every run as the source grows — load only what's new via a high-water mark. Silently dropping bad rows loses data without anyone noticing — quarantine and report bad rows from the staging table. Maintaining indexes during a huge initial load is far slower than dropping and rebuilding them (for initial loads, not live tables). Huge single transactions for a massive load can exhaust resources or hold locks too long — batch appropriately. The unifying theme is that bulk data operations have different performance characteristics and failure modes than the row-at-a-time operations of earlier chapters, so the row-at-a-time instincts (just INSERT it, load straight into the table, re-run if it fails) cause problems at bulk scale. The bulk discipline — COPY not loops, stage-validate-transform, incremental and idempotent, ANALYZE after, drop/rebuild indexes for big loads — is what avoids these mistakes. Each is a specific, common, avoidable error, and knowing them as a checklist lets you load data at scale without the painful surprises that catch those who apply row-at-a-time thinking to million-row jobs.

Summary

Row-by-row INSERT is the slow way to move bulk data; COPY is the fast path — streaming rows in bulk, orders of magnitude faster, in both directions (COPY ... FROM to load, COPY ... TO to unload), with CSV/text/binary formats. Use \copy for client-side files and the driver's copy support for fast loads from code. The staging-table pattern separates fast loading (bulk COPY into a loose, optionally UNLOGGED table) from correct transformation (SQL: validate, cast, resolve keys, upsert) — essentially ELT in miniature. Pipelines come in ETL (transform before load) and ELT (load raw, transform in SQL — favored with modern warehouses) shapes, and robust ones are incremental and idempotent. For big loads: COPY not loops, drop/rebuild indexes, batch transactions, and ANALYZE after.

You can now: - Use COPY/\copy to bulk-load and unload data, and from application code. - Explain why COPY is far faster than INSERT loops. - Apply the staging-table pattern (load raw → transform with SQL → upsert). - Distinguish ETL vs. ELT and design incremental, idempotent loads. - Apply bulk-load performance techniques (drop/rebuild indexes, UNLOGGED, ANALYZE).

What's next. Chapter 32 — Database Security — the close of Part V: SQL injection (revisited), roles and privileges (GRANT/REVOKE), row-level security, encryption in transit and at rest, backups, and data-privacy (GDPR) implications. Protecting the data you've learned to move.

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.

In This Chapter

Chapter 31: Bulk Data Operations and ETL Foundations — COPY, Loading, and Moving Data at Scale

Row-by-row is the slow way

COPY: the fast path

Server-side vs. client-side: \copy

COPY from application code

Why row-by-row is slow, quantified

COPY in depth

Staging tables: load, then transform

Data formats and interchange

ETL and ELT

Bulk-load performance tips

Incremental loads and change data capture

Idempotency: pipelines you can safely re-run

A worked scenario: a daily import pipeline

Common mistakes

Bulk-load performance, in depth

The modern data stack and where pipelines run

Data movement as a discipline

Progressive project: load data the fast way

A field guide to bulk-loading mistakes

Summary

Server-side vs. client-side: `\copy`