Case Study 1 — The Overnight Import That Became a Coffee Break

DataField.Dev

Case Study 1 — The Overnight Import That Became a Coffee Break

The same data, the same database, the same machine — but loaded with COPY instead of an INSERT loop, a nightly import dropped from six hours to four minutes. A vivid lesson in why the bulk path exists.

Background

A company ingested a daily product/pricing feed from a supplier — about 4 million rows of CSV. The import job, written by an application developer, read the CSV in Python and inserted each row with a separate execute():

# ❌ row-by-row INSERT loop
for row in csv_reader:
    cur.execute(
        "INSERT INTO products_feed (sku, name, price, category) VALUES (%s, %s, %s, %s)",
        row,
    )
    conn.commit()        # (worse: committing per row!)

The job took about six hours every night. It barely finished before the business day began, and any hiccup meant it ran into working hours, slowing the live database. The team assumed they needed a bigger server.

Diagnosis

The bottleneck wasn't CPU or disk capacity — it was the method. Four million separate INSERT statements means four million round trips to the server, four million statement parses/plans, and (because of the per-row commit()) four million transaction commits, each forcing a WAL flush to disk (Chapter 28). The per-row overhead, multiplied by four million, was the six hours. The data volume was trivial for PostgreSQL; the row-at-a-time protocol was the cost.

This is precisely what COPY exists to eliminate.

The fix: COPY

The team rewrote the load to use COPY (via the driver's copy support), streaming the rows in bulk into a staging table:

# ✅ bulk load with COPY (psycopg 3)
with conn.cursor() as cur:
    with cur.copy("COPY products_feed (sku, name, price, category) FROM STDIN WITH (FORMAT csv)") as copy:
        for row in csv_reader:
            copy.write_row(row)
conn.commit()            # ONE commit for the whole load

(Equivalently, \copy products_feed FROM 'feed.csv' WITH (FORMAT csv, HEADER true) from psql.) The result: the load dropped from ~6 hours to ~4 minutes — roughly two orders of magnitude. Nothing about the hardware or the data changed; only the path the rows took into the database. COPY streams rows in bulk with minimal per-row overhead and a single transaction, instead of millions of individual statements and commits.

They layered on the other bulk techniques for the few remaining minutes:

Loaded into an UNLOGGED staging table (skip the WAL for re-loadable data), then transformed into the real table.
Dropped non-essential indexes on the staging target before the load and rebuilt after (Chapter 23).
Ran one transaction, not per-row commits.
ANALYZEd after loading so downstream queries planned well (Chapter 24).

The analysis

COPY is the bulk path; INSERT loops are not. For millions of rows, the per-statement overhead of individual INSERTs — round trips, parsing, and especially per-row commits — dominates. COPY streams in bulk and pays that overhead once. Two orders of magnitude is typical, not exceptional.
Per-row commit() is a special disaster. Committing each row forces a WAL flush per row (durability has a cost, Chapter 28). Even short of COPY, batching many rows per transaction (or one transaction for the load) is a huge win. The original code committed 4 million times.
It wasn't a hardware problem. The team almost bought a bigger server. The real issue was the load method. "It's slow, buy more hardware" is often wrong — diagnose the how first (as with indexing, Chapter 23, and N+1, Chapter 30).
Stack the bulk techniques. COPY + UNLOGGED staging + drop/rebuild indexes + one transaction + ANALYZE after — together they turn a punishing load into a quick one. Each addresses a specific overhead.
Know the loading hierarchy. single-row inserts → multi-row inserts → batched transactions → COPY. When data volume grows, move up the hierarchy. For genuine bulk, go straight to COPY.

Discussion questions

Why did 4 million INSERTs take six hours when the data volume is trivial for PostgreSQL?
What specific overheads does COPY eliminate that an INSERT loop pays per row?
Why is per-row commit() especially costly (tie to the WAL, Chapter 28)?
Why was "buy a bigger server" the wrong instinct here?
⭐ List the additional bulk-load techniques the team used and what overhead each removes.