Case Study 1 — The Overnight Import That Became a Coffee Break
The same data, the same database, the same machine — but loaded with
COPYinstead of anINSERTloop, a nightly import dropped from six hours to four minutes. A vivid lesson in why the bulk path exists.
Background
A company ingested a daily product/pricing feed from a supplier — about 4 million rows of CSV. The import job, written by an application developer, read the CSV in Python and inserted each row with a separate execute():
# ❌ row-by-row INSERT loop
for row in csv_reader:
cur.execute(
"INSERT INTO products_feed (sku, name, price, category) VALUES (%s, %s, %s, %s)",
row,
)
conn.commit() # (worse: committing per row!)
The job took about six hours every night. It barely finished before the business day began, and any hiccup meant it ran into working hours, slowing the live database. The team assumed they needed a bigger server.
Diagnosis
The bottleneck wasn't CPU or disk capacity — it was the method. Four million separate INSERT statements means four million round trips to the server, four million statement parses/plans, and (because of the per-row commit()) four million transaction commits, each forcing a WAL flush to disk (Chapter 28). The per-row overhead, multiplied by four million, was the six hours. The data volume was trivial for PostgreSQL; the row-at-a-time protocol was the cost.
This is precisely what COPY exists to eliminate.
The fix: COPY
The team rewrote the load to use COPY (via the driver's copy support), streaming the rows in bulk into a staging table:
# ✅ bulk load with COPY (psycopg 3)
with conn.cursor() as cur:
with cur.copy("COPY products_feed (sku, name, price, category) FROM STDIN WITH (FORMAT csv)") as copy:
for row in csv_reader:
copy.write_row(row)
conn.commit() # ONE commit for the whole load
(Equivalently, \copy products_feed FROM 'feed.csv' WITH (FORMAT csv, HEADER true) from psql.) The result: the load dropped from ~6 hours to ~4 minutes — roughly two orders of magnitude. Nothing about the hardware or the data changed; only the path the rows took into the database. COPY streams rows in bulk with minimal per-row overhead and a single transaction, instead of millions of individual statements and commits.
They layered on the other bulk techniques for the few remaining minutes:
- Loaded into an
UNLOGGEDstaging table (skip the WAL for re-loadable data), then transformed into the real table. - Dropped non-essential indexes on the staging target before the load and rebuilt after (Chapter 23).
- Ran one transaction, not per-row commits.
ANALYZEd after loading so downstream queries planned well (Chapter 24).
The analysis
-
COPYis the bulk path;INSERTloops are not. For millions of rows, the per-statement overhead of individualINSERTs — round trips, parsing, and especially per-row commits — dominates.COPYstreams in bulk and pays that overhead once. Two orders of magnitude is typical, not exceptional. -
Per-row
commit()is a special disaster. Committing each row forces a WAL flush per row (durability has a cost, Chapter 28). Even short ofCOPY, batching many rows per transaction (or one transaction for the load) is a huge win. The original code committed 4 million times. -
It wasn't a hardware problem. The team almost bought a bigger server. The real issue was the load method. "It's slow, buy more hardware" is often wrong — diagnose the how first (as with indexing, Chapter 23, and N+1, Chapter 30).
-
Stack the bulk techniques.
COPY+UNLOGGEDstaging + drop/rebuild indexes + one transaction +ANALYZEafter — together they turn a punishing load into a quick one. Each addresses a specific overhead. -
Know the loading hierarchy. single-row inserts → multi-row inserts → batched transactions →
COPY. When data volume grows, move up the hierarchy. For genuine bulk, go straight toCOPY.
Discussion questions
- Why did 4 million
INSERTs take six hours when the data volume is trivial for PostgreSQL? - What specific overheads does
COPYeliminate that anINSERTloop pays per row? - Why is per-row
commit()especially costly (tie to the WAL, Chapter 28)? - Why was "buy a bigger server" the wrong instinct here?
- ⭐ List the additional bulk-load techniques the team used and what overhead each removes.