Case Study 2 — The Import That Corrupted Production (and the Staging Table That Would Have Saved It)

DataField.Dev

Case Study 2 — The Import That Corrupted Production (and the Staging Table That Would Have Saved It)

Loading raw, unvalidated data straight into constrained production tables — and committing it directly — turned one bad source file into a production data incident. The staging-table pattern, plus idempotency, would have caught the problem safely and made the retry harmless.

Background

A nightly job imported a partner's CSV directly into the live products table:

# ❌ load raw partner data straight into production, row by row, committing as it goes
for row in csv_reader:
    cur.execute("INSERT INTO products (sku, name, price, category_id) VALUES (%s,%s,%s,%s)",
                (row.sku, row.name, row.price, lookup_category(row.category)))
    conn.commit()

One night the partner's file was malformed: a shifted column meant price values landed in the name-adjacent field, some prices were the string "N/A", and some category names didn't exist. The job:

Aborted partway on the first "N/A" price (a type error) — but because it had been committing per row, the half-loaded, partly-corrupt rows were already in production. The catalog now had products with garbage names and missing/wrong data, visible to customers.
When ops re-ran the job after "fixing" it, the rows that had loaded were inserted again (no upsert), creating duplicates.

A bad input file had corrupted the live catalog and then duplicated data on retry. Cleanup took a day and an emergency restore of the affected rows.

What went wrong

Two design failures, both addressed by this chapter:

No staging / validation layer. Raw, untrusted partner data went straight into the constrained production table, row by row, committing as it went. So a bad file partially corrupted production before failing, instead of being caught in a safe holding area. There was no "validate before it touches production" step.
Not idempotent. The plain INSERT (no ON CONFLICT) meant a re-run duplicated the already-loaded rows. A failed/retried load wasn't safe.

The fix: stage, validate, transform, upsert

The team rebuilt the import around the staging-table pattern and idempotency:

-- 1. Bulk-load raw data into a LOOSE staging table (all text, no constraints) — fast, can't corrupt prod
CREATE UNLOGGED TABLE staging_products (sku text, name text, price text, category text);
TRUNCATE staging_products;
-- \copy staging_products FROM 'partner_feed.csv' WITH (FORMAT csv, HEADER true)

-- 2. VALIDATE in the staging table — report/quarantine bad rows BEFORE touching production
SELECT * FROM staging_products
WHERE price !~ '^[0-9]+(\.[0-9]+)?$'                    -- non-numeric prices
   OR NOT EXISTS (SELECT 1 FROM categories c WHERE c.name = category);  -- unknown categories
-- (if too many bad rows → ABORT the whole import; the partner file is broken)

-- 3. Transform + UPSERT only the GOOD rows into production, in ONE transaction
INSERT INTO products (sku, name, price, category_id)
SELECT s.sku, s.name, s.price::numeric, c.category_id
FROM staging_products s
JOIN categories c ON c.name = s.category
WHERE s.price ~ '^[0-9]+(\.[0-9]+)?$'
ON CONFLICT (sku) DO UPDATE                              -- idempotent: re-run is safe
  SET name = EXCLUDED.name, price = EXCLUDED.price, category_id = EXCLUDED.category_id;

Now:

Bad data can't reach production. The raw file lands in the loose staging table; validation runs there. If the file is malformed (many bad rows), the import aborts before touching products — production stays clean.
One transaction moves only good rows into production atomically — no half-loaded state (Chapter 26).
Idempotent. ON CONFLICT means a re-run updates rather than duplicates — a retried or repeated load is harmless.
Bad rows are reported, not silently dropped — the team sees what the partner sent wrong.

When the same malformed file arrived again under the new pipeline, validation flagged it, the import aborted with a clear report ("3,200 rows have non-numeric prices"), and production was untouched. The team emailed the partner instead of cleaning up a corrupted catalog.

The analysis

Never load raw, untrusted data straight into constrained production tables. A bad file can abort mid-load (leaving partial corruption) or push subtly-wrong data into live tables. Stage it first, validate, then move only good rows.
The staging table is a safety buffer. Loading into a loose table (text columns, no constraints) is fast and isolates raw data from production. Validation and transformation happen in the buffer; production sees only clean, transformed rows.
Make loads idempotent. Without ON CONFLICT/upsert, a re-run duplicates data — so the natural response to a failure (retry) causes a new problem. Idempotency makes retries safe, which is essential for any reliable pipeline.
Move good rows in one transaction. Per-row commits (here) leave half-applied corruption on failure; a single transforming INSERT ... SELECT is atomic — all good rows or none (Chapter 26).
Report bad rows; don't silently drop or blindly insert them. The staging table lets you count and inspect bad rows and decide (abort? quarantine? proceed?) — turning a silent corruption into an actionable signal.

Discussion questions

How did loading raw data + per-row commits let one bad file corrupt production?
Why did the re-run create duplicates, and what fixes that?
How does the staging table prevent bad data from reaching production?
Why move good rows in one transaction rather than row-by-row?
⭐ Design the abort policy: how many/what fraction of bad rows should halt the whole import vs. proceed with the good ones? What factors decide?