Case Study 1 — Splitting a Column Without Downtime

DataField.Dev

Case Study 1 — Splitting a Column Without Downtime

Expand-contract in practice. A team needed to split a single name column into first_name and last_name on a live, high-traffic users table — a breaking change. Done in one step, it would have caused an outage. Done in three backward-compatible phases, it was invisible to users.

Background

A product stored each user's name in a single users.name column. A new feature (personalized greetings, sorting by last name) needed it split into first_name and last_name. The table had tens of millions of rows and served constant read/write traffic from a web app and several background services.

The naive plan — "rename/split in one migration" — has a fatal flaw: the moment the schema changes, every piece of running code that still does INSERT INTO users (name, ...) or SELECT name breaks. You can't deploy the schema change and all the code that uses it at the same instant across a fleet of servers and services. There's always a window where old code meets new schema (or vice versa). For a breaking change, that window is an outage.

So they used expand-contract: three deploys, each individually safe and backward-compatible.

Phase 1 — Expand (add the new, keep the old)

A migration adds the new columns, nullable, without touching name:

ALTER TABLE users ADD COLUMN first_name text;
ALTER TABLE users ADD COLUMN last_name  text;

This is additive and safe (nullable columns, brief lock). The app is unchanged and keeps using name. Nothing breaks; the new columns just sit empty.

Then deploy app code that writes to both — when a user is created or updated, it sets name and first_name/last_name (dual-write), but still reads from name. Old rows have null new-columns, but that's fine because reads still use name.

Phase 2 — Migrate (backfill + switch reads)

Backfill the new columns for existing rows, in batches (so no long lock):

-- run repeatedly until no rows remain (re-runnable):
UPDATE users
SET first_name = split_part(name, ' ', 1),
    last_name  = substring(name from position(' ' in name) + 1)
WHERE first_name IS NULL
  AND user_id IN (SELECT user_id FROM users WHERE first_name IS NULL LIMIT 10000);

(Name-splitting is imperfect in reality — they handled edge cases separately — but the pattern is the point.) Once backfill is complete and verified, deploy app code that reads from first_name/last_name while still writing both. Now the app uses the new columns everywhere, but name is still maintained, so any not-yet-updated service still works.

Phase 3 — Contract (remove the old)

Once every service reads and writes the new columns and nothing references name, deploy code that stops writing name, then a final migration drops it:

ALTER TABLE users DROP COLUMN name;

Safe now, because nothing depends on it. The split is complete, and at no point did the running system see a schema it couldn't handle.

The outcome

Users noticed nothing. Each phase was a normal, low-risk deploy, independently verifiable and reversible (until the final drop). The change took longer in wall-clock time (three deploys over days) than a one-line RENAME — but a one-line rename would have meant an outage and a frantic rollback. Slower-but-safe won decisively.

The analysis

Breaking changes can't be atomic across schema + code. You deploy the schema and the (many) code consumers at different times, so there's always a window where they coexist. For a breaking change, that window is downtime. Expand-contract eliminates the window by making every phase backward-compatible.
Expand is additive; contract is subtractive; migrate bridges them. Add the new without removing the old (safe), move data and traffic to the new while the old still works (no breakage), then remove the old once unused (safe). Each phase, on its own, never breaks anything.
Dual-write is the bridge. Writing to both old and new during the transition is what lets old and new code coexist. It's temporary scaffolding, removed in the contract phase.
Backfill in batches, re-runnably. Tens of millions of rows can't be updated in one statement without a punishing lock; batching keeps the system responsive, and WHERE ... IS NULL makes it resumable.
The drop is last and deliberate. Dropping name is irreversible (the data's gone), so it happens only after you're certain nothing uses it — the textbook contract-phase action.

Discussion questions

Why can't a column rename/split be done safely in a single migration on a live system?
Walk through what each of the three deploys (and migrations) does, and why each is backward-compatible.
What is "dual-write," and why is it essential during the migrate phase?
Why must the backfill be batched and re-runnable?
⭐ Why is dropping the old column the last step, and what's the rollback plan if a problem appears after the drop?