Chapter 22: Schema Migration and Evolution — Changing Your Database Without Breaking Everything

DataField.Dev

29 min read

> Where you are: Part III, Chapter 22 of 40 — the close of Database Design. You can design a schema; this chapter is about changing one that's already live and full of data, safely. It's the discipline that turns the scary ALTERs of Chapter 14 into...

In This Chapter

Schemas change — the only question is how
What a migration looks like
Migrations in a team and in CI/CD
Migration tools
Safe vs. dangerous changes
Zero-downtime: the expand-contract pattern
Data migrations
Testing migrations and rollback
The anatomy of a migration system
Safe versus dangerous changes: the locking model
Expand-contract, fully worked
Common mistakes
PostgreSQL's safe-migration toolkit
Data migrations, testing, and rollback
Part III in retrospect
A field guide to migration disasters
Progressive project: evolve your schema safely
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 22: Schema Migration and Evolution — Changing Your Database Without Breaking Everything

Where you are: Part III, Chapter 22 of 40 — the close of Database Design. You can design a schema; this chapter is about changing one that's already live and full of data, safely. It's the discipline that turns the scary ALTERs of Chapter 14 into routine, reversible operations.

Learning paths: 💻 🏗️ especially — production schema change is a core professional skill; 🔬 CS students should understand the expand-contract pattern.

Schemas change — the only question is how

No schema is final. Requirements evolve, features ship, mistakes get corrected. The naive approach — hand-typing ALTER TABLE into a production shell — is how outages happen (Chapter 14's "ALTER that locked the store") and how environments drift (dev ≠ staging ≠ prod). The professional approach treats schema change as a migration: a versioned, reviewed, repeatable, reversible script, applied the same way to every environment.

A migration is a small, ordered change to the schema (and sometimes the data), recorded as a file in version control alongside your application code. The set of applied migrations defines the database's current structure; running them in order rebuilds it from scratch. This is infrastructure as code for your database.

What a migration looks like

Each migration is a step with an "up" (apply) and ideally a "down" (revert):

-- migrations/0042_add_wishlists.up.sql
CREATE TABLE wishlists (
    wishlist_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id integer NOT NULL REFERENCES customers(customer_id) ON DELETE CASCADE,
    product_id  integer NOT NULL REFERENCES products(product_id)  ON DELETE CASCADE,
    added_at    timestamptz NOT NULL DEFAULT now(),
    UNIQUE (customer_id, product_id)
);

-- migrations/0042_add_wishlists.down.sql
DROP TABLE wishlists;

Migrations are numbered/ordered (or timestamped) so they always apply in the same sequence, and a tool tracks which have run (in a schema_migrations table). The properties that make this work:

Version-controlled — schema changes live in git with the code that needs them.
Ordered & repeatable — the same migrations applied to dev, staging, and prod yield identical schemas.
Reviewed — a migration is code; it goes through code review (where a reviewer catches the missing WHERE, the lock-heavy ALTER, the irreversible drop).
Reversible (ideally) — a "down" migration undoes the change if something's wrong.

Why this matters. This is the cure for the two recurring disasters of Part II/III: ad-hoc production edits (Chapter 13's missing WHERE, Chapter 14's locking ALTER) and environment drift (Chapter 2's version-skew). Migrations make schema change boring — which, in production, is exactly what you want.

Migrations in a team and in CI/CD

Migrations truly earn their keep on a team, where multiple developers change the schema concurrently and changes flow through automated pipelines to multiple environments. Understanding how migrations fit a team workflow shows why the discipline matters beyond solo work, and it's where the version-controlled, ordered, append-only properties pay off most.

On a team, each developer writes migrations alongside the feature code that needs them, and both go through code review together. This is a major benefit: a migration is code, so it gets reviewed like code, and a reviewer can catch the missing WHERE, the lock-heavy ALTER that needs CONCURRENTLY, the breaking change that should be expand-contract, or the irreversible drop that lacks a backup plan — before it ever runs on production. Reviewing schema changes as code, rather than typing them ad hoc into a production console, is one of the largest safety improvements migrations provide. The version-control integration also means schema changes are tied to the code that depends on them: the migration adding a wishlists table lands in the same pull request as the wishlist feature, so the schema and the code that uses it move together through review, staging, and production.

The ordering discipline becomes important with concurrent development. If two developers each write migration number 43 on separate branches, merging them creates a conflict that must be resolved (renumbering one to 44) — which is good, because it forces an explicit decision about the order, rather than two changes silently colliding. Timestamp-based versioning (common in tools like Rails and Alembic) reduces these conflicts since timestamps rarely collide, though it can allow migrations to apply in a surprising order across branches; numbered migrations make order explicit at the cost of merge conflicts. Either way, the system surfaces concurrent schema changes rather than letting them clash invisibly.

In CI/CD pipelines, migrations are applied automatically as part of deployment: the pipeline runs pending migrations against each environment (staging, then production) before or alongside deploying the new application code. This automation is what makes the expand-contract pattern practical — the three phases become three deploys through the pipeline, each applying its migrations and its code together. It also enforces that every environment stays in sync: because the pipeline applies the same migrations everywhere, dev, staging, and production converge on identical schemas, eliminating the drift that ad-hoc changes cause. The migration tracking table per environment lets the pipeline apply exactly what each needs. Well-run teams treat database migrations with the same rigor as application deploys — reviewed, automated, tested on staging first, with rollback plans — because a bad migration can cause an outage just as surely as bad code. This integration of schema change into the normal software-delivery process, rather than treating it as a separate manual ritual, is the mature endpoint of the migration discipline.

Migration tools

You rarely hand-roll the runner; mature tools manage ordering, tracking, and rollback:

Flyway — SQL-first; numbered .sql files; simple and widely used.
Liquibase — XML/YAML/SQL changelogs; database-agnostic; rich features.
Alembic — the Python/SQLAlchemy migration tool (you'll meet it in Chapter 30); autogenerates migrations from model changes.
Framework-built-in: Rails Active Record migrations, Django migrations, etc.

They all do the same core job: apply pending migrations in order, record what's applied, and (usually) support rollback. Pick the one that fits your stack.

Safe vs. dangerous changes

The heart of the discipline: knowing which changes are cheap and safe versus which are risky on a live, populated table (building on Chapter 14).

Generally safe (fast, brief lock): - Adding a nullable column with no default (or a constant default, on modern PostgreSQL). - Adding a new table. - Adding an index CONCURRENTLY (doesn't block writes). - Adding a CHECK/foreign-key constraint as NOT VALID, then VALIDATE separately.

Dangerous (lock-heavy, slow, or irreversible) — handle with care: - Dropping a column — irreversible (data gone); also can break running app code that still references it. - Renaming a column or table — breaks any code still using the old name the instant it runs. - Changing a column's type — may rewrite the whole table (long exclusive lock) and can fail if data doesn't convert. - Adding NOT NULL to a populated column, or a validating constraint — scans/locks the table.

Common mistake. Treating a rename or type change as a one-step migration on a live system. The instant the migration runs, application code expecting the old shape breaks — even a few seconds of mismatch is an outage. The solution is to never make a breaking change in one step. That's the expand-contract pattern.

Zero-downtime: the expand-contract pattern

The key insight for changing a live system: the database schema and the application code deploy at different times, so for a window they must both work. A breaking change (rename, type change, splitting a column) is therefore done in three phases, each individually safe and backward-compatible:

Expand — add the new structure alongside the old, without removing anything. (Safe: additive.)
Migrate — backfill data into the new structure and update the application to write to both (dual-write) and read from the new. (No breakage: old still works.)
Contract — once everything uses the new structure and no code references the old, remove the old. (Safe: nothing depends on it anymore.)

Example — renaming customers.phone to phone_number with zero downtime:

   EXPAND:   ALTER TABLE customers ADD COLUMN phone_number text;   -- add new (nullable)
             [deploy code that writes BOTH phone and phone_number, reads phone]

   MIGRATE:  UPDATE customers SET phone_number = phone WHERE phone_number IS NULL;  -- backfill (batched)
             [deploy code that reads phone_number, still writes both]

   CONTRACT: [deploy code that uses ONLY phone_number]
             ALTER TABLE customers DROP COLUMN phone;             -- remove old, now unused

At no point is there a moment where the running code and the schema disagree. Each step is reversible, deployed and verified independently. It's more steps than a one-line RENAME, but it's the difference between a routine change and an outage. The same three-phase shape handles splitting a column, changing a type (add new-typed column, backfill, switch, drop old), and extracting a table.

Data migrations

Migrations aren't only structural — sometimes you must transform data: backfill a new column, normalize a denormalized one (Chapter 19), split a full_name into first/last. Key practices:

Batch large data migrations (update N thousand rows at a time, committing each) so you don't hold a long lock or generate a huge transaction (Chapter 14, 27).
Make them idempotent / re-runnable where possible (WHERE new_col IS NULL), so a partial run can resume safely.
Separate big backfills from the schema change — add the column in one migration, backfill in a controlled job, enforce constraints in a later migration (the safe NOT NULL pattern from Chapter 14).

Testing migrations and rollback

Test every migration against a copy of production-like data before prod — including the "down" path. A migration that works on an empty dev DB can fail or lock on millions of real rows.
Apply in a transaction where possible (PostgreSQL runs most DDL transactionally) so a failing migration rolls back cleanly. (Note: CREATE INDEX CONCURRENTLY and a few others can't run in a transaction.)
Have a rollback plan. A "down" migration is ideal, but some changes (a dropped column's data) can't be reverted — for those, the rollback plan is "restore from backup" (Chapter 38), which is why dropping is a contract-phase, last-step action after you're sure.
lock_timeout so a blocked migration fails fast instead of queuing behind app queries and freezing the system (Chapter 14).

The anatomy of a migration system

Understanding how a migration system works underneath the tools demystifies the whole discipline and helps you use any tool (or build a simple one) correctly. At its core, a migration system solves one problem: given a database in some state and a set of ordered change-scripts, apply exactly the scripts that haven't run yet, in order, and record what's been done. Everything else is refinement of that idea.

The mechanism is a special tracking table — conventionally schema_migrations — that the system maintains in the database itself, recording which migrations have been applied (by their version number or filename). When you run the migration tool, it compares the migrations present in your migrations/ folder against the records in schema_migrations, finds the ones not yet recorded, and applies them in order, inserting a record for each as it succeeds. This is how the same set of migration files produces the same schema on every environment: dev, staging, and production each have their own schema_migrations table tracking their own progress, and running the tool brings each up to date by applying whatever it's missing. A freshly-created database has an empty tracking table, so running the tool applies all migrations in order, rebuilding the schema from nothing — which is why "run the migrations" is how you stand up a new environment.

This design gives migrations their defining properties. They're ordered because each has a version (a sequential number or a timestamp) and the tool applies them in that order, so a migration can safely depend on the ones before it. They're idempotent at the suite level because the tracking table prevents re-applying what's already done — running the tool twice applies nothing the second time. They're repeatable because the same files plus an empty database always produce the same result. And they form a history — the ordered sequence of migrations is a complete, version-controlled record of how the schema evolved, readable like a changelog. The tracking table is the small, clever piece that makes all of this work, turning a pile of SQL scripts into a reliable, repeatable schema-evolution system.

The practical implication is that migrations are append-only: once a migration has been applied to any shared environment, you don't edit it — you write a new migration to make further changes. Editing an already-applied migration breaks the model, because environments that already ran the old version won't pick up your edit (it's already recorded as done), causing the very drift migrations exist to prevent. This append-only discipline feels restrictive at first ("I just want to fix the migration I wrote an hour ago") but it's essential: the migration history is a shared, immutable record, and changing the past breaks everyone who already moved past it. The rule is: before it's merged/shared, edit freely; once it's applied anywhere others rely on, only ever add new migrations forward. Understanding the tracking-table mechanism is what makes this rule make sense rather than seem arbitrary.

Safe versus dangerous changes: the locking model

The single most important practical knowledge in this chapter is which schema changes are safe to run on a live, populated, busy table and which are dangerous — and the key to understanding it is locking. Most schema changes take a lock on the table while they run, and the question is which lock, held for how long, because that determines whether the change is invisible to users or an outage.

The dangerous changes are dangerous for one of two reasons: they take a strong lock (blocking reads and/or writes) for a long time, or they break running application code. A change that rewrites the whole table — like changing a column's type in a way that requires reformatting every row — takes an exclusive lock for as long as the rewrite takes, which on a large table is minutes, during which every query on that table queues behind the lock, cascading into timeouts and an effective outage. Adding a NOT NULL constraint or a validating CHECK/foreign-key constraint must scan the whole table to verify existing rows comply, holding a lock during the scan. These aren't dangerous because they're wrong — they're correct SQL — they're dangerous because of the lock duration on a live system. The insidious part is that they're instant on your empty dev database and a multi-minute outage on production, so they pass every test and then strike in production. This gap between dev and prod behavior is why "it worked in dev" is no assurance for schema changes.

The other danger — breaking running code — is subtler and is the reason renames and type changes can't be done in one step. The database schema and the application code deploy at different moments, so there's always a window where the new schema must work with the old code, or the old schema with the new code. Rename phone to phone_number in one migration, and the instant it runs, all the still-deployed code referencing phone breaks — even a few seconds of mismatch between schema and code is an outage. The change is "safe" in the sense of not locking long, but "dangerous" in the sense of breaking the code that expects the old shape. This is what the expand-contract pattern exists to solve.

The safe changes, by contrast, are additive and quick-locking. Adding a nullable column (or one with a constant default, which modern PostgreSQL handles without rewriting rows) is fast and breaks nothing (old code ignores the new column). Adding a new table affects nothing existing. Creating an index CONCURRENTLY builds it without blocking writes. Adding a constraint as NOT VALID (which skips the initial full-table scan) and then VALIDATEing it separately (which takes a weaker lock) splits a dangerous operation into two safe-ish ones. The discipline, then, is to classify every migration before running it on production: does this take a strong lock for long (rewrite, scan)? Does this break code expecting the old shape (rename, drop, type change)? If either, it needs special handling — the expand-contract pattern, the CONCURRENTLY/NOT VALID techniques, batching, or a maintenance window. If neither (additive, quick), it's routine. That classification, run on every change, is what keeps schema migrations from becoming outages.

Expand-contract, fully worked

The expand-contract pattern is the key to zero-downtime breaking changes, and it's worth working through completely because it's both essential and initially counterintuitive. The core insight, restated because everything depends on it: the schema and the application code deploy at different times, so during every change there's a window where both the old and new must work. Expand-contract makes a breaking change as a sequence of individually non-breaking steps, so that window is always safe.

Let's split customers.full_name into first_name and last_name — a genuinely breaking change (code reads and writes full_name; afterward it must use the two new columns). Done naively in one step, the moment the migration runs, all deployed code referencing full_name breaks. Expand-contract does it in phases, each safe:

PHASE 1 — EXPAND (additive, safe):
   ALTER TABLE customers ADD COLUMN first_name text;   -- add new columns, nullable
   ALTER TABLE customers ADD COLUMN last_name  text;
   -- Deploy code that WRITES all three (full_name + first/last) but still READS full_name.
   -- Old code (reading/writing full_name) still works; new columns are just along for the ride.

PHASE 2 — MIGRATE (backfill + switch reads):
   UPDATE customers SET first_name = split_part(full_name,' ',1),     -- backfill, batched
                        last_name  = split_part(full_name,' ',2)
   WHERE first_name IS NULL;
   -- Deploy code that READS first/last, still WRITES all three.
   -- Now the app uses the new columns; full_name is still maintained for safety/rollback.

PHASE 3 — CONTRACT (remove the old, now-unused):
   -- Deploy code that uses ONLY first/last (stops touching full_name).
   ALTER TABLE customers DROP COLUMN full_name;        -- safe: nothing references it now

Trace why each phase is safe. Phase 1 is purely additive — adding nullable columns breaks nothing, and the deployed code (old or new) all still works because the new columns are optional. Phase 2's backfill is a data migration (batched, re-runnable via the WHERE ... IS NULL guard), and the code switch is safe because both old and new columns are still being written, so code reading either works. Phase 3 only removes full_name after a deploy in which no code references it anymore — so the drop breaks nothing. At no single moment do the running code and the schema disagree, which is the entire definition of zero-downtime. Each phase is also independently deployable and verifiable: you can pause between phases, confirm things are healthy, even roll back a phase, before proceeding.

The cost is obvious — what could be a one-line ALTER ... RENAME becomes three migrations and three coordinated deploys — but the benefit is that a change which would otherwise require downtime (take the app offline, change the schema, deploy new code, come back up) becomes a routine, online, reversible operation. For a system that can't take downtime (most production systems), this trade is overwhelmingly worth it. The same three-phase shape — expand (add new alongside old), migrate (backfill, dual-write, switch reads), contract (remove old) — handles every breaking change: column renames, type changes (add new-typed column, backfill with conversion, switch, drop old), splitting or merging columns, and extracting a table. Once you internalize the pattern, breaking changes stop being scary special cases and become a known, repeatable procedure. That transformation — from "breaking changes need downtime" to "breaking changes are three safe steps" — is one of the most valuable things a database practitioner learns.

Common mistakes

Ad-hoc production edits instead of versioned migrations — drift and untracked changes.
Breaking changes in one step (rename/type change) — use expand-contract.
Unbatched data migrations on huge tables — long locks, outages.
No "down"/rollback plan for a risky change.
Testing only on empty dev databases — the change that's instant on 100 rows locks on 100 million.

PostgreSQL's safe-migration toolkit

PostgreSQL provides specific features that make otherwise-dangerous migrations safe, and knowing them is what lets you apply heavy changes to live tables without outages. These are the tools that turn "this would lock the table for minutes" into "this runs online."

CREATE INDEX CONCURRENTLY builds an index without taking the write-blocking lock that an ordinary CREATE INDEX requires. A normal index build locks the table against writes for the entire build — minutes on a large table, an outage. The CONCURRENTLY variant builds the index in the background, allowing reads and writes throughout, at the cost of taking longer and requiring two passes over the table. The trade-off is almost always worth it on a live system: a slightly slower build that doesn't block anyone beats a fast build that freezes writes. (One caveat: CONCURRENTLY can't run inside a transaction block, which interacts with migration tools that wrap migrations in transactions — you typically mark such migrations as non-transactional.) Adding constraints as NOT VALID splits a dangerous validating change into two safe-ish steps: ADD CONSTRAINT ... NOT VALID adds the constraint without the full-table scan to verify existing rows (so it's fast and takes a brief lock), enforcing it only on new writes; then a separate VALIDATE CONSTRAINT checks the existing rows with a weaker lock that doesn't block reads and writes. The combination achieves what a one-step constraint add would, without the long table-scanning lock.

lock_timeout is a crucial safety net: setting it before a migration (SET lock_timeout = '5s') makes the migration fail fast if it can't acquire its lock within the timeout, rather than queuing indefinitely behind running queries. Without it, a migration that needs a lock can wait behind a long-running query, and while it waits, it blocks every query that arrives after it — so one migration waiting for one slow query can freeze the entire table. With lock_timeout, the migration gives up quickly and you retry later, rather than triggering a pile-up. This single setting prevents a whole class of migration-induced outages and should be standard practice for any migration touching a busy table. Related, statement_timeout bounds how long the migration's statements run, another guard against a runaway operation holding locks too long.

The meta-point is that PostgreSQL gives you the tools to do almost any schema change online, if you know them and use them. The dangerous version (CREATE INDEX, a validating ADD CONSTRAINT, an unbounded lock wait) and the safe version (CREATE INDEX CONCURRENTLY, NOT VALID + VALIDATE, lock_timeout) achieve the same end state, but one is an outage on a live table and the other is invisible to users. Mastering this toolkit — knowing that there's a concurrent or split or timeout-guarded way to do the scary thing — is what separates a DBA who can change production safely from one who needs a maintenance window for every change. Combined with the expand-contract pattern for breaking changes, these techniques mean that nearly every schema change can be done on a live system without downtime. That capability — evolve the running database freely and safely — is the practical payoff of this whole chapter.

Data migrations, testing, and rollback

Schema migrations often come paired with data migrations — backfilling a new column, transforming existing values, normalizing a denormalized field — and these have their own discipline. The cardinal rule is to batch large data migrations rather than doing them in one statement: a single UPDATE over millions of rows holds a long lock and generates a huge transaction (Chapters 14, 20), so instead you update in chunks (10,000 rows at a time), committing between batches, keeping each transaction short and locks brief. Make data migrations idempotent and re-runnable with a WHERE ... IS NULL guard (or similar), so that if a backfill is interrupted halfway, re-running it resumes from where it stopped rather than redoing or double-applying work. And separate the big backfill from the schema change: add the column in one migration, backfill it in a controlled, batched job, and enforce any NOT NULL constraint in a later migration once the data is populated — the safe-NOT NULL pattern from Chapter 14, which avoids the table-scanning lock of adding NOT NULL to an unpopulated column.

Testing migrations is non-negotiable, and the key insight is that a migration must be tested against production-like data, not an empty dev database. The whole danger of schema migrations — the locking, the rewrite time, the constraint-scan duration — scales with table size, so a migration that's instant on 100 dev rows can lock for minutes on 100 million production rows. Testing on a realistic data volume (a recent production backup restored to a staging environment) is the only way to discover that a migration is too slow or lock-heavy before it hits production. Test the down migration too — the rollback path is the one you'll need in a crisis, and it's the one most likely to be untested and broken. PostgreSQL runs most DDL transactionally, so wrapping a migration in a transaction means a failure rolls back cleanly rather than leaving the schema half-changed (the exceptions, like CREATE INDEX CONCURRENTLY, can't be transactional, which is why migration tools handle them specially).

The rollback plan deserves explicit thought for every risky change. A "down" migration that cleanly reverses the change is ideal, and you should write one whenever possible. But some changes can't be reversed by a down migration: once you DROP a column, its data is gone, and no down migration can recreate it. For genuinely irreversible changes, the rollback plan is "restore from backup" (Chapter 38) — which is precisely why destructive operations like dropping a column are done in the contract phase, as the last step, only after you're confident nothing needs the old structure. The sequencing isn't arbitrary: additive and reversible changes go first and freely; irreversible destruction goes last, after verification, with a backup as the ultimate safety net. This layering — reversible-first, destructive-last-with-backup — means that even if something goes wrong, you can almost always recover, either by running the down migration or, in the worst case, by restoring. A schema change without a rollback plan is a bet you can't unwind; with one, even a failed migration is recoverable.

Part III in retrospect

Chapter 22 closes Part III, the heart of database design, and it's worth seeing how the part's six chapters form one integrated discipline. You began with conceptual modeling (Chapter 17) — extracting entities, attributes, and relationships from requirements, the pure-thinking stage where design decisions are cheapest. You learned to translate that model into tables via the mapping rules (Chapter 18), turning a blueprint into a schema. You learned normalization (Chapter 19) — the theory that ensures each fact lives in one place, eliminating the redundancy that causes anomalies — and its deliberate counterpart, denormalization (Chapter 20), reintroducing controlled redundancy where measured read performance demands it. You added a vocabulary of reusable patterns (Chapter 21) — audit trails, soft deletes, hierarchies, multi-tenancy, polymorphic associations — for the problems that recur in every domain. And now you've learned to evolve a schema safely once it's live (Chapter 22), turning change from a feared outage into a routine operation.

Together these form the complete arc of design: model it, build it, make it correct, tune it, pattern it, and change it safely. The unifying theme, theme #1, has been that design is the most important skill — and across these six chapters you've seen why, repeatedly. The schema is the long-lived foundation everything else rests on; it's expensive to change once it holds data (which is exactly why Chapter 22's techniques matter); and its quality determines how clean your SQL is, how automatic your integrity is, how achievable your performance is, and how comprehensible your whole system is. A well-designed schema makes everything downstream easier; a poorly-designed one makes everything a struggle. That asymmetry is why a whole part of the book — teaching no new SQL syntax — is devoted to design.

What's striking is how the stages reinforce each other. Good conceptual modeling naturally produces normalized schemas (both express "each fact in its place"). The mapping rules are mechanical because the conceptual model did the hard thinking. Normalization and denormalization are two ends of one trade-off, not opposing camps. The patterns build on all of it. And safe migration is what lets you apply this design knowledge to systems that are already running, evolving them without fear. By the end of Part III, you can take a domain from a vague description to a well-designed, correct, performant, pattern-informed schema — and then change that schema safely as requirements evolve. That is the full skill of database design, and it's the foundation on which Part IV's performance work, Part V's application integration, and everything after will build. The SQL of Part II expresses queries against a structure; Part III is where you create the structure, and create it well — which is, in the end, the most consequential thing a database practitioner does.

With design behind you, the natural next question is speed: a well-designed schema can still be slow if the database has to do too much work to answer your queries. Part IV — Performance and Internals — takes that on, beginning with the single most impactful performance tool, the index (Chapter 23), and continuing through query optimization, partitioning, transactions, concurrency, and the internals (MVCC, WAL, VACUUM) that explain why the database behaves as it does. Theme #5 — performance is basic competence, not premature optimization — takes the lead. But everything there builds on the foundation you've just completed: you can only make a well-designed schema fast, because performance work on a badly-designed schema is mostly fighting the design. Design first, then performance — which is exactly the order the book takes, and exactly the order that works.

A field guide to migration disasters

Schema migrations cause some of the most memorable production incidents in software, and nearly all of them fall into a few recognizable categories. Knowing the disasters by name — and their preventions — is how you avoid joining the people who tell these stories at their own expense.

The locking ALTER. Someone runs an ALTER TABLE that takes a strong lock on a large, busy table — adding a NOT NULL constraint that scans every row, changing a column type that rewrites the table, or building an index without CONCURRENTLY. The lock is held for minutes; every query on the table queues behind it; the application times out; the site goes down. The cause is treating a change that's instant in dev as instant in prod. The prevention is the safe-migration toolkit: CONCURRENTLY for indexes, NOT VALID + VALIDATE for constraints, lock_timeout to fail fast rather than queue, and testing on production-sized data to discover lock duration before prod. The breaking rename. Someone renames a column or table in one migration, and the instant it runs, all the still-deployed code referencing the old name breaks — an outage from a "simple" rename. The prevention is expand-contract: never make a breaking change in one step.

The unbatched mega-update. A data migration updates millions of rows in a single statement, holding a long lock and generating a massive transaction that bloats the WAL and can exhaust disk or replication capacity, freezing the system. The prevention is batching — update in chunks, commit between them. The untested down migration. Something goes wrong in production, the team reaches for the rollback, and the down migration is broken or was never written — so there's no clean way back, and recovery means a panicked restore from backup. The prevention is testing the down path as rigorously as the up path, and having a backup-based rollback plan for irreversible changes. The edited applied migration. Someone "fixes" a migration that's already run on some environments, so those environments never pick up the fix while fresh environments get the corrected version — silent drift, the exact thing migrations exist to prevent. The prevention is the append-only rule: never edit an applied migration; always add a new one forward.

The dev-only test. A migration is verified only against an empty or tiny dev database, so its real cost on production volume — the lock duration, the rewrite time, the constraint-scan length — is never measured until it strikes production. The prevention is testing every non-trivial migration against a production-sized dataset (a restored backup on staging) before it ships. Notice the common thread across all these disasters: they happen when a migration is treated casually — typed ad hoc, not reviewed, not tested at scale, not classified for safety, not given a rollback plan. The entire migration discipline exists to prevent exactly these failures, and following it — versioned, reviewed, classified, tested-at-scale, reversible, applied through a pipeline — turns each disaster into a non-event. The teams that experience these incidents are almost always the ones that skipped the discipline; the teams that follow it find schema change genuinely boring, which, for production database changes, is the highest possible praise. Boring is the goal. Every technique in this chapter is in service of making the inherently risky act of changing a live database's structure into something routine and safe — and that transformation, from feared to boring, is the mark of a mature engineering practice.

It's fitting that Part III ends here, on change, because it completes the design story. The earlier chapters taught you to design a schema well; this one teaches you that "well-designed" must include "changeable," since no schema is final and the cost of change is what makes initial design so consequential. A team that can migrate fearlessly can also correct design mistakes that slip through — so migration competence is, paradoxically, what makes imperfect initial design survivable. Design carefully because change is costly; master migration so that when change is needed (and it always is), it's safe. Those two truths, held together, are the complete professional stance toward the database schema: respect it as a long-lived foundation, and command the tools to evolve it without fear.

Progressive project: evolve your schema safely

For your project:

Adopt a migration approach — even just numbered .sql files in a migrations/ folder, applied in order (or a tool like Flyway/Alembic).
Make a routine change as a migration (add a table or a nullable column) with an up and a down.
Plan a breaking change with expand-contract — e.g., rename a column or split name into first/last. Write the three phases (expand, migrate/backfill, contract) and note what each deploy does.
Write a batched data migration for a backfill, and make it re-runnable.

Document your migration workflow. Your schema is now something you can evolve confidently, not fearfully.

Summary

Schemas always change; the professional approach is the migration — a versioned, ordered, reviewed, reversible script in source control, applied identically to every environment by a tool (Flyway, Liquibase, Alembic). Know safe changes (add nullable column/table, CREATE INDEX CONCURRENTLY, NOT VALID+VALIDATE) from dangerous ones (drop, rename, type change, NOT NULL/constraint scans). Make breaking changes with the expand-contract pattern — add the new alongside the old, migrate/dual-write, then remove the old — so the schema and running code never disagree (zero downtime). Batch large data migrations and make them re-runnable; test migrations (including rollback) on production-like data; use lock_timeout and transactions. Migrations turn schema change from a feared outage into a routine, boring, safe operation.

You can now: - Write versioned migrations (up/down) and explain why they beat ad-hoc edits. - Classify changes as safe vs. dangerous on a live table. - Apply the expand-contract pattern for zero-downtime breaking changes. - Write batched, re-runnable data migrations. - Test migrations and plan rollback (down migration or backup).

What's next — Part IV. Your schema is designed and you can evolve it. Part IV — Performance and Internals begins with Chapter 23 (Indexing): making queries fast. Theme #5 — performance is basic competence — takes over, starting with the single most impactful performance tool: the index.

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.