30 min read

> Where you are: Part IV, Chapter 23 of 40 — the start of Performance and Internals, where theme #5 takes over: performance is basic competence, not premature optimization. This chapter is home to the book's slow-query anchor: a query that takes 45...

Chapter 23: Indexing — The Data Structure That Makes Databases Fast

Where you are: Part IV, Chapter 23 of 40 — the start of Performance and Internals, where theme #5 takes over: performance is basic competence, not premature optimization. This chapter is home to the book's slow-query anchor: a query that takes 45 seconds, fixed to 12 milliseconds — by adding one index.

Learning paths: 💻 📊 🔬 🏗️ — everyone. Reading a slow query and knowing whether an index will fix it is a baseline professional skill, not wizardry.


The most impactful thing you'll learn about performance

Throughout this book we've previewed it: the same query can take milliseconds or minutes depending on whether the database can use an index. Now we make good on it. Indexing is the single highest-leverage performance tool you have — and understanding it is, as theme #5 insists, basic competence, not advanced wizardry.

Load the large practice dataset for this part (sql/generate_data.sql — ~100K orders, ~250K order items), so the numbers are real.

The anchor: 45 seconds → 12 milliseconds

A common query: "all orders for customer #4267."

SELECT * FROM orders WHERE customer_id = 4267;

On the large dataset without the right index, PostgreSQL must read every row of orders and check each one's customer_id — a sequential scan of 100,000 rows. As the table grows to millions, this takes seconds, then tens of seconds. Add an index on customer_id:

CREATE INDEX idx_orders_customer_id ON orders (customer_id);

Now PostgreSQL jumps directly to the matching rows via the index — an index scan — returning them in milliseconds. Same query, same result, ~3,000× faster. That gap is the difference between an app that works and one that falls over, and it comes down to one line of DDL. (Mercado pre-creates this particular index; the dramatic version is when you find a missing one.)


What an index actually is

An index is a separate data structure, stored alongside your table, that lets the database find rows by a column's value without scanning the whole table — exactly like the index at the back of a book lets you find a topic without reading every page.

Without an index, finding rows matching a condition is O(n) — proportional to table size (read everything). With the right index, it's roughly O(log n) — proportional to the logarithm of table size. The difference is staggering at scale: for a billion rows, O(n) is a billion operations; O(log n) is about 30. That's the whole game.

The trade, stated up front (we'll return to it): an index makes reads faster but makes writes slower (every insert/update must also update the index) and uses storage. Indexes are not free — they're a deliberate trade of write cost for read speed.


The B-tree: the default index

PostgreSQL's default (and most useful) index is the B-tree — a balanced tree that keeps keys sorted and shallow. A B-tree node holds many keys, so the tree stays very short even for huge tables:

   B-tree (simplified): each lookup walks root → branch → leaf

                 [ 50 | 100 ]              ← root
                /     |      \
        [..30..] [..70..] [..150..]        ← branch nodes
         /  |  \   ...        ...
      leaves with actual keys → row pointers

Because each level multiplies the keys, a B-tree's height grows logarithmically: even a table of hundreds of millions of rows has a B-tree only ~4–5 levels deep. So a lookup touches ~5 nodes instead of scanning millions of rows — that's the O(log n) win, concretely. B-trees also serve range queries (price BETWEEN 100 AND 500), sorting (ORDER BY), and prefix matching (LIKE 'Ultra%'), because they keep keys in order.

Theory → Practice. "An index turns an O(n) scan into an O(log n) lookup by maintaining a sorted structure" is theme #3 (understand the WHY) in one sentence. Knowing why tells you when: an index helps when you're searching, range-scanning, or sorting by its column(s) — and doesn't help when you need most of the table anyway.


When an index helps

The database can use an index for a column that appears in:

  • WHERE equality/range: WHERE customer_id = 4267, WHERE price > 200, WHERE order_date >= '2024-01-01'.
  • JOIN conditions: JOIN orders o ON o.customer_id = c.customer_id — index the join columns (Chapter 6's lesson). The #1 cause of slow joins is an unindexed join column.
  • ORDER BY / GROUP BY: an index in the right order can supply sorted rows without a sort step.
  • UNIQUE constraints: enforced by a unique index (automatic).

Mercado pre-indexes the busiest columns (orders.customer_id, order_items.product_id, products.category_id, reviews.product_id) — and deliberately leaves others unindexed so the performance chapters can have you diagnose and add them.


Finding the missing index

The skill isn't memorizing which columns to index — it's diagnosing a slow query and recognizing it needs one. The tool is EXPLAIN (the whole of Chapter 24), but the headline: run EXPLAIN ANALYZE <query> and look for a Seq Scan (sequential scan) on a large table where you expected to fetch few rows. That's the smell of a missing index. Add the index on the filtered/joined column, re-run, and watch the plan switch to an Index Scan — the 45s→12ms moment.

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 4267;
-- BEFORE: Seq Scan on orders  (... actual time=... rows=100000 scanned)  → slow
-- (add the index)
-- AFTER:  Index Scan using idx_orders_customer_id  (... rows=N)           → fast

The index types

B-tree handles most needs, but PostgreSQL offers specialized indexes for specialized data:

Index Best for Example
B-tree (default) equality, ranges, sorting, prefix LIKE WHERE price > 200, ORDER BY order_date
Hash equality only (=) rarely needed; B-tree usually as good
GIN "contains" on composite values: JSONB, arrays, full-text attributes @> '{"ram_gb":16}', tsvector @@ ... (Ch. 16)
GiST ranges, geometric/spatial, nearest-neighbor range overlaps, PostGIS, daterange &&
BRIN huge tables naturally ordered by the column (e.g., time-series) a billion rows by created_at, tiny index

Plus important variations on these:

  • Multi-column index: indexes several columns together — (customer_id, order_date). Column order matters: it serves queries filtering on a leftmost prefix (customer_id, or customer_id + order_date), but not order_date alone. Order the columns by how you query.
  • Partial index: indexes only some rows — CREATE INDEX ... ON orders (customer_id) WHERE status = 'pending'. Smaller and faster when you only ever query that subset (and the basis for the soft-delete unique trick, Chapter 21).
  • Expression index: indexes a computed value — CREATE INDEX ... ON customers (lower(email)) so WHERE lower(email) = '...' can use it. This is the fix for the "function on a column defeats the index" problem from Chapter 8.
  • Covering index / index-only scan: if an index INCLUDEs all the columns a query needs, PostgreSQL can answer from the index alone, never touching the table — the fastest case.

Composite indexes and the leftmost-prefix rule

Multi-column (composite) indexes are where indexing gets genuinely strategic, and the rule that governs them — the leftmost-prefix rule — is one of the most important and most-misunderstood pieces of indexing knowledge. A composite index on (customer_id, order_date) doesn't just index two columns separately; it indexes them together, in order, building a single sorted structure ordered first by customer_id, then by order_date within each customer. That ordering is the key to understanding what the index can and can't do.

The leftmost-prefix rule states that a composite index can serve queries filtering on a leftmost prefix of its columns — the first column alone, or the first two together, or the first three together, and so on — but not columns that skip the leading ones. An index on (customer_id, order_date) accelerates WHERE customer_id = 4267 (uses the first column), and WHERE customer_id = 4267 AND order_date >= '2024-01-01' (uses both, in order), but it does not help WHERE order_date >= '2024-01-01' alone — because the index is sorted by customer_id first, so the order dates are scattered throughout it, not grouped. To find all rows in a date range regardless of customer, the database would have to scan the whole index, which is no better than scanning the table. The analogy is a phone book sorted by (last name, first name): it's great for finding "Smith" or "Smith, John," but useless for finding everyone named "John" regardless of surname, because the Johns are scattered across every letter.

This rule has direct consequences for how you design composite indexes: order the columns by how you query them, putting the columns used in equality filters first, then range filters, then sort columns. A query that filters customer_id = X (equality) and then ranges on order_date wants the index (customer_id, order_date) — equality column first, so the index narrows to one customer's rows, then those are already sorted by date for the range and any ORDER BY order_date. Reverse the order to (order_date, customer_id) and the same query is poorly served. So a single composite index isn't symmetric — the column order encodes which queries it accelerates, and getting it right requires knowing your query patterns. A useful corollary: a composite index on (a, b) can often replace a single-column index on a (since a is a leftmost prefix), so you sometimes get two indexes' worth of coverage from one — but it can't replace an index on b alone.

The practical skill is matching composite indexes to your actual multi-column query patterns. When you repeatedly filter on the same combination of columns — customer_id and status, or category_id and price — a composite index on that combination, ordered with the equality-filtered column first, serves those queries far better than separate single-column indexes (which the database would have to combine, less efficiently). But don't over-create them: each composite index is still an index with write and storage costs, and a composite index serves fewer distinct query shapes than its column count might suggest (only the leftmost prefixes). Design composite indexes deliberately for your hot multi-column queries, order their columns by the equality-then-range-then-sort principle, and remember the leftmost-prefix rule when reasoning about whether a given query can use a given index. That reasoning — "can this query use this index, given its column order?" — is exactly what EXPLAIN (Chapter 24) lets you verify, and what separates effective composite indexing from hopeful index-sprinkling.


Partial and expression indexes

Two specialized index variations deserve emphasis because they solve common problems elegantly and are underused. Partial indexes index only a subset of a table's rows — those matching a WHERE clause in the index definition — making them smaller, faster to maintain, and more targeted than a full index. The classic use is when you only ever query a particular subset: if your application constantly queries pending orders (WHERE status = 'pending') but rarely cares about the completed ones that dominate the table, a partial index CREATE INDEX ... ON orders (created_at) WHERE status = 'pending' indexes only the pending rows. Since pending orders might be 1% of the table, this index is tiny — a fraction of a full index's size — yet it fully accelerates the queries that matter, and it costs far less to maintain on writes (only writes that touch pending rows update it). Partial indexes are also the solution to the soft-delete unique-constraint problem from Chapter 21 (UNIQUE ... WHERE deleted_at IS NULL), enforcing uniqueness only among active rows.

Expression indexes index the result of a computed expression rather than a raw column, which fixes the recurring "function on a column defeats the index" problem (Chapter 8). When you query WHERE lower(email) = 'alice@example.com', a plain index on email can't help, because the index stores the original-case emails, not their lowercased forms — so the database can't use it to find a lowercased match. An expression index, CREATE INDEX ON customers (lower(email)), indexes the lowercased values, so a query filtering on lower(email) can use it directly. This is the proper fix for case-insensitive search and any query that must filter on a transformed value: index the transformation. The same applies to date extraction (CREATE INDEX ON orders (date_trunc('month', order_date)) for monthly queries), JSON extraction, and computed values generally. The rule is that the index must match the expression in your query exactly — an index on lower(email) helps WHERE lower(email) = ... but not WHERE email = ..., and vice versa.

Both partial and expression indexes embody a deeper principle: the index should match how you actually query. A general-purpose full index on the raw column is the default, but when your queries consistently target a subset (partial) or a transformation (expression), a specialized index serves those specific queries better and cheaper than forcing a general index to do a job it's not shaped for. The trade-off is specificity: a partial index only helps queries matching its WHERE, and an expression index only helps queries using its exact expression, so they're targeted tools, not general ones. But for the right query pattern, they're dramatically more efficient than the alternatives — a tiny partial index where a huge full one would be wasteful, an expression index where no plain index could help at all. Recognizing when your query pattern calls for one of these, rather than reflexively reaching for a plain full index, is a mark of indexing sophistication, and it often turns a query that "can't be indexed" into one that's fast.


The specialized index types

While B-tree handles the great majority of indexing needs — equality, ranges, sorting, prefix matching — PostgreSQL's specialized index types handle data shapes B-tree can't, and knowing which to reach for is essential when you work with JSON, full-text, arrays, ranges, or huge ordered tables. Each type matches a specific kind of query that B-tree serves poorly or not at all.

GIN (Generalized Inverted Index) is the type for "does this composite value contain that element?" queries — exactly the operations on JSONB, arrays, and full-text from Chapter 16. A GIN index on a JSONB column accelerates containment (@>) and key-existence (?); on a tsvector, it accelerates full-text match (@@); on an array, it accelerates contains and overlap. GIN indexes the elements inside a composite value (the keys in a document, the lexemes in a document, the items in an array), which is what lets it answer "which rows contain this element" quickly — the inverted-index idea that powers search engines. If you use JSONB, full-text, or array operators on a real-sized table, a GIN index is what makes them fast; without it, those queries scan every row. GiST (Generalized Search Tree) handles geometric, spatial, and range queries — "do these ranges overlap?" (&&), "is this point near that one?", PostGIS geospatial operations, and the range exclusion constraints from Chapter 16. Where B-tree understands linear order (less-than, greater-than), GiST understands multidimensional and overlapping relationships that have no single linear ordering.

BRIN (Block Range Index) is a fascinating specialist for very large tables that are naturally ordered by the indexed column — the classic case being time-series data, where rows are inserted in roughly timestamp order and the table grows to billions of rows. Instead of indexing every row (like B-tree), BRIN stores only the min and max value per block of the table, making it astonishingly small — a BRIN index on a billion-row table might be kilobytes where a B-tree would be gigabytes. The trade-off is that BRIN only works well when the data's physical order correlates with the indexed column's values (so each block covers a narrow value range); for a column whose values are scattered randomly through the table, BRIN is useless. But for append-only, time-ordered big data, BRIN delivers most of an index's benefit at a tiny fraction of the size and maintenance cost. Hash indexes serve equality-only lookups and are rarely needed, since B-tree handles equality nearly as well while also serving ranges and sorting — reach for Hash only in narrow cases where you've measured a benefit.

The selection logic is straightforward once you know the types: B-tree for the common cases (equality, ranges, sorting, prefix); GIN for "contains" queries on JSONB/arrays/full-text; GiST for ranges, spatial, and overlap; BRIN for huge naturally-ordered tables; Hash almost never. The key insight is that the index type must match the kind of query, not just the column — a JSONB column queried by containment needs GIN, not B-tree, even though both are "indexes." Reaching for B-tree on a JSONB containment query (and finding it doesn't help) is a common confusion; knowing the specialized types prevents it. Most of your indexes will be B-tree, but recognizing the cases that call for GIN, GiST, or BRIN — and using them — is what makes PostgreSQL's advanced data types (Chapter 16) actually perform, completing the picture of that chapter's "index them or they're slow" warning.


The cost of indexes — and when NOT to index

Every index you add:

  • Slows writes. Each INSERT/UPDATE/DELETE must also update every affected index. A table with 10 indexes writes much slower than one with 2.
  • Uses storage. Indexes can collectively rival the table's own size.
  • Must be maintained (VACUUM, bloat — Chapter 28).

So you do not index everything. Skip or reconsider an index when:

  • The table is small. A sequential scan of a few hundred rows is already instant; an index adds cost for no benefit. (Mercado's sample data is tiny — indexes matter on the generated data.)
  • The column has low cardinality. Indexing a boolean or a status with 4 values rarely helps — the index can't narrow much (a partial index on the rare value can, though).
  • The table is write-heavy and the column is rarely queried. The write penalty outweighs the read benefit.
  • You're fetching most of the table anyway. If a query returns 80% of rows, a sequential scan is faster than an index scan (the optimizer knows this and will choose the scan).

The discipline: index the columns you actually filter, join, and sort on — and no more. Add indexes in response to measured slow queries (via EXPLAIN), not speculatively.

Common mistake. Indexing a column but then wrapping it in a function in the WHEREWHERE lower(email) = '...' can't use a plain index on email (Chapter 8). Either index the expression (ON customers (lower(email))) or store the data normalized. The index must match how you query.


Why the planner sometimes ignores your index

A confusing experience for newcomers: you add an index, and the database doesn't use it — the query still does a sequential scan. This isn't a bug; it's the planner making a cost-based decision, and understanding why deepens your indexing intuition considerably. PostgreSQL's planner (Chapter 24) doesn't use an index just because one exists — it estimates the cost of using the index versus scanning the table, and picks the cheaper plan. Sometimes the scan genuinely is cheaper, and the planner is right to skip the index.

The key concept is selectivity — what fraction of the table a query's condition matches. An index shines when a query selects a small fraction of rows: jumping to a handful of matching rows via the index beats reading the whole table. But when a query selects a large fraction — say, 60% of the table — using the index can actually be slower than a sequential scan, because the database would follow the index to most of the rows anyway, jumping around the table in index order (random access), when it could have just read the whole table sequentially (fast sequential access) and filtered. There's a crossover point — typically somewhere around 5–10% of the table, depending on row size and storage — above which a sequential scan wins. So a query like WHERE status = 'active' where most rows are active will (correctly) ignore an index on status and scan, because the index can't narrow the result enough to be worth the random access. This is why low-cardinality columns are poor index candidates: their conditions usually match too many rows for an index to help.

The planner makes these decisions using statistics it maintains about each table — how many rows, how many distinct values per column, the distribution of values — gathered by ANALYZE (run automatically by autovacuum, Chapter 28). When the statistics are accurate, the planner estimates selectivity well and chooses correctly. When they're stale — after a big data change that ANALYZE hasn't caught up with — the planner can mis-estimate and choose a bad plan, including ignoring an index it should use or using one it shouldn't. This is why "the query got slow after a big import" sometimes resolves with a manual ANALYZE: you refreshed the statistics, and the planner now estimates correctly. Understanding that the planner is cost-based and statistics-driven explains both why it sometimes ignores your index (the scan was genuinely cheaper) and why it sometimes makes mistakes (stale statistics led it astray).

The practical implications are twofold. First, don't assume an unused index is a problem — if the planner skips your index for a query that selects most of the table, it's probably right, and the fix (if the query is slow) is rethinking the query or accepting that this query just touches a lot of data. Second, keep statistics fresh — autovacuum usually handles this, but after bulk data changes, a manual ANALYZE ensures the planner has accurate information. Chapter 24 goes deep on reading the planner's decisions; the lesson here is that an index is an option the planner chooses among, not a command it must obey, and the choice is a cost estimate based on selectivity and statistics. That understanding turns "why isn't it using my index?!" from a frustration into a diagnosable question: what's the selectivity, and are the statistics current?


An indexing strategy

Pulling the chapter together into a practical strategy: how do you actually decide what to index for a real application? The disciplined approach is demand-driven — index in response to the queries you actually run, measured, rather than speculatively up front. But a few defaults and a repeatable process make it concrete.

Start with the indexes that are almost always right: primary keys (automatically indexed), foreign keys used in joins (frequently missing by default and a top cause of slow joins — index them), and columns in your most common WHERE filters. These cover the bulk of typical query patterns. Then, as the application runs and you observe its behavior, diagnose and add: use the EXPLAIN ANALYZE loop to find slow queries doing sequential scans on large tables to fetch few rows, and add the matching index — the right type (B-tree usually, GIN/GiST/BRIN for special data), the right columns (composite with the leftmost-prefix rule in mind), and the right variation (partial for subsets, expression for transformations). This responsive process ensures every index earns its place by serving a real, measured query.

Periodically, prune: audit pg_stat_user_indexes for indexes with near-zero usage and drop the freeloaders, because every unused index taxes writes for no benefit. And on write-heavy tables especially, be conservative — the write cost of indexes is real, so resist adding indexes that serve rare queries on hot-write tables, where the write penalty outweighs the occasional read benefit. The overarching principle, restated: index the columns you actually filter, join, and sort on, justified by measured queries, and no more. An index without a query to justify it is write-tax; a slow query without an index is a missed opportunity; the strategy is to keep those in balance through measurement.

This strategy embodies theme #5 — performance is basic competence, addressed through measurement rather than guesswork or premature optimization. You don't index everything (premature, costly); you don't index nothing (negligent); you index what the application's real queries need, found by profiling, maintained by pruning. That disciplined, demand-driven, measured approach to indexing is exactly what makes the difference between an application that scales smoothly and one that grinds to a halt as data grows — and it's entirely learnable, which is the whole point of treating performance as basic competence rather than arcane wizardry. Indexes are the highest-leverage tool; this strategy is how you wield them well.

If you remember one thing from this chapter, make it this: an index is a deliberate trade — read speed bought with write cost — applied per-query, justified by measurement, and verified by EXPLAIN. That single sentence captures the whole discipline, and it's the foundation for everything in the rest of Part IV, where you'll learn to read those EXPLAIN plans in full, understand the query optimizer's choices, and round out performance from a single high-leverage tool into a complete, measured practice.

And it's worth ending where the chapter began: indexing is not arcane wizardry reserved for specialists — it's basic competence, learnable by anyone willing to understand what an index is (a sorted side structure), when it helps (selective filters, joins, sorts), what it costs (writes, storage, maintenance), and how to apply it (measure, add the matching index, verify). Every developer who works with a database should be able to look at a slow query, recognize that it's doing a sequential scan to fetch a few rows, and fix it with the right index. That ability — undramatic, methodical, enormously impactful — is exactly the kind of "performance is basic competence" skill that theme #5 promises, and you now have it.


Building indexes safely

On a large, live table, a plain CREATE INDEX locks the table against writes while it builds. Use CONCURRENTLY to build without blocking writes (Chapters 14, 22):

CREATE INDEX CONCURRENTLY idx_orders_order_date ON orders (order_date);

It's slower to build and can't run inside a transaction, but it doesn't take the application down — essential for production.


The write cost, quantified

The trade-off at the heart of indexing — faster reads, slower writes — deserves to be made concrete, because "indexes slow writes" is easy to say and easy to underweight until it bites. Every index on a table is a separate data structure that must be kept consistent with the table, which means every INSERT must add an entry to every index, every DELETE must remove entries from every index, and every UPDATE that changes an indexed column must update the relevant indexes. A table with one index pays one index-maintenance cost per write; a table with ten indexes pays ten. This is linear in the number of indexes, and on a write-heavy table it's very real: an over-indexed table can write several times slower than a lean one.

This reframes the indexing decision as a genuine trade, not a free win. Each index you add speeds some reads but taxes every write. The right number of indexes is therefore the set that accelerates your actual read queries and no more — every index beyond that is pure write-tax with no compensating read benefit. This is why "index everything" is wrong: it's not that extra indexes are merely useless, it's that they actively slow every write to the table while sitting unused. The discipline is to add indexes in response to measured read needs (a slow query that EXPLAIN shows needs one) and to periodically remove indexes that no read query actually uses. An unused index is the worst of both worlds — it pays the full write and storage cost while providing zero read benefit.

PostgreSQL gives you the data to find unused indexes: the pg_stat_user_indexes view tracks how many times each index has been used by a scan, so an index with a near-zero scan count over a long period is a candidate for removal — it's costing writes without earning its keep. Auditing index usage periodically, and dropping the freeloaders, is a real part of database maintenance that keeps write performance healthy. The mental model to carry: indexes are not a sprinkle-on-everything seasoning but a deliberate trade made per-query, where you pay write cost to buy read speed, and you should only make the trade where the read speed is actually needed. A lean set of well-chosen indexes outperforms a sprawling set of speculative ones on any workload that does meaningful writes — which is most of them.


Index maintenance and bloat

Indexes, like tables, require maintenance over time, and understanding why connects to the internals you'll meet in Chapter 28. Because PostgreSQL uses MVCC (multi-version concurrency control), updates and deletes don't immediately reclaim space — they leave dead entries that accumulate in both tables and indexes as bloat. Over time, a heavily-updated table's indexes can bloat, growing larger and slower than the live data warrants, because they retain entries for rows that have been updated or deleted. PostgreSQL's autovacuum process (Chapter 28) routinely cleans this up, but on very high-churn tables, indexes can still degrade, and occasionally need rebuilding with REINDEX (or REINDEX CONCURRENTLY, to avoid locking) to restore them to peak efficiency.

This maintenance dimension is another reason indexes aren't free: beyond the write cost and storage, they're ongoing objects that the database must vacuum and that may occasionally need attention. For most tables, autovacuum handles index maintenance invisibly and you never think about it. But for the highest-churn tables — a queue table with constant inserts and deletes, a counter updated thousands of times a second — index bloat can become a measurable performance problem, and recognizing it (an index much larger than its data warrants, or a query that's slowed despite an index existing) is part of the DBA skill set of Part VII. The tools to inspect it (pg_stat_user_indexes for usage, size functions for bloat) let you monitor index health.

The practical takeaway for now is simply awareness: indexes are living structures, not set-and-forget. They speed reads, tax writes, consume storage, and require maintenance — a fuller accounting of the "indexes aren't free" principle than the headline suggests. None of this argues against indexing; the right indexes are still the single most impactful performance tool you have. It argues for deliberate indexing — adding the indexes your queries need, removing the ones they don't, and understanding that each one is an ongoing commitment, not a one-time gift. That complete picture — indexes as a powerful but costed, maintained trade — is what lets you index like a professional rather than scattering indexes hopefully and wondering why writes got slow.


A worked diagnostic: from slow query to the right index

Let's walk the full diagnostic loop that is the slow-query anchor, because the process of diagnosing and fixing — not just the concept — is the skill this chapter exists to build. The scenario: users report that loading a particular customer's order history has become slow as the database grew. You have a query, SELECT * FROM orders WHERE customer_id = 4267, and a complaint. What do you actually do?

First, reproduce and measure with EXPLAIN ANALYZE (Chapter 24's tool, previewed here): EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 4267. You read the output and see Seq Scan on orders with an actual time of, say, 45 seconds, having examined all 100,000+ rows to find the handful matching customer 4267. That is the diagnosis: a sequential scan on a large table to fetch few rows — the unmistakable smell of a missing index. The database is reading every row and checking each one's customer_id, because it has no faster way to find them. The EXPLAIN output told you both the symptom (seq scan) and the waste (scanned 100,000 rows to return a dozen).

Second, add the targeted index: CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)CONCURRENTLY because this is a (pretend) live table you mustn't lock. The index is a B-tree on customer_id, exactly matching the equality filter in the query. Third, verify the fix: re-run EXPLAIN ANALYZE on the same query. Now the output shows Index Scan using idx_orders_customer_id, with an actual time of milliseconds, having examined only the matching rows. The plan switched from Seq Scan to Index Scan, and the time dropped from 45 seconds to 12 milliseconds — the same query, the same result, ~3,000× faster, because the database can now jump straight to the matching rows via the index instead of reading the whole table. That switch in the plan is the concrete, observable signature of the fix working.

This loop — measure with EXPLAIN ANALYZE, recognize the seq-scan smell, add the matching index, verify the plan switched to an index scan and the time dropped — is the most valuable practical skill in the chapter, and it generalizes to every indexing problem. The specifics vary (the slow query might need a composite index for a multi-column filter, an expression index for a function-wrapped column, or a GIN index for a JSONB containment), but the loop is always the same: measure, diagnose the scan, add the index that matches the query, verify. Notice what makes it disciplined rather than hopeful: you add the index in response to a measured problem (theme #5 — optimize on evidence), you add the specific index the query needs (not a speculative sprinkle), and you verify it actually helped (the plan changed, the time dropped) rather than assuming. Do this loop on your own slow queries — that's the chapter's progressive-project task — and the 45s→12ms moment stops being a story in a book and becomes a tool in your hands. It is, genuinely, one of the most satisfying and high-leverage skills in all of database work: the ability to take a query that's killing an application and, with one well-chosen index, make it instant.


Progressive project: index your hot paths

With your project's data at some scale (generate test data if needed):

  1. Find a slow query. Run EXPLAIN ANALYZE on the queries your app runs most; look for Seq Scan on large tables where you fetch few rows.
  2. Add the index on the filtered/joined column; re-run EXPLAIN ANALYZE; confirm it switched to an index scan and got faster. (This is the anchor — do it yourself.)
  3. Index your foreign keys that are used in joins (often missing by default).
  4. Identify a column you should NOT index (low cardinality, or write-heavy + rarely queried) and explain why.
  5. Try a partial or expression index if your domain has a natural fit (e.g., WHERE active, or lower(email)).

Document each index and the query it serves. Indexes without a query to justify them are just write-tax.


Summary

An index is a separate, sorted data structure that turns finding rows from an O(n) scan into an O(log n) lookup — the single most impactful performance tool. The default B-tree stays shallow (logarithmic height) and serves equality, ranges, sorting, and prefix matches; specialized types handle other data: GIN (JSONB/arrays/full-text), GiST (ranges/spatial), Hash, BRIN (huge ordered tables), plus multi-column (leftmost-prefix; order matters), partial (subset of rows), expression (computed values — the fix for function-on-column), and covering/index-only scans. Index the columns you filter, join, and sort on — especially foreign keys in joins. But indexes cost write speed, storage, and maintenance, so don't index small/low-cardinality/rarely-queried columns or queries that return most of the table. Diagnose missing indexes with EXPLAIN (a Seq Scan on a big table fetching few rows), build them CONCURRENTLY in production, and justify each with a real query. That is the 45s→12ms skill.

You can now: - Explain what an index is and why it's O(log n) vs O(n). - Recognize when an index helps (WHERE/JOIN/ORDER BY) and add the right one. - Choose among B-tree, GIN, GiST, BRIN, and multi-column/partial/expression indexes. - State the costs of indexes and when not to index. - Diagnose a missing index (Seq Scan smell) and build indexes safely (CONCURRENTLY).

What's next. Chapter 24 — Query Optimization — reading EXPLAIN plans in depth: scan types, join algorithms, cost estimates, and how to diagnose why a query is slow and fix it. The skill that turns "it's slow" into "here's the problem, and here's the fix."


Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.