Case Study 1 — Reading the Plan to Fix a Slow Report

DataField.Dev

Case Study 1 — Reading the Plan to Fix a Slow Report

A full diagnose → fix → verify cycle with EXPLAIN ANALYZE. A nightly report degraded to the point of missing its window. Instead of guessing, the engineer read the plan, found the exact bottleneck, applied a targeted fix, and confirmed it — the entire optimization discipline in one story.

Background

A nightly "top customers by revenue this quarter" report had to finish before the morning batch. It started taking over 20 minutes and began overrunning its window, delaying everything downstream. The query:

SELECT c.customer_id, c.last_name, SUM(oi.quantity * oi.unit_price) AS revenue
FROM customers c
JOIN orders o       ON o.customer_id = c.customer_id
JOIN order_items oi ON oi.order_id   = o.order_id
WHERE o.order_date >= '2024-01-01' AND o.order_date < '2024-04-01'
GROUP BY c.customer_id, c.last_name
ORDER BY revenue DESC
LIMIT 50;

Step 1 — Diagnose (read the plan)

The engineer ran EXPLAIN (ANALYZE, BUFFERS) and read it bottom-up. The key nodes:

 ... Sort (revenue DESC) ...
   -> HashAggregate ...
     -> Hash Join (o.order_id = oi.order_id) ...
       -> Seq Scan on orders o
            Filter: (order_date >= '2024-01-01' AND order_date < '2024-04-01')
            Rows Removed by Filter: 92,000,000           ← red flag #1
       -> Seq Scan on order_items oi  (actual time=... rows=250000000)  ← red flag #2
 Execution Time: 1,260,000 ms

Two red flags jumped out:

Seq Scan on orders with Rows Removed by Filter: 92,000,000 — the date filter was applied by scanning all 100M orders and discarding 92M of them. There was no index on order_date, so the database had no way to jump to the quarter's orders.
A full Seq Scan on order_items (hundreds of millions of rows) feeding the join — the entire line-item table was being read even though only one quarter's orders mattered.

The plan made the problem unambiguous: the query was reading the two largest tables in their entirety because nothing let it narrow to the date range first.

Step 2 — Fix (targeted, plan-driven)

The fix followed directly from the diagnosis:

-- index the date filter so the quarter's orders can be found directly
CREATE INDEX CONCURRENTLY idx_orders_order_date ON orders (order_date);
-- ensure the join column on order_items is indexed (it is, in Mercado)
-- CREATE INDEX CONCURRENTLY idx_order_items_order_id ON order_items (order_id);
ANALYZE orders;   -- refresh stats so the planner trusts the new index

The order_date index lets the database fetch only Q1's orders (a few million) instead of scanning 100M; the existing order_items(order_id) index then lets the join fetch only the matching line items via an indexed lookup, instead of scanning the whole table.

Step 3 — Verify (re-read the plan)

Re-running EXPLAIN ANALYZE confirmed the plan changed exactly as intended:

 ... Sort ...
   -> HashAggregate ...
     -> Nested Loop ...
       -> Index Scan using idx_orders_order_date on orders   (rows=3,100,000)   ← narrowed first!
            Index Cond: (order_date >= '2024-01-01' AND order_date < '2024-04-01')
       -> Index Scan using idx_order_items_order_id on order_items   (per matching order)
 Execution Time: 41,000 ms

The two full Seq Scans were gone. The query now narrows to the quarter first (the index on order_date), then joins only the relevant line items. Runtime dropped from ~21 minutes to ~41 seconds — and the report comfortably made its window. The engineer didn't guess which index to add; the plan told them (Rows Removed by Filter on order_date), and re-reading the plan proved the fix.

The analysis

The plan tells you the problem — read it, don't guess. Two red flags (Rows Removed by Filter and full Seq Scans) named the exact issue: no way to narrow by date. Optimization without EXPLAIN is guesswork; with it, it's diagnosis.
Rows Removed by Filter is gold. A filter discarding tens of millions of rows means those rows were read only to be thrown away — an index on the filter column would skip them entirely. It points straight at the missing index.
Filter early (Chapter 4, again). The fix made the query narrow to the date range first, before the expensive join — exactly "selection pushdown," now achieved by giving the optimizer an index it could use.
Verify by re-reading the plan. The before/after plans prove the Seq Scans became Index Scans. "It's faster now" is a hope; "the plan no longer scans 100M rows" is evidence.
ANALYZE after adding an index/loading data. Refreshing statistics ensures the planner actually uses the new index instead of sticking with a stale plan.

Discussion questions

What two red flags in the original plan identified the problem?
Why does Rows Removed by Filter: 92,000,000 imply a missing index?
How did the fix make the query "filter early," and why is that faster?
Why re-run EXPLAIN ANALYZE after the fix instead of just timing the query?
⭐ Why ANALYZE orders after creating the index? What might happen without it?