Case Study 2 — The Plan That Went Insane After a Data Load

DataField.Dev

Case Study 2 — The Plan That Went Insane After a Data Load

Sometimes the indexes are fine and the query is fine, but the plan is catastrophic — because the optimizer is working from stale statistics. A bulk import turned a millisecond query into a multi-minute one, and the fix was a single ANALYZE.

Background

A data team ran a nightly job that bulk-loaded the day's events into a large events table (tens of millions of rows), then ran several downstream queries to aggregate them. The load itself was fast. But one downstream query — a join between the freshly-loaded events and a devices table — suddenly took minutes, where the day before it had taken under a second. No code had changed. No index had been dropped. The query was identical.

Diagnosis

The engineer ran EXPLAIN ANALYZE and immediately saw something bizarre:

 Nested Loop  (cost=... rows=12 ...) (actual time=... rows=4,800,000 ...)   ← estimate 12, actual 4.8 MILLION
   ->  Seq Scan on events  (estimated rows=10  ... actual rows=4,800,000)
   ->  Index Scan on devices  (loops=4,800,000)
 Execution Time: 240,000 ms

The smoking gun was the massive estimate-vs-actual mismatch: the optimizer estimated the events scan would return about 10 rows, but it actually returned 4.8 million. Believing only ~10 rows would come out, the optimizer chose a nested loop — a fine choice for 10 rows, a disaster for 4.8 million (it did 4.8 million index lookups into devices). The plan wasn't "broken"; it was a reasonable plan for the row counts the optimizer believed — which were catastrophically wrong.

Why were the estimates so wrong? The bulk load had inserted millions of new rows, but statistics hadn't been updated yet. The optimizer was planning based on pg_statistic data from before the load — when the table was nearly empty or had a very different distribution. Autovacuum's ANALYZE hadn't caught up to the sudden influx. The planner was navigating with a stale map.

The fix

One command — refresh the statistics:

ANALYZE events;

Re-running EXPLAIN ANALYZE, the estimate now matched reality (~4.8M rows), and the optimizer — correctly seeing it was joining millions of rows — switched from a nested loop to a hash join:

 Hash Join  (cost=... rows=4,800,000 ...) (actual rows=4,800,000 ...)
   ->  Seq Scan on events   (estimated rows=4,800,000 — now accurate)
   ->  Hash on devices ...
 Execution Time: 3,200 ms

240 seconds → 3 seconds. The query, the indexes, and the data were all unchanged; only the statistics — and therefore the optimizer's plan — changed. The team added an explicit ANALYZE events; to the end of the bulk-load job, so the optimizer always plans against fresh statistics after a load.

The analysis

Bad plans often mean stale statistics, not a broken optimizer. The optimizer chooses the cheapest plan for the row counts it believes. When those beliefs (statistics) are wrong, it makes confident, terrible choices. The query and indexes can be perfect and the plan still be a disaster.
The estimate-vs-actual rows gap is the tell. Estimated 10, actual 4.8M is the signature of stale stats. Whenever a plan looks insane, first compare estimated to actual rows — if they diverge wildly, run ANALYZE before suspecting anything else.
Big data changes invalidate statistics. A bulk load, a mass UPDATE/DELETE, or a TRUNCATE+reload can leave statistics describing a table that no longer exists. Autovacuum's ANALYZE will catch up eventually, but not necessarily before your downstream queries run.
ANALYZE after bulk operations. Make refreshing statistics part of any job that changes a lot of data — exactly why this book's generate_data.sql ends with ANALYZE. It's cheap insurance against the optimizer planning blind.
A reasonable algorithm with bad inputs gives bad outputs. The nested loop wasn't a "wrong" choice in the abstract — it's optimal for ~10 rows. The failure was the input (the row estimate), not the decision rule. Fix the input (statistics), and the optimizer makes the right call on its own.

Discussion questions

Why did the optimizer choose a nested loop, and why was that catastrophic here?
What single number in the plan revealed the problem? What does the gap mean?
Why were the statistics stale, and why didn't autovacuum save the query in time?
Why did ANALYZE alone fix it, with no query or index change?
⭐ What operations should trigger an explicit ANALYZE in a data pipeline, and why?