Case Study 2 — The Catalog Page That Timed Out

Correlated subqueries are expressive and readable — and occasionally a performance trap, because conceptually they run once per outer row. A product page that worked fine in testing fell over in production, and the fix was to stop asking the same little question thousands of times.

Background

An e-commerce catalog page listed products with, for each, its review count and average rating. The query used correlated subqueries — one per metric, per product:

-- Catalog listing   ⚠️ correlated subqueries per row
SELECT p.product_id, p.name, p.price,
       (SELECT COUNT(*)    FROM reviews r WHERE r.product_id = p.product_id) AS num_reviews,
       (SELECT AVG(rating) FROM reviews r WHERE r.product_id = p.product_id) AS avg_rating
FROM products p
WHERE p.is_active
ORDER BY p.name;

On the test database (a few hundred products), it returned in milliseconds. In production — 80,000 active products and millions of reviews — the page took 40+ seconds and frequently timed out. Customers got spinners and errors on the main catalog.

What went wrong

The query has two correlated subqueries, each evaluated once per product row. With 80,000 products, that's up to 160,000 separate little aggregate queries against the reviews table for a single page load. Even if each is fast, 160,000 of them is not — and they scanned reviews repeatedly. The cost scaled with the number of products times the work per subquery, which is exactly the shape of a query that's instant on small data and catastrophic on large data.

(In some cases a modern optimizer can de-correlate such subqueries into a single join + group-by. Here it could not fully, partly because there were two separate correlated subqueries over the same table — so the per-row cost stood.)

The fix: aggregate once, then join

Instead of asking "how many reviews does this product have?" 80,000 times, ask "how many reviews does each product have?" once, then join the pre-computed summary:

-- Aggregate reviews ONCE per product, then join
SELECT p.product_id, p.name, p.price,
       COALESCE(r.num_reviews, 0) AS num_reviews,
       r.avg_rating
FROM products p
LEFT JOIN (
    SELECT product_id, COUNT(*) AS num_reviews, AVG(rating) AS avg_rating
    FROM reviews
    GROUP BY product_id
) r ON r.product_id = p.product_id
WHERE p.is_active
ORDER BY p.name;

The derived table scans reviews one time, producing one summary row per product; the LEFT JOIN attaches it (with COALESCE so products with no reviews show 0). Page time dropped from 40 seconds to well under a second. The two queries return the same data — the difference is entirely in how many times the work is done.

For even better latency on a read-heavy page, the team later cached the per-product summary in a materialized view (Chapter 15), refreshed periodically — so the page reads pre-aggregated rows directly. And an index on reviews(product_id) (Chapter 23) made both the join and any remaining lookups fast.

The analysis

  1. Correlated subqueries run per outer row. Two correlated subqueries over 80,000 rows is up to 160,000 executions. Expressive, but the cost is multiplicative. Watch for them in SELECT lists over large tables.

  2. "Aggregate once, then join" is the standard remedy. Replace per-row correlated lookups with a single grouped subquery (derived table or CTE) joined to the outer table. One pass over the child table instead of N.

  3. Small-data testing hides scaling bugs. The query was instant on hundreds of rows and lethal on tens of thousands. Test performance-sensitive queries against production-scale data (this is exactly why the book ships generate_data.sql — load it for Chapters 23–25).

  4. EXPLAIN ANALYZE would have shown it immediately (Chapter 24): a plan with the subqueries re-executing per row, versus the join plan with a single aggregate. Reading the plan turns "why is this slow?" from guesswork into diagnosis.

  5. Layer the fixes by need: rewrite to a join (correctness-preserving, big win) → index the join column → cache with a materialized view if it's hot. Each step is covered later in the book; the instinct to form now is don't ask the same question thousands of times.

Discussion questions

  1. Why was the original query fast in testing but slow in production? What scaled?
  2. Explain how "aggregate once, then join" reduces the number of times reviews is scanned.
  3. Why is COALESCE(r.num_reviews, 0) needed after the LEFT JOIN?
  4. What tool would you use to confirm which plan each version produces, and what would you look for?
  5. ⭐ When might two correlated subqueries be acceptable (i.e., when does the per-row cost not matter)? When is a materialized view the better answer than a rewrite?