Case Study 1 — Why Two "Identical" Queries Ran 400× Apart

DataField.Dev

Case Study 1 — Why Two "Identical" Queries Ran 400× Apart

Relational algebra isn't just notation — its equivalences are the engine of query optimization. This case shows two queries that ask for exactly the same thing, are provably equivalent in algebra, and yet run hundreds of times apart in wall-clock time. It's a gentle first encounter with the book's slow-query-to-fast-query anchor (Chapter 23 makes it dramatic).

Background

An analytics team at an online retailer needed a recurring figure: the total revenue from a single large customer (a wholesale account, customer #5) over the company's full order history. The orders table had grown to tens of millions of rows; order_items, the line-item table, was several times larger again.

Two analysts wrote two queries. Both returned the same correct number. One finished in about 30 milliseconds. The other took roughly 12 seconds — 400 times slower. The difference wasn't the data, the hardware, or luck. It was which algebra expression each query handed the database.

The two queries

Query A joined everything first, then filtered:

-- Conceptually: σ customer_id = 5 ( orders ⋈ order_items )  -- filter AFTER the join
SELECT SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.customer_id = 5;

Query B expressed the same idea but in a way that filtered customer #5's orders before reaching the giant line-item table:

-- Conceptually: ( σ customer_id = 5 (orders) ) ⋈ order_items  -- filter BEFORE the join
SELECT SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
WHERE oi.order_id IN (SELECT order_id FROM orders WHERE customer_id = 5);

In pure relational algebra, these are the same expression — the selection on customer_id = 5 can be pushed down past the join (the equivalence from the chapter):

   σ customer_id = 5 (orders ⋈ order_items)
   ≡
   (σ customer_id = 5 (orders)) ⋈ order_items

So why the 400× gap?

What actually happened

A modern optimizer often performs this push-down automatically — and in an ideal world, both queries would compile to the same fast plan. But in this case the team was on a database (and a statistics state) where, for Query A, the planner chose to build the full join first and filter afterward: it scanned and joined tens of millions of order/line-item pairs, then threw away all but customer #5's handful. Most of the work was wasted before the WHERE ever applied.

Query B's shape forced the efficient order: first find customer #5's orders (a tiny set, especially with an index on orders.customer_id), then fetch only the matching line items. The expensive table was touched only for the rows that could possibly contribute. Same answer; a fraction of the work.

When the team ran EXPLAIN on both (the tool you'll master in Chapter 24), the plans told the story at a glance: Query A showed a massive join feeding a late filter; Query B showed a tiny indexed lookup feeding a small join. The algebra was identical; the execution was not.

The resolution

Two lessons turned into action:

They added the right index. With an index on orders.customer_id, even Query A's planner gained a cheaper option and the gap narrowed dramatically. (Indexes are how you give the optimizer good choices — Chapter 23.)
They learned to write filter-early shapes and to check the plan rather than trust that two equivalent queries would run equally. Equivalence guarantees the same answer, not the same cost.

The analysis

Algebra equivalence is about correctness; cost is about execution. Two expressions can be provably equal and still differ enormously in how much work each plan does. The optimizer's job is to find a cheap plan among the equivalent ones — but it can only choose from plans it can see, and only as well as its statistics and your indexes allow.
"Push selection down" is the single most valuable instinct. Filter to the rows you care about as early as possible, before expensive joins and aggregations. Often the optimizer does it for you; writing the query that way anyway never hurts and sometimes rescues a bad plan.
You are handing the database an algebra expression. Internalizing that reframes query writing: you're not issuing a command to be obeyed literally, you're describing a result and a structure the optimizer will work from. Better structure → more chances for a fast plan.
This is the seed of the book's performance anchor. A query that takes seconds dropping to milliseconds by changing how the same result is computed is the recurring drama of Chapters 23–24. Here it came from algebra ordering; there it will come from indexes and reading EXPLAIN. Same theme: performance is basic competence.

Discussion questions

State the algebra equivalence that makes Query A and Query B return the same result. Why is it valid?
Explain in plain terms why filtering before the join did less work than filtering after.
The team's first fix was an index, not a query rewrite. Why might adding an index let both queries run fast?
Equivalence guarantees the same answer but not the same speed. What tool would you use to see why two equivalent queries differ in cost? (Foreshadowing Chapter 24.)
⭐ Is "always filter early" an absolute rule, or are there cases where the optimizer legitimately filters late? Speculate, then revisit after Chapter 24.