Chapter 9: Subqueries — Queries Inside Queries

DataField.Dev

27 min read

> Where you are: Part II, Chapter 9 of 40. You can join and aggregate. Now you'll nest one query inside another — the first big payoff of closure (Chapter 4): because a query returns a relation, you can use it inside another query.

In This Chapter

The closure payoff
Scalar subqueries: a query that yields one value
Subqueries with IN: filter by a list
Subqueries in FROM: derived tables
Correlated subqueries: re-evaluated per row
EXISTS and NOT EXISTS: the right way to ask "is there any?"
ANY and ALL
Subquery or join? They often overlap
Subqueries everywhere: the places they can appear
Correlated subqueries and the "compared to its group" pattern
Row and multi-column subqueries
One question, four tools: subquery, join, CTE, window
A gallery of subquery questions
How the optimizer sees subqueries
A performance note (preview of Chapters 23–24)
Subqueries beyond SELECT: modification and constraints
A recap: choosing your subquery shape
A mental model: name the inner question first
When subqueries help, and when they hurt
Debugging subqueries
Progressive project: nest your queries
The closure thread
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 9: Subqueries — Queries Inside Queries

Where you are: Part II, Chapter 9 of 40. You can join and aggregate. Now you'll nest one query inside another — the first big payoff of closure (Chapter 4): because a query returns a relation, you can use it inside another query.

Learning paths: 💻 📊 🔬 🏗️ — everyone. Subqueries are how you express "compared to the average," "where no matching row exists," and "from the result of this other question."

The closure payoff

In Chapter 4 you learned that relational operations take relations in and return relations out — closure. The practical consequence is enormous: the result of a SELECT is itself a table-like thing, so you can drop a whole query into the spot where a value, a list, or a table would go. A query nested inside another is a subquery (or inner query), and it lets you answer questions that a single flat query can't.

Subqueries come in a few shapes, by what they return:

Scalar subquery — returns a single value (one row, one column). Usable anywhere a value goes.
Row/list subquery — returns one column of several rows. Used with IN, ANY, ALL.
Table subquery — returns a whole table. Used in the FROM clause (a derived table).
Correlated subquery — references the outer query; re-evaluated per outer row. Often paired with EXISTS.

Scalar subqueries: a query that yields one value

A scalar subquery returns exactly one value, so it can sit anywhere a single value is expected — in SELECT, in WHERE, in an expression.

"Products priced above the overall average price":

SELECT name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products)   -- the inner query yields one number
ORDER BY price DESC;

The inner (SELECT AVG(price) FROM products) runs first, produces one value (≈ 482.07), and the outer query filters against it. You can also put a scalar subquery in the SELECT list:

SELECT name, price,
       price - (SELECT AVG(price) FROM products) AS diff_from_avg
FROM products
ORDER BY diff_from_avg DESC;

Common mistake. A scalar subquery that accidentally returns more than one row errors out: ERROR: more than one row returned by a subquery used as an expression. If you write WHERE price = (SELECT price FROM products WHERE category_id = 3) and category 3 has several products, it fails. Use IN (for a list) or add LIMIT 1/an aggregate (for a single value) as appropriate.

Subqueries with IN: filter by a list

When the subquery returns a list of values, test membership with IN. "All orders from gold-tier customers":

SELECT order_id, customer_id, order_date
FROM orders
WHERE customer_id IN (
    SELECT customer_id FROM customers WHERE loyalty_tier = 'gold'
)
ORDER BY order_id;

The inner query produces the set of gold customers' ids; the outer keeps orders whose customer_id is in that set. This is often equivalent to a join — more on choosing between them below.

Common mistake — NOT IN with NULLs (again). As warned in Chapters 3 and 5: if the subquery's result contains any NULL, NOT IN returns no rows (three-valued logic). "Customers with no orders" written as WHERE customer_id NOT IN (SELECT customer_id FROM orders) breaks the day any order has a NULL customer_id. Prefer NOT EXISTS (below) for anti-joins — it's NULL-safe.

Subqueries in FROM: derived tables

A subquery in the FROM clause acts as a temporary table — a derived table (or inline view). It must be given an alias. This is how you aggregate, then query the aggregate:

-- Average revenue per order, computed from a per-order subtotal
SELECT AVG(order_total) AS avg_order_value
FROM (
    SELECT o.order_id, SUM(oi.quantity * oi.unit_price) AS order_total
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    GROUP BY o.order_id
) per_order;     -- the alias is required

The inner query computes one total per order; the outer averages those totals. This "aggregate, then aggregate again at a different grain" pattern is exactly how you'd correctly compute average order value (recall Chapter 7's average-of-averages warning — here each order is weighted equally, which is right). In Chapter 11 you'll see CTEs (WITH) as a more readable way to write the same thing.

Correlated subqueries: re-evaluated per row

A correlated subquery references a column from the outer query, so it can't be evaluated once — it's (conceptually) re-evaluated for each outer row. "Each product alongside the number of reviews it has":

SELECT p.name,
       (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.product_id) AS num_reviews
FROM products p
ORDER BY num_reviews DESC;

The inner query depends on p.product_id from the outer query — that's the correlation. For each product, it counts that product's reviews. Correlated subqueries are expressive but can be slow (they run per outer row), so the optimizer often rewrites them into joins; when it can't, a join or window function may be faster (Chapters 6, 12).

EXISTS and NOT EXISTS: the right way to ask "is there any?"

EXISTS (subquery) is true if the subquery returns at least one row, false otherwise. It's almost always correlated, and it's the idiomatic, NULL-safe way to express "has a match" / "has no match."

"Customers who have placed at least one order":

SELECT c.first_name, c.last_name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

"Customers who have never placed an order" — the anti-join, done safely:

SELECT c.first_name, c.last_name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

SELECT 1 is a convention — EXISTS only cares whether rows come back, not what's in them, so we don't bother selecting real columns. EXISTS stops at the first matching row (it's efficient), and unlike NOT IN, it handles NULLs correctly. This is why, back in Chapter 6, we said the anti-join can be written as LEFT JOIN … IS NULL or NOT EXISTS — and NOT EXISTS is usually the clearest and safest.

Why this matters. "Find the X that have no related Y" is one of the most common real questions — products never ordered, employees managing no one, accounts with no activity. Burn NOT EXISTS into muscle memory as the default tool, and reserve NOT IN for lists you're certain contain no NULL.

ANY and ALL

ANY (a.k.a. SOME) and ALL compare a value against every element of a subquery's result:

-- Products more expensive than AT LEAST ONE audio product (> the minimum audio price)
SELECT name, price FROM products
WHERE price > ANY (SELECT price FROM products WHERE category_id = 5);

-- Products more expensive than EVERY audio product (> the maximum audio price)
SELECT name, price FROM products
WHERE price > ALL (SELECT price FROM products WHERE category_id = 5);

> ANY (...) means "greater than at least one" (i.e., greater than the minimum); > ALL (...) means "greater than all" (greater than the maximum). In fact IN (...) is exactly = ANY (...). These are occasionally the clearest way to phrase a comparison-against-a-set, though an aggregate (> (SELECT MAX(...))) is often more readable.

Subquery or join? They often overlap

Many subqueries can be rewritten as joins, and vice versa. "Orders from gold customers," three ways:

-- (a) IN subquery
SELECT o.* FROM orders o
WHERE o.customer_id IN (SELECT customer_id FROM customers WHERE loyalty_tier = 'gold');

-- (b) JOIN
SELECT o.* FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE c.loyalty_tier = 'gold';

-- (c) EXISTS
SELECT o.* FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c
              WHERE c.customer_id = o.customer_id AND c.loyalty_tier = 'gold');

All three return the same rows. How to choose:

Use a JOIN when you need columns from both tables in the output. A subquery in WHERE can filter by another table but can't return its columns.
Use IN/EXISTS when you only need to test membership/existence and don't want the other table's columns — it reads as "orders where the customer is gold," which can be clearer.
Use EXISTS/NOT EXISTS for existence and anti-joins — it's NULL-safe and stops at the first match.
Watch for fan-out: a join to a one-to-many table can duplicate rows (Chapter 6); EXISTS never does, because it doesn't multiply — it just tests. This makes EXISTS the safer choice when you only want "does a match exist" without risking duplicates.

The good news (Chapter 4 again): the optimizer frequently rewrites between these forms anyway, so they often run identically. Choose the one that reads clearest, and consult EXPLAIN (Chapter 24) if performance matters.

Subqueries everywhere: the places they can appear

Because of closure, a subquery can stand almost anywhere SQL expects a value, a list, or a table. You've seen WHERE and FROM; it's worth seeing the full range, because recognizing "a subquery fits here" is half the skill.

-- In the SELECT list (a scalar subquery as a computed column):
SELECT p.name,
       p.price,
       (SELECT ROUND(AVG(price), 2) FROM products) AS catalog_avg,
       p.price - (SELECT AVG(price) FROM products)  AS vs_avg
FROM products p;

-- In HAVING (filter groups against a subquery value):
SELECT category_id, AVG(price) AS avg_price
FROM products
GROUP BY category_id
HAVING AVG(price) > (SELECT AVG(price) FROM products);   -- categories pricier than overall

-- In a CASE expression (a subquery as one branch's value):
SELECT name, price,
       CASE WHEN price > (SELECT AVG(price) FROM products)
            THEN 'above average' ELSE 'at or below' END AS standing
FROM products;

The HAVING example is especially useful: "groups whose aggregate beats a global benchmark" is a common analytical question, and it's just a group filter compared against a scalar subquery. Once you internalize that a parenthesized query is a value, you stop seeing subqueries as a special construct and start seeing them as "I can compute that right here." That mental shift — query-as-value, anywhere — is the real payoff of closure made practical.

Correlated subqueries and the "compared to its group" pattern

The correlated subquery's superpower is per-row, per-group comparison. A classic, genuinely hard-to-express-otherwise question: "products priced above the average for their own category." The benchmark is different for each row — it depends on the row's category — which is exactly what correlation provides:

SELECT p.name, p.category_id, p.price
FROM products p
WHERE p.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category_id = p.category_id     -- correlation: same category as the outer row
)
ORDER BY p.category_id, p.price DESC;

For each product, the inner query computes the average price of that product's category, and the outer query keeps it only if it beats that category-specific average. You could not write this with a single global subquery — the comparison value varies per row. This "compared to my own group" shape appears everywhere: employees paid above their department's average, orders larger than their customer's typical order, scores above a cohort's mean. Recognize the phrase "above average for their…" as the signature of a correlated subquery (or, often more efficiently, a window function — Chapter 12, which can compute the per-group average without re-scanning).

Theory → Practice. Notice the inner query reuses the products table under a different alias (p2). That's the same rename trick (ρ) that made self-joins work in Chapter 6 — the inner and outer references must be distinguishable. Correlated subqueries are, in a sense, self-joins expressed as nesting.

Row and multi-column subqueries

Subqueries aren't limited to a single column. A row subquery returns a tuple, letting you compare several columns at once — handy for "find the row matching this combination":

-- The order_item matching a specific (order, product) pair, by comparing a row:
SELECT * FROM order_items
WHERE (order_id, product_id) = (SELECT 1, 7);

-- Multi-column IN: items whose (order, product) is among a set of pairs:
SELECT * FROM order_items
WHERE (order_id, product_id) IN (
    SELECT order_id, product_id FROM order_items WHERE quantity > 2
);

Comparing (order_id, product_id) as a unit is occasionally exactly what you need — for instance when a table's identity is a composite key (recall order_items' composite PK from Chapter 3). It's a niche tool, but when the question is naturally about a combination of columns, row comparison expresses it directly instead of forcing you into a tangle of ANDs that can subtly mishandle NULL.

One question, four tools: subquery, join, CTE, window

It's worth seeing explicitly that a single analytical question often has four equivalent expressions, because choosing among them is a recurring judgment call for the rest of the book. Take "each product with how many times it was ordered":

-- (1) Correlated subquery in the SELECT list — clear, but runs per product:
SELECT p.name,
       (SELECT COUNT(*) FROM order_items oi WHERE oi.product_id = p.product_id) AS times
FROM products p;

-- (2) JOIN + GROUP BY — the aggregation workhorse:
SELECT p.name, COUNT(oi.order_item_id) AS times
FROM products p
LEFT JOIN order_items oi ON oi.product_id = p.product_id
GROUP BY p.product_id, p.name;

-- (3) CTE (Chapter 11) — same as (2) but readable in stages:
WITH counts AS (
    SELECT product_id, COUNT(*) AS times FROM order_items GROUP BY product_id
)
SELECT p.name, COALESCE(c.times, 0) AS times
FROM products p LEFT JOIN counts c ON c.product_id = p.product_id;

-- (4) Window function (Chapter 12) — when you want detail AND the aggregate together.

All four are legitimate; the right choice depends on context. The correlated subquery (1) reads most directly for a single derived column but can be slow at scale. The join-and-group (2) is the standard for aggregations. The CTE (3) wins when the query is complex and you want named, readable stages. The window function (4) shines when you need each row and a group summary side by side. You don't have to choose perfectly — the optimizer often executes (1) and (2) identically — but knowing the menu means you can always reach for the spelling that makes a given query clearest. This four-way equivalence is a theme that recurs through Part II; subqueries are simply the first of the four you've now met in full.

A gallery of subquery questions

More patterns, to cement the shapes:

-- "The single most expensive product" (scalar subquery for the max):
SELECT name, price FROM products
WHERE price = (SELECT MAX(price) FROM products);

-- "Customers who have ordered every... " — well, that's division (Chapter 4);
-- here: "customers who have at least one delivered AND one cancelled order":
SELECT c.last_name FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.status='delivered')
  AND EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.status='cancelled');

-- "Products never reviewed" (NULL-safe anti-join):
SELECT p.name FROM products p
WHERE NOT EXISTS (SELECT 1 FROM reviews r WHERE r.product_id = p.product_id);

-- "Orders larger than this customer's own average order" (correlated, per-group):
SELECT o.order_id, o.customer_id, o.total_amount FROM orders o
WHERE o.total_amount > (
    SELECT AVG(o2.total_amount) FROM orders o2 WHERE o2.customer_id = o.customer_id
);

Across the gallery, notice how often the English telegraphs the tool: "the single most…" → scalar subquery on an aggregate; "never / no…" → NOT EXISTS; "both X and Y" → two EXISTS; "compared to its own…" → correlated subquery. Learning to hear those signals — to map a phrase to a subquery shape on reflex — is what makes subqueries feel natural rather than puzzling. That mapping, more than any single syntax, is what this chapter is really teaching.

Try this. Rewrite "orders larger than this customer's own average" using a CTE that pre-computes each customer's average, joined back to orders. Compare which version you find clearer. There's no single right answer — developing that taste is exactly the goal.

How the optimizer sees subqueries

A reassuring truth runs underneath this whole chapter: the way you write a subquery is frequently not the way the database executes it. PostgreSQL's optimizer is sophisticated, and one of its most valuable tricks is rewriting subqueries into other forms — usually joins — that it can execute more efficiently. Understanding this changes how you think about subquery performance and frees you to write for clarity.

Consider an IN subquery like "orders from gold customers." Naively, you might imagine PostgreSQL running the inner query to build a list of gold customer ids, then checking each order against that list. Sometimes it does exactly that. But very often the optimizer recognizes that this is logically a join between orders and customers, and un-nests it into a hash join — the same plan it would produce if you'd written the join yourself. The two forms you thought were different turn out to be the same query to the planner, executed by identical machinery. This is why the advice throughout the chapter has been "choose the form that reads clearest" — the performance difference you might fear often doesn't exist, because the optimizer collapses the distinction.

Correlated subqueries are the interesting case, because they can't always be un-nested, and that's where the real performance story lives. A correlated subquery in the SELECT list — say, counting each product's reviews by re-querying reviews per product — is, in the worst case, executed once per outer row. For fifteen products that's nothing; for a catalog of half a million, it's half a million little queries, and you'll feel it. The good news is that modern PostgreSQL frequently transforms even correlated subqueries into a single join or a hash aggregate, computing all the counts in one pass. The less-good news is that it can't always do so, particularly for complex correlations, and when it can't, the per-row cost is real. This is precisely why correlated subqueries have efficient alternatives — a join with GROUP BY, or a window function (Chapter 12) — that compute the same per-group values without the per-row re-scan.

The practical wisdom that falls out of this is a two-step rhythm you'll use for the rest of the book. First, write the subquery in whatever form expresses your intent most clearly — don't pre-optimize based on guesses about what's fast. Second, if and only if a query is actually slow, look at its plan with EXPLAIN (Chapter 24) to see what the optimizer actually did. The plan will tell you whether your correlated subquery became a tidy join or remained a per-row loop, and that — not folklore about "subqueries are slow" or "joins are faster" — is what guides any rewrite. Optimization is measured, not assumed (theme #5). You'll be wrong about performance often enough by guessing that the only reliable method is to ask the database what it's doing. Until a query proves slow, clarity wins; once it does, the plan, not your intuition, decides the fix.

This is also why the four-equivalent-forms idea from earlier matters so much in practice. Because the optimizer treats many of these spellings alike, you genuinely are free to pick the one a human reads best — and only when measurement shows a problem do you reach for a different form and verify it actually helped. That discipline, writing for people and optimizing for machines only with evidence, is the mark of someone who understands both the language and the engine beneath it.

A performance note (preview of Chapters 23–24)

Correlated subqueries can be slow because, conceptually, they run once per outer row — a product catalog of 100,000 rows each running a COUNT(*) subquery is 100,000 little queries. Modern optimizers often transform a correlated subquery into a single join or hash aggregate, erasing the cost; but when they can't, you'll see it in the plan. The fixes are usually a join or a window function (Chapter 12). For now: correlated subqueries are expressive, occasionally expensive, and EXPLAIN (Chapter 24) tells you which.

Subqueries beyond SELECT: modification and constraints

So far every subquery has lived inside a SELECT, but closure means they work in the data-modification statements too — and previewing that here shows just how pervasive the idea is. You'll meet these statements properly in Chapters 13 and 14, but the subquery patterns are the same ones you've just learned, simply relocated.

A subquery can choose which rows an UPDATE or DELETE touches. "Mark as inactive every product that has never been ordered" is a NOT EXISTS anti-join driving an UPDATE; "delete reviews belonging to customers who have closed their accounts" is an IN or EXISTS subquery driving a DELETE. The subquery identifies the target set exactly as it would in a SELECT's WHERE; the modification statement then acts on whatever it identifies:

-- Deactivate products with no order history (subquery selects the targets):
UPDATE products
SET is_active = false
WHERE NOT EXISTS (
    SELECT 1 FROM order_items oi WHERE oi.product_id = products.product_id
);

A subquery can also supply the data an INSERT adds — INSERT INTO archive_orders SELECT * FROM orders WHERE order_date < '2020-01-01' copies a query's result straight into another table, the foundation of archiving and ETL (Chapter 31). And subqueries appear in CHECK constraints' more powerful cousins and in the WITH CHECK OPTION of updatable views (Chapter 15), enforcing rules defined by other tables. The unifying point is that a subquery is a relation-valued expression, and anywhere SQL can use a relation or a value — filtering, modifying, inserting, constraining — a subquery can stand in. The handful of shapes you've learned (scalar, list, derived table, correlated, EXISTS) are the complete vocabulary; the rest of the book just deploys them in new positions.

This is worth internalizing now because it reframes subqueries from "a SELECT feature" to "a fundamental building block of the entire language." When you reach data modification and see a subquery choosing rows to update, or a view defined by a subquery, you won't meet anything new — you'll recognize old friends in new clothes. That recognition, the sense that a few core ideas recur everywhere, is exactly the fluency Part II is building toward.

A recap: choosing your subquery shape

Let's consolidate the chapter into a decision guide you can carry, because the hardest part of subqueries in practice is not writing any one shape but knowing which shape a problem calls for. The good news is that the choice follows directly from what the inner question returns and how its answer is used.

If you need a single computed value to compare against or display — an average, a maximum, a total, a count of the whole — reach for a scalar subquery, and place it wherever a value goes: WHERE, the SELECT list, HAVING, a CASE. Guard against it accidentally returning more than one row. If you need to test membership in a set of values the answer is IN for the positive case, but NOT EXISTS (not NOT IN) for the negative case, because of the NULL trap that has now been flagged in four separate chapters. If you need to test whether any related row exists — a "has a match" or "has no match" question — EXISTS/NOT EXISTS is the idiomatic, NULL-safe, duplicate-free choice, and it should be your default for anti-joins. If you need to compare each row against a benchmark specific to its own group — "above average for its category," "larger than this customer's typical order" — that's a correlated subquery (or a window function, when you also want the row detail alongside). And if you need to aggregate and then work with the aggregate — average of per-order totals, filtering on a computed-per-group value — a derived table in FROM (or a CTE, Chapter 11) gives the intermediate result a name and a shape.

Underneath all of these sits one decision that overrides the rest: do you need the other table's columns in your output? If yes, you almost certainly want a join, not a subquery, because a subquery in WHERE can filter by another table but cannot return its columns. If you only need to test against the other table — does a match exist, is this value in that set, how does this compare to that benchmark — a subquery often reads more clearly and avoids the duplicate-row hazards a join introduces. Hold that question first, and the rest of the guide falls into place.

Finally, remember the meta-lesson the optimizer taught us: many of these forms are interchangeable to the database, executed by the same machinery, so the choice is usually about the human reader, not the machine. Write the shape whose structure mirrors the question, run the inner query alone when something's wrong, and consult the plan only when a query is genuinely slow. Master that rhythm and subqueries stop being the intimidating corner of SQL they're reputed to be, and become what they actually are — the simplest expression of closure, and one of the most natural tools in the language.

A mental model: name the inner question first

Subqueries intimidate people because a nested query looks like one tangled thing. The cure is a habit of mind: read and write subqueries from the inside out, naming the inner question before you worry about the outer one. Almost every subquery is the answer to a small, self-contained question that the outer query then uses. Find that small question, state it in plain English, and the structure becomes obvious.

Take "products priced above the overall average." The inner question is "what is the overall average price?" — a complete, standalone question with a single-number answer. Once you've named it, the outer query is trivial: "keep products whose price exceeds that number." The nesting isn't one complicated query; it's two simple questions, one feeding the other. When you write it, write the inner query first, run it on its own to confirm it returns what you expect (a single number, a list, a table — whatever the outer query needs), and only then wrap the outer query around it. When you read someone else's subquery, do the same in reverse: find the innermost parentheses, understand that little query in isolation, then step outward one layer at a time.

This inside-out discipline also tells you immediately what kind of subquery you're dealing with, which determines where it can go. If the inner question has a one-number answer ("the average," "the maximum," "the count"), it's a scalar subquery and fits anywhere a value fits. If it answers with a list ("the ids of gold customers"), it pairs with IN or EXISTS. If it answers with a whole table ("each order with its total"), it's a derived table for the FROM clause. If the inner question can't be answered without knowing which outer row you're on ("the average for this product's category"), it's correlated. You don't classify subqueries by staring at syntax; you classify them by what the inner question returns, and naming that question is how you find out.

The payoff is that arbitrarily deep nesting stops being scary. A subquery inside a subquery inside a query is just three small questions stacked — answer the innermost, feed it up, repeat. Closure (Chapter 4) is what makes this legal; the inside-out reading habit is what makes it manageable. Whenever a nested query overwhelms you, retreat to the innermost parentheses and ask, "what small question does this one answer?" The fog lifts every time.

When subqueries help, and when they hurt

Subqueries are powerful, but power misused makes queries harder to read, not easier, and part of maturing as a SQL writer is developing taste for when to nest and when not to. The guiding principle is clarity: a subquery is worth it when it makes the query's intent more obvious, and a liability when it buries that intent under layers of parentheses.

Subqueries genuinely help in a few recognizable situations. They're the natural tool for comparison against a computed benchmark — "above the average," "more than the maximum," "compared to the group" — because the benchmark is itself a query result. They're the right call for existence and absence tests via EXISTS/NOT EXISTS, which read almost like English and sidestep the duplicate-row problems a join would create. And a derived table in FROM is the clean way to "aggregate, then work with the aggregate," giving an intermediate result a name and a shape the outer query can use.

But subqueries hurt when they pile up. A query with three levels of nesting, each doing a little aggregation, is often far clearer rewritten as a sequence of CTEs (Chapter 11), which lay the same logic out as named, top-to-bottom steps instead of inside-out nesting. Deeply correlated subqueries in the SELECT list — several of them, each re-scanning a table per row — are usually both slower and harder to read than a single join-and-group or a window function (Chapter 12). And a subquery that merely filters by another table, when you actually need that table's columns in the output, is a sign you wanted a join all along. The tell is almost always readability: if you find yourself losing track of which parenthesis closes which query, the structure is fighting you, and a flatter form (a join, a CTE, a window function) will serve you and the next reader better.

None of these are hard rules, because the optimizer often executes the different forms identically — so the choice is about the human reading the query six months from now, not about the machine. The mature instinct is to reach for the spelling whose shape mirrors the thought: a benchmark comparison wants a scalar subquery, a multi-step transformation wants CTEs, a "does it exist" wants EXISTS, a "I need both tables' data" wants a join. As you write more SQL, this taste develops on its own; for now, simply notice when a subquery clarifies versus when it obscures, and don't be afraid to rewrite a working-but-murky nested query into something flatter and plainer. Correct and readable beats merely correct.

Debugging subqueries

When a query with subqueries returns the wrong result, the inside-out habit becomes a debugging method, and it's reliable enough to be worth stating as a procedure. The cardinal rule: a subquery is a query — so run it by itself.

Start at the innermost subquery and execute it in isolation, exactly as written, and confirm it returns what you assumed. This single step catches a remarkable share of bugs. The scalar subquery you thought returned one row actually returns three (hence the "more than one row returned" error, or a silently wrong comparison). The IN subquery you expected to contain customer ids actually contains NULLs (hence NOT IN returning nothing). The correlated subquery's condition references the wrong alias. You cannot see any of this while it's buried inside the outer query; pulled out and run on its own, the problem is usually staring at you.

For correlated subqueries, which can't run entirely on their own because they reference the outer row, the trick is to substitute a concrete value for the correlation. If the inner query says WHERE p2.category_id = p.category_id, replace p.category_id with an actual category id and run it — now it's a standalone query you can verify. Confirm it returns the right benchmark for that one case, and you've validated the logic; the correlation just applies that same logic per row. Then check a second category to be sure your assumption holds across cases.

The other frequent culprits are the ones this chapter has flagged repeatedly, and a quick mental checklist dispatches them: Is this a NOT IN over a subquery that might contain NULL? (Switch to NOT EXISTS.) Is a scalar subquery at risk of returning more than one row? (Add an aggregate or constrain it.) Did a join inside a subquery fan out and inflate a count? (Use COUNT(DISTINCT …) or EXISTS.) Is the correlation referencing the column you meant? (Check the aliases.) Running the inner query alone, substituting values for correlations, and walking that short checklist will resolve nearly every subquery bug you'll meet — and it reinforces the chapter's core truth: a subquery is nothing more exotic than a query you've placed inside another one, and it can be understood, tested, and fixed exactly like any other query.

Why this matters. The skill of decomposing a confusing nested query into its standalone parts is the same skill, scaled down, as decomposing a hard problem into subproblems — and it's the difference between debugging by educated diagnosis and debugging by random flailing. Theme #3 once more: when you understand that a subquery is just a query in a particular position, its mysteries reduce to ordinary, checkable questions.

Progressive project: nest your queries

In your domain:

A scalar subquery: rows above an average (members who borrowed more than the average member; products priced above average).
An IN subquery: rows related to a filtered set (orders from premium customers; loans of books in a given category).
A NOT EXISTS anti-join: "X that never Y" (members who never borrowed; courses with no enrollments) — written NULL-safely.
A derived table in FROM: aggregate then aggregate (average per-group value).

Add these to project-notes.md.

The closure thread

Pull back from the mechanics for a moment and notice what this chapter really demonstrated: the quiet power of a single idea introduced way back in Chapter 4. Closure — the property that every query returns a relation, which can therefore be used wherever a relation is expected — is the entire reason subqueries exist. A scalar subquery works because a one-row, one-column result is a value. An IN subquery works because a one-column result is a list. A derived table works because any result is a table. Correlated subqueries, EXISTS, row comparisons — all of them are closure cashed out in a particular position. You did not learn six unrelated features in this chapter; you learned six faces of one principle.

That framing matters because closure does not stop at subqueries. The next chapters are, in a real sense, more closure. Chapter 11's CTEs are named subqueries you can chain and even make recursive — closure with better ergonomics. Chapter 15's views are subqueries saved under a name and queried like tables — closure made permanent. Every time SQL lets you build something from the result of something else, closure is what makes it legal. Having seen the principle work hard here, you'll recognize it everywhere, and new features will feel less like things to memorize and more like the same idea reappearing in a new guise.

This is what it means to learn fundamentals rather than syntax. Someone who learned subqueries as "a list of patterns to copy" has to learn CTEs and views as more patterns, separately. Someone who learned subqueries as "closure, applied" already half-understands CTEs and views before reading a word about them, because they grasp the underlying idea those features express. Theme #3 — understand the why — is not a study tip; it's a multiplier. The why you understood in this chapter (a query is a relation, and relations compose) is the why that unlocks a third of everything still ahead. Carry it forward, and watch how much of Part II's remainder you can almost predict.

Summary

A subquery nests one query inside another — the practical payoff of closure. Scalar subqueries return one value (usable in SELECT/WHERE); a scalar that returns >1 row is an error. IN subqueries filter by a list (beware NOT IN with NULL). Derived tables (subqueries in FROM, alias required) let you aggregate then query the aggregate. Correlated subqueries reference the outer query and run per row — expressive but potentially slow. EXISTS/NOT EXISTS are the idiomatic, NULL-safe way to ask "has a match / has no match" — your default for anti-joins. ANY/ALL compare against a whole set. Many subqueries equal joins; choose by readability and whether you need the other table's columns, and let the optimizer (and EXPLAIN) settle performance.

You can now: - Write scalar, list (IN), derived-table, and correlated subqueries. - Use EXISTS/NOT EXISTS for existence tests and NULL-safe anti-joins. - Use ANY/ALL (and know IN = = ANY). - Choose between a subquery and a join based on output needs and fan-out. - Recognize the NOT IN-with-NULL and multi-row-scalar pitfalls.

What's next. Chapter 10 — Set Operations — UNION, INTERSECT, EXCEPT: combining the results of two queries the way Chapter 4's algebra intended, for "this plus that," "in both," and "in one but not the other."

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.