Some questions require answers that depend on other answers. That's where subqueries shine.
In This Chapter
- 8.1 Scalar Subqueries — A Single Value from a Query
- 8.2 Table Subqueries — A Result Set Within a Query
- 8.3 Correlated Subqueries — When the Inner Query Depends on the Outer
- 8.4 EXISTS and NOT EXISTS — The Existence Test
- 8.5 Subqueries in FROM — Derived Tables
- 8.6 Common Table Expressions — Named Subqueries
- 8.7 CTEs vs. Subqueries vs. Views — When to Use Which
- 8.8 CTE Materialization in DB2
- 8.9 Nested Subqueries — Queries Within Queries Within Queries
- 8.10 Meridian Bank Complex Queries
- 8.11 The Threshold Concept — Thinking in Sets, Not Loops
- Spaced Review: Chapters 2, 5, and 6
- Summary
Chapter 8: Subqueries and Common Table Expressions — Queries Within Queries
Some questions require answers that depend on other answers. That's where subqueries shine.
Consider a deceptively simple business question: "Which Meridian National Bank customers have an account balance above the bank-wide average?" To answer it, you need two pieces of information. First, you need the average balance across all accounts. Second, you need to compare each customer's balance against that average. Neither piece is useful alone. The average means nothing without the comparison; the comparison is impossible without the average.
SQL solves this with subqueries — complete SELECT statements embedded inside other SELECT statements. A subquery computes one result, and the outer query uses that result as if it were a constant, a list, or an entire table. Once you internalize this pattern, the complexity ceiling of what you can express in a single SQL statement rises dramatically.
This chapter takes you from simple scalar subqueries through correlated subqueries, existence tests, derived tables, and finally into Common Table Expressions (CTEs) — DB2's mechanism for naming and reusing subqueries within a single statement. By the end, you will be writing multi-step analytical queries that would have required procedural code or temporary tables in earlier chapters.
We assume you are comfortable with SELECT, WHERE, joins (Chapter 6), and GROUP BY with aggregate functions (Chapter 7). Everything here builds directly on those foundations.
8.1 Scalar Subqueries — A Single Value from a Query
A scalar subquery is a subquery that returns exactly one row and one column — a single value. DB2 treats this value as if you had typed a literal constant in its place. If the subquery returns more than one row, DB2 raises an error. If it returns no rows, the result is NULL.
Scalar Subqueries in WHERE
The most common use is comparing a column to a computed value:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
a.BALANCE
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.BALANCE > (SELECT AVG(BALANCE) FROM ACCOUNT);
DB2 processes this in two logical steps. First, it evaluates the subquery SELECT AVG(BALANCE) FROM ACCOUNT and obtains a single number — say, 14,832.50. Then it substitutes that number into the outer query's WHERE clause, effectively producing WHERE a.BALANCE > 14832.50. The outer query runs against that fixed threshold.
This is an uncorrelated subquery: the inner query does not reference any column from the outer query. It runs once, produces one value, and that value is reused for every row the outer query examines. This distinction matters enormously for performance, as we will see in Section 8.3.
Scalar Subqueries in the SELECT List
You can also place a scalar subquery in the SELECT list to compute a per-row value:
SELECT b.BRANCH_NAME,
b.CITY,
(SELECT COUNT(*)
FROM ACCOUNT a
WHERE a.BRANCH_ID = b.BRANCH_ID) AS ACCOUNT_COUNT
FROM BRANCH b;
Here, for each branch row, DB2 runs the inner SELECT to count accounts belonging to that branch. This is technically a correlated scalar subquery because the inner query references b.BRANCH_ID from the outer query. The result is one integer per branch row — perfectly valid as a scalar.
This pattern is convenient for quick column additions, but be cautious: DB2 executes the subquery once per outer row unless the optimizer can transform it into a join internally. For large result sets, an explicit JOIN with GROUP BY is usually more efficient:
SELECT b.BRANCH_NAME,
b.CITY,
COALESCE(ac.ACCOUNT_COUNT, 0) AS ACCOUNT_COUNT
FROM BRANCH b
LEFT JOIN (SELECT BRANCH_ID, COUNT(*) AS ACCOUNT_COUNT
FROM ACCOUNT
GROUP BY BRANCH_ID) ac
ON b.BRANCH_ID = ac.BRANCH_ID;
Both queries return the same result. The second form gives the optimizer more room to choose an efficient execution plan.
Scalar Subqueries in HAVING
Scalar subqueries also work in the HAVING clause, enabling comparisons against aggregated results:
SELECT a.BRANCH_ID,
AVG(a.BALANCE) AS AVG_BRANCH_BALANCE
FROM ACCOUNT a
GROUP BY a.BRANCH_ID
HAVING AVG(a.BALANCE) > (SELECT AVG(BALANCE) FROM ACCOUNT);
This returns branches whose average account balance exceeds the bank-wide average. The subquery computes the bank-wide average once; the HAVING clause compares each branch's average against it.
Scalar Subqueries in CASE Expressions
You can embed scalar subqueries inside CASE expressions for conditional logic based on computed values:
SELECT a.ACCOUNT_ID,
a.BALANCE,
CASE
WHEN a.BALANCE > (SELECT AVG(BALANCE) * 2 FROM ACCOUNT)
THEN 'High Value'
WHEN a.BALANCE > (SELECT AVG(BALANCE) FROM ACCOUNT)
THEN 'Above Average'
ELSE 'Standard'
END AS BALANCE_TIER
FROM ACCOUNT a
ORDER BY a.BALANCE DESC;
Each CASE branch contains its own scalar subquery. DB2 evaluates these subqueries independently. In practice, this pattern can be expensive if the subqueries are complex — consider computing the thresholds in a CTE and cross-joining them to the main query for better performance.
Scalar Subquery Rules
| Rule | What Happens If Violated |
|---|---|
| Must return exactly one column | SQL error at parse time |
| Must return at most one row | SQLCODE -811 at runtime |
| May return zero rows | Result is NULL |
A common mistake is writing a scalar subquery that can return multiple rows for certain data distributions. Always ask yourself: "Is it guaranteed that this inner query returns at most one row for every possible state of the data?" If the answer is no, add appropriate filtering, use MAX/MIN to force a single row, or switch to a table subquery with IN.
Debugging SQLCODE -811
When you encounter SQLCODE -811 ("The result of a scalar fullselect, SELECT INTO statement, or VALUES INTO statement is more than one row"), the fix depends on the root cause:
- Missing filter: The subquery's WHERE clause does not narrow results to one row. Add the missing predicate.
- Data anomaly: Duplicate data exists where you expected uniqueness. Investigate the data quality issue.
- Wrong approach: The problem genuinely requires multiple values. Switch to IN, EXISTS, or a derived table.
A quick diagnostic technique: temporarily change the scalar subquery to a standalone SELECT with COUNT(*) to see how many rows it produces for sample outer-row values.
8.2 Table Subqueries — A Result Set Within a Query
When a subquery returns multiple rows (but typically one column), you use it with set-comparison operators: IN, NOT IN, ANY, ALL, and SOME.
The IN Operator
IN checks whether a value matches any value in the subquery's result set:
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE c.CUSTOMER_ID IN (
SELECT DISTINCT a.CUSTOMER_ID
FROM ACCOUNT a
WHERE a.ACCOUNT_TYPE = 'SAVINGS'
);
This returns every customer who has at least one savings account. The subquery produces a list of CUSTOMER_ID values, and the outer WHERE checks membership in that list.
Performance note: For small result sets from the subquery, IN works well. For large sets, EXISTS (Section 8.4) often performs better because DB2 can stop searching as soon as it finds the first match rather than materializing the entire list.
NOT IN and the NULL Trap
NOT IN has a dangerous subtlety involving NULL values. Consider:
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE c.CUSTOMER_ID NOT IN (
SELECT a.CUSTOMER_ID
FROM ACCOUNT a
WHERE a.ACCOUNT_TYPE = 'CHECKING'
);
This is intended to find customers without a checking account. It works correctly — unless the subquery returns any NULL values. If even one row in the subquery result has a NULL CUSTOMER_ID, the NOT IN condition evaluates to UNKNOWN for every outer row, and the entire query returns zero rows.
Why? SQL's three-valued logic. 5 NOT IN (3, 7, NULL) evaluates as 5 <> 3 AND 5 <> 7 AND 5 <> NULL. The last comparison yields UNKNOWN, and TRUE AND TRUE AND UNKNOWN is UNKNOWN. WHERE filters out UNKNOWN, so no rows pass.
The fix is either to filter NULLs in the subquery:
WHERE c.CUSTOMER_ID NOT IN (
SELECT a.CUSTOMER_ID
FROM ACCOUNT a
WHERE a.ACCOUNT_TYPE = 'CHECKING'
AND a.CUSTOMER_ID IS NOT NULL
)
Or, better, use NOT EXISTS (Section 8.4), which handles NULLs correctly by design.
ANY, ALL, and SOME
These operators compare a value to each row of the subquery using a comparison operator:
-- Accounts with balance greater than ANY savings account balance
-- (i.e., greater than the minimum savings balance)
SELECT ACCOUNT_ID, BALANCE
FROM ACCOUNT
WHERE BALANCE > ANY (
SELECT BALANCE FROM ACCOUNT WHERE ACCOUNT_TYPE = 'SAVINGS'
);
-- Accounts with balance greater than ALL savings account balances
-- (i.e., greater than the maximum savings balance)
SELECT ACCOUNT_ID, BALANCE
FROM ACCOUNT
WHERE BALANCE > ALL (
SELECT BALANCE FROM ACCOUNT WHERE ACCOUNT_TYPE = 'SAVINGS'
);
> ANY means "greater than at least one value in the set" — equivalent to > MIN(...). > ALL means "greater than every value in the set" — equivalent to > MAX(...). The keyword SOME is a synonym for ANY in the SQL standard; DB2 supports both.
In practice, ANY and ALL appear less frequently than IN and EXISTS. Most developers find explicit MIN/MAX aggregates clearer:
-- Equivalent to > ALL, but more readable
SELECT ACCOUNT_ID, BALANCE
FROM ACCOUNT
WHERE BALANCE > (SELECT MAX(BALANCE) FROM ACCOUNT WHERE ACCOUNT_TYPE = 'SAVINGS');
Use whichever form makes the intent most obvious to the next person reading your SQL.
Multi-Row, Multi-Column Subqueries
DB2 also supports comparing multiple columns against a subquery simultaneously using row-value constructors:
-- Find transactions that match the most recent transaction
-- for any account (matching on both account and date)
SELECT t.TRANSACTION_ID,
t.ACCOUNT_ID,
t.TRANSACTION_DATE,
t.AMOUNT
FROM TRANSACTION t
WHERE (t.ACCOUNT_ID, t.TRANSACTION_DATE) IN (
SELECT t2.ACCOUNT_ID, MAX(t2.TRANSACTION_DATE)
FROM TRANSACTION t2
GROUP BY t2.ACCOUNT_ID
);
This compares two columns simultaneously. The subquery returns (ACCOUNT_ID, MAX_DATE) pairs, and the outer query checks whether each transaction's (ACCOUNT_ID, TRANSACTION_DATE) matches any pair. This is a clean way to express "the latest transaction per account" without a correlated subquery.
Subqueries with BETWEEN and LIKE
Scalar subqueries can appear anywhere a value expression is valid, including as operands of BETWEEN:
SELECT a.ACCOUNT_ID, a.BALANCE
FROM ACCOUNT a
WHERE a.BALANCE BETWEEN
(SELECT AVG(BALANCE) - STDDEV(BALANCE) FROM ACCOUNT)
AND
(SELECT AVG(BALANCE) + STDDEV(BALANCE) FROM ACCOUNT);
This returns accounts within one standard deviation of the mean balance — a statistical band filter. Both bounds are scalar subqueries.
8.3 Correlated Subqueries — When the Inner Query Depends on the Outer
An uncorrelated subquery is self-contained. It runs once, produces a result, and the outer query uses that result. A correlated subquery references one or more columns from the outer query, which means the inner query must be logically re-evaluated for every row of the outer query.
Execution Model
Consider finding every customer whose total balance exceeds the average total balance for their branch:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
c.BRANCH_ID
FROM CUSTOMER c
WHERE (SELECT SUM(a.BALANCE)
FROM ACCOUNT a
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID)
>
(SELECT AVG(branch_total)
FROM (SELECT SUM(a2.BALANCE) AS branch_total
FROM ACCOUNT a2
JOIN CUSTOMER c2 ON a2.CUSTOMER_ID = c2.CUSTOMER_ID
WHERE c2.BRANCH_ID = c.BRANCH_ID
GROUP BY a2.CUSTOMER_ID) AS bt);
Both subqueries reference c.CUSTOMER_ID or c.BRANCH_ID from the outer query. Logically, DB2 must:
- Pick a row from CUSTOMER (the outer query).
- Plug that row's CUSTOMER_ID and BRANCH_ID into both subqueries.
- Run both subqueries.
- Compare the results.
- Decide whether the outer row passes the WHERE filter.
- Move to the next outer row and repeat.
For 10,000 customers, that is logically 10,000 executions of each subquery. In practice, DB2's optimizer may transform this into a more efficient plan — perhaps a hash join or a pre-aggregation — but the worst case is O(n * m) where n is the outer row count and m is the work per subquery evaluation.
Performance Cost
Correlated subqueries are the most expensive subquery pattern. The optimizer can sometimes decorrelate them — rewriting the correlated subquery as an equivalent join — but it cannot always do so. When decorrelation fails, you get nested-loop execution: one subquery evaluation per outer row.
Signs that a correlated subquery is hurting performance:
- The EXPLAIN output shows a nested-loop join with the subquery on the inner side and no index support.
- Query runtime scales linearly (or worse) with the number of outer rows.
- Adding rows to the outer table causes disproportionate slowdown.
When They're Unavoidable
Some problems are most naturally expressed as correlated subqueries. The classic example is "find the most recent transaction for each account":
SELECT t.*
FROM TRANSACTION t
WHERE t.TRANSACTION_DATE = (
SELECT MAX(t2.TRANSACTION_DATE)
FROM TRANSACTION t2
WHERE t2.ACCOUNT_ID = t.ACCOUNT_ID
);
This is clean, readable, and correct. DB2's optimizer can often execute it efficiently by building an index on TRANSACTION(ACCOUNT_ID, TRANSACTION_DATE DESC). Before rewriting for performance, check the EXPLAIN plan — the optimizer may already be handling it well.
Alternative formulations using ROW_NUMBER() (covered in Chapter 10) or self-joins exist, but the correlated subquery version is often the most readable for straightforward "latest/earliest/max/min per group" patterns.
Correlated Subqueries in UPDATE and DELETE
Correlated subqueries are common in data modification statements:
-- Update each account's LAST_ACTIVITY_DATE from the TRANSACTION table
UPDATE ACCOUNT a
SET LAST_ACTIVITY_DATE = (
SELECT MAX(t.TRANSACTION_DATE)
FROM TRANSACTION t
WHERE t.ACCOUNT_ID = a.ACCOUNT_ID
)
WHERE EXISTS (
SELECT 1 FROM TRANSACTION t
WHERE t.ACCOUNT_ID = a.ACCOUNT_ID
);
The SET clause uses a correlated subquery to compute the new value. The WHERE EXISTS clause ensures we only update accounts that actually have transactions — without it, accounts with no transactions would have LAST_ACTIVITY_DATE set to NULL.
8.4 EXISTS and NOT EXISTS — The Existence Test
EXISTS is a Boolean operator. It returns TRUE if the subquery produces at least one row, FALSE if the subquery produces zero rows. It does not care about the values in those rows — only whether rows exist.
Basic EXISTS Pattern
-- Find customers who have at least one account
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE EXISTS (
SELECT 1
FROM ACCOUNT a
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
);
The SELECT 1 in the subquery is a convention. You could write SELECT *, SELECT a.ACCOUNT_ID, or even SELECT 'hello'. DB2 ignores the SELECT list in an EXISTS subquery — it only checks whether the subquery produces rows. Using SELECT 1 signals to other developers that you are performing an existence check, not retrieving data.
Why EXISTS Often Beats IN for Large Sets
Consider these two equivalent queries:
-- Version 1: IN
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE c.CUSTOMER_ID IN (
SELECT a.CUSTOMER_ID FROM ACCOUNT a
);
-- Version 2: EXISTS
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE EXISTS (
SELECT 1 FROM ACCOUNT a WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
);
Both return customers who have accounts. The difference is in how DB2 can execute them:
- IN logically produces the complete list of CUSTOMER_ID values from ACCOUNT, then checks membership for each outer row. If ACCOUNT has millions of rows, that list can be large.
- EXISTS only needs to find one matching row in ACCOUNT for each outer CUSTOMER_ID. With an index on
ACCOUNT(CUSTOMER_ID), this is a quick probe — and DB2 can stop at the first match.
Modern DB2 optimizers often transform IN into a semi-join (the same execution strategy as EXISTS), so the performance difference has narrowed. But when the optimizer does not make this transformation — particularly with complex subqueries or when statistics are stale — EXISTS with a correlated condition and an index is the safer bet.
NOT EXISTS and the NULL Advantage
NOT EXISTS is the negative form: it returns TRUE when the subquery produces zero rows. This is where it fundamentally differs from NOT IN.
-- Customers with no checking account — NOT EXISTS version
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE NOT EXISTS (
SELECT 1
FROM ACCOUNT a
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
AND a.ACCOUNT_TYPE = 'CHECKING'
);
This query is NULL-safe. Even if some ACCOUNT rows have NULL CUSTOMER_ID values, NOT EXISTS still works correctly. It checks "does any row exist where the condition is true?" If no such row exists, the customer passes the filter. NULL values in the ACCOUNT table simply fail the a.CUSTOMER_ID = c.CUSTOMER_ID equality check and do not produce a match — exactly the behavior you want.
Compare this with the NOT IN version from Section 8.2, which breaks in the presence of NULLs. For this reason, experienced SQL developers prefer NOT EXISTS over NOT IN as a default habit.
The Semi-Join and Anti-Semi-Join Patterns
Internally, DB2 executes EXISTS as a semi-join: a join that returns the outer row as soon as any match is found in the inner table, without duplicating the outer row even if multiple matches exist. NOT EXISTS becomes an anti-semi-join: return the outer row only if no match is found.
These are powerful query plan operations. You can see them in EXPLAIN output as SEMI JOIN or ANTI JOIN access types. If you do not see these and instead see a full nested-loop with a subquery scan, consider:
- Ensuring appropriate indexes exist on the correlation columns.
- Running RUNSTATS to update table statistics.
- Rewriting the query to help the optimizer recognize the semi-join pattern.
EXISTS with Multiple Conditions
EXISTS subqueries can contain arbitrarily complex logic:
-- Customers who have made a deposit over $5,000 in the last 30 days
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE EXISTS (
SELECT 1
FROM TRANSACTION t
JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
AND t.TRANSACTION_TYPE = 'DEPOSIT'
AND t.AMOUNT > 5000
AND t.TRANSACTION_DATE >= CURRENT DATE - 30 DAYS
);
The subquery joins two tables, applies multiple filters, and correlates back to the outer query. DB2 returns each qualifying customer exactly once, regardless of how many large deposits they made.
8.5 Subqueries in FROM — Derived Tables
A subquery in the FROM clause produces a derived table (also called an inline view). The outer query treats it as if it were a regular table.
Basic Syntax
SELECT dt.BRANCH_ID,
dt.TOTAL_DEPOSITS,
b.BRANCH_NAME
FROM (SELECT a.BRANCH_ID,
SUM(a.BALANCE) AS TOTAL_DEPOSITS
FROM ACCOUNT a
WHERE a.ACCOUNT_TYPE = 'SAVINGS'
GROUP BY a.BRANCH_ID) AS dt
JOIN BRANCH b ON dt.BRANCH_ID = b.BRANCH_ID
WHERE dt.TOTAL_DEPOSITS > 1000000
ORDER BY dt.TOTAL_DEPOSITS DESC;
The derived table dt aggregates savings account balances by branch. The outer query joins this result to BRANCH for the branch name and filters to only branches with over $1 million in savings deposits.
Nested Derived Tables
Derived tables can be nested — a derived table's FROM clause can itself contain another derived table:
SELECT final.BRANCH_NAME,
final.CUSTOMER_SEGMENT,
final.SEGMENT_COUNT
FROM (SELECT b.BRANCH_NAME,
dt.CUSTOMER_SEGMENT,
COUNT(*) AS SEGMENT_COUNT
FROM (SELECT CUSTOMER_ID,
CASE
WHEN TOTAL_BAL >= 100000 THEN 'Premium'
WHEN TOTAL_BAL >= 25000 THEN 'Standard'
ELSE 'Basic'
END AS CUSTOMER_SEGMENT,
BRANCH_ID
FROM (SELECT c.CUSTOMER_ID,
c.BRANCH_ID,
COALESCE(SUM(a.BALANCE), 0) AS TOTAL_BAL
FROM CUSTOMER c
LEFT JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
GROUP BY c.CUSTOMER_ID, c.BRANCH_ID) AS totals
) AS dt
JOIN BRANCH b ON dt.BRANCH_ID = b.BRANCH_ID
GROUP BY b.BRANCH_NAME, dt.CUSTOMER_SEGMENT) AS final
ORDER BY final.BRANCH_NAME, final.CUSTOMER_SEGMENT;
This three-level nesting computes customer totals, assigns segments, then counts segments per branch. It works, but the readability is poor. This is precisely the kind of query that CTEs were designed to simplify — we will revisit this example in Section 8.9.
When to Use Derived Tables
Derived tables are useful when you need to:
-
Pre-aggregate before joining. Joining first and then aggregating can produce incorrect results due to row duplication (the "fan-out" problem from Chapter 7). A derived table that aggregates first, then joins, avoids this.
-
Filter on computed columns. You cannot use a column alias in the same query level's WHERE clause. A derived table lets you compute the column in the inner query and filter on it in the outer query.
-
Isolate complex logic. Breaking a complex query into a derived table plus an outer query can make the logic clearer, even when a single-level query would work.
-
Limit the scope of a join. When you want to join only a filtered or aggregated subset of a table rather than the full table, wrapping the filter and aggregation in a derived table makes the intent explicit.
Derived Tables vs. CTEs
Derived tables and CTEs (Section 8.6) accomplish similar things. The key differences:
| Feature | Derived Table | CTE |
|---|---|---|
| Defined in | FROM clause | WITH clause before the main query |
| Named | Alias required (AS dt) | Name given in WITH clause |
| Reusable in same query | No — must repeat the subquery | Yes — reference the CTE name multiple times |
| Readability for complex logic | Nests deeply | Reads top-to-bottom |
For one-time use with simple logic, derived tables are fine. For anything complex or reused, CTEs are superior. We cover CTEs next.
8.6 Common Table Expressions — Named Subqueries
A Common Table Expression (CTE) is a named, temporary result set defined with the WITH keyword. It exists only for the duration of the single SQL statement it belongs to. Think of it as giving a name to a subquery so you can reference it by name — possibly multiple times — in the main query.
Basic Syntax
WITH high_balance_accounts AS (
SELECT a.ACCOUNT_ID,
a.CUSTOMER_ID,
a.BALANCE,
a.BRANCH_ID
FROM ACCOUNT a
WHERE a.BALANCE > 50000
)
SELECT c.FIRST_NAME,
c.LAST_NAME,
hba.BALANCE,
b.BRANCH_NAME
FROM high_balance_accounts hba
JOIN CUSTOMER c ON hba.CUSTOMER_ID = c.CUSTOMER_ID
JOIN BRANCH b ON hba.BRANCH_ID = b.BRANCH_ID
ORDER BY hba.BALANCE DESC;
The CTE high_balance_accounts filters the ACCOUNT table once. The main query joins the CTE's results with CUSTOMER and BRANCH. This is logically identical to writing the subquery as a derived table in the FROM clause, but the CTE version reads from top to bottom: "First, define the high-balance accounts. Then, join them with customers and branches."
Readability Benefits
CTEs transform inside-out SQL into top-to-bottom SQL. Without CTEs, a complex query might read:
SELECT ...
FROM (SELECT ...
FROM (SELECT ...
FROM ...
WHERE ...) AS inner1
JOIN ... ON ...
WHERE ...) AS inner2
JOIN ... ON ...
WHERE ...
Reading this, you must start from the innermost subquery and work outward. With CTEs:
WITH step1 AS (
SELECT ... FROM ... WHERE ...
),
step2 AS (
SELECT ... FROM step1 JOIN ... ON ... WHERE ...
)
SELECT ...
FROM step2
JOIN ... ON ...
WHERE ...;
Each step is labeled and reads in logical order. This is a substantial readability improvement for queries with three or more levels of nesting.
Multi-CTE Chains
You can define multiple CTEs, separated by commas, and each subsequent CTE can reference any previously defined CTE:
WITH monthly_totals AS (
SELECT a.BRANCH_ID,
MONTH(t.TRANSACTION_DATE) AS TXN_MONTH,
SUM(t.AMOUNT) AS TOTAL_AMOUNT,
COUNT(*) AS TXN_COUNT
FROM TRANSACTION t
JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE YEAR(t.TRANSACTION_DATE) = 2025
GROUP BY a.BRANCH_ID, MONTH(t.TRANSACTION_DATE)
),
branch_averages AS (
SELECT BRANCH_ID,
AVG(TOTAL_AMOUNT) AS AVG_MONTHLY_AMOUNT,
AVG(TXN_COUNT) AS AVG_MONTHLY_COUNT
FROM monthly_totals
GROUP BY BRANCH_ID
),
above_average_months AS (
SELECT mt.BRANCH_ID,
mt.TXN_MONTH,
mt.TOTAL_AMOUNT,
ba.AVG_MONTHLY_AMOUNT,
mt.TOTAL_AMOUNT - ba.AVG_MONTHLY_AMOUNT AS VARIANCE
FROM monthly_totals mt
JOIN branch_averages ba ON mt.BRANCH_ID = ba.BRANCH_ID
WHERE mt.TOTAL_AMOUNT > ba.AVG_MONTHLY_AMOUNT
)
SELECT b.BRANCH_NAME,
aam.TXN_MONTH,
aam.TOTAL_AMOUNT,
aam.AVG_MONTHLY_AMOUNT,
aam.VARIANCE
FROM above_average_months aam
JOIN BRANCH b ON aam.BRANCH_ID = b.BRANCH_ID
ORDER BY b.BRANCH_NAME, aam.TXN_MONTH;
This three-CTE chain: 1. Computes monthly transaction totals per branch. 2. Computes the average monthly total per branch. 3. Identifies months where a branch exceeded its own average.
Reading the query from top to bottom tells you exactly what each step does. Try expressing this as nested derived tables — it would be far less readable.
CTEs Referenced Multiple Times
A key advantage of CTEs over derived tables is that you can reference the same CTE multiple times:
WITH account_stats AS (
SELECT BRANCH_ID,
AVG(BALANCE) AS AVG_BALANCE,
MAX(BALANCE) AS MAX_BALANCE,
MIN(BALANCE) AS MIN_BALANCE
FROM ACCOUNT
GROUP BY BRANCH_ID
)
SELECT a.BRANCH_ID AS branch_a,
b.BRANCH_ID AS branch_b,
a.AVG_BALANCE - b.AVG_BALANCE AS BALANCE_DIFF
FROM account_stats a
CROSS JOIN account_stats b
WHERE a.BRANCH_ID < b.BRANCH_ID
ORDER BY ABS(a.AVG_BALANCE - b.AVG_BALANCE) DESC;
The CTE account_stats is referenced twice — once as a and once as b — to compare every pair of branches. With a derived table, you would have to write the aggregation query twice, which means double the code and double the risk of introducing inconsistencies.
Column Naming in CTEs
You can explicitly name the CTE's columns in the CTE definition:
WITH branch_summary (branch_id, account_count, total_balance) AS (
SELECT BRANCH_ID, COUNT(*), SUM(BALANCE)
FROM ACCOUNT
GROUP BY BRANCH_ID
)
SELECT * FROM branch_summary;
This is optional — if omitted, the CTE inherits column names from its SELECT list. Explicit naming is useful when the SELECT list contains expressions without natural names, or when you want to document the CTE's interface clearly.
CTEs in INSERT, UPDATE, and DELETE
CTEs are not limited to SELECT statements. DB2 allows CTEs in data modification statements:
-- INSERT using a CTE to compute the source data
WITH premium_customers AS (
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME,
SUM(a.BALANCE) AS TOTAL_BALANCE
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
GROUP BY c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
HAVING SUM(a.BALANCE) >= 100000
)
INSERT INTO PREMIUM_CUSTOMER_LIST (CUSTOMER_ID, NAME, BALANCE, ADDED_DATE)
SELECT CUSTOMER_ID,
FIRST_NAME || ' ' || LAST_NAME,
TOTAL_BALANCE,
CURRENT DATE
FROM premium_customers;
-- DELETE using a CTE to identify targets
WITH inactive_accounts AS (
SELECT a.ACCOUNT_ID
FROM ACCOUNT a
WHERE a.STATUS = 'CLOSED'
AND NOT EXISTS (
SELECT 1
FROM TRANSACTION t
WHERE t.ACCOUNT_ID = a.ACCOUNT_ID
AND t.TRANSACTION_DATE >= CURRENT DATE - 7 YEARS
)
)
DELETE FROM ACCOUNT_NOTES n
WHERE n.ACCOUNT_ID IN (SELECT ACCOUNT_ID FROM inactive_accounts);
Using CTEs in data modification statements keeps the targeting logic separate from the modification action, improving readability and reducing errors.
Developing CTEs Incrementally
One of the most practical benefits of CTEs is incremental development. When building a complex query, follow this workflow:
- Write and test the first CTE standalone. Run it as a plain SELECT to verify it produces the expected rows.
- Add the second CTE. Reference the first CTE if needed. Run the second CTE's SELECT to verify.
- Continue adding CTEs one at a time, testing each step.
- Write the final SELECT that combines the CTEs.
This is far easier to debug than writing a deeply nested query all at once. If the final result is wrong, you can test each CTE independently to isolate the problem.
8.7 CTEs vs. Subqueries vs. Views — When to Use Which
All three mechanisms let you encapsulate a query and reference it from another query. Here is when to use each.
Use a Subquery When...
- The logic is simple (one level, no reuse).
- You need a quick scalar comparison in WHERE.
- The query is ad hoc and you are exploring data interactively.
- The subquery is small enough to read in place without confusion.
Use a CTE When...
- The query has multiple levels of logic (multi-step analysis).
- You need to reference the same intermediate result more than once.
- Readability matters (production code, shared queries, code reviews).
- You are building the query incrementally — CTEs let you develop and test each step.
Use a View When...
- The encapsulated query is reused across multiple SQL statements or applications.
- You need to grant access to the result set without exposing underlying tables.
- The logic represents a stable business concept (e.g., "active customers," "overdue loans").
- You want the definition stored in the database catalog for documentation and governance.
Practical Decision Guide
Ask these questions in order:
- Will this query be used in multiple SQL statements? Yes: create a VIEW. No: continue.
- Does the query have more than one level of nesting? Yes: use a CTE. No: continue.
- Is the subquery referenced more than once? Yes: use a CTE. No: continue.
- Is the subquery more than five lines? Yes: consider a CTE for readability. No: a subquery is fine.
This is a guideline, not a rule. Experienced developers develop an intuition for when nesting becomes unreadable. When in doubt, use a CTE — the performance is the same (Section 8.8), and the readability is always better.
8.8 CTE Materialization in DB2
When you write a CTE, DB2 has two choices for how to execute it:
-
Inline the CTE: merge the CTE's logic into the main query, as if you had written a derived table or subquery directly. The optimizer can then rearrange, push down predicates, and optimize the combined query holistically.
-
Materialize the CTE: execute the CTE's query, store the result in a temporary work area (a temp table in memory or on disk), and have the main query read from that materialized result.
Default Behavior
In DB2 for LUW (Linux, UNIX, Windows), the optimizer generally prefers to inline CTEs because this gives it the most freedom to optimize. The CTE is treated as a syntactic convenience — a named subquery — and the optimizer folds it into the main query plan.
DB2 will sometimes choose to materialize a CTE when:
- The CTE is referenced multiple times in the main query (materializing avoids redundant computation).
- The CTE contains an operation that acts as an optimization fence (e.g., certain OLAP functions, DISTINCT, or GROUP BY that the optimizer cannot push through).
- The optimizer estimates that materialization is cheaper than repeated inline evaluation.
Performance Implications
Inlining is usually best for simple CTEs that filter or project. The optimizer can push WHERE predicates from the outer query into the CTE, use indexes on the underlying tables, and choose the optimal join order.
Materialization is better when the CTE is expensive to compute and is referenced multiple times, or when the CTE result is much smaller than the underlying tables and subsequent operations benefit from the reduced data volume.
The risk of materialization is that it creates a temporary result set that: - May not have indexes (so subsequent joins against it use table scans). - Prevents the optimizer from pushing predicates through to the underlying tables. - Consumes memory or temp space.
Materialization Hints
In DB2 11.5 and later, you can influence the optimizer's choice with hints in the CTE definition. While the exact syntax can vary by DB2 version and fix pack, DB2 has introduced optimization profiles and guidelines that can encourage or discourage CTE materialization.
For cases where you want to control materialization behavior, consider these approaches:
Encouraging materialization when the CTE is expensive and referenced multiple times: - Use optimization profiles to specify materialization preferences. - In some contexts, creating a declared global temporary table (DGTT) and inserting the CTE result gives you explicit control, though this is outside the single-statement CTE model.
Discouraging materialization when you want the optimizer to inline and push predicates: - Keep the CTE simple — avoid DISTINCT, GROUP BY, or OLAP functions when they are not needed. - Reference the CTE only once if possible. - Ensure table statistics are current so the optimizer can make good cost estimates.
Checking Materialization in EXPLAIN
To see whether DB2 materialized a CTE, examine the EXPLAIN output:
EXPLAIN PLAN FOR
WITH expensive_cte AS (
SELECT BRANCH_ID, SUM(BALANCE) AS TOTAL
FROM ACCOUNT
GROUP BY BRANCH_ID
)
SELECT * FROM expensive_cte e1
JOIN expensive_cte e2 ON e1.BRANCH_ID <> e2.BRANCH_ID;
In the access plan, look for a TEMP operator or a reference to a temporary table. If present, the CTE was materialized. If the plan shows direct access to the ACCOUNT table in two separate branches of the plan tree, the CTE was inlined (and the aggregation is computed twice).
Practical Guidance
For most queries, do not worry about materialization. Write your CTEs for readability and let the optimizer decide. Intervene only when:
- EXPLAIN shows an inefficient plan.
- A CTE referenced multiple times is clearly being computed repeatedly.
- A materialized CTE's temporary result set is causing temp space issues.
In these cases, restructure the query, update statistics, or use optimization profiles to guide DB2's behavior.
Materialization and Recursive CTEs
Recursive CTEs (covered in Case Study 2) are always materialized. DB2 must store intermediate results to detect when recursion terminates (i.e., when a recursive step produces no new rows). The temporary work area grows with each recursion level. For hierarchical data with moderate depth (5-10 levels) and moderate breadth (hundreds of nodes per level), this is efficient. For extremely large or deep hierarchies, monitor temp space usage.
The Optimization Fence Concept
Certain SQL operations act as "optimization fences" — the optimizer cannot push predicates or reorder operations across them. In the context of CTEs, the following create fences:
- DISTINCT — DB2 must deduplicate before the outer query can process results.
- GROUP BY with aggregation — Aggregates must complete before outer predicates can apply.
- UNION (but not UNION ALL) — Deduplication creates a fence.
- OLAP/window functions — Must compute over the complete partition before outer filtering.
When a CTE contains an optimization fence, DB2 is more likely to materialize it, because it cannot fold the CTE's logic into the outer query's predicate evaluation. If you notice that a CTE with a simple filter is being materialized unexpectedly, check whether it contains any fence-creating operations.
8.9 Nested Subqueries — Queries Within Queries Within Queries
SQL allows arbitrary nesting depth. A subquery can contain another subquery, which can contain another. But just because you can nest deeply does not mean you should.
A Three-Level Example
"Find branches where the highest-balance customer has a balance above the bank-wide median balance":
SELECT b.BRANCH_NAME
FROM BRANCH b
WHERE (SELECT MAX(a.BALANCE)
FROM ACCOUNT a
WHERE a.BRANCH_ID = b.BRANCH_ID)
>
(SELECT AVG(mid.BALANCE)
FROM (SELECT a2.BALANCE
FROM ACCOUNT a2
ORDER BY a2.BALANCE
FETCH FIRST 50 PERCENT ROWS ONLY) AS mid);
This has three levels: the outer BRANCH query, two second-level subqueries (one correlated, one uncorrelated), and the uncorrelated one contains a derived table. It works, but reading it requires holding multiple contexts in your head simultaneously.
Readability Limits
Research on code comprehension consistently shows that humans struggle with more than two or three levels of nesting in any context — code blocks, parentheses, or SQL subqueries. At four levels, most developers cannot confidently predict what the query does without tracing through it line by line.
Signs your nesting is too deep:
- You have to scroll horizontally to see the full query.
- You lose track of which closing parenthesis belongs to which subquery.
- You cannot explain the query's purpose in one sentence per level.
- Your colleagues ask "what does this do?" every time they encounter it.
Refactoring to CTEs
The three-level example above becomes much clearer as CTEs:
WITH median_balance AS (
SELECT AVG(BALANCE) AS MEDIAN_APPROX
FROM (SELECT BALANCE
FROM ACCOUNT
ORDER BY BALANCE
FETCH FIRST 50 PERCENT ROWS ONLY) AS mid
),
branch_max_balance AS (
SELECT BRANCH_ID,
MAX(BALANCE) AS MAX_BALANCE
FROM ACCOUNT
GROUP BY BRANCH_ID
)
SELECT b.BRANCH_NAME
FROM BRANCH b
JOIN branch_max_balance bmb ON b.BRANCH_ID = bmb.BRANCH_ID
CROSS JOIN median_balance mb
WHERE bmb.MAX_BALANCE > mb.MEDIAN_APPROX;
Each CTE has a clear name and a single responsibility. The main query is a straightforward join with a filter. The logic is identical, but the readability improvement is substantial.
Refactoring Strategy
When faced with deeply nested subqueries:
- Identify the innermost subquery. Make it the first CTE.
- Work outward. Each level becomes the next CTE, referencing the previous one.
- Name each CTE after what it represents (e.g.,
branch_max_balance), not how it was computed. - Simplify the main query. After extracting CTEs, the main query should be a simple SELECT/JOIN/WHERE with no subqueries.
This mechanical process works for any nesting depth. The result is always more readable than the nested original.
8.10 Meridian Bank Complex Queries
Let's apply everything from this chapter to realistic Meridian National Bank scenarios. Each query addresses a business question that requires subqueries, CTEs, or both.
Query 1: Customers Above Average Balance
Business question: "List every customer whose total account balance exceeds the bank-wide average total balance per customer."
WITH customer_totals AS (
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
SUM(a.BALANCE) AS TOTAL_BALANCE
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
GROUP BY c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
),
bank_average AS (
SELECT AVG(TOTAL_BALANCE) AS AVG_TOTAL_BALANCE
FROM customer_totals
)
SELECT ct.CUSTOMER_ID,
ct.FIRST_NAME,
ct.LAST_NAME,
ct.TOTAL_BALANCE,
ba.AVG_TOTAL_BALANCE,
ct.TOTAL_BALANCE - ba.AVG_TOTAL_BALANCE AS ABOVE_AVERAGE_BY
FROM customer_totals ct
CROSS JOIN bank_average ba
WHERE ct.TOTAL_BALANCE > ba.AVG_TOTAL_BALANCE
ORDER BY ct.TOTAL_BALANCE DESC;
The first CTE computes each customer's total balance. The second CTE computes the average of those totals. The main query compares and filters. Notice how the second CTE references the first — this is the multi-CTE chain pattern from Section 8.6.
Query 2: Accounts with No Transactions This Month
Business question: "Find all accounts that have had zero transactions in the current calendar month."
SELECT a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE,
c.FIRST_NAME,
c.LAST_NAME
FROM ACCOUNT a
JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
WHERE NOT EXISTS (
SELECT 1
FROM TRANSACTION t
WHERE t.ACCOUNT_ID = a.ACCOUNT_ID
AND YEAR(t.TRANSACTION_DATE) = YEAR(CURRENT DATE)
AND MONTH(t.TRANSACTION_DATE) = MONTH(CURRENT DATE)
)
ORDER BY a.BALANCE DESC;
NOT EXISTS is the natural choice here. We are looking for the absence of something — accounts with no matching transactions. The correlated subquery checks each account against the TRANSACTION table, filtering to the current month. An index on TRANSACTION(ACCOUNT_ID, TRANSACTION_DATE) makes this efficient.
Query 3: Branch Ranking by Performance
Business question: "Rank branches by total transaction volume this year, showing each branch's percentage of the bank-wide total and its rank."
WITH branch_volume AS (
SELECT a.BRANCH_ID,
COUNT(t.TRANSACTION_ID) AS TXN_COUNT,
SUM(t.AMOUNT) AS TXN_TOTAL
FROM TRANSACTION t
JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE YEAR(t.TRANSACTION_DATE) = YEAR(CURRENT DATE)
GROUP BY a.BRANCH_ID
),
bank_total AS (
SELECT SUM(TXN_TOTAL) AS TOTAL_VOLUME
FROM branch_volume
)
SELECT b.BRANCH_NAME,
bv.TXN_COUNT,
bv.TXN_TOTAL,
bt.TOTAL_VOLUME,
DECIMAL(bv.TXN_TOTAL * 100.0 / bt.TOTAL_VOLUME, 5, 2)
AS PERCENT_OF_TOTAL,
ROW_NUMBER() OVER (ORDER BY bv.TXN_TOTAL DESC) AS RANK
FROM branch_volume bv
JOIN BRANCH b ON bv.BRANCH_ID = b.BRANCH_ID
CROSS JOIN bank_total bt
ORDER BY bv.TXN_TOTAL DESC;
This combines CTEs with a window function (ROW_NUMBER, previewing Chapter 10). The first CTE aggregates transactions by branch. The second CTE sums across all branches to get the bank-wide total. The main query joins both, computes percentages, and ranks.
Query 4: Customers Eligible for Premium Status
Business question: "A customer qualifies for premium status if they have at least $100,000 in total deposits AND at least one loan in good standing (no missed payments in the last 12 months) AND have been a customer for at least 2 years."
WITH customer_deposits AS (
SELECT CUSTOMER_ID,
SUM(BALANCE) AS TOTAL_DEPOSITS
FROM ACCOUNT
WHERE ACCOUNT_TYPE IN ('SAVINGS', 'CHECKING', 'MONEY_MARKET')
GROUP BY CUSTOMER_ID
HAVING SUM(BALANCE) >= 100000
),
good_standing_loans AS (
SELECT DISTINCT l.CUSTOMER_ID
FROM LOAN l
WHERE l.STATUS = 'ACTIVE'
AND NOT EXISTS (
SELECT 1
FROM LOAN_PAYMENT lp
WHERE lp.LOAN_ID = l.LOAN_ID
AND lp.STATUS = 'MISSED'
AND lp.DUE_DATE >= CURRENT DATE - 12 MONTHS
)
),
tenured_customers AS (
SELECT CUSTOMER_ID
FROM CUSTOMER
WHERE DATE_OPENED <= CURRENT DATE - 2 YEARS
)
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
cd.TOTAL_DEPOSITS
FROM CUSTOMER c
JOIN customer_deposits cd ON c.CUSTOMER_ID = cd.CUSTOMER_ID
JOIN good_standing_loans gsl ON c.CUSTOMER_ID = gsl.CUSTOMER_ID
JOIN tenured_customers tc ON c.CUSTOMER_ID = tc.CUSTOMER_ID
ORDER BY cd.TOTAL_DEPOSITS DESC;
Three CTEs, each encoding one business rule. The main query joins them with INNER JOINs, which enforces the AND logic: a customer must appear in all three CTEs to be in the final result. This is a powerful pattern for complex eligibility rules — each rule is isolated, testable, and readable.
Query 5: Month-Over-Month Transaction Growth by Branch
Business question: "For each branch, show the transaction count and total for each month this year, along with the change from the previous month."
WITH monthly_stats AS (
SELECT a.BRANCH_ID,
MONTH(t.TRANSACTION_DATE) AS TXN_MONTH,
COUNT(*) AS TXN_COUNT,
SUM(t.AMOUNT) AS TXN_TOTAL
FROM TRANSACTION t
JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE YEAR(t.TRANSACTION_DATE) = YEAR(CURRENT DATE)
GROUP BY a.BRANCH_ID, MONTH(t.TRANSACTION_DATE)
),
with_previous AS (
SELECT curr.BRANCH_ID,
curr.TXN_MONTH,
curr.TXN_COUNT,
curr.TXN_TOTAL,
prev.TXN_COUNT AS PREV_COUNT,
prev.TXN_TOTAL AS PREV_TOTAL
FROM monthly_stats curr
LEFT JOIN monthly_stats prev
ON curr.BRANCH_ID = prev.BRANCH_ID
AND curr.TXN_MONTH = prev.TXN_MONTH + 1
)
SELECT b.BRANCH_NAME,
wp.TXN_MONTH,
wp.TXN_COUNT,
wp.TXN_TOTAL,
wp.TXN_COUNT - COALESCE(wp.PREV_COUNT, 0) AS COUNT_CHANGE,
wp.TXN_TOTAL - COALESCE(wp.PREV_TOTAL, 0) AS TOTAL_CHANGE,
CASE
WHEN wp.PREV_TOTAL IS NULL OR wp.PREV_TOTAL = 0 THEN NULL
ELSE DECIMAL((wp.TXN_TOTAL - wp.PREV_TOTAL) * 100.0
/ wp.PREV_TOTAL, 7, 2)
END AS PERCENT_CHANGE
FROM with_previous wp
JOIN BRANCH b ON wp.BRANCH_ID = b.BRANCH_ID
ORDER BY b.BRANCH_NAME, wp.TXN_MONTH;
The monthly_stats CTE is referenced twice in with_previous — once for the current month and once for the previous month — using a self-join on TXN_MONTH = TXN_MONTH + 1. This is a case where the CTE's reusability shines. With a derived table, you would have to write the aggregation query twice.
8.11 The Threshold Concept — Thinking in Sets, Not Loops
If you come from a procedural programming background — Java, Python, C# — you are accustomed to thinking in loops. "For each customer, look up their accounts. For each account, check if the balance is above average. If yes, add the customer to the result list."
SQL does not work this way. SQL operates on sets. A query defines a set of rows that satisfy certain conditions. The database engine figures out how to compute that set efficiently. Your job is to describe what you want, not how to compute it.
Subqueries and CTEs are the bridge between procedural thinking and set-based thinking. Consider the premium status query from Section 8.10. A procedural programmer might write:
result = []
for customer in all_customers:
total = sum of customer's deposit account balances
if total < 100000: continue
has_good_loan = false
for loan in customer's loans:
if loan is active and no missed payments in 12 months:
has_good_loan = true
break
if not has_good_loan: continue
if customer.tenure < 2 years: continue
result.append(customer)
The CTE version expresses the same logic without loops:
- Define the set of customers with sufficient deposits.
- Define the set of customers with good-standing loans.
- Define the set of customers with sufficient tenure.
- Intersect these three sets (using JOIN).
The set-based version is not just different syntax for the same thing. It gives DB2 freedom to execute the operations in any order, use any access method (index scan, hash join, merge join), and parallelize across partitions. The procedural version locks in a specific execution order — outer loop over customers, inner lookups for each — which may be far from optimal.
The Mental Shift
The mental shift from loops to sets involves asking different questions:
| Procedural Question | Set-Based Question |
|---|---|
| "For each customer, what is their balance?" | "What is the set of (customer, balance) pairs?" |
| "Does this customer have a savings account?" | "What is the set of customers who have savings accounts?" |
| "Loop through transactions and sum the deposits" | "What is the SUM of amounts WHERE type = 'DEPOSIT'?" |
| "If the count is zero, skip this customer" | "NOT EXISTS (matching rows)" |
When you catch yourself thinking "for each... if... then...", translate it to "the set of rows where...". This translation is exactly what subqueries and CTEs help you express.
A Practical Heuristic
If your query requires a correlated subquery that you cannot express as a join or CTE-based set operation, ask: "Am I thinking in loops?" Often, the correlated subquery is a translation artifact from procedural thinking, and a set-based reformulation exists that is both more readable and more efficient.
This is not always the case — some problems genuinely require correlation (like "the most recent transaction per account"). But it is worth checking before accepting a correlated subquery as the final form.
Spaced Review: Chapters 2, 5, and 6
Subqueries and CTEs build on everything you have learned so far. Let's reinforce key concepts from earlier chapters.
From Chapter 2: Data Types and Table Structure
Review question: Why does the data type of a scalar subquery's result matter?
When you write WHERE a.BALANCE > (SELECT AVG(BALANCE) FROM ACCOUNT), the subquery returns a DECIMAL or DOUBLE value (depending on the BALANCE column's type and the AVG function's return type). DB2 must be able to compare this with a.BALANCE using compatible types. If the subquery returned a VARCHAR, the comparison would fail with a type error. Always ensure that scalar subqueries return a type compatible with the context they appear in.
From Chapter 5: SELECT and WHERE
Review question: Can you use a column alias from the SELECT list in the WHERE clause of the same query?
No. This is why derived tables and CTEs exist. If you compute SUM(BALANCE) AS TOTAL_BALANCE in a subquery or CTE, the outer query can reference TOTAL_BALANCE in its WHERE clause. Within a single query level, WHERE is logically evaluated before SELECT, so aliases from SELECT are not available in WHERE.
From Chapter 6: Joins
Review question: What is the relationship between IN with a subquery and a semi-join?
WHERE c.CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM ACCOUNT) is logically equivalent to a semi-join: "return each CUSTOMER row that has at least one matching ACCOUNT row." DB2's optimizer may execute it as a semi-join internally, using the same machinery as an EXISTS-based query. Understanding this equivalence helps you recognize that IN-subquery, EXISTS-subquery, and semi-join are three ways of expressing the same logical operation.
From Chapter 7: Aggregation and GROUP BY
Review question: Why does placing a GROUP BY inside a CTE and then joining the CTE to another table avoid the fan-out problem?
The fan-out problem occurs when you join two tables before aggregating, and the join multiplies rows. For example, joining CUSTOMER to ACCOUNT (one-to-many) before counting gives inflated counts. By aggregating inside a CTE first, the CTE produces one row per group. Joining this pre-aggregated result to other tables does not multiply rows, because the CTE's output is already at the correct granularity.
Summary
This chapter covered the mechanisms SQL provides for composing queries from other queries:
Scalar subqueries return a single value. Use them for simple comparisons in WHERE or computed columns in SELECT. Watch for the runtime error when a scalar subquery unexpectedly returns multiple rows.
Table subqueries with IN, NOT IN, ANY, and ALL compare a value against a set. Beware the NULL trap with NOT IN — prefer NOT EXISTS instead.
Correlated subqueries reference the outer query, requiring logical re-evaluation per outer row. They are powerful but expensive. Use them when the pattern is natural (e.g., "latest per group"), but look for set-based alternatives when possible.
EXISTS and NOT EXISTS perform existence checks efficiently. They implement semi-joins and anti-semi-joins, handle NULLs correctly, and often outperform IN for large data sets.
Derived tables place subqueries in the FROM clause, useful for pre-aggregation and predicate isolation.
Common Table Expressions name subqueries for readability and reuse. Multi-CTE chains transform deeply nested queries into top-to-bottom logical steps. They are the single most impactful readability tool for complex SQL.
CTE materialization in DB2 is an optimizer decision. CTEs are usually inlined, but may be materialized when referenced multiple times or when optimization fences are present. Use EXPLAIN to diagnose, and intervene only when performance demands it.
The underlying theme is set-based thinking: define the sets you need, then combine them with joins and filters. This mindset, more than any specific syntax, is what separates effective SQL practitioners from those who fight the language.
In Chapter 9, we will explore INSERT, UPDATE, DELETE, and MERGE — the data modification statements that change the contents of your tables. Subqueries and CTEs play a central role there too, enabling targeted modifications based on complex conditions.