> "Data lives in separate tables for good reason. Joins bring it together."
In This Chapter
- 6.1 Why Joins Exist — The Relational Design Trade-Off
- 6.2 INNER JOIN — The Workhorse
- 6.3 Table Aliases — Writing Readable Join Queries
- 6.4 LEFT OUTER JOIN — When the Right Side Might Be Missing
- 6.5 RIGHT OUTER JOIN and FULL OUTER JOIN
- 6.6 CROSS JOIN — The Cartesian Product
- 6.7 LATERAL Joins and TABLE Functions
- 6.8 Joining Three or More Tables
- 6.9 Join Conditions vs. WHERE Conditions
- 6.10 Common Join Mistakes
- 6.11 Meridian Bank Join Queries
- 6.12 A Preview: How the Optimizer Processes Joins
- Spaced Review — Chapters 1 Through 5
- Chapter Summary
Chapter 6: Joining Tables — INNER, OUTER, CROSS, and LATERAL
"Data lives in separate tables for good reason. Joins bring it together."
In Chapter 5 you learned to query a single table with precision — filtering rows, sorting results, transforming columns with scalar functions, and handling NULLs. Every query targeted one table at a time. That constraint ends now.
Relational databases spread data across many tables by design. A customer's name lives in CUSTOMER. Their account balances live in ACCOUNT. The branch where they opened that account lives in BRANCH. The transactions they made live in TRANSACTION. The employee who helped them lives in EMPLOYEE. No single table tells the whole story. Joins are how you reassemble the pieces.
This chapter is the pivot point of SQL fluency. Everything that follows — subqueries, aggregation over multiple entities, reporting, analytics — depends on your ability to join tables correctly and efficiently. We will cover every join type DB2 supports, show exactly when each one is appropriate, and work through realistic scenarios using our Meridian National Bank schema.
By the end of this chapter you will be able to:
- Explain why normalization requires joins and articulate the trade-offs
- Write INNER JOIN queries with single-column, multi-column, and non-equi join conditions
- Choose correctly among LEFT, RIGHT, and FULL OUTER JOIN based on business requirements
- Apply CROSS JOIN deliberately for Cartesian products (and recognize when you have created one by accident)
- Use LATERAL joins for correlated subquery-like behavior
- Join three, four, or more tables in a single query with confidence
- Distinguish join conditions from WHERE conditions, especially in outer joins
- Recognize the most common join mistakes before they reach production
6.1 Why Joins Exist — The Relational Design Trade-Off
Before we write a single join, let us understand why data is split across tables in the first place.
The Redundancy Problem
Consider what happens when you try to store everything in a single flat table. In the early days of computing — before relational databases — this is exactly how data was organized: one giant file with every piece of information on every record. Some organizations still maintain spreadsheets this way, and the problems become apparent quickly.
Imagine storing everything about a customer's transactions in one massive table:
| CUST_ID | CUST_NAME | CUST_ADDRESS | ACCT_ID | ACCT_TYPE | BRANCH_NAME | BRANCH_CITY | TXN_ID | TXN_DATE | TXN_AMT |
|---|---|---|---|---|---|---|---|---|---|
| 1001 | Alice Morgan | 42 Elm St, Chicago | 5001 | CHECKING | Downtown Branch | Chicago | 90001 | 2024-01-15 | 250.00 |
| 1001 | Alice Morgan | 42 Elm St, Chicago | 5001 | CHECKING | Downtown Branch | Chicago | 90002 | 2024-01-16 | -45.00 |
| 1001 | Alice Morgan | 42 Elm St, Chicago | 5001 | CHECKING | Downtown Branch | Chicago | 90003 | 2024-01-17 | 1200.00 |
| 1001 | Alice Morgan | 42 Elm St, Chicago | 5002 | SAVINGS | Downtown Branch | Chicago | 90004 | 2024-02-01 | 5000.00 |
Alice's name, address, account type, and branch information are repeated on every single transaction row. If Alice has 500 transactions across two accounts, her name appears 500 times. Her branch city appears 500 times. This is called update anomaly territory: if Alice moves to a new address, you must update 500 rows — and if you miss one, your data is inconsistent.
The problems with this flat structure go beyond storage waste:
- Update anomalies: Change Alice's address and you must find every row. Miss one and you have contradictory data.
- Insert anomalies: You cannot record a new branch until someone opens an account there — the branch has no independent existence.
- Delete anomalies: If you delete Alice's last transaction, you lose her customer information entirely.
These are the classic motivations for normalization, and they explain why every well-designed relational database distributes data across multiple tables.
The Normalized Solution
Relational database design (normalization, covered in depth in Chapter 16) solves this by storing each fact exactly once:
- CUSTOMER stores Alice's name and address once.
- ACCOUNT stores each account once, linking to CUSTOMER via CUSTOMER_ID.
- BRANCH stores each branch once, linked from ACCOUNT via BRANCH_ID.
- TRANSACTION stores each transaction once, linking to ACCOUNT via ACCOUNT_ID.
This eliminates redundancy, but it creates a new requirement: when you need a report that shows customer names alongside their transaction amounts and branch locations, you must join the tables back together.
The Meridian National Bank Schema (Reminder)
Here are the tables we have been building since Chapter 2, shown with their primary and foreign keys:
CUSTOMER (CUSTOMER_ID PK, FIRST_NAME, LAST_NAME, DATE_OF_BIRTH, EMAIL, PHONE, ADDRESS, CITY, STATE, ZIP_CODE, CREATED_DATE)
ACCOUNT (ACCOUNT_ID PK, CUSTOMER_ID FK→CUSTOMER, BRANCH_ID FK→BRANCH, ACCOUNT_TYPE, BALANCE, OPENED_DATE, STATUS)
BRANCH (BRANCH_ID PK, BRANCH_NAME, ADDRESS, CITY, STATE, ZIP_CODE, PHONE, MANAGER_ID FK→EMPLOYEE)
TRANSACTION (TXN_ID PK, ACCOUNT_ID FK→ACCOUNT, TXN_TYPE, AMOUNT, TXN_DATE, DESCRIPTION, RELATED_ACCOUNT_ID FK→ACCOUNT)
EMPLOYEE (EMPLOYEE_ID PK, FIRST_NAME, LAST_NAME, BRANCH_ID FK→BRANCH, POSITION, HIRE_DATE, SALARY, MANAGER_EMPLOYEE_ID FK→EMPLOYEE)
The foreign key arrows are your roadmap for joins. Every FK→PK relationship is a natural join path. Keep this schema in front of you as we work through the chapter.
The Join Mental Model
Here is a way to think about joins that will serve you throughout your career: a join is a question about relationships. When you write CUSTOMER INNER JOIN ACCOUNT ON c.CUSTOMER_ID = a.CUSTOMER_ID, you are asking "for each customer, what accounts do they have?" When you write ACCOUNT LEFT OUTER JOIN TRANSACTION ON a.ACCOUNT_ID = t.ACCOUNT_ID, you are asking "for each account, what transactions exist — and show me the account even if the answer is none."
Every join answers a relationship question. The join type determines what happens when one side of the relationship is empty.
6.2 INNER JOIN — The Workhorse
The INNER JOIN is the most common join type. It returns only the rows where matching values exist in both tables. If a row in the left table has no match in the right table, it is excluded. If a row in the right table has no match in the left table, it is excluded.
Basic Syntax
SELECT columns
FROM table_a
INNER JOIN table_b
ON table_a.join_column = table_b.join_column;
The INNER keyword is optional — a plain JOIN defaults to INNER JOIN in DB2 (and in the SQL standard). However, we will always write INNER JOIN explicitly for clarity.
Your First Join: Customers and Their Accounts
Business question: Show every customer alongside their account information.
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME ACCOUNT_ID ACCOUNT_TYPE BALANCE
----------- ---------- --------- ---------- ------------ ----------
1001 Alice Morgan 5001 CHECKING 4,250.00
1001 Alice Morgan 5002 SAVINGS 12,800.00
1002 Brian Chen 5003 CHECKING 1,100.00
1003 Clara Davis 5004 CHECKING 8,900.00
1003 Clara Davis 5005 SAVINGS 45,000.00
1003 Clara Davis 5006 CD 25,000.00
1004 David Kim 5007 CHECKING 320.00
Notice that Alice appears twice (she has two accounts) and Clara appears three times (she has three accounts). This is expected and correct — the join produces one row for every matching pair.
This one-to-many relationship is fundamental. The CUSTOMER table has a one-to-many relationship with ACCOUNT (one customer can own many accounts). When you join them, the "one" side (CUSTOMER) is duplicated to match each row on the "many" side (ACCOUNT). The result set has as many rows as there are accounts — not as many as there are customers.
Understanding this row-multiplication behavior is crucial. It is not a bug; it is the defined behavior of SQL joins. But it can cause problems if you later aggregate these results without accounting for it (we cover this trap in Section 6.10).
How INNER JOIN Works Conceptually
Think of it as two nested loops:
- Pick a row from CUSTOMER.
- Scan every row in ACCOUNT looking for rows where
ACCOUNT.CUSTOMER_ID = CUSTOMER.CUSTOMER_ID. - For each match, produce an output row combining columns from both tables.
- Move to the next CUSTOMER row and repeat.
(The actual execution may differ — DB2's optimizer may use hash joins, merge joins, or index lookups — but the logical result is identical. We will preview optimizer strategies in Section 6.12.)
Multi-Column Join Conditions
Sometimes a single column is not enough to establish the relationship. Consider a hypothetical ACCOUNT_HISTORY table that tracks balance snapshots by account and date:
SELECT a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
h.SNAPSHOT_DATE,
h.BALANCE_SNAPSHOT
FROM ACCOUNT a
INNER JOIN ACCOUNT_HISTORY h
ON a.ACCOUNT_ID = h.ACCOUNT_ID
AND h.SNAPSHOT_DATE = CURRENT DATE;
The ON clause can contain multiple conditions connected with AND. Both conditions must be true for a row pair to appear in the result.
Multi-column joins are common in several scenarios:
- Composite keys: Some tables use multiple columns as their primary key. For example, a DAILY_BALANCE table might have a composite key of (ACCOUNT_ID, BALANCE_DATE).
- Temporal joins: Joining to a history or audit table where you need to match both the entity and a specific time period.
- Business rules: The relationship between two tables may require more than one column to be meaningful — for example, matching on both department and job code.
You can include as many conditions in the ON clause as needed. All must evaluate to TRUE for a row pair to be included in the result.
Non-Equi Joins
Most joins use equality (=), but DB2 allows any comparison operator in the ON clause:
-- Find employees whose salary exceeds their branch's average
SELECT e.FIRST_NAME,
e.LAST_NAME,
e.SALARY,
ba.AVG_SALARY
FROM EMPLOYEE e
INNER JOIN (
SELECT BRANCH_ID, AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
GROUP BY BRANCH_ID
) ba
ON e.BRANCH_ID = ba.BRANCH_ID
AND e.SALARY > ba.AVG_SALARY;
Non-equi joins are also useful for range-based lookups — for example, joining a transaction to a fee schedule where the fee depends on the transaction amount falling within a range:
SELECT t.TXN_ID,
t.AMOUNT,
f.FEE_PERCENTAGE
FROM TRANSACTION t
INNER JOIN FEE_SCHEDULE f
ON t.AMOUNT >= f.MIN_AMOUNT
AND t.AMOUNT < f.MAX_AMOUNT;
Self-Joins
A self-join joins a table to itself. This is essential when a table contains a hierarchical or self-referencing relationship. Our EMPLOYEE table has a MANAGER_EMPLOYEE_ID column that points back to EMPLOYEE_ID in the same table.
Business question: List every employee alongside their manager's name.
SELECT e.EMPLOYEE_ID,
e.FIRST_NAME || ' ' || e.LAST_NAME AS EMPLOYEE_NAME,
e.POSITION,
m.FIRST_NAME || ' ' || m.LAST_NAME AS MANAGER_NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE m
ON e.MANAGER_EMPLOYEE_ID = m.EMPLOYEE_ID;
Expected output:
EMPLOYEE_ID EMPLOYEE_NAME POSITION MANAGER_NAME
----------- --------------- -------------------- ---------------
2002 Sarah Palmer Senior Teller James Whitfield
2003 Miguel Reyes Loan Officer James Whitfield
2004 Karen Liu Teller Sarah Palmer
2005 Tom Bradley Operations Analyst James Whitfield
2006 Nina Patel Branch Manager Robert Haines
2007 Derek Olsen Teller Nina Patel
Notice that James Whitfield (the top-level manager or CEO-equivalent in our sample data) does not appear as an employee in this result — his MANAGER_EMPLOYEE_ID is NULL, and an INNER JOIN excludes NULLs. We will fix this with a LEFT OUTER JOIN in Section 6.4.
Self-joins are also useful for comparing rows within the same table. For example, finding all pairs of employees at the same branch who have a salary difference of more than $10,000:
SELECT e1.FIRST_NAME || ' ' || e1.LAST_NAME AS EMPLOYEE_A,
e1.SALARY AS SALARY_A,
e2.FIRST_NAME || ' ' || e2.LAST_NAME AS EMPLOYEE_B,
e2.SALARY AS SALARY_B,
ABS(e1.SALARY - e2.SALARY) AS SALARY_GAP
FROM EMPLOYEE e1
INNER JOIN EMPLOYEE e2
ON e1.BRANCH_ID = e2.BRANCH_ID
AND e1.EMPLOYEE_ID < e2.EMPLOYEE_ID
WHERE ABS(e1.SALARY - e2.SALARY) > 10000.00
ORDER BY SALARY_GAP DESC;
The condition e1.EMPLOYEE_ID < e2.EMPLOYEE_ID ensures each pair appears only once (not both A-B and B-A) and excludes self-pairs (A-A). This is a standard technique whenever you compare rows within the same table.
The USING Clause (Alternative to ON)
DB2 supports an alternative syntax when the join columns have the same name in both tables:
SELECT c.FIRST_NAME, a.ACCOUNT_ID, a.BALANCE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
USING (CUSTOMER_ID);
The USING clause is shorthand for ON c.CUSTOMER_ID = a.CUSTOMER_ID. There is one behavioral difference: with USING, the join column appears only once in the result (not qualified by either table alias). With ON, both c.CUSTOMER_ID and a.CUSTOMER_ID are available.
In this book we prefer the ON clause because it is more explicit and works for all join conditions, including multi-column joins and non-equi joins. However, you may encounter USING in production code, and it is perfectly valid.
6.3 Table Aliases — Writing Readable Join Queries
You have already seen aliases like c for CUSTOMER and a for ACCOUNT. In single-table queries, aliases are a convenience. In joins, they are a necessity.
Why Aliases Are Essential
When two tables share a column name (e.g., both CUSTOMER and ACCOUNT have a CUSTOMER_ID column), you must qualify every reference:
-- Without aliases (verbose but legal)
SELECT CUSTOMER.FIRST_NAME,
ACCOUNT.ACCOUNT_ID,
ACCOUNT.BALANCE
FROM CUSTOMER
INNER JOIN ACCOUNT
ON CUSTOMER.CUSTOMER_ID = ACCOUNT.CUSTOMER_ID;
With aliases, the same query is far more readable:
-- With aliases (preferred)
SELECT c.FIRST_NAME,
a.ACCOUNT_ID,
a.BALANCE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID;
Self-Join Aliases Are Mandatory
In a self-join, aliases are not optional — they are required to distinguish the two "copies" of the table:
FROM EMPLOYEE e -- the employee copy
INNER JOIN EMPLOYEE m -- the manager copy
ON e.MANAGER_EMPLOYEE_ID = m.EMPLOYEE_ID
Without aliases, DB2 would have no way to know which EMPLOYEE table reference you mean.
Alias Conventions
Adopt a consistent convention. Here are common approaches:
| Convention | Example | Pros |
|---|---|---|
| First letter | c for CUSTOMER |
Short, fast to type |
| Abbreviation | cust for CUSTOMER |
More readable in complex queries |
| Role-based | mgr for EMPLOYEE (manager copy) |
Clarifies purpose in self-joins |
In this book, we use single-letter or short abbreviations. In production SQL with five or more tables joined, slightly longer aliases like cust, acct, txn improve readability significantly.
Column Qualification Best Practices
Even when a column name is unique across all joined tables, qualify it with the alias anyway. Consider:
-- Ambiguous: which table does CITY come from?
SELECT c.FIRST_NAME, a.ACCOUNT_ID, CITY
FROM CUSTOMER c
INNER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b ON a.BRANCH_ID = b.BRANCH_ID;
Both CUSTOMER and BRANCH have a CITY column. If you do not qualify it, DB2 will raise an error: SQLCODE -203 (SQLSTATE 42702): column name is ambiguous. Even for columns that are currently unambiguous, qualifying them protects you against future schema changes — if someone adds a CITY column to ACCOUNT later, your unqualified query will break.
Rule of thumb: In any query with more than one table, qualify every column with its table alias. No exceptions.
6.4 LEFT OUTER JOIN — When the Right Side Might Be Missing
An INNER JOIN excludes rows that have no match. Sometimes that is exactly what you want. Other times, you need to preserve all rows from one table, whether or not a match exists in the other.
The Syntax
SELECT columns
FROM table_a
LEFT OUTER JOIN table_b
ON table_a.join_column = table_b.join_column;
The OUTER keyword is optional in DB2. LEFT JOIN and LEFT OUTER JOIN are identical. We write LEFT OUTER JOIN for clarity.
How It Works
A LEFT OUTER JOIN returns:
- Every matched pair (same as INNER JOIN), plus
- Every row from the left table that has no match in the right table, with NULLs filling in the right table's columns.
Customers With or Without Accounts
Business question: List all customers, showing their accounts if they have any. Include customers who have no accounts.
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
ORDER BY c.CUSTOMER_ID;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME ACCOUNT_ID ACCOUNT_TYPE BALANCE
----------- ---------- --------- ---------- ------------ ----------
1001 Alice Morgan 5001 CHECKING 4,250.00
1001 Alice Morgan 5002 SAVINGS 12,800.00
1002 Brian Chen 5003 CHECKING 1,100.00
1003 Clara Davis 5004 CHECKING 8,900.00
1003 Clara Davis 5005 SAVINGS 45,000.00
1003 Clara Davis 5006 CD 25,000.00
1004 David Kim 5007 CHECKING 320.00
1005 Elena Vargas - - -
Elena Vargas has no accounts. With an INNER JOIN she would vanish from the result. With a LEFT OUTER JOIN she appears, and all ACCOUNT columns are NULL (shown as - here for readability).
Finding Rows With No Match (Anti-Join Pattern)
One of the most practical uses of LEFT OUTER JOIN is finding rows in the left table that have no corresponding row in the right table:
-- Find customers with no accounts
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.ACCOUNT_ID IS NULL;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME
----------- ---------- ---------
1005 Elena Vargas
The WHERE a.ACCOUNT_ID IS NULL filter keeps only the unmatched rows. This is called an anti-join pattern. It answers questions like:
- Which customers have never made a transaction?
- Which branches have no employees assigned?
- Which products have never been ordered?
This pattern is functionally equivalent to NOT EXISTS and NOT IN subqueries, but many developers find the LEFT JOIN version more readable. DB2's optimizer typically generates the same access plan for all three forms.
Here are the three equivalent ways to express an anti-join, for comparison:
-- Method 1: LEFT OUTER JOIN + IS NULL (covered above)
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.ACCOUNT_ID IS NULL;
-- Method 2: NOT EXISTS (correlated subquery)
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE NOT EXISTS (
SELECT 1 FROM ACCOUNT a WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
);
-- Method 3: NOT IN (simple subquery — be careful with NULLs!)
SELECT c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
FROM CUSTOMER c
WHERE c.CUSTOMER_ID NOT IN (
SELECT a.CUSTOMER_ID FROM ACCOUNT a WHERE a.CUSTOMER_ID IS NOT NULL
);
All three return the same result. Method 3 has a subtle trap: if the subquery returns any NULL values, NOT IN returns no rows at all (because x NOT IN (1, 2, NULL) evaluates to UNKNOWN for any x). The WHERE a.CUSTOMER_ID IS NOT NULL guard clause prevents this, but it is easy to forget. For this reason, most DB2 practitioners prefer Method 1 or Method 2.
Fixing the Self-Join: Employees Including Top-Level Managers
Remember that our INNER JOIN self-join in Section 6.2 excluded the top-level manager because their MANAGER_EMPLOYEE_ID is NULL. A LEFT OUTER JOIN fixes this:
SELECT e.EMPLOYEE_ID,
e.FIRST_NAME || ' ' || e.LAST_NAME AS EMPLOYEE_NAME,
e.POSITION,
COALESCE(m.FIRST_NAME || ' ' || m.LAST_NAME, '(none — top level)') AS MANAGER_NAME
FROM EMPLOYEE e
LEFT OUTER JOIN EMPLOYEE m
ON e.MANAGER_EMPLOYEE_ID = m.EMPLOYEE_ID
ORDER BY e.EMPLOYEE_ID;
Expected output:
EMPLOYEE_ID EMPLOYEE_NAME POSITION MANAGER_NAME
----------- --------------- -------------------- -------------------
2001 James Whitfield Regional Director (none — top level)
2002 Sarah Palmer Senior Teller James Whitfield
2003 Miguel Reyes Loan Officer James Whitfield
2004 Karen Liu Teller Sarah Palmer
2005 Tom Bradley Operations Analyst James Whitfield
2006 Nina Patel Branch Manager Robert Haines
2007 Derek Olsen Teller Nina Patel
James Whitfield now appears, with (none — top level) for his manager name, thanks to COALESCE handling the NULL from the unmatched right side.
Branches With or Without Employees
Business question: Show all branches and their employee count, including branches that might have zero employees.
SELECT b.BRANCH_ID,
b.BRANCH_NAME,
COUNT(e.EMPLOYEE_ID) AS EMPLOYEE_COUNT
FROM BRANCH b
LEFT OUTER JOIN EMPLOYEE e
ON b.BRANCH_ID = e.BRANCH_ID
GROUP BY b.BRANCH_ID, b.BRANCH_NAME
ORDER BY EMPLOYEE_COUNT DESC;
Expected output:
BRANCH_ID BRANCH_NAME EMPLOYEE_COUNT
--------- ------------------ --------------
101 Downtown Branch 4
102 Northside Branch 3
103 Airport Branch 2
104 University Branch 0
COUNT(e.EMPLOYEE_ID) correctly returns 0 for University Branch because COUNT(column) does not count NULLs. If we had written COUNT(*) instead, it would return 1 for University Branch (counting the preserved left-side row), which would be incorrect.
This is one of the most common bugs in LEFT OUTER JOIN queries with aggregation. Memorize this rule:
| Function | NULL Behavior | Use With Outer Joins |
|---|---|---|
COUNT(*) |
Counts all rows including NULL rows | Almost always wrong |
COUNT(column) |
Skips NULLs | Correct for counting matched rows |
SUM(column) |
Skips NULLs (returns NULL if all NULL) | Wrap in COALESCE for 0 |
AVG(column) |
Skips NULLs | Wrap in COALESCE for 0 |
When you combine LEFT OUTER JOIN with GROUP BY, always use COUNT(specific_column) from the right-side table, and wrap SUM() and AVG() in COALESCE() to handle the all-NULL case.
6.5 RIGHT OUTER JOIN and FULL OUTER JOIN
RIGHT OUTER JOIN
A RIGHT OUTER JOIN preserves all rows from the right table, filling in NULLs for the left table when there is no match. It is the mirror image of LEFT OUTER JOIN.
SELECT a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
t.TXN_ID,
t.TXN_TYPE,
t.AMOUNT
FROM TRANSACTION t
RIGHT OUTER JOIN ACCOUNT a
ON t.ACCOUNT_ID = a.ACCOUNT_ID
ORDER BY a.ACCOUNT_ID;
This returns all accounts, including those with no transactions.
In practice, RIGHT OUTER JOIN is rarely used. Most SQL authors structure their queries so that the "must-preserve" table appears on the left side, making LEFT OUTER JOIN the natural choice. You can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order:
-- These two queries produce identical results
-- Version 1: RIGHT OUTER JOIN
SELECT a.ACCOUNT_ID, t.TXN_ID
FROM TRANSACTION t
RIGHT OUTER JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID;
-- Version 2: LEFT OUTER JOIN (preferred — same result, more conventional)
SELECT a.ACCOUNT_ID, t.TXN_ID
FROM ACCOUNT a
LEFT OUTER JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID;
Our recommendation: prefer LEFT OUTER JOIN for consistency. Reserve RIGHT OUTER JOIN for situations where rewriting would make the query less readable — for example, when adding a new table to an existing multi-join query and you do not want to rearrange the FROM clause.
In over twenty years of DB2 production code, the author has seen RIGHT OUTER JOIN used intentionally fewer than a dozen times. It is not wrong, but the SQL community has standardized on LEFT OUTER JOIN as the default outer join direction. When you see RIGHT OUTER JOIN in someone else's code, the first question to ask is whether the tables could be swapped to use LEFT instead.
FULL OUTER JOIN
A FULL OUTER JOIN preserves unmatched rows from both sides:
SELECT c.CUSTOMER_ID,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE
FROM CUSTOMER c
FULL OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
ORDER BY c.CUSTOMER_ID;
This returns: - All matched customer-account pairs - Customers with no accounts (ACCOUNT columns are NULL) - Accounts with no customer (CUSTOMER columns are NULL — this would indicate orphaned data)
When is FULL OUTER JOIN useful?
FULL OUTER JOIN is less common than LEFT OUTER JOIN but invaluable in specific scenarios:
- Data reconciliation: Comparing two data sources to find records present in one but not the other.
-- Compare internal accounts with external audit file
SELECT i.ACCOUNT_ID AS INTERNAL_ACCT,
e.ACCOUNT_ID AS EXTERNAL_ACCT,
CASE
WHEN i.ACCOUNT_ID IS NULL THEN 'Missing internally'
WHEN e.ACCOUNT_ID IS NULL THEN 'Missing externally'
ELSE 'Matched'
END AS STATUS
FROM INTERNAL_ACCOUNTS i
FULL OUTER JOIN EXTERNAL_AUDIT e
ON i.ACCOUNT_ID = e.ACCOUNT_ID
WHERE i.ACCOUNT_ID IS NULL
OR e.ACCOUNT_ID IS NULL;
-
Merging time series: Combining data from two sources that may have entries on different dates.
-
Gap detection: Finding periods where expected data is missing from either side.
FULL OUTER JOIN is sometimes called "the reconciliation join" because of its power in data quality scenarios. In the Meridian Bank context, consider a nightly batch process that imports transaction data from an ATM network. A FULL OUTER JOIN between the internal transaction log and the external ATM feed quickly reveals:
- Transactions recorded internally but not in the ATM feed (potential system error)
- Transactions in the ATM feed but not recorded internally (potential data loss)
- Transactions in both systems but with mismatched amounts (potential corruption)
No other single join type provides this complete three-way classification.
6.6 CROSS JOIN — The Cartesian Product
A CROSS JOIN combines every row from the left table with every row from the right table. If the left table has m rows and the right table has n rows, the result has m × n rows.
Syntax
SELECT columns
FROM table_a
CROSS JOIN table_b;
There is no ON clause — every row pairs with every other row.
Why Would You Want This?
Cross joins are useful for generating combinations:
Example 1: Generate all account-type/branch combinations for a report template.
SELECT b.BRANCH_NAME,
at.ACCOUNT_TYPE
FROM BRANCH b
CROSS JOIN (VALUES ('CHECKING'), ('SAVINGS'), ('CD'), ('MONEY_MARKET')) AS at(ACCOUNT_TYPE)
ORDER BY b.BRANCH_NAME, at.ACCOUNT_TYPE;
Expected output:
BRANCH_NAME ACCOUNT_TYPE
------------------ ------------
Airport Branch CD
Airport Branch CHECKING
Airport Branch MONEY_MARKET
Airport Branch SAVINGS
Downtown Branch CD
Downtown Branch CHECKING
Downtown Branch MONEY_MARKET
Downtown Branch SAVINGS
Northside Branch CD
Northside Branch CHECKING
Northside Branch MONEY_MARKET
Northside Branch SAVINGS
University Branch CD
University Branch CHECKING
University Branch MONEY_MARKET
University Branch SAVINGS
This produces a skeleton that can be LEFT OUTER JOINed to actual account data to create a report showing zero where no accounts of that type exist at a branch.
Example 2: Generate a date range for reporting.
-- Generate every date in January 2024
WITH DATE_RANGE(D) AS (
VALUES DATE('2024-01-01')
UNION ALL
SELECT D + 1 DAY FROM DATE_RANGE WHERE D < DATE('2024-01-31')
)
SELECT dr.D AS REPORT_DATE,
b.BRANCH_ID,
b.BRANCH_NAME
FROM DATE_RANGE dr
CROSS JOIN BRANCH b
ORDER BY dr.D, b.BRANCH_ID;
This produces 31 days × 4 branches = 124 rows — a reporting skeleton for daily branch metrics.
Example 3: Comparing every pair of branches.
SELECT b1.BRANCH_NAME AS BRANCH_A,
b2.BRANCH_NAME AS BRANCH_B
FROM BRANCH b1
CROSS JOIN BRANCH b2
WHERE b1.BRANCH_ID < b2.BRANCH_ID
ORDER BY b1.BRANCH_NAME, b2.BRANCH_NAME;
The WHERE b1.BRANCH_ID < b2.BRANCH_ID clause eliminates duplicates and self-pairs, giving unique combinations.
Understanding Cross Join Sizes
Always calculate the expected result size before running a CROSS JOIN:
| Left Table Rows | Right Table Rows | Result Rows | Assessment |
|---|---|---|---|
| 4 | 4 | 16 | Trivial |
| 31 | 4 | 124 | Small reporting skeleton |
| 365 | 4 | 1,460 | Yearly daily report — still fine |
| 10,000 | 10,000 | 100,000,000 | Dangerous — probably unintended |
A CROSS JOIN between two tables of 10,000 rows each produces 100 million rows. If the tables have wide columns, this can exhaust memory and fill temp tablespaces. Always ask yourself: "Do I genuinely need every combination?" If the answer is no, you probably want an INNER JOIN with a specific condition instead.
The Danger of Accidental Cross Joins
If you join two tables and forget the ON clause, you get a cross join whether you intended one or not:
-- DANGEROUS: missing ON clause produces a Cartesian product!
SELECT c.FIRST_NAME, a.ACCOUNT_ID
FROM CUSTOMER c, ACCOUNT a;
-- If CUSTOMER has 5,000 rows and ACCOUNT has 20,000 rows,
-- this returns 100,000,000 rows.
The old-style comma-separated FROM clause (pre-SQL:1992 syntax) still works in DB2 but makes accidental cross joins easy. Always use explicit JOIN ... ON syntax. We will return to this in Section 6.10.
CROSS JOIN vs. Comma Syntax
Technically, FROM A, B (with no WHERE clause) and FROM A CROSS JOIN B are equivalent — both produce a Cartesian product. The difference is intent. When you write CROSS JOIN, you are explicitly declaring that you want every combination. When you write FROM A, B and forget the WHERE clause, you have a bug. Use CROSS JOIN when you mean it, and INNER JOIN ... ON when you want matched rows.
6.7 LATERAL Joins and TABLE Functions
LATERAL is a powerful but often overlooked join type. It allows the right side of a join to reference columns from the left side — effectively embedding a correlated subquery in the FROM clause.
The Problem LATERAL Solves
Suppose you want to find, for each customer, their two most recent transactions. With a regular join you cannot express "top N per group" cleanly in the FROM clause. LATERAL makes it possible.
Syntax
SELECT columns
FROM table_a
, LATERAL (
SELECT columns
FROM table_b
WHERE table_b.some_column = table_a.some_column
ORDER BY ...
FETCH FIRST n ROWS ONLY
) AS alias;
The subquery inside LATERAL (...) can reference table_a columns, which a regular subquery in FROM cannot do.
Example: Top 2 Transactions Per Customer
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
lt.TXN_ID,
lt.AMOUNT,
lt.TXN_DATE
FROM CUSTOMER c,
LATERAL (
SELECT t.TXN_ID, t.AMOUNT, t.TXN_DATE
FROM TRANSACTION t
INNER JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
ORDER BY t.TXN_DATE DESC
FETCH FIRST 2 ROWS ONLY
) AS lt
ORDER BY c.CUSTOMER_ID, lt.TXN_DATE DESC;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME TXN_ID AMOUNT TXN_DATE
----------- ---------- --------- ------ --------- ----------
1001 Alice Morgan 90015 1,200.00 2024-03-15
1001 Alice Morgan 90012 -200.00 2024-03-10
1002 Brian Chen 90020 500.00 2024-03-14
1002 Brian Chen 90018 -75.00 2024-03-12
1003 Clara Davis 90025 5,000.00 2024-03-16
1003 Clara Davis 90022 -1,500.00 2024-03-13
Without LATERAL, achieving this result requires window functions (Chapter 14) or complex correlated subqueries in the SELECT list. LATERAL provides a clean, readable alternative.
Important Notes on LATERAL in DB2
- DB2 for LUW (Linux, UNIX, Windows) has supported LATERAL since version 9.1.
- DB2 for z/OS added LATERAL support in version 12.
- The comma before LATERAL acts as a cross-apply-style join. If the LATERAL subquery returns no rows for a given left-side row, that left-side row is excluded (like an INNER JOIN). To preserve left-side rows with no matches, use
LEFT OUTER JOIN LATERAL (...):
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
lt.TXN_ID,
lt.AMOUNT
FROM CUSTOMER c
LEFT OUTER JOIN LATERAL (
SELECT t.TXN_ID, t.AMOUNT
FROM TRANSACTION t
INNER JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
ORDER BY t.TXN_DATE DESC
FETCH FIRST 2 ROWS ONLY
) AS lt ON 1=1
ORDER BY c.CUSTOMER_ID;
The ON 1=1 is required syntactically for the LEFT OUTER JOIN but has no filtering effect — the correlation is inside the LATERAL subquery.
TABLE Functions
DB2 also supports TABLE() function calls in the FROM clause, which work similarly to LATERAL in that they can accept correlated arguments:
SELECT c.CUSTOMER_ID,
c.LAST_NAME,
t.*
FROM CUSTOMER c,
TABLE(GET_CUSTOMER_TRANSACTIONS(c.CUSTOMER_ID)) AS t;
Here GET_CUSTOMER_TRANSACTIONS is a user-defined table function that returns a result set. The TABLE() syntax allows the function to receive c.CUSTOMER_ID from the outer query — the same correlated behavior as LATERAL.
When to Use LATERAL vs. Other Approaches
LATERAL is not the only way to solve "top N per group" and similar problems. Here is a decision guide:
| Approach | Syntax Complexity | Performance | DB2 Version Required |
|---|---|---|---|
| LATERAL | Moderate | Good (optimizer can push predicates) | LUW 9.1+, z/OS 12+ |
| Window functions (ROW_NUMBER) | Moderate | Good | LUW 8.2+, z/OS 9+ |
| Correlated subquery in SELECT | Low | Can be poor for many columns | All versions |
| Correlated subquery with EXISTS | Low | Good with indexes | All versions |
LATERAL shines when you need multiple columns from the correlated result and when the correlated subquery benefits from ORDER BY with FETCH FIRST. Window functions (covered in Chapter 14) are often the most efficient approach and are widely understood. Choose based on readability, your DB2 version, and your team's SQL familiarity.
LATERAL and Performance Considerations
When DB2 processes a LATERAL join, it evaluates the LATERAL subquery once for each row from the left side — similar to a correlated subquery. However, the optimizer can often transform the LATERAL into a more efficient join internally. Index support on the correlated column (e.g., an index on ACCOUNT.CUSTOMER_ID or TRANSACTION.ACCOUNT_ID) is critical for good LATERAL performance. Without an index, each evaluation of the LATERAL subquery requires a table scan, and the total cost becomes proportional to left_rows × right_table_size.
6.8 Joining Three or More Tables
Real-world queries frequently join three, four, five, or more tables. The principle is simple: chain your joins.
Three-Table Join: Customer + Account + Branch
Business question: For each customer, show their account numbers, account types, and the name of the branch where each account was opened.
SELECT c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE,
b.BRANCH_NAME,
b.CITY AS BRANCH_CITY
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
ORDER BY c.LAST_NAME, a.ACCOUNT_ID;
Expected output:
FIRST_NAME LAST_NAME ACCOUNT_ID ACCOUNT_TYPE BALANCE BRANCH_NAME BRANCH_CITY
---------- --------- ---------- ------------ ---------- ---------------- -----------
Brian Chen 5003 CHECKING 1,100.00 Northside Branch Chicago
Clara Davis 5004 CHECKING 8,900.00 Downtown Branch Chicago
Clara Davis 5005 SAVINGS 45,000.00 Downtown Branch Chicago
Clara Davis 5006 CD 25,000.00 Airport Branch Chicago
David Kim 5007 CHECKING 320.00 Northside Branch Chicago
Alice Morgan 5001 CHECKING 4,250.00 Downtown Branch Chicago
Alice Morgan 5002 SAVINGS 12,800.00 Downtown Branch Chicago
Four-Table Join: Adding Transaction Details
Business question: Show customer names, account types, branch names, and transaction details for all transactions over $1,000.
SELECT c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
b.BRANCH_NAME,
t.TXN_ID,
t.TXN_TYPE,
t.AMOUNT,
t.TXN_DATE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
INNER JOIN TRANSACTION t
ON a.ACCOUNT_ID = t.ACCOUNT_ID
WHERE t.AMOUNT > 1000.00
ORDER BY t.TXN_DATE DESC;
Five-Table Join: Full Picture with Employee
Business question: For each large transaction (over $5,000), show the customer name, account, branch, and the branch manager who oversees that branch.
SELECT c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
b.BRANCH_NAME,
e.FIRST_NAME || ' ' || e.LAST_NAME AS BRANCH_MANAGER,
t.TXN_ID,
t.AMOUNT,
t.TXN_DATE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
INNER JOIN EMPLOYEE e
ON b.MANAGER_ID = e.EMPLOYEE_ID
INNER JOIN TRANSACTION t
ON a.ACCOUNT_ID = t.ACCOUNT_ID
WHERE t.AMOUNT > 5000.00
ORDER BY t.AMOUNT DESC;
Readability Tips for Multi-Table Joins
- Indent the ON clause under its JOIN to visually pair them.
- List tables in logical order — typically starting from the "main" entity (CUSTOMER) and joining outward through foreign keys.
- Use meaningful aliases —
c,a,b,t,emap naturally to our five tables. - Comment complex joins:
FROM CUSTOMER c
INNER JOIN ACCOUNT a -- customer's accounts
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b -- branch where account opened
ON a.BRANCH_ID = b.BRANCH_ID
LEFT OUTER JOIN EMPLOYEE e -- branch manager (might be vacant)
ON b.MANAGER_ID = e.EMPLOYEE_ID
Mixing Join Types
You can freely mix INNER, LEFT, RIGHT, and CROSS joins in a single query. A common pattern is an INNER JOIN chain with one LEFT OUTER JOIN at the end for an optional lookup:
SELECT c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
b.BRANCH_NAME,
t.TXN_ID,
t.AMOUNT
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
LEFT OUTER JOIN TRANSACTION t
ON a.ACCOUNT_ID = t.ACCOUNT_ID
ORDER BY c.LAST_NAME, a.ACCOUNT_ID, t.TXN_DATE;
This ensures we see every customer's accounts and branches, even if some accounts have no transactions yet.
Join Order and the Concept of "Driving Table"
When writing multi-table queries, developers sometimes worry about which table to list first. In the query above, we started with CUSTOMER. Should we have started with TRANSACTION instead?
From a correctness standpoint, the order does not matter. All of the following produce identical results:
-- Order 1: Start from CUSTOMER
FROM CUSTOMER c
INNER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b ON a.BRANCH_ID = b.BRANCH_ID
-- Order 2: Start from BRANCH
FROM BRANCH b
INNER JOIN ACCOUNT a ON b.BRANCH_ID = a.BRANCH_ID
INNER JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
-- Order 3: Start from ACCOUNT
FROM ACCOUNT a
INNER JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
INNER JOIN BRANCH b ON a.BRANCH_ID = b.BRANCH_ID
From a performance standpoint, the textual order in your SQL does not determine the physical execution order. DB2's cost-based optimizer evaluates multiple join orderings and picks the cheapest one. Write the order that makes the query most readable to humans.
From a readability standpoint, start with the "main" entity of your query — the one the business question is about — and join outward through the foreign key relationships. If the question is "show me each customer's accounts," start with CUSTOMER. If the question is "show me each transaction with context," start with TRANSACTION.
6.9 Join Conditions vs. WHERE Conditions
This is one of the most misunderstood aspects of SQL joins, and getting it wrong with outer joins produces incorrect results.
For INNER JOINs: No Difference
With INNER JOIN, putting a condition in the ON clause or the WHERE clause produces the same result:
-- Version A: condition in ON
SELECT c.FIRST_NAME, a.ACCOUNT_ID, a.BALANCE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
AND a.ACCOUNT_TYPE = 'CHECKING';
-- Version B: condition in WHERE
SELECT c.FIRST_NAME, a.ACCOUNT_ID, a.BALANCE
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.ACCOUNT_TYPE = 'CHECKING';
Both return only checking accounts. DB2's optimizer treats them identically.
For OUTER JOINs: Critical Difference
With LEFT OUTER JOIN, placing a condition in the ON clause versus the WHERE clause changes the result dramatically.
Scenario: Show all customers with their checking accounts, but also show customers who have no checking accounts.
Correct — condition in ON clause:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
AND a.ACCOUNT_TYPE = 'CHECKING'
ORDER BY c.CUSTOMER_ID;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME ACCOUNT_ID ACCOUNT_TYPE BALANCE
----------- ---------- --------- ---------- ------------ ----------
1001 Alice Morgan 5001 CHECKING 4,250.00
1002 Brian Chen 5003 CHECKING 1,100.00
1003 Clara Davis 5004 CHECKING 8,900.00
1004 David Kim 5007 CHECKING 320.00
1005 Elena Vargas - - -
Elena Vargas appears with NULLs because she has no checking account — the LEFT OUTER JOIN preserves her. Customers who have only savings or CD accounts (but no checking) would also appear with NULLs.
Incorrect — condition in WHERE clause:
SELECT c.CUSTOMER_ID,
c.FIRST_NAME,
c.LAST_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.ACCOUNT_TYPE = 'CHECKING'
ORDER BY c.CUSTOMER_ID;
Expected output:
CUSTOMER_ID FIRST_NAME LAST_NAME ACCOUNT_ID ACCOUNT_TYPE BALANCE
----------- ---------- --------- ---------- ------------ ----------
1001 Alice Morgan 5001 CHECKING 4,250.00
1002 Brian Chen 5003 CHECKING 1,100.00
1003 Clara Davis 5004 CHECKING 8,900.00
1004 David Kim 5007 CHECKING 320.00
Elena is gone. The WHERE clause filters after the join. For Elena, a.ACCOUNT_TYPE is NULL (no match), and NULL = 'CHECKING' evaluates to UNKNOWN, which is treated as FALSE. The WHERE clause eliminates her. The LEFT OUTER JOIN effectively became an INNER JOIN.
The Rule
| Condition Placement | Effect on OUTER JOIN |
|---|---|
| ON clause | Filters during the join — unmatched rows are still preserved with NULLs |
| WHERE clause | Filters after the join — can eliminate the preserved NULL rows, converting the outer join to an inner join |
Best practice: Place conditions about the right-side table (the "optional" table) in the ON clause. Place conditions about the left-side table (the "preserved" table) in the WHERE clause.
A Visual Way to Think About It
Imagine the query execution in two phases:
- Phase 1 (JOIN + ON): Build the joined result set, applying ON conditions during the join. For LEFT OUTER JOIN, add NULLs for unmatched left-side rows.
- Phase 2 (WHERE): Filter the joined result set, removing rows that do not satisfy WHERE conditions.
If your WHERE condition tests a right-side column, it removes the NULL rows that the LEFT OUTER JOIN just added. That is why conditions on the right-side table belong in the ON clause — they participate in Phase 1, where the outer join logic preserves unmatched rows.
This is probably the single most important concept in this chapter. If you take away one lesson, let it be this: in a LEFT OUTER JOIN, filtering the right-side table in WHERE destroys the "outer" behavior.
6.10 Common Join Mistakes
After reviewing thousands of SQL queries in production systems, these are the mistakes we see most often.
Mistake 1: Accidental Cartesian Product
-- Missing join condition between CUSTOMER and BRANCH
SELECT c.FIRST_NAME, a.ACCOUNT_ID, b.BRANCH_NAME
FROM CUSTOMER c
INNER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b ON 1 = 1; -- oops, this is a cross join!
Every account row gets paired with every branch row. If you expected 7 result rows but got 28 (7 accounts × 4 branches), check your join conditions.
How to spot it: If your result set is much larger than expected, count the rows and check whether it equals the product of the table sizes.
Mistake 2: Wrong Outer Join Side
-- Intent: show ALL branches, even those with no employees
-- Bug: RIGHT side is BRANCH, but we used LEFT OUTER JOIN
SELECT b.BRANCH_NAME, e.FIRST_NAME
FROM BRANCH b
LEFT OUTER JOIN EMPLOYEE e ON b.BRANCH_ID = e.BRANCH_ID;
Actually, this one is correct — BRANCH is on the left, and we are preserving it. The mistake version would be:
-- Bug: BRANCH is on the RIGHT, and we used LEFT OUTER JOIN
-- This preserves EMPLOYEE rows with no branch (not what we wanted)
SELECT b.BRANCH_NAME, e.FIRST_NAME
FROM EMPLOYEE e
LEFT OUTER JOIN BRANCH b ON e.BRANCH_ID = b.BRANCH_ID;
This preserves all employees (even those without a branch), not all branches. Read the join direction carefully.
Mistake 3: Filtering Outer Join Results in WHERE
We covered this in Section 6.9, but it is worth repeating because it is so common:
-- Bug: WHERE clause nullifies the LEFT OUTER JOIN
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.STATUS = 'ACTIVE' -- this eliminates customers with no accounts!
Fix: move the filter to the ON clause.
FROM CUSTOMER c
LEFT OUTER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
AND a.STATUS = 'ACTIVE'
Mistake 4: Duplicate Rows from One-to-Many Joins
When CUSTOMER has 1 row but ACCOUNT has 3 rows for that customer, a join produces 3 rows. If you then join TRANSACTION (and each account has 100 transactions), you get 300 rows. If you aggregate (SUM, COUNT) without understanding this multiplication, you get inflated numbers.
-- Bug: double-counting balances when customer has multiple accounts
SELECT c.CUSTOMER_ID,
c.LAST_NAME,
SUM(a.BALANCE) AS TOTAL_BALANCE, -- correct
COUNT(t.TXN_ID) AS TXN_COUNT -- correct
FROM CUSTOMER c
INNER JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID
GROUP BY c.CUSTOMER_ID, c.LAST_NAME;
This query actually works correctly for COUNT(t.TXN_ID) because each transaction row appears once. But SUM(a.BALANCE) is wrong — Alice's checking account balance is summed once for each transaction on that account, wildly inflating the total. The fix is to aggregate at the right level, often using subqueries or CTEs:
-- Correct: aggregate balances separately from transaction counts
SELECT c.CUSTOMER_ID,
c.LAST_NAME,
acct.TOTAL_BALANCE,
txn.TXN_COUNT
FROM CUSTOMER c
INNER JOIN (
SELECT CUSTOMER_ID, SUM(BALANCE) AS TOTAL_BALANCE
FROM ACCOUNT
GROUP BY CUSTOMER_ID
) acct ON c.CUSTOMER_ID = acct.CUSTOMER_ID
INNER JOIN (
SELECT a.CUSTOMER_ID, COUNT(*) AS TXN_COUNT
FROM TRANSACTION t
INNER JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
GROUP BY a.CUSTOMER_ID
) txn ON c.CUSTOMER_ID = txn.CUSTOMER_ID;
Mistake 5: Using Old-Style Comma Joins
-- Old style (pre-SQL:1992) — avoid this
SELECT c.FIRST_NAME, a.ACCOUNT_ID
FROM CUSTOMER c, ACCOUNT a
WHERE c.CUSTOMER_ID = a.CUSTOMER_ID;
This works, but it mixes join logic (how tables relate) with filter logic (which rows you want) in the WHERE clause. It makes it easy to accidentally omit a join condition, and it cannot express outer joins cleanly. Always use explicit JOIN ... ON syntax.
A counting trick for verifying comma-style joins: If you have N tables in the FROM clause, you need exactly N-1 equi-join conditions in the WHERE clause to avoid a Cartesian product. With 5 tables, you need 4 join conditions. Count them. If you are one short, you have a hidden cross join. With explicit JOIN ... ON syntax, this verification is unnecessary because each JOIN requires its own ON clause.
Mistake 6: Join Elimination Surprises
DB2's optimizer can eliminate a join entirely if it determines the join does not affect the result. For example:
SELECT a.ACCOUNT_ID, a.BALANCE
FROM ACCOUNT a
INNER JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID;
If there is a foreign key constraint guaranteeing that every ACCOUNT.CUSTOMER_ID exists in CUSTOMER, and you are not selecting any columns from CUSTOMER, DB2 may skip the CUSTOMER table entirely. This is usually a performance benefit, but it can be confusing when analyzing EXPLAIN plans.
Join elimination is one of many optimizer transformations that rely on foreign key constraints. This is one practical reason to define foreign keys even when your application layer already enforces referential integrity — the optimizer uses them to produce better plans.
Debugging Join Problems
When a join query produces unexpected results, use this systematic approach:
-
Count the result rows. Is it more or fewer than expected? - Way too many rows: probably a missing or wrong join condition (accidental Cartesian product). - Fewer rows than expected: probably using INNER JOIN where you need LEFT OUTER JOIN, or a WHERE clause that eliminates outer-joined NULLs. - Duplicate rows: probably a many-to-many relationship you did not anticipate.
-
Isolate each join. Start with two tables and verify the result. Add the third table and verify again. Continue until you find where the problem appears.
-
Check for NULLs in join columns. If a foreign key column contains NULLs, INNER JOIN silently drops those rows. This is correct behavior but can surprise you.
-
Verify your aliases. In a five-table query with single-letter aliases, it is easy to type
a.BRANCH_IDwhen you meanb.BRANCH_ID. The query might still run if the column exists in both tables — but it gives the wrong answer.
6.11 Meridian Bank Join Queries
Let us put it all together with comprehensive queries against our Meridian National Bank schema.
Query 1: Complete Customer Account Summary
Business question: For each customer, show all their accounts with branch information, ordered by customer name and account type.
SELECT c.CUSTOMER_ID,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
c.CITY AS CUSTOMER_CITY,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE,
a.STATUS,
b.BRANCH_NAME,
b.CITY AS BRANCH_CITY
FROM CUSTOMER c
INNER JOIN ACCOUNT a
ON c.CUSTOMER_ID = a.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
ORDER BY c.LAST_NAME, c.FIRST_NAME, a.ACCOUNT_TYPE;
Query 2: Transaction Ledger with Full Context
Business question: Build a complete transaction ledger showing customer, account, and branch for each transaction in March 2024.
SELECT t.TXN_ID,
t.TXN_DATE,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
b.BRANCH_NAME,
t.TXN_TYPE,
t.AMOUNT,
t.DESCRIPTION
FROM TRANSACTION t
INNER JOIN ACCOUNT a
ON t.ACCOUNT_ID = a.ACCOUNT_ID
INNER JOIN CUSTOMER c
ON a.CUSTOMER_ID = c.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
WHERE t.TXN_DATE BETWEEN '2024-03-01' AND '2024-03-31'
ORDER BY t.TXN_DATE, t.TXN_ID;
Query 3: Branch Staffing Report
Business question: For each branch, show the branch manager, total employee count, and total salary expense. Include branches with no employees.
SELECT b.BRANCH_ID,
b.BRANCH_NAME,
COALESCE(mgr.FIRST_NAME || ' ' || mgr.LAST_NAME, 'Vacant') AS MANAGER_NAME,
COUNT(e.EMPLOYEE_ID) AS EMPLOYEE_COUNT,
COALESCE(SUM(e.SALARY), 0) AS TOTAL_SALARY_EXPENSE
FROM BRANCH b
LEFT OUTER JOIN EMPLOYEE mgr
ON b.MANAGER_ID = mgr.EMPLOYEE_ID
LEFT OUTER JOIN EMPLOYEE e
ON b.BRANCH_ID = e.BRANCH_ID
GROUP BY b.BRANCH_ID, b.BRANCH_NAME,
mgr.FIRST_NAME, mgr.LAST_NAME
ORDER BY b.BRANCH_NAME;
Query 4: Transfer Pairs (Self-Join on TRANSACTION)
Business question: For every transfer transaction, show the source account, destination account, and both account holders' names.
SELECT t.TXN_ID,
t.TXN_DATE,
t.AMOUNT,
-- Source account details
src.ACCOUNT_ID AS FROM_ACCOUNT,
src_cust.FIRST_NAME || ' ' || src_cust.LAST_NAME AS FROM_CUSTOMER,
-- Destination account details
dst.ACCOUNT_ID AS TO_ACCOUNT,
dst_cust.FIRST_NAME || ' ' || dst_cust.LAST_NAME AS TO_CUSTOMER
FROM TRANSACTION t
INNER JOIN ACCOUNT src
ON t.ACCOUNT_ID = src.ACCOUNT_ID
INNER JOIN CUSTOMER src_cust
ON src.CUSTOMER_ID = src_cust.CUSTOMER_ID
INNER JOIN ACCOUNT dst
ON t.RELATED_ACCOUNT_ID = dst.ACCOUNT_ID
INNER JOIN CUSTOMER dst_cust
ON dst.CUSTOMER_ID = dst_cust.CUSTOMER_ID
WHERE t.TXN_TYPE = 'TRANSFER'
ORDER BY t.TXN_DATE;
Query 5: Customers and Their Most Recent Transaction (LATERAL)
Business question: For every customer, show their most recent transaction across all accounts.
SELECT c.CUSTOMER_ID,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
lt.ACCOUNT_ID,
lt.TXN_ID,
lt.TXN_TYPE,
lt.AMOUNT,
lt.TXN_DATE
FROM CUSTOMER c
LEFT OUTER JOIN LATERAL (
SELECT t.TXN_ID, t.ACCOUNT_ID, t.TXN_TYPE, t.AMOUNT, t.TXN_DATE
FROM TRANSACTION t
INNER JOIN ACCOUNT a ON t.ACCOUNT_ID = a.ACCOUNT_ID
WHERE a.CUSTOMER_ID = c.CUSTOMER_ID
ORDER BY t.TXN_DATE DESC
FETCH FIRST 1 ROW ONLY
) AS lt ON 1=1
ORDER BY c.CUSTOMER_ID;
Query 6: Inactive Accounts Report (Anti-Join)
Business question: Find all accounts that have had no transactions in the last 90 days.
SELECT a.ACCOUNT_ID,
a.ACCOUNT_TYPE,
a.BALANCE,
c.FIRST_NAME || ' ' || c.LAST_NAME AS CUSTOMER_NAME,
b.BRANCH_NAME
FROM ACCOUNT a
INNER JOIN CUSTOMER c
ON a.CUSTOMER_ID = c.CUSTOMER_ID
INNER JOIN BRANCH b
ON a.BRANCH_ID = b.BRANCH_ID
LEFT OUTER JOIN TRANSACTION t
ON a.ACCOUNT_ID = t.ACCOUNT_ID
AND t.TXN_DATE >= CURRENT DATE - 90 DAYS
WHERE a.STATUS = 'ACTIVE'
AND t.TXN_ID IS NULL
ORDER BY a.BALANCE DESC;
Notice the anti-join pattern: the transaction date filter is in the ON clause (to only consider recent transactions as matches), and the IS NULL check is in the WHERE clause (to keep only unmatched accounts).
This query is a perfect example of the ON-vs-WHERE distinction in action. If we moved t.TXN_DATE >= CURRENT DATE - 90 DAYS to the WHERE clause, it would eliminate every account whose only transactions are older than 90 days — those accounts would show t.TXN_DATE as NULL (from the outer join), and NULL >= CURRENT DATE - 90 DAYS evaluates to UNKNOWN (filtered out). The ON placement ensures we only look for recent transactions as matches, while still preserving accounts that have no recent transactions.
Query 7: Account Type Distribution by Branch (CROSS JOIN + LEFT JOIN)
Business question: For each branch, show the count of each account type. Include zero counts for account types that do not exist at a branch.
SELECT b.BRANCH_NAME,
at.ACCOUNT_TYPE,
COUNT(a.ACCOUNT_ID) AS ACCOUNT_COUNT,
COALESCE(SUM(a.BALANCE), 0) AS TOTAL_BALANCE
FROM BRANCH b
CROSS JOIN (
VALUES ('CHECKING'), ('SAVINGS'), ('CD'), ('MONEY_MARKET')
) AS at(ACCOUNT_TYPE)
LEFT OUTER JOIN ACCOUNT a
ON b.BRANCH_ID = a.BRANCH_ID
AND a.ACCOUNT_TYPE = at.ACCOUNT_TYPE
GROUP BY b.BRANCH_NAME, at.ACCOUNT_TYPE
ORDER BY b.BRANCH_NAME, at.ACCOUNT_TYPE;
This is a three-step pattern that appears frequently in reporting: 1. CROSS JOIN creates every branch-accounttype combination (the "skeleton"). 2. LEFT OUTER JOIN brings in actual data where it exists. 3. COUNT and SUM with COALESCE fill in zeros where there is no data.
Without the CROSS JOIN skeleton, branches that have no CD accounts would simply not appear in the CD row — the report would have invisible gaps.
6.12 A Preview: How the Optimizer Processes Joins
DB2's query optimizer is one of the most sophisticated in the industry. When you write a join, you describe what you want. The optimizer decides how to get it. Understanding the basics helps you write better queries and diagnose performance problems.
Three Join Methods
DB2 uses three primary strategies to execute a join:
1. Nested Loop Join
The conceptual model from Section 6.2: for each row in the outer table, look up matching rows in the inner table (typically using an index).
For each row in CUSTOMER:
Look up matching rows in ACCOUNT using index on CUSTOMER_ID
Output combined rows
- Best for: Small outer table, indexed inner table, or highly selective joins.
- Cost: Proportional to
outer_rows × (index lookup cost).
2. Merge Scan Join (Sort-Merge Join)
Both tables are sorted on the join column, then merged in a single pass:
Sort CUSTOMER by CUSTOMER_ID
Sort ACCOUNT by CUSTOMER_ID
Merge both sorted streams, matching rows in sequence
- Best for: Large tables that are already sorted or have clustered indexes on the join columns.
- Cost: Proportional to
sort_cost + (rows_a + rows_b).
3. Hash Join
Build a hash table from the smaller table, then probe it with rows from the larger table:
Build hash table from CUSTOMER on CUSTOMER_ID
For each row in ACCOUNT:
Probe hash table with ACCOUNT.CUSTOMER_ID
If match found, output combined row
- Best for: Large tables with no useful indexes, equi-joins only.
- Cost: Proportional to
rows_a + rows_b, but requires memory for the hash table.
What You Control
You do not (and should not) tell DB2 which join method to use. Instead, you influence the optimizer's choices by:
- Creating appropriate indexes (Chapter 20) — especially on foreign key columns and commonly filtered columns.
- Keeping statistics current with
RUNSTATS(Chapter 21) — so the optimizer has accurate row counts and data distribution information. - Writing clear join conditions — ambiguous or overly complex ON clauses can confuse the optimizer.
Join Order
When you join five tables, there are 5! = 120 possible orders in which the optimizer could process them. The optimizer evaluates many of these (pruning obviously bad candidates) and chooses the order with the lowest estimated cost.
You might write:
FROM CUSTOMER c
INNER JOIN ACCOUNT a ON ...
INNER JOIN BRANCH b ON ...
INNER JOIN TRANSACTION t ON ...
INNER JOIN EMPLOYEE e ON ...
But the optimizer might process them as: TRANSACTION → ACCOUNT → CUSTOMER → BRANCH → EMPLOYEE, because starting with TRANSACTION (after a selective date filter) produces the fewest intermediate rows.
Key insight: The order you list tables in the FROM clause does not determine execution order. The optimizer makes that decision. Write your joins in the order that makes the query most readable for humans.
What Affects Join Performance in Practice
While the full treatment belongs in Chapter 22, here are the factors that most commonly affect join performance in day-to-day DB2 work:
-
Indexes on join columns. An index on the foreign key column (e.g.,
ACCOUNT.CUSTOMER_ID) dramatically speeds up nested loop joins. Without it, each lookup requires a full table scan. -
Data volume. A join between two 100-row tables is instantaneous regardless of method. A join between two 100-million-row tables requires careful optimization. The strategies that work at small scale may fail catastrophically at large scale.
-
Selectivity of WHERE predicates. If your WHERE clause filters TRANSACTION to just 50 rows before joining, even a nested loop join is fast. If no filtering happens until after the join, DB2 must join millions of rows before discarding most of them.
-
Available memory (sort heap, buffer pool). Hash joins need memory for the hash table. Merge joins may need memory for sorting. If memory is insufficient, DB2 spills to disk, which is orders of magnitude slower.
-
Statistics freshness. Run
RUNSTATSafter loading data, after major updates, and after adding or dropping indexes. The optimizer cannot make good decisions without accurate statistics.
A Rule of Thumb
For most OLTP (online transaction processing) queries that join 2-5 tables with selective WHERE conditions and proper indexes, DB2 will choose an efficient plan automatically. Spend your time on clear SQL and good indexing. Reserve optimizer hints and manual plan overrides for the rare cases where the optimizer genuinely makes a mistake — and always confirm with EXPLAIN before and after.
We will return to join optimization in depth in Chapter 22 (Query Performance Tuning), including how to read EXPLAIN plans, identify join bottlenecks, and use optimization hints when the optimizer makes suboptimal choices.
Spaced Review — Chapters 1 Through 5
Each chapter includes review questions from earlier material to reinforce retention.
From Chapter 1 (Introduction to DB2): 1. What is the difference between DB2 for LUW and DB2 for z/OS? Name two differences in features or configuration. 2. What tool do you use to run SQL interactively in DB2 for LUW?
From Chapter 2 (Creating Your First Database): 3. Write the DB2 command to create a database named MERIDIAN_BANK with UTF-8 encoding. 4. What is the purpose of a tablespace, and how does it relate to tables?
From Chapter 3 (Tables, Columns, and Data Types): 5. What is the difference between CHAR(10) and VARCHAR(10)? 6. Why should you use DECIMAL(12,2) rather than FLOAT for monetary values?
From Chapter 4 (Inserting, Updating, and Deleting Data):
7. Write a single INSERT statement that adds three rows to a table.
8. What happens if you run DELETE FROM CUSTOMER without a WHERE clause?
From Chapter 5 (SELECT, WHERE, and Ordering):
9. What is the difference between WHERE CITY = 'Chicago' and WHERE CITY LIKE 'Chi%'?
10. Explain why WHERE SALARY > NULL returns no rows, and how you should test for NULL instead.
Chapter Summary
Joins are the mechanism that transforms a normalized relational design into meaningful, combined results. Here is what we covered:
| Join Type | Returns | Use When |
|---|---|---|
| INNER JOIN | Only matching rows from both tables | You want only rows with data on both sides |
| LEFT OUTER JOIN | All left rows + matching right rows (NULLs for no match) | You need every row from the left table regardless |
| RIGHT OUTER JOIN | All right rows + matching left rows (NULLs for no match) | Mirror of LEFT (prefer LEFT for consistency) |
| FULL OUTER JOIN | All rows from both tables (NULLs on both sides for no match) | Data reconciliation, gap detection |
| CROSS JOIN | Every row from A combined with every row from B | Deliberate Cartesian products, generating combinations |
| LATERAL | Right side can reference left side columns | Top-N per group, correlated row generation |
The most important lessons from this chapter:
- INNER JOIN is your default. Use it when both sides must have data.
- LEFT OUTER JOIN is your safety net. Use it when the right side is optional.
- Put right-side filters in ON, not WHERE, when using outer joins.
- Use explicit JOIN ... ON syntax. Never use comma-separated FROM clauses.
- Watch for row multiplication when joining one-to-many relationships and aggregating.
- The optimizer chooses the execution strategy. Your job is clear SQL and good indexes.
In Chapter 7, we will build on joins by adding aggregation — GROUP BY, HAVING, and aggregate functions — to answer questions like "What is the total balance per branch?" and "Which customers have more than three accounts?"
Next: Chapter 7 — Aggregation: GROUP BY, HAVING, and Aggregate Functions