31 min read

Nobody in a bank boardroom says, "Show me every row in the transactions table." Not once. Not ever.

Chapter 7: Aggregation and Grouping — COUNT, SUM, AVG, GROUP BY, HAVING, and Analytical Patterns


Learning Objectives

By the end of this chapter, you will be able to:

  1. Use the core aggregate functions — COUNT, SUM, AVG, MIN, MAX — and the statistical functions STDDEV and VARIANCE
  2. Group result sets with GROUP BY using single columns, multiple columns, and expressions
  3. Distinguish between WHERE and HAVING, and choose the correct one for any filtering task
  4. Predict how NULLs behave inside every aggregate function and avoid the traps that catch even experienced developers
  5. Construct multi-level summaries with GROUPING SETS, ROLLUP, and CUBE
  6. Identify summary rows using the GROUPING() and GROUPING_ID() functions
  7. Combine aggregation with joins to produce the reporting queries that make up the majority of real-world SQL
  8. Build financial reports for Meridian National Bank that answer questions about branch performance, customer analytics, and transaction trends
  9. Evaluate the performance implications of aggregation queries and apply strategies for optimization

Opening: The Questions Managers Actually Ask

Nobody in a bank boardroom says, "Show me every row in the transactions table." Not once. Not ever.

What they say is this:

  • "How many active accounts does the Hartford branch have?"
  • "What is the average checking account balance across all branches?"
  • "Which loan officer originated the most dollar volume this quarter?"
  • "Show me monthly transaction counts for the last six months, broken down by channel."
  • "Give me totals by branch, subtotals by state, and a grand total at the bottom."

Every one of these questions requires aggregation — the process of taking many rows and collapsing them into summary values. And every one of them requires grouping — the process of defining which rows belong together before that collapsing happens.

If SELECT, WHERE, and JOIN are the verbs of SQL, then GROUP BY and the aggregate functions are the accounting department. They are where raw data becomes information. They are where a list of 120 transactions becomes a monthly summary, where 35 individual account balances become a branch average, where a year of loan originations becomes a performance report that determines whether a loan officer gets promoted.

This chapter will make you fluent in aggregation. We will start with the individual aggregate functions, build up through GROUP BY and HAVING, handle the NULL traps that catch even experienced developers, and then move into the advanced grouping operations — GROUPING SETS, ROLLUP, and CUBE — that let you produce multi-dimensional reports in a single query. By the end, you will write the financial reports that Meridian National Bank's management team actually needs.

Let us begin with the functions themselves.


7.1 Aggregate Functions — Summarizing Data

An aggregate function takes a set of values (typically all the values in a column, or all the values in a column within a group) and returns a single value. DB2 provides a rich set of aggregate functions, but five form the core:

Function Purpose Returns
COUNT(*) Count rows INTEGER
COUNT(expression) Count non-NULL values INTEGER
COUNT(DISTINCT expression) Count distinct non-NULL values INTEGER
SUM(expression) Sum of values Same numeric type as input
AVG(expression) Arithmetic mean Same numeric type as input
MIN(expression) Smallest value Same type as input
MAX(expression) Largest value Same type as input

Beyond these, DB2 also supports STDDEV (standard deviation), VARIANCE, STDDEV_SAMP, VAR_SAMP, and several others we will encounter as needed.

COUNT — The Most Used Aggregate

COUNT comes in three forms, and understanding the difference between them is non-negotiable.

COUNT(*) counts rows. Period. It does not look at any column. It does not care about NULLs. If the group has 35 rows, COUNT(*) returns 35.

-- How many accounts does Meridian National Bank have?
SELECT COUNT(*) AS total_accounts
FROM MERIDIAN.ACCOUNTS;

Result:

TOTAL_ACCOUNTS
35

COUNT(column_name) counts the number of non-NULL values in that column. If five of the 35 accounts have a NULL CLOSE_DATE (because they are still open), and 30 have been closed with a date filled in, then COUNT(CLOSE_DATE) returns 30 — not 35.

-- How many accounts have been closed (have a close date)?
SELECT COUNT(CLOSE_DATE) AS closed_count
FROM MERIDIAN.ACCOUNTS;

If all accounts are currently open with NULL close dates, this returns 0. That surprises people. The table has 35 rows. COUNT(*) returns 35. COUNT(CLOSE_DATE) returns 0. Both are correct.

COUNT(DISTINCT column_name) counts the number of distinct non-NULL values. This is your tool for answering "how many different X" questions.

-- How many distinct account types are in use?
SELECT COUNT(DISTINCT ACCOUNT_TYPE_CODE) AS types_in_use
FROM MERIDIAN.ACCOUNTS;
-- How many branches have at least one account?
SELECT COUNT(DISTINCT BRANCH_ID) AS branches_with_accounts
FROM MERIDIAN.ACCOUNTS;

Critical Distinction: COUNT() counts rows. COUNT(column) counts non-NULL values. COUNT(DISTINCT column) counts distinct non-NULL values*. Confusing these three is one of the most common SQL errors in production code. We will revisit this in Section 7.4.

SUM — Adding Things Up

SUM computes the total of a numeric expression. It ignores NULLs (it does not treat them as zero — it simply skips them).

-- Total balance across all accounts
SELECT SUM(CURRENT_BALANCE) AS total_deposits
FROM MERIDIAN.ACCOUNTS;
-- Total dollar volume of all transactions in November 2025
SELECT SUM(AMOUNT) AS november_volume
FROM MERIDIAN.TRANSACTIONS
WHERE TRANSACTION_DATE BETWEEN '2025-11-01' AND '2025-11-30';

Note that SUM on a column containing only NULLs returns NULL, not zero. This is a gotcha we will address in Section 7.4.

You can also use SUM with expressions:

-- Total overdraft capacity across all accounts
SELECT SUM(OVERDRAFT_LIMIT) AS total_overdraft_capacity
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A';

And SUM with DISTINCT removes duplicate values before summing — which is rarely useful in practice but exists for completeness:

-- Sum of distinct salary values (unusual, but valid)
SELECT SUM(DISTINCT SALARY) AS distinct_salary_total
FROM MERIDIAN.EMPLOYEES;

AVG — The Arithmetic Mean

AVG computes the arithmetic mean. Like SUM, it ignores NULLs — and this is where the trouble starts.

-- Average account balance
SELECT AVG(CURRENT_BALANCE) AS avg_balance
FROM MERIDIAN.ACCOUNTS;

Consider this carefully: AVG(CURRENT_BALANCE) divides the SUM of all non-NULL balances by the COUNT of non-NULL balances. If some balances were NULL, those rows would be excluded from both the numerator and the denominator. The average is computed over the rows that have data, not over all rows.

This matters enormously. If you have 100 employees, and 20 of them have NULL salaries (perhaps they are contractors whose compensation is tracked elsewhere), then AVG(SALARY) gives you the average salary of the 80 employees who have one — not a value diluted by 20 zeros. Whether that is what you want depends on the business question.

-- Average interest rate on interest-bearing accounts only
SELECT AVG(INTEREST_RATE) AS avg_interest_rate
FROM MERIDIAN.ACCOUNTS
WHERE INTEREST_RATE > 0;

Precision warning: AVG on INTEGER columns returns an INTEGER in DB2, which means it truncates the fractional part. If your balances are stored as INTEGER (they should not be, but if they are), wrap the column in a CAST or multiply by 1.0:

-- Force decimal precision in average calculation
SELECT AVG(CAST(CURRENT_BALANCE AS DECIMAL(15,2))) AS avg_balance
FROM MERIDIAN.ACCOUNTS;

Since our ACCOUNTS table defines CURRENT_BALANCE as DECIMAL(15,2), this is not a concern for our Meridian data — but it bites people regularly in other schemas.

MIN and MAX — Extremes

MIN and MAX find the smallest and largest values, respectively. They work on any data type that supports ordering: numbers, strings, dates, timestamps.

-- What is the highest and lowest balance across all accounts?
SELECT MIN(CURRENT_BALANCE) AS lowest_balance,
       MAX(CURRENT_BALANCE) AS highest_balance
FROM MERIDIAN.ACCOUNTS;
-- When was the first and most recent transaction?
SELECT MIN(TRANSACTION_DATE) AS earliest_txn,
       MAX(TRANSACTION_DATE) AS latest_txn
FROM MERIDIAN.TRANSACTIONS;
-- Alphabetically first and last customer surname
SELECT MIN(LAST_NAME) AS first_alpha,
       MAX(LAST_NAME) AS last_alpha
FROM MERIDIAN.CUSTOMERS;

MIN and MAX also ignore NULLs. If every value in the set is NULL, they return NULL.

Combining Multiple Aggregates

You can use multiple aggregate functions in a single SELECT:

-- Comprehensive account summary
SELECT COUNT(*)              AS total_accounts,
       SUM(CURRENT_BALANCE)  AS total_balance,
       AVG(CURRENT_BALANCE)  AS avg_balance,
       MIN(CURRENT_BALANCE)  AS min_balance,
       MAX(CURRENT_BALANCE)  AS max_balance
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A';

This query scans the ACCOUNTS table once and computes all five aggregates in a single pass. DB2's optimizer is smart enough to do this — you never need to write five separate queries.

STDDEV and VARIANCE — Statistical Measures

For analytical work, DB2 provides population and sample statistics:

-- Standard deviation and variance of account balances
SELECT STDDEV(CURRENT_BALANCE)  AS balance_stddev,
       VARIANCE(CURRENT_BALANCE) AS balance_variance
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A';

STDDEV computes the population standard deviation. Use STDDEV_SAMP for the sample standard deviation (which uses N-1 in the denominator). Similarly, VARIANCE is the population variance, and VAR_SAMP is the sample variance.

These functions are particularly useful in risk analysis. A branch where account balances have high variance behaves differently from one where everyone carries a similar balance — even if the averages are the same.

-- Compare balance distribution across branches
SELECT b.BRANCH_NAME,
       COUNT(*)                  AS account_count,
       AVG(a.CURRENT_BALANCE)   AS avg_balance,
       STDDEV(a.CURRENT_BALANCE) AS balance_stddev
FROM MERIDIAN.ACCOUNTS a
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME;

We just used GROUP BY in that example. Let us formally introduce it.


Check Your Understanding (Box 1)

Pause and answer from memory:

  1. What is the difference between COUNT(*) and COUNT(COLUMN_NAME)?
  2. If a column has 10 non-NULL values and 5 NULLs, what does AVG return — the average of 10 values or 15 values?
  3. What does SUM return when applied to a column where every value is NULL?

If you cannot answer all three confidently, re-read Section 7.1.


7.2 GROUP BY — The Foundation of Reporting

Without GROUP BY, every aggregate function in your SELECT operates on the entire result set as one giant group. That is fine for "What is the total balance across all accounts?" but useless for "What is the total balance per branch?"

GROUP BY partitions the result set into groups based on one or more columns, and the aggregate functions then operate independently on each group.

Grouping by One Column

-- Total balance by branch
SELECT BRANCH_ID,
       SUM(CURRENT_BALANCE) AS branch_total
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A'
GROUP BY BRANCH_ID;

Result:

BRANCH_ID BRANCH_TOTAL
1 682,776.70
2 38,540.30
3 925,170.80
4 1,673,560.60
5 87,460.75

What happened here? DB2 took the 35 account rows, put each row into a "bucket" based on its BRANCH_ID, and then computed SUM(CURRENT_BALANCE) separately within each bucket. Branch 4 (Park Avenue, New York) has the highest total because it holds accounts for high-net-worth customers like Jack Pemberton and Henry Nakamura.

The Fundamental Rule of GROUP BY

Here is the rule you must internalize:

Every column in the SELECT list must either appear in the GROUP BY clause or be contained within an aggregate function.

This rule is not optional. It is not a guideline. It is enforced by the SQL engine, and violating it produces an error.

Correct:

SELECT BRANCH_ID, COUNT(*) AS acct_count
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID;

BRANCH_ID appears in GROUP BY. COUNT(*) is an aggregate. Both are legal.

Incorrect — this will fail:

-- ERROR: ACCOUNT_NUMBER is not in GROUP BY and not aggregated
SELECT BRANCH_ID, ACCOUNT_NUMBER, COUNT(*)
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID;

Why does this fail? Because each group (each BRANCH_ID) contains multiple account numbers. DB2 cannot pick one to display — there is no rule that says "pick the first one" or "pick a random one." It refuses to guess, and that refusal is a feature, not a limitation.

Grouping by Multiple Columns

You can group by two or more columns to create finer-grained buckets:

-- Account count by branch and account type
SELECT BRANCH_ID,
       ACCOUNT_TYPE_CODE,
       COUNT(*)              AS acct_count,
       SUM(CURRENT_BALANCE)  AS total_balance
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A'
GROUP BY BRANCH_ID, ACCOUNT_TYPE_CODE
ORDER BY BRANCH_ID, ACCOUNT_TYPE_CODE;

Now each group is defined by the combination of BRANCH_ID and ACCOUNT_TYPE_CODE. Branch 1 with account type CHK is one group, Branch 1 with account type SAV is another, Branch 1 with account type MMA is a third, and so on. You get a row for every unique combination that exists in the data.

Grouping by Expressions

GROUP BY is not limited to bare column names. You can group by expressions:

-- Transaction count by month
SELECT YEAR(TRANSACTION_DATE)  AS txn_year,
       MONTH(TRANSACTION_DATE) AS txn_month,
       COUNT(*)                AS txn_count,
       SUM(AMOUNT)             AS total_amount
FROM MERIDIAN.TRANSACTIONS
GROUP BY YEAR(TRANSACTION_DATE), MONTH(TRANSACTION_DATE)
ORDER BY txn_year, txn_month;
-- Account count by balance tier
SELECT CASE
           WHEN CURRENT_BALANCE < 1000    THEN 'Under $1K'
           WHEN CURRENT_BALANCE < 10000   THEN '$1K - $10K'
           WHEN CURRENT_BALANCE < 100000  THEN '$10K - $100K'
           WHEN CURRENT_BALANCE < 500000  THEN '$100K - $500K'
           ELSE '$500K+'
       END AS balance_tier,
       COUNT(*)              AS acct_count,
       SUM(CURRENT_BALANCE)  AS tier_total
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A'
GROUP BY CASE
             WHEN CURRENT_BALANCE < 1000    THEN 'Under $1K'
             WHEN CURRENT_BALANCE < 10000   THEN '$1K - $10K'
             WHEN CURRENT_BALANCE < 100000  THEN '$10K - $100K'
             WHEN CURRENT_BALANCE < 500000  THEN '$100K - $500K'
             ELSE '$500K+'
         END
ORDER BY tier_total;

Notice that the CASE expression in the GROUP BY must be an exact textual match with the one in the SELECT. DB2 does not recognize column aliases in the GROUP BY clause — this is one area where DB2 is stricter than some other databases.

GROUP BY with Joins

In practice, you almost always join first and then group. The bare BRANCH_ID in the previous examples is not helpful in a report — nobody wants to see "Branch 4" when they could see "Meridian Park Avenue."

-- Branch performance summary (the realistic version)
SELECT b.BRANCH_NAME,
       b.STATE_CODE,
       COUNT(a.ACCOUNT_ID)     AS total_accounts,
       SUM(a.CURRENT_BALANCE)  AS total_deposits,
       AVG(a.CURRENT_BALANCE)  AS avg_balance
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME, b.STATE_CODE
ORDER BY total_deposits DESC;

This pattern — join to get descriptive names, filter with WHERE, group, and order — is the bread and butter of reporting SQL. You will write hundreds of queries like this.

ORDER BY with Aggregates

You can ORDER BY aggregate expressions, either by alias or by repeating the expression:

-- Branches ordered by total deposits, highest first
SELECT b.BRANCH_NAME,
       SUM(a.CURRENT_BALANCE) AS total_deposits
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME
ORDER BY total_deposits DESC;

You can also ORDER BY an aggregate that does not appear in the SELECT:

-- Branch names ordered by average balance (but we don't show the avg)
SELECT b.BRANCH_NAME
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME
ORDER BY AVG(a.CURRENT_BALANCE) DESC;

This is legal because ORDER BY is evaluated after GROUP BY. The aggregate is computed and used for sorting, even though it is not projected.


Check Your Understanding (Box 2)

  1. What is the fundamental rule about columns in a SELECT list when GROUP BY is used?
  2. Can you GROUP BY an expression that uses CASE WHEN? If so, what is the syntax requirement?
  3. In the typical join-then-group pattern, which clause do you use to filter individual rows before grouping?

7.3 HAVING vs. WHERE — Filtering Before and After Aggregation

This is the section that prevents a hundred future debugging sessions. The difference between WHERE and HAVING is simple in principle but causes confusion in practice because both "filter rows." They just filter at different stages.

WHERE filters individual rows before grouping. It operates on the raw data.

HAVING filters groups after aggregation. It operates on the results of aggregate functions.

The Logical Order of SQL Evaluation

To understand HAVING, you need to understand the order in which DB2 conceptually processes a query:

  1. FROM — Identify the table(s)
  2. JOIN / ON — Combine tables
  3. WHERE — Filter individual rows
  4. GROUP BY — Form groups
  5. HAVING — Filter groups
  6. SELECT — Compute output columns
  7. ORDER BY — Sort the result
  8. FETCH FIRST — Limit the result set

HAVING sits between GROUP BY and SELECT. It can reference aggregate functions because the groups have already been formed and the aggregates have already been computed (logically, at least — the optimizer may reorder things physically).

WHERE: Filtering Rows

-- Average balance of ACTIVE checking accounts only
SELECT AVG(CURRENT_BALANCE) AS avg_checking_balance
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'A'
  AND ACCOUNT_TYPE_CODE = 'CHK';

WHERE removes rows before any aggregation happens. Inactive accounts and non-checking accounts are excluded first, and then AVG operates on what remains.

HAVING: Filtering Groups

-- Branches with total deposits over $500,000
SELECT b.BRANCH_NAME,
       SUM(a.CURRENT_BALANCE) AS total_deposits
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME
HAVING SUM(a.CURRENT_BALANCE) > 500000
ORDER BY total_deposits DESC;

Here, WHERE filters individual accounts (only active ones), GROUP BY creates one group per branch, SUM is computed within each group, and then HAVING eliminates branches whose total is $500,000 or less. Only branches that pass the HAVING test appear in the output.

The Common Mistake

Here is what goes wrong when people confuse the two:

-- WRONG: WHERE cannot reference aggregate functions
SELECT BRANCH_ID, SUM(CURRENT_BALANCE) AS total
FROM MERIDIAN.ACCOUNTS
WHERE SUM(CURRENT_BALANCE) > 500000   -- ERROR!
GROUP BY BRANCH_ID;

This fails because WHERE is evaluated before groups are formed. At the WHERE stage, there are no groups and no sums — just individual rows. You cannot test SUM(CURRENT_BALANCE) > 500000 when the SUM has not been computed yet.

The fix is to use HAVING:

-- CORRECT: HAVING filters after aggregation
SELECT BRANCH_ID, SUM(CURRENT_BALANCE) AS total
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID
HAVING SUM(CURRENT_BALANCE) > 500000;

Using WHERE and HAVING Together

The most powerful queries use both:

-- Among active accounts, find branches where the average
-- checking account balance exceeds $10,000
SELECT b.BRANCH_NAME,
       AVG(a.CURRENT_BALANCE) AS avg_checking_balance,
       COUNT(*)               AS checking_count
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'                    -- row-level filter
  AND a.ACCOUNT_TYPE_CODE IN ('CHK', 'CHI')
GROUP BY b.BRANCH_NAME
HAVING AVG(a.CURRENT_BALANCE) > 10000   -- group-level filter
ORDER BY avg_checking_balance DESC;

Read it step by step: 1. Join branches and accounts 2. WHERE keeps only active checking accounts 3. GROUP BY forms one group per branch name 4. HAVING keeps only groups where the average balance exceeds $10,000 5. SELECT computes the output 6. ORDER BY sorts by average, highest first

Performance Implication

WHERE is almost always more efficient than HAVING for the same filter. If you can express a filter in WHERE, do it there — fewer rows enter the grouping stage, which means less work for DB2.

This is wrong not syntactically, but performatively:

-- Technically works, but wasteful
SELECT BRANCH_ID, COUNT(*) AS acct_count
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID
HAVING BRANCH_ID IN (1, 2, 3);

BRANCH_ID is not an aggregate — it is a grouping column. This filter belongs in WHERE:

-- Better: filter before grouping
SELECT BRANCH_ID, COUNT(*) AS acct_count
FROM MERIDIAN.ACCOUNTS
WHERE BRANCH_ID IN (1, 2, 3)
GROUP BY BRANCH_ID;

Both return the same result, but the second version filters out rows from branches 4 and 5 before they ever enter the grouping stage.

Rule of Thumb: If the filter condition does not involve an aggregate function, put it in WHERE. If it involves an aggregate function, put it in HAVING.

HAVING without GROUP BY

An unusual but valid pattern: HAVING without GROUP BY. When GROUP BY is omitted, the entire result set is one implicit group. HAVING then decides whether to return that single group or nothing.

-- Return the total balance only if it exceeds $3,000,000
SELECT SUM(CURRENT_BALANCE) AS total_balance
FROM MERIDIAN.ACCOUNTS
HAVING SUM(CURRENT_BALANCE) > 3000000;

If the sum exceeds three million, you get one row. If it does not, you get zero rows. This is occasionally useful for conditional reporting.


7.4 NULLs in Aggregation — The Trap That Catches Everyone

NULLs in SQL mean "unknown." They do not mean zero, they do not mean empty string, and they do not mean "not applicable." And aggregate functions have specific, well-defined rules for handling them — rules that will bite you if you do not know them.

The Universal Rule

All aggregate functions except COUNT(*) ignore NULLs. They skip over NULL values as if those rows did not exist for that particular column.

Let us make this concrete. Imagine a table with five rows where the SALARY column contains: 50000, 60000, NULL, 70000, NULL.

Function Result Explanation
COUNT(*) 5 Counts rows, ignores column values entirely
COUNT(SALARY) 3 Counts non-NULL values only
SUM(SALARY) 180000 50000 + 60000 + 70000 (NULLs skipped)
AVG(SALARY) 60000 180000 / 3 (not 180000 / 5)
MIN(SALARY) 50000 Smallest non-NULL value
MAX(SALARY) 70000 Largest non-NULL value

Gotcha #1: AVG Denominator

This is the most dangerous gotcha. AVG divides by the count of non-NULL values, not the count of all rows. If you want NULLs treated as zeros in the average, you must say so explicitly:

-- AVG that treats NULL as zero
SELECT AVG(COALESCE(SALARY, 0)) AS avg_salary_with_nulls
FROM MERIDIAN.EMPLOYEES;

COALESCE(SALARY, 0) replaces NULLs with 0 before AVG sees them, so the denominator includes those rows. Whether this is correct depends on your business question. For Meridian Bank, if an employee's salary is NULL because they are a contractor, treating it as zero in the average would be misleading. If it is NULL because the data entry was incomplete and the real salary is zero, then COALESCE is appropriate.

Gotcha #2: SUM of All NULLs Returns NULL, Not Zero

-- If there are no closed accounts, this returns NULL
SELECT SUM(CURRENT_BALANCE) AS closed_balance_total
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'C';

If no rows match the WHERE clause — or if all matching rows have NULL in the summed column — SUM returns NULL. This is technically correct (the sum of zero non-NULL values is undefined), but it causes problems when you try to use the result in arithmetic or display it in a report.

The fix:

-- Guaranteed numeric result
SELECT COALESCE(SUM(CURRENT_BALANCE), 0) AS closed_balance_total
FROM MERIDIAN.ACCOUNTS
WHERE STATUS = 'C';

Use COALESCE(SUM(...), 0) defensively whenever a sum could operate on an empty set.

Gotcha #3: COUNT(*) vs. COUNT(column)

We covered this in Section 7.1, but it bears repeating in the context of NULLs:

SELECT COUNT(*)            AS total_rows,
       COUNT(CLOSE_DATE)   AS rows_with_close_date,
       COUNT(MIDDLE_NAME)  AS rows_with_middle_name
FROM MERIDIAN.CUSTOMERS;

For our 20 Meridian customers, COUNT(*) returns 20. COUNT(CLOSE_DATE) depends on how many have a close date (likely 0, since the CUSTOMERS table does not have a CLOSE_DATE — but the principle stands). COUNT(MIDDLE_NAME) returns the number of customers who have a middle name recorded — those where the MIDDLE_NAME column is not NULL.

Gotcha #4: DISTINCT and NULLs

COUNT(DISTINCT column_name) counts distinct non-NULL values. NULL is not counted, no matter how many NULLs exist.

-- How many distinct risk ratings exist?
SELECT COUNT(DISTINCT RISK_RATING) AS distinct_ratings
FROM MERIDIAN.CUSTOMERS;

If RISK_RATING values are 1, 2, 3, and some rows have NULL, the result is 3 — not 4. NULL is not a distinct value in the context of COUNT(DISTINCT).

Gotcha #5: GROUP BY and NULLs

When you GROUP BY a column that contains NULLs, all the NULLs are collected into a single group. This is one of the few places in SQL where two NULLs are treated as "equal":

-- Group employees by manager
SELECT MANAGER_ID, COUNT(*) AS direct_reports
FROM MERIDIAN.EMPLOYEES
GROUP BY MANAGER_ID;

If MANAGER_ID is NULL for the top-level manager (Margaret Chen, who has no manager), she forms a group by herself. The result includes a row where MANAGER_ID is NULL and the count is 1.

Defensive Aggregation Patterns

Here is a template for defensive aggregation that handles NULLs correctly:

-- Defensive reporting query
SELECT b.BRANCH_NAME,
       COUNT(*)                                    AS total_accounts,
       COALESCE(SUM(a.CURRENT_BALANCE), 0)         AS total_balance,
       COALESCE(AVG(a.CURRENT_BALANCE), 0)         AS avg_balance,
       COALESCE(MIN(a.CURRENT_BALANCE), 0)         AS min_balance,
       COALESCE(MAX(a.CURRENT_BALANCE), 0)         AS max_balance,
       COUNT(*) - COUNT(a.CLOSE_DATE)              AS still_open
FROM MERIDIAN.BRANCHES b
LEFT JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
GROUP BY b.BRANCH_NAME;

The LEFT JOIN ensures branches with zero accounts still appear. COALESCE prevents NULL results for branches that have no accounts. And COUNT(*) - COUNT(CLOSE_DATE) cleverly computes the count of accounts where CLOSE_DATE is NULL (i.e., still open) by exploiting the difference between COUNT(*) and COUNT(column).


Check Your Understanding (Box 3)

Given a column with values: 100, 200, NULL, 300, NULL

  1. What does COUNT(*) return?
  2. What does COUNT(column) return?
  3. What does AVG(column) return?
  4. What does SUM(column) / COUNT(*) return? Is this the same as AVG(column)?

The difference between questions 3 and 4 is subtle and important. Work it out on paper.


7.5 GROUPING SETS — Flexible Multi-Level Summaries

Standard GROUP BY produces exactly one level of grouping. But reports often need multiple levels: a subtotal by branch, a subtotal by state, and a grand total — all in one result set.

Before GROUPING SETS, you had two options: run multiple queries and UNION ALL the results, or make the application layer compute the subtotals. Both are painful. GROUPING SETS solves the problem at the database level.

The Problem

Suppose management wants three things from one report:

  1. Total deposits by branch
  2. Total deposits by state
  3. Grand total of all deposits

Without GROUPING SETS, you would write:

-- Old way: three queries stitched together
SELECT BRANCH_NAME, STATE_CODE, SUM(CURRENT_BALANCE) AS total
FROM MERIDIAN.BRANCHES b JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
GROUP BY BRANCH_NAME, STATE_CODE

UNION ALL

SELECT NULL, STATE_CODE, SUM(CURRENT_BALANCE)
FROM MERIDIAN.BRANCHES b JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
GROUP BY STATE_CODE

UNION ALL

SELECT NULL, NULL, SUM(CURRENT_BALANCE)
FROM MERIDIAN.BRANCHES b JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID;

This scans the data three times. With GROUPING SETS, you scan it once:

The Solution

SELECT b.BRANCH_NAME,
       b.STATE_CODE,
       SUM(a.CURRENT_BALANCE) AS total_deposits
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY GROUPING SETS (
    (b.BRANCH_NAME, b.STATE_CODE),   -- detail by branch
    (b.STATE_CODE),                    -- subtotal by state
    ()                                 -- grand total
)
ORDER BY b.STATE_CODE NULLS LAST, b.BRANCH_NAME NULLS LAST;

Each entry in GROUPING SETS defines one grouping level. The empty parentheses () represent the grand total — no grouping columns at all.

The result looks like this (conceptually):

BRANCH_NAME STATE_CODE TOTAL_DEPOSITS
Meridian Main Street CT 682,776.70
Meridian Westside CT 38,540.30
NULL CT 721,317.00
Meridian Downtown Boston MA 925,170.80
NULL MA 925,170.80
Meridian Princeton NJ 87,460.75
NULL NJ 87,460.75
Meridian Park Avenue NY 1,673,560.60
NULL NY 1,673,560.60
NULL NULL 3,407,509.15

The NULLs in the grouping columns indicate summary rows. BRANCH_NAME is NULL in state subtotals (because the subtotal is for all branches in that state). Both BRANCH_NAME and STATE_CODE are NULL in the grand total row.

But wait — how do you distinguish a NULL that means "this is a summary row" from a NULL that means "the data is actually NULL"? That is what the GROUPING() function is for, which we will cover in Section 7.8.


7.6 ROLLUP — Hierarchical Subtotals

ROLLUP is a specialized form of GROUPING SETS that produces hierarchical subtotals. It is perfect for reports with a natural drill-down structure: branch, state, region, grand total.

Syntax and Behavior

GROUP BY ROLLUP (a, b, c)

is equivalent to:

GROUP BY GROUPING SETS (
    (a, b, c),    -- most detailed
    (a, b),       -- subtotal without c
    (a),          -- subtotal without b and c
    ()            -- grand total
)

ROLLUP removes columns from right to left, one at a time. It produces N+1 grouping levels for N columns.

Meridian Bank Example: Branch Hierarchy

-- Hierarchical report: Branch -> State -> Grand Total
SELECT b.STATE_CODE,
       b.BRANCH_NAME,
       COUNT(a.ACCOUNT_ID)     AS acct_count,
       SUM(a.CURRENT_BALANCE)  AS total_deposits,
       AVG(a.CURRENT_BALANCE)  AS avg_balance
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY ROLLUP (b.STATE_CODE, b.BRANCH_NAME)
ORDER BY b.STATE_CODE NULLS LAST, b.BRANCH_NAME NULLS LAST;

Result (conceptually):

STATE_CODE BRANCH_NAME ACCT_COUNT TOTAL_DEPOSITS AVG_BALANCE
CT Meridian Main Street 12 682,776.70 56,898.06
CT Meridian Westside 4 38,540.30 9,635.08
CT NULL 16 721,317.00 45,082.31
MA Meridian Downtown Boston 6 925,170.80 154,195.13
MA NULL 6 925,170.80 154,195.13
NJ Meridian Princeton 6 87,460.75 14,576.79
NJ NULL 6 87,460.75 14,576.79
NY Meridian Park Avenue 7 1,673,560.60 239,080.09
NY NULL 7 1,673,560.60 239,080.09
NULL NULL 35 3,407,509.15 97,357.40

The row where STATE_CODE is "CT" and BRANCH_NAME is NULL is the Connecticut subtotal. The row where both are NULL is the grand total.

Why Column Order Matters in ROLLUP

ROLLUP(STATE_CODE, BRANCH_NAME) gives you subtotals by state and a grand total — but not subtotals by branch that cut across states. The rollup hierarchy goes from right to left. If you want every possible combination of subtotals, you need CUBE.

ROLLUP with More Than Two Levels

-- Three-level rollup: account type within branch within state
SELECT b.STATE_CODE,
       b.BRANCH_NAME,
       at.ACCOUNT_TYPE_NAME,
       COUNT(*)              AS acct_count,
       SUM(a.CURRENT_BALANCE) AS total_balance
FROM MERIDIAN.ACCOUNTS a
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
JOIN MERIDIAN.ACCOUNT_TYPES at ON a.ACCOUNT_TYPE_CODE = at.ACCOUNT_TYPE_CODE
WHERE a.STATUS = 'A'
GROUP BY ROLLUP (b.STATE_CODE, b.BRANCH_NAME, at.ACCOUNT_TYPE_NAME)
ORDER BY b.STATE_CODE NULLS LAST,
         b.BRANCH_NAME NULLS LAST,
         at.ACCOUNT_TYPE_NAME NULLS LAST;

This produces four grouping levels: 1. State + Branch + Account Type (detail) 2. State + Branch (branch subtotal across account types) 3. State (state subtotal across all branches) 4. Grand total

That is the power of ROLLUP: one keyword, and you get the entire subtotal hierarchy that a financial report needs.


7.7 CUBE — All Possible Combinations

CUBE is ROLLUP's more aggressive sibling. Where ROLLUP produces a hierarchy, CUBE produces every possible combination of the grouping columns.

Syntax and Behavior

GROUP BY CUBE (a, b)

is equivalent to:

GROUP BY GROUPING SETS (
    (a, b),   -- grouped by both
    (a),      -- grouped by a only
    (b),      -- grouped by b only
    ()        -- grand total
)

For N columns, CUBE produces 2^N grouping levels. With 2 columns that is 4 levels. With 3 columns that is 8. With 4 columns, 16 levels. CUBE grows exponentially, so use it judiciously.

Cross-Tabulation Report

-- Every combination of state and account type
SELECT b.STATE_CODE,
       at.CATEGORY,
       COUNT(*)                AS acct_count,
       SUM(a.CURRENT_BALANCE)  AS total_balance
FROM MERIDIAN.ACCOUNTS a
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
JOIN MERIDIAN.ACCOUNT_TYPES at ON a.ACCOUNT_TYPE_CODE = at.ACCOUNT_TYPE_CODE
WHERE a.STATUS = 'A'
GROUP BY CUBE (b.STATE_CODE, at.CATEGORY)
ORDER BY b.STATE_CODE NULLS LAST, at.CATEGORY NULLS LAST;

This gives you: - Count and total for each state + category combination (CT/CHECKING, CT/SAVINGS, etc.) - Subtotals for each state (across all categories) - Subtotals for each category (across all states) - Grand total

This is exactly the data you need for a cross-tabulation (pivot table) report. The application layer can format it into a grid.

When to Use Each

Technique Grouping Levels Best For
GROUP BY 1 Simple summaries
ROLLUP N+1 Hierarchical subtotals (drill-down)
CUBE 2^N Cross-tabulation, all-combinations analysis
GROUPING SETS Custom Specific combinations you define

GROUPING SETS is the general case. ROLLUP and CUBE are convenient shortcuts: - ROLLUP(a, b, c) is shorthand for GROUPING SETS ((a,b,c), (a,b), (a), ()) - CUBE(a, b, c) is shorthand for GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c), (a), (b), (c), ())


7.8 The GROUPING() Function — Identifying Summary Rows

When ROLLUP or CUBE produces summary rows, the grouping columns that are "rolled up" show as NULL. But what if the underlying data also has NULLs? How do you distinguish "this is a summary row" from "the data is actually NULL"?

The GROUPING() function answers this question.

How GROUPING() Works

GROUPING(column) returns: - 0 if the column is part of the current grouping level (the value is a real data value, or a real NULL) - 1 if the column has been aggregated away (the NULL represents a summary row)

SELECT GROUPING(b.STATE_CODE)  AS is_state_total,
       GROUPING(b.BRANCH_NAME) AS is_branch_total,
       b.STATE_CODE,
       b.BRANCH_NAME,
       SUM(a.CURRENT_BALANCE)  AS total_deposits
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY ROLLUP (b.STATE_CODE, b.BRANCH_NAME)
ORDER BY b.STATE_CODE NULLS LAST, b.BRANCH_NAME NULLS LAST;

Result:

IS_STATE_TOTAL IS_BRANCH_TOTAL STATE_CODE BRANCH_NAME TOTAL_DEPOSITS
0 0 CT Meridian Main Street 682,776.70
0 0 CT Meridian Westside 38,540.30
0 1 CT NULL 721,317.00
0 0 MA Meridian Downtown Boston 925,170.80
0 1 MA NULL 925,170.80
1 1 NULL NULL 3,407,509.15

When IS_BRANCH_TOTAL = 1, the BRANCH_NAME is NULL because it is a subtotal row, not because the branch name is actually NULL. When IS_STATE_TOTAL = 1, the STATE_CODE is NULL for the same reason.

Using GROUPING() for Display Labels

A common pattern is to use GROUPING() with CASE to replace NULLs with meaningful labels:

SELECT CASE GROUPING(b.STATE_CODE)
           WHEN 1 THEN '*** ALL STATES ***'
           ELSE b.STATE_CODE
       END AS state,
       CASE GROUPING(b.BRANCH_NAME)
           WHEN 1 THEN '--- All Branches ---'
           ELSE b.BRANCH_NAME
       END AS branch,
       COUNT(a.ACCOUNT_ID)     AS acct_count,
       SUM(a.CURRENT_BALANCE)  AS total_deposits
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY ROLLUP (b.STATE_CODE, b.BRANCH_NAME)
ORDER BY GROUPING(b.STATE_CODE), b.STATE_CODE,
         GROUPING(b.BRANCH_NAME), b.BRANCH_NAME;

This produces clean output where summary rows are clearly labeled instead of showing cryptic NULLs.

GROUPING_ID() — Multiple Columns at Once

When you have many grouping columns, checking each with GROUPING() individually becomes tedious. DB2 provides GROUPING_ID(), which returns a single integer representing the grouping state of multiple columns. It treats the GROUPING() values as bits in a binary number.

For example, with two columns: - GROUPING_ID(state, branch) = 0 means both are detail (binary 00) - GROUPING_ID(state, branch) = 1 means branch is aggregated (binary 01) - GROUPING_ID(state, branch) = 2 means state is aggregated (binary 10) - GROUPING_ID(state, branch) = 3 means both are aggregated (binary 11) — the grand total

This is particularly useful in application code that needs to programmatically identify summary levels without checking each column individually.


7.9 Combining Aggregation with Joins — The Real-World Reporting Pattern

In production, aggregation queries almost never operate on a single table. The typical pattern is:

  1. Join multiple tables to assemble the raw data
  2. Filter with WHERE to exclude irrelevant rows
  3. Group with GROUP BY to define the summary level
  4. Filter groups with HAVING if needed
  5. Order the result for presentation

Let us walk through progressively more complex examples.

Pattern 1: Two-Table Join with Aggregation

-- Customer account summary
SELECT c.CUSTOMER_ID,
       c.FIRST_NAME || ' ' || c.LAST_NAME AS customer_name,
       COUNT(a.ACCOUNT_ID)                 AS num_accounts,
       SUM(a.CURRENT_BALANCE)              AS total_balance,
       MAX(a.OPEN_DATE)                    AS newest_account
FROM MERIDIAN.CUSTOMERS c
JOIN MERIDIAN.ACCOUNTS a ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE c.STATUS = 'A'
  AND a.STATUS = 'A'
GROUP BY c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
ORDER BY total_balance DESC;

Notice that FIRST_NAME and LAST_NAME must appear in GROUP BY because they are in SELECT without an aggregate. However, since CUSTOMER_ID is the primary key and functionally determines the name columns, this grouping is logically equivalent to grouping by CUSTOMER_ID alone. DB2 requires the syntactic compliance regardless.

Pattern 2: Three-Table Join

-- Transaction summary by branch and transaction type
SELECT b.BRANCH_NAME,
       t.TRANSACTION_TYPE,
       COUNT(*)         AS txn_count,
       SUM(t.AMOUNT)    AS total_amount,
       AVG(t.AMOUNT)    AS avg_amount
FROM MERIDIAN.TRANSACTIONS t
JOIN MERIDIAN.ACCOUNTS a ON t.ACCOUNT_ID = a.ACCOUNT_ID
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
WHERE t.TRANSACTION_DATE >= '2025-11-01'
GROUP BY b.BRANCH_NAME, t.TRANSACTION_TYPE
ORDER BY b.BRANCH_NAME, txn_count DESC;

This joins three tables: transactions to accounts (to get BRANCH_ID) to branches (to get the name). Then it groups by branch and transaction type to show how many deposits, withdrawals, payments, etc. each branch processed.

Pattern 3: Join, Aggregate, and Filter Groups

-- Find customers with more than $100,000 in total deposits
-- who have accounts at multiple branches
SELECT c.FIRST_NAME || ' ' || c.LAST_NAME AS customer_name,
       COUNT(DISTINCT a.BRANCH_ID)         AS branch_count,
       COUNT(a.ACCOUNT_ID)                 AS account_count,
       SUM(a.CURRENT_BALANCE)              AS total_deposits
FROM MERIDIAN.CUSTOMERS c
JOIN MERIDIAN.ACCOUNTS a ON c.CUSTOMER_ID = a.CUSTOMER_ID
WHERE a.STATUS = 'A'
GROUP BY c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME
HAVING SUM(a.CURRENT_BALANCE) > 100000
ORDER BY total_deposits DESC;

Pattern 4: LEFT JOIN with Aggregation

LEFT JOIN is critical for reports that must include entities with zero related rows:

-- All branches with their account counts (including branches with 0)
SELECT b.BRANCH_NAME,
       COUNT(a.ACCOUNT_ID)                     AS acct_count,
       COALESCE(SUM(a.CURRENT_BALANCE), 0)     AS total_balance
FROM MERIDIAN.BRANCHES b
LEFT JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
                               AND a.STATUS = 'A'
GROUP BY b.BRANCH_NAME
ORDER BY total_balance DESC;

Two important points here:

  1. We use COUNT(a.ACCOUNT_ID) rather than COUNT(*). With a LEFT JOIN, branches that have no accounts still produce a row (with NULLs in the account columns). COUNT(*) would count that row as 1. COUNT(a.ACCOUNT_ID) correctly returns 0 because ACCOUNT_ID is NULL for the unmatched rows.

  2. The filter a.STATUS = 'A' is in the ON clause, not in WHERE. If it were in WHERE, it would filter out the NULL rows produced by the LEFT JOIN, effectively converting it to an inner join. Putting the filter in ON means "when joining, only consider active accounts" while still preserving branches that have no active accounts.

Pattern 5: Subquery in HAVING

-- Branches whose average balance exceeds the bank-wide average
SELECT b.BRANCH_NAME,
       AVG(a.CURRENT_BALANCE) AS branch_avg
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME
HAVING AVG(a.CURRENT_BALANCE) > (
    SELECT AVG(CURRENT_BALANCE)
    FROM MERIDIAN.ACCOUNTS
    WHERE STATUS = 'A'
)
ORDER BY branch_avg DESC;

The subquery in HAVING computes the overall average once, and then each branch's average is compared against it. This is a powerful pattern for finding above-average or below-average groups.


Check Your Understanding (Box 4)

  1. In a LEFT JOIN with GROUP BY, why should you use COUNT(column) instead of COUNT(*)?
  2. When combining a LEFT JOIN with a filter on the right table, where should the filter go — WHERE or ON?
  3. Can you use a subquery inside a HAVING clause?

7.10 Meridian Bank Financial Reports

Now let us apply everything to the kinds of reports Meridian National Bank's management team actually needs. These are production-grade queries.

Report 1: Daily Transaction Summary

-- Daily transaction summary by type and channel
SELECT t.TRANSACTION_DATE,
       t.TRANSACTION_TYPE,
       t.CHANNEL,
       COUNT(*)                          AS txn_count,
       SUM(CASE WHEN t.AMOUNT > 0
                THEN t.AMOUNT ELSE 0 END) AS total_credits,
       SUM(CASE WHEN t.AMOUNT < 0
                THEN t.AMOUNT ELSE 0 END) AS total_debits,
       SUM(t.AMOUNT)                     AS net_amount
FROM MERIDIAN.TRANSACTIONS t
WHERE t.TRANSACTION_DATE >= '2025-11-01'
  AND t.STATUS = 'C'
GROUP BY t.TRANSACTION_DATE, t.TRANSACTION_TYPE, t.CHANNEL
ORDER BY t.TRANSACTION_DATE, t.TRANSACTION_TYPE;

The CASE expressions inside SUM are a standard technique for conditional aggregation. They allow you to split a single column into positive and negative buckets without running separate queries.

Report 2: Branch Performance Dashboard

-- Monthly branch performance with all key metrics
SELECT b.BRANCH_NAME,
       b.STATE_CODE,
       COUNT(DISTINCT c.CUSTOMER_ID)       AS unique_customers,
       COUNT(a.ACCOUNT_ID)                 AS total_accounts,
       SUM(a.CURRENT_BALANCE)              AS total_deposits,
       AVG(a.CURRENT_BALANCE)              AS avg_balance,
       MIN(a.CURRENT_BALANCE)              AS min_balance,
       MAX(a.CURRENT_BALANCE)              AS max_balance,
       SUM(CASE WHEN at.CATEGORY = 'CHECKING'
                THEN a.CURRENT_BALANCE ELSE 0 END) AS checking_total,
       SUM(CASE WHEN at.CATEGORY = 'SAVINGS'
                THEN a.CURRENT_BALANCE ELSE 0 END) AS savings_total,
       SUM(CASE WHEN at.CATEGORY = 'CERTIFICATE'
                THEN a.CURRENT_BALANCE ELSE 0 END) AS certificate_total
FROM MERIDIAN.BRANCHES b
JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
JOIN MERIDIAN.CUSTOMERS c ON a.CUSTOMER_ID = c.CUSTOMER_ID
JOIN MERIDIAN.ACCOUNT_TYPES at ON a.ACCOUNT_TYPE_CODE = at.ACCOUNT_TYPE_CODE
WHERE a.STATUS = 'A'
GROUP BY b.BRANCH_NAME, b.STATE_CODE
ORDER BY total_deposits DESC;

This single query produces every number a branch manager's dashboard needs. The conditional SUM technique breaks deposits into categories without additional queries.

Report 3: Customer Relationship Summary

-- Top customers by total relationship value
SELECT c.CUSTOMER_ID,
       c.FIRST_NAME || ' ' || c.LAST_NAME AS customer_name,
       c.CUSTOMER_TYPE,
       b.BRANCH_NAME                       AS primary_branch,
       COUNT(a.ACCOUNT_ID)                 AS num_accounts,
       SUM(a.CURRENT_BALANCE)              AS total_balance,
       MIN(a.OPEN_DATE)                    AS relationship_since,
       MAX(a.LAST_ACTIVITY_DATE)           AS last_activity,
       COALESCE(SUM(l.CURRENT_PRINCIPAL), 0) AS total_loan_balance
FROM MERIDIAN.CUSTOMERS c
JOIN MERIDIAN.BRANCHES b ON c.PRIMARY_BRANCH_ID = b.BRANCH_ID
LEFT JOIN MERIDIAN.ACCOUNTS a ON c.CUSTOMER_ID = a.CUSTOMER_ID
                                AND a.STATUS = 'A'
LEFT JOIN MERIDIAN.LOANS l ON a.ACCOUNT_ID = l.ACCOUNT_ID
                             AND l.STATUS = 'A'
WHERE c.STATUS = 'A'
GROUP BY c.CUSTOMER_ID, c.FIRST_NAME, c.LAST_NAME,
         c.CUSTOMER_TYPE, b.BRANCH_NAME
ORDER BY total_balance DESC
FETCH FIRST 10 ROWS ONLY;

This report joins four tables and uses LEFT JOINs to include customers who might not have loans. The COALESCE ensures zero rather than NULL for customers with no active loans.

Report 4: Loan Officer Performance

-- Loan officer production report
SELECT e.FIRST_NAME || ' ' || e.LAST_NAME AS loan_officer,
       b.BRANCH_NAME,
       COUNT(l.LOAN_ID)                     AS loans_originated,
       SUM(l.ORIGINAL_AMOUNT)               AS total_originated,
       AVG(l.ORIGINAL_AMOUNT)               AS avg_loan_size,
       SUM(l.CURRENT_PRINCIPAL)             AS outstanding_principal,
       AVG(l.INTEREST_RATE)                 AS avg_rate
FROM MERIDIAN.EMPLOYEES e
JOIN MERIDIAN.BRANCHES b ON e.BRANCH_ID = b.BRANCH_ID
LEFT JOIN MERIDIAN.LOANS l ON e.EMPLOYEE_ID = l.LOAN_OFFICER_ID
WHERE e.DEPARTMENT = 'Lending'
GROUP BY e.EMPLOYEE_ID, e.FIRST_NAME, e.LAST_NAME, b.BRANCH_NAME
ORDER BY total_originated DESC NULLS LAST;
-- Monthly trend with state subtotals and grand total
SELECT CASE GROUPING(b.STATE_CODE)
           WHEN 1 THEN '=== ALL ==='
           ELSE b.STATE_CODE
       END AS state,
       CASE GROUPING(MONTH(t.TRANSACTION_DATE))
           WHEN 1 THEN 'TOTAL'
           ELSE CAST(MONTH(t.TRANSACTION_DATE) AS VARCHAR(2))
       END AS month,
       COUNT(*)         AS txn_count,
       SUM(t.AMOUNT)    AS net_amount,
       SUM(CASE WHEN t.AMOUNT > 0
                THEN t.AMOUNT ELSE 0 END) AS credits,
       SUM(CASE WHEN t.AMOUNT < 0
                THEN ABS(t.AMOUNT) ELSE 0 END) AS debits
FROM MERIDIAN.TRANSACTIONS t
JOIN MERIDIAN.ACCOUNTS a ON t.ACCOUNT_ID = a.ACCOUNT_ID
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
WHERE t.TRANSACTION_DATE >= '2025-11-01'
  AND t.STATUS = 'C'
GROUP BY ROLLUP (b.STATE_CODE, MONTH(t.TRANSACTION_DATE))
ORDER BY GROUPING(b.STATE_CODE), b.STATE_CODE,
         GROUPING(MONTH(t.TRANSACTION_DATE)), MONTH(t.TRANSACTION_DATE);

Report 6: Account Type Mix Analysis with CUBE

-- Cross-tabulation: state vs. account category
SELECT CASE GROUPING(b.STATE_CODE)
           WHEN 1 THEN 'ALL STATES'
           ELSE b.STATE_CODE
       END AS state,
       CASE GROUPING(at.CATEGORY)
           WHEN 1 THEN 'ALL TYPES'
           ELSE at.CATEGORY
       END AS category,
       COUNT(*)                AS acct_count,
       SUM(a.CURRENT_BALANCE)  AS total_balance,
       DECIMAL(AVG(a.CURRENT_BALANCE), 15, 2) AS avg_balance
FROM MERIDIAN.ACCOUNTS a
JOIN MERIDIAN.BRANCHES b ON a.BRANCH_ID = b.BRANCH_ID
JOIN MERIDIAN.ACCOUNT_TYPES at ON a.ACCOUNT_TYPE_CODE = at.ACCOUNT_TYPE_CODE
WHERE a.STATUS = 'A'
GROUP BY CUBE (b.STATE_CODE, at.CATEGORY)
ORDER BY GROUPING(b.STATE_CODE), b.STATE_CODE,
         GROUPING(at.CATEGORY), at.CATEGORY;

This produces every possible subtotal: totals for each state, totals for each account category, and the grand total. It is exactly the data needed for a matrix report.


7.11 Performance Considerations for Aggregation

Aggregation queries can be expensive. They typically need to read every row in the group (or the entire table if there is no WHERE clause), sort or hash the data to form groups, and then compute the aggregate expressions. Here is what you need to know to keep them fast.

When GROUP BY Triggers Sorts

DB2 has two strategies for computing GROUP BY:

  1. Sort-based grouping: DB2 sorts the rows by the grouping columns, then scans the sorted stream and computes aggregates as each group boundary is reached. This requires a sort operation, which can be expensive for large result sets and may spill to disk if the sort heap is too small.

  2. Hash-based grouping: DB2 builds a hash table keyed on the grouping columns and accumulates aggregates as it scans the rows. This is often faster because it avoids a full sort, but it requires enough memory to hold the hash table.

The optimizer chooses between these strategies based on cost estimates. You generally do not need to force one or the other, but understanding that both exist helps you read EXPLAIN output.

Index Exploitation for Aggregation

DB2 can sometimes satisfy aggregate queries using indexes alone, without touching the base table. This is called index-only access and is extremely fast.

For example, consider this query:

SELECT BRANCH_ID, COUNT(*)
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID;

If there is an index on BRANCH_ID, DB2 might scan that index (which is already sorted by BRANCH_ID) and count entries without ever reading the ACCOUNTS table. The index contains all the information the query needs.

For this to work, every column referenced in the query — in SELECT, WHERE, GROUP BY, and HAVING — must be present in the index. This is the covering index concept.

-- This index would cover the query above
CREATE INDEX MERIDIAN.IX_ACCT_BRANCH_TYPE
    ON MERIDIAN.ACCOUNTS (BRANCH_ID, ACCOUNT_TYPE_CODE, CURRENT_BALANCE)
    IN TS_MERIDIAN_INDEX;

With this composite index, DB2 could satisfy queries like:

SELECT BRANCH_ID, ACCOUNT_TYPE_CODE,
       SUM(CURRENT_BALANCE) AS total
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID, ACCOUNT_TYPE_CODE;

entirely from the index, because all three columns are in the index.

Filtering Early

We covered this in Section 7.3, but it bears repeating: always filter with WHERE before grouping. The fewer rows that enter the GROUP BY operation, the faster it runs.

-- Slow: groups all accounts, then filters
SELECT BRANCH_ID, SUM(CURRENT_BALANCE)
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID
HAVING BRANCH_ID = 1;

-- Fast: filters first, then groups one branch
SELECT BRANCH_ID, SUM(CURRENT_BALANCE)
FROM MERIDIAN.ACCOUNTS
WHERE BRANCH_ID = 1
GROUP BY BRANCH_ID;

ROLLUP and CUBE Performance

ROLLUP and CUBE are implemented internally as multiple GROUP BY passes over the same data. DB2 optimizes this to avoid rescanning the base tables, but the computation is still more expensive than a simple GROUP BY. Specifically:

  • ROLLUP(a, b, c) computes 4 grouping levels
  • CUBE(a, b, c) computes 8 grouping levels
  • CUBE(a, b, c, d) computes 16 grouping levels

If you only need a few specific combinations, GROUPING SETS with explicit sets is more efficient than CUBE because it avoids computing combinations you do not need.

Aggregation Over Large Tables

For tables with millions or billions of rows (which is common in banking), consider:

  1. Materialized Query Tables (MQTs): DB2 supports materialized views that pre-compute aggregations. If the daily transaction summary report is run 50 times a day, materializing it once and refreshing it periodically is far more efficient than recomputing it every time.
-- Create a materialized query table for branch totals
CREATE TABLE MERIDIAN.MQT_BRANCH_SUMMARY AS (
    SELECT b.BRANCH_ID,
           b.BRANCH_NAME,
           COUNT(*) AS acct_count,
           SUM(a.CURRENT_BALANCE) AS total_balance
    FROM MERIDIAN.BRANCHES b
    JOIN MERIDIAN.ACCOUNTS a ON b.BRANCH_ID = a.BRANCH_ID
    WHERE a.STATUS = 'A'
    GROUP BY b.BRANCH_ID, b.BRANCH_NAME
) DATA INITIALLY DEFERRED REFRESH DEFERRED;
  1. Summary tables maintained by triggers: For real-time summaries, maintain a summary table that is updated by triggers on the base table. This trades write performance for read performance.

  2. Partition elimination: If the TRANSACTIONS table is range-partitioned by TRANSACTION_DATE, a query that filters on a date range can skip entire partitions, dramatically reducing the data scanned.

  3. Sort heap tuning: If GROUP BY queries frequently spill to disk (visible in EXPLAIN as sort overflows), increase the SORTHEAP database configuration parameter. We set it to 2048 pages in Chapter 4, but heavy aggregation workloads may need more.

The EXPLAIN Perspective

To understand how DB2 executes your aggregation query, use EXPLAIN:

EXPLAIN PLAN FOR
SELECT BRANCH_ID, SUM(CURRENT_BALANCE)
FROM MERIDIAN.ACCOUNTS
GROUP BY BRANCH_ID;

Then query the EXPLAIN tables. Look for: - GRPBY operators: these perform the grouping - SORT operators before GRPBY: indicate sort-based grouping - HSJOIN or TEMP nodes: indicate hash-based grouping or temporary tables - IXSCAN without FETCH: indicate index-only access

We will cover EXPLAIN in depth in the performance chapters (Part 5), but even at this stage, knowing that aggregation queries involve sorts and hash operations helps you reason about why they behave the way they do.


Spaced Review: Chapters 1-6

This section reinforces key concepts from earlier chapters. Spend 60-90 seconds recalling each answer before checking.

From Chapter 1 (What Is DB2?)

Q: What are the two major DB2 product families, and what platforms do they run on?

Answer DB2 for z/OS runs on IBM Z mainframes. DB2 for LUW (Linux, UNIX, Windows) runs on distributed platforms including Linux, AIX, and Windows Server. They share the same SQL foundation but differ in architecture, tooling, and operational practices.

From Chapter 2 (Relational Model)

Q: What is the difference between a candidate key and a primary key?

Answer A candidate key is any column or combination of columns that uniquely identifies every row in a table. A primary key is the candidate key that is selected as the table's official unique identifier. A table may have multiple candidate keys but only one primary key.

From Chapter 3 (DB2 Architecture)

Q: What is the purpose of the buffer pool in DB2?

Answer The buffer pool is an area of memory where DB2 caches data pages and index pages read from disk. By keeping frequently accessed pages in memory, the buffer pool dramatically reduces physical I/O, which is the single largest factor in database performance.

From Chapter 4 (Setting Up Your Environment)

Q: What does the SORTHEAP database configuration parameter control, and why is it relevant to this chapter?

Answer SORTHEAP defines the amount of private memory (in pages) available for each sort operation. GROUP BY often triggers sort operations, and if the sort data exceeds SORTHEAP, DB2 must spill intermediate results to disk (a sort overflow), which dramatically slows the query.

From Chapter 5 (SELECT, WHERE, ORDER BY)

Q: In what order does DB2 conceptually evaluate FROM, WHERE, SELECT, and ORDER BY?

Answer FROM first (identify the table), WHERE second (filter rows), SELECT third (compute output columns), ORDER BY last (sort the result). This is why you cannot use a column alias defined in SELECT inside a WHERE clause — WHERE is evaluated before SELECT.

From Chapter 6 (Joins)

Q: What is the difference between INNER JOIN and LEFT JOIN when one side has no matching rows?

Answer INNER JOIN excludes the row entirely — if there is no match, the row does not appear in the result. LEFT JOIN preserves the row from the left table and fills the right table's columns with NULLs. This is why LEFT JOIN is essential for reporting queries that must include entities with zero related rows (e.g., branches with no accounts).

Chapter Summary

Aggregation is where SQL transitions from data retrieval to data analysis. In this chapter, you learned:

  1. Aggregate functions — COUNT, SUM, AVG, MIN, MAX, STDDEV, and VARIANCE — collapse multiple rows into single summary values. COUNT(*) counts rows; COUNT(column) counts non-NULL values; COUNT(DISTINCT column) counts unique non-NULL values.

  2. GROUP BY partitions the result set into groups and applies aggregate functions independently to each group. Every non-aggregated column in SELECT must appear in GROUP BY.

  3. HAVING filters groups after aggregation, while WHERE filters rows before aggregation. If the condition does not involve an aggregate, use WHERE for better performance.

  4. NULLs are ignored by all aggregate functions except COUNT(*). AVG divides by the count of non-NULL values, not total rows. SUM over an empty set returns NULL, not zero. Use COALESCE defensively.

  5. GROUPING SETS let you compute multiple grouping levels in a single query. ROLLUP produces hierarchical subtotals (N+1 levels). CUBE produces all possible combinations (2^N levels).

  6. GROUPING() identifies summary rows by returning 1 when a column has been rolled up. Use it with CASE to label summary rows in reports.

  7. Join-then-aggregate is the standard reporting pattern. Use LEFT JOIN with COUNT(column) and COALESCE to correctly handle entities with zero related rows.

  8. Performance depends on filtering early (WHERE before GROUP BY), covering indexes that enable index-only access, and appropriate SORTHEAP configuration. For frequently-run reports, consider materialized query tables.

These techniques form the foundation of every financial report, every dashboard, and every data analysis query you will ever write against DB2. Master them, and you can answer any question that starts with "How many," "What is the total," or "What is the average."


Next chapter: Chapter 8 will introduce subqueries and common table expressions (CTEs), giving you the tools to nest queries inside queries and build complex analytical pipelines step by step.