26 min read

> Where you are: Part II, Chapter 7 of 40. Chapter 5 filtered rows; Chapter 6 connected tables. Now you condense rows into summaries — counts, totals, averages — the numbers that answer business questions.

Chapter 7: Aggregation — GROUP BY, HAVING, and Summarizing Data

Where you are: Part II, Chapter 7 of 40. Chapter 5 filtered rows; Chapter 6 connected tables. Now you condense rows into summaries — counts, totals, averages — the numbers that answer business questions.

Learning paths: 💻 📊 🔬 🏗️ — everyone, but 📊 analysts especially: aggregation is the core of reporting and analytics.


From rows to answers

So far every query returned rows — individual records. But most real questions are about summaries: "How many orders did we get last month?" "What's the average order value per customer?" "Which category sells the most?" None of those want a list of rows; they want a computed number, often per group. That's aggregation, and together with joins it lets you answer almost any analytical question.

Aggregation works in two layers:

  1. Aggregate functions (COUNT, SUM, AVG, MIN, MAX) collapse many rows into one value.
  2. GROUP BY splits rows into groups and applies the aggregate to each group, so you get one summary row per group.

Aggregate functions over the whole table

With no GROUP BY, an aggregate collapses the entire result to a single row:

SELECT COUNT(*)        AS num_products,
       MIN(price)      AS cheapest,
       MAX(price)      AS priciest,
       ROUND(AVG(price), 2) AS avg_price,
       SUM(price)      AS total_catalog_value
FROM products;
 num_products | cheapest | priciest | avg_price | total_catalog_value
--------------+----------+----------+-----------+---------------------
           15 |    34.00 |  1899.00 |    482.07 |             7231.00

COUNT has three forms — know the difference

SELECT COUNT(*)              AS rows_total,      -- counts ROWS (including NULLs)
       COUNT(phone)          AS with_phone,      -- counts non-NULL phone values
       COUNT(DISTINCT loyalty_tier) AS num_tiers -- counts distinct non-NULL values
FROM customers;
 rows_total | with_phone | num_tiers
------------+------------+-----------
         12 |         10 |         4
  • COUNT(*) counts rows, period.
  • COUNT(column) counts rows where that column is not NULL (recall Chapter 3: aggregates skip nulls).
  • COUNT(DISTINCT column) counts distinct non-null values.

That rows_total vs with_phone gap (12 vs 10) is exactly the two null phones — and the source of countless reporting bugs when people use COUNT(col) thinking it counts rows.

Common mistake. SUM, AVG, MIN, MAX also skip NULLs. AVG(discount) averages only the rows that have a discount — if you want to treat missing as zero, use AVG(COALESCE(discount, 0)). The two can differ enormously. Always ask whether "no value" should mean "skip it" or "count it as zero."


GROUP BY: one summary per group

GROUP BY partitions rows into groups that share a value, then computes each aggregate within each group. "How many customers are in each loyalty tier?"

SELECT loyalty_tier, COUNT(*) AS num_customers
FROM customers
GROUP BY loyalty_tier
ORDER BY num_customers DESC;
 loyalty_tier | num_customers
--------------+---------------
 standard     |             6
 gold         |             3
 silver       |             2
 platinum     |             1

The query produced one row per distinct loyalty_tier. That's the essence of GROUP BY: the grouping column(s) become the identity of each output row, and every aggregate is computed over just the rows in that group.

You can group by multiple columns — the group is then each distinct combination:

-- Average price per category, by active/inactive status
SELECT category_id, is_active, ROUND(AVG(price), 2) AS avg_price, COUNT(*) AS n
FROM products
GROUP BY category_id, is_active
ORDER BY category_id, is_active;

The cardinal rule of GROUP BY

Every column in the SELECT list must either be inside an aggregate function or appear in the GROUP BY.

This makes sense: if you group by loyalty_tier, each output row represents many customers — so PostgreSQL can't show a single first_name, because there are several. It must be aggregated (COUNT, MAX, etc.) or be the grouping column itself.

-- ERROR: column "customers.first_name" must appear in the GROUP BY clause
-- or be used in an aggregate function
SELECT loyalty_tier, first_name, COUNT(*)
FROM customers
GROUP BY loyalty_tier;

This error is one of the most common in all of SQL. The fix is always the same: either add the column to GROUP BY, or wrap it in an aggregate, depending on what you actually mean.


HAVING: filtering groups

WHERE filters rows — but it runs before grouping, so it can't reference an aggregate. To filter groups by an aggregate result, use HAVING, which runs after grouping.

"Which loyalty tiers have more than two customers?"

SELECT loyalty_tier, COUNT(*) AS n
FROM customers
GROUP BY loyalty_tier
HAVING COUNT(*) > 2
ORDER BY n DESC;
 loyalty_tier | n
--------------+---
 standard     | 6
 gold         | 3

The distinction is fundamental and worth stating crisply:

  • WHERE filters individual rows before they're grouped. ("Only consider active products.")
  • HAVING filters whole groups after aggregation, using aggregate results. ("Only keep categories whose average price exceeds $100.")

You frequently use both together:

-- Among ACTIVE products (WHERE), find categories (GROUP BY)
-- whose average price exceeds $100 (HAVING):
SELECT category_id, ROUND(AVG(price), 2) AS avg_price, COUNT(*) AS n
FROM products
WHERE is_active = true            -- row filter, before grouping
GROUP BY category_id
HAVING AVG(price) > 100           -- group filter, after aggregation
ORDER BY avg_price DESC;

Common mistake. Putting an aggregate in WHERE (WHERE COUNT(*) > 2) — that's an error, because WHERE runs before groups exist. Aggregate conditions go in HAVING. Conversely, don't put a plain row condition in HAVING (HAVING is_active = true) when it belongs in WHERE — filtering rows early is clearer and lets the database do less work.


The logical evaluation order, completed

Aggregation adds two stages to the order you learned in Chapter 5. The full picture:

   1. FROM / JOIN   assemble the rows
   2. WHERE         filter rows                    (before grouping)
   3. GROUP BY      partition rows into groups
   4. HAVING        filter groups                  (after aggregation)
   5. SELECT        compute the output columns / aggregates
   6. ORDER BY      sort
   7. LIMIT         take a slice

This order explains everything: why WHERE can't see aggregates (it runs at step 2, before grouping at step 3), why HAVING can (step 4, after), and why ORDER BY can use a SELECT alias (step 6, after step 5). Internalize this sequence and SQL stops surprising you.


Aggregation across joins (and the fan-out fix)

Aggregation shines when combined with joins. "Total revenue and number of orders per customer":

SELECT c.first_name, c.last_name,
       COUNT(DISTINCT o.order_id)                 AS num_orders,
       COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS revenue
FROM customers c
LEFT JOIN orders o      ON o.customer_id = c.customer_id
LEFT JOIN order_items oi ON oi.order_id   = o.order_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY revenue DESC;

Note the deliberate choices, straight from Chapter 6's lessons:

  • LEFT JOIN so customers with no orders still appear (with 0, via COALESCE).
  • COUNT(DISTINCT o.order_id) not COUNT(*) — because the join to order_items fans out (one row per line item), a plain COUNT(*) would count line items, not orders. COUNT(DISTINCT order_id) counts each order once. This is the aggregation-side answer to fan-out.
  • GROUP BY c.customer_id, ... — grouping by the primary key, then listing the other customer columns (safe because they're functionally determined by the key).

Theory → Practice. Grouping by a table's primary key and then selecting its other columns is so common that PostgreSQL specifically allows it: once you GROUP BY customer_id (the PK), you may SELECT first_name without aggregating it, because the key determines it. This is functional dependency (Chapter 19) showing up in the SQL standard.


Advanced grouping: ROLLUP, CUBE, GROUPING SETS

Sometimes you want subtotals and a grand total in one query. PostgreSQL provides grouping extensions:

-- Revenue per category, plus a grand total row (ROLLUP):
SELECT cat.name AS category,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN products p     ON p.product_id  = oi.product_id
JOIN categories cat ON cat.category_id = p.category_id
GROUP BY ROLLUP (cat.name)
ORDER BY revenue NULLS LAST;

ROLLUP adds a summary row (with NULL for the rolled-up column) holding the grand total. CUBE produces subtotals for all combinations of the grouped columns; GROUPING SETS lets you specify exactly which subtotal levels you want. These are reporting power tools — you reach for them when building dashboards and pivot-style summaries.

Dialect Difference. GROUP BY ROLLUP/CUBE/GROUPING SETS is standard SQL and supported by PostgreSQL, SQL Server, and Oracle; MySQL supports WITH ROLLUP with different syntax. See Appendix J.


FILTER: conditional aggregation done cleanly

A frequent need is to aggregate only some of the rows in each group — count the delivered orders and the cancelled orders side by side, sum revenue from one channel but not another. The classic technique was a CASE inside an aggregate; PostgreSQL offers a cleaner, standard syntax: the FILTER clause.

-- Per customer: total orders, plus how many are delivered vs cancelled,
-- all in a single pass over the data:
SELECT c.last_name,
       COUNT(*)                                   AS total_orders,
       COUNT(*) FILTER (WHERE o.status = 'delivered') AS delivered,
       COUNT(*) FILTER (WHERE o.status = 'cancelled') AS cancelled,
       SUM(o.total_amount) FILTER (WHERE o.status = 'delivered') AS delivered_revenue
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.last_name
ORDER BY total_orders DESC;

Each FILTER (WHERE …) restricts that one aggregate to the rows matching its condition, while the rest of the aggregates see all the group's rows. The older equivalent — SUM(CASE WHEN status = 'delivered' THEN total_amount ELSE 0 END) — still works and you'll see it everywhere, but FILTER reads far better, especially when several conditional aggregates pile up. This pattern is also how you pivot rows into columns: one FILTERed aggregate per category turns "orders by status" from many rows into one wide row, which is exactly what a dashboard wants.

-- Pivot: one row, a column per status (rows → columns):
SELECT
    COUNT(*) FILTER (WHERE status = 'pending')   AS pending,
    COUNT(*) FILTER (WHERE status = 'paid')      AS paid,
    COUNT(*) FILTER (WHERE status = 'shipped')   AS shipped,
    COUNT(*) FILTER (WHERE status = 'delivered') AS delivered,
    COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled
FROM orders;

Dialect Difference. FILTER (WHERE …) is standard SQL and supported by PostgreSQL and SQLite; many other databases lack it, so there you fall back to the CASE-inside-aggregate form. Both compute the same thing; FILTER is simply the more legible spelling where it's available.


Aggregating text and arrays: string_agg and array_agg

Not every summary is a number. Sometimes you want to collect the values in a group into a single combined value — all the products in an order listed in one cell, all the tags on an article gathered into an array. Two aggregates do this, and they're enormously handy:

-- One row per order, with all its product names rolled into a single string:
SELECT o.order_id,
       string_agg(p.name, ', ' ORDER BY p.name) AS products,
       array_agg(p.product_id ORDER BY p.name)  AS product_ids
FROM orders o
JOIN order_items oi ON oi.order_id  = o.order_id
JOIN products p     ON p.product_id = oi.product_id
GROUP BY o.order_id
ORDER BY o.order_id;
 order_id |             products              | product_ids
----------+----------------------------------+-------------
        1 | NordicSound Headphones, UltraBook | {7,1}
        2 | BudgetBook 13, Chefer Toaster     | {4,9}

string_agg(expr, delimiter) concatenates the group's values into one delimited string; array_agg(expr) collects them into a PostgreSQL array. The optional ORDER BY inside the aggregate controls the order of the collected elements — a small but important detail, because without it the order is unspecified. These turn "one row per line item" back into "one readable row per order," which is often precisely the shape a report or an API response needs. (array_agg also pairs beautifully with the JSON functions of Chapter 16 when you need nested output.)

Common mistake. Reaching for application code to "join up" a group's values into a comma-separated list — fetching all the rows and looping in Python or JavaScript. string_agg does it in the database in one pass, faster and simpler. Whenever you catch yourself planning to loop-and-concatenate in app code, ask whether string_agg/array_agg belongs there instead.


Statistical aggregates and percentiles

For analytics work, PostgreSQL goes well beyond AVG. The average alone is famously misleading — a few huge orders drag the mean far from what a "typical" order looks like — so real analysis reaches for spread and percentiles:

SELECT
    ROUND(AVG(total_amount), 2)      AS mean,
    ROUND(stddev(total_amount), 2)   AS std_dev,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount) AS median,
    percentile_cont(0.9) WITHIN GROUP (ORDER BY total_amount) AS p90,
    mode() WITHIN GROUP (ORDER BY status)                     AS most_common_status
FROM orders;

The median (percentile_cont(0.5)) is often a better "typical value" than the mean, precisely because it ignores the pull of outliers; the 90th percentile answers "what does a high-but-not-extreme order look like?"; mode() finds the most frequent value. The slightly unusual WITHIN GROUP (ORDER BY …) syntax marks these as ordered-set aggregates — they need the data sorted to compute a position-based result. You don't need them every day, but knowing they exist means you'll never again export to a spreadsheet just to get a median. This is theme #4 in miniature: PostgreSQL's reach often removes the need for a separate analytics tool.


Grouping by expressions, not just columns

GROUP BY can group by any expression, not only a bare column — which unlocks time-series and bucketed summaries. The two workhorses are date_trunc (for time periods) and CASE (for custom buckets):

-- Revenue per month — group by a truncated date:
SELECT date_trunc('month', o.order_date) AS month,
       SUM(oi.quantity * oi.unit_price)   AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY date_trunc('month', o.order_date)
ORDER BY month;

-- Customers bucketed by spend tier — group by a CASE expression:
SELECT CASE
           WHEN SUM(o.total_amount) >= 1000 THEN 'high'
           WHEN SUM(o.total_amount) >= 200  THEN 'medium'
           ELSE 'low'
       END AS spend_tier,
       COUNT(*) AS num_customers
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY c.customer_id;     -- group per customer first, then the outer query buckets

Grouping by date_trunc('month', …) is the idiom for "per month / per week / per day" reporting, and you'll write it constantly. Grouping by a CASE lets you define arbitrary categories on the fly. Both are just the cardinal rule applied to expressions: whatever you group by, that's what each output row represents.

Why this matters. A surprising amount of "business intelligence" is exactly this — date_trunc for the time axis, CASE for the category axis, an aggregate for the measure. Once these three moves are fluent, you can build most of a reporting dashboard in plain SQL, no special tooling required.


The average-of-averages trap, revisited

Chapter 6 warned about grain; aggregation is where grain mistakes turn into wrong numbers that look plausible, so it deserves a sharper treatment. Consider "the average order value." It is tempting to average a column that's already an average, or to average at the wrong grain — and the result is subtly, dangerously wrong.

Suppose you want average revenue per order. If you join orders to line items and then AVG(oi.quantity * oi.unit_price), you've computed the average line item value, not the average order value — a completely different number. The correct approach aggregates to order grain first, then averages those order totals:

-- RIGHT: total each order, THEN average the order totals (each order weighted once):
SELECT ROUND(AVG(order_total), 2) AS avg_order_value
FROM (
    SELECT o.order_id, SUM(oi.quantity * oi.unit_price) AS order_total
    FROM orders o
    JOIN order_items oi ON oi.order_id = o.order_id
    GROUP BY o.order_id
) per_order;

The general principle: an average of averages is not the average. A category with 100 orders and a category with 2 orders should not contribute equally to an overall average if you want a per-order figure — but they will if you naively average their per-category averages. Whenever you compute a rate, ratio, or average, pause and ask: at what grain, and weighted how? The query that answers "average per order" and the one that answers "average of category averages" look almost identical and return different numbers; only clear thinking about grain tells them apart. This is the analytical heart of theme #3 — understanding why a number is what it is.


More patterns, to build the reflex of mapping a business question to a GROUP BY shape:

-- "Top 5 products by revenue."
SELECT p.name, SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi JOIN products p ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name
ORDER BY revenue DESC
LIMIT 5;

-- "Average rating and review count per product, for products with >= 3 reviews."
SELECT p.name, ROUND(AVG(r.rating), 2) AS avg_rating, COUNT(*) AS reviews
FROM products p JOIN reviews r ON r.product_id = p.product_id
GROUP BY p.product_id, p.name
HAVING COUNT(*) >= 3
ORDER BY avg_rating DESC;

-- "Number of distinct customers who ordered in each month."
SELECT date_trunc('month', order_date) AS month,
       COUNT(DISTINCT customer_id)       AS active_customers
FROM orders
GROUP BY date_trunc('month', order_date)
ORDER BY month;

-- "Each category's share of total revenue" (a subquery for the grand total).
SELECT cat.name,
       SUM(oi.quantity * oi.unit_price) AS revenue,
       ROUND(100.0 * SUM(oi.quantity * oi.unit_price)
             / (SELECT SUM(quantity * unit_price) FROM order_items), 1) AS pct_of_total
FROM order_items oi
JOIN products p     ON p.product_id  = oi.product_id
JOIN categories cat ON cat.category_id = p.category_id
GROUP BY cat.name
ORDER BY revenue DESC;

Study the last one: it mixes a GROUP BY aggregate with a scalar subquery for the overall total, and uses 100.0 * (not 100 *) to force decimal division — every lesson of this chapter and the last, working together. That combination — group to get the parts, a subquery for the whole, careful arithmetic for the ratio — is the skeleton of countless "share of total" and "percent of" reports.

Try this. Take the "top 5 products by revenue" query and turn it into "top 5 products by revenue within each category." That's a per-group top-N — a hint that you've outgrown plain GROUP BY and want window functions (Chapter 12). Feeling that limit firsthand is the best motivation for the chapter that lifts it.


A worked scenario: building a monthly KPI report

Let's tie the whole chapter together by building something a real business actually asks for: a monthly key-performance-indicator report for Mercado. Stakeholders want, per month: the number of orders, the number of distinct customers who ordered, total revenue, average order value, and the percentage of orders that were cancelled. Watch how each requirement maps to a tool you now own, and how they compose into one query.

We start from the grain. "Per month" means the output has one row per month, so date_trunc('month', order_date) is the grouping key — and we'll want a date spine later if any month could be empty, but let's assume continuous activity for now. "Number of orders" is COUNT(*) at order grain, so we must be careful not to join to order_items (which would fan out to line-item grain and inflate the count) unless we de-duplicate. "Distinct customers" is COUNT(DISTINCT customer_id) — the word distinct is the signpost. "Total revenue" needs line-item data (quantity * unit_price), which lives at a finer grain than orders, so we have a grain conflict to resolve. "Average order value" is total revenue divided by order count — but computed carefully, not as a naive average of a fanned-out column. And "percentage cancelled" is a FILTERed count over the total, with attention to integer division.

The grain conflict between order-level counts and line-item-level revenue is the crux, and it's exactly the kind of thing that produces plausible-but-wrong reports. The clean solution is to compute revenue at its own grain in a derived table, then join the order-level and revenue figures together — or, more simply here, to recognize that orders.total_amount already stores each order's total at order grain, letting us avoid the line-item join entirely for this report:

SELECT
    date_trunc('month', order_date)                     AS month,
    COUNT(*)                                            AS orders,
    COUNT(DISTINCT customer_id)                         AS distinct_customers,
    SUM(total_amount)                                   AS revenue,
    ROUND(AVG(total_amount), 2)                         AS avg_order_value,
    ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'cancelled')
          / COUNT(*), 1)                                AS pct_cancelled
FROM orders
GROUP BY date_trunc('month', order_date)
ORDER BY month;

Every clause is a lesson from this chapter. The GROUP BY date_trunc sets the monthly grain. COUNT(*) and COUNT(DISTINCT customer_id) are two of the three COUNT forms, chosen deliberately. SUM and AVG summarize the money. The FILTER clause counts only cancelled orders without a second query. And 100.0 * forces decimal division so the percentage isn't truncated to zero — the integer-division trap defused. Notice what we didn't do: we didn't join to order_items, because doing so would have multiplied each order's row by its line-item count and quietly broken every order-grain figure in the report. Choosing to read total_amount at order grain instead is the single most important decision in the query, and it's a grain decision, not a syntax one.

This is what real aggregation work looks like: a business question decomposed into measures, each measure mapped to the right aggregate, and — above all — relentless attention to grain so the numbers mean what the stakeholders think they mean. The syntax was the easy part. The judgment about grain, about which COUNT form, about where division could go wrong, is the part that separates a report you can trust from one that merely runs. When you can build a query like this and defend every column — why this aggregate, at this grain, with this guard — you've genuinely mastered aggregation.

Try this. Extend the report with a "revenue per customer" column. Think carefully: is that revenue / distinct_customers? Watch the integer division, and watch what "per customer" means when one customer placed several orders. Then add a date spine with generate_series (Chapter 8) so months with zero orders still appear as a row of zeros rather than vanishing. Both extensions are small, and both hide exactly the kind of subtlety this chapter has trained you to catch.


A performance note (preview of Chapters 23–24)

Aggregation requires the database to scan and group rows, which can be expensive on large tables. Two things help: an index on the grouped/filtered columns can let the database avoid a full scan or produce groups in order; and filtering early with WHERE (before grouping) shrinks the work. For pre-computed summaries that are queried often, materialized views (Chapter 15) cache the aggregated result. You'll quantify all this with EXPLAIN in Chapter 24. For now: filter before you group, and know that GROUP BY on millions of rows benefits from the right index.


Sorting, presenting, and reading grouped results

Once you've computed a summary, you usually need to present it, and a few habits make grouped output genuinely useful rather than merely correct. The most common is sorting by the aggregate itself: a "revenue per category" report is far more useful ordered by revenue descending, so the biggest categories surface first. Because ORDER BY runs after SELECT (step 6 of the evaluation order), you can sort by an aggregate or its alias directly — ORDER BY revenue DESC or ORDER BY SUM(...) DESC — which is exactly what turns a raw summary into a ranked, scannable answer.

Beyond sorting, presentation often calls for limiting and labeling. A "top 5 categories" report is a GROUP BY followed by ORDER BY aggregate DESC LIMIT 5 — the aggregation defines the ranking, and the limit takes the head of it. Be deliberate about ties: if the fifth and sixth categories have identical revenue, a plain LIMIT 5 picks one arbitrarily, and if that matters you'll want a window function's ranking (Chapter 12) instead. For human-facing reports, wrap money in formatting (to_char(SUM(...), 'FM999,990.00') from Chapter 8), round averages to a sensible precision, and give every column a clear alias — the difference between a query that's technically right and a report someone can actually read at a glance.

Reading grouped output back is a skill too. Always confirm the number of groups matches your expectation: if "orders per status" returns eight rows but you know there are five statuses, you've probably grouped by something extra (a stray column in the GROUP BY) or there's dirty data with unexpected status values. The grouping columns identify each row, so scanning them is a quick sanity check on whether your GROUP BY did what you intended. And when an aggregate looks surprising — a suspiciously round number, a zero where you expected data, a total that's an order of magnitude off — return to grain: nine times out of ten the surprise is a fan-out from a join or a NULL that an aggregate quietly skipped. Reading results critically, rather than trusting the first number that appears, is the habit that catches errors before they reach a stakeholder.

Common mistake. Presenting a GROUP BY result without an ORDER BY and assuming the groups come back in a meaningful order. As with any query, grouped results have no inherent order unless you ask — and "no order" on a summary is especially jarring to readers who expect the biggest or most recent at the top. A summary almost always wants an explicit ORDER BY, usually on the aggregate that the report is about.


The mental model: split, apply, combine

Underneath all the syntax, every grouped aggregation follows the same three-beat rhythm, and holding that rhythm in mind makes even complex summaries easy to reason about. Practitioners borrow a phrase from data science for it: split, apply, combine.

First, split: the rows are partitioned into groups, one group per distinct value (or combination of values) of the GROUP BY columns. If you group orders by customer, you get one pile of rows per customer. Second, apply: each aggregate function runs independently within each pile — COUNT(*) counts the rows in that pile, SUM(total) adds them up, AVG(rating) averages them. The function never sees across piles; it only ever works inside one group at a time. Third, combine: the per-group results are assembled into the final result set, one output row per group, with the grouping columns identifying each row and the aggregate values describing it.

This model explains the cardinal rule effortlessly. Why must every non-aggregated SELECT column appear in the GROUP BY? Because after the split, a group contains many rows — so there is no single first_name to show for a group of customers unless first_name is itself what defined the group. The grouping columns are the only per-row values that are guaranteed identical across the whole group; everything else must be collapsed by an aggregate into one value. The error message "must appear in the GROUP BY clause or be used in an aggregate function" is just the database insisting you tell it how to collapse a column that isn't constant within the group.

The model also clarifies the difference between WHERE and HAVING once and for all. WHERE acts before the split — it decides which rows enter the piles in the first place. HAVING acts after the apply — it decides which finished group-results survive, based on their aggregate values. You can't filter on COUNT(*) in WHERE because, before the split, there are no groups and therefore no counts; the count is a property of a finished pile, so the filter on it has to wait until HAVING. Picture the three beats and you'll never again confuse where a condition belongs: row conditions go before the split (WHERE), group conditions go after the apply (HAVING).

Once split-apply-combine is automatic, you can read any aggregation query — however many joins and conditions it carries — by asking three questions in order: What defines a group here? What is being computed within each group? Which groups make it to the output? Answer those, and the query holds no mysteries. This is theme #3 in its most useful form: a single mental model that turns a family of syntax rules into one coherent idea.


Translating business questions into GROUP BY

The hardest part of aggregation in practice is rarely the syntax — it's hearing a question posed in plain business language and recognizing the grouping shape hiding inside it. With a little practice, certain phrases become reliable signposts, and learning to hear them is what turns a stakeholder's request into a query in seconds rather than minutes.

The phrase "per" or "for each" almost always names the grouping column. "Revenue per category" groups by category; "orders for each customer" groups by customer; "signups per month" groups by date_trunc('month', …). Whenever you hear "per X," X is your GROUP BY. The phrase "total," "number of," "average," "highest," "lowest" names the aggregate: total → SUM, number of → COUNT, average → AVG, highest/lowest → MAX/MIN. So "the average order value per customer" parses almost mechanically into "AVG of order value, GROUP BY customer."

The word "only" attached to a summary condition signals HAVING. "Show me categories — but only those averaging over \$100" is a HAVING AVG(price) > 100. Contrast "only active products," which is a row condition (WHERE is_active), not a group condition — the tell is whether the "only" applies to individual rows (before grouping) or to the computed summaries (after). The word "distinct" or "unique" signals COUNT(DISTINCT …): "how many different customers ordered this month" is COUNT(DISTINCT customer_id), a different and usually more meaningful number than COUNT(*).

Finally, "top N" or "the most" signals an ORDER BY … DESC LIMIT N wrapped around the aggregation — and if it's "top N within each group," that's your cue that plain GROUP BY has run out of road and a window function (Chapter 12) is coming. Training your ear to these signals is one of the highest-leverage skills in all of data work, because the people asking the questions almost never speak SQL — they speak business, and your job is the translation. The grammar above is most of the dictionary.

Why this matters. An analyst who can turn "which regions are underperforming relative to their size?" into the right GROUP BY, HAVING, and ratio on the spot is worth their weight in gold, precisely because that translation is where most people stall. The syntax in this chapter is finite and learnable in an afternoon; the translation skill is what you'll keep sharpening for years, and it starts with hearing "per," "only," and "distinct" as the signposts they are.


Aggregates and NULL, in depth

Chapter 3 told you aggregates skip NULLs; now that you're aggregating for real, the consequences deserve a closer look, because they cause quiet, plausible-looking errors that survive code review and reach reports. The rule is simple to state — every aggregate except COUNT(*) ignores NULL inputs — but its effects ripple.

Consider AVG. If a discount column is NULL for orders that had no discount, AVG(discount) averages only the discounted orders, because the NULLs are skipped entirely. If your intent was "the average discount across all orders, treating no-discount as zero," that's a different and larger denominator, and you must write AVG(COALESCE(discount, 0)) to get it. Both are legitimate questions — "average discount among discounted orders" versus "average discount across all orders" — but they are different questions, and a stray NULL silently decides which one you asked. The only defense is to pause at every aggregate over a nullable column and consciously decide whether "missing" should mean "skip" or "count as zero."

COUNT is where the confusion peaks, which is why it has three forms. COUNT(*) counts rows unconditionally — NULLs included, because it counts rows, not values. COUNT(column) counts only the rows where that column is non-NULL. The gap between them is precisely the number of NULLs in that column, which is occasionally useful as a data-quality probe (COUNT(*) - COUNT(phone) is "how many customers have no phone"). But it bites when someone writes COUNT(some_nullable_column) meaning "how many rows" and quietly undercounts every time that column is null. The habit to build: use COUNT(*) when you mean rows, and reserve COUNT(column) for the rare case when you specifically mean "rows where this value is present."

SUM over an all-NULL set (or an empty set) returns NULL, not 0 — which is why you so often wrap it in COALESCE(SUM(…), 0), especially after a LEFT JOIN where some groups have no matching rows at all. A revenue report that shows blank instead of $0.00 for a customer with no orders is this exact issue, and the COALESCE is the fix. MIN and MAX likewise return NULL over empty or all-NULL groups. None of this is arbitrary — it all follows from the single principle that NULL means "unknown," and you cannot meaningfully add, average, or rank an unknown — but the practical upshot is that every aggregate over nullable data hides a decision, and your job is to make that decision on purpose rather than inherit it by accident.

Common mistake. Trusting an AVG over a column riddled with NULLs without asking what the NULLs mean. "Average satisfaction score" over a table where most rows never recorded a score is averaging a self-selected minority — and that number will mislead everyone who sees it. Before you publish an average, know your denominator. The NULL-skipping behavior is convenient when you want it and treacherous when you don't notice it.


Progressive project: summarize your data

Answer summary questions about your domain:

  1. A simple count or total over the whole table ("how many members?", "total fines outstanding").
  2. A GROUP BY summary ("loans per member category", "appointments per provider", "enrollments per course").
  3. A GROUP BY + HAVING ("courses with more than 30 enrollments", "members with more than 5 overdue items").
  4. A summary across a join, using LEFT JOIN + COALESCE so empty groups show as 0, and COUNT(DISTINCT ...) if a join fans out.

Add these to project-notes.md — they're the metrics your application's reports will show.


Aggregation and the road ahead

Aggregation is the point where SQL stops merely retrieving data and starts answering questions — and that shift deserves a moment's reflection before you move on. Up to now, queries returned records: here are the matching products, here are the orders, here is the joined detail. Useful, but still raw. Aggregation transforms those records into the numbers people actually act on: how many, how much, the average, the trend, the share of total. Almost every dashboard, every report, every "how are we doing?" question a business asks resolves, underneath, to a GROUP BY with the right aggregate at the right grain. You have just learned the grammar of business intelligence.

What you've also learned, threaded through this chapter, is that the syntax is the easy half. The hard, valuable half is judgment: choosing the right grain so the numbers mean what people think; picking the right COUNT form; deciding whether a NULL should be skipped or counted as zero; forcing decimal division so a percentage isn't silently truncated; recognizing when "per group" has outgrown plain GROUP BY. These judgments are what separate a report you can stake a decision on from one that merely runs and quietly misleads. Every one of them traces back to theme #3 — understand why a number is what it is — and to the split-apply-combine model that makes the whole topic coherent.

The road immediately ahead builds directly on this foundation. Chapter 8's functions let you clean and bucket the values you aggregate. Chapter 11's CTEs make multi-stage aggregations readable. And Chapter 12's window functions lift the one real limitation you've bumped against here — the inability to show each row and its group's summary together, or to rank within groups — which is why "top N per category" kept appearing as a teaser. When you reach window functions and find you can compute a running total, a rank, or a per-group average without collapsing your rows, you'll appreciate exactly what GROUP BY does and doesn't do, because you'll have felt its edges here. For now, you can answer the great majority of analytical questions any organization will pose — which is no small thing to carry out of a single chapter.


Summary

Aggregation condenses rows into summaries. Aggregate functions (COUNT, SUM, AVG, MIN, MAX) collapse rows into one value and skip NULLs (mind COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)). GROUP BY produces one summary row per group; every non-aggregated SELECT column must be grouped (or be the PK). WHERE filters rows before grouping; HAVING filters groups after, using aggregates. The full evaluation order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Combined with joins, aggregation answers real business questions — use LEFT JOIN + COALESCE to include empty groups and COUNT(DISTINCT ...) to defeat fan-out. ROLLUP/CUBE/GROUPING SETS add subtotals and grand totals.

You can now: - Use COUNT/SUM/AVG/MIN/MAX and the three forms of COUNT. - Write GROUP BY summaries (single and multi-column) and obey the cardinal rule. - Filter groups with HAVING and distinguish it from WHERE. - State the full logical evaluation order and why it governs WHERE/HAVING/alias visibility. - Aggregate across joins correctly (LEFT JOIN + COALESCE, COUNT(DISTINCT ...)). - Produce subtotals with ROLLUP/CUBE/GROUPING SETS.

What's next. Chapter 8 — Built-in Functions — the string, numeric, date/time, and type-conversion functions you'll use in nearly every query to clean, format, and compute on the values you're selecting, filtering, grouping, and aggregating.


Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.