28 min read

> Where you are: Part II, Chapter 10 of 40. Joins combine tables side by side (adding columns). Set operations combine query results stacked (adding rows) — the SQL realization of Chapter 4's set algebra (∪, ∩, −).

Chapter 10: Set Operations — UNION, INTERSECT, EXCEPT

Where you are: Part II, Chapter 10 of 40. Joins combine tables side by side (adding columns). Set operations combine query results stacked (adding rows) — the SQL realization of Chapter 4's set algebra (∪, ∩, −).

Learning paths: 💻 📊 🔬 🏗️ — everyone. These are the cleanest tools for "this list plus that list," "in both," and "in one but not the other."


Stacking results, not widening them

A join answers "for each row in A, what matches in B?" — it widens rows with more columns. A set operation answers a different shape of question: "combine the rows of these two result sets." You write two (or more) full SELECT statements and connect them with a set operator:

SELECT ... FROM ...
UNION            -- or INTERSECT, or EXCEPT
SELECT ... FROM ...;

The three operators come straight from Chapter 4's relational algebra:

  • UNION — rows in the first query or the second (∪).
  • INTERSECT — rows in both (∩).
  • EXCEPT — rows in the first but not the second (−).

Union compatibility

Because you're stacking rows, the two queries must be union-compatible:

  1. Same number of columns, in the same order.
  2. Compatible types column-by-column (PostgreSQL will coerce where it can — e.g., integer and numeric — but text and date won't mix).

The names of the output columns come from the first query; the second query's column names are ignored. A simple example — a unified contact list of customers and suppliers:

SELECT first_name || ' ' || last_name AS name, email
FROM customers
UNION
SELECT name, contact_email
FROM suppliers
ORDER BY name;

Both queries produce two columns (a name and an email) of compatible types, so they stack cleanly.

Common mistake. Mismatched column counts or types: each UNION query must have the same number of columns, or a type error. Line the SELECT lists up column-for-column — same count, same order, compatible types.


UNION vs UNION ALL: the crucial distinction

UNION removes duplicate rows (true set behavior, ∪). UNION ALL keeps all rows, including duplicates (bag behavior). This is one of the most important practical distinctions in SQL:

SELECT category_id FROM products WHERE price > 1000
UNION
SELECT category_id FROM products WHERE is_active = false;   -- deduped

SELECT category_id FROM products WHERE price > 1000
UNION ALL
SELECT category_id FROM products WHERE is_active = false;   -- duplicates kept

Two reasons the distinction matters:

  • Correctness: sometimes you want the duplicates (e.g., concatenating two logs to count total events); sometimes you don't (a deduplicated mailing list). Choose deliberately.
  • Performance: UNION must do extra work to find and remove duplicates (a sort or hash). UNION ALL just concatenates. If you know there are no duplicates, or you want to keep them, use UNION ALL — it's faster.

Common mistake. Reaching for UNION by default. On large result sets, the dedup step is a real cost. Ask: "could there be duplicates, and do I care?" If no duplicates are possible (e.g., the two queries cover disjoint sets) or duplicates are wanted, UNION ALL is the right, faster choice.


INTERSECT: rows in both

INTERSECT returns rows present in both result sets. "Customers who are also reviewers" — appear in both the customers-who-ordered set and the customers-who-reviewed set:

SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM reviews;

This returns customer ids that appear in both orders and reviews — i.e., customers who have both ordered and reviewed. Like UNION, INTERSECT removes duplicates by default (INTERSECT ALL keeps them).


EXCEPT: rows in the first but not the second

EXCEPT is set difference (−) — rows in the first query that are not in the second. This is the set-algebra form of the anti-join you met in Chapters 4, 6, and 9. "Customers who ordered but never reviewed":

SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM reviews;

And the classic "products never ordered":

SELECT product_id FROM products
EXCEPT
SELECT product_id FROM order_items;

EXCEPT removes duplicates (EXCEPT ALL keeps multiplicity). It's often the most readable way to express a difference — though NOT EXISTS (Chapter 9) is the more flexible tool when you need to keep other columns or add conditions. (Recall Chapter 4's Case Study 2: EXCEPT, NOT EXISTS, and LEFT JOIN … IS NULL are three spellings of the same difference.)

Note

EXCEPT compares entire rows, so it naturally handles multi-column differences too — "(customer, product) pairs that were ordered but never reviewed" works by selecting both columns in each side.


ORDER BY and set operations

An ORDER BY applies to the whole combined result, so it goes at the very end, after the last query — not inside the individual SELECTs:

SELECT name, price FROM products WHERE category_id = 3
UNION ALL
SELECT name, price FROM products WHERE category_id = 5
ORDER BY price DESC;        -- sorts the combined result; uses first query's column names

If you need to order within each part before combining (rare), you wrap each in a subquery with its own ORDER BY and LIMIT. But the common case is one ORDER BY at the end.


Combining more than two queries

You can chain set operators. Be aware of precedence: INTERSECT binds tighter than UNION/EXCEPT, so use parentheses when mixing them to make intent explicit:

-- Explicit grouping with parentheses:
(SELECT customer_id FROM orders WHERE status = 'delivered'
 INTERSECT
 SELECT customer_id FROM reviews)
EXCEPT
SELECT customer_id FROM customers WHERE loyalty_tier = 'standard';

As with AND/OR (Chapter 5), don't rely on precedence memory — parenthesize.


Set operations vs. joins and subqueries

When is a set operation the right tool?

  • Use UNION/UNION ALL to stack rows from different sources into one list — combining two queries' results, building a unified report from heterogeneous tables, or concatenating periods.
  • Use INTERSECT/EXCEPT for clean "in both" / "in one not the other" membership questions, especially over a single column of ids. They read like the English question.
  • Use a JOIN when you need to widen rows (columns from multiple tables together).
  • Use EXISTS/NOT EXISTS (Chapter 9) when the difference/intersection needs extra conditions or you want to keep other columns — it's more flexible than EXCEPT/INTERSECT, which compare whole rows and return only the selected columns.

Often several work; pick the one that most directly says what you mean.


Thinking in sets: the mental model

Set operations reward a particular way of seeing a problem — not as "rows to filter" but as "sets to combine." When a question naturally splits into two populations that you then want to add together, overlap, or subtract, set operations express it more directly than any join or subquery. Training yourself to recognize that shape is the real skill of this chapter.

The signals are in the language of the question. "Customers who are either gold-tier or have spent over \$1000" is two sets joined by or — a UNION. "Products that are both on sale and low in stock" is two sets joined by and — an INTERSECT. "Active customers who have never placed an order" is one set minus another — an EXCEPT. The moment you hear "either/or," "both/and," or "minus/except/never" applied to two populations, your mind should reach for the set algebra of Chapter 4 made concrete. The Venn diagram is the mental picture: UNION is the whole of both circles, INTERSECT is the lens where they overlap, EXCEPT is the part of the first circle outside the second.

This set-thinking is genuinely different from join-thinking, and the difference is worth feeling. A join asks a relational question — "for each row here, what connects to it there?" — and produces wider rows. A set operation asks a membership question — "which items are in this collection, that collection, both, or neither?" — and produces a combined list of the same shape. Many questions can be twisted into either form, but each has a natural home: when you care about combining or comparing populations of the same kind of thing, set operations say it cleanly; when you care about relating different kinds of things, joins do. Recognizing which kind of question you're facing is half of writing the query well.

The practical payoff is that set-shaped questions, written as set operations, are almost self-documenting. SELECT customer_id FROM orders EXCEPT SELECT customer_id FROM reviews reads, almost in English, as "customers who ordered but didn't review." The same question as a LEFT JOIN ... WHERE r.customer_id IS NULL is correct but reads as machinery; the EXCEPT reads as intent. When a question is genuinely about set membership, honoring that with a set operation makes the query clearer for everyone who reads it later — which is the recurring quiet theme of all of Part II.


INTERSECT, EXCEPT, and whole-row comparison

A subtlety that elevates INTERSECT and EXCEPT from "single-column id tricks" to genuinely powerful tools: they compare entire rows, across all selected columns, not just one. This makes them excellent for comparing multi-column data — and especially for the everyday task of reconciliation, finding how two datasets differ.

Because comparison is whole-row, you can ask multi-column membership questions directly. "Which (customer, product) pairs were ordered but never reviewed?" selects both columns on each side, and EXCEPT matches pairs as units:

SELECT o.customer_id, oi.product_id
FROM orders o JOIN order_items oi ON oi.order_id = o.order_id
EXCEPT
SELECT customer_id, product_id FROM reviews;

A row appears in the result only if its exact combination of customer and product is absent from reviews. Trying to express "this pair is not in that set" with NOT IN over two columns is awkward and NULL-fragile; EXCEPT handles it naturally because it was built on whole-row set difference. The same applies to INTERSECT — "which (customer, product) pairs were both ordered and reviewed?" is a two-column INTERSECT.

The ALL variants deserve a second look here too, because whole-row comparison plus multiplicity enables precise reconciliation. EXCEPT ALL keeps multiplicity: if a row appears three times on the left and once on the right, EXCEPT ALL returns it twice (the unmatched copies), whereas plain EXCEPT returns it zero times (it exists on both sides, so the difference is empty). This matters when you're comparing two snapshots of the same data and need to know not just whether a row differs but how many copies differ — the bread and butter of data-migration verification, where you compare a table before and after a transformation and expect the difference to be empty.

Common mistake. Selecting different columns on the two sides of an INTERSECT/EXCEPT and expecting it to match on just the "important" one. These operators compare every selected column. If you select (customer_id, order_date) on one side and (customer_id, status) on the other, you're comparing two-column rows that will essentially never match. Select exactly the columns whose combination defines "the same row" for your question — no more, no less.


Data-modifying source: building reports from heterogeneous tables

One of the most practical uses of UNION ALL is assembling a single unified view from different tables that represent similar things — a pattern that appears constantly in real reporting. Mercado records financial events in more than one place; a unified "account activity" feed needs to stack them into one chronological list, tagging each with its source.

SELECT order_date AS event_time, 'order'   AS type, total_amount AS amount
FROM orders
UNION ALL
SELECT paid_at,                'payment' AS type, amount
FROM payments
UNION ALL
SELECT shipped_at,             'shipment' AS type, NULL          AS amount
FROM shipments
ORDER BY event_time DESC;

Each SELECT shapes a different table into the same three-column form — a timestamp, a type label, and an amount — and UNION ALL stacks them into one feed. The literal 'order', 'payment', 'shipment' columns are how you tag each row with where it came from, since once they're stacked you can no longer tell by the table. The NULL AS amount for shipments keeps the columns union-compatible even though shipments have no monetary amount. This "normalize disparate sources into a common shape, tag, and stack" pattern is the backbone of activity feeds, audit logs, unified search results, and countless dashboards.

Note the deliberate UNION ALL, not UNION: these events are genuinely distinct (an order and a payment that happen to share a timestamp and amount are still two real events), so deduplication would be wrong here — it would silently merge coincidentally-identical events. This is the correctness dimension of the UNION/UNION ALL choice in action: you reach for UNION ALL not merely because it's faster, but because for an event feed, every event must be preserved. Choosing the wrong one wouldn't error; it would just quietly lose data, which is the most dangerous kind of bug.


A worked scenario: reconciling two datasets

Let's put the comparison tools to work on a realistic task. Suppose Mercado runs a nightly job that's supposed to keep a customer_search_index table in sync with the customers table, and you suspect it's drifting. Set operations are the perfect instrument for finding exactly how two datasets differ — what's missing, what's extra, what's changed.

"Customers in the main table but missing from the index" (the job failed to add them) is a straightforward EXCEPT on the identifying key:

SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM customer_search_index;

"Stale entries in the index for customers who no longer exist" (the job failed to remove them) is the same EXCEPT reversed. And "customers whose indexed data no longer matches the source" — the subtlest drift — is a whole-row EXCEPT over the columns that should agree:

-- Rows whose (id, name, email) in the source don't match the index exactly:
SELECT customer_id, first_name, last_name, email FROM customers
EXCEPT
SELECT customer_id, first_name, last_name, email FROM customer_search_index;

Run all three and you have a complete diagnosis: what's missing, what's orphaned, what's out of date. If the sync job were perfect, all three would return zero rows — and "the difference is empty" is exactly the assertion you want from a reconciliation check. This is how data engineers verify migrations, audit ETL pipelines, and catch synchronization bugs: express "these two datasets should be identical" as a pair of EXCEPTs and confirm both come back empty. It's a technique you'll use far beyond this chapter, and it's nothing more than set difference applied with intent. The same logic underpins the integration tests that guard data pipelines in production (Chapter 31).


More patterns, to build the reflex of spotting set-shaped questions:

-- "All email addresses we have, from customers OR suppliers, deduplicated."
SELECT email FROM customers
UNION
SELECT contact_email FROM suppliers;

-- "Customers who have ordered in BOTH 2023 and 2024."
SELECT customer_id FROM orders WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01'
INTERSECT
SELECT customer_id FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

-- "Products in the catalog that have NEVER been ordered."
SELECT product_id FROM products
EXCEPT
SELECT product_id FROM order_items;

-- "Customers who reviewed but never ordered" (a possible data-quality red flag).
SELECT customer_id FROM reviews
EXCEPT
SELECT customer_id FROM orders;

Look at how each maps to a phrase: "from X or Y, deduplicated" → UNION; "in both" → INTERSECT; "never / but not" → EXCEPT. The second query is especially instructive — "ordered in both periods" is naturally two date-filtered sets intersected, which is far clearer than the self-join or double-EXISTS that would otherwise be required. And the last one shows set operations doubling as data-quality probes: a customer who reviewed without ordering might indicate a bug or fraud, and EXCEPT surfaces them in one line. The recurring lesson is that when a question is genuinely about combining or comparing populations, the set operators say it most directly — and reading them back is nearly reading the question aloud.

Try this. Take "customers who ordered in both 2023 and 2024" and change it to "customers who ordered in 2023 but not 2024" (a churn signal). It's a one-operator change — INTERSECT becomes EXCEPT — and that single word swap flips the question from loyalty to attrition. Feeling how the operator is the meaning is the point.


A performance note

UNION, INTERSECT, and EXCEPT all deduplicate, which costs a sort or hash over the combined data. The ALL variants skip that. On large results the difference is significant — so default to UNION ALL unless you specifically need deduplication. Indexes on the compared columns can also speed INTERSECT/EXCEPT. You'll measure all this with EXPLAIN in Chapter 24.


A practical recipe: deduplication with UNION

One small but constantly useful trick falls out of UNION's deduplicating nature: UNION of a query with nothing, or of overlapping queries, is a clean way to deduplicate rows that arrive from multiple overlapping sources. Suppose you have two mailing lists that partly overlap — newsletter subscribers and past purchasers — and you want each person exactly once. UNION does the deduplication for you, automatically, across the combined set:

SELECT email FROM newsletter_subscribers
UNION                              -- dedup happens here, across both lists
SELECT email FROM customers WHERE loyalty_tier IS NOT NULL;

Anyone appearing in both lists appears once in the result. This is cleaner than concatenating with UNION ALL and then wrapping the whole thing in a SELECT DISTINCT, though the two are equivalent in effect — UNION is essentially UNION ALL followed by a DISTINCT, fused into one operation the optimizer can handle efficiently. Understanding that equivalence helps you predict behavior and cost: UNION does exactly the work of UNION ALL plus a deduplication pass, which is why UNION ALL is faster whenever you don't need the dedup.

The recipe generalizes to any "combine these overlapping sources, one row each" need. Just be precise about what counts as a duplicate: UNION deduplicates on the entire selected row, so if you select (email, signup_source) rather than just email, the same email from two different sources counts as two distinct rows and both survive. Select exactly the columns whose combination defines "the same person" (or the same whatever) — usually fewer columns than you might first reach for. This is the same whole-row-comparison principle that governs INTERSECT and EXCEPT, applied to UNION's deduplication. Get the column list right and UNION becomes a precise, declarative deduplicator across any number of overlapping sources.


A decision recap: which set operation, and when

To consolidate, here's the reasoning you'll run whenever a question smells set-shaped. The first fork is are you combining or comparing? Combining — stacking rows from sources into one list — is UNION (deduplicated) or UNION ALL (everything, faster). Comparing — asking about membership across two sets — is INTERSECT (in both) or EXCEPT (in the first not the second). That single distinction routes most decisions.

Within combining, the second fork is do duplicates matter? If the sources are disjoint, or duplicates are wanted (an event feed, a concatenation of periods), use UNION ALL — it's correct and faster. If duplicates are possible and unwanted (a deduplicated contact list), use UNION and let it do the dedup. Defaulting to UNION ALL and reaching for UNION only when you specifically need deduplication is the habit that avoids both the silent-data-loss bug and the needless performance cost.

Within comparing, the second fork is do you need other columns? INTERSECT and EXCEPT return only the compared columns, so if you need the full row, either use the "identify with the set operation, then join back to enrich" pattern, or switch to EXISTS/NOT EXISTS, which keep the outer table's columns and accept additional conditions. For a clean single-column membership question, INTERSECT/EXCEPT read beautifully; for anything needing extra columns or conditions, the existence predicates are more flexible.

Run those two forks — combine-vs-compare, then the follow-up — and you'll pick the right operator essentially every time. And keep the escape hatch in mind: if the question turns out to need columns from multiple related tables side by side, it was a join question wearing set-operation clothing, and you should switch tools. The operators in this chapter are a small, sharp set; their value comes entirely from recognizing the questions they fit and not forcing them onto questions they don't.


How NULL behaves in set operations (a surprising twist)

Here is a subtlety that catches even experienced SQL writers, and it's the opposite of what Chapters 3 and 5 taught you about NULL. In a WHERE clause, NULL = NULL is unknown, so two NULLs never "match." But in set operations, two NULLs are treated as equal for the purpose of deduplication and comparison. This is deliberate — set membership needs a definite notion of "the same row," and treating all NULLs as one value is what makes UNION, INTERSECT, and EXCEPT behave sensibly — but it means NULL plays by different rules here than it does in filtering.

The practical consequences are worth seeing concretely. A UNION of two queries that each produce a NULL in some column will collapse those NULL-bearing rows together if they're otherwise identical — the NULLs are considered the same, so the duplicate is removed. An INTERSECT will report a NULL-containing row as "in both" if both sides have a matching NULL. And an EXCEPT will remove a NULL-bearing row from the result if the right side has a matching one. None of this would happen with =-based comparison in a WHERE or a join ON clause, where NULLs stubbornly refuse to match anything.

-- These two NULLs are treated as EQUAL by UNION, so the result has ONE such row:
SELECT NULL::int AS x
UNION
SELECT NULL::int;
-- → a single row containing NULL (the duplicate NULL was removed)

-- Contrast a join's ON clause, where NULL = NULL is unknown and would NOT match.

This difference is rooted in a precise distinction in the SQL standard: set operations compare rows using "not distinct" semantics (the same idea as IS NOT DISTINCT FROM from Chapter 3), which treats two NULLs as equal, whereas WHERE and join conditions use ordinary =, which treats them as unknown. You don't need to memorize the terminology, but you do need to remember the headline: NULL matches NULL in set operations, but not in WHERE. When a UNION deduplicates rows you expected to remain separate, or an EXCEPT removes a row you expected to keep, suspect NULL-equality semantics — it's the one place where NULL stops being slippery and starts being definite.

Common mistake. Assuming an EXCEPT will keep a row with a NULL because "NULL never equals anything." It will not — if the right side has a matching row with a NULL in the same position, EXCEPT treats them as equal and removes it. This trips people precisely because it contradicts the WHERE-clause NULL behavior they've internalized. Two different contexts, two different rules.


Combining set operations with CTEs, joins, and ordering

Set operations rarely live alone; they combine with everything else you've learned, and seeing the combinations removes the last bit of mystery. The most common pairing is with CTEs (Chapter 11), which let you name each side of a set operation for readability — invaluable when the individual queries are themselves complex:

WITH big_spenders AS (
    SELECT customer_id FROM orders
    GROUP BY customer_id HAVING SUM(total_amount) > 1000
),
reviewers AS (
    SELECT DISTINCT customer_id FROM reviews
)
SELECT customer_id FROM big_spenders      -- high-value customers who...
INTERSECT
SELECT customer_id FROM reviewers;        -- ...also leave reviews

The CTEs turn each population into a named, self-documenting set, and the INTERSECT reads as the plain-English question. You can also feed a set operation's result into an outer query by wrapping it in a CTE or subquery, then joining it back to get more columns — a frequent pattern, since INTERSECT/EXCEPT return only the compared columns, but you often want the full row:

WITH dormant AS (                          -- ids of customers who never ordered
    SELECT customer_id FROM customers
    EXCEPT
    SELECT customer_id FROM orders
)
SELECT c.first_name, c.last_name, c.email  -- now fetch their full details
FROM customers c
JOIN dormant d ON d.customer_id = c.customer_id;

This two-step pattern — use a set operation to identify the rows, then join back to the table to enrich them with columns — is how you get the readability of set operations without their limitation of returning only the compared columns. It's the idiomatic answer to "I love how EXCEPT reads, but I need the customer's name and email too." Finally, remember the ordering rule from earlier: a single ORDER BY at the very end sorts the combined result, and it can only reference the output column names (which come from the first query) — not columns that exist in the underlying tables but aren't in the result.


A field guide to set-operation pitfalls

Gathering the traps in one place makes a useful checklist when a set operation misbehaves:

Union-incompatibility. The error "each UNION query must have the same number of columns" or a type mismatch means your SELECT lists don't line up. Count columns left to right on both sides; confirm the types are compatible position by position. Remember the output names come from the first query — a surprise when you expected the second query's names.

Reaching for UNION when you want UNION ALL. The default instinct to write UNION carries a hidden deduplication cost and can silently merge rows you meant to keep distinct (the event-feed problem). Ask every time: could there be duplicates, and do I want them gone? If duplicates are impossible or desired, UNION ALL is both faster and more correct.

The NULL-equality surprise. As above — NULL matches NULL in set operations, unlike in WHERE. Watch for unexpected deduplication or removal of NULL-bearing rows.

Precedence when mixing operators. INTERSECT binds tighter than UNION and EXCEPT. Mixing them without parentheses produces a result that's correct by the rules but probably not what you meant. Parenthesize, exactly as you do with AND/OR.

Whole-row comparison in INTERSECT/EXCEPT. These compare every selected column, not just the first. Select precisely the columns whose combination defines "the same row" — extra columns will cause near-zero matches; too few will over-match.

Expecting other columns from INTERSECT/EXCEPT. They return only the compared columns. If you need the full row, use the "identify then join back" pattern above, or switch to EXISTS/NOT EXISTS, which keep the outer table's columns naturally.

Run a misbehaving set operation past this list and you'll find the problem quickly. Most set-operation bugs are one of these six, and each has a clear, mechanical fix once you recognize it. The operators themselves are simple; the pitfalls cluster entirely around compatibility, the UNION/UNION ALL choice, and NULL/whole-row comparison semantics.


Set operations in data pipelines and warehousing

Looking ahead, set operations become structural load-bearing tools once you move from answering ad-hoc questions to building data pipelines — the repeatable flows that move and reshape data, which Chapters 31 and 34 treat in full. Seeing their role there now explains why a seemingly modest feature deserves a whole chapter.

In ETL and data loading (Chapter 31), EXCEPT is the natural way to compute what changed between a source and a destination — the delta to apply. Rather than reloading an entire table every night, a pipeline can compute "rows in the new extract but not in the current table" with an EXCEPT and insert only those, and "rows in the current table but not in the new extract" to find deletions. This change-data-capture-by-difference is far cheaper than full reloads and is expressed almost entirely with set operations. The reconciliation pattern from earlier — assert two datasets are identical by checking both directions of EXCEPT come back empty — is exactly how such pipelines verify they worked.

In data warehousing (Chapter 34), UNION ALL is the workhorse for assembling fact data from multiple source systems into a unified table. A company that acquired competitors, or that runs separate systems per region, often needs to stack structurally-similar data from many sources into one analytical table — each source shaped to a common schema and concatenated with UNION ALL, tagged with a source column exactly as in the activity-feed example. Whole categories of "combine these monthly partitions," "union these regional sales tables," and "merge these historical and current datasets" are UNION ALL at their core.

The throughline is that set operations are how you treat whole datasets as values — combining, differencing, and comparing them wholesale rather than row-by-row. That dataset-level thinking is precisely what pipelines and warehouses are built on, which is why a feature you might first meet as a tidy way to write "in both" or "never" turns out to be infrastructure. When you reach those later chapters, the operators will already be familiar; you'll simply be applying them at a larger scale and on a schedule.


A note on portability

Set operations are among the more portable corners of SQL, which is worth knowing if your career takes you across databases. UNION, UNION ALL, INTERSECT, and EXCEPT are all standard SQL and behave consistently across PostgreSQL, SQL Server, Oracle, and SQLite — the same syntax, the same deduplication semantics, the same union-compatibility rules. This makes set-operation skills among the most transferable you'll build in Part II.

The one notable wrinkle is EXCEPT itself: Oracle historically spelled it MINUS (it added EXCEPT only recently), so older Oracle code and documentation uses MINUS for the identical operation. MySQL was the laggard, gaining INTERSECT and EXCEPT only in version 8.0.31; before that, developers emulated them with joins and NOT EXISTS. The ALL variants (INTERSECT ALL, EXCEPT ALL) are less universally implemented than the deduplicating defaults, so check before relying on multiplicity-preserving differences. And as always, the ORDER BY at the end of a set operation, and the rule that output names come from the first query, are standard and consistent.

These are minor caveats against a backdrop of strong portability. If you learn set operations on PostgreSQL, you can use them almost unchanged nearly everywhere — a pleasant contrast to date/time functions (Chapter 8) or the LIMIT/TOP/FETCH divergence (Chapter 5), where databases disagree far more. Appendix J records the specifics, but the headline is reassuring: this is one chapter whose skills travel with you intact.

That portability is itself a small lesson about the relational model. Set operations transfer cleanly precisely because they're rooted in mathematical set theory, which is the same everywhere — there's no room for vendors to disagree about what "the union of two sets" means. The parts of SQL that diverge most across databases (date handling, string functions, row-limiting) are the parts furthest from the mathematical core, where each vendor improvised. The closer a feature sits to Codd's foundations, the more universal it is. Set operations sit right on those foundations, and their consistency across every database is the dividend.


Progressive project: combine result sets

In your domain:

  1. A UNION ALL that stacks two related lists into one report (e.g., upcoming and overdue loans; this-month and last-month enrollments).
  2. An INTERSECT "in both" question (members who borrowed and attended an event; students enrolled in two specific courses).
  3. An EXCEPT difference (active members who never borrowed; books in the catalog never loaned).
  4. Decide consciously, for each, whether you need UNION (dedup) or UNION ALL (keep duplicates), and why.

Add these to project-notes.md.


Where set operations earn their place

It's worth understanding why set operations exist as first-class features rather than being left to joins and subqueries, because the answer reveals when they're genuinely the right tool. The relational model, as Codd conceived it, is built directly on mathematical set theory — relations are sets of tuples — so the set operations of union, intersection, and difference aren't bolted-on conveniences but fundamental operations that the model inherits from its mathematical foundation. SQL would be an incomplete realization of the relational algebra without them. That lineage is why they map so cleanly to the algebra of Chapter 4, and why they compose so predictably: they're operations on sets, and sets have well-understood, centuries-old algebra.

In day-to-day engineering, set operations turn out to occupy a few specific niches where they're not just a solution but clearly the best one. The first is combining heterogeneous sources into a uniform stream — the activity-feed pattern, where rows from different tables representing different kinds of events are normalized to a common shape and stacked with UNION ALL. No join expresses this, because the rows aren't related; they're simply being concatenated. The second niche is set-membership questions over a single dimension — "in both," "in one but not the other," "in either" applied to a column of identifiers — where INTERSECT and EXCEPT read like the English question and sidestep the duplicate-row hazards a join would introduce. The third, and perhaps most valuable in practice, is data reconciliation and verification: asserting that two datasets are identical by checking that their symmetric difference is empty, the technique that underpins migration testing and pipeline auditing.

Outside these niches, joins and EXISTS/NOT EXISTS are often the better choice — particularly whenever you need columns from more than one table, since set operations return only the compared columns. The mature instinct is to recognize the set-shaped questions when they appear (the language gives them away: either, both, except, never, combine, in common) and to reach for the set operator that says the question most directly, while defaulting to joins and existence tests for the relational, column-widening questions that make up the bulk of querying. Knowing which kind of question you have is the meta-skill; the operators themselves are simple once you've classified the question correctly.

There's also a quiet professional payoff in fluency with set operations: they make you better at thinking about data as populations. A great deal of analysis and data quality work is fundamentally about comparing groups — who's in this cohort but not that one, what overlaps, what's unique to each — and the habit of reaching for set operations trains you to frame problems that way. When a colleague asks "are these two reports counting the same customers?", the person who immediately thinks "that's an EXCEPT in both directions; if both come back empty, yes" has a sharper, more set-theoretic way of reasoning about data than someone who would laboriously join and eyeball. That way of thinking, more than the specific syntax, is the lasting takeaway of this short but foundational chapter.


Summary

Set operations stack query results by row, realizing Chapter 4's set algebra: UNION (∪, in either — deduplicated), UNION ALL (in either — duplicates kept, faster), INTERSECT (∩, in both), and EXCEPT (−, in the first but not the second). The queries must be union-compatible (same column count and compatible types; output names come from the first query). UNION/INTERSECT/EXCEPT deduplicate by default (use the ALL variants to keep duplicates and run faster). A single ORDER BY at the end sorts the combined result. EXCEPT is a readable set-difference anti-join, but NOT EXISTS is more flexible when you need extra columns or conditions. Default to UNION ALL unless you truly need dedup.

You can now: - Combine result sets with UNION/UNION ALL, INTERSECT, and EXCEPT. - Ensure union compatibility (column count, order, types). - Choose UNION vs UNION ALL deliberately (dedup cost vs. keeping duplicates). - Express "in both" and "in one not the other" with INTERSECT/EXCEPT. - Order a combined result correctly and parenthesize mixed operators. - Choose between set operations, joins, and EXISTS for a given question.

What's next. Chapter 11 — CTEs and Recursive Queries — the WITH clause that makes complex queries readable (naming the derived tables you've been nesting), and recursive CTEs for traversing hierarchies like Mercado's category tree and employee org chart.


Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.