Case Study 2 — "Bought X but Never Y": Getting the Algebra Right
A marketing question that sounds simple, traps nearly everyone who attacks it with joins alone, and falls out cleanly the moment you think in relational algebra. The lesson: choosing the right operation (here, difference / anti-join) is half the battle, and it's a thinking skill, not a syntax skill.
Background
A retailer's marketing team wanted to run a campaign targeting a specific group: customers who have bought a laptop but have never bought any accessory (headphones, earbuds, a speaker, etc.). The idea was to upsell accessories to people who clearly own a laptop but haven't added anything to it.
The request landed with a junior analyst, who reached — as most people do — for a join.
The wrong turns
Attempt 1: join and filter. The analyst joined customers to their orders and line items, then tried to filter:
-- WRONG: this can't express "never bought Y" with a row filter
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
JOIN products p ON p.product_id = oi.product_id
WHERE p.category = 'Laptops'
AND p.category <> 'Audio'; -- "and not an accessory"??
This returns customers who bought a laptop — the category <> 'Audio' is meaningless here, because it filters individual line-item rows, not the customer's entire history. A laptop line item is never in the 'Audio' category, so the condition is always true for the rows that survive, and the query happily includes customers who also bought accessories in other rows. The result was far too large, and the campaign would have annoyed exactly the people it meant to exclude.
Attempt 2: the trap of "AND". Realizing the first was wrong, the analyst tried to require "bought laptop AND not audio" on the same join. But a single joined row represents one line item, so you can't say "this customer has a laptop row AND has no audio row" with a condition on one row. The information about whether a customer ever bought audio lives across many rows — it's a property of a set, not of a row. No amount of WHERE on the flat join can express "the absence of a matching row elsewhere."
This is the crux: "never bought Y" is a statement about the non-existence of rows, and row filters can't express non-existence. You need a set difference.
Thinking in algebra
Step back and express the goal as algebra. Two sets of customers:
L = π customer_id ( customers who bought a Laptop )
A = π customer_id ( customers who bought an Accessory )
answer = L − A -- bought a laptop, MINUS those who bought an accessory
That's a difference. Once you see it, the SQL writes itself — three equivalent ways, all correct:
-- (a) Set difference, straight from the algebra:
SELECT customer_id FROM laptop_buyers
EXCEPT
SELECT customer_id FROM accessory_buyers;
-- (b) Anti-join with NOT EXISTS (often the clearest and fastest):
SELECT DISTINCT lb.customer_id
FROM laptop_buyers lb
WHERE NOT EXISTS (
SELECT 1 FROM accessory_buyers ab
WHERE ab.customer_id = lb.customer_id
);
-- (c) LEFT JOIN ... IS NULL (the "find the non-matches" idiom):
SELECT lb.customer_id
FROM laptop_buyers lb
LEFT JOIN accessory_buyers ab ON ab.customer_id = lb.customer_id
WHERE ab.customer_id IS NULL;
(Here laptop_buyers and accessory_buyers are the two π customer_id (σ category = …) sub-results; in real SQL they'd be subqueries or CTEs — Chapters 9 and 11.) All three implement the same algebra — a difference — and all three give the correct, much smaller target list.
The analysis
-
Choosing the operation is the hard part. The analyst's struggle wasn't SQL syntax; it was not recognizing that "never bought Y" is a difference (an anti-join), not a filter. Relational algebra gives you the vocabulary to classify a question: is this a join? a difference? an intersection? Get the operation right and the syntax is mechanical.
-
Row filters can't express "no matching row exists." A
WHEREcondition tests one row at a time. Statements about the absence of related rows ("never," "none," "not any") require difference,NOT EXISTS, orLEFT JOIN ... IS NULL. This is one of the most common and important patterns in all of SQL — you'll meet it formally in Chapters 9 and 10. -
One algebra, several SQL spellings.
EXCEPT,NOT EXISTS, andLEFT JOIN ... IS NULLare three SQL renderings of the same algebraic difference. They're equivalent in result (with care aroundNULLs); which is fastest depends on the data and indexes (Chapter 24). Knowing they're "the same thing" lets you pick freely. -
Set thinking beats row thinking for "membership" questions. Whenever a question is about which entities satisfy a condition across their whole history ("customers who…", "products that have never…"), think in sets of ids and combine them with set operations. It's clearer and less error-prone than wrestling a flat join.
Discussion questions
- Explain precisely why
p.category <> 'Audio'in the flat join fails to express "never bought audio." - Express the target group in algebra. Which operation is essential, and why can't a
WHEREfilter replace it? - Write the three SQL forms (EXCEPT, NOT EXISTS, LEFT JOIN…IS NULL) and explain why they're equivalent.
- ⭐ The marketing team next asks for "customers who bought a laptop AND a phone but no accessory." Sketch the algebra (hint: an intersection combined with a difference). You won't write the full SQL until Part II, but get the operations right.
- ⭐ With
NULLs in play, theLEFT JOIN ... IS NULLandNOT INforms can behave differently. Recall Chapter 3: why isNOT EXISTSgenerally safer thanNOT INfor anti-joins?