Case Study: Auditing a Mailing List with Set Algebra

DataField.Dev

Case Study: Auditing a Mailing List with Set Algebra

"The most important property of a program is whether it accomplishes the intention of its user." — C. A. R. Hoare

Executive Summary

A marketing team is about to send ten million emails, and a regulator could fine the company for every message sent to someone who unsubscribed or who never consented. Your job is not to build a fancy new system — it is to audit the one they have, find out exactly who would receive a message, prove that the audited "send list" is correct, and explain why the naive list-based code the team wrote is both wrong and unacceptably slow. Every requirement in the audit turns out to be a single set operation from §8.3, and the whole correctness argument is the element method from §8.5. By the end you will have translated a compliance problem into set algebra, derived the right answer by hand, and seen why a set (not a list) is the only reasonable data structure for the job.

Skills applied

Translating informal business rules into precise statements about union, intersection, and difference (§8.3, §8.4).
Reasoning about cardinality and overlap — why $\lvert A \cup B \rvert \ne \lvert A \rvert + \lvert B \rvert$ in general.
Proving an identity about the audited set by the element method (§8.5).
Choosing set over list on complexity grounds (§8.6), and quantifying the difference.

Background

The scenario

This is a constructed teaching scenario (a hypothetical example), but the data shapes and the bug are the kind you meet constantly. The team maintains four collections of email addresses:

Name	Meaning
$C$ — `contacts`	every address ever collected (the raw list, full of duplicates)
$U$ — `unsubscribed`	addresses that asked to stop receiving mail
$B$ — `bounced`	addresses that hard-bounced (the mailbox does not exist)
$V$ — `consented`	addresses that explicitly opted in to marketing

The compliance rule, in plain English, is:

Send to every consented address that has not unsubscribed and has not bounced — and never send to the same address twice.

The team's current code looks like this:

def build_send_list_naive(contacts, unsubscribed, bounced, consented):
    """The team's current approach -- it has TWO problems. Find them."""
    send = []
    for addr in contacts:                 # iterate the raw, duplicate-laden list
        if addr in consented:             # 'in' on a list is O(n) -- a linear scan
            if addr not in unsubscribed:
                if addr not in bounced:
                    send.append(addr)     # may append the same address many times
    return send

💡 Intuition: The compliance rule is a sentence with three filters ("consented", "not unsubscribed", "not bounced") joined by and, plus a global "no duplicates" requirement. Three filters joined by and is an intersection with two differences; "no duplicates" is exactly what a set gives you for free. The team wrote loops where the mathematics wanted set operations, and that mismatch is the source of both bugs.

Why this matters

This is the deduplication-and-difference problem from the chapter's overview, with money attached. Get the set algebra wrong and you mail people who opted out (a legal problem); use the wrong data structure and the job that should take seconds takes hours (an engineering problem). Auditing it correctly means being able to write the exact set the send list should equal — and to prove the code computes that set — before a single message goes out.

Phase 1: Translate the rule into one set expression

Read the compliance sentence one clause at a time and attach a set operation to each.

"Send to every consented address" — start from $V$.
"that has not unsubscribed" — remove $U$: this is the difference $V \setminus U$.
"and has not bounced" — remove $B$ as well: $(V \setminus U) \setminus B$.
"never send to the same address twice" — the result must be a set (distinct elements), which the difference of sets already guarantees.

So the correct send list is the set $$\text{SEND} = (V \setminus U) \setminus B.$$

There is a cleaner equivalent worth having, using the identity $A \setminus B = A \cap \overline{B}$ (Exercise 8.7) and the fact that removing $U$ then removing $B$ is removing $U \cup B$: $$\text{SEND} = V \setminus (U \cup B).$$

🧩 Productive Struggle. Before reading on, convince yourself that $(V \setminus U) \setminus B = V \setminus (U \cup B)$. Which set identity is this? (We prove it in Phase 3 — try it first.)

Notice what is absent from this expression: the raw contacts list $C$. The rule never says "send to contacts"; it says "send to consented addresses." The team's code looped over $C$, which is the first sign something is off — $C$ should not appear in the answer at all.

🔄 Check Your Understanding Rewrite each business rule as a set expression. (a) "addresses we have a record of but who never consented." (b) "consented addresses that nonetheless bounced." (c) "addresses that both unsubscribed and bounced."

Answer

(a) $C \setminus V$ (in contacts, not in consented). (b) $V \cap B$. (c) $U \cap B$. Each English "and/but-not/both" maps to $\cap$ or $\setminus$ exactly as in the §8.3 correspondence table.

Phase 2: Work a small instance by hand

Auditing means we must know the right answer independently of the code. Take a tiny instance — six addresses, abbreviated to single letters — and compute SEND by hand.

$$ C = \{a, b, c, d, e, f, a, c\}, \quad V = \{a, b, c, d\}, \quad U = \{b\}, \quad B = \{c\}. $$

(Note $C$ was written with a and c repeated, mimicking the duplicate-laden raw list; as a set it is just $\{a,b,c,d,e,f\}$.) Compute step by step:

$U \cup B = \{b\} \cup \{c\} = \{b, c\}$.
$\text{SEND} = V \setminus (U \cup B) = \{a, b, c, d\} \setminus \{b, c\} = \{a, d\}$.

So exactly two addresses, $a$ and $d$, should be mailed. Let us double-check with the other form: $(V \setminus U) \setminus B = ({a,b,c,d} \setminus {b}) \setminus {c} = {a,c,d} \setminus {c} = {a, d}$. The two expressions agree, as Phase 3 will prove they must.

Now hand-trace the team's naive code on the same data. It loops over $C = [a, b, c, d, e, f, a, c]$ (the list, with repeats) and keeps addr when addr in V and addr not in U and addr not in B:

`addr`	in $V$?	not in $U$?	not in $B$?	appended?
a	yes	yes	yes	a
b	yes	no	—	no
c	yes	yes	no	no
d	yes	yes	yes	d
e	no	—	—	no
f	no	—	—	no
a	yes	yes	yes	a (again!)
c	yes	yes	no	no

The naive function returns the list [a, d, a]. The set of recipients is correct ($\{a, d\}$), but the list contains a twice — address $a$ gets two emails. That is bug #1: iterating the duplicate-laden $C$ and appending to a list reproduces the duplicates.

⚠️ Common Pitfall: "The recipients are correct, so the code is fine." No — the requirement was "never send to the same address twice," and the returned list violates it. Correct membership is not the same as a correct result when duplicates are forbidden. The fix is to make the result a set, where the second a simply collapses (§8.1).

Phase 3: Prove the two send-list formulas are equal

The audit produced two expressions for SEND. If we are going to rewrite the team's pipeline using the more efficient $V \setminus (U \cup B)$ form, we must prove it equals the rule-by-rule form $(V \setminus U) \setminus B$. This is the element method from §8.5.

Claim. For all sets $V, U, B$: $(V \setminus U) \setminus B = V \setminus (U \cup B)$.

The strategy first. Chase one arbitrary element through both sides as a chain of iff steps. The only real content is De Morgan's law from Chapter 1, $\neg(p \lor q) \equiv \neg p \land \neg q$, turning "not in the union" into "not in either." Everything else is unfolding the definitions of difference and union.

Proof. Let $x$ be arbitrary. Then $$ > \begin{aligned} > x \in (V \setminus U) \setminus B > &\iff (x \in V \setminus U) \land (x \notin B) && \text{(definition of difference)} \\ > &\iff (x \in V \land x \notin U) \land x \notin B && \text{(definition of difference again)} \\ > &\iff x \in V \land (x \notin U \land x \notin B) && \text{(associativity of } \land\text{)} \\ > &\iff x \in V \land \neg(x \in U \lor x \in B) && \text{(De Morgan, [Chapter 1](../../part-01-logic-and-proof/chapter-01-propositional-logic/index.md))} \\ > &\iff x \in V \land x \notin (U \cup B) && \text{(definition of union)} \\ > &\iff x \in V \setminus (U \cup B) && \text{(definition of difference).} > \end{aligned} > $$ Since $x \in (V \setminus U) \setminus B \iff x \in V \setminus (U \cup B)$ for every $x$, the two sets have the same elements, so they are equal. $\blacksquare$

Re-read the proof and notice the §8.5 lesson in action: the entire mathematical content is the single line "De Morgan, Chapter 1." The rest is bookkeeping — unfolding difference and union into their logical connectives, applying one logical law, folding back up. The audit's two formulas are not merely "both reasonable"; they are provably the same set, so we may use whichever is cheaper to compute.

🔗 Connection. This is the same shape as the textbook's proof of $A \setminus (B \cap C) = (A \setminus B) \cup (A \setminus C)$ in §8.5: difference distributing over a set operation always reduces to a De Morgan step. Once you have seen one, you can write them all.

Phase 4: The corrected, audited pipeline

Now rebuild the send-list computation using sets, transcribing the proven expression $V \setminus (U \cup B)$ directly. This fixes both bugs at once.

def build_send_list(contacts, unsubscribed, bounced, consented):
    """Audited send list = consented \ (unsubscribed u bounced).
    Inputs may be any iterables; we convert to sets first."""
    V = set(consented)
    U = set(unsubscribed)
    B = set(bounced)
    return V - (U | B)          # V \ (U u B): one line, mirrors the proven identity

C = ["a", "b", "c", "d", "e", "f", "a", "c"]   # raw, with duplicates
V = ["a", "b", "c", "d"]
U = ["b"]
B = ["c"]
result = build_send_list(C, U, B, V)
print(sorted(result))          # the recipients, deduplicated and rule-correct
print(len(result))             # how many emails actually go out
# Expected output:
# ['a', 'd']
# 2

The result is the set $\{a, d\}$ — the hand-derived answer from Phase 2 — and it is a set, so address $a$ appears once no matter how many duplicates lurked in the inputs. Bug #1 (duplicate sends) is gone because the result type is a set; the "loop over $C$" mistake is gone because we start from the correct set $V$.

Bug #2: the hidden quadratic blowup

The naive code had a second problem the small example could not reveal: speed. In build_send_list_naive, the inner test addr in unsubscribed runs on a list, so each test is an $O(n)$ scan, and it runs once per contact. With $\lvert C \rvert$ contacts and unsubscribe/bounce lists of size up to $n$, the total cost is on the order of $\lvert C \rvert \cdot n$ — quadratic. At ten million addresses that is on the order of $10^{14}$ comparisons: hours of work.

The audited version converts each input to a set once (near-linear) and then does set difference and union, each of which is linear in the set sizes because membership is average $O(1)$ (§8.6). The total is near-linear instead of quadratic. Let us make the contrast explicit.

# Illustrative cost comparison (NOT executed; counts reasoned by hand).
# Let n = size of each collection. We count membership tests.
#
#   naive:  for each of n contacts, do up to 3 membership tests,
#           each an O(n) list scan      ->  ~ 3 * n * n      = O(n^2)
#   sets:   build 3 sets (O(n) total),
#           then differences/unions touch each element once  ->  O(n)
#
# For n = 10_000_000:
naive_ops = 3 * (10_000_000 ** 2)     # ~ 3e14 -- infeasible
set_ops   = 4 * 10_000_000            # ~ 4e7  -- a few seconds
print(naive_ops)
print(set_ops)
print(naive_ops // set_ops)           # how many times faster, roughly
# Expected output:
# 300000000000000
# 40000000
# 7500000

The audited pipeline is on the order of seven million times fewer operations at this scale. The constants are rough, but the asymptotic story — $O(n^2)$ versus $O(n)$ — is exactly the §8.6 promise: "set = fast membership, no duplicates," which is "your mathematics, compiled."

⚠️ Common Pitfall: Converting to a set inside a loop throws the win away. Writing if addr in set(unsubscribed): rebuilds the set on every iteration, restoring the $O(n^2)$ cost (now with worse constants). Build each set once, before the loop or — better — let the set operators do the looping, as the audited version does.

Discussion Questions

The audit dropped the raw contacts set $C$ entirely. Under what (mis-stated) business rule would $C$ belong in the send-list expression? Write that rule and its set expression, and explain why it is not the compliance rule.
Suppose a fifth list arrives: $P$ = "addresses on a legal hold who must receive a notice regardless of preferences." Modify the send-list expression to always include $P$ while still excluding bounced addresses. Is your new expression's order of operations important? Prove or disprove that $(V \setminus (U \cup B)) \cup P = (V \cup P) \setminus (U \cup B)$.
The cost analysis assumed list membership is $O(n)$ and set membership is $O(1)$ "on average." What has to be true about the addresses (and the hash function) for the average case to hold, and what is the worst case for a set? (You will study hashing and collisions properly later in the book — give the intuition now.)
The compliance rule used "and" three times. If a sloppy spec instead said "send to consented addresses or addresses that have not unsubscribed," how would the expression change, and why is that rule almost certainly a bug in the spec? Translate it to set algebra and describe who would wrongly get mail.

Your Turn: Extensions

Option A (inclusion–exclusion preview). Management asks: "how many distinct addresses are in $U$ or $B$ (so we can report total opt-outs)?" Show by a small hand example that $\lvert U \cup B \rvert \ne \lvert U \rvert + \lvert B \rvert$ when they overlap, and state the corrected formula $\lvert U \cup B \rvert = \lvert U \rvert + \lvert B \rvert - \lvert U \cap B \rvert$. (This is inclusion– exclusion, formalized in Chapter 15.)
Option B (audit as a test). Using the Toolkit's difference and union from this chapter's Project Checkpoint, write code that checks build_send_list agrees with the rule-by-rule form $(V \setminus U) \setminus B$ on several hand-made inputs. Explain what agreement does and does not prove — and why you also proved the identity in Phase 3 (theme four: test and proof are partners).
Option C (symmetric difference report). The team wants "addresses that are in exactly one of $U$ and $B$" for a reconciliation report. Express this with the symmetric difference $U \triangle B$, compute it on the Phase-2 data by hand, and write the one-line set expression that produces it.

Key Takeaways

Business rules are set expressions. "Consented, not unsubscribed, not bounced, no duplicates" is $V \setminus (U \cup B)$ — three filters joined by and become differences, and "no duplicates" is what a set is for.
Audit by computing the answer independently. Hand-deriving SEND on a six-address instance exposed both bugs (duplicate sends, looping the wrong collection) before any code ran.
Two formulas are equal only when you prove it. $(V \setminus U) \setminus B = V \setminus (U \cup B)$ rests on one De Morgan step — the §8.5 engine — and proving it licenses the faster implementation.
The data structure is part of correctness and speed. A set makes "no duplicates" automatic and turns an $O(n^2)$ audit into an $O(n)$ one; this is §8.6's "fast membership, no duplicates, no order" contract paying for itself at scale.