Case Study: Building a Record Deduplicator with Equivalence Closure
"The goal is to turn data into information, and information into insight." — Carly Fiorina
Executive Summary
A customer database has accumulated duplicates: the same person entered three times with slightly different details. Your task is to build a deduplicator that groups records belonging to the same real person into clusters, picks one representative per cluster, and reports the merged set. The catch — and the reason this is a relations problem — is that your matching rules give you only pairwise hints ("record 1 and record 4 share an email," "record 4 and record 7 share a phone"), and those hints are not transitive on their own. To get clean, non-overlapping clusters you must take the equivalence closure of the match relation (reflexive + symmetric + transitive closure together) and then read off its equivalence classes, which the chapter guarantees form a partition — exactly the no-record-in-two-clusters property a deduplicator needs.
This case study is build-heavy: you will construct a small but real component, step by step, each step
justified by a theorem from §12.4–12.5. (Case Study 1 was analysis-heavy — interrogating a system you were
handed.) By the end you'll have a deduplicate function whose correctness rests directly on the
equivalence-relation/partition correspondence.
Skills applied
- Building a match relation from pairwise rules (§12.1) and recognizing it is not yet an equivalence relation (§12.3).
- Taking the reflexive, symmetric, and transitive closures together to form the equivalence closure (§12.5, the Common Pitfall box).
- Extracting equivalence classes and relying on the partition theorem for correctness (§12.4).
- Choosing class representatives — the deduplication output.
- Hand-deriving expected output at each phase.
Background
The scenario
A table of customer records, keyed by an internal id, with messy contact fields:
id name email phone
1 Ada Lovelace ada@calc.org 555-0100
2 Grace Hopper grace@navy.mil 555-0200
3 A. Lovelace ada@calc.org (none)
4 Ada L. (none) 555-0100
5 Alan Turing turing@bletchley.uk 555-0300
6 G. Hopper grace@navy.mil 555-0299
Two matching rules are agreed with the data team: two records match if they share a non-empty email, or they share a non-empty phone. Applying the rules pairwise:
- Records 1 and 3 share email
ada@calc.org. - Records 1 and 4 share phone
555-0100. - Records 2 and 6 share email
grace@navy.mil. - Record 5 matches nothing.
Notice what is not directly stated: rules never said "3 and 4 match." But 3 matches 1 and 1 matches 4, so 3 and 4 are the same person transitively. A correct deduplicator must infer that. The raw match relation is symmetric (sharing is mutual) but not transitive, and that gap is the whole engineering problem.
🔗 Connection. "Same person, inferred through a chain of shared attributes" is the textbook signature of an equivalence relation built by closure. The same pattern powers connected-components in a graph (Chapter 28), "same cluster" in entity resolution, and union-find. We are about to build the relation version by hand.
Why this matters
Deduplication done wrong is worse than not deduplicating. If clusters overlap — record 4 lands in two "people" — you double-count, merge the wrong fields, and corrupt downstream analytics. The chapter's partition theorem (§12.4) is precisely the guarantee that cannot happen if you cluster by an equivalence relation: classes are disjoint by theorem, not by hope. So the safe design is "make the match relation an equivalence relation, then cluster by its classes." We build exactly that.
Phase 1: Build the raw match relation
Encode the records and derive the directly stated matches from the rules. We store records as a dict and build the match relation as a set of pairs.
records = {
1: ("Ada Lovelace", "ada@calc.org", "555-0100"),
2: ("Grace Hopper", "grace@navy.mil", "555-0200"),
3: ("A. Lovelace", "ada@calc.org", ""),
4: ("Ada L.", "", "555-0100"),
5: ("Alan Turing", "turing@bletchley.uk", "555-0300"),
6: ("G. Hopper", "grace@navy.mil", "555-0299"),
}
def raw_matches(records):
"""Pairs (i, j), i != j, that share a non-empty email or phone."""
pairs = set()
for i in records:
for j in records:
if i == j:
continue
_, ei, pi = records[i]
_, ej, pj = records[j]
if (ei and ei == ej) or (pi and pi == pj):
pairs.add((i, j))
return pairs
M = raw_matches(records)
print(sorted(M))
# Expected output:
# [(1, 3), (1, 4), (2, 6), (3, 1), (4, 1), (6, 2)]
Hand-derive it. The sharing facts are 1–3 (email), 1–4 (phone), 2–6 (email); record 5 shares nothing.
Because the test is symmetric (ei == ej is the same as ej == ei), each fact produces both directions:
$(1,3)$ and $(3,1)$, $(1,4)$ and $(4,1)$, $(2,6)$ and $(6,2)$. Six pairs total, matching the output. So
$M$ is already symmetric — but it is not reflexive (no $(i,i)$) and not transitive (we have
$(3,1)$ and $(1,4)$ but no $(3,4)$). It is not yet an equivalence relation, so clustering by it directly
would be unsafe.
🔄 Check Your Understanding Verify that $M$ is not transitive by naming a specific two-step chain in $M$ whose shortcut is missing.
Answer
$(3, 1) \in M$ and $(1, 4) \in M$, but $(3, 4) \notin M$. The chain "3 matches 1, 1 matches 4" has no direct shortcut, so transitivity fails. (This is exactly why records 3 and 4 — the same person — are not yet grouped.)
Phase 2: Add reflexive and transitive closure (build the equivalence closure)
We want the smallest equivalence relation containing $M$. Per §12.5's Common Pitfall, transitive closure alone is not enough — an equivalence relation also needs reflexivity. $M$ is already symmetric, but to be safe and general we close under all three properties. Build small, composable closure functions, then chain them.
def reflexive_closure(R, A):
return R | {(a, a) for a in A}
def symmetric_closure(R):
return R | {(b, a) for (a, b) in R}
def closure_transitive(R):
S = set(R)
while True:
new = {(a, c) for (a, b) in S for (x, c) in S if b == x}
if new <= S:
return S
S |= new
def equivalence_closure(R, A):
"""Smallest equivalence relation containing R: reflexive + symmetric + transitive."""
R = symmetric_closure(R) # ensure both directions
R = reflexive_closure(R, A) # ensure every (a, a)
return closure_transitive(R) # ensure chains close
ids = set(records) # {1, 2, 3, 4, 5, 6}
E = equivalence_closure(M, ids)
print((3, 4) in E, (4, 3) in E, (5, 5) in E, (1, 2) in E)
# Expected output:
# True True True False
Hand-derive the four checks.
- $(3, 4)$: symmetric closure leaves $(3,1)$ and $(1,4)$; transitive closure chains them into $(3,4)$. So
True— records 3 and 4 are now correctly linked. - $(4, 3)$: symmetry plus the chain $(4,1),(1,3)$ gives $(4,3)$.
True. - $(5, 5)$: reflexive closure adds every diagonal pair, including $(5,5)$, even though record 5 matched
nobody.
True— a lone record is its own cluster. - $(1, 2)$: records 1 and 2 share no attribute and no chain connects them (the Lovelace cluster never
touches the Hopper cluster).
False.
We now have a genuine equivalence relation. The order of closures matters less than getting all three, as §12.5 noted; we did symmetric, then reflexive, then transitive so that the transitive pass closes a relation that is already symmetric — which guarantees the result stays symmetric too.
⚠️ Common Pitfall. A tempting shortcut is to call only
closure_transitive(M)and cluster on that. It would link 3 and 4 correctly, but record 5 would be in no pair (no self-loop), so the class-extraction in Phase 3 would either crash or silently drop record 5. Reflexivity is what guarantees every record lands in exactly one class. Build the full equivalence closure, not just the transitive one.💡 Intuition. Think of each closure as adding the arrows one property demands: reflexive adds "every record is the same person as itself," symmetric adds "matching is mutual," transitive adds "sameness chains through shared attributes." Together they upgrade a pile of pairwise hints into a clean notion of "same person" that obeys all the rules equality should.
Phase 3: Extract the clusters (equivalence classes)
Now the payoff. Because $E$ is an equivalence relation, its equivalence classes form a partition
(§12.4's theorem) — every record is in exactly one class, and no two classes overlap. So we can safely walk
the records, and for each not-yet-clustered record, form its class $[r] = \{x \mid (r, x) \in E\}$. This is
the chapter's equivalence_classes routine, and its correctness rests on the partition theorem.
def equivalence_classes(E, A):
seen, classes = set(), []
for a in sorted(A):
if a in seen:
continue
cls = {x for x in A if (a, x) in E} # everything related to a
classes.append(cls)
seen |= cls
return classes
clusters = equivalence_classes(E, ids)
print(clusters)
# Expected output:
# [{1, 3, 4}, {2, 6}, {5}]
Hand-derive it. Walk records in order:
- $r = 1$, unseen. $[1] = \{x \mid (1, x) \in E\} = \{1, 3, 4\}$ (1 links to 3 by email, to 4 by phone, to itself reflexively). Mark 1, 3, 4 seen.
- $r = 2$, unseen. $[2] = \{2, 6\}$ (shared email with 6). Mark 2, 6 seen.
- $r = 3$, already seen — skip. $r = 4$, seen — skip.
- $r = 5$, unseen. $[5] = \{5\}$ (matched nobody; only its reflexive self-loop). Mark 5 seen.
- $r = 6$, seen — skip.
Three clusters: $\{1, 3, 4\}$ (Ada), $\{2, 6\}$ (Grace), $\{5\}$ (Alan). They are non-empty, pairwise disjoint, and cover all six records — a partition, exactly as the theorem promised. The algorithm visits each class once because the classes do not overlap; if $E$ were not a genuine equivalence relation, this loop could produce overlapping or duplicated clusters. The theorem is not decoration here — it is the correctness proof of the extraction loop.
🚪 Threshold idea, used. "Cluster records into non-overlapping groups by an all-or-nothing sameness rule" is "find the equivalence classes of an equivalence relation." Once you see that, deduplication, connected components, strongly-connected components, and union-find are all the same act — extracting the blocks of a partition. We just did it with nothing but closures and class extraction.
Phase 4: Pick representatives and emit the merged records
A deduplicator must output one record per real person. Any element of a class is a valid representative
(§12.4); a sensible policy is "smallest id," and a sensible merge is "fill missing fields from
clustermates." We keep the merge simple: take the representative's name, and the first non-empty email and
phone found in the cluster.
def merge_cluster(cluster, records):
rep = min(cluster) # representative = smallest id
name = records[rep][0]
email = next((records[i][1] for i in sorted(cluster) if records[i][1]), "")
phone = next((records[i][2] for i in sorted(cluster) if records[i][2]), "")
return (rep, name, email, phone)
def deduplicate(records):
M = raw_matches(records)
E = equivalence_closure(M, set(records))
clusters = equivalence_classes(E, set(records))
return [merge_cluster(c, records) for c in clusters]
for row in deduplicate(records):
print(row)
# Expected output:
# (1, 'Ada Lovelace', 'ada@calc.org', '555-0100')
# (2, 'Grace Hopper', 'grace@navy.mil', '555-0200')
# (5, 'Alan Turing', 'turing@bletchley.uk', '555-0300')
Hand-derive the merge. For cluster $\{1, 3, 4\}$: representative is min = 1, name Ada Lovelace;
first non-empty email scanning ids $1, 3, 4$ is record 1's ada@calc.org; first non-empty phone is
record 1's 555-0100. For $\{2, 6\}$: rep 2, name Grace Hopper, email grace@navy.mil, phone
555-0200 (record 2's, found before record 6's). For $\{5\}$: rep 5, all of Alan's own fields. Three merged
records out of six raw rows — the duplicates are gone, and crucially, no record contributed to two outputs,
because the clusters partitioned the input.
🐛 Find the Error. A teammate "optimizes"
equivalence_closureby dropping thereflexive_closureline, reasoning that "every record matches itself anyway." Tracededuplicate(records)with that change and identify which record disappears from the output, and why. (Hint: which record never appears as the first coordinate of any pair in the transitive-only closure?)
Answer
Record 5 matches nobody, so without reflexive closure it appears in no pair of $E$. Then in
equivalence_classes, when $a = 5$ we compute $[5] = \{x \mid (5, x) \in E\} = \emptyset$, so the cluster{5}is never formed (the empty set adds nothing toseen, and5is silently never emitted). The output would contain only the Ada and Grace clusters — Alan Turing vanishes. Reflexivity is exactly what guarantees every element is in its own class; dropping it loses every record that matched nothing.
Discussion Questions
- Our merge policy ("smallest id is representative; first non-empty field wins") is one of many. Suppose two records in a cluster have different non-empty emails. Does the choice of representative affect which records cluster together, or only how they are merged? Tie your answer to the fact that any element of a class is a valid representative.
- The rules were "share an email or share a phone." Suppose we changed them to "share an email and share a phone." Is the resulting raw match relation still symmetric? Would the equivalence-closure step still be necessary, and would the clusters get larger or smaller?
- We rebuilt
equivalence_closurefrom three small closures. In production with millions of records, this is far too slow. The same partition-into-clusters task is solved by the union-find data structure in near-linear time. What does each closure step correspond to in union-find terms (hint:union, and the "same set?" query)? - The Find-the-Error box showed that dropping reflexivity silently loses singletons. Why is "silently" the dangerous word here — and what single test case over the chapter's partition definition would have caught the bug immediately?
Your Turn: Extensions
- Option A (add a rule, keep the math). Add a third matching rule: "two records match if their names
are equal after lowercasing and removing periods/spaces" (so
A. Lovelace~Ada Lovelace? decide and justify). Re-derive the raw matches and the clusters by hand, and confirm the partition property still holds. - Option B (union-find). Implement the same
deduplicateusing a union-find (disjoint-set) structure instead of closures:union(i, j)for each raw match, then group byfind(i). Argue that the groups it produces are exactly the equivalence classes of the equivalence closure — i.e., that union-find computes a partition. - Option C (verify the partition). Write
assert_partition(clusters, A)that checks the three partition conditions (non-empty, pairwise disjoint, union equalsA) and call it on the output ofequivalence_classes. Explain why this assertion should never fire if the input toequivalence_classesis a true equivalence relation — and why it would fire if you fed it the transitive-only closure with record 5 missing.
Key Takeaways
- Pairwise match hints are not an equivalence relation. They are typically symmetric but neither reflexive nor transitive; clustering on them directly is unsafe.
- The equivalence closure fixes that — all three closures, not just transitive. Reflexive closure guarantees singletons survive; symmetric closure guarantees mutual matching; transitive closure links chains. Drop any one and the deduplicator breaks (and often silently).
- Clusters are equivalence classes, and the partition theorem is the correctness proof. Because the classes of an equivalence relation are non-empty, disjoint, and covering, "no record in two clusters" is guaranteed by §12.4's theorem, not by careful coding.
- Representatives make the output. Any element of a class can stand for it; choosing one (smallest id) and merging fields turns the partition into a clean, deduplicated table — three records out of six, with no double-counting.