Case Study: Building a Record Deduplicator with Equivalence Closure

DataField.Dev

Case Study: Building a Record Deduplicator with Equivalence Closure

"The goal is to turn data into information, and information into insight." — Carly Fiorina

Executive Summary

A customer database has accumulated duplicates: the same person entered three times with slightly different details. Your task is to build a deduplicator that groups records belonging to the same real person into clusters, picks one representative per cluster, and reports the merged set. The catch — and the reason this is a relations problem — is that your matching rules give you only pairwise hints ("record 1 and record 4 share an email," "record 4 and record 7 share a phone"), and those hints are not transitive on their own. To get clean, non-overlapping clusters you must take the equivalence closure of the match relation (reflexive + symmetric + transitive closure together) and then read off its equivalence classes, which the chapter guarantees form a partition — exactly the no-record-in-two-clusters property a deduplicator needs.

This case study is build-heavy: you will construct a small but real component, step by step, each step justified by a theorem from §12.4–12.5. (Case Study 1 was analysis-heavy — interrogating a system you were handed.) By the end you'll have a deduplicate function whose correctness rests directly on the equivalence-relation/partition correspondence.

Skills applied

Building a match relation from pairwise rules (§12.1) and recognizing it is not yet an equivalence relation (§12.3).
Taking the reflexive, symmetric, and transitive closures together to form the equivalence closure (§12.5, the Common Pitfall box).
Extracting equivalence classes and relying on the partition theorem for correctness (§12.4).
Choosing class representatives — the deduplication output.
Hand-deriving expected output at each phase.

Background

The scenario

A table of customer records, keyed by an internal id, with messy contact fields:

id  name            email                phone
1   Ada Lovelace    ada@calc.org         555-0100
2   Grace Hopper    grace@navy.mil       555-0200
3   A. Lovelace     ada@calc.org         (none)
4   Ada L.          (none)               555-0100
5   Alan Turing     turing@bletchley.uk  555-0300
6   G. Hopper       grace@navy.mil       555-0299

Two matching rules are agreed with the data team: two records match if they share a non-empty email, or they share a non-empty phone. Applying the rules pairwise:

Records 1 and 3 share email ada@calc.org.
Records 1 and 4 share phone 555-0100.
Records 2 and 6 share email grace@navy.mil.
Record 5 matches nothing.

Notice what is not directly stated: rules never said "3 and 4 match." But 3 matches 1 and 1 matches 4, so 3 and 4 are the same person transitively. A correct deduplicator must infer that. The raw match relation is symmetric (sharing is mutual) but not transitive, and that gap is the whole engineering problem.

🔗 Connection. "Same person, inferred through a chain of shared attributes" is the textbook signature of an equivalence relation built by closure. The same pattern powers connected-components in a graph (Chapter 28), "same cluster" in entity resolution, and union-find. We are about to build the relation version by hand.

Why this matters

Deduplication done wrong is worse than not deduplicating. If clusters overlap — record 4 lands in two "people" — you double-count, merge the wrong fields, and corrupt downstream analytics. The chapter's partition theorem (§12.4) is precisely the guarantee that cannot happen if you cluster by an equivalence relation: classes are disjoint by theorem, not by hope. So the safe design is "make the match relation an equivalence relation, then cluster by its classes." We build exactly that.

Phase 1: Build the raw match relation

Encode the records and derive the directly stated matches from the rules. We store records as a dict and build the match relation as a set of pairs.

records = {
    1: ("Ada Lovelace", "ada@calc.org",        "555-0100"),
    2: ("Grace Hopper", "grace@navy.mil",       "555-0200"),
    3: ("A. Lovelace",  "ada@calc.org",         ""),
    4: ("Ada L.",       "",                     "555-0100"),
    5: ("Alan Turing",  "turing@bletchley.uk",  "555-0300"),
    6: ("G. Hopper",    "grace@navy.mil",       "555-0299"),
}

def raw_matches(records):
    """Pairs (i, j), i != j, that share a non-empty email or phone."""
    pairs = set()
    for i in records:
        for j in records:
            if i == j:
                continue
            _, ei, pi = records[i]
            _, ej, pj = records[j]
            if (ei and ei == ej) or (pi and pi == pj):
                pairs.add((i, j))
    return pairs

M = raw_matches(records)
print(sorted(M))
# Expected output:
# [(1, 3), (1, 4), (2, 6), (3, 1), (4, 1), (6, 2)]

Hand-derive it. The sharing facts are 1–3 (email), 1–4 (phone), 2–6 (email); record 5 shares nothing. Because the test is symmetric (ei == ej is the same as ej == ei), each fact produces both directions: $(1,3)$ and $(3,1)$, $(1,4)$ and $(4,1)$, $(2,6)$ and $(6,2)$. Six pairs total, matching the output. So $M$ is already symmetric — but it is not reflexive (no $(i,i)$) and not transitive (we have $(3,1)$ and $(1,4)$ but no $(3,4)$). It is not yet an equivalence relation, so clustering by it directly would be unsafe.

🔄 Check Your Understanding Verify that $M$ is not transitive by naming a specific two-step chain in $M$ whose shortcut is missing.

Answer

$(3, 1) \in M$ and $(1, 4) \in M$, but $(3, 4) \notin M$. The chain "3 matches 1, 1 matches 4" has no direct shortcut, so transitivity fails. (This is exactly why records 3 and 4 — the same person — are not yet grouped.)

Phase 2: Add reflexive and transitive closure (build the equivalence closure)

We want the smallest equivalence relation containing $M$. Per §12.5's Common Pitfall, transitive closure alone is not enough — an equivalence relation also needs reflexivity. $M$ is already symmetric, but to be safe and general we close under all three properties. Build small, composable closure functions, then chain them.

def reflexive_closure(R, A):
    return R | {(a, a) for a in A}

def symmetric_closure(R):
    return R | {(b, a) for (a, b) in R}

def closure_transitive(R):
    S = set(R)
    while True:
        new = {(a, c) for (a, b) in S for (x, c) in S if b == x}
        if new <= S:
            return S
        S |= new

def equivalence_closure(R, A):
    """Smallest equivalence relation containing R: reflexive + symmetric + transitive."""
    R = symmetric_closure(R)             # ensure both directions
    R = reflexive_closure(R, A)          # ensure every (a, a)
    return closure_transitive(R)         # ensure chains close

ids = set(records)                       # {1, 2, 3, 4, 5, 6}
E = equivalence_closure(M, ids)
print((3, 4) in E, (4, 3) in E, (5, 5) in E, (1, 2) in E)
# Expected output:
# True True True False

Hand-derive the four checks.

$(3, 4)$: symmetric closure leaves $(3,1)$ and $(1,4)$; transitive closure chains them into $(3,4)$. So True — records 3 and 4 are now correctly linked.
$(4, 3)$: symmetry plus the chain $(4,1),(1,3)$ gives $(4,3)$. True.
$(5, 5)$: reflexive closure adds every diagonal pair, including $(5,5)$, even though record 5 matched nobody. True — a lone record is its own cluster.
$(1, 2)$: records 1 and 2 share no attribute and no chain connects them (the Lovelace cluster never touches the Hopper cluster). False.

We now have a genuine equivalence relation. The order of closures matters less than getting all three, as §12.5 noted; we did symmetric, then reflexive, then transitive so that the transitive pass closes a relation that is already symmetric — which guarantees the result stays symmetric too.

⚠️ Common Pitfall. A tempting shortcut is to call only closure_transitive(M) and cluster on that. It would link 3 and 4 correctly, but record 5 would be in no pair (no self-loop), so the class-extraction in Phase 3 would either crash or silently drop record 5. Reflexivity is what guarantees every record lands in exactly one class. Build the full equivalence closure, not just the transitive one.

💡 Intuition. Think of each closure as adding the arrows one property demands: reflexive adds "every record is the same person as itself," symmetric adds "matching is mutual," transitive adds "sameness chains through shared attributes." Together they upgrade a pile of pairwise hints into a clean notion of "same person" that obeys all the rules equality should.

Phase 3: Extract the clusters (equivalence classes)

Now the payoff. Because $E$ is an equivalence relation, its equivalence classes form a partition (§12.4's theorem) — every record is in exactly one class, and no two classes overlap. So we can safely walk the records, and for each not-yet-clustered record, form its class $[r] = \{x \mid (r, x) \in E\}$. This is the chapter's equivalence_classes routine, and its correctness rests on the partition theorem.

def equivalence_classes(E, A):
    seen, classes = set(), []
    for a in sorted(A):
        if a in seen:
            continue
        cls = {x for x in A if (a, x) in E}   # everything related to a
        classes.append(cls)
        seen |= cls
    return classes

clusters = equivalence_classes(E, ids)
print(clusters)
# Expected output:
# [{1, 3, 4}, {2, 6}, {5}]

Hand-derive it. Walk records in order:

$r = 1$, unseen. $[1] = \{x \mid (1, x) \in E\} = \{1, 3, 4\}$ (1 links to 3 by email, to 4 by phone, to itself reflexively). Mark 1, 3, 4 seen.
$r = 2$, unseen. $[2] = \{2, 6\}$ (shared email with 6). Mark 2, 6 seen.
$r = 3$, already seen — skip. $r = 4$, seen — skip.
$r = 5$, unseen. $[5] = \{5\}$ (matched nobody; only its reflexive self-loop). Mark 5 seen.
$r = 6$, seen — skip.

Three clusters: $\{1, 3, 4\}$ (Ada), $\{2, 6\}$ (Grace), $\{5\}$ (Alan). They are non-empty, pairwise disjoint, and cover all six records — a partition, exactly as the theorem promised. The algorithm visits each class once because the classes do not overlap; if $E$ were not a genuine equivalence relation, this loop could produce overlapping or duplicated clusters. The theorem is not decoration here — it is the correctness proof of the extraction loop.

🚪 Threshold idea, used. "Cluster records into non-overlapping groups by an all-or-nothing sameness rule" is "find the equivalence classes of an equivalence relation." Once you see that, deduplication, connected components, strongly-connected components, and union-find are all the same act — extracting the blocks of a partition. We just did it with nothing but closures and class extraction.

Phase 4: Pick representatives and emit the merged records

A deduplicator must output one record per real person. Any element of a class is a valid representative (§12.4); a sensible policy is "smallest id," and a sensible merge is "fill missing fields from clustermates." We keep the merge simple: take the representative's name, and the first non-empty email and phone found in the cluster.

def merge_cluster(cluster, records):
    rep = min(cluster)                                  # representative = smallest id
    name = records[rep][0]
    email = next((records[i][1] for i in sorted(cluster) if records[i][1]), "")
    phone = next((records[i][2] for i in sorted(cluster) if records[i][2]), "")
    return (rep, name, email, phone)

def deduplicate(records):
    M = raw_matches(records)
    E = equivalence_closure(M, set(records))
    clusters = equivalence_classes(E, set(records))
    return [merge_cluster(c, records) for c in clusters]

for row in deduplicate(records):
    print(row)
# Expected output:
# (1, 'Ada Lovelace', 'ada@calc.org', '555-0100')
# (2, 'Grace Hopper', 'grace@navy.mil', '555-0200')
# (5, 'Alan Turing', 'turing@bletchley.uk', '555-0300')

Hand-derive the merge. For cluster $\{1, 3, 4\}$: representative is min = 1, name Ada Lovelace; first non-empty email scanning ids $1, 3, 4$ is record 1's ada@calc.org; first non-empty phone is record 1's 555-0100. For $\{2, 6\}$: rep 2, name Grace Hopper, email grace@navy.mil, phone 555-0200 (record 2's, found before record 6's). For $\{5\}$: rep 5, all of Alan's own fields. Three merged records out of six raw rows — the duplicates are gone, and crucially, no record contributed to two outputs, because the clusters partitioned the input.

🐛 Find the Error. A teammate "optimizes" equivalence_closure by dropping the reflexive_closure line, reasoning that "every record matches itself anyway." Trace deduplicate(records) with that change and identify which record disappears from the output, and why. (Hint: which record never appears as the first coordinate of any pair in the transitive-only closure?)

Answer

Record 5 matches nobody, so without reflexive closure it appears in no pair of $E$. Then in equivalence_classes, when $a = 5$ we compute $[5] = \{x \mid (5, x) \in E\} = \emptyset$, so the cluster {5} is never formed (the empty set adds nothing to seen, and 5 is silently never emitted). The output would contain only the Ada and Grace clusters — Alan Turing vanishes. Reflexivity is exactly what guarantees every element is in its own class; dropping it loses every record that matched nothing.

Discussion Questions

Our merge policy ("smallest id is representative; first non-empty field wins") is one of many. Suppose two records in a cluster have different non-empty emails. Does the choice of representative affect which records cluster together, or only how they are merged? Tie your answer to the fact that any element of a class is a valid representative.
The rules were "share an email or share a phone." Suppose we changed them to "share an email and share a phone." Is the resulting raw match relation still symmetric? Would the equivalence-closure step still be necessary, and would the clusters get larger or smaller?
We rebuilt equivalence_closure from three small closures. In production with millions of records, this is far too slow. The same partition-into-clusters task is solved by the union-find data structure in near-linear time. What does each closure step correspond to in union-find terms (hint: union, and the "same set?" query)?
The Find-the-Error box showed that dropping reflexivity silently loses singletons. Why is "silently" the dangerous word here — and what single test case over the chapter's partition definition would have caught the bug immediately?

Your Turn: Extensions

Option A (add a rule, keep the math). Add a third matching rule: "two records match if their names are equal after lowercasing and removing periods/spaces" (so A. Lovelace ~ Ada Lovelace? decide and justify). Re-derive the raw matches and the clusters by hand, and confirm the partition property still holds.
Option B (union-find). Implement the same deduplicate using a union-find (disjoint-set) structure instead of closures: union(i, j) for each raw match, then group by find(i). Argue that the groups it produces are exactly the equivalence classes of the equivalence closure — i.e., that union-find computes a partition.
Option C (verify the partition). Write assert_partition(clusters, A) that checks the three partition conditions (non-empty, pairwise disjoint, union equals A) and call it on the output of equivalence_classes. Explain why this assertion should never fire if the input to equivalence_classes is a true equivalence relation — and why it would fire if you fed it the transitive-only closure with record 5 missing.

Key Takeaways

Pairwise match hints are not an equivalence relation. They are typically symmetric but neither reflexive nor transitive; clustering on them directly is unsafe.
The equivalence closure fixes that — all three closures, not just transitive. Reflexive closure guarantees singletons survive; symmetric closure guarantees mutual matching; transitive closure links chains. Drop any one and the deduplicator breaks (and often silently).
Clusters are equivalence classes, and the partition theorem is the correctness proof. Because the classes of an equivalence relation are non-empty, disjoint, and covering, "no record in two clusters" is guaranteed by §12.4's theorem, not by careful coding.
Representatives make the output. Any element of a class can stand for it; choosing one (smallest id) and merging fields turns the partition into a clean, deduplicated table — three records out of six, with no double-counting.