Case Study: Auditing a Hash Function for Collisions

DataField.Dev

Case Study: Auditing a Hash Function for Collisions

"A function is a black box that turns an input into exactly one output. Everything else is detail — but the detail is where the power is."

Executive Summary

A teammate has shipped a small in-memory cache keyed by a homegrown hash function, and the cache is "mysteriously" returning the wrong values for some keys. You'll analyze the hash as a mathematical function — pin down its domain, codomain, and range; decide whether it is injective; compute the preimages that reveal exactly which keys collide; and explain, in this chapter's vocabulary, both why the bug exists and why it was guaranteed to exist before a single line ran. The point of this case study is diagnosis, not construction: you will read code you did not write, model it precisely, and use §9.3's finite theorem to turn "it's flaky" into "here is the theorem it violates and here are the exact colliding inputs."

Skills applied

Modeling a piece of code as a function $f\colon A \to B$ with an explicit domain and codomain (§9.1–9.2).
Computing the range and distinguishing it from the codomain (§9.2).
Deciding injectivity and locating collisions via preimages $f^{-1}(\{v\})$ (§9.2–9.3).
Applying the finite injective/surjective theorem and its pigeonhole reading to prove collisions unavoidable (§9.3).
Connecting non-injectivity to a concrete correctness failure in a cache (§9.6).

Background

The scenario

The team's cache is supposed to map short product codes (strings) to prices. To avoid storing strings, a previous engineer hashes each code to a small integer index and stores prices in a fixed-size list. Here is the hash and the cache, lightly simplified for review:

def code_hash(s, table_size=8):
    """Map a product code (string) to an index in 0..table_size-1."""
    total = 0
    for ch in s:
        total += ord(ch)          # add the code point of each character
    return total % table_size

class Cache:
    def __init__(self, table_size=8):
        self.size = table_size
        self.slots = [None] * table_size   # slots[i] = (code, price) or None

    def put(self, code, price):
        self.slots[code_hash(code, self.size)] = (code, price)

    def get(self, code):
        entry = self.slots[code_hash(code, self.size)]
        return entry[1] if entry is not None else None

The cache stores one entry per slot and overwrites silently on a clash. The team reports: after caching prices for the codes "ab", "ba", and "c!", a lookup for "ab" sometimes returns the wrong price. Your job is to explain why — precisely, in the language of functions — and to predict exactly which codes are unsafe to store together.

💡 Intuition: A cache slot is a mailbox. code_hash decides which mailbox a code's price goes into. If two different codes are assigned the same mailbox, the second delivery overwrites the first — and a later reader of the first code opens the mailbox to find someone else's mail. "Two inputs, one output" is exactly a collision, and a cache with one slot per output cannot survive one. The whole bug is a non-injective function wearing a costume.

Why this matters

Hash-backed lookups sit under dictionaries, sets, caches, routers, and databases. When a hash collides and the surrounding code does not handle the collision, you get silent data corruption — the worst kind of bug, because nothing crashes. Reading the hash as a function $f\colon \text{codes} \to \text{slots}$ tells you immediately what to check (is it injective on the keys we actually use?) and what is impossible to fix by "improving the hash" alone (over a large enough key set, collisions are a theorem). This is theme one of the book in the wild: the dict is not like a function, it is one, and its failure modes are a function's failure modes.

Phase 1: Model the hash as a function

Before debugging, name the object. Restrict attention to two-character codes drawn from a small character set, so we can reason exhaustively; the same analysis scales.

Domain $A$: the set of codes we will actually cache. For the bug report, $A \supseteq \{$ "ab", "ba", "c!" $\}$. More generally, take $A$ to be all length-2 strings over a fixed alphabet.
Codomain $B$: the slot indices $\{0, 1, 2, 3, 4, 5, 6, 7\}$ — by construction, because the function ends in % 8.
Rule: $f(s) = \big(\sum_{\text{ch} \in s} \operatorname{ord}(\text{ch})\big) \bmod 8$.

So $f\colon A \to \{0,\dots,7\}$. Note immediately that $f$ factors as a composition (§9.4): first $s \mapsto \sum \operatorname{ord}(\text{ch})$ (a digit-sum-like map into $\mathbb{Z}$), then $t \mapsto t \bmod 8$ (the §9.5 mod function onto $\{0,\dots,7\}$). The second stage is surjective but not injective — the textbook fingerprint of a collision source.

🔗 Connection: We met $x \mapsto x \bmod n$ in §9.5 and watched it map a large domain onto the small set $\{0,\dots,n-1\}$, hitting every remainder (surjective) while infinitely many integers share each remainder (not injective). Every modular hash inherits that non-injectivity. Chapter 17 proves the resulting collisions unavoidable via the pigeonhole principle; Chapter 26 builds hash tables that handle collisions instead of corrupting on them.

🔄 Check Your Understanding Why is it useful, for an audit, to write $f$ as a composition of "sum the code points" and "reduce mod 8" rather than as one opaque rule?

Answer

The composition isolates where injectivity dies. The sum-of-code-points stage is already non-injective (two different strings can have the same character-sum), and the mod-8 stage is non-injective again (many integers share a remainder). Naming the stages tells you that both layers create collisions, so "use a bigger table" (fixing only the second stage) cannot make $f$ injective on a large key set.

Phase 2: Compute the range — codomain is not range

The codomain is all eight slots. But does $f$ actually use all eight on the team's keys? Compute the character-sums first (using $\operatorname{ord}$: a=97, b=98, c=99, !=33):

"ab": $97 + 98 = 195$, and $195 \bmod 8 = 3$ (since $195 = 24\cdot 8 + 3$).
"ba": $98 + 97 = 195$, and $195 \bmod 8 = 3$.
"c!": $99 + 33 = 132$, and $132 \bmod 8 = 4$ (since $132 = 16\cdot 8 + 4$).

def code_hash(s, table_size=8):
    return sum(ord(ch) for ch in s) % table_size

codes = ["ab", "ba", "c!"]
hashes = {c: code_hash(c) for c in codes}
print(hashes)
print("range on these keys:", sorted(set(hashes.values())))

# Expected output:
# {'ab': 3, 'ba': 3, 'c!': 4}
# range on these keys: [3, 4]

The range on these three keys is $\{3, 4\}$ — a strict subset of the codomain $\{0,\dots,7\}$. That already tells us $f$ is not surjective onto the eight slots from this key set (six slots go unused). More importantly, two distinct keys, "ab" and "ba", both map to slot $3$: that is the collision.

⚠️ Common Pitfall: Concluding "the table is too small" from the unused slots. The opposite is true here — six of eight slots are empty, yet there is still a collision. Collisions are about two inputs sharing one output, not about running out of room. A table can be mostly empty and still corrupt data if the hash sends two live keys to the same slot.

Phase 3: Decide injectivity and locate every collision via preimages

"Is $f$ injective on the keys we use?" is the precise question. By §9.3, $f$ is injective iff distinct inputs give distinct outputs. We already have a counterexample: $f(\text{`ab'}) = f(\text{`ba'}) = 3$ with "ab" $\ne$ "ba". So $f$ is not injective on this key set. One collision is all it takes to disprove injectivity (a counterexample, Chapter 2).

To find all unsafe groupings, compute the preimage of each slot — the set of keys landing in it, $f^{-1}(\{v\}) = \{\, s \in A \mid f(s) = v \,\}$ (§9.2). Any preimage with two or more keys is a collision set: those keys cannot coexist in the cache. Let's widen the domain to a small fixed alphabet and group by slot.

def code_hash(s, table_size=8):
    return sum(ord(ch) for ch in s) % table_size

alphabet = "abc!"
domain = [a + b for a in alphabet for b in alphabet]   # all length-2 codes

preimages = {v: [] for v in range(8)}
for s in domain:
    preimages[code_hash(s)].append(s)

for v in range(8):
    if preimages[v]:
        print(v, preimages[v])

# Expected output:
# 2 ['aa', 'a!', '!a', '!!']
# 3 ['ab', 'ba', 'b!', '!b']
# 4 ['ac', 'bb', 'ca', 'c!', '!c']
# 5 ['bc', 'cb']
# 6 ['cc']

Read the preimages as collision sets. Slot $3$ has preimage $\{$"ab", "ba", "b!", "!b"$\}$ — and the first two are the reported bug. Slot $4$ has a five-element preimage $\{$"ac", "bb", "ca", "c!", "!c"$\}$: any two of these five corrupt each other, and note "c!" (from the bug report) is among them, so caching "c!" alongside, say, "bb" would clash too. The large preimage of slot $4$ is the §9.2 fingerprint of a badly non-injective function: the preimage of a small set (one slot) is large. (Slots $0$, $1$, and $7$ are empty over this alphabet — no length-2 code sums to those residues mod $8$ — another reminder that empty slots and collisions happily coexist.)

🔗 Connection: Notice why slot 4 is crowded. Our hash sums code points, so any two codes with the same character multiset — like "ac", "ca", and "bb" (because $97+99 = 98+98$) — are forced to the same character-sum before the mod even happens. This is the first composition stage failing injectivity, exactly as Phase 1 predicted. A hash that ignores character order cannot distinguish anagrams; "ab" and "ba" are the smallest example.

🔄 Check Your Understanding Using the preimage table, which of these pairs are safe to cache together, and which collide: (i) "aa" and "cc"; (ii) "ac" and "ca"; (iii) "a!" and "b!"?

Answer

(i) Safe — "aa" is in slot 2, "cc" in slot 6. (ii) Collide — both in slot 4 (anagrams, same character-sum). (iii) Safe — "a!" is in slot 2, "b!" in slot 3 (different slots). The preimage table is the safe/unsafe oracle: two keys are unsafe exactly when they share a preimage.

Phase 4: Prove the collision was unavoidable

The bug is not bad luck — it is forced. Here the chapter's finite theorem does the heavy lifting.

Claim. Once the number of distinct codes stored exceeds the table size (here, more than $8$), the hash $f\colon \text{codes} \to \{0,\dots,7\}$ cannot be injective on those codes — at least two must collide.

Analysis. Suppose we store $k$ distinct codes with $k > 8$. Restrict $f$ to that $k$-element set of codes; its codomain still has only $8$ elements. An injective function maps its domain to a set of outputs of the same size as the domain (§9.3: distinct inputs give distinct outputs, so the range has exactly as many elements as the domain). If $f$ were injective on $k > 8$ codes, its range would need $k > 8$ distinct slots — but only $8$ exist. A subset of an $8$-element set cannot have more than $8$ elements. Contradiction. Therefore $f$ is not injective on any set of more than $8$ codes. $\blacksquare$

This is exactly the §9.3 reasoning read in reverse — the pigeonhole principle: placing more than $8$ items into $8$ slots forces some slot to receive two. No choice of hash function avoids it once the key count exceeds the slot count; the only cures are more slots (which merely raises the threshold) or handling collisions instead of overwriting (Chapter 26). The current cache does neither, so corruption is inevitable as it fills.

⚠️ Common Pitfall: "If we pick a really good hash, we won't get collisions." A good hash makes collisions rare and unpredictable for typical inputs; it cannot make them impossible when the domain outnumbers the codomain. That impossibility is a theorem, not an engineering shortfall — the finite theorem of §9.3 guarantees it. Good hashing plus collision handling is the real fix.

Phase 5: From diagnosis to fix (in this chapter's terms)

The audit yields three precise statements the team can act on:

The hash is not injective, so the cache — which keeps one entry per slot — is only correct on key sets with no two keys sharing a preimage. The cache silently violates the function definition: a get(code) is supposed to be a function of code, but with overwrites it depends on insertion order, so "same input, same output" fails (§9.1, §9.6). It is, in the language of §9.6, no longer a pure lookup.
The collision set of any slot is its preimage $f^{-1}(\{v\})$. Phase 3's table is the exact list of unsafe groupings; the team can reproduce the bug deterministically with any two keys from one preimage.
Collisions are unavoidable beyond $8$ keys (Phase 4). The fix is not "a better hash" but collision handling: store a small list per slot (chaining) and keep the code in the entry so get can find the right one. That restores correctness by making the cache's lookup a genuine function of code again, even though the hash itself stays non-injective.

class ChainedCache:
    def __init__(self, table_size=8):
        self.slots = [[] for _ in range(table_size)]
        self.size = table_size

    def put(self, code, price):
        bucket = self.slots[code_hash(code, self.size)]
        for i, (c, _) in enumerate(bucket):
            if c == code:
                bucket[i] = (code, price)   # update in place
                return
        bucket.append((code, price))        # new key in this slot

    def get(self, code):
        for c, price in self.slots[code_hash(code, self.size)]:
            if c == code:
                return price
        return None

cache = ChainedCache()
for code, price in [("ab", 10), ("ba", 20), ("c!", 30)]:
    cache.put(code, price)
print(cache.get("ab"), cache.get("ba"), cache.get("c!"))

# Expected output:
# 10 20 30

The chained cache stores "ab" and "ba" in the same slot (still slot 3 — the hash is unchanged and still non-injective) but distinguishes them by keeping the code in the entry. Now get("ab") returns 10 regardless of insertion order: the lookup is once again a function of the code. We did not make the hash injective — we made the surrounding lookup tolerate the hash's non-injectivity. That is the whole discipline of hash tables, and it rests entirely on this chapter's vocabulary.

Discussion Questions

The audit modeled code_hash as a composition of "sum code points" and "reduce mod 8." Which of the two stages would you change to distinguish anagrams like "ab" and "ba", and would that change make the overall function injective on all length-2 codes? Why or why not (cite the finite theorem)?
The original Cache.get is not a pure function of its argument once overwrites happen. State, in the language of §9.1, exactly which part of the function definition it violates and what hidden input it secretly depends on.
Suppose you raise table_size from $8$ to $1024$. Does that make code_hash injective? Explain the precise sense in which it "helps" without ever making collisions impossible.
The preimage of slot $4$ had five elements while slot $1$ had one. In a real cache, why is a hash that produces wildly uneven preimage sizes worse than one with evenly sized preimages, even though both have collisions? (Connect to the cost of scanning a chained bucket, Chapter 14's lens.)
The fixed ChainedCache.get scans a bucket comparing codes. Argue that, given the stored entries, this lookup now satisfies "exactly one output per input" — i.e., it is a genuine function of code.

Your Turn: Extensions

Option A (quantify the damage). Write collision_report(f, domain) that returns a dict mapping each output value to its preimage list, then prints only the entries whose preimage has size $\ge 2$. Run it (by hand-tracing) on a domain of all length-2 codes over "abcde" and identify the most crowded slot. State, in one sentence, the §9.2 phenomenon you are measuring.
Option B (order-sensitive hash). Replace code_hash with a polynomial hash $h(s) = \big(\sum_i \operatorname{ord}(s_i)\cdot 31^{i}\big) \bmod m$ so that "ab" and "ba" differ. Recompute the preimages on the same domain and confirm the anagram collisions disappear while new collisions appear elsewhere. Explain why the finite theorem still forbids injectivity once $\lvert \text{domain} \rvert > m$. (This is the hash you will study properly in Chapter 26.)
Option C (preimage as a set operation). Using the Toolkit's functions.py, write the predicate "codes $c_1, c_2$ are safe to cache together" purely in terms of code_hash and confirm it agrees with membership in the same preimage. Tie your answer to the §9.3 definition of injectivity.

Key Takeaways

Model code as a function first. Naming the domain, codomain, and rule — and spotting the composition sum ∘ mod — told us where injectivity dies before we touched a debugger.
Preimages are the collision map. $f^{-1}(\{v\})$ is exactly the set of keys that share slot $v$; any preimage of size $\ge 2$ is an unsafe grouping, and a small set with a large preimage is the signature of a badly non-injective function.
Collisions are a theorem, not a bug to out-engineer. Once keys outnumber slots, §9.3's finite theorem (the pigeonhole principle in reverse) forbids injectivity. The cure is collision handling, not a "perfect" hash.
A correct lookup must be a real function of its key. The original cache's get secretly depended on insertion order, breaking "same input, same output"; chaining restored it. Correctness here is the §9.1 function definition, enforced.