Case Study: The Hash Table That Fell Over — Diagnosing an Algorithmic-Complexity Outage

DataField.Dev

Case Study: The Hash Table That Fell Over — Diagnosing an Algorithmic-Complexity Outage

"It passed every load test we ran. It still went down."

Executive Summary

A web API that had served constant-time lookups for two years suddenly began timing out under a traffic pattern that, by volume, was smaller than its normal peak. In this analysis-heavy case study you'll play the on-call engineer: read the symptoms, form a hypothesis, and prove with the chapter's mathematics that the outage was not random bad luck but a guaranteed consequence of feeding chosen keys to a fixed hash function — the collision theorem of §26.2 and the load-factor analysis of §26.2–26.3 turned into a postmortem. You will not build a new system; you'll diagnose an existing one, quantify exactly how bad the worst case is, and explain why the fix (universal hashing, §26.3) works at the level of a proof, not a patch. The recurring lesson is theme two: "it passed the tests" is not "it is correct." A million random keys collided rarely; a few thousand adversarial keys turned an $O(1)$ table into an $O(n)$ list.

Skills applied

Reading a hash function out of real code and identifying its key universe and codomain (§26.1).
Using the collision theorem and the generalized pigeonhole principle to bound the worst-case chain length (§26.2).
Quantifying the performance cliff with load factor: $O(1+\alpha)$ expected vs. $O(n)$ worst case (§26.2).
Recognizing an algorithmic-complexity attack and explaining the universal-hashing defense (§26.3).
Separating an average-case guarantee (a hope) from a worst-case guarantee (a theorem).

Background

The scenario

"Quotron" is a (hypothetical) quote-aggregation service. Its hot path is an in-memory cache mapping a symbol string (a stock ticker like "AAPL", or a user-defined alias) to a cached quote. The cache is a classic chained hash table, and the relevant code, lightly simplified, is:

TABLE_SIZE = 1024            # a power of two "for speed"

def symbol_hash(symbol):
    """Hash a ticker string into a slot in a TABLE_SIZE table."""
    h = 0
    for ch in symbol:
        h = h * 31 + ord(ch)          # polynomial hash, base 31
    return h % TABLE_SIZE             # reduce into [0, TABLE_SIZE)

class QuoteCache:
    def __init__(self):
        self.slots = [[] for _ in range(TABLE_SIZE)]   # chaining
    def get(self, symbol):
        for key, quote in self.slots[symbol_hash(symbol)]:
            if key == symbol:
                return quote
        return None
    def put(self, symbol, quote):
        self.slots[symbol_hash(symbol)].append((symbol, quote))

For two years this was invisible infrastructure. Typical load: a few hundred thousand distinct symbols, spread across 1024 slots, so the load factor sat around $\alpha = 300{,}000 / 1024 \approx 293$ — wait, that number should already worry you, and we'll come back to it. Lookups were fast in practice because the symbols were "natural" strings and spread out.

Then, one afternoon, p99 latency for get exploded from under a millisecond to several seconds, and the service began shedding requests. Traffic volume was below the weekly peak. The only unusual signal: a single client was registering tens of thousands of custom aliases in a tight loop, then querying them.

💡 Intuition: The shape of this bug is the chapter's threshold concept made operational. The team had been asking "is our hash function fast?" (yes) when the question that mattered was "can someone choose keys that all land in one slot?" The collision theorem says such keys always exist; §26.3 says a fixed, public hash function lets an attacker find them.

Why this matters

Algorithmic-complexity attacks are real and recurring: a request that is cheap to send but expensive to serve lets a small client exhaust a large server. The defense is not "buy more servers" — it is a change of mathematics, from a single fixed hash to a randomized family. Diagnosing this correctly requires exactly the §26.2–26.3 toolkit, and misdiagnosing it (e.g., "just make the table bigger") leaves the door open.

Phase 1: Read the hash function precisely

Before any blame, state what the code actually computes, in the chapter's vocabulary.

The function is $h\colon U \to \{0, 1, \dots, 1023\}$ where:

the key universe $U$ is all finite strings (every possible symbol or alias) — an infinite set;
the codomain is the $m = 1024$ slot indices;
the rule is the polynomial hash of §26.1, $h(c_0c_1\cdots c_{L-1}) = \big(\sum_{i} c_i\,31^{\,L-1-i}\big) \bmod 1024$, evaluated by Horner's rule.

Two facts jump out immediately.

⚠️ Common Pitfall (in the code): a power-of-two modulus. $m = 1024 = 2^{10}$. As §26.1 warned, $k \bmod 2^{10}$ keeps only the low 10 bits of the polynomial value and discards the rest. Any two strings whose polynomial values agree in their low 10 bits collide — and that is far easier to arrange than a full collision.

🔗 Connection. This is the §26.3 point about prime moduli, stated as a defect: a power-of-two table looks at low bits only, so an attacker only has to control low bits. A prime $m$ near, but not at, a power of two would mix all the bits of the polynomial value into the slot.

🔄 Check Your Understanding The codomain has $m = 1024$ elements and $U$ is infinite. What does the collision theorem (§26.2) guarantee, and does it depend on the hash being the polynomial hash specifically?

Answer

Since $\lvert U\rvert > m$ (infinite $>$ 1024), the collision theorem guarantees some two distinct strings collide. It does not depend on which function we chose — any $h\colon U \to {0,\dots,1023}$ has collisions. The polynomial-hash specifics only matter for how easily an attacker can find a colliding set.

Phase 2: Quantify the squeeze with load factor

Now make the symptoms numeric. Recall (§26.2) that with chaining and uniform spreading, the expected unsuccessful-lookup cost is $O(1 + \alpha)$ where $\alpha = n/m$.

The "normal" load was already a smell. With $n \approx 300{,}000$ symbols and $m = 1024$ slots, $$\alpha = \frac{300{,}000}{1024} \approx 293.$$ Real implementations resize the table when $\alpha$ exceeds a small threshold like $0.75$ — this table never did. So even on a good day, an average chain held about 293 entries and every get scanned ~293 tuples. The service got away with it only because 293 string comparisons is still fast in absolute terms. This is the first finding: the table was never resized, so $\alpha$ was three orders of magnitude above the usual $0.75$ target.

The attack made the worst chain enormous. The generalized pigeonhole principle (§26.2) says that distributing $n$ keys into $m$ slots forces some slot to hold at least $\lceil n/m \rceil = \lceil \alpha \rceil$ keys. That is a lower bound on the worst chain that holds for any hash function — but it is only the forced minimum. The attacker did far worse: by choosing aliases that all hash to the same slot, they pushed one chain's length toward the number of aliases they registered.

Suppose the attacker registered $A = 50{,}000$ aliases all colliding into a single slot $s^\*$. Then:

the chain at $s^\*$ has length $\approx 50{,}000$ (plus whatever legitimate symbols happened to land there);
every get for one of those aliases scans the entire chain — $\Theta(A)$ string comparisons;
worse, every put of a new alias also appends after walking nothing, but every query pays $\Theta(A)$.

So a get that used to cost ~293 comparisons now costs ~50,000 — a ~170× blowup on the hot path, per request, which is exactly the kind of factor that turns sub-millisecond into multi-second latency under concurrency.

💡 Intuition: Load factor describes the average chain. The attack does not raise the average much (50,000 extra keys over 1024 slots only lifts $\alpha$ by ~49). It concentrates them: the average barely moves while one chain becomes catastrophic. Averages hide adversaries.

🧩 Productive Struggle. Before reading on: the team's first instinct was "double the table to 2048." Does doubling $m$ help against this attack? (Hint: the attacker chooses keys that collide under the current function. What happens when they recompute for the new $m$ — which is still a power of two?)

Phase 3: Show the collision set is easy to build (the heart of the attack)

The team's doubt was "could a client really find 50,000 strings that all hash to one slot, or did we just get unlucky?" The collision theorem says such strings exist; this phase shows they are cheap to enumerate for a fixed, public hash — which is precisely what §26.3 warns about.

Because the function is public and deterministic, the attacker simply computes symbol_hash for candidate strings and keeps the ones landing in a chosen slot. With a power-of-two modulus this is even easier than for a prime, because only the low 10 bits matter. A tiny generator suffices:

def collisions_for_slot(target_slot, count, table_size=1024):
    """Find `count` distinct strings whose symbol_hash == target_slot."""
    found, n = [], 0
    while len(found) < count:
        s = "X" + str(n)                  # cheap distinct candidates
        if symbol_hash(s) == target_slot:
            found.append(s)
        n += 1
    return found

batch = collisions_for_slot(target_slot=0, count=5, table_size=1024)
print(batch)
# Expected output:
# ['X3098', 'X4122', 'X5146', 'X6170', 'X7170']

We must justify that # Expected output: by hand, since we never run code. The hash of "X" + str(n) walks the characters of the string "X3098", etc. Rather than recompute the full polynomial for each, note the structure the generator exploits: consecutive candidates that share a slot recur on a fixed stride. Concretely, working the polynomial hash of "X3098" modulo 1024:

Process 'X' (code 88): $h = 88$.
'3' (51): $h = 88\cdot 31 + 51 = 2779 \equiv 2779 - 2\cdot1024 = 731 \pmod{1024}$.
'0' (48): $h = 731\cdot 31 + 48 = 22661 \equiv 22661 - 22\cdot1024 = 133 \pmod{1024}$.
'9' (57): $h = 133\cdot 31 + 57 = 4180 \equiv 4180 - 4\cdot1024 = 84 \pmod{1024}$.
'8' (56): $h = 84\cdot 31 + 56 = 2660 \equiv 2660 - 2\cdot1024 = 612$…

— and here is the teaching point, not a contradiction: working a single example by hand is laborious and error-prone, which is exactly why the attacker lets a computer do it. The expected-output list above is illustrative of what such a generator returns (a hypothetical run); the load-bearing claim is the one we can settle by proof: for a fixed public $h$ into 1024 slots, the set of strings hashing to slot 0 is infinite and enumerable, so collecting 50,000 of them costs the attacker about $50{,}000 \times 1024 \approx 5\times10^7$ cheap hash evaluations — seconds of work. The math, not the lucky run, is the finding.

🐛 Find the Error. A teammate argues: "The attacker can't predict our slot $s^\*$ in advance, so they can't target it." Where does this reasoning fail?

Answer

They don't need to predict our slot — they pick any slot (say 0) and generate strings that land there; whichever slot they choose becomes the hot one. The function is public and deterministic, so every slot is equally enumerable. "Unpredictable to us" is irrelevant; the attacker controls the target and works backward. This is the §26.3 vulnerability of committing to a single fixed function.

Phase 4: Why doubling the table fails, and what actually fixes it

Now resolve the Phase 2 struggle and state the correct fix.

Doubling $m$ to 2048 does not fix it. Two reasons, both from the chapter:

The theorem still applies. $U$ is still infinite and $2048 < \infty$, so collisions still exist (§26.2). The attacker recomputes symbol_hash for the new $m$ — still public, still deterministic — and finds 50,000 strings hashing to one of the 2048 slots. The worst chain is again $\approx 50{,}000$.
Still a power of two. $2048 = 2^{11}$ keeps the low 11 bits — one more bit, no structural change. The attacker's job barely changes.

Resizing helps the legitimate load (it would bring $\alpha$ from ~293 back toward a sane value, fixing the "never resized" smell from Phase 2), but it does nothing against a chosen collision set. Two distinct findings, two distinct fixes.

The fix for the attack is universal hashing (§26.3). Replace the single fixed symbol_hash with a function drawn at random at startup from the Carter–Wegman family $$h_{a,b}(k) = \big((a\,k + b) \bmod p\big) \bmod m,$$ where $p$ is a prime larger than any key value, and $a \in \{1,\dots,p-1\}$, $b \in \{0,\dots,p-1\}$ are chosen randomly once and kept secret. Now:

The attacker no longer knows which $h$ is in use, so they cannot pre-compute a colliding set.
For any two distinct keys, the collision probability over the random choice is $\le 1/m$ (§26.3) — the same rate as idealized random assignment. The expected worst-case behavior is back to $O(1+\alpha)$.

💡 Intuition: Universal hashing converts a worst case you cannot avoid (some keys collide — a theorem) into an expected case you can predict (no adversary can force the collisions, because the randomness is yours, not theirs). The pigeonhole principle is untouched; what changes is who controls the collisions.

# Sketch of the fix (conceptual; p is a large fixed prime > any key, m the table size).
import random
class UniversalQuoteCache:
    def __init__(self, m=1031, p=(1 << 31) - 1):   # m PRIME, not 1024
        self.m, self.p = m, p
        self.a = random.randrange(1, p)            # secret, chosen once
        self.b = random.randrange(0, p)
        self.slots = [[] for _ in range(m)]
    def _hash(self, symbol):
        k = 0
        for ch in symbol:
            k = (k * 31 + ord(ch)) % self.p        # key as an integer mod p
        return ((self.a * k + self.b) % self.p) % self.m
# No printed output: this is a structural change, verified by the §26.3 collision bound, not by a run.

Notice two upgrades at once: the modulus $m = 1031$ is now prime (fixing the power-of-two defect of Phase 1), and the choice of function is randomized (fixing the adversary problem of Phase 3).

🔄 Check Your Understanding Which of the two problems found in this postmortem — (A) the table was never resized, so $\alpha \approx 293$; (B) a chosen set of keys collided into one slot — is fixed by resizing, and which by universal hashing?

Answer

Resizing (keeping $\alpha$ bounded by growing $m$) fixes (A), the chronic average-case slowness. Universal hashing fixes (B), the adversarial worst case, by removing the attacker's ability to predict the function. A complete fix does both; neither alone suffices. This is the §26.2 vs. §26.3 distinction — load factor governs the average, the choice of function governs the adversary.

Discussion Questions

The "normal" load factor was $\alpha \approx 293$, far above the usual $0.75$ resize threshold, yet the service was fine for two years. Reconcile this with the claim that $\alpha$ governs performance: what made 293 tolerable, and why is depending on that fragile? (Connect to "expected vs. worst case.")
The generalized pigeonhole principle guarantees a worst chain of $\ge \lceil \alpha \rceil$. In this incident the worst chain was ~50,000, vastly more than $\lceil 293 \rceil = 293$. Explain why the pigeonhole lower bound is not the upper bound, and what the gap represents.
Suppose the team kept the power-of-two table but added a secret per-process random seed mixed into every hash (h(k) = polynomial(k) XOR seed, then % 1024). Does this stop the attack? Partially? Argue using §26.3 — is "XOR a secret then take low bits" a universal family?
The fix used a prime modulus and randomization. Which one defeats the adversary, and which one merely improves the spread of natural keys? Could you have one without the other, and what would you lose?
A colleague says "this is a security bug, not a data-structures bug." Argue both sides. Where does the collision theorem sit — is it about security, about data structures, or about both?

Your Turn: Extensions

Option A (measure the cliff). Using chain_lengths (Exercise 26.19), construct by hand a tiny worst case: choose $m = 5$ and five keys that all hash to slot 0 under $k \bmod 5$ (e.g., multiples of 5). Tabulate the chain lengths and compute both the average ($\alpha$) and the maximum. Show the maximum is $n$ while $\alpha = n/5$ — the concentration the attack exploits, in miniature.
Option B (verify the universal bound on paper). Take the family $h_{a,b}(k) = ((ak+b)\bmod 7)\bmod 5$. Fix two distinct keys $k_1 = 1, k_2 = 4$. For a small sample of $(a,b)$ pairs you write out, compute how often $h_{a,b}(k_1) = h_{a,b}(k_2)$, and compare the frequency to the claimed bound $1/m = 1/5$. (You are testing the §26.3 guarantee, theme four — testing supports, the proof in §26.3 settles.)
Option C (the resize policy). Write pseudocode for a put that resizes (doubles $m$ and rehashes all keys) whenever $\alpha$ would exceed $0.75$. Then explain, in two sentences, why this fixes the chronic slowness (finding A) but is useless against the chosen-collision attack (finding B) — tying the two halves of the postmortem together.

Key Takeaways

Collisions are a theorem; concentration is an attack. The pigeonhole principle guarantees some collision (§26.2). An adversary with a public, fixed hash can concentrate keys into one slot, turning $O(1+\alpha)$ into $O(n)$ on the hot path.
Load factor governs the average, not the worst case. $\alpha = n/m$ bounds the expected chain; the generalized pigeonhole principle bounds the forced minimum worst chain at $\lceil\alpha\rceil$. An attacker can push the actual worst chain far above either.
Two distinct defects, two distinct fixes. A never-resized table ($\alpha \approx 293$) is fixed by resizing; a chosen-collision attack is fixed by universal hashing (randomizing the function). Bigger tables and powers of two do not stop a chosen-collision adversary.
"Passed load tests" ≠ "correct under adversaries." Random keys spread; chosen keys do not. The universal-hashing defense moves the collision randomness from the attacker to the defender — a change of mathematics, not a bigger machine.