Case Study: Diagnosing a Hash Table That Collapsed to One Bucket

DataField.Dev

Case Study: Diagnosing a Hash Table That Collapsed to One Bucket

"The trouble with quotes on the internet is that you can't always be sure of their accuracy — but a remainder, at least, is exactly what the Division Algorithm says it is."

Executive Summary

A production cache is timing out. Lookups that should be $O(1)$ have degraded to $O(n)$, and the on-call engineer's profiler points at a single hash bucket holding every key. In this analysis-focused case study you will play that engineer: instead of guessing, you will use the number theory of this chapter — divisibility, gcd, and the Division Algorithm — to prove why the table degenerated, predict exactly how many buckets the keys can reach, and justify the standard fix (a prime-sized table). The lesson is the one that runs through the whole book: "it usually works" is not an explanation, and a remainder computation you can reason about beats a benchmark you can only observe.

Skills applied

Reading a hash function key % m as the remainder $r$ guaranteed by the Division Algorithm (§22.3).
Using gcd (§22.4) to count how many buckets a strided key set can occupy.
Applying the definition of divisibility (§22.1) to characterize collisions precisely.
Explaining, from coprimality (§22.5), why a prime table size repairs the distribution.
Distinguishing "tested and fast" from "provably well-distributed" (theme four).

Background

The scenario

A service stores session records in an open-addressing hash table. The hash is the textbook one:

def bucket(key, m):
    """Map an integer key into one of m buckets."""
    return key % m            # the Division Algorithm's remainder r, with 0 <= r < m

The table size is a comfortable power of two, m = 1024, chosen (as powers of two often are) because key % 1024 can be computed with a fast bit-mask. For months it was fine. Then a new upstream system began assigning session IDs as monotonic, evenly spaced integers: the IDs come in as $0, 4096, 8192, 12288, \dots$ — every ID a multiple of $4096$. Throughput cratered. The profiler shows load factor near zero on $1023$ buckets and catastrophic pile-up on one.

💡 Intuition: A hash table works when the hash scatters keys. The Division Algorithm tells us key % m is a perfectly well-defined remainder — but "well-defined" is not "well-scattered." If the keys all share a factor with $m$, the remainder can only land in a few places. That shared factor is a gcd, and this entire investigation is the story of one gcd that should have been $1$ and wasn't.

Why this matters

Hash tables are the most-used data structure in computing, and "choose a good table size" is folklore advice that most engineers follow without knowing why. The why is pure §22.4: the number of distinct buckets a strided key set can reach is controlled by $\gcd(\text{stride}, m)$. Understanding this turns a superstition ("use a prime size") into a theorem you can derive on a whiteboard during an incident.

Phase 1: Reproduce and quantify the failure

First, make the failure concrete and small enough to reason about by hand. Shrink the table to $m = 8$ and feed it the same kind of input — keys that are multiples of $4$ (stride $g = 4$):

def bucket(key, m):
    return key % m

keys = [0, 4, 8, 12, 16, 20, 24, 28]   # stride 4
m = 8
hits = [bucket(k, m) for k in keys]
print(hits)
print("distinct buckets used:", len(set(hits)))
# Expected output:
# [0, 4, 0, 4, 0, 4, 0, 4]
# distinct buckets used: 2

Eight keys, two buckets. Let us check that against the Division Algorithm by hand. For $k = 12$, $12 = 8 \cdot 1 + 4$, so $12 \bmod 8 = 4$; for $k = 16$, $16 = 8 \cdot 2 + 0$, so $16 \bmod 8 = 0$. The remainders alternate $0, 4, 0, 4, \dots$ exactly as printed. The table is not "randomly unlucky" — it is deterministically confined.

🔄 Check Your Understanding With stride $g = 4$ and $m = 8$, the keys reached buckets $\{0, 4\}$ — that is $8/4 = 2$ buckets. What is $\gcd(4, 8)$, and how does it relate to the count $2$?

Answer

$\gcd(4, 8) = 4$, and the number of reachable buckets is $m / \gcd(g, m) = 8/4 = 2$. The gcd is the spacing of the reachable buckets ($0, 4, 8 \equiv 0, \dots$), and there are exactly $m/\gcd$ of them.

Phase 2: Characterize a collision with divisibility

Before counting buckets, nail down what a collision even is, in the language of §22.1 and §22.3. Two keys $k_1$ and $k_2$ collide exactly when they leave the same remainder mod $m$: $$k_1 \bmod m = k_2 \bmod m.$$ By the Division Algorithm, $k_1 = m q_1 + r$ and $k_2 = m q_2 + r$ with the same $r$. Subtracting, $$k_1 - k_2 = m(q_1 - q_2),$$ which is precisely the statement $m \mid (k_1 - k_2)$.

Claim (collision criterion). Keys $k_1, k_2$ land in the same bucket of an $m$-bucket table if and only if $m \mid (k_1 - k_2)$.

Proof. ($\Rightarrow$) If they collide they share a remainder $r$, and the subtraction above gives $k_1 - k_2 = m(q_1 - q_2)$, so $m \mid (k_1 - k_2)$ by the definition of divisibility (witness $q_1 - q_2$). ($\Leftarrow$) If $m \mid (k_1 - k_2)$, write $k_1 - k_2 = mt$. Then $k_1 = k_2 + mt$, so $k_1 \bmod m = (k_2 + mt) \bmod m = k_2 \bmod m$ (adding a multiple of $m$ does not change the remainder — a one-line consequence of uniqueness in the Division Algorithm). Hence they collide. $\blacksquare$

This is the engineer's diagnostic lens. Our incoming keys are $k_i = g \cdot i$ for a fixed stride $g$. The difference of any two is $g(i - j)$ — always a multiple of $g$. The question "how concentrated are the collisions?" becomes "how much of $m$ does $g$ already account for?" — a gcd question.

🔗 Connection. The criterion $m \mid (k_1 - k_2)$ is the seed of congruence mod $m$, the opening definition of Chapter 23: $k_1 \equiv k_2 \pmod m$ will mean exactly $m \mid (k_1 - k_2)$. You are already doing modular arithmetic here; next chapter just gives it its name and its algebra.

Phase 3: Count the reachable buckets with a gcd

Now the central computation. The keys are the multiples of $g$. Which buckets can $g \cdot i \bmod m$ ever hit, as $i$ ranges over all integers?

Theorem (reachable buckets). The set of buckets reachable by keys $\{g \cdot i : i \in \mathbb{Z}\}$ in an $m$-bucket table is exactly the multiples of $d = \gcd(g, m)$ that lie in $\{0, 1, \dots, m-1\}$. There are exactly $m / d$ of them.

The strategy first. A bucket $r$ is reachable iff $g i \equiv r \pmod m$ for some $i$ — i.e. iff $r = g i - m q$ for some integers $i, q$. That right-hand side is an integer combination of $g$ and $m$, and §22.5 (Bézout) tells us the integer combinations of $g$ and $m$ are precisely the multiples of $\gcd(g, m)$. So the reachable buckets are exactly the multiples of $d$ in range.

Proof. Bucket $r$ (with $0 \le r < m$) is reachable iff there is an integer $i$ with $g i \bmod m = r$, equivalently $r = g i - m q$ for some integers $i, q$ (peeling off the quotient). Thus $r$ is reachable iff $r$ is an integer combination $g i + m(-q)$ of $g$ and $m$. By Bézout's identity and its converse (§22.5), the set of all integer combinations of $g$ and $m$ is exactly the set of multiples of $d = \gcd(g, m)$. Intersecting with $\{0, \dots, m-1\}$, the reachable buckets are $0, d, 2d, \dots, m - d$ — that is, the $m/d$ multiples of $d$ below $m$. (There are $m/d$ of them because $d \mid m$, so $m/d$ is a whole number.) $\blacksquare$

Apply it to the incident. The real keys were multiples of $g = 4096$ in a table of size $m = 1024$. Then $$d = \gcd(4096, 1024) = 1024 \quad (\text{since } 1024 \mid 4096),$$ so the number of reachable buckets is $$\frac{m}{d} = \frac{1024}{1024} = 1.$$ Every key lands in bucket $0$. The table did not "perform poorly" — it provably collapsed to a single bucket, and we can say so without running it on production traffic. The profiler's "one hot bucket" is a theorem.

⚠️ Common Pitfall: It is tempting to blame "bad luck" or "too many keys." Neither is the cause. Even two keys here would collide; even a lightly loaded table degenerates. The failure is structural — a gcd greater than $1$ between the key stride and the table size — not statistical.

Phase 4: Why a prime table size fixes it

The folklore fix is "use a prime table size." Now we can prove it does what's claimed. Suppose we re-size the table to a prime $m = p$ with $p$ larger than the stride's prime factors — concretely, take $p = 1031$ (a prime; $1024$ is $2^{10}$, and the next prime above it is $1031$). The keys are still multiples of $g = 4096 = 2^{12}$.

The reachable-bucket count is $m / \gcd(g, m) = p / \gcd(2^{12}, p)$. Because $p = 1031$ is an odd prime, it shares no factor of $2$ with $g$; so $\gcd(2^{12}, 1031) = 1$, and the keys can reach $$\frac{p}{1} = p = 1031 \text{ buckets} —$$ the entire table. More generally:

Corollary. If the table size $m$ is coprime to the key stride $g$ (in particular, whenever $m$ is a prime that does not divide $g$), then $\gcd(g, m) = 1$ and the keys can reach all $m$ buckets.

Let us watch the repair on the small model. Stride $g = 4$, but now a prime table size $m = 7$:

def bucket(key, m):
    return key % m

keys = [0, 4, 8, 12, 16, 20, 24, 28]   # stride 4
m = 7                                  # prime, coprime to 4
hits = [bucket(k, m) for k in keys]
print(hits)
print("distinct buckets used:", len(set(hits)))
# Expected output:
# [0, 4, 1, 5, 2, 6, 3, 0]
# distinct buckets used: 7

Hand-check two entries: $8 = 7 \cdot 1 + 1$ so $8 \bmod 7 = 1$; $24 = 7 \cdot 3 + 3$ so $24 \bmod 7 = 3$. The remainders march $0, 4, 1, 5, 2, 6, 3$ and only repeat once we have used all seven buckets — exactly the "reach everything" behavior the corollary promises, because $\gcd(4, 7) = 1$.

💡 Intuition: When $\gcd(g, m) = 1$, adding the stride $g$ each time acts like a generator that steps through every residue before returning home — it cycles through all $m$ buckets. When $\gcd(g, m) = d > 1$, the same stepping is trapped on the sub-lattice of multiples of $d$, visiting only $m/d$ of them. The whole difference between a healthy and a degenerate table is whether that gcd is $1$.

🐛 Find the Error: A colleague proposes: "Powers of two are fine — just XOR the key with a random constant before masking, that breaks the pattern." For the specific strided input above, explain why a fixed XOR constant $c$ does not change the reachable-bucket count when $m$ is a power of two.

Answer

XOR with a constant $c$ is a bijection on the low bits, but it is a translation, not a scrambling of spacing: $(g i) \oplus c$ for $m = 2^b$ still produces values whose pairwise differences in the low $b$ bits are governed by $g$'s factor of two. Concretely, the bucket count is determined by how many distinct values $((g i)\oplus c) \bmod 2^b$ takes; since $\oplus c$ just permutes a fixed set of reachable low-bit patterns, the number of distinct buckets is unchanged — still $m/\gcd(g,m)$. The structural fix is to break the factor sharing (a coprime/prime size or a multiplicative mix), not to translate the keys.

Phase 5: The recommendation, justified

The incident write-up can now be three lines of mathematics rather than a shrug:

Root cause. Upstream keys arrive with stride $g = 4096 = 2^{12}$. The table size $m = 1024 = 2^{10}$ satisfies $\gcd(g, m) = 1024$, so by the reachable-buckets theorem the keys occupy only $m/\gcd(g,m) = 1$ bucket. The $O(n)$ lookups are the inevitable consequence.
Fix. Choose a table size coprime to the key stride. A prime $m$ (e.g. $1031$) gives $\gcd(g, m) = 1$ and full $m$-bucket reach. (If powers of two are required for masking speed, mix the key first with a multiplier coprime to $m$ — a multiplicative hash — so the effective stride is coprime to $m$.)
Verification. The fix is not "it benchmarked better"; it is a theorem (§22.4–22.5): coprimality forces all $m$ buckets reachable. Benchmarks then confirm what the proof guarantees.

🔗 Connection. The "multiplier coprime to $m$" trick is the multiplicative-hashing idea you will meet again in Chapter 26 (hashing and collisions), and the demand that the multiplier be coprime to the modulus so it is invertible is exactly the modular-inverse condition $\gcd(a, m) = 1$ from §22.5 — the same coprimality, wearing a different hat.

Discussion Questions

The reachable-buckets theorem says strided keys hit $m / \gcd(g, m)$ buckets. For $m = 12$ and stride $g = 8$, how many buckets are reachable, and which ones? Verify by listing $8i \bmod 12$ for $i = 0, \dots, 5$.
Real session IDs are rarely a perfect arithmetic progression — they have stride $g$ plus occasional jitter. If most keys are multiples of $g$ but a few are not, does the table still degrade? Argue using the collision criterion $m \mid (k_1 - k_2)$.
The proof of the reachable-buckets theorem leaned on Bézout's identity to say "integer combinations of $g$ and $m$ are exactly the multiples of $\gcd(g, m)$." Point to the exact sentence in §22.5 that establishes both directions of that fact.
Powers of two are popular table sizes because key % 2^b is a single bit-mask. State the trade-off this case study exposes between speed of the modulo and quality of the distribution, and explain when each consideration should win.
Suppose an adversary knows you use $m = 1024$ and deliberately sends keys that are multiples of $1024$ to force the one-bucket collapse (a hash-flooding denial-of-service). How does choosing $m$ at random from a set of primes at startup blunt the attack? (This connects to Chapter 26's treatment of adversarial hashing.)

Your Turn: Extensions

Option A (general stride audit). Write reachable_buckets(g, m) that returns the sorted list of buckets the keys $\{g i\}$ can occupy, predicting it as the multiples of $\gcd(g, m)$ below $m$ (using the chapter's gcd) and then confirming by simulation over $i = 0, \dots, m$. Include a hand-derived # Expected output: for reachable_buckets(4, 12).
Option B (pick the size). Given an observed key stride $g$ and a desired minimum table size $M$, write good_table_size(g, M) that returns the smallest prime $\ge M$ that does not divide $g$, using the chapter's sieve to generate candidates. Argue why the result is guaranteed coprime to $g$.
Option C (multiplicative repair). Keep $m = 1024$ but pre-multiply each key by an odd constant $a$ (so $\gcd(a, 1024) = 1$, making $a$ invertible mod $1024$ by §22.5) before masking. Empirically count buckets for the strided input and explain, via "$a$ is a bijection mod $m$," why the count is restored. State the invertibility condition you relied on.

Key Takeaways

key % m is the Division Algorithm's remainder — well-defined, not well-scattered. A hash that is mathematically valid can still be statistically useless if the keys share structure with $m$.
Collisions are a divisibility statement: $k_1, k_2$ collide iff $m \mid (k_1 - k_2)$. This single criterion drives the whole diagnosis.
Strided keys reach exactly $m / \gcd(\text{stride}, m)$ buckets. The gcd between the key spacing and the table size is the entire story; when it exceeds $1$, the table degenerates by a provable factor.
Coprimality is the cure. A prime table size (or a multiplier coprime to $m$) forces $\gcd = 1$ and full reach — the folklore "use a prime" is a theorem from §22.4–22.5, not a superstition.
Proof beats benchmark for explaining a failure. We diagnosed and fixed a production incident with gcd arithmetic before touching a load test; the benchmark's role is to confirm, not to discover.