Case Study: Sizing a Hash Table Before It Falls Over

DataField.Dev

Case Study: Sizing a Hash Table Before It Falls Over

"It is remarkable that a science which began with the consideration of games of chance should have become the most important object of human knowledge." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)

Executive Summary

A service is dropping requests under load, and the on-call engineer suspects the in-memory hash table at its core. In this analysis-heavy case study you will not build a new data structure — you will reach for the probability you learned in this chapter to explain, quantitatively, three things every hash-table user eventually has to reason about: when collisions become likely, how full the fullest bucket gets, and how many buckets sit idle while another overflows. Each answer is a few lines of counting, the complement rule, or linearity of expectation — and each is checked against a Monte-Carlo simulation, exactly the proof-vs-computation duality (theme four) the chapter is built around.

The point is not that hash tables are mysterious. The point is that "it worked in testing" told the team nothing about when it would stop working, and the probability in this chapter turns a vague worry into a number you can put in a capacity plan.

Skills applied

Modeling "$n$ keys hashed into $m$ buckets" as an equally-likely sample space (§20.1–20.2).
Using the complement rule and the birthday calculation to find the probability of any collision (§20.2).
Defining indicator random variables and using linearity of expectation to compute the expected number of empty buckets and the expected number of collisions (§20.4).
Reading variance as the warning sign behind worst-case bucket load (§20.5).
Using Monte-Carlo simulation to check each hand derivation, and knowing what the check does and does not prove (§20.6).

Background

The scenario

SessionStore is an in-memory cache: it maps a user's session token (a long random-looking string) to that user's session data, using a hash table with $m$ buckets and chaining (each bucket holds a linked list of the keys that land there). A "good" hash function spreads tokens across buckets so uniformly that, for the purposes of analysis, we can treat each token as landing in a uniformly random bucket, independently of the others. That modeling assumption — uniform and independent — is what lets us bring this chapter's machinery to bear; in Chapter 26 you will study what makes a real hash function deserve it.

The team sized the table at $m = 1000$ buckets "because we'll never have more than a few hundred sessions." Traffic grew. Now there are $n$ active sessions, $n$ has crept past 1000, lookups have slowed, and someone asks the questions that this chapter answers:

At what $n$ do collisions stop being rare and become expected?
With $n$ sessions in $m$ buckets, how many collisions should we expect in total?
How many buckets are wasted (empty) — and should we worry that one bucket is far longer than the rest?

💡 Intuition: A hash table is the birthday problem wearing work clothes. "Two people share a birthday" becomes "two keys land in the same bucket"; the 365 days become the $m$ buckets. Everything surprising about the birthday problem — collisions appearing far sooner than your intuition expects — is surprising about hash tables for exactly the same reason.

Why this matters

Collisions are not a bug; chaining handles them. But collisions are slow: a lookup that should be $O(1)$ becomes $O(\text{bucket length})$, and if one bucket gets long, the tail latency for the unlucky users hitting it spikes. Capacity planning for a hash table is, at bottom, a probability question: given my load, how bad does the worst bucket get, and how much headroom do I have? You cannot answer that by running it once and looking; you answer it by modeling it.

Phase 1: Model the experiment

Fix the sample space first — the recurring discipline of the chapter. We have $n$ keys (sessions) and $m$ buckets. An outcome records, for each key, which bucket it landed in. So an outcome is a tuple $$(b_1, b_2, \dots, b_n), \qquad b_i \in \{1, 2, \dots, m\},$$ where $b_i$ is the bucket of key $i$. By the product rule (Chapter 15), the sample space has $$\lvert S \rvert = m^n$$ outcomes, and because the hash is uniform and independent, all $m^n$ outcomes are equally likely — the Laplace model applies, and every probability becomes a count over $m^n$.

🔄 Check Your Understanding For $n = 3$ keys and $m = 4$ buckets, how many outcomes are in the sample space? Write one specific outcome in which all three keys collide in bucket 2.

Answer

$\lvert S \rvert = 4^3 = 64$ equally likely outcomes. The all-collide-in-bucket-2 outcome is the tuple $(2, 2, 2)$ — one of the 64.

Notice the modeling choice that makes everything tractable: we record which bucket each key chose, in order. The alternative ("how many keys per bucket") is a smaller description but its outcomes are not equally likely — the same trap as the two-coin sample space in §20.2. Keep the equally-likely tuple space and the counting stays honest.

Phase 2: When do collisions become likely? (the complement rule)

"At least one collision" is a messy event — there could be one collision, or many, in any combination. Its complement, "all $n$ keys land in distinct buckets," is clean, so we use the complement rule exactly as the birthday problem did in §20.2.

Count the no-collision outcomes. The first key may land anywhere ($m$ choices), the second must avoid the first ($m-1$), the third must avoid both ($m-2$), and so on — a falling product (Chapter 16): $$\lvert \text{all distinct} \rvert = m (m-1)(m-2)\cdots(m-n+1) = P(m, n).$$ Therefore $$P(\text{at least one collision}) = 1 - \frac{P(m, n)}{m^n} = 1 - \frac{m(m-1)\cdots(m-n+1)}{m^n}.$$ This is identical to the birthday formula with "days" replaced by "buckets." For $m = 1000$, here is the hand-derived behavior at a few values of $n$:

$n$ (sessions)	$P(\text{some collision})$	Reading
10	$\approx 0.044$	rare
38	$\approx 0.50$	a coin flip
100	$\approx 0.994$	nearly certain

The shock the birthday problem trained you to expect: with a thousand buckets, collisions hit 50–50 at just 38 keys — roughly $\sqrt{m}$, not $m$. The team's "we have plenty of buckets" intuition was off by more than an order of magnitude.

🔗 Connection. The $\sqrt{m}$ threshold is not a coincidence; it is the birthday bound, and it is exactly why cryptographic hash functions need outputs far longer than you would naively guess: to push the collision-likely point ($\approx \sqrt{m} = 2^{b/2}$ for a $b$-bit hash) out of reach. The same mathematics that sizes SessionStore sizes a SHA-256 digest. Chapter 26 returns to this.

Here is the calculation in code — the complement formula, evaluated as a falling product to avoid overflow:

def collision_prob(n, m):
    """P(at least one of n keys collides) in m buckets, via the complement."""
    p_all_distinct = 1.0
    for k in range(n):
        p_all_distinct *= (m - k) / m      # k buckets already taken
    return 1 - p_all_distinct

print(round(collision_prob(38, 1000), 3))
print(round(collision_prob(100, 1000), 3))
# Expected output:
# 0.503
# 0.994

The loop multiplies the survival factors $\frac{m}{m}, \frac{m-1}{m}, \dots, \frac{m-n+1}{m}$ exactly as the falling product prescribes, then subtracts from 1. The hand table and the code agree: $\approx 0.50$ at $n = 38$, $\approx 0.99$ at $n = 100$.

Phase 3: How many collisions, total? (linearity of expectation)

Knowing collisions are likely is not the same as knowing how many there are. Counting "the number of colliding pairs" directly means reasoning about the joint behavior of all the keys — a nightmare. Linearity of expectation sidesteps the joint behavior entirely, just as it did for the hat-check problem in §20.4.

Define one indicator per pair of keys. For keys $i < j$, let $$X_{ij} = \mathbf{1}[\text{keys } i \text{ and } j \text{ land in the same bucket}],$$ and let $X = \sum_{iChapter 16), so by linearity (no independence needed — the pairs overlap and interact, and linearity does not care): $$E[X] = \sum_{i

Set this expression equal to 1 to find the $n$ at which we expect one colliding pair: $\frac{n(n-1)}{2m} = 1$ gives $n \approx \sqrt{2m} \approx 45$ for $m = 1000$ — the same $\sqrt{m}$-scale answer Phase 2 found by a completely different route, a reassuring cross-check. For the current load $n = 1500, m = 1000$: $$E[X] = \frac{1500 \cdot 1499}{2000} \approx 1124 \text{ colliding pairs.}$$ Over a thousand pairs of sessions are sharing a bucket — which is exactly why lookups slowed.

🚪 Threshold Concept: the indicator decomposition makes "impossible" expectations routine. Computing $E[X]$ for "number of colliding pairs" looks like it needs the full distribution of bucket loads. It does not. Break $X$ into a sum of $\binom{n}{2}$ tiny indicator variables, take the expectation of each (just a probability, $1/m$), and add. You never analyze how the pairs interact — and they interact in genuinely complicated ways. This write-it-as-a-sum-of-indicators move is the single most reusable idea in the chapter; once you see it work here, you will reach for it everywhere.

from fractions import Fraction

def expected_collisions(n, m):
    """E[number of colliding pairs] = C(n,2)/m, exactly."""
    return Fraction(n * (n - 1), 2 * m)

print(expected_collisions(100, 1000))     # exact
print(float(expected_collisions(1500, 1000)))
# Expected output:
# 99/20
# 1124.25

The exact arithmetic confirms the hand derivation: at $n = 100$, $E[X] = \frac{100 \cdot 99}{2000} = \frac{9900}{2000} = \frac{99}{20} = 4.95$ expected colliding pairs; at $n = 1500$, about $1124$.

Phase 4: Empty buckets and the warning behind worst-case load (more linearity, plus variance)

Two final quantities a capacity planner wants. First, how many buckets are wasted (empty)? Define an indicator per bucket: $Y_j = \mathbf{1}[\text{bucket } j \text{ is empty}]$, and $Y = \sum_{j=1}^{m} Y_j$. Bucket $j$ is empty iff every one of the $n$ keys missed it; each key misses with probability $\frac{m-1}{m}$, and the keys are independent, so $P(\text{bucket } j \text{ empty}) = \left(1 - \frac1m\right)^n$. By linearity (this is precisely the empty-buckets result of §20.4), $$E[\text{empty buckets}] = \sum_{j=1}^{m} E[Y_j] = m \left(1 - \frac1m\right)^n.$$ For $n = m = 1000$, this is $1000\left(1 - \tfrac{1}{1000}\right)^{1000} \approx 1000/e \approx 368$. About 37% of buckets sit empty even when there are as many keys as buckets — the keys clump, because randomness is lumpier than intuition expects.

The flip side of empty buckets is overfull ones. The expected load of any single bucket is just $n/m$ (by symmetry, or by linearity on per-key indicators), which is a comforting $1.0$ when $n = m$. But expectation hides the spread, and §20.5 warned that two distributions with the same mean can behave very differently. The load of a single bucket is a sum of $n$ independent indicators (key $i$ lands in this bucket with probability $1/m$); its variance works out to $n \cdot \frac1m\left(1 - \frac1m\right) \approx n/m$ for large $m$. A mean of $1$ with variance near $1$ is a distribution with a meaningful right tail — and the maximum over $m$ buckets is what bites you. (A classic result, beyond this chapter's tools but worth knowing, says that for $n = m$ the fullest bucket holds about $\Theta(\log m / \log\log m)$ keys — small, but not the $1$ the average suggests.)

🔗 Connection. "The average is fine but the maximum is what hurts" is the whole reason concentration inequalities (Chebyshev, Chernoff) exist: they use the variance you just computed to bound how far the worst bucket can stray from the mean. That toolkit — built directly on the expectation and variance of this chapter — is how you prove a randomized load balancer "almost never" overloads a server. Chapter 21 takes the first step by analyzing randomized algorithms.

Finally, the empty-bucket count in code:

def expected_empty(n, m):
    """E[number of empty buckets] = m * (1 - 1/m)**n."""
    return m * (1 - 1 / m) ** n

print(round(expected_empty(1000, 1000), 1))
print(round(expected_empty(2000, 1000), 1))
# Expected output:
# 367.7
# 135.3

At $n = m = 1000$, about $368$ empty buckets ($\approx m/e$); doubling the load to $n = 2000$ drops that to about $135$ ($\approx m/e^2$, since $(1 - 1/m)^{2m} \approx e^{-2}$).

Phase 5: Check everything by simulation

Every number above came from a hand derivation. Before trusting any of it in a capacity plan, check it with Monte-Carlo — theme four. We simulate the experiment (drop $n$ keys into $m$ buckets at random) many times and measure the empty-bucket count and collision count, then compare to the formulas.

import random
from collections import Counter

def simulate_table(n, m, trials):
    """Average empty buckets and colliding pairs over `trials` random fills."""
    empty_total, collide_total = 0, 0
    for _ in range(trials):
        buckets = Counter(random.randint(1, m) for _ in range(n))
        empty_total += m - len(buckets)                       # buckets never hit
        collide_total += sum(c * (c - 1) // 2 for c in buckets.values())  # pairs per bucket
    return empty_total / trials, collide_total / trials

random.seed(0)
avg_empty, avg_collide = simulate_table(1000, 1000, 2000)
print(round(avg_empty, 1))      # compare to formula: 367.7
print(round(avg_collide, 1))    # compare to formula: C(1000,2)/1000 = 499.5
# Expected output (approximate; hovers near the exact formulas):
# 367.9
# 499.6

The simulated averages land right on the hand-derived formulas: empty buckets $\approx 368$ matches $m/e$, and colliding pairs $\approx 499.5$ matches $\binom{1000}{2}/1000 = \frac{999}{2}$. The agreement is evidence, not proof — a different seed shifts the last digit, and no finite number of trials makes it exact. But the derivations are proofs, so we know the formulas hold for every $n$ and $m$; the simulation's job was to catch a modeling blunder or an arithmetic slip, and it found none.

⚠️ Common Pitfall: the per-bucket-count sample space is not equally likely. A tempting shortcut is to model the experiment by its load vector (how many keys in each bucket) and count those. But two different load vectors are not equally likely — "all keys in bucket 1" is one tuple out of $m^n$, while "one key in each of $n$ buckets" corresponds to $n!$ tuples. Always reason on the equally-likely tuple space $(b_1, \dots, b_n)$; collapsing to counts too early is the §20.2 two-coin error at scale.

Discussion Questions

Phase 2 and Phase 3 both produced a $\sqrt{m}$-scale threshold for "collisions start mattering," by different arguments (complement rule vs. expected colliding pairs). Are they answering exactly the same question? Where would the two thresholds disagree, and why is it still reassuring that they roughly agree?
The expected number of colliding pairs is $\binom{n}{2}/m$, and the keys are not independent in their pairwise collisions (if keys 1 and 2 collide and keys 2 and 3 collide, then keys 1 and 3 must collide too). Explain precisely why linearity of expectation is nonetheless valid here. Which hypothesis does linearity not require?
Expectation says the average bucket holds $n/m$ keys, but the fullest bucket holds more. In your own words, why is the expectation an inadequate summary for a latency-sensitive service, and what does variance add to the picture? Tie your answer to the two-server example in §20.5.
Suppose the hash function is bad — it sends 90% of keys to the first half of the buckets. Which of the five phases' results break, and which modeling assumption from Phase 1 fails? (This previews why Chapter 26 cares so much about hash-function quality.)
The simulation in Phase 5 used 2000 trials and matched the formulas to about one part in a thousand. By the $1/\sqrt{k}$ error rule from §20.6, roughly how many trials would you need to trust the third decimal place, and why does this make the closed-form derivation so much more attractive when it is available?

Your Turn: Extensions

Option A (resize policy). Most hash tables resize (double $m$) when the load factor $n/m$ exceeds a threshold like $0.75$. Using the expected-colliding-pairs formula $\binom{n}{2}/m$, compute the expected collision count just before and just after a resize, and argue why keeping $n/m$ bounded keeps expected lookup cost $O(1)$. State the claim precisely as a statement about $E[\text{bucket load}]$.
Option B (the maximum bucket, empirically). Extend simulate_table to also record the maximum bucket length over each fill, and tabulate the average maximum for $m = 100, 1000, 10000$ with $n = m$. Do your numbers grow like $\log m / \log\log m$? You are checking a theorem the chapter did not prove — exactly the role §20.6 reserves for simulation.
Option C (two-choice hashing). A famous improvement: for each key, pick two random buckets and put the key in the less full one. Simulate this "power of two choices" variant and compare its average maximum bucket length to the one-choice version. The dramatic improvement (from $\Theta(\log m/\log\log m)$ down to $\Theta(\log\log m)$) is one of the most celebrated results in randomized algorithms — see it for yourself, then read about why in the further-reading list.

Key Takeaways

A hash table is the birthday problem in disguise. Modeling "$n$ keys into $m$ buckets" as the equally-likely tuple space $(b_1, \dots, b_n)$ over $m^n$ outcomes turns every capacity question into counting, the complement rule, or linearity.
The complement rule finds the collision-likely point. $P(\text{some collision}) = 1 - P(m,n)/m^n$ crosses $1/2$ at $n \approx \sqrt{m}$ — far sooner than the table is full, and the same bound that sizes cryptographic hashes.
Linearity of expectation computes what brute force cannot. Expected colliding pairs $= \binom{n}{2}/m$ and expected empty buckets $= m(1 - 1/m)^n$ both come from summing simple indicators — with no independence assumption, despite the keys interacting in tangled ways.
Expectation is not the whole story; variance is the warning. The average bucket load is $n/m$, but the maximum is what hurts, and variance is the first tool for bounding how far the worst bucket strays — the gateway to the concentration bounds of later chapters.
Simulate to check, derive to settle. Monte-Carlo confirmed every formula to a part in a thousand and caught no modeling error; the derivations proved the formulas for all $n, m$. Used together, they are hard to fool — theme four, in a capacity plan.