Case Study: Sizing a Hash Table Before It Falls Over
"It is remarkable that a science which began with the consideration of games of chance should have become the most important object of human knowledge." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)
Executive Summary
A service is dropping requests under load, and the on-call engineer suspects the in-memory hash table at its core. In this analysis-heavy case study you will not build a new data structure — you will reach for the probability you learned in this chapter to explain, quantitatively, three things every hash-table user eventually has to reason about: when collisions become likely, how full the fullest bucket gets, and how many buckets sit idle while another overflows. Each answer is a few lines of counting, the complement rule, or linearity of expectation — and each is checked against a Monte-Carlo simulation, exactly the proof-vs-computation duality (theme four) the chapter is built around.
The point is not that hash tables are mysterious. The point is that "it worked in testing" told the team nothing about when it would stop working, and the probability in this chapter turns a vague worry into a number you can put in a capacity plan.
Skills applied
- Modeling "$n$ keys hashed into $m$ buckets" as an equally-likely sample space (§20.1–20.2).
- Using the complement rule and the birthday calculation to find the probability of any collision (§20.2).
- Defining indicator random variables and using linearity of expectation to compute the expected number of empty buckets and the expected number of collisions (§20.4).
- Reading variance as the warning sign behind worst-case bucket load (§20.5).
- Using Monte-Carlo simulation to check each hand derivation, and knowing what the check does and does not prove (§20.6).
Background
The scenario
SessionStore is an in-memory cache: it maps a user's session token (a long random-looking string) to
that user's session data, using a hash table with $m$ buckets and chaining (each bucket holds a linked
list of the keys that land there). A "good" hash function spreads tokens across buckets so uniformly that,
for the purposes of analysis, we can treat each token as landing in a uniformly random bucket,
independently of the others. That modeling assumption — uniform and independent — is what lets us bring
this chapter's machinery to bear; in Chapter 26 you will study what makes a real hash function deserve it.
The team sized the table at $m = 1000$ buckets "because we'll never have more than a few hundred sessions." Traffic grew. Now there are $n$ active sessions, $n$ has crept past 1000, lookups have slowed, and someone asks the questions that this chapter answers:
- At what $n$ do collisions stop being rare and become expected?
- With $n$ sessions in $m$ buckets, how many collisions should we expect in total?
- How many buckets are wasted (empty) — and should we worry that one bucket is far longer than the rest?
💡 Intuition: A hash table is the birthday problem wearing work clothes. "Two people share a birthday" becomes "two keys land in the same bucket"; the 365 days become the $m$ buckets. Everything surprising about the birthday problem — collisions appearing far sooner than your intuition expects — is surprising about hash tables for exactly the same reason.
Why this matters
Collisions are not a bug; chaining handles them. But collisions are slow: a lookup that should be $O(1)$ becomes $O(\text{bucket length})$, and if one bucket gets long, the tail latency for the unlucky users hitting it spikes. Capacity planning for a hash table is, at bottom, a probability question: given my load, how bad does the worst bucket get, and how much headroom do I have? You cannot answer that by running it once and looking; you answer it by modeling it.
Phase 1: Model the experiment
Fix the sample space first — the recurring discipline of the chapter. We have $n$ keys (sessions) and $m$ buckets. An outcome records, for each key, which bucket it landed in. So an outcome is a tuple $$(b_1, b_2, \dots, b_n), \qquad b_i \in \{1, 2, \dots, m\},$$ where $b_i$ is the bucket of key $i$. By the product rule (Chapter 15), the sample space has $$\lvert S \rvert = m^n$$ outcomes, and because the hash is uniform and independent, all $m^n$ outcomes are equally likely — the Laplace model applies, and every probability becomes a count over $m^n$.
🔄 Check Your Understanding For $n = 3$ keys and $m = 4$ buckets, how many outcomes are in the sample space? Write one specific outcome in which all three keys collide in bucket 2.
Answer
$\lvert S \rvert = 4^3 = 64$ equally likely outcomes. The all-collide-in-bucket-2 outcome is the tuple $(2, 2, 2)$ — one of the 64.
Notice the modeling choice that makes everything tractable: we record which bucket each key chose, in order. The alternative ("how many keys per bucket") is a smaller description but its outcomes are not equally likely — the same trap as the two-coin sample space in §20.2. Keep the equally-likely tuple space and the counting stays honest.
Phase 2: When do collisions become likely? (the complement rule)
"At least one collision" is a messy event — there could be one collision, or many, in any combination. Its complement, "all $n$ keys land in distinct buckets," is clean, so we use the complement rule exactly as the birthday problem did in §20.2.
Count the no-collision outcomes. The first key may land anywhere ($m$ choices), the second must avoid the first ($m-1$), the third must avoid both ($m-2$), and so on — a falling product (Chapter 16): $$\lvert \text{all distinct} \rvert = m (m-1)(m-2)\cdots(m-n+1) = P(m, n).$$ Therefore $$P(\text{at least one collision}) = 1 - \frac{P(m, n)}{m^n} = 1 - \frac{m(m-1)\cdots(m-n+1)}{m^n}.$$ This is identical to the birthday formula with "days" replaced by "buckets." For $m = 1000$, here is the hand-derived behavior at a few values of $n$:
| $n$ (sessions) | $P(\text{some collision})$ | Reading |
|---|---|---|
| 10 | $\approx 0.044$ | rare |
| 38 | $\approx 0.50$ | a coin flip |
| 100 | $\approx 0.994$ | nearly certain |
The shock the birthday problem trained you to expect: with a thousand buckets, collisions hit 50–50 at just 38 keys — roughly $\sqrt{m}$, not $m$. The team's "we have plenty of buckets" intuition was off by more than an order of magnitude.
🔗 Connection. The $\sqrt{m}$ threshold is not a coincidence; it is the birthday bound, and it is exactly why cryptographic hash functions need outputs far longer than you would naively guess: to push the collision-likely point ($\approx \sqrt{m} = 2^{b/2}$ for a $b$-bit hash) out of reach. The same mathematics that sizes
SessionStoresizes a SHA-256 digest. Chapter 26 returns to this.
Here is the calculation in code — the complement formula, evaluated as a falling product to avoid overflow:
def collision_prob(n, m):
"""P(at least one of n keys collides) in m buckets, via the complement."""
p_all_distinct = 1.0
for k in range(n):
p_all_distinct *= (m - k) / m # k buckets already taken
return 1 - p_all_distinct
print(round(collision_prob(38, 1000), 3))
print(round(collision_prob(100, 1000), 3))
# Expected output:
# 0.503
# 0.994
The loop multiplies the survival factors $\frac{m}{m}, \frac{m-1}{m}, \dots, \frac{m-n+1}{m}$ exactly as the falling product prescribes, then subtracts from 1. The hand table and the code agree: $\approx 0.50$ at $n = 38$, $\approx 0.99$ at $n = 100$.
Phase 3: How many collisions, total? (linearity of expectation)
Knowing collisions are likely is not the same as knowing how many there are. Counting "the number of colliding pairs" directly means reasoning about the joint behavior of all the keys — a nightmare. Linearity of expectation sidesteps the joint behavior entirely, just as it did for the hat-check problem in §20.4.
Define one indicator per pair of keys. For keys $i < j$, let
$$X_{ij} = \mathbf{1}[\text{keys } i \text{ and } j \text{ land in the same bucket}],$$
and let $X = \sum_{i Set this expression equal to 1 to find the $n$ at which we expect one colliding pair: $\frac{n(n-1)}{2m}
= 1$ gives $n \approx \sqrt{2m} \approx 45$ for $m = 1000$ — the same $\sqrt{m}$-scale answer Phase 2 found
by a completely different route, a reassuring cross-check. For the current load $n = 1500, m = 1000$:
$$E[X] = \frac{1500 \cdot 1499}{2000} \approx 1124 \text{ colliding pairs.}$$
Over a thousand pairs of sessions are sharing a bucket — which is exactly why lookups slowed. 🚪 Threshold Concept: the indicator decomposition makes "impossible" expectations routine. Computing
$E[X]$ for "number of colliding pairs" looks like it needs the full distribution of bucket loads. It does
not. Break $X$ into a sum of $\binom{n}{2}$ tiny indicator variables, take the expectation of each (just
a probability, $1/m$), and add. You never analyze how the pairs interact — and they interact in genuinely
complicated ways. This write-it-as-a-sum-of-indicators move is the single most reusable idea in the
chapter; once you see it work here, you will reach for it everywhere. The exact arithmetic confirms the hand derivation: at $n = 100$, $E[X] = \frac{100 \cdot 99}{2000} =
\frac{9900}{2000} = \frac{99}{20} = 4.95$ expected colliding pairs; at $n = 1500$, about $1124$. Two final quantities a capacity planner wants. First, how many buckets are wasted (empty)? Define an
indicator per bucket: $Y_j = \mathbf{1}[\text{bucket } j \text{ is empty}]$, and $Y = \sum_{j=1}^{m} Y_j$.
Bucket $j$ is empty iff every one of the $n$ keys missed it; each key misses with probability
$\frac{m-1}{m}$, and the keys are independent, so $P(\text{bucket } j \text{ empty}) = \left(1 -
\frac1m\right)^n$. By linearity (this is precisely the empty-buckets result of §20.4),
$$E[\text{empty buckets}] = \sum_{j=1}^{m} E[Y_j] = m \left(1 - \frac1m\right)^n.$$
For $n = m = 1000$, this is $1000\left(1 - \tfrac{1}{1000}\right)^{1000} \approx 1000/e \approx 368$.
About 37% of buckets sit empty even when there are as many keys as buckets — the keys clump, because
randomness is lumpier than intuition expects. The flip side of empty buckets is overfull ones. The expected load of any single bucket is just
$n/m$ (by symmetry, or by linearity on per-key indicators), which is a comforting $1.0$ when $n = m$. But
expectation hides the spread, and §20.5 warned that two distributions with the same mean can behave very
differently. The load of a single bucket is a sum of $n$ independent indicators (key $i$ lands in this
bucket with probability $1/m$); its variance works out to $n \cdot \frac1m\left(1 - \frac1m\right) \approx
n/m$ for large $m$. A mean of $1$ with variance near $1$ is a distribution with a meaningful right tail —
and the maximum over $m$ buckets is what bites you. (A classic result, beyond this chapter's tools but
worth knowing, says that for $n = m$ the fullest bucket holds about $\Theta(\log m / \log\log m)$ keys —
small, but not the $1$ the average suggests.) 🔗 Connection. "The average is fine but the maximum is what hurts" is the whole reason
concentration inequalities (Chebyshev, Chernoff) exist: they use the variance you just computed to
bound how far the worst bucket can stray from the mean. That toolkit — built directly on the expectation
and variance of this chapter — is how you prove a randomized load balancer "almost never" overloads a
server. Chapter 21 takes the first step by analyzing randomized algorithms. Finally, the empty-bucket count in code: At $n = m = 1000$, about $368$ empty buckets ($\approx m/e$); doubling the load to $n = 2000$ drops that
to about $135$ ($\approx m/e^2$, since $(1 - 1/m)^{2m} \approx e^{-2}$). Every number above came from a hand derivation. Before trusting any of it in a capacity plan, check it
with Monte-Carlo — theme four. We simulate the experiment (drop $n$ keys into $m$ buckets at random) many
times and measure the empty-bucket count and collision count, then compare to the formulas. The simulated averages land right on the hand-derived formulas: empty buckets $\approx 368$ matches
$m/e$, and colliding pairs $\approx 499.5$ matches $\binom{1000}{2}/1000 = \frac{999}{2}$. The agreement
is evidence, not proof — a different seed shifts the last digit, and no finite number of trials makes it
exact. But the derivations are proofs, so we know the formulas hold for every $n$ and $m$; the
simulation's job was to catch a modeling blunder or an arithmetic slip, and it found none. ⚠️ Common Pitfall: the per-bucket-count sample space is not equally likely. A tempting shortcut is
to model the experiment by its load vector (how many keys in each bucket) and count those. But two
different load vectors are not equally likely — "all keys in bucket 1" is one tuple out of $m^n$, while
"one key in each of $n$ buckets" corresponds to $n!$ tuples. Always reason on the equally-likely tuple
space $(b_1, \dots, b_n)$; collapsing to counts too early is the §20.2 two-coin error at scale.
from fractions import Fraction
def expected_collisions(n, m):
"""E[number of colliding pairs] = C(n,2)/m, exactly."""
return Fraction(n * (n - 1), 2 * m)
print(expected_collisions(100, 1000)) # exact
print(float(expected_collisions(1500, 1000)))
# Expected output:
# 99/20
# 1124.25
Phase 4: Empty buckets and the warning behind worst-case load (more linearity, plus variance)
def expected_empty(n, m):
"""E[number of empty buckets] = m * (1 - 1/m)**n."""
return m * (1 - 1 / m) ** n
print(round(expected_empty(1000, 1000), 1))
print(round(expected_empty(2000, 1000), 1))
# Expected output:
# 367.7
# 135.3
Phase 5: Check everything by simulation
import random
from collections import Counter
def simulate_table(n, m, trials):
"""Average empty buckets and colliding pairs over `trials` random fills."""
empty_total, collide_total = 0, 0
for _ in range(trials):
buckets = Counter(random.randint(1, m) for _ in range(n))
empty_total += m - len(buckets) # buckets never hit
collide_total += sum(c * (c - 1) // 2 for c in buckets.values()) # pairs per bucket
return empty_total / trials, collide_total / trials
random.seed(0)
avg_empty, avg_collide = simulate_table(1000, 1000, 2000)
print(round(avg_empty, 1)) # compare to formula: 367.7
print(round(avg_collide, 1)) # compare to formula: C(1000,2)/1000 = 499.5
# Expected output (approximate; hovers near the exact formulas):
# 367.9
# 499.6
Discussion Questions
Your Turn: Extensions
simulate_table to also record the maximum
bucket length over each fill, and tabulate the average maximum for $m = 100, 1000, 10000$ with $n = m$.
Do your numbers grow like $\log m / \log\log m$? You are checking a theorem the chapter did not prove —
exactly the role §20.6 reserves for simulation.Key Takeaways