Case Study: Why the Sharding Plan Cannot Work — A Pigeonhole Audit

DataField.Dev

Case Study: Why the Sharding Plan Cannot Work — A Pigeonhole Audit

"Counting is the sanity check that comes before the benchmark. If the numbers forbid it, no amount of tuning will save you."

Executive Summary

A team is about to ship a data-sharding plan: route each customer to one of a fixed set of database shards by hashing the customer ID, with two guarantees written into the design doc — "no two active customers will ever land on the same shard" and "every shard will be used." In this analysis-heavy case study you will audit that plan with nothing but the counting tools of this chapter — no profiler, no load test — and show, before a single byte moves, that the first guarantee is mathematically impossible and the second is only achievable under a condition the team has not checked. You will then turn the same tools to a constructive question: given the impossibility, exactly how unbalanced can the shards get, and how many distinct "every shard used" configurations even exist?

This is the pigeonhole principle (§17.1), the generalized pigeonhole principle (§17.2), and the surjection count (§17.6) used the way a senior engineer uses them: as a cheap, decisive filter applied to a design on paper.

Skills applied

Identifying the pigeons and the holes in a real system, and reading off an impossibility result.
Using the generalized pigeonhole principle to derive a hard worst-case load bound.
Recognizing "every shard used" as a surjection and counting the valid configurations by inclusion–exclusion.
Distinguishing an existence guarantee (pigeonhole) from a probability statement, and knowing which question each answers.

Background

The scenario

ShopRing (a hypothetical e-commerce platform — Tier 3, constructed for teaching) runs a customer database that has outgrown one machine. The proposed fix is hash sharding: a customer with integer ID cid is routed to shard number h(cid) = cid % S, where S is the number of shards. The design doc makes two load-bearing claims:

Claim D1 (uniqueness). "Because our hash spreads IDs evenly, no two active customers will collide on the same shard."

Claim D2 (coverage). "Every shard will receive at least one customer, so we never provision a machine that sits idle."

ShopRing has 8 shards ($S = 8$) and, at launch, 5,000,000 active customer IDs. The IDs are arbitrary 64-bit integers (assigned over years by an upstream service); you do not control them.

def shard_of(cid, S):
    """Route a customer id to one of S shards by simple modular hashing."""
    return cid % S

# Three sample customers, 8 shards:
for cid in (1001, 2009, 4_000_017):
    print(cid, "->", shard_of(cid, 8))
# Expected output:
# 1001 -> 1
# 2009 -> 1
# 4000017 -> 1

Already a hint: three unrelated IDs all landed on shard 1. Let us make the informal worry precise.

💡 Intuition: A hash function is a function from a large set (customer IDs) to a small set (shards). The design doc is, without realizing it, claiming this function is both injective ("no collisions") and surjective ("every shard used"). The whole audit is just checking whether a function from a 5-million-element set to an 8-element set can have those properties. Pigeonhole answers the first; surjection counting answers the second.

Why this matters

Sharding bugs are expensive: a "unique per shard" assumption baked into application code (say, using the shard as a lock or as a primary-key namespace) corrupts data the moment two customers collide. And an idle, fully provisioned shard is wasted money. Both failure modes are counting questions, and counting is far cheaper than discovering them in production. This is theme two of the book in its purest form — a proof settles all cases at once, where a load test only ever samples a few.

Phase 1: Claim D1 is impossible (basic pigeonhole)

Set up the pigeons and the holes explicitly — this is the entire skill.

Pigeons: the 5,000,000 active customer IDs being routed.
Holes: the 8 shards.

We are placing $n = 5{,}000{,}000$ pigeons into $m = 8$ holes, and $n > m$ in spectacular fashion. By the pigeonhole principle (§17.1), at least one shard receives at least two customers. Two customers on one shard is exactly a collision. Therefore Claim D1 — "no two active customers will collide" — is false for any hash function whatsoever, not merely for cid % 8. The cleverness of the hash is irrelevant; the conclusion is forced by $5{,}000{,}000 > 8$.

State it in the function language to see how total the impossibility is:

Restatement. $h\colon \{\text{5{,}000{,}000 IDs}\} \to \{0, 1, \dots, 7\}$ has a domain larger than its codomain, so $h$ is not injective (§17.1). Some $\text{cid}_1 \ne \text{cid}_2$ satisfy $h(\text{cid}_1) = h(\text{cid}_2)$.

We can detect the collision the principle promises, which is all code can do here — it cannot repeal the theorem:

def first_collision(cids, S):
    """Return the first pair of ids that share a shard, or None.
    Pigeonhole guarantees a pair exists once len(cids) > S."""
    seen = {}                          # shard -> first id seen there
    for cid in cids:
        s = cid % S
        if s in seen:
            return (seen[s], cid)
        seen[s] = cid
    return None

print(first_collision([1001, 2009, 4_000_017], S=8))
# Expected output:
# (1001, 2009)

Hand-trace it: $1001 \bmod 8 = 1$ (store shard 1 → 1001); $2009 \bmod 8 = 1$ (shard 1 already seen) → return (1001, 2009). With three IDs and eight shards we have $n \le m$, so pigeonhole made no guarantee for this tiny list — yet a collision happened anyway, which is the next pitfall.

⚠️ Common Pitfall: The team might respond, "fine, then we'll just keep active customers below 8 and we're safe." That confuses the two directions of the theorem. $n > m$ guarantees a collision; $n \le m$ merely fails to guarantee one — collisions are still entirely possible (the three-ID trace above had one with $n = 3 < 8$). Safety is never something pigeonhole grants you below the threshold.

🔄 Check Your Understanding The team proposes raising the shard count to $S = 5{,}000{,}001$ so that "there are more shards than customers, so D1 holds." Does that make D1 true? What does pigeonhole now say, and what does it not say?

Answer

It removes the guarantee of a collision (now $n = 5{,}000{,}000 \le m = 5{,}000{,}001$, so pigeonhole is silent), but it does not make D1 true: two IDs can still hash to the same shard (e.g. any two IDs congruent mod $S$). Pigeonhole gives a sufficient condition for collision, never a guarantee of no collision. To truly guarantee uniqueness you would need an injective assignment you control — not a hash of IDs you don't.

Phase 2: How unbalanced can it get? (generalized pigeonhole)

Killing D1 raises the real operational question: collisions are inevitable, so how bad is the worst shard? The application's tail latency is governed by the busiest shard, so we want a guaranteed lower bound on it.

Pigeons: $n = 5{,}000{,}000$ customers.
Holes: $m = 8$ shards.

The generalized pigeonhole principle (§17.2) says some shard holds at least $$\left\lceil \frac{n}{m}\right\rceil = \left\lceil \frac{5{,}000{,}000}{8}\right\rceil = \lceil 625{,}000\rceil = 625{,}000$$ customers. Here $n/m$ is exactly an integer, so the ceiling changes nothing — but the bound is a hard floor on the worst case: no routing of these 5,000,000 customers across 8 shards can keep every shard at or below 624,999. If a single shard's hardware is sized for 600,000 customers, the plan is dead-on-arrival, and we know it without a benchmark.

from math import ceil

def guaranteed_busiest(n, S):
    """Lower bound on the busiest shard's load: no routing beats this."""
    return ceil(n / S)

print(guaranteed_busiest(5_000_000, 8))     # exact integer case
print(guaranteed_busiest(5_000_001, 8))     # one more customer
# Expected output:
# 625000
# 625001

The second call shows the ceiling doing real work: $\lceil 5{,}000{,}001 / 8\rceil = \lceil 625{,}000.125\rceil = 625{,}001$. One extra customer raises the guaranteed worst shard by one, because that customer must pile onto an already-occupied shard.

🚪 Threshold Concept: a lower bound no engineering can beat. Most performance numbers are measurements — they describe what happened on one run. The generalized pigeonhole bound is different in kind: it is a proof about every possible run at once. "Some shard handles $\ge 625{,}000$ customers" holds for the cleverest future hash, the luckiest input, any rebalancing scheme. When you can derive such a bound, you have changed the conversation from "let's tune it" to "this target is unreachable; change the target or change $m$." Learning to reach for a counting bound before profiling is a habit that separates an engineer who argues from data from one who argues from proof.

Is the bound tight — could the shards actually be that even? Yes: if the IDs happened to spread perfectly, every shard would hold exactly $625{,}000$, meeting the bound. So $625{,}000$ is the best guarantee available; we cannot honestly promise "$\le 624{,}999$" or claim "$\ge 625{,}001$ always."

🔄 Check Your Understanding Management asks: "If we want to guarantee the busiest shard stays below 700,000, how many shards do we need?" Set up the inequality $\lceil 5{,}000{,}000 / S\rceil \le 700{,}000$ and find the smallest such $S$.

Answer

We need $\lceil 5{,}000{,}000/S\rceil \le 700{,}000$, which holds once $5{,}000{,}000/S \le 700{,}000$, i.e. $S \ge 5{,}000{,}000/700{,}000 = 7.14\ldots$, so $S = 8$ suffices ($\lceil 5{,}000{,}000/8\rceil = 625{,}000 \le 700{,}000$). With $S = 7$, $\lceil 5{,}000{,}000/7\rceil = 714{,}286 > 700{,}000$, which fails. So 8 shards is the minimum — note this is the guarantee on the worst case, not a promise the actual data will be balanced.

Phase 3: Will every shard be used? (surjections and inclusion–exclusion)

Now Claim D2: "every shard receives at least one customer." That is precisely the statement that the routing function $h$ is surjective (§17.6) — every one of the 8 shards is hit. Surjectivity is not forced by having many customers; in principle the hash could map all 5,000,000 IDs to shards $\{0, 1, \dots, 6\}$ and never touch shard 7. So D2 is a genuine question, and the chapter gives us the exact count of how many routings satisfy it.

Think of the assignment abstractly as a function from the $n$ customers to the $m = 8$ shards. The number of surjective such functions — assignments that use every shard — is, by inclusion–exclusion over the missed shards (§17.6), $$\text{surjections}(n \to m) = \sum_{j=0}^{m}(-1)^j\binom{m}{j}(m-j)^n.$$

To make the arithmetic legible, audit a scaled-down model first: $n = 5$ customers across $m = 3$ shards (the structure is identical; only the numbers shrink).

from math import comb

def surjections(n, m):
    """Assignments of n customers to m shards that use EVERY shard."""
    return sum((-1)**j * comb(m, j) * (m - j)**n for j in range(m + 1))

total      = 3 ** 5
onto       = surjections(5, 3)
print(total)
print(onto)
print(total - onto)        # assignments that leave some shard idle
# Expected output:
# 243
# 150
# 93

Hand-derive the surjection count to keep the tool honest: $$\sum_{j=0}^{3}(-1)^j\binom{3}{j}(3-j)^5 = \binom30 3^5 - \binom31 2^5 + \binom32 1^5 - \binom33 0^5 = 243 - 3\cdot 32 + 3\cdot 1 - 0 = 243 - 96 + 3 = 150.$$ So of the $3^5 = 243$ possible assignments of 5 customers to 3 shards, exactly 150 use all three shards and the remaining 93 leave at least one shard idle. D2 is satisfied by 150 of 243 configurations — a clear majority here, but not all, and that gap is the whole point: coverage is not automatic.

💡 Intuition: The term $(m-j)^n$ counts assignments confined to a chosen set of $m-j$ shards (each of the $n$ customers independently picks one of $m-j$ allowed shards — the product rule of Chapter 15), and $\binom{m}{j}$ counts which $j$ shards we declared "missed." Inclusion–exclusion alternates these to strike out every assignment that misses at least one shard, leaving exactly the surjections. It is the same "add singles, subtract pairs, …" machinery of §17.4, now applied to "missed targets."

For the real numbers ($n = 5{,}000{,}000$, $m = 8$) the surjection count is astronomically close to the total $8^{5{,}000{,}000}$ — missing a specific shard requires all five million IDs to dodge it, which is vanishingly rare but not impossible. So the correct audit verdict on D2 is nuanced:

Verdict on D2. With 5,000,000 customers and 8 shards, the overwhelming majority of routings are surjective, so an idle shard is extremely unlikely — but it is not guaranteed. D2 should be downgraded from "will" to "almost certainly will, and we will monitor for an unused shard." Unlike D1 (provably false), D2 is provably-possible-to-fail, which is a weaker but still important caveat.

🔗 Connection: "How many functions hit every target" is the surjection question, and it is the same shape as the coupon-collector problem you will meet in Chapter 20 (how long until every shard has been used at least once) and connects to how full a hash table gets. The inclusion–exclusion skeleton here is reused verbatim there.

🔄 Check Your Understanding In the scaled model, what fraction of assignments leave at least one shard idle, and why does this fraction shrink toward zero as the number of customers grows (with shards fixed)?

Answer

$93/243 \approx 0.383$. As $n$ grows with $m$ fixed, leaving a particular shard idle requires all $n$ customers to avoid it, which happens for a $((m-1)/m)^n$ fraction — shrinking geometrically toward 0. So with millions of customers, the idle-shard fraction is negligible, matching the D2 verdict.

Phase 4: The audit memo

Assemble the three results into the verdict the team needs. Counting alone — no profiler — settled all three claims:

Design claim	Tool	Verdict
D1 "no two customers collide on a shard"	Pigeonhole (§17.1)	Impossible. $5{,}000{,}000 > 8 \Rightarrow$ some shard has $\ge 2$ customers, for any hash.
Worst-case shard load	Generalized pigeonhole (§17.2)	$\ge 625{,}000$ customers on the busiest shard, unbeatable; size hardware accordingly.
D2 "every shard is used"	Surjections / I–E (§17.6)	Not guaranteed, but overwhelmingly likely; monitor for an unused shard rather than assume coverage.

The actionable recommendations: drop the uniqueness assumption from application code entirely (it can never hold); provision every shard for at least 625,000 customers; and replace D2's "will" with a monitor. None of this required running the system — the counting forbade the broken design and bounded the workable one.

Discussion Questions

The pigeonhole argument against D1 never used any property of the hash function cid % 8. Re-examine the proof and state the only fact about the routing it relied on. Why does this make the impossibility robust to any future "better hash function"?
Phase 2's bound $\lceil n/m\rceil$ was shown to be tight by exhibiting a perfectly even distribution. In practice, hashing arbitrary IDs almost never gives a perfectly even split. Does the lower bound $\ge 625{,}000$ still hold for an uneven split? Does the tightness (some distribution achieving exactly 625,000) say anything about what your data will actually do?
D2 was "provably-possible-to-fail" rather than "provably false." Explain the difference between an existence guarantee (what pigeonhole gives for D1) and the counting result for D2, and why the team should treat the two verdicts differently in code.
Suppose the team adds a constraint: customer IDs are guaranteed to be consecutive integers $0, 1, \dots, 4{,}999{,}999$. Under cid % 8, what is the actual load on each shard, and does it meet the generalized-pigeonhole lower bound exactly? (This is a special, controlled input — contrast it with the arbitrary-ID case.)

Your Turn: Extensions

Option A (re-shard math). The team wants the busiest shard guaranteed below 500,000 for a future 8,000,000 customers. Find the minimum shard count $S$ from $\lceil 8{,}000{,}000/S\rceil \le 500{,}000$, and write a one-line min_shards(n, cap) function (with a hand-derived expected output) that returns it.
Option B (coverage probability sketch). For the scaled model ($m = 3$ shards), tabulate the number of surjective assignments for $n = 3, 4, 5, 6$ using your surjections function, and the fraction $\text{onto}/3^n$. Describe how fast coverage becomes near-certain, foreshadowing the coupon-collector analysis of Chapter 20.
Option C (the uniqueness fix). Pigeonhole forbids unique-per-shard routing of uncontrolled IDs. Propose a design where uniqueness is achievable (hint: assign customers to shards by a counter you control, not by hashing a value you don't), and explain in pigeonhole terms why your design escapes the impossibility — i.e., why its assignment can be injective.

Key Takeaways

Pigeonhole is a design filter, not just a puzzle. "More customers than shards" instantly disproved a shipped design assumption (D1) with no test — identify the pigeons and holes, check $n > m$, and read off the impossibility.
The generalized principle yields hard worst-case bounds. $\lceil n/m\rceil = 625{,}000$ is a floor on the busiest shard that no routing can beat — a proof about every run, the kind of number you size hardware against before benchmarking.
"Every target used" is a surjection. Coverage (D2) is counted exactly by inclusion–exclusion over missed shards; it is usually likely but rarely guaranteed, a distinction that belongs in the design doc and the code.
Existence vs. probability are different verdicts. Pigeonhole settles whether something must happen; the surjection count tells you how many configurations have a property. Knowing which question you are asking is half the audit.