Case Study 2 — Sharded Before They Needed To
Distribution's complexity is real, and taking it on prematurely is its own disaster. A startup sharded its database from day one "to be ready for scale" — and spent two years fighting the complexity for a scale that never came, while a single PostgreSQL server would have served them effortlessly.
Background
A startup's founding engineers had previously worked at a web-scale company and were determined not to "hit a scaling wall." So from the very beginning — before they had a single paying customer — they architected the database as a sharded system: data split across multiple nodes by customer_id, with a custom routing layer in the application to send each query to the right shard. "We'll never have to re-architect when we grow," they reasoned.
For two years, they had hundreds of customers and a database that would have fit comfortably — with room to spare for years — on a single modest PostgreSQL server. Yet they paid the full cost of a distributed system the entire time.
The cost they paid for scale they didn't have
Sharding made everything harder, daily:
- Cross-shard queries were a nightmare. Any query spanning customers — "total revenue across all customers," admin dashboards, analytics — had to query every shard and merge results in application code. Simple reports became distributed-systems problems. Joins across shards were effectively impossible.
- No cross-shard transactions. Operations touching data on different shards couldn't be atomic without a fragile, slow distributed-transaction layer they had to build and maintain (Chapter 35's 2PC pain).
- Constant routing complexity. Every query had to determine its shard; bugs in the routing layer sent queries to the wrong shard or missed data. The routing code was a perpetual source of subtle bugs.
- Operational burden. Backups, migrations (Chapter 22), and monitoring all had to be done N times across shards and kept consistent. A schema change meant coordinating a migration across every node.
- Rebalancing fear. Adding or removing a shard meant moving data — risky and complex — so they avoided it, even when one shard got hotter than others.
All of this for a dataset that a single server's index could have scanned in milliseconds. The "scaling preparation" was pure overhead — slowing development, multiplying bugs, and consuming engineering time that a startup desperately needed for product.
The fix: de-shard
The team eventually consolidated back to a single PostgreSQL server (with read replicas for availability and read scaling — Case Study 1). The result was liberating:
- Cross-customer queries became normal SQL — one join, one query, no merging.
- Transactions across any data became normal transactions — atomic, simple.
- The routing layer was deleted — a whole category of bugs gone.
- Backups, migrations, and monitoring became single-system tasks.
Development sped up dramatically, and the single server (plus replicas) handled their load with enormous headroom. They kept a note: if they ever genuinely exceeded one server's write/storage capacity, they'd revisit sharding (or move to a NewSQL database like CockroachDB that handles sharding automatically) — at that point, with real data on where the bottleneck was. Until then, simplicity won.
The analysis
-
Premature distribution is as costly as premature anything — more so. Sharding imposes cross-shard query/transaction complexity, routing bugs, and N-fold operational burden immediately and continuously, in exchange for scale you may never need. The complexity is paid every day; the benefit may never arrive.
-
A single PostgreSQL server scales much further than people assume. Hundreds of customers, millions of rows, modest write rates — that's comfortable for one well-tuned server, with read replicas extending it further. Most companies never outgrow this. The web-scale architectures that inspire premature sharding solve problems most apps will never have.
-
Sharding is the last resort, not the first. The order is: one server → tune/index → read replicas → (only if you genuinely exceed one server's write/storage capacity) sharding or NewSQL. Each step is far simpler than the next; don't skip to the end. (This is Chapter 1's Lumen lesson — "we might scale huge" — at the infrastructure level.)
-
Distribute on evidence, not anticipation. The right time to shard is when you have measured evidence that a single server's capacity is the bottleneck — not a hypothetical fear. Then you'll also know how to shard (which key, which queries matter) from real data.
-
If you must distribute, prefer tools that hide the complexity. Had they truly needed scale, a NewSQL database (CockroachDB/YugabyteDB) that shards and replicates automatically while preserving SQL/ACID would have spared them the hand-rolled routing and distributed-transaction layer (Chapter 35).
Discussion questions
- List the daily costs the team paid for sharding before they needed it.
- Why does a single PostgreSQL server (+ replicas) suffice for far more workloads than people assume?
- What's the right escalation order from one server to a sharded system?
- What evidence should trigger an actual decision to shard?
- ⭐ Contrast this with Case Study 1. What single principle (about when to add distribution complexity) ties both together?