Case Study 2 — Sharded Before They Needed To

Distribution's complexity is real, and taking it on prematurely is its own disaster. A startup sharded its database from day one "to be ready for scale" — and spent two years fighting the complexity for a scale that never came, while a single PostgreSQL server would have served them effortlessly.

Background

A startup's founding engineers had previously worked at a web-scale company and were determined not to "hit a scaling wall." So from the very beginning — before they had a single paying customer — they architected the database as a sharded system: data split across multiple nodes by customer_id, with a custom routing layer in the application to send each query to the right shard. "We'll never have to re-architect when we grow," they reasoned.

For two years, they had hundreds of customers and a database that would have fit comfortably — with room to spare for years — on a single modest PostgreSQL server. Yet they paid the full cost of a distributed system the entire time.

The cost they paid for scale they didn't have

Sharding made everything harder, daily:

  • Cross-shard queries were a nightmare. Any query spanning customers — "total revenue across all customers," admin dashboards, analytics — had to query every shard and merge results in application code. Simple reports became distributed-systems problems. Joins across shards were effectively impossible.
  • No cross-shard transactions. Operations touching data on different shards couldn't be atomic without a fragile, slow distributed-transaction layer they had to build and maintain (Chapter 35's 2PC pain).
  • Constant routing complexity. Every query had to determine its shard; bugs in the routing layer sent queries to the wrong shard or missed data. The routing code was a perpetual source of subtle bugs.
  • Operational burden. Backups, migrations (Chapter 22), and monitoring all had to be done N times across shards and kept consistent. A schema change meant coordinating a migration across every node.
  • Rebalancing fear. Adding or removing a shard meant moving data — risky and complex — so they avoided it, even when one shard got hotter than others.

All of this for a dataset that a single server's index could have scanned in milliseconds. The "scaling preparation" was pure overhead — slowing development, multiplying bugs, and consuming engineering time that a startup desperately needed for product.

The fix: de-shard

The team eventually consolidated back to a single PostgreSQL server (with read replicas for availability and read scaling — Case Study 1). The result was liberating:

  • Cross-customer queries became normal SQL — one join, one query, no merging.
  • Transactions across any data became normal transactions — atomic, simple.
  • The routing layer was deleted — a whole category of bugs gone.
  • Backups, migrations, and monitoring became single-system tasks.

Development sped up dramatically, and the single server (plus replicas) handled their load with enormous headroom. They kept a note: if they ever genuinely exceeded one server's write/storage capacity, they'd revisit sharding (or move to a NewSQL database like CockroachDB that handles sharding automatically) — at that point, with real data on where the bottleneck was. Until then, simplicity won.

The analysis

  1. Premature distribution is as costly as premature anything — more so. Sharding imposes cross-shard query/transaction complexity, routing bugs, and N-fold operational burden immediately and continuously, in exchange for scale you may never need. The complexity is paid every day; the benefit may never arrive.

  2. A single PostgreSQL server scales much further than people assume. Hundreds of customers, millions of rows, modest write rates — that's comfortable for one well-tuned server, with read replicas extending it further. Most companies never outgrow this. The web-scale architectures that inspire premature sharding solve problems most apps will never have.

  3. Sharding is the last resort, not the first. The order is: one server → tune/index → read replicas → (only if you genuinely exceed one server's write/storage capacity) sharding or NewSQL. Each step is far simpler than the next; don't skip to the end. (This is Chapter 1's Lumen lesson — "we might scale huge" — at the infrastructure level.)

  4. Distribute on evidence, not anticipation. The right time to shard is when you have measured evidence that a single server's capacity is the bottleneck — not a hypothetical fear. Then you'll also know how to shard (which key, which queries matter) from real data.

  5. If you must distribute, prefer tools that hide the complexity. Had they truly needed scale, a NewSQL database (CockroachDB/YugabyteDB) that shards and replicates automatically while preserving SQL/ACID would have spared them the hand-rolled routing and distributed-transaction layer (Chapter 35).

Discussion questions

  1. List the daily costs the team paid for sharding before they needed it.
  2. Why does a single PostgreSQL server (+ replicas) suffice for far more workloads than people assume?
  3. What's the right escalation order from one server to a sharded system?
  4. What evidence should trigger an actual decision to shard?
  5. ⭐ Contrast this with Case Study 1. What single principle (about when to add distribution complexity) ties both together?