Case Study 1 — Read Replicas: The 80% Solution to "We Need to Scale"
Before sharding or NewSQL, there's a far simpler distribution step that solves most scaling and availability needs: read replicas. A company facing read-load and uptime worries got both with managed PostgreSQL replicas — no sharding, no rewrite.
Background
A SaaS company's single PostgreSQL server was under growing pressure. Two worries:
- Read load — the app was read-heavy (dashboards, list views, reports), and at peak the single server's CPU and I/O were near capacity. Reads were starting to slow.
- Availability — the entire business ran on one database server. If it failed, everything was down. Leadership wanted resilience.
The engineering team's first instinct, fueled by blog posts about web-scale companies, was "we need to shard" or "we should move to a distributed NoSQL database." Both would have meant a major, risky rewrite. They paused and asked what would actually solve their two specific problems.
The right step: replication
Neither problem required sharding (their data fit comfortably on one machine; write volume was modest). Both were solved by read replicas — copies of the database that receive a stream of changes and serve reads.
Using their managed cloud provider (RDS), they added two read replicas of the primary:
writes ──► PRIMARY ──(streaming WAL)──► REPLICA 1 (reads)
│ REPLICA 2 (reads)
└── promote a replica on failure (automatic failover)
- Read scaling: they routed read-only traffic — dashboards, reports, list endpoints — to the replicas, leaving the primary to handle writes and the reads that needed absolute freshness. The primary's load dropped sharply; read capacity roughly tripled.
- High availability: the managed service monitored the primary and, on failure, automatically promoted a replica to primary (failover) in under a minute. The single-point-of-failure was gone.
This took days, not months, and required almost no application changes — just directing read queries to a replica connection. No sharding, no rewrite, no abandoning SQL or transactions.
The one wrinkle: replication lag
Async replicas lag the primary slightly (Chapter 35) — usually milliseconds, occasionally more under load. So a user who writes data and immediately reads it back from a replica might not see their own change yet ("read-your-writes" inconsistency). The team handled this simply: reads that must reflect a just-made write (e.g., "show the order I just placed") go to the primary; everything else (dashboards, browsing, reports — where milliseconds of staleness is invisible) goes to replicas. A small routing rule, matched to each read's freshness needs.
The analysis
-
Read replicas are the highest-leverage distribution step. They solve the two most common scaling needs — read capacity and availability — with minimal complexity and no rewrite. For the vast majority of "we need to scale the database" situations, this is the answer, and it should be tried before sharding or NewSQL.
-
Diagnose the actual need. The team almost sharded (huge complexity) when their problems were read load and availability — both solved by replicas. As with indexing (Ch. 23) and caching (Ch. 33), identify the specific bottleneck before reaching for the heaviest tool.
-
Replication scales reads and provides HA; it does not scale writes. All writes still go to the one primary. That's fine until write volume or data size exceeds one machine — only then do you need sharding (Case Study 2 warns against doing it early). Replicas buy you a long runway.
-
Managed cloud databases make this trivial. RDS/Cloud SQL/Aurora handle replica provisioning, streaming, and automatic failover. You get distribution's benefits without operating the machinery — the right default for most teams (Chapter 35).
-
Handle replication lag deliberately. Route freshness-critical reads to the primary, lag-tolerant reads to replicas. A simple, explicit rule (matched to each read's needs) sidesteps the read-your-writes pitfall.
Discussion questions
- The team's two problems were read load and availability. Why are replicas the right fix for both?
- Why did they not need sharding or NewSQL?
- What does replication not solve, and what would push them to the next step?
- What is replication lag, and how did they handle read-your-writes consistency?
- ⭐ Why is "managed PostgreSQL + replicas" the recommended default before sharding? When does it stop being enough?