Chapter 35: Distributed Databases — Replication, Sharding, and the CAP Theorem

DataField.Dev

30 min read

> Where you are: Part VI, Chapter 35 of 40. So far, one database server. This chapter is about spreading a database across many machines — for availability, geographic reach, and scale beyond one server — and the fundamental trade-off (CAP) that...

In This Chapter

Why distribute?
Replication: copies of the data
Sharding: splitting the data across nodes
Replication, in depth
Sharding, in depth
The CAP theorem
Eventual consistency
Distributed transactions
NewSQL: distributed and relational
A worked scaling scenario
Cloud managed databases
When you actually need distribution
The CAP theorem, deeply
NewSQL, consensus, and the scaling ladder
Common mistakes
Consistency models and distributed transactions, deeply
The verdict on distribution
Progressive project: reason about scale
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 35: Distributed Databases — Replication, Sharding, and the CAP Theorem

Where you are: Part VI, Chapter 35 of 40. So far, one database server. This chapter is about spreading a database across many machines — for availability, geographic reach, and scale beyond one server — and the fundamental trade-off (CAP) that distribution forces.

Learning paths: 🏗️ DBA · 💻 Developer (at scale) · 🔬 CS students (CAP, consistency models). 📊 analysts can skim. A conceptual chapter — the why and the trade-offs matter more than commands.

Why distribute?

A single PostgreSQL server handles a lot — most applications never outgrow one well-tuned machine (plus replicas). But sometimes one server isn't enough:

Availability / fault tolerance — if the one server dies, the database is down. Multiple machines let the system survive a failure (high availability).
Read scaling — one server has finite capacity; spreading reads across copies handles more traffic.
Geographic proximity — users worldwide get lower latency from a database copy near them.
Scale beyond one server — data or write volume too large for any single machine's disk/CPU/memory.

Distribution delivers these, but at a cost: the moment data lives on multiple machines that can't always communicate, you face hard trade-offs that don't exist on one server. That's the theme of this chapter.

Replication: copies of the data

Replication keeps copies of the database on multiple servers. The common form is primary-replica (formerly "master-slave"):

        writes ──► PRIMARY ──(stream WAL)──► REPLICA 1 ──► reads
                      │                       REPLICA 2 ──► reads
                      └──(WAL)──────────────► REPLICA 3 ──► reads

The primary accepts writes; replicas receive a continuous stream of changes (PostgreSQL ships the WAL — Chapter 28 — and replicas replay it) and serve reads.
Read scaling: point reporting and read-heavy traffic at replicas (Chapter 34's quick win), freeing the primary for writes.
High availability: if the primary fails, a replica is promoted to become the new primary (failover).

Synchronous vs. asynchronous replication is a key choice:

Asynchronous (default) — the primary commits without waiting for replicas; replicas lag slightly behind. Fast, but a replica might be a moment stale (and a just-committed write could be lost if the primary dies before shipping it).
Synchronous — the primary waits for a replica to confirm before committing. No data loss on failover, but slower writes (every commit waits for the network round trip).

This is a durability/latency trade — choose by how much you can afford to lose vs. how fast writes must be.

Multi-primary replication (multiple writable nodes) exists but is harder — concurrent writes to different nodes can conflict, requiring conflict resolution. Most systems use single-primary for simplicity.

Sharding: splitting the data across nodes

Replication copies the whole database to each node — but every node still holds all the data, so it doesn't help when the data is too big for one machine, or when write volume exceeds one server. Sharding (horizontal partitioning across servers) splits the data itself: each shard holds a subset of the rows, on a different node.

   shard by customer_id:
     node A: customers 1–1M       node B: customers 1M–2M     node C: customers 2M–3M
   a query for customer 1.5M goes only to node B

Sharding is partitioning (Chapter 25) taken across machines: choose a shard key (like a partition key) and route each row — and each query — to the right node. Done well, it scales writes and storage near-linearly by adding nodes.

But sharding is hard:

Cross-shard queries/joins are expensive (data on different machines) — you design to keep related data on the same shard, and avoid queries that span shards.
Cross-shard transactions are slow and complex (distributed transactions, below).
Choosing the shard key is critical — a bad key creates hot spots (one overloaded shard) or forces cross-shard queries (the Chapter 25 wrong-key lesson, across machines).
Rebalancing when adding nodes is operationally complex.

Sharding is a serious step taken only at genuine scale; most applications never need it (a big server plus read replicas suffices far longer than people assume).

Replication, in depth

Replication — keeping copies of the database on multiple servers — is the most common and most accessible form of distribution, serving availability and read-scaling needs that most growing applications eventually have. Understanding it deeply, including the crucial synchronous-versus-asynchronous choice, is practical knowledge for anyone running a database at any scale.

The standard model is primary-replica replication: one primary server accepts all writes, and one or more replica servers receive a continuous stream of the primary's changes and serve reads. PostgreSQL implements this elegantly using the WAL (Chapter 28): since the WAL is a complete, ordered record of every change, the primary ships its WAL to the replicas, which replay it to stay in sync — the same mechanism that provides durability also provides replication, a beautiful reuse. This delivers two major benefits. Read scaling: read-heavy traffic (reporting, analytics, the read side of a read-heavy application) can be directed to replicas, offloading the primary so it can focus on writes — often the first and easiest scaling step (Chapter 34's quick win for analytics). High availability: if the primary fails, a replica can be promoted to become the new primary (failover), so the database survives a server failure rather than going down — turning a single point of failure into a recoverable event.

The crucial design choice is synchronous versus asynchronous replication, a genuine trade-off between durability and latency. Asynchronous replication (the common default) has the primary commit a write without waiting for replicas to confirm — so writes are fast (no waiting for the network), but replicas lag slightly behind the primary, and a write committed on the primary just before it crashes could be lost if it hadn't yet shipped to a replica. Synchronous replication has the primary wait for at least one replica to confirm receipt before reporting the commit successful — so no committed write is lost on failover (it's guaranteed to be on a replica), but every commit pays the latency of the network round trip to the replica, making writes slower. The choice is a durability-versus-latency trade: asynchronous for speed (accepting tiny potential data loss on failover), synchronous for guaranteed durability (accepting slower writes). Many systems use asynchronous replication for read replicas (where lag is acceptable) and synchronous to one replica for failover safety (no data loss), getting both read-scaling and durable failover.

A subtlety worth knowing is replica lag and its consequences for application correctness. With asynchronous replication, a replica is slightly behind the primary, so a read from a replica might not reflect a write that just committed on the primary — the "read your own writes" problem. If a user updates their profile (write to primary) and immediately views it (read from a lagging replica), they might see the old data, confusingly. The fixes: route reads that must reflect recent writes to the primary (not a replica), or use synchronous replication for those, or design the application to tolerate the lag. This is the consistency cost of read-scaling via async replicas — usually acceptable (most reads don't immediately follow a write), but a real consideration for read-after-write scenarios. Understanding replication — primary-replica via WAL shipping, read-scaling and failover benefits, the sync/async durability-latency trade, and replica lag's read-after-write implications — equips you for the scaling step most applications actually take, which is not sharding or NewSQL but simply adding read replicas to a primary, often via managed cloud PostgreSQL. It's the accessible, common form of distribution, and it goes a long way.

Sharding, in depth

Sharding — splitting the data itself across multiple servers, each holding a subset — is the heavier form of distribution, the one for when data or write volume exceeds any single machine, and understanding both its power and its genuine difficulty is essential to knowing when (rarely) to reach for it. Where replication copies the whole database to each node (so every node holds all the data, helping reads and availability but not capacity), sharding divides the data, so each shard holds only some rows, on a different node — scaling writes and storage by adding nodes.

Sharding is partitioning (Chapter 25) taken across machines: you choose a shard key (analogous to a partition key), and each row is routed to a shard based on it, with queries directed to the relevant shard(s). Done well, sharding scales near-linearly — double the nodes, roughly double the write and storage capacity — which is how the largest systems handle data volumes no single machine could hold. But sharding is genuinely hard, and its difficulties are why it's a serious step taken only at real scale. Cross-shard queries and joins are expensive: if a query needs data from multiple shards (on different machines), it must gather from each and combine — slow and complex — so you must design to keep related data on the same shard and avoid queries that span shards, which constrains your data model and queries significantly. Cross-shard transactions are slow and fragile (distributed transactions, below), so you design to keep transactions within a single shard. The shard key choice is critical: a bad key creates hot spots (one shard getting disproportionate load while others sit idle) or forces constant cross-shard queries — the wrong-key problem of Chapter 25, now across machines with worse consequences. Rebalancing when you add nodes (redistributing data to use the new capacity) is operationally complex.

These difficulties mean sharding fundamentally changes how you design the application — you must think about the shard key in every query, keep related data co-located, avoid cross-shard operations — which is a substantial ongoing constraint, not just a one-time setup. This is why sharding is a last resort, taken only when genuinely necessary: when you've truly exceeded what a single powerful server (plus read replicas) can handle in write volume or data size. And that point is much further out than people assume — a single modern PostgreSQL server with good design, indexing, and partitioning, plus read replicas, handles enormous workloads, far beyond what most applications ever reach. The premature-sharding mistake — taking on sharding's complexity and design constraints for scale you don't have and may never reach — is a serious one (the infrastructure-level version of the Lumen over-engineering mistake), adding permanent complexity to solve a hypothetical problem. The discipline is to scale up (a bigger server) and scale reads (replicas) as far as they go — which is far — before scaling out (sharding), and to shard only when a demonstrated, measured need exceeds those simpler options. When you genuinely need it, sharding (or NewSQL, which automates it) is the answer; until then, it's complexity to avoid. Knowing both its power (linear scale-out) and its cost (design constraints, operational complexity, cross-shard difficulties) is what lets you make that call correctly — almost always "not yet," occasionally "now, genuinely."

The CAP theorem

Here is the fundamental law of distributed databases. The CAP theorem states that when a network partition occurs (nodes can't communicate — which will happen in any distributed system), a distributed system can guarantee at most two of:

Consistency — every read sees the latest write (all nodes agree).
Availability — every request gets a (non-error) response.
Partition tolerance — the system keeps working despite network partitions.

Since partitions are inevitable in a distributed system, P is not optional — so the real choice, during a partition, is between C and A:

CP (consistency over availability): during a partition, refuse requests that can't be made consistent (return errors / become unavailable on the minority side) rather than serve stale/conflicting data. Choose this when correctness is paramount (banking, inventory).
AP (availability over consistency): during a partition, keep serving (possibly stale) data and reconcile later. Choose this when uptime matters more than perfect freshness (social feeds, shopping carts, caches).

   Network partition splits the nodes. You must choose:
     CP → stay consistent, but some requests fail (unavailable)
     AP → stay available, but some reads are stale (inconsistent)

CAP is often oversimplified ("pick two"), and modern systems are more nuanced (the trade-off only bites during a partition; PACELC extends it to latency-vs-consistency even when there's no partition). But the core insight is essential: distribution forces a consistency/availability trade-off that a single server never has to make.

Eventual consistency

Many AP systems (and most NoSQL, Chapter 33) offer eventual consistency: after a write, replicas converge to the latest value eventually, but a read right after a write might see a stale value. This is fine for some data (a like count, a social feed — briefly stale is OK) and unacceptable for others (an account balance — must be correct now). Choosing a database means choosing its consistency model to match your data's needs (Chapter 33's lesson, formalized).

Distributed transactions

A transaction (Chapter 26) that spans multiple nodes is a distributed transaction, and atomicity across machines is hard. The classic protocol is two-phase commit (2PC): a coordinator asks all nodes to prepare (phase 1), and if all agree, tells them to commit (phase 2). It works, but it's slow (multiple round trips) and fragile (if the coordinator fails mid-protocol, nodes can be left blocked). This is why sharded systems try hard to keep transactions within a single shard, and why distributed transactions are avoided when possible.

NewSQL: distributed and relational

For a long time the choice seemed binary: relational (single-server, ACID) or NoSQL (distributed, eventually consistent). NewSQL databases aim for both — horizontal scale-out with relational guarantees (SQL, ACID transactions, strong consistency):

Google Spanner — globally distributed, strongly consistent, SQL (uses synchronized clocks for global ordering).
CockroachDB, YugabyteDB — open-source, PostgreSQL-compatible, distributed SQL databases that shard and replicate automatically while preserving ACID.
TiDB — distributed SQL (MySQL-compatible).

NewSQL is the "have it all" category — they make real trade-offs (often higher latency for the strong consistency, via consensus protocols like Raft/Paxos that coordinate replicas), but they let you scale horizontally without abandoning SQL and transactions. For applications that genuinely outgrow a single PostgreSQL server but need relational guarantees, NewSQL is increasingly the answer.

A worked scaling scenario

Let's walk through how a growing application's database scales over time, because seeing the sequence of scaling decisions — climbing the ladder rung by rung as need demands — makes the abstract guidance concrete and shows how rarely the higher rungs are reached. Follow an application from launch to large scale.

At launch, the application runs on a single PostgreSQL server. With good design (Part III), indexing (Chapter 23), and query optimization (Chapter 24), this handles the early traffic easily — and will handle far more growth than the team might expect. Most applications live their entire lives here, on one well-tuned server, never needing more. As traffic grows, the first pressure is usually read load (more users viewing data) and a desire for availability (the single server is a single point of failure). The response is read replicas: add one or two replicas (easily, via managed cloud PostgreSQL), direct read-heavy traffic (reporting, the read side of the app) to them, and configure failover so a replica can be promoted if the primary fails. This single step — primary plus replicas — addresses both read-scaling and availability, and it serves the great majority of applications that grow beyond launch. The team is still on "PostgreSQL," just replicated, with all the relational power intact.

If the application keeps growing, the next pressures might be a very large table (the orders table reaches hundreds of millions of rows) — addressed by partitioning (Chapter 25), still within the single primary, splitting the big table physically while keeping it on one server. Vertical scaling (a bigger server) also buys substantial headroom cheaply. Only if the application grows to where a single server's write or storage capacity is genuinely exceeded — a real, measured limit, after exhausting vertical scaling, replicas, and partitioning — does the team face the heavy decision: shard (split data across servers, taking on the design constraints and complexity) or adopt NewSQL (a distributed relational database like CockroachDB that shards and replicates automatically while preserving ACID, paying coordination latency). This rung is reached by relatively few applications — those at genuine large scale — and reached late, after the simpler options are exhausted. The team that arrives here does so because they have a demonstrated need, not a hypothetical one.

The scenario's lesson is the sequence and its timing: single server → vertical scaling and read replicas → partitioning → (rarely, at genuine scale) sharding or NewSQL, with each step taken only when a measured need pushes past the previous. The crucial observations are how far the early rungs go (a single replicated server handles enormous load) and how rarely the late rungs are needed (most applications never shard). This vindicates the book's relational, single-server-focused approach: the skills you've built — design, SQL, indexing, optimization, replication — carry an application through almost its entire scaling journey, and the heavy distribution that seems glamorous is reached by few, late, on genuine need. The mistake is to skip ahead — to shard at launch because you "might scale huge" — taking on the late-rung complexity before climbing the cheap early rungs that would have sufficed. Scale the way the scenario shows: deliberately, rung by rung, as measured need requires, staying on the simpler rungs (single server, replicas) as long as they serve — which is far longer than the premature-distribution enthusiasts assume. That disciplined, need-driven scaling is how real applications grow, and it's why the single PostgreSQL server you've mastered is the foundation of an application's entire scaling journey, not just its start.

Cloud managed databases

In practice, most teams get distribution's benefits via managed cloud databases rather than running it themselves:

Amazon RDS / Google Cloud SQL / Azure Database for PostgreSQL — managed PostgreSQL: the cloud provider handles replication, failover, backups, and patching. You get HA and read replicas without operating them.
Amazon Aurora / Google AlloyDB — PostgreSQL-compatible databases re-engineered for the cloud with distributed storage, faster replication, and auto-scaling.

For most applications, managed PostgreSQL with a replica or two delivers the availability and read-scaling they need — far simpler than sharding or NewSQL, and the right default until you have a concrete reason to go further.

When you actually need distribution

The honest guidance (theme of Part VI):

Almost everyone is well served by one strong (managed) PostgreSQL primary + read replicas for HA and read scaling. This scales much further than people assume.
Reach for sharding or NewSQL only when you genuinely exceed a single server's write/storage capacity, or need multi-region strong consistency — a real, demonstrated need, not a hypothetical "we might scale huge" (Chapter 1's Lumen mistake, at the infrastructure level).
Match the consistency model to the data (CAP) — strong consistency where correctness demands it, eventual where availability matters more.

The CAP theorem, deeply

The CAP theorem is the fundamental law of distributed databases — the inescapable trade-off that distribution forces and a single server never faces — so understanding it deeply, beyond the oversimplified "pick two," is essential to reasoning about distributed systems. The theorem concerns what happens during a network partition, and grasping that framing is the key to understanding it correctly.

The theorem states that a distributed system can guarantee at most two of three properties: Consistency (every read sees the latest write — all nodes agree on the current state), Availability (every request gets a non-error response), and Partition tolerance (the system keeps working despite network partitions — nodes unable to communicate). The crucial insight that the "pick two" framing obscures is that partitions are inevitable in any distributed system — networks fail, packets drop, nodes become unreachable — so partition tolerance (P) is not optional; you must tolerate partitions if you're distributed. This means the real choice isn't "pick two of three" but rather: during a partition, when nodes can't communicate, you must choose between Consistency and Availability. You can't have both during a partition, because keeping all nodes consistent requires them to communicate (impossible during a partition), so you either refuse requests that can't be made consistent (sacrificing availability) or serve possibly-stale data (sacrificing consistency).

This gives the two real stances. A CP system (consistency over availability) chooses, during a partition, to refuse requests it can't guarantee are consistent — the minority side of a partition becomes unavailable rather than serve stale or conflicting data. You choose CP when correctness is paramount: a banking system should refuse a transaction it can't make consistent rather than risk a wrong balance. An AP system (availability over consistency) chooses, during a partition, to keep serving (possibly stale) data and reconcile later — staying available at the cost of temporary inconsistency. You choose AP when uptime matters more than perfect freshness: a social media feed should keep showing posts (even slightly stale) rather than go down. The choice is per system (or even per operation) and reflects what the data needs: CP for data where being wrong is unacceptable, AP for data where being briefly stale is fine but being unavailable is costly. Critically, this trade-off only bites during a partition — when the network is healthy, a well-designed distributed system can offer both consistency and availability; CAP is about the partition scenario specifically.

The theorem is often oversimplified, and the nuances matter. The "pick two" phrasing misleads (P isn't really optional, so it's "pick C or A during a partition"). The extension PACELC captures more: it says that during a Partition you trade Availability vs. Consistency (the CAP part), Else (when there's no partition) you trade Latency vs. Consistency — because even without a partition, keeping nodes strongly consistent requires coordination that adds latency, so distributed systems face a consistency-latency trade even in normal operation. This is why NewSQL systems (strongly consistent and distributed) often have higher latency than single-node databases — they pay coordination latency for their consistency. The deep lesson of CAP (and PACELC) is that distribution forces trade-offs that a single server never has to make: a single PostgreSQL server is trivially consistent and available (no partition possible within one machine) with no coordination latency, while any distributed system must navigate consistency versus availability (during partitions) and consistency versus latency (always). Understanding this — why distribution forces these trades, and how to choose (CP vs AP) based on what your data needs — is the conceptual core of distributed databases, and it explains why distribution is a cost (these unavoidable trade-offs) accepted only for the benefits (scale, availability, geography) when genuinely needed. The single server's freedom from these trades is, itself, a strong argument for staying on one (well-replicated) server as long as you can.

NewSQL, consensus, and the scaling ladder

Two final pieces complete the distributed-databases picture: how NewSQL achieves distributed strong consistency, and where all these techniques sit on the ladder of scaling options — which frames the practical decision of how far to climb.

NewSQL databases (Google Spanner, CockroachDB, YugabyteDB, TiDB) deliver what once seemed impossible: horizontal scale-out with relational guarantees (SQL, ACID, strong consistency). They achieve distributed strong consistency using consensus protocols — Raft or Paxos — which let a group of nodes agree on the order of operations despite failures, so the distributed system behaves consistently. The cost, per PACELC, is latency: reaching consensus requires nodes to coordinate (network round trips), so writes are slower than a single-node database's. NewSQL accepts this latency cost to provide strong consistency at scale — a different trade than NoSQL's (which sacrifices consistency for availability and lower latency). Spanner famously uses synchronized atomic clocks to order operations globally; CockroachDB and YugabyteDB are open-source and PostgreSQL-compatible, automatically sharding and replicating while preserving ACID. For applications that genuinely outgrow a single PostgreSQL server and need relational guarantees and strong consistency, NewSQL is the answer — distributed scale without abandoning SQL and transactions, paying coordination latency for the privilege.

These techniques form a scaling ladder, and knowing where each rung sits frames the practical decision of how high to climb. The rungs, from simplest to most complex: a single well-tuned server (handles most applications — never underestimate it); vertical scaling (a bigger server — simple, goes far); read replicas (offload reads, add availability via failover — the common, accessible scaling step, often via managed cloud PostgreSQL); partitioning (Chapter 25 — split a large table within one server); sharding (split data across servers — heavy, for genuine write/storage scale beyond one machine); and NewSQL (distributed relational with strong consistency — for multi-node scale that needs ACID). Each rung adds capability and complexity, so you climb only as high as a measured need requires. The crucial, repeated guidance is that most applications never climb past read replicas — a strong managed PostgreSQL primary with a replica or two handles far more than people assume, providing availability and read-scaling without the complexity of sharding or NewSQL. The teams that thrive scale up and read-replicate as far as those go (which is far) before scaling out, and reach for sharding or NewSQL only on demonstrated need. The teams that struggle often distribute prematurely — climbing to sharding for scale they don't have, taking on its enormous complexity for a hypothetical future.

This is Part VI's recurring theme at the infrastructure level: match the solution to the actual, measured need, and prefer the simpler option until a concrete requirement justifies the complex one. Distribution's benefits (availability, read-scaling, scale-out, geography) are real, and the techniques (replication, sharding, NewSQL, managed cloud) deliver them — but every step up the ladder costs complexity and, per CAP/PACELC, forces trade-offs a single server avoids. So the wisdom is to climb deliberately: a single (managed, replicated) PostgreSQL for almost everyone, the heavier distribution for the genuine, rare cases that truly need it. Knowing the full ladder — and that most applications stay near its bottom — is what lets you scale appropriately, neither under-provisioning (a single server crushed by load it can't handle) nor over-engineering (a sharded NewSQL cluster for an application a single server would serve). That judgment, matching infrastructure to need, is the practical payoff of understanding distributed databases: not to distribute, but to know when to, and when not to.

Common mistakes

Distributing/sharding prematurely — taking on enormous complexity for scale you don't have. A big server + replicas goes a long way.
Ignoring CAP — assuming you can have perfect consistency and availability during a partition. You can't; choose deliberately.
Using eventual consistency for data that needs strong consistency (balances, inventory) — stale reads cause real errors.
Cross-shard queries/transactions as a routine — they're slow and complex; design to avoid them.
A bad shard key — hot spots or constant cross-shard queries (the wrong-key lesson, Chapter 25).

Consistency models and distributed transactions, deeply

Distribution forces nuanced choices about consistency and makes transactions across nodes genuinely hard — two topics that deserve a closer look because they're where distributed systems most differ from the single-server world you've mastered. Both stem from the fundamental reality that coordinating multiple machines is expensive and fallible.

Consistency in distributed systems isn't binary (consistent or not) but a spectrum of models offering different guarantees at different costs. Strong consistency — every read sees the latest write, as if there were one copy — is what single-server databases provide trivially and what distributed systems must work hard (and pay latency) to achieve. Eventual consistency — replicas converge to the latest value eventually, so a read after a write might be stale — is the relaxed model many AP systems and NoSQL databases offer, trading immediate consistency for availability and lower latency. Between them lie intermediate models: read-your-writes consistency (you always see your own writes, even if others' are delayed), monotonic reads (you never see data go backwards in time), causal consistency (causally-related operations are seen in order). These intermediate models offer useful guarantees weaker than strong but stronger than eventual, letting systems provide enough consistency for correctness while avoiding the full cost of strong consistency. The practical skill is matching the consistency model to what the data needs: an account balance needs strong consistency (must be correct now); a social feed tolerates eventual (briefly stale is fine); a user viewing their own profile edits needs read-your-writes (they must see their change). Choosing a distributed database means choosing where on the consistency spectrum it sits, and ensuring that matches your data's actual requirements — too weak risks correctness bugs (stale balances), too strong wastes latency (coordinating data that didn't need it).

Distributed transactions — a transaction (Chapter 26) spanning multiple nodes — are where atomicity meets the hard reality of coordinating machines. Achieving ACID atomicity across nodes (all commit or all roll back, even across a network) is genuinely difficult. The classic protocol is two-phase commit (2PC): a coordinator asks all participating nodes to prepare (phase 1 — each node verifies it can commit and promises to), and if all agree, the coordinator tells them all to commit (phase 2). This achieves atomicity across nodes, but it's slow (multiple network round trips for every distributed transaction) and fragile (if the coordinator fails between phases, participants can be left blocked — having promised to commit but not knowing whether to, holding locks while waiting). These costs — latency and blocking-on-coordinator-failure — are why distributed transactions are avoided when possible: sharded systems are designed to keep each transaction within a single shard (where normal single-node transactions work), precisely to avoid the distributed-transaction cost. NewSQL systems use sophisticated consensus protocols (Raft/Paxos) rather than naive 2PC to make distributed transactions more robust, but even they pay coordination latency. The lesson is that transactions across nodes are expensive and fragile, so good distributed design minimizes them — keeping related data co-located (on the same shard) so transactions stay local. This is yet another way distribution complicates what's simple on a single server: a single-server transaction is fast and atomic trivially, while a distributed transaction is slow, fragile, and best avoided — which is part of why staying on a single server (where transactions are easy) is valuable as long as you can.

The verdict on distribution

Pulling Chapter 35 together into a clear verdict, since "should I distribute, and how?" is the practical question all this theory serves. The verdict: most applications should run on a single (managed, replicated) PostgreSQL server for as long as possible, distributing further only when a concrete, measured need genuinely requires it — and then choosing the technique that matches the specific need.

The reasoning rests on everything the chapter covered. A single server is trivially consistent and available with no coordination latency (free of CAP's trades) and makes transactions easy — so it's the simplest, most capable default, handling far more than people assume. The accessible first scaling step is read replicas (and managed-cloud failover): they add availability and read-scaling without the deep complexity of sharding, via the WAL-based replication PostgreSQL provides naturally, and this alone serves the great majority of applications that ever need to scale beyond one server's raw capacity. Only when you genuinely exceed a single server's write or storage capacity — a real, demonstrated limit, not a hypothetical — do you climb to sharding (with its design constraints and operational complexity) or NewSQL (distributed relational with strong consistency at a latency cost). And throughout, you match the consistency model (CAP/PACELC) to what the data needs — strong where correctness demands it, eventual where availability matters more — choosing deliberately rather than accepting whatever a chosen system defaults to.

The two failure modes to avoid frame the verdict. Premature distribution — sharding or going NewSQL for scale you don't have — takes on enormous complexity and design constraints to solve a problem that may never arise, the infrastructure-level over-engineering mistake; a single server plus replicas would have served for years. Ignoring the trade-offs — assuming you can have perfect consistency and availability during a partition, or using eventual consistency for data that needs strong consistency — causes subtle, serious bugs (stale balances, oversold inventory) that the CAP theorem warned were inevitable. Between them: distribute deliberately, on measured need, with the consistency model matched to the data, climbing the scaling ladder only as high as required. This is Part VI's theme — match the solution to the actual need — applied to the scaling question, and it closes the distributed-databases chapter where it should: not with "distribute for scale" but with "understand distribution's costs and trade-offs so you distribute only when truly needed, and correctly when you do." The single server you've mastered throughout this book is, for most applications, the right answer for a remarkably long time; distribution is the powerful, complex tool for when you genuinely outgrow it, used with the understanding this chapter provides. Knowing when not to distribute is as valuable as knowing how — perhaps more so, given how often distribution is reached for prematurely.

The next chapter turns from scaling the relational model to specializing it — the time-series, vector, geospatial, and search workloads that have their own optimal tools, and (theme #4 yet again) PostgreSQL's extensions that handle a remarkable number of them without a separate system. Having seen when data must spread across machines (this chapter), you'll see when data has a shape specialized enough to warrant a purpose-built tool — completing Part VI's survey of the boundaries of the general-purpose relational database.

Progressive project: reason about scale

For your project (a reasoning exercise):

Would one server + replicas suffice? For most domains, yes — say why.
Identify the first thing that would push you to distribute (read load? availability? data size? geography?) and which technique addresses it (replicas / sharding / NewSQL / managed cloud).
Pick a CAP stance for your most critical data: does it need strong consistency (CP) or is availability more important (AP)? Justify.
(Stretch) If you sharded, what would the shard key be, and what cross-shard queries would you need to avoid?

Summary

Distributed databases spread data across machines for availability, read scaling, geography, and scale beyond one server. Replication copies the database (primary accepts writes, replicas serve reads and enable failover; sync = no loss but slower, async = fast but lag). Sharding splits the data across nodes (partitioning across machines) to scale writes/storage — powerful but hard (cross-shard queries/transactions, shard-key choice, rebalancing). The CAP theorem is the fundamental law: during an inevitable network partition, you choose consistency (CP) or availability (AP) — distribution forces a trade-off a single server never faces. Many systems offer eventual consistency (converge eventually; match it to data that tolerates staleness). Distributed transactions (2PC) are slow/fragile, so keep transactions within a shard. NewSQL (Spanner, CockroachDB, YugabyteDB) delivers distributed scale with relational/ACID guarantees via consensus. Most teams get distribution via managed cloud PostgreSQL + replicas — and most applications never need more. Distribute for a concrete need, not prematurely.

You can now: - Explain why and when to distribute (availability, read scaling, geography, scale). - Describe replication (primary-replica, sync vs async) and sharding, and their trade-offs. - State the CAP theorem and choose CP vs AP for given data. - Explain eventual consistency and distributed-transaction (2PC) difficulty. - Place NewSQL and managed cloud databases, and judge when distribution is warranted.

What's next. Chapter 36 — Time-Series, Vector, and Specialized Databases — the right tool for unusual data: time-series (metrics/IoT), vector (AI embeddings/similarity search), spatial (geographic), and search engines — and PostgreSQL's extensions for each.

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.