Chapter 3 Exercises: Distributed Systems Fundamentals

DataField.Dev

Chapter 3 Exercises: Distributed Systems Fundamentals

Exercise 3.1: Byzantine Fault Tolerance Threshold Analysis

Difficulty: Intermediate | Estimated Time: 30 minutes

Part A: The PBFT Bound

Prove or disprove the following statement: A 5-node system using PBFT can tolerate 2 Byzantine faults.

Work through the following steps:

Recall the Byzantine fault tolerance bound: a BFT system requires n >= 3f + 1 nodes to tolerate f Byzantine faults.
For n = 5, calculate the maximum f.
For f = 2, calculate the minimum n required.
State your conclusion and provide an intuitive explanation for why the bound exists.

Hint: Consider what happens during the commit phase if 2 out of 5 nodes are Byzantine. How many honest nodes must agree? Can the protocol distinguish between a scenario where 2 nodes are Byzantine and one where 2 honest nodes are simply slow?

Part B: Practical Implications

Suppose you are designing a consortium blockchain for 7 banks. Using PBFT:

What is the maximum number of Byzantine banks the system can tolerate?
If you need to tolerate 3 Byzantine banks, how many total participants do you need?
One bank argues that adding more participants improves security linearly. Is this correct? Explain the relationship between n and fault tolerance.

Part C: Comparing Thresholds

Fill in the following table:

Total Nodes (n)	Max Crash Faults (CFT)	Max Byzantine Faults (BFT)	Difference
3
5
7
10
100
1,000

What pattern do you observe as n grows? Express the asymptotic relationship between CFT and BFT fault tolerance.

Exercise 3.2: CAP Theorem Classification

Difficulty: Intermediate | Estimated Time: 45 minutes

Classify each of the following systems as CP, AP, or CA. For each classification: - Identify which two properties the system prioritizes. - Explain what happens to the sacrificed property during a network partition. - Describe a specific scenario where the tradeoff becomes visible to the user.

System 1: MySQL with Synchronous Replication (Primary-Replica)

A MySQL primary database replicates all writes synchronously to a single replica. If the primary cannot reach the replica, it refuses to accept writes until connectivity is restored.

System 2: Apache Cassandra (Default Configuration)

Cassandra distributes data across multiple nodes. With the default consistency level of ONE, a write succeeds as long as one replica acknowledges it, and a read returns data from the first replica that responds. During a partition, both sides of the partition continue accepting reads and writes independently.

System 3: Bitcoin

Miners produce blocks approximately every 10 minutes. During a network partition, miners on both sides of the partition continue producing blocks, creating two divergent chains. When the partition heals, the shorter chain is abandoned, and transactions that were only on the shorter chain are reversed.

System 4: Ethereum 2.0 (Post-Merge)

Ethereum uses Proof of Stake with the Casper FFG finality gadget. Blocks are produced every 12 seconds, and finality is achieved when two-thirds of validators attest to a checkpoint (approximately every 6.4 minutes). If more than one-third of validators go offline, the chain stops finalizing (though block production continues in a degraded mode).

System 5: CockroachDB

CockroachDB is a distributed SQL database that uses the Raft consensus protocol. It divides data into ranges, each replicated across multiple nodes. During a partition, a range is available only if a majority of its replicas are on the same side of the partition.

Bonus: PACELC Classification

For each of the five systems above, extend your classification using the PACELC framework. What tradeoff does each system make when the network is operating normally (no partition)?

Exercise 3.3: Consensus Simulation Lab

Difficulty: Intermediate-Advanced | Estimated Time: 60 minutes

This exercise uses the consensus_simulation.py script provided in the code/ directory.

Part A: Baseline Behavior

Run the simulation with the default parameters (10 nodes, 0 Byzantine):

python consensus_simulation.py --nodes 10 --byzantine 0 --rounds 20

How many rounds does it typically take to reach consensus?
What is the final agreed-upon value?
Run it 5 times. Is the outcome deterministic? Why or why not?

Part B: Introducing Byzantine Nodes

Run the simulation with increasing numbers of Byzantine nodes:

python consensus_simulation.py --nodes 10 --byzantine 1 --rounds 20
python consensus_simulation.py --nodes 10 --byzantine 2 --rounds 20
python consensus_simulation.py --nodes 10 --byzantine 3 --rounds 20
python consensus_simulation.py --nodes 10 --byzantine 4 --rounds 20

At what point does consensus fail to be reached?
How does the number of rounds to consensus change as Byzantine nodes increase?
Does the result match the theoretical f < n/3 bound? Show your calculation.

Part C: Scaling Analysis

Run the simulation with varying total node counts, keeping the Byzantine fraction just below the theoretical limit:

python consensus_simulation.py --nodes 7 --byzantine 2 --rounds 30
python consensus_simulation.py --nodes 13 --byzantine 4 --rounds 30
python consensus_simulation.py --nodes 31 --byzantine 10 --rounds 30
python consensus_simulation.py --nodes 100 --byzantine 33 --rounds 50

How does the number of rounds to consensus scale with the number of nodes?
What does this tell you about the scalability of BFT protocols?
Relate your observations to the O(n^2) message complexity of PBFT.

Part D: Strategy Analysis

The simulation supports different Byzantine strategies:

python consensus_simulation.py --nodes 10 --byzantine 3 --strategy random --rounds 30
python consensus_simulation.py --nodes 10 --byzantine 3 --strategy split --rounds 30
python consensus_simulation.py --nodes 10 --byzantine 3 --strategy stubborn --rounds 30

Which Byzantine strategy is most disruptive to consensus? Why?
Which strategy most closely resembles a real-world attack on a blockchain?
Design a fourth strategy that you believe would be even more disruptive. Describe it in pseudocode.

Exercise 3.4: Consensus Protocol Design

Difficulty: Advanced | Estimated Time: 45 minutes

Scenario

You are designing a consensus protocol for a decentralized ride-sharing application. The system has the following requirements:

Participants: 50 vehicles in a city, each running a node.
Trust model: Vehicles are operated by independent drivers. Most are honest, but up to 20% might report false locations or fabricate ride requests to manipulate the system.
Latency: A ride request must be matched to a driver within 5 seconds.
Availability: The system must remain operational even if 10 vehicles go offline simultaneously.
Data: The consensus is over ride-matching decisions: which driver is assigned to which rider.

Part A: Requirements Analysis

Is this a crash fault tolerance or Byzantine fault tolerance scenario? Justify your answer.
Calculate the minimum number of nodes needed given your fault model and the requirement to tolerate 20% malicious nodes.
Is the 5-second latency requirement compatible with Nakamoto consensus? With PBFT? With Raft? Explain.

Part B: Protocol Design

Design a consensus protocol for this scenario. Your design should include:

Membership: How do vehicles join and leave the consensus? Is this permissioned or permissionless?
Leader selection: Is there a leader? How is the leader chosen? What happens if the leader is malicious?
Agreement process: How do nodes propose, validate, and commit ride-matching decisions?
Fault handling: What happens when a node crashes? What happens when a node sends contradictory messages?
Finality: Is the ride assignment deterministically or probabilistically final? How quickly?

Part C: Tradeoff Analysis

Classify your protocol using the CAP theorem. Which property did you sacrifice?
Classify your protocol using the PACELC framework.
Identify the two biggest weaknesses of your design. How might an attacker exploit them?
Compare your design to simply using a centralized server. What does decentralization gain? What does it cost?

Exercise 3.5: The FLP Impossibility — Conceptual Deep Dive

Difficulty: Advanced | Estimated Time: 30 minutes

Part A: Understanding the Result

In your own words, explain why the FLP impossibility result does not prevent Bitcoin from working. Your explanation should address:

Which assumption of FLP does Bitcoin violate or relax?
What guarantee does Bitcoin sacrifice that FLP says is necessary?
Why is this sacrifice acceptable for a cryptocurrency but might not be acceptable for, say, an air traffic control system?

Part B: Circumventing FLP

For each of the following approaches to circumventing FLP, identify which assumption of the impossibility result is relaxed and give a real-world system that uses this approach:

Assuming partial synchrony instead of full asynchrony.
Using randomized (non-deterministic) algorithms.
Accepting probabilistic rather than absolute guarantees.
Using failure detectors (oracles that indicate which nodes have failed).

Part C: Thought Experiment

Suppose someone claims to have invented a deterministic consensus protocol that works in a fully asynchronous network and tolerates one crash fault, with guaranteed termination. What must be wrong with their claim? Describe at least two possible flaws in their reasoning or assumptions.

Exercise 3.6: Discussion — Why Did Nakamoto Consensus Succeed?

Difficulty: Intermediate | Estimated Time: 30 minutes (individual reflection) + 30 minutes (group discussion)

The PBFT paper was published in 1999. The Bitcoin whitepaper was published in 2008. For nearly a decade, PBFT was the state of the art in practical BFT, yet it never achieved widespread adoption outside of academic research. Nakamoto consensus, by contrast, went from a pseudonymous whitepaper to securing hundreds of billions of dollars in value within a decade.

Discussion Questions

Technical factors: What specific technical limitations of PBFT prevented its adoption for open-membership networks? Could PBFT have been adapted for permissionless use? Why or why not?
Economic factors: Nakamoto consensus introduced economic incentives (block rewards, transaction fees) as an integral part of the consensus mechanism. PBFT has no built-in incentive mechanism. How important was the economic layer to Bitcoin's success? Could PBFT work with added economic incentives?
Social factors: Bitcoin launched as open-source software that anyone could run. PBFT required coordinated deployment among known participants. How did the permissionless nature of Bitcoin contribute to its adoption? What role did timing play (the 2008 financial crisis)?
The tradeoff thesis: Some argue that Nakamoto consensus succeeded because of its tradeoffs, not despite them. The slow finality and low throughput are features, not bugs, because they enable truly permissionless participation. Do you agree? What would a system that provided both fast finality and permissionless participation look like — or is this impossible?
Counterfactual: If PBFT had been published in 2008 and Bitcoin in 1999, would the adoption trajectory have been different? What does this tell you about the role of context (technology, economics, politics) in the success of distributed systems protocols?

Writing Prompt

Write a 500-word essay arguing either (a) that Nakamoto consensus was an inevitable development given the trajectory of distributed systems research, or (b) that it was a contingent breakthrough that required a specific set of circumstances to emerge. Support your argument with specific technical and historical evidence from this chapter.

Exercise 3.7: CAP Theorem Interactive Demo

Difficulty: Beginner-Intermediate | Estimated Time: 30 minutes

This exercise uses the cap_theorem_demo.py script provided in the code/ directory.

Part A: Exploring the Tradeoffs

Run the demo and experiment with different configurations:

python cap_theorem_demo.py

Set up a 3-node cluster with no partition. Perform several reads and writes. Verify that all nodes return consistent data.
Introduce a network partition between node 1 and nodes 2-3. Attempt writes on both sides of the partition.
Switch between CP mode and AP mode. Observe what happens to: - Write availability (can you write to all nodes?) - Read consistency (do all nodes return the same value?) - The system's behavior when the partition heals

Part B: Measuring Staleness

In AP mode during a partition:

Perform 10 writes to one side of the partition.
Read from nodes on the other side.
How "stale" is the data? (How many writes behind is the disconnected side?)
When the partition heals, how does the system reconcile the divergent states?

Part C: Relating to Blockchain

In the demo, which mode most closely resembles Bitcoin's behavior during a network partition?
Which mode most closely resembles a PBFT-based blockchain?
If you were building a decentralized exchange (where incorrect data could cost users money), which mode would you choose? Why?