Case Study 1: Solana's Outages — The Cost of Optimizing for Speed

DataField.Dev

Case Study 1: Solana's Outages — The Cost of Optimizing for Speed

Background

Solana launched its mainnet beta in March 2020 with an audacious promise: a blockchain that could process 65,000 transactions per second with 400-millisecond block times, rivaling the throughput of centralized systems like Visa. Founded by Anatoly Yakovenko, a former Qualcomm engineer, Solana's architecture was built from the ground up for speed. Its Proof of History clock, parallel transaction execution engine (Sealevel), Tower BFT consensus, and Gulf Stream mempool-less transaction forwarding worked together as a tightly integrated system optimized for raw performance.

By early 2022, Solana had become one of the most popular blockchains in the industry. Its DeFi ecosystem managed billions in total value locked. Its NFT marketplace thrived due to sub-cent transaction fees. Venture capital firms had invested heavily. The SOL token reached an all-time high price. The blockchain appeared to vindicate the design philosophy that hardware should do the heavy lifting — that by requiring validators to run powerful servers, you could achieve performance that software-only optimizations on commodity hardware could never match.

Then the outages began.

The Incidents

September 14, 2021: The First Major Outage (17 Hours)

On September 14, 2021, Solana's network went completely offline for approximately 17 hours. The cause was a surge in transaction volume associated with the launch of a token sale on the Raydium decentralized exchange. Bot activity generated an enormous flood of transactions — far beyond what the network had previously handled.

The transaction flood caused validators' memory usage to spike as their queues filled with unprocessed transactions. The networking layer became overwhelmed, and validators began falling behind the PoH clock. When validators cannot keep up with the PoH leader's output, they cannot vote on new blocks, and without sufficient votes, consensus stalls.

The validators tried to catch up, but the backlog was too large. Eventually, the validator set could not reach the two-thirds supermajority needed for Tower BFT consensus, and the network halted completely. Recovery required the validator community to coordinate a restart using a snapshot of the last confirmed state — a process that took hours and required manual intervention from a significant portion of the validator set.

January 2022: Repeated Degradation

In January 2022, Solana experienced multiple periods of severe network degradation (though not complete outages). Transaction processing slowed to a fraction of normal throughput, and many user transactions failed or timed out. The cause was again excessive transaction volume, primarily from bot activity associated with NFT minting events (known as "mints" or "drops") and automated arbitrage.

The pattern was consistent: when network activity spiked, the system's components — all designed to operate in lockstep at high speed — degraded together rather than failing gracefully. There was no effective mechanism to prioritize legitimate user transactions over bot traffic.

February 25, 2023: Forking Event (18+ Hours)

On February 25, 2023, a bug in the Solana client software caused validators to produce conflicting blocks, leading to a fork that the network could not automatically resolve. The network experienced severe degradation for over 18 hours before a client patch was deployed and validators restarted.

This incident highlighted a different vulnerability: because virtually all Solana validators ran the same client implementation (the Solana Labs client, later called Agave), a bug in that client affected every validator simultaneously. There was no client diversity to provide resilience — unlike Ethereum, where a bug in one client (say, Geth) would only affect the validators running that specific client, while validators running Nethermind, Besu, or Erigon would continue operating normally.

February 6, 2024: Five-Hour Outage

On February 6, 2024, the Solana network halted for approximately five hours due to a bug in the program loader that caused an infinite loop when processing a specific transaction. The bug caused all validators to get stuck at the same block, and the network could not advance until a patched client was deployed and validators restarted.

Technical Analysis: Why Solana Is Fragile

The pattern across these incidents reveals architectural characteristics that are the flip side of Solana's speed advantages:

Tight Coupling

Solana's components (PoH clock, Tower BFT, Sealevel execution, Gulf Stream forwarding, Turbine block propagation) are designed to work together as an integrated system rather than as independent modules. When one component is stressed, the stress propagates to the others. A transaction flood does not just fill the mempool — it overwhelms the PoH generator, causes validators to fall behind on voting, disrupts block propagation, and ultimately stalls consensus.

In a more loosely coupled system (like Ethereum's separation of execution and consensus clients), problems in one layer can be isolated. Solana's integrated design makes isolation difficult.

The Single-Client Problem

For most of its history, Solana had effectively one production-quality validator client. While a second client (Firedancer, developed by Jump Trading) began development in 2022 and reached testnet milestones by 2024, the mainnet validator set remained overwhelmingly homogeneous. This meant that every software bug was a network-wide bug.

Ethereum, by contrast, has maintained multiple independent clients from its earliest days. When a bug in the Prysm consensus client caused issues in 2023, validators running Lighthouse, Teku, or Nimbus were unaffected. The network continued producing blocks because no single client controlled a supermajority of validators.

Hardware Monoculture

Solana's high hardware requirements (256 GB RAM, high-core CPUs, fast NVMe storage, 1 Gbps network) mean that validators tend to use similar hardware configurations, often from the same cloud providers. This creates a hardware monoculture where infrastructure failures (an AWS region going down, for example) can disproportionately affect the validator set. It also means that a bug triggered by a specific hardware characteristic could affect all validators simultaneously.

No Effective Fee Market Under Stress

During the early outage period, Solana lacked an effective mechanism to price out spam during high-demand periods. Unlike Ethereum's EIP-1559 base fee, which increases exponentially when blocks are full (making spam increasingly expensive), Solana's fixed low fees meant that bots could flood the network at minimal cost. Later protocol updates introduced priority fees and QUIC-based transaction forwarding to mitigate this, but the fundamental tension between "low fees for users" and "resistance to transaction spam" remains an active area of development.

The Trilemma Perspective

Solana's outages are not random bad luck. They are the predictable consequence of specific design choices on the blockchain trilemma:

Scalability (Optimized): Solana achieves high throughput through powerful hardware, tight integration, and aggressive parallelism. When everything works, the performance is remarkable.

Security (Moderate): Tower BFT provides Byzantine fault tolerance, and the validator set is reasonably large (1,500+). However, the single-client monoculture means that safety depends on the correctness of one implementation, and the high hardware bar means validators are less diverse than the raw count suggests.

Decentralization (Sacrificed): The high hardware requirements limit who can validate. The tightly coupled architecture means the network behaves more like a single coordinated system than a loose collection of independent participants. And the restart coordination required after outages — where the Solana Foundation and major validators must manually agree on a recovery plan — reveals a practical centralization that contradicts the permissionless ideal.

Solana's Response and Evolution

The Solana team and community have not ignored these problems. Significant engineering effort has gone into mitigation:

QUIC-based networking replaced the original UDP-based protocol, providing better congestion control and resistance to transaction spam.
Priority fees were introduced to allow users to pay for faster processing during high-demand periods, creating a basic fee market.
Local fee markets (introduced 2024) allow fees to increase for in-demand programs without affecting fees for unrelated transactions.
The Firedancer client, developed by Jump Trading, represents a complete independent reimplementation of the Solana validator. When fully deployed, it will provide the client diversity that Ethereum has had for years.
Improved block packing algorithms prioritize transactions more effectively under load.

By 2025, Solana's reliability had improved substantially compared to the 2022-2023 period. The network had not experienced a full outage in several months, and performance degradation events were shorter and less severe. Whether this reflects fundamental architectural improvements or simply a period of lighter network stress is a matter of ongoing debate.

Discussion Questions

Design philosophy. Solana's outages occurred precisely because of the same design choices that enable its speed. If you were designing a new high-performance blockchain, would you replicate Solana's approach, or would you accept lower peak performance in exchange for greater resilience? What specific design changes would you make?
Client diversity as infrastructure. Ethereum treats multiple independent client implementations as a security requirement, not a luxury. Solana initially prioritized a single high-quality client. What are the costs and benefits of each approach? Is it realistic to expect client diversity in blockchain systems with complex, tightly integrated architectures?
The restart problem. When Solana goes down and validators must coordinate a restart, who makes the decisions? How does this process compare to the decentralized ideal? Could a truly decentralized network (where no one has authority to coordinate) recover from a similar failure?
Fee markets and spam resistance. Solana's low fees are a major user experience advantage. But low fees also mean low cost for attackers and spammers. Is there a way to have both low fees for legitimate users and high costs for spam, or is this an inherent tradeoff?
Evaluation framework. If you were an enterprise deciding whether to build on Solana, how would you weigh the high performance against the outage history? What uptime guarantees would you need, and how would you design your application to handle potential network outages?