Case Study 2: Supply Chains and the Power Grid -- Two Systems That Got Redundancy Wrong

DataField.Dev

Case Study 2: Supply Chains and the Power Grid -- Two Systems That Got Redundancy Wrong

"There are decades where nothing happens, and there are weeks where decades happen." -- attributed to Vladimir Lenin

Two Systems, One Lesson

This case study examines two systems that chose efficiency over redundancy and paid the price: global supply chains (with particular attention to the semiconductor industry) and the North American power grid. Both systems were designed by talented engineers and managed by sophisticated organizations. Both achieved impressive efficiency under normal conditions. Both collapsed spectacularly when conditions deviated from normal -- revealing that the efficiency they had achieved was purchased at the cost of resilience they could not afford to lose.

The parallel is instructive because the two systems broke in different ways -- the supply chain broke slowly, over months, while the power grid broke in seconds -- but for the same structural reason: both had eliminated the redundancy that would have contained the failure.

Part I: The Supply Chain That Snapped

The Architecture of Efficiency

Modern global supply chains are marvels of optimization. A smartphone contains components from dozens of countries: rare earth minerals mined in the Democratic Republic of Congo, processed into magnets in China, assembled into speakers in Vietnam, integrated into circuit boards in South Korea, installed in a phone assembled in China, shipped to a distribution center in the Netherlands, and delivered to a consumer in Kansas. Each step has been optimized for cost, speed, and quality. Inventory at every stage has been minimized. Suppliers have been consolidated to the lowest-cost option. Transportation routes have been planned to the hour.

This system can deliver a $1,200 smartphone to your doorstep in two days at a retail margin that allows the manufacturer to turn a profit. It is an astonishing achievement of engineering, logistics, and coordination.

It is also a system that was designed to operate within a narrow band of normal conditions and that has no capacity to handle anything outside that band.

The Semiconductor Bottleneck

The semiconductor industry illustrates the efficiency trap with particular clarity, because semiconductors are both the most critical component in the modern economy and the most concentrated.

As of 2020, the world's most advanced semiconductors -- the chips that power smartphones, data centers, medical devices, and advanced automobiles -- were manufactured overwhelmingly in Taiwan, and predominantly by a single company: Taiwan Semiconductor Manufacturing Company (TSMC). Samsung in South Korea and Intel in the United States produced some advanced chips, but TSMC's dominance in cutting-edge fabrication was overwhelming. For the most advanced process nodes (5 nanometers and below), TSMC's global market share exceeded 90 percent.

This concentration was the result of decades of efficiency-driven consolidation. Building a semiconductor fabrication plant ("fab") costs $15 to $20 billion and takes three to five years. Operating one requires extraordinary technical expertise, ultra-pure water supplies, specialized chemicals, and a workforce trained in disciplines that are difficult to develop outside existing semiconductor ecosystems. The economics of the industry favor concentration: a single large fab produces chips at lower cost per unit than multiple smaller fabs. Efficiency demanded consolidation, and consolidation delivered efficiency.

The result was a global supply chain with a single point of failure the size of a small island in the western Pacific.

The Cascade of 2020-2023

When COVID-19 began shutting down factories in early 2020, the first impact was a supply shock: semiconductor fabs reduced capacity or closed temporarily. But the deeper problem emerged in the months that followed, through a sequence of individually manageable events that collectively overwhelmed a system with no slack.

Step 1: Demand shift. The pandemic shifted consumer spending from services (restaurants, travel, entertainment) to goods (laptops, monitors, webcams, gaming consoles). Demand for consumer electronics surged. Semiconductor demand spiked accordingly.

Step 2: Automotive cancellation. Automakers, expecting a prolonged recession, canceled their semiconductor orders in early 2020 to conserve cash. This was rational from each automaker's individual perspective: why pay for chips you cannot use?

Step 3: Capacity reallocation. Chipmakers, facing surging demand from consumer electronics and canceled orders from automakers, reallocated capacity to their more reliable customers. When automakers tried to reinstate their orders six months later, the capacity was gone. There was no buffer. Fabs were already running at maximum utilization.

Step 4: Compounding disruptions. A drought in Taiwan restricted water supplies to semiconductor fabs. A fire at a Renesas factory in Japan knocked out a significant source of automotive-grade chips. A winter storm in Texas shut down petrochemical plants that produced the specialty chemicals used in chip manufacturing. Each event was individually survivable. In a system with slack, any one of them could have been absorbed. In a system with zero slack, they cascaded.

Step 5: The whiplash effect. As shortages became apparent, companies began ordering more chips than they actually needed, hoping to build buffer stock. This "double ordering" inflated apparent demand further, making the shortage worse. Chip delivery times, normally measured in weeks, stretched to months. Some automotive chips had lead times exceeding a year.

The result: automakers worldwide lost an estimated $210 billion in revenue in 2021 alone. Ford, General Motors, Toyota, and Volkswagen all idled production lines. Consumers waited months for new vehicles. Prices for both new and used cars spiked. The effects rippled through every industry that depended on semiconductors -- which is to say, every industry.

Why the Supply Chain Had No Redundancy

The semiconductor supply chain's fragility was not accidental. It was the predictable result of decades of efficiency-driven decisions.

Supplier consolidation. Automakers had consolidated their semiconductor sourcing to minimize cost and complexity. Many components were single-sourced: one fab, one chip design, no alternative. This was efficient (lower costs, simpler procurement) and fragile (any disruption at the single source stopped production).

Zero inventory. Just-in-time principles had been applied aggressively across the automotive industry. Automakers maintained days or weeks of chip inventory, not months. This freed up working capital and reduced storage costs. It also meant that any interruption in supply immediately halted production.

Long lead times without buffers. Building new semiconductor capacity takes three to five years. This means that once a shortage begins, there is no rapid way to increase supply. The system can only recover by waiting for new fabs to come online -- a wait of years, not months. In a system with buffer inventory, this lead time is manageable. In a system with zero buffer, it is catastrophic.

Misaligned incentives. The executives who made the efficiency decisions (consolidating suppliers, eliminating inventory) were rewarded for the cost savings those decisions produced. The costs of fragility -- production shutdowns, lost revenue, customer frustration -- were borne by different people, at a different time, in a different part of the organization. The decision-maker did not bear the consequence.

Part II: The Grid That Cascaded

The Architecture of Interconnection

The North American power grid is divided into three major interconnections: the Eastern Interconnection (covering everything east of the Rockies except Texas), the Western Interconnection (covering everything west of the Rockies), and the Electric Reliability Council of Texas (ERCOT), which operates independently. Within each interconnection, thousands of power plants, transmission lines, substations, and distribution networks are linked together into a single synchronized system.

Interconnection provides enormous efficiency benefits. Different regions have different patterns of electricity demand (air conditioning peaks in the South during summer; heating peaks in the North during winter). Different regions have different sources of generation (hydropower in the Pacific Northwest, coal and gas in the Midwest, wind in the Great Plains). By linking these regions together, the grid can move power from where it is abundant to where it is needed, reducing the total generating capacity required and lowering the average cost of electricity.

But interconnection also means that the grid is tightly coupled: what happens in one region affects every other region. This tight coupling is the structural prerequisite for cascading failure.

August 14, 2003: The Cascade

The sequence of events that produced the largest blackout in North American history began with something utterly mundane: overgrown trees.

2:14 PM. A 345-kilovolt transmission line in northern Ohio sagged into untrimmed trees and tripped offline. This kind of event happens regularly on the grid; individual lines trip for various reasons, and the system is designed to handle it. The load that line had been carrying shifted automatically to neighboring lines.

3:05 PM. A software bug in the alarm system at FirstEnergy Corporation's control room in Akron, Ohio, prevented operators from seeing that the line had tripped. This was a critical failure of situational awareness, but it was not, by itself, catastrophic. Other monitoring systems should have detected the problem. They did not, because they too were affected by the software bug.

3:32 PM. A second transmission line sagged into trees and tripped. Its load shifted to the remaining lines.

3:41 PM. A third line tripped. The remaining lines were now carrying significantly more than their rated capacity. They began to overheat and sag -- creating the conditions for more tree contacts and more trips.

4:05 PM. A fourth line tripped. The cascade accelerated. Lines were now failing faster than operators could respond. Each failure shifted load to the remaining lines, which were already overloaded, causing them to fail in turn.

4:10:34 PM. The cascade reached critical mass. In the next nine seconds -- nine seconds -- the failure propagated from Ohio across the entire Eastern Interconnection. Power plants tripped offline to protect themselves from the grid instability. Within minutes, 55 million people in eight U.S. states and the Canadian province of Ontario were without power.

The blackout lasted up to four days in some areas. Eleven people died. Water treatment plants stopped operating. Hospitals switched to backup generators. Traffic lights went dark. Refrigerated food spoiled. Economic losses were estimated at $6 billion.

The Structural Analysis

The 2003 blackout was not caused by any single failure. It was caused by the interaction of multiple failures in a tightly coupled system with insufficient redundancy at multiple levels.

Insufficient transmission redundancy. The grid in northeastern Ohio had been operating with insufficient transmission capacity relative to its load. There was not enough slack in the transmission system to absorb the loss of several lines simultaneously. This was an efficiency choice: building additional transmission lines costs money, and the existing lines were adequate for normal conditions.

Insufficient monitoring redundancy. The software bug that blinded the FirstEnergy control room was a single point of failure in the monitoring system. There was no independent backup monitoring system that would have alerted operators to the line trips. This was a modularity failure: the monitoring system was not designed so that a failure in one component would be caught by another.

Insufficient modularity in the grid itself. The Eastern Interconnection operates as a single synchronized system. This provides efficiency (power flows freely across the entire network) but makes the entire network vulnerable to cascading failure. A more modular design -- with circuit breakers that could isolate failing sections before the cascade spread -- would have contained the failure to a local area. Some such protections existed but were either inadequate or improperly configured.

Insufficient vegetation management. The triggering event -- transmission lines sagging into trees -- was the result of inadequate vegetation management, itself a cost-cutting measure. Trimming trees along transmission corridors is expensive and ongoing. Deferring maintenance saved money in the short term and created the conditions for catastrophe in the long term.

Texas, 2021: The Same Lesson, Relearned

The 2003 blackout led to improved reliability standards, better monitoring requirements, and more stringent vegetation management rules. But the fundamental architecture of the grid -- its tight coupling, its efficiency-driven operation near capacity limits, its insufficient reserves -- did not change fundamentally.

In February 2021, a winter storm brought unprecedented cold to Texas. The ERCOT grid, which operates independently from the other two interconnections, was overwhelmed. Natural gas wells froze. Natural gas pipelines lost pressure. Gas-fired power plants, unable to get fuel, shut down. Wind turbines froze. Coal plants experienced equipment failures in the extreme cold. The grid lost approximately one-third of its generating capacity in a matter of hours.

Because ERCOT operates independently -- it has only minimal connections to the Eastern and Western Interconnections -- there was no way to import power from other regions. The grid's isolation, which normally provided efficiency benefits (avoiding federal regulation, optimizing for the Texas market), became a fatal vulnerability. There was no redundancy outside the state's borders to draw upon.

The grid operator implemented rolling blackouts to prevent a total grid collapse. But "rolling" quickly became "prolonged" -- some areas lost power for days. An estimated 246 people died, many from hypothermia. Property damage exceeded $195 billion. The grid came within minutes of a total, uncontrolled collapse that could have left Texas without power for weeks or months.

The ERCOT grid had been designed for efficiency. Generating capacity was maintained at minimal reserve margins, because maintaining excess capacity costs money. Weatherization of power plants -- insulating equipment against extreme cold -- had been recommended after a previous cold snap in 2011 but was not mandated, because the cost was considered unjustified given the perceived rarity of extreme cold events in Texas.

The cost of the weatherization that was not done was estimated at $2 to $5 billion. The cost of the disaster that resulted was $195 billion. The ratio -- roughly 40 to 100 times more spent on the disaster than would have been spent on prevention -- is a precise measure of the efficiency trap's cost.

The Structural Isomorphism

Feature	Supply Chain (Semiconductors)	Power Grid (2003, 2021)
Efficiency benefit	Lower costs through supplier consolidation and zero inventory	Lower costs through interconnection and minimal reserve margins
Form of redundancy eliminated	Buffer inventory, alternative suppliers, geographic diversity	Excess generating capacity, grid modularity, weatherization
Triggering event	Pandemic, drought, factory fire, winter storm	Overgrown trees (2003), winter storm (2021)
Why the trigger cascaded	Zero slack meant any disruption immediately halted production	Tight coupling meant any local failure propagated system-wide
Speed of cascade	Months (supply disruptions accumulated gradually)	Seconds (2003), hours (2021)
Cost of the cascade	~$210 billion in lost automotive revenue (2021 alone) \| $6 billion (2003), $195 billion (Texas 2021)
Could redundancy have prevented it?	Buffer inventory and diversified sourcing would have absorbed the disruption	Excess capacity, weatherization, and better grid modularity would have contained the failure
Why redundancy was absent	Efficiency pressure: cost savings from consolidation and zero inventory	Efficiency pressure: cost savings from minimal reserves and deferred maintenance
Who made the decision	Supply chain executives rewarded for cost reduction	Grid operators and regulators prioritizing low electricity prices
Who bore the cost	Consumers, workers, downstream industries	Residents, businesses, families of the deceased

The shared structure is identical to the efficiency trap described in the main chapter: competitive pressure drives out redundancy; the system becomes fragile; a shock arrives; the system fails catastrophically; and the cost of the failure vastly exceeds the cost of the redundancy that was eliminated.

The Deeper Pattern: Time Horizons and Discounting

Both the supply chain and the power grid reveal a deeper pattern in how organizations value redundancy: the problem of temporal discounting.

Redundancy costs money now and provides benefits later -- specifically, during low-probability, high-consequence events that may not occur for years or decades. Efficiency savings are realized now and create risks later. Every standard method of evaluating investments -- discounted cash flow, return on investment, quarterly earnings -- systematically favors near-term benefits over far-term risks. A dollar saved today is worth more than a dollar of prevented loss next year, and far more than a dollar of prevented loss in a decade.

This temporal discounting is rational in a narrow financial sense. But it is catastrophically irrational when applied to fat-tailed risks -- risks where the extreme events are far more probable and far more costly than standard financial models assume (Chapter 4). A supply chain disruption or a grid collapse is not a normally distributed risk that can be safely discounted. It is a fat-tailed risk whose true expected cost is vastly larger than the discounted present value that appears on a financial model.

The organizations that manage supply chains and power grids are not staffed by foolish people. They are staffed by people using tools -- financial models, risk assessments, quarterly performance metrics -- that systematically undervalue redundancy. The tools are the problem, not the people.

The Path Forward

Both supply chains and power grids have begun, slowly and incompletely, to invest in redundancy in the wake of their respective crises.

The semiconductor industry has seen massive investment in new fabrication capacity outside Taiwan: TSMC is building fabs in Arizona and Japan; Samsung is building in Texas; Intel is expanding in Arizona, Ohio, and Germany. The U.S. CHIPS and Science Act of 2022 allocated $52 billion to support domestic semiconductor manufacturing. These investments are driven by the recognition that geographic concentration is a catastrophic single point of failure.

Power grid operators are investing in grid modernization, energy storage (batteries that can provide power during generation shortfalls), and distributed generation (rooftop solar and local microgrids that can operate independently during grid failures). ERCOT has implemented winterization requirements and increased reserve margins.

But these investments face the same efficiency pressure that created the problem in the first place. Every dollar spent on a new fab in Arizona could produce chips more cheaply if spent on expanding the existing fab in Taiwan. Every dollar spent on grid weatherization is a dollar that does not reduce electricity prices. The competitive pressure to maximize efficiency has not disappeared. It has merely been temporarily overridden by the memory of catastrophe.

The question is whether the memory will last. After the 2003 blackout, reliability standards were tightened -- but the Texas grid, operating under different regulatory authority, was not affected by those standards, and collapsed 18 years later. After the 2011 semiconductor supply disruption (caused by the Japanese earthquake and tsunami), the industry discussed diversification -- but continued to concentrate in Taiwan for a decade until the pandemic forced the issue.

The historical pattern is discouraging: catastrophe produces reform, reform is gradually eroded by efficiency pressure, and the next catastrophe finds the system just as fragile as before. Breaking this cycle requires not just one-time investments in redundancy, but permanent institutional structures that protect redundancy from the relentless gravitational pull of efficiency.

Questions for Reflection

The semiconductor shortage of 2020-2023 was widely described as "unprecedented." In what sense was it predictable, given the industry's structure? What would an analyst applying the redundancy framework from Chapter 17 have predicted, and when?
The 2003 blackout cascaded in nine seconds, while the semiconductor shortage took months to fully develop. How does the speed of cascade affect the possibility of intervention? What structural features determine whether a failure cascades fast or slow?
The chapter identifies temporal discounting as a key driver of the efficiency trap. Can you design an institutional mechanism that would counteract temporal discounting in infrastructure investment? (Consider: who has a sufficiently long time horizon, and how can their interests be represented in investment decisions?)
Both the semiconductor industry and the power grid are now investing in redundancy. What evidence would you look for, five or ten years from now, to determine whether these investments have been sustained or whether efficiency pressure has eroded them again?
The case study argues that "the tools are the problem, not the people" -- that standard financial models systematically undervalue redundancy. Do you agree? If so, what alternative tools or frameworks would produce better decisions about the redundancy-efficiency tradeoff?