Chapter 15: Goodhart's Law — When Every Metric Becomes a Target

43 min read

> "When a measure becomes a target, it ceases to be a good measure."

Learning Objectives

State Goodhart's Law and Strathern's generalization and explain why metrics degrade under optimization pressure
Identify Goodhart's Law operating across at least five domains: manufacturing, education, policing, medicine, and digital platforms
Distinguish between a measure and the thing it is supposed to measure, and explain why optimization drives a wedge between them
Analyze the principal-agent problem as the structural foundation of metric gaming
Evaluate proposed solutions to Goodhart's Law including multi-metric approaches and Ostrom's polycentric governance
Apply the threshold concept -- Metrics Are Models -- to recognize when a proxy measure has decoupled from the underlying reality

In This Chapter

How Soviet Factories, Standardized Tests, Crime Statistics, Hospital Rankings, Search Engines, and Social Media All Break in the Same Way
15.1 The Nail Factory
15.2 Teaching to the Test
15.3 The Body Count and the Crime Rate
15.4 Hospitals, Search Engines, and the Engagement Machine
15.5 The Deeper Pattern: Why This Keeps Happening
15.6 Publish or Perish: The Academy Eats Itself
15.7 Strathern's Generalization and the Lucas Critique
15.8 Solutions: What Can Be Done?
15.9 Pattern Library Checkpoint
15.10 The Metric Paradox
Chapter Summary
Spaced Review
What's Next

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 15: Goodhart's Law — When Every Metric Becomes a Target

How Soviet Factories, Standardized Tests, Crime Statistics, Hospital Rankings, Search Engines, and Social Media All Break in the Same Way

"When a measure becomes a target, it ceases to be a good measure." — Marilyn Strathern, paraphrasing Charles Goodhart

15.1 The Nail Factory

In the Soviet Union, central planners faced an impossible problem. They needed to coordinate production across thousands of factories, spanning eleven time zones, manufacturing everything from tractors to toothbrushes. They could not be present at every factory. They could not observe every worker. They needed a way to measure output from a distance -- a metric that would tell them whether a factory was doing its job.

For nail factories, they chose weight. Produce your quota of nails, measured in tons, and you have fulfilled the plan. Fall short, and your factory manager faces consequences ranging from lost bonuses to reassignment to Siberia.

The nail factories responded rationally. They produced enormous nails -- massive iron spikes, heavy and useless for most construction purposes, but gloriously heavy on the scale. A factory that needed to produce five tons of nails per month could meet its quota with far less effort by producing a few hundred railroad-spike-sized monstrosities than by manufacturing the thousands of small, useful nails that Soviet citizens actually needed to build houses, hang pictures, and repair furniture.

The planners noticed the problem and changed the metric. Instead of weight, they measured nail production by count. Produce your quota of nails, measured in number of units, and you have fulfilled the plan.

The factories adapted again. They produced tiny nails -- little slivers of metal, barely functional, but easy to stamp out by the thousands. A machine that once produced a hundred usable nails per hour could now produce ten thousand miniature pins. The quota was met. The metric was satisfied. And Soviet citizens still could not get the nails they needed.

This story, variations of which circulated widely in Soviet economic literature and Western analyses of planned economies, illustrates a pattern so fundamental that it has been independently discovered, named, and formalized by scholars across at least four disciplines. In economics, it is Goodhart's Law, named after the British economist Charles Goodhart, who observed in 1975 that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." In sociology, it is Campbell's Law, formulated by the American psychologist Donald Campbell in 1979: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." In anthropology, it is Strathern's generalization, articulated by Marilyn Strathern in 1997 in a characteristically elegant compression: "When a measure becomes a target, it ceases to be a good measure."

Three scholars. Three disciplines. Three independent formulations of the same insight. If Chapter 1 taught us anything, this convergent discovery is a signal that we are dealing with a deep structural pattern -- not a quirk of Soviet planning or British monetary policy, but something fundamental about the relationship between measurement and the thing being measured.

Fast Track: Goodhart's Law says that metrics break when you optimize for them. The chapter traces this pattern across Soviet factories, education, policing, medicine, SEO, social media, and academic publishing, then explains why the pattern keeps recurring: every metric is a simplified model of reality, and optimization pressure exploits the gap between the model and the reality. If you already grasp the core idea, skip to Section 15.5 (The Deeper Pattern) for the structural analysis, then read Section 15.7 (Solutions).

Deep Dive: The full chapter examines seven domains in detail, connects Goodhart's Law to the principal-agent problem, the map/territory distinction (preview of Chapter 22), and the legibility problem (Chapter 16), and evaluates proposed solutions including multi-metric approaches, qualitative assessment, and Ostrom's polycentric governance. For the richest understanding, read the full chapter and both case studies.

15.2 Teaching to the Test

Leave the nail factory. Walk into an American public school.

In 2001, the United States Congress passed the No Child Left Behind Act (NCLB), which tied federal funding to student performance on standardized tests. The logic was intuitive and well-intentioned. We want students to learn. We cannot sit in every classroom. We need a metric that tells us, from a distance, whether learning is occurring. Standardized test scores seem like a reasonable proxy: if students are learning, their scores should go up.

The schools responded rationally -- and the nail factory story repeated itself in a different substrate.

Teachers began teaching to the test -- restructuring their curricula not around what students most needed to learn, but around what would appear on the standardized exam. Subjects not tested -- art, music, physical education, science in some states, history in others -- were squeezed out of the school day. Within tested subjects, teachers narrowed their focus to the specific question formats and content areas that appeared on the exam. Students practiced filling in bubbles. They memorized test-taking strategies. They learned to eliminate obviously wrong answers and guess strategically among the remainder.

Test scores went up. Learning, by many measures, did not.

Researchers documented the phenomenon extensively. A study by the RAND Corporation found that score gains on state-mandated tests substantially overstated actual learning gains as measured by independent assessments. Students who showed impressive improvement on the state test showed far less improvement -- sometimes none at all -- on the National Assessment of Educational Progress (NAEP), a separate assessment with no stakes attached. The state test measured not what students had learned, but what students had been trained to do on that specific test. The metric had decoupled from the thing it was supposed to measure.

The corruption ran deeper than mere curriculum narrowing. In some cases, it became outright fraud. In the Atlanta Public Schools cheating scandal, uncovered in 2011, 178 educators across 44 schools were found to have systematically altered students' answer sheets to boost test scores. Teachers held "erasure parties" where they corrected students' wrong answers after the exams were submitted. Principals pressured teachers to cheat with explicit and implicit threats of termination. The superintendent, Beverly Hall, won national awards for the district's dramatic test score improvements -- improvements that were, in significant part, fictional.

Campbell's Law predicted exactly this outcome. "The more any quantitative social indicator is used for social decision-making," Campbell wrote, "the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." Note the precision of that formulation. Campbell did not say the indicator merely becomes less accurate. He said it corrupts the processes it was designed to monitor. The metric does not just break. It actively damages the thing it was supposed to improve.

💡 Intuition: Imagine you want to know if your garden is healthy, so you start measuring the number of green leaves. At first, the count correlates well with health -- healthy plants have more green leaves. But now suppose you start paying your gardener based on the leaf count. The gardener might start spray-painting dead leaves green, gluing artificial leaves onto stems, or planting fast-growing but shallow-rooted species that produce lots of leaves but crowd out the deeper-rooted plants that stabilize the soil. The leaf count goes up. The garden gets worse.

🔗 Connection to Chapter 14 (Overfitting): Teaching to the test is a form of overfitting -- the educational analogue of fitting your model too closely to your training data. Students optimized for the specific test perform well on that test (training data) but poorly on different assessments (out-of-sample data). The underlying pattern is the same: optimizing too aggressively for a specific measure captures noise and artifacts rather than the true signal. Chapter 14's insight that overfitting destroys generalization applies directly here.

🔄 Check Your Understanding

In your own words, explain why Goodhart's Law is not just a problem with bad metrics, but a problem with what happens to any metric once it becomes a target.
How does the Atlanta cheating scandal illustrate Campbell's Law specifically -- not just metric gaming, but the corruption of the process being measured?
What structural feature do the Soviet nail factory and the American public school share? (Hint: think about the relationship between the measurer and the measured.)

15.3 The Body Count and the Crime Rate

The military version of Goodhart's Law may be the deadliest.

During the Vietnam War, American military commanders needed to measure progress in a war without clear front lines, territorial objectives, or conventional benchmarks of victory. In a traditional war, you measure success by territory captured. In Vietnam, territory was taken and retaken constantly; holding ground was not the strategic objective. Commanders needed a different metric.

They chose the body count -- the number of enemy combatants killed. The logic was straightforward: if you are killing more of the enemy than they are killing of you, you must be winning. Defense Secretary Robert McNamara, a former president of Ford Motor Company with a deep faith in quantitative management, embraced the body count as the primary metric of progress. Briefings to the press and to Congress featured body counts prominently. Officers who reported high body counts were promoted. Units that reported high body counts received favorable evaluations.

The metric became a target. And the target corrupted the process.

Soldiers and officers, under intense pressure to produce high body counts, began counting liberally. Civilian casualties were reclassified as enemy combatants. Bodies were counted multiple times. Estimates replaced actual counts. The journalist Neil Sheehan, in his Pulitzer Prize-winning book A Bright Shining Lie, documented cases where body counts were inflated by factors of three or four. David Hackworth, one of the most decorated American soldiers of the Vietnam era, described the system as "the sham of body count" and documented how units routinely doubled or tripled their actual kills to satisfy superiors who demanded ever-higher numbers.

The consequences went beyond mere inflation. Because body count was the metric that determined career success, some commanders chose tactics that maximized kills rather than tactics that achieved strategic objectives. Artillery and air strikes were called in on suspected enemy positions, even when the tactical situation did not require it, because the resulting destruction could be reported as enemy casualties. Patrols were sent into dangerous areas not because the area was strategically important but because contact with the enemy would produce a body count. Soldiers died so that officers could report numbers.

Meanwhile, the actual strategic situation in Vietnam deteriorated steadily. The body counts told a story of American dominance -- we were killing the enemy at ratios of ten to one, twenty to one. The Tet Offensive of 1968, in which the North Vietnamese and Viet Cong launched coordinated attacks across the entire country, shocked the American public and military leadership precisely because the body count metrics had told them they were winning. The metric said victory was near. The reality said otherwise.

CompStat and the Numbers Game

The same pattern reappeared in American policing three decades later.

In 1994, the New York City Police Department introduced CompStat (Computer Statistics), a data-driven management system that tracked crime statistics at the precinct level and held precinct commanders accountable for reducing crime in their areas. The system was initially credited with dramatic crime reductions -- New York's violent crime dropped substantially through the late 1990s and 2000s.

But as the pressure to produce favorable numbers intensified, the metrics began to corrupt. Investigations by journalists and by the department's own inspectors found multiple forms of gaming.

Downgrading. Officers reclassified serious crimes as less serious ones. A grand larceny (theft over $1,000) became a petit larceny (theft under $1,000). A felony assault became a misdemeanor. A burglary -- a crime where an intruder enters a home -- became a trespass. The number of serious crimes fell on paper without any change in the actual safety of the city.

Discouraging reports. Officers discouraged citizens from filing reports. A robbery victim might be told there was nothing the police could do, or might be subjected to extended questioning that implicitly conveyed the message that reporting the crime was more trouble than it was worth. Fewer reports meant fewer recorded crimes, which meant better statistics.

Manipulating response categories. Complaints that should have triggered a formal crime report were instead classified as "unfounded" or handled as complaints rather than crimes.

The criminologist John Eterno and retired NYPD captain Eli Silverman surveyed nearly 500 retired NYPD officers. Over 100 said they had witnessed downgrading of crime statistics. Several described the pressure as systemic: precinct commanders were judged on their numbers, and commanders who reported rising crime faced professional consequences. The incentive to make the numbers look good was overwhelming.

📜 Historical Context: The McNamara body count and CompStat are not merely parallel stories. They are connected by a shared intellectual heritage. McNamara brought to the Pentagon the same quantitative management philosophy he had used at Ford Motor Company -- the belief that if you measure it, you can manage it. CompStat applied the same philosophy to policing. Both systems assumed that metrics could serve as reliable proxies for complex realities (military progress, public safety) and that holding managers accountable for those metrics would drive improvement. Both systems worked, up to a point. And both systems eventually corrupted, in identical ways, because of the same structural flaw: the metric was not the reality, and optimization pressure drove a wedge between them.

🔗 Connection to Chapter 6 (Signal and Noise): Crime statistics are supposed to be a signal about public safety. But when the people producing the statistics have an incentive to distort them, the signal-to-noise ratio collapses. The noise is no longer random measurement error. It is systematic bias -- a form of noise that no amount of statistical sophistication can correct, because the data itself has been corrupted at the source. Chapter 6's framework for signal detection assumed that noise was honest -- random, unbiased, indifferent. Goodhart's Law introduces dishonest noise: noise with an agenda.

🔄 Check Your Understanding

Both the Vietnam body count and CompStat were implemented by intelligent, well-intentioned people trying to improve outcomes. Why did rational optimization of these metrics lead to worse outcomes rather than better ones?
The chapter describes three forms of CompStat gaming: downgrading, discouraging reports, and manipulating response categories. For each, explain how it satisfies the metric while failing to achieve the underlying goal of public safety.
What does the phrase "dishonest noise" mean in the context of Goodhart's Law? How does it differ from the noise discussed in Chapter 6?

15.4 Hospitals, Search Engines, and the Engagement Machine

The Surgeon Who Refuses to Operate

In 2010, the United States implemented the Hospital Readmissions Reduction Program (HRRP), which penalized hospitals with higher-than-expected rates of patient readmission within 30 days of discharge. The logic was sound: if a patient is readmitted shortly after discharge, something probably went wrong with their initial treatment. Reducing readmissions should improve patient care.

Hospitals responded to the metric. Some improved their discharge planning, follow-up care, and care coordination -- exactly the behavioral changes the policy intended. But others found a different approach.

Observation status gaming. Instead of formally readmitting a returning patient, hospitals placed them in "observation status" -- a classification that technically means the patient is being observed to determine whether admission is necessary, but which functionally provides many of the same services as an inpatient admission. A patient in observation status does not count as a readmission. The metric improves. The patient's experience -- and often their insurance coverage -- may worsen.

Surgical mortality gaming operates by a related mechanism. When hospitals and surgeons are ranked by surgical mortality rates, surgeons face a perverse incentive: refusing to operate on the highest-risk patients. A surgeon who takes on only low-risk cases will have an excellent mortality rate. A surgeon who accepts the most difficult cases -- the patients who most desperately need surgical intervention -- will have a higher mortality rate, even if they are the more skilled surgeon. The metric punishes the behavior it should reward.

This is not hypothetical. Studies published in medical journals have documented evidence that cardiac surgeons in states with publicly reported mortality data were more likely to refuse high-risk patients than surgeons in states without such reporting. The patients who were refused did not simply vanish; they either went untreated or sought care elsewhere, often with worse outcomes. The metric improved cardiac surgery mortality rates on paper while potentially increasing actual patient deaths.

PageRank and the SEO Industry

Google's original search algorithm, PageRank, was built on an elegant insight. The quality of a web page could be estimated by counting how many other pages linked to it, weighted by the quality of the linking pages. A page linked to by many high-quality pages was probably high-quality itself. PageRank was, in effect, a metric for web page quality.

The metric worked beautifully -- until it became a target.

The moment website owners understood how Google ranked pages, an entire industry sprang up to exploit the metric. Search engine optimization (SEO) is, in its less scrupulous forms, Goodhart's Law as a business model. Link farms -- networks of low-quality sites created solely to link to a target site -- proliferated. Content mills churned out keyword-stuffed articles designed to satisfy search algorithms rather than human readers. Guest-posting schemes traded links for links, inflating the apparent authority of participating sites.

Google fought back with algorithm updates -- Panda (targeting low-quality content), Penguin (targeting link manipulation), and dozens of others. Each update represented a new metric, or a refinement of the old one. And each update triggered a new round of gaming. The SEO industry did not disappear. It adapted. The arms race between Google's algorithms and the SEO industry is a permanent, ongoing demonstration of Goodhart's Law: no matter how clever the metric, optimization pressure will find its gaps.

💡 Intuition: Imagine a teacher who uses raised hands to measure classroom engagement. At first, raised hands correlate with genuine understanding -- students who understand the material are more willing to volunteer answers. But now suppose the teacher starts grading based on hand-raising frequency. Students begin raising their hands before they have understood the question, raising hands on easy questions to pad their count, or raising hands and then saying "never mind" when called on. The hand-raising frequency goes up. The actual engagement may go down.

The Engagement Machine

Perhaps the most consequential modern instance of Goodhart's Law operates inside your phone.

Social media platforms optimize for engagement -- the metric that captures likes, shares, comments, time spent on the platform, and other indicators of user interaction. Engagement is a reasonable proxy for value: if people are spending time on your platform and interacting with content, you must be providing something they value.

But engagement-as-target generates a specific, predictable pathology. Research in cognitive science and behavioral psychology has consistently found that certain types of content generate disproportionately high engagement: content that provokes outrage, fear, moral indignation, and tribal identification. A post that makes you angry is far more likely to elicit a comment than a post that makes you thoughtfully consider an opposing view. A headline that triggers fear generates more clicks than a headline that provides nuanced context. Content that confirms your existing beliefs and demonizes the outgroup produces more shares than content that acknowledges the complexity of the world.

When the engagement metric becomes the target -- when algorithms are tuned to maximize it, when content creators are rewarded for it, when the platform's business model depends on it -- the platform systematically promotes outrage over nuance, fear over context, and tribalism over understanding. The metric goes up. The quality of public discourse goes down.

This is not a conspiracy theory about evil tech executives. It is Goodhart's Law, operating at scale. The people who designed engagement metrics were not trying to polarize society. They were trying to measure whether users found their platform valuable. But when the measure became a target -- when algorithms, content creators, and advertisers all optimized for engagement -- the metric decoupled from the thing it was supposed to measure. High engagement no longer meant "users are getting value." It meant "users are getting angry."

Spaced Review -- Cooperation Mechanisms (Ch. 11): In Chapter 11, we examined how cooperation emerges when the structure of the game makes it the self-interested strategy. Goodhart's Law reveals the dark mirror of this insight. When the structure of the incentive system makes gaming the self-interested strategy, gaming will emerge just as reliably as cooperation does. The mechanism is identical: agents respond rationally to the payoff structure they face. In Chapter 11, the payoff structure rewarded cooperation. In Goodhart's Law, the payoff structure rewards metric manipulation. Same engine, different fuel, opposite destination.

🔄 Check Your Understanding

Explain how surgical mortality metrics can lead to worse patient outcomes even while the metrics themselves improve.
Why is the SEO industry a permanent demonstration of Goodhart's Law rather than a problem that can be solved once and for all?
How does the engagement metric on social media platforms illustrate the difference between a measure and the thing it is supposed to measure?

15.5 The Deeper Pattern: Why This Keeps Happening

We have now traced Goodhart's Law through seven domains -- Soviet manufacturing, education, military statistics, policing, medicine, search engines, and social media. The pattern is unmistakable. But why does it keep happening? Why do intelligent, well-intentioned people keep falling into the same trap?

The answer has three layers.

Layer 1: The Principal-Agent Problem

In every case of Goodhart's Law, there is a principal -- someone who wants something -- and an agent -- someone who is supposed to deliver it. The Soviet planners (principals) want nails. The factory managers (agents) are supposed to make them. Congress (principal) wants educated students. Teachers (agents) are supposed to educate them. The public (principal) wants safe streets. Police officers (agents) are supposed to provide safety.

The principal cannot directly observe the thing they want. Soviet planners cannot examine every nail. Congress cannot sit in every classroom. Citizens cannot follow every police officer. So the principal chooses a proxy measure -- a metric that is supposed to correlate with the thing they actually care about. Weight of nails, test scores, crime statistics.

The agent's incentives are now tied not to the thing the principal wants, but to the proxy measure. And the agent, who is closer to the ground, inevitably discovers that there are ways to improve the proxy that do not improve the underlying reality -- and that these shortcuts are easier, cheaper, or less risky than actually doing the hard work of producing the real thing.

This is the principal-agent problem, and it is the structural engine that drives Goodhart's Law. Wherever there is a gap between what the principal wants and what the principal can observe, there is an opportunity for metric gaming. The wider the gap -- the worse the metric is as a proxy for the real objective -- the more gaming will occur.

Layer 2: The Map Is Not the Territory

Every metric is a model of reality -- a simplified representation that captures some features and ignores others. Weight captures something about nail production, but not everything. Test scores capture something about learning, but not everything. Body counts capture something about military progress, but not everything.

The problem is not that these models are wrong. All models are incomplete. The problem is that when you optimize for the model, you exploit its incompleteness. Every feature of reality that the metric ignores becomes a dimension along which gaming can occur. Weight ignores nail size. Test scores ignore deep understanding. Body counts ignore strategic progress. The metric is a map, and the territory is always more complex than the map.

This is the map/territory confusion that we will explore in greater depth in Chapter 22. For now, the key insight is that optimization pressure does not respect the boundaries of your model. It finds the gaps between the map and the territory, and it drives through them.

🚪 Threshold Concept: Metrics Are Models

Here is the central insight of this chapter: every metric is a model -- a simplified representation of something you actually care about. And like all models, every metric has a domain of validity beyond which it breaks down.

When a metric is used passively -- as a thermometer, not a thermostat -- it can be a useful window into reality. But when a metric is used actively -- as a target that drives incentives -- optimization pressure systematically exploits the gap between the metric and the reality it represents. The metric becomes like a photograph of a landscape: useful for getting an impression, but disastrous as a guide if you try to walk through the photograph instead of the actual terrain.

This threshold concept connects to Chapter 14 (Overfitting), where we saw that optimization too tightly fitted to a particular dataset captures noise rather than signal. Goodhart's Law is overfitting applied to institutions: when you optimize an organization too tightly for a particular metric, you capture the metric's artifacts rather than the underlying reality.

How to know you have grasped this concept: You reflexively ask, whenever you encounter a metric being used as a target: "What aspects of reality does this metric fail to capture, and how might someone exploit those gaps?" You recognize that the danger is not bad metrics but the act of turning any metric into an optimization target.

Layer 3: Optimization Pressure Is Relentless

The third layer is about the sheer force of incentives. Soviet factory managers faced career consequences. Teachers faced job loss. Police commanders faced demotion. Surgeons faced reputational damage. Website owners faced financial ruin. Social media platforms faced competition.

In every case, the agents were not evil. They were responding to intense optimization pressure -- the relentless, systematic force exerted by an incentive structure that rewards metric improvement regardless of whether the underlying reality improves. Optimization pressure is like water flowing downhill: it finds every crack, every gap, every weakness in the metric's relationship to reality, and it flows through it.

Spaced Review -- Annealing (Ch. 13): In Chapter 13, we explored simulated annealing -- the idea that optimization sometimes needs random perturbation to avoid getting trapped in local optima. Goodhart's Law reveals a complementary problem: when optimization is too effective, too focused, too relentless, it does not get stuck in a local optimum. Instead, it finds a "false optimum" -- a point that scores well on the metric but poorly on the actual objective. The nail factory that produces giant spikes has found a local optimum on the weight metric that is a terrible point on the "useful nails" metric. Annealing is the cure for too little optimization. Goodhart's Law is the pathology of too much.

15.6 Publish or Perish: The Academy Eats Itself

The academic world offers perhaps the most ironic demonstration of Goodhart's Law, because academics are the people who named and formalized the law -- and yet they are subject to one of its most severe manifestations.

The currency of academic life is the publication. "Publish or perish" is not a joke. It is a literal description of the incentive structure in modern universities: your career advancement, your tenure, your salary, your grant funding, and your professional reputation all depend on the number and perceived quality of your publications.

The primary metric of publication quality is the impact factor -- a number assigned to each journal that represents the average number of citations received by papers published in that journal. A paper published in a high-impact-factor journal (like Nature or Science) is presumed to be more important than a paper published in a low-impact-factor journal. Impact factor is a proxy for quality: important papers get cited more, so journals that publish important papers should have higher citation rates.

The metric became a target. And the target corrupted the process.

Salami slicing. Researchers divide a single substantial study into multiple smaller papers -- "least publishable units" -- to maximize their publication count. A study that would be most useful to the scientific community as a single comprehensive paper is instead scattered across three or four journals, each containing an incomplete piece of the puzzle. The publication count goes up. The coherence of the scientific literature goes down.

Citation gaming. Authors cite their own previous work extensively, whether or not it is relevant, to boost their citation counts. Journals encourage authors to cite other papers published in the same journal, boosting the journal's impact factor. Citation rings -- informal agreements between groups of researchers to cite each other's work -- have been documented in several fields.

Novelty bias. Journals preferentially publish novel, surprising results because novel results attract more citations than replications of existing work. This creates a systematic bias against replication studies -- the studies that are most essential for verifying whether published findings are actually true. When a published finding cannot be replicated, the replication failure often goes unpublished because it is not "novel" enough for a high-impact journal.

P-hacking and the replication crisis. Under pressure to produce statistically significant results (another metric), researchers engage in p-hacking -- running multiple statistical analyses on the same data, selectively reporting the ones that produce significant results, and ignoring the ones that do not. This practice inflates the rate of false discoveries. The result is the replication crisis: when independent researchers attempt to replicate published findings in psychology, medicine, economics, and other fields, they find that a disturbingly large fraction -- estimates range from 30 percent to more than 60 percent in some fields -- fail to replicate.

The replication crisis is Goodhart's Law applied to the production of knowledge itself. The metric (publications and citations) was supposed to measure scientific progress. When the metric became the target, scientists optimized for publications and citations rather than for truth. The metric improved -- more papers, higher citation counts, more impressive-seeming results. The underlying reality -- our actual knowledge of the world -- may have degraded.

⚠️ Common Pitfall: It is tempting to blame individual researchers for gaming the system. But this misses the structural point. Most researchers who salami-slice, self-cite excessively, or engage in p-hacking are not dishonest. They are responding rationally to an incentive structure that rewards metric optimization. The problem is not bad people. The problem is a system in which the metrics have decoupled from the values they are supposed to represent. Blaming individuals for responding to structural incentives is like blaming water for flowing downhill. The solution is to change the landscape, not to moralize at the water.

🔗 Connection to Chapter 2 (Feedback Loops): The replication crisis illustrates a destructive feedback loop. Journals publish novel, surprising results. These results get cited. The journal's impact factor rises. More researchers submit novel, surprising results to the journal. The journal becomes more selective, favoring even more novel and surprising findings. The bar for "surprising" rises. The temptation to p-hack or cherry-pick results increases. The fraction of published results that are actually true decreases. The crisis deepens. This is a positive (reinforcing) feedback loop driving the system toward an ever-greater divergence between what is published and what is real.

🔄 Check Your Understanding

Explain how the impact factor illustrates Goodhart's Law. What is the metric, what is the underlying reality it is supposed to measure, and how has optimization pressure driven a wedge between them?
Why is the replication crisis a particularly ironic manifestation of Goodhart's Law?
The text argues that blaming individual researchers for metric gaming "misses the structural point." Do you agree? What are the strengths and limitations of a structural explanation versus an individual-responsibility explanation?

15.7 Strathern's Generalization and the Lucas Critique

We have been circling this idea from multiple directions. It is time to state it with full generality.

Marilyn Strathern, a social anthropologist at Cambridge, crystallized the insight in a single sentence: "When a measure becomes a target, it ceases to be a good measure." Strathern was not the first to notice the pattern -- Goodhart and Campbell got there first -- but her formulation strips away the domain-specific language and reveals the universal structure.

Notice the elegance of the formulation. Strathern does not say that the measure becomes useless. She says it ceases to be good. It may still contain information. But the information is now contaminated by the very act of using the measure as a target. The measure was calibrated in a world where no one was trying to manipulate it. Once it becomes a target, the world changes -- people start optimizing for it -- and the calibration breaks.

This connects to a parallel insight from economics. In 1976, the economist Robert Lucas articulated what became known as the Lucas critique: the idea that the statistical relationships observed in historical economic data will change once policymakers try to exploit them. If the Federal Reserve notices that inflation and unemployment have a stable inverse relationship (the Phillips curve) and tries to exploit that relationship by accepting higher inflation to reduce unemployment, the very act of exploiting the relationship will change people's behavior -- they will come to expect inflation, build it into their contracts and wage demands -- and the relationship will break down.

The Lucas critique is Goodhart's Law applied to macroeconomic policy. Both say the same thing: observed regularities are not invariant under intervention. The act of using a regularity for control changes the system in ways that destroy the regularity.

This is the deepest version of the insight, and it connects to fundamental questions in epistemology and systems theory. In physics, the Heisenberg uncertainty principle tells us that the act of measuring a quantum system disturbs the system. In sociology, the Hawthorne effect tells us that the act of observing workers changes their behavior. In ecology, the observer effect tells us that the presence of a researcher changes the behavior of the animals being studied. In all these cases, observation is not passive. It is an intervention. And Goodhart's Law tells us that when observation becomes optimization -- when you don't just measure the system but try to steer it using the measurement -- the distortion is not subtle. It is systematic, predictable, and devastating.

🔗 Forward Connection to Chapter 16 (Legibility and Control): The next chapter examines what happens when authorities try to make complex systems readable -- "legible" -- in order to control them. Goodhart's Law is a preview of this deeper problem. Metrics make systems legible: they translate complex realities into simple numbers that can be monitored from a distance. But the act of making a system legible changes the system. Chapter 16 will show that this problem is not confined to individual metrics. It is a fundamental feature of the relationship between those who govern and the systems they try to govern.

🔗 Forward Connection to Chapter 21 (The Cobra Effect): Chapter 21 will explore a closely related pattern: incentive systems that produce the opposite of their intended effect. The British colonial government in India, worried about cobras in Delhi, offered a bounty for every dead cobra brought to the government office. People began breeding cobras to collect the bounty. When the government discovered the fraud and cancelled the bounty, the breeders released their now-worthless cobras into the wild -- and Delhi had more cobras than before. This is Goodhart's Law pushed to its logical extreme: a metric-based incentive system that not only fails to solve the problem but actively makes it worse.

15.8 Solutions: What Can Be Done?

If Goodhart's Law is a universal structural pattern, is there any defense against it? The answer is yes -- imperfect, partial, always requiring vigilance, but real. Here are five approaches that have shown promise across domains.

Solution 1: Multi-Metric Approaches

If any single metric can be gamed, use multiple metrics simultaneously. The more dimensions you measure, the harder it is to game all of them at once. A hospital measured only by readmission rates might game that single metric. But a hospital measured by readmission rates, patient satisfaction surveys, clinical outcomes, complication rates, and process measures will find it much harder to game all five simultaneously -- and the easiest path to improving all five is usually to actually improve patient care.

There is a subtlety here. Multi-metric approaches work only if the metrics are independent -- if they measure genuinely different aspects of the underlying reality. If all your metrics are correlated (they all improve when the same gaming strategy is applied), adding more metrics provides no additional protection. The key is to choose metrics that are difficult to improve simultaneously through any means other than genuinely improving the thing you care about.

Solution 2: Qualitative Assessment

Numbers are not the only way to evaluate performance. Expert human judgment, while imperfect and subjective, has an enormous advantage over quantitative metrics: it is much harder to game. A standardized test can be gamed by teaching to the test. But fooling an experienced teacher who is spending time in a classroom, observing how students think and engage with material, is much more difficult.

The pharmaceutical industry provides an instructive example. Clinical trials often use surrogate endpoints -- measurable biological markers that are supposed to correlate with the outcome you actually care about (does the patient get better and live longer?). A drug that lowers cholesterol (the surrogate) is assumed to reduce heart attacks (the real outcome). But some drugs that successfully lower cholesterol have been found to have no effect on heart attack rates -- or even to increase them. The surrogate endpoint was a metric that had decoupled from the real outcome. The solution: measure the actual outcome. Run longer trials. Follow patients for years. Accept that measuring what you actually care about is harder, slower, and more expensive than measuring a proxy -- and do it anyway.

Solution 3: Rotating and Unannounced Metrics

If people know exactly what metric they will be evaluated on, they will optimize for it. If the metric is unknown, changing, or revealed only after the evaluation period, gaming becomes much more difficult. Some educational systems have adopted the practice of evaluating schools on different criteria each year, or using random audits rather than predetermined tests. The uncertainty forces the evaluated party to invest in general capability rather than specific metric optimization.

Solution 4: Measuring the Gaming

Some of the most effective responses to Goodhart's Law involve explicitly monitoring for gaming itself. Statistical methods can detect anomalies that suggest metric manipulation: unusual clustering of values just above a threshold, suspicious patterns in the timing of reported data, sudden improvements not accompanied by changes in process. The Atlanta cheating scandal was eventually uncovered by a statistical analysis of erasure rates on answer sheets -- an improbably high rate of wrong-to-right corrections was the forensic fingerprint of systematic cheating.

Solution 5: Ostrom's Polycentric Governance

In Chapter 11, we encountered Elinor Ostrom's work on governing the commons. Ostrom's approach is directly relevant to Goodhart's Law. Her key insight was that effective governance of shared resources requires polycentric governance -- multiple overlapping centers of decision-making, adapted to local conditions, with rules shaped by the people they affect.

Applied to Goodhart's Law, polycentric governance means: do not impose a single metric from the center. Allow local actors to develop their own evaluation criteria, adapted to their specific context, and overseen by multiple stakeholders with different perspectives. A school evaluated by its own community -- parents, teachers, local employers -- using a mix of quantitative and qualitative measures adapted to local needs is far less susceptible to Goodhart's Law than a school evaluated by a distant bureaucracy using a single standardized test.

This is not a perfect solution. Local evaluation can be captured by local interests, can be inconsistent across contexts, and can be difficult to aggregate for large-scale decision-making. But the tradeoff is clear: the more distance between the measurer and the measured, and the more the evaluation relies on a single metric, the more vulnerable the system is to Goodhart's Law.

⚠️ Common Pitfall: Knowing about Goodhart's Law can tempt you into a nihilistic conclusion: "All metrics are useless, so we should stop measuring things." This is the wrong lesson. Metrics are essential. You cannot manage what you cannot see, and measurement is how you see. The lesson is not to abandon measurement but to hold metrics lightly -- to use them as one input among many, to watch for signs of gaming, to change metrics regularly, and above all, to never mistake the metric for the thing you actually care about. The map is useful. Just do not confuse it with the territory.

🔄 Check Your Understanding

Why must the metrics in a multi-metric approach be independent of each other? What happens if they are all correlated?
Explain the tradeoff between the precision of quantitative metrics and the gaming-resistance of qualitative assessment.
How does Ostrom's polycentric governance address the structural conditions that enable Goodhart's Law?

15.9 Pattern Library Checkpoint

You now have a powerful new pattern in your library. Here is how to catalog it.

Pattern Name: Goodhart's Law / Campbell's Law / Metric Corruption

Abstract Structure: When a proxy measure is used as an optimization target, agents exploit the gap between the proxy and the underlying reality, causing the proxy to decouple from the reality it was designed to measure.

Structural Signature: - A principal who cares about an outcome they cannot directly observe - A proxy metric chosen to represent that outcome - Agents whose incentives are tied to the proxy metric - Optimization pressure that exploits the gap between metric and reality - Progressive decoupling of the metric from the underlying outcome

Domains Where It Appears: - Manufacturing (Soviet nail factories, output quotas) - Education (standardized testing, teaching to the test) - Military (body counts, kill ratios) - Policing (crime statistics, CompStat) - Medicine (readmission rates, surgical mortality) - Digital platforms (PageRank/SEO, engagement metrics) - Academia (publications, citations, impact factors) - Finance (quarterly earnings targets, credit ratings)

Diagnostic Questions: 1. Is anyone's incentive tied to a metric? If so, Goodhart's Law applies. 2. What aspects of the underlying reality does the metric fail to capture? 3. How might an agent improve the metric without improving the reality? 4. Is there a principal-agent gap -- a distance between the person who cares about the outcome and the person who produces the metric? 5. How strong is the optimization pressure? (Higher pressure = more gaming.)

Connections: - Overfitting (Ch. 14): Metric gaming is institutional overfitting - Legibility and Control (Ch. 16): Metrics are instruments of legibility - Cobra Effect (Ch. 21): Metric-based incentives that backfire - Cooperation (Ch. 11): Gaming is defection against the spirit of the metric - Feedback Loops (Ch. 2): Metric corruption follows reinforcing feedback dynamics

Cross-Domain Transfer Exercise: Choose a domain from your own expertise. Identify a metric that is currently used as a target. Analyze it using the diagnostic questions above. Predict how agents might be gaming the metric, and check whether your prediction matches reality. If you can find evidence of gaming, you have successfully applied Goodhart's Law to a new domain.

15.10 The Metric Paradox

We end where we began -- with nails.

The Soviet planning system collapsed, and with it, the particular form of Goodhart's Law that produced giant useless nails and tiny useless nails. But the pattern did not collapse. It migrated. It found new hosts. It is alive today in every school district that teaches to the test, every hospital that games its readmission numbers, every social media platform that optimizes for engagement, every academic department that counts publications instead of ideas.

The paradox of Goodhart's Law is this: we need metrics. Complex systems cannot be managed without measurement. You cannot improve what you cannot see. But the very act of measuring, when coupled with the act of optimizing, corrupts the measurement. Metrics are like antibiotics -- essential, powerful, and dangerous when overused. The dose makes the poison.

The threshold concept of this chapter -- Metrics Are Models -- is the key to navigating this paradox. Every metric is a simplified representation of a complex reality. As long as you remember that the metric is a model, you can use it wisely: as one input among many, held lightly, rotated frequently, supplemented by qualitative judgment, and always, always checked against the underlying reality it claims to represent.

The moment you forget that the metric is a model -- the moment you treat the map as the territory, the proxy as the thing itself, the measure as the meaning -- you have stepped into Goodhart's Law. And the system will respond. It always does. Not because people are corrupt. But because optimization pressure, like water, finds every crack.

In the next chapter, we will see that the problem runs deeper than individual metrics. Chapter 16 examines what happens when entire systems are simplified for the purpose of control -- the phenomenon that James C. Scott calls "legibility." If Goodhart's Law is about what happens when a single measure becomes a target, legibility is about what happens when an entire complex reality is flattened into a form that can be administered from a distance. The patterns are the same. The scale is larger. The consequences are more severe.

But that is a story for the next chapter.

🔗 Looking Ahead: Chapter 16 (Legibility and Control) will generalize the insight from this chapter. Chapter 21 (The Cobra Effect) will explore cases where metric-based incentive systems not only fail but actively produce the opposite of their intended effect. Chapter 22 (Map and Territory) will provide the philosophical foundation for the map/territory distinction that underlies Goodhart's Law.

Chapter Summary

Goodhart's Law -- "When a measure becomes a target, it ceases to be a good measure" -- describes a universal pattern in which optimization pressure corrupts the relationship between a metric and the underlying reality it was designed to represent. The pattern appears identically across Soviet manufacturing, education, military strategy, policing, medicine, digital platforms, and academic publishing. It arises from the principal-agent problem (those who set metrics cannot directly observe the reality they care about), the map/territory confusion (every metric is an incomplete model of reality), and the relentless force of optimization pressure (incentivized agents will find and exploit every gap between the metric and the reality). Solutions include multi-metric approaches, qualitative assessment, rotating and unannounced metrics, monitoring for gaming, and polycentric governance -- but no solution is permanent, because the pressure to game is as constant as the pressure to optimize. The threshold concept -- Metrics Are Models -- provides the conceptual key: as long as you remember that every metric is a simplified model, not the reality itself, you can use metrics wisely while remaining vigilant against their corruption.

Spaced Review

Revisiting earlier material to strengthen retention.

(From Chapter 11 — Cooperation Without Trust) In Axelrod's iterated prisoner's dilemma tournaments, tit-for-tat won by being "nice, retaliatory, forgiving, and clear." Consider a Goodhart's Law scenario where employees are measured on a metric. Why might a tit-for-tat-like response to metric gaming (reward honest reporting, punish gaming, forgive reformed behavior) be more effective than a purely punitive approach? How does the iterated nature of employment relationships create conditions where cooperation mechanisms from game theory become relevant to metric design?
(From Chapter 13 — Annealing and Shaking) In Chapter 13, we learned that systems sometimes need randomness — controlled disorder — to escape suboptimal solutions. How might the concept of annealing apply to metric systems? Consider an organization trapped in a Goodhart equilibrium where everyone is gaming a particular KPI. What would a "controlled shaking" of the metric system look like? How does the idea of a cooling schedule (explore early, exploit later) apply to the process of developing and refining metrics over time? Could rotating metrics on a schedule function as a form of organizational annealing?
(From Chapter 14 — Overfitting) We argued in the previous chapter that overfitting occurs when a model captures noise rather than signal. How is Goodhart's Law related to overfitting? Consider this parallel: a machine learning model that overfits memorizes the training data rather than learning the underlying pattern. A metric system suffering from Goodhart's Law captures the optimization behavior of agents rather than the underlying quality it was designed to measure. In both cases, the "model" (whether statistical or managerial) appears to be performing well by its own internal standards while actually diverging from reality. What regularization techniques from machine learning might have analogues in metric design?

Answers

1. Employment is an iterated game, not a one-shot interaction. In iterated games, strategies that combine cooperation with conditional punishment outperform pure punishment. A purely punitive approach to metric gaming (fire everyone caught gaming) can create an adversarial culture where employees hide gaming behavior rather than reducing it. A tit-for-tat approach — reward honest reporting, punish clear gaming, but forgive reformed behavior — builds a cooperative norm while maintaining accountability. The "clarity" aspect of tit-for-tat is also relevant: employees need to understand exactly what counts as gaming and what counts as legitimate optimization. Ambiguity in metric rules is like ambiguity in the rules of a game — it breeds defection. 2. An organization trapped in a Goodhart equilibrium is stuck in a local optimum — like a system at a low energy state that is stable but suboptimal. "Shaking" the metric system might involve: temporarily randomizing which metrics are evaluated, introducing qualitative assessments that disrupt the gaming strategies agents have developed, or deliberately changing metrics faster than agents can adapt. The cooling schedule analogy suggests that new metric systems should begin with high variability (frequent changes, multiple metrics, qualitative elements) and gradually stabilize as the organization develops genuine capability rather than gaming strategies. 3. Goodhart's Law is institutional overfitting. Machine learning regularization techniques with metric design analogues include: (a) **Early stopping** → time-limited metrics that are retired before gaming fully develops; (b) **Dropout** → randomly excluding some metrics from each evaluation period; (c) **Cross-validation** → checking whether improvements in the metric correspond to improvements in independent measures of the underlying reality; (d) **Regularization penalties** → penalizing extreme values that suggest gaming rather than genuine performance; (e) **Ensemble methods** → using multiple independent metrics rather than relying on any single measure.

What's Next

In Chapter 16: Legibility and Control, we will see that the problem of Goodhart's Law runs deeper than individual metrics. James C. Scott's concept of "legibility" describes what happens when entire complex realities are simplified for the purpose of administrative control — and why the results are predictably catastrophic. If Goodhart's Law is about the corruption of a single measure, legibility is about the corruption of an entire worldview. The patterns are the same. The scale is larger. The stakes are higher.

Learning Objectives

In This Chapter

Chapter 15: Goodhart's Law — When Every Metric Becomes a Target

How Soviet Factories, Standardized Tests, Crime Statistics, Hospital Rankings, Search Engines, and Social Media All Break in the Same Way

15.1 The Nail Factory

15.2 Teaching to the Test

15.3 The Body Count and the Crime Rate

CompStat and the Numbers Game

15.4 Hospitals, Search Engines, and the Engagement Machine

The Surgeon Who Refuses to Operate

PageRank and the SEO Industry

The Engagement Machine

15.5 The Deeper Pattern: Why This Keeps Happening

Layer 1: The Principal-Agent Problem

Layer 2: The Map Is Not the Territory

Layer 3: Optimization Pressure Is Relentless

15.6 Publish or Perish: The Academy Eats Itself

15.7 Strathern's Generalization and the Lucas Critique

15.8 Solutions: What Can Be Done?

Solution 1: Multi-Metric Approaches

Solution 2: Qualitative Assessment

Solution 3: Rotating and Unannounced Metrics

Solution 4: Measuring the Gaming

Solution 5: Ostrom's Polycentric Governance

15.9 Pattern Library Checkpoint

15.10 The Metric Paradox

Chapter Summary

Spaced Review

What's Next

Related Reading