Chapter 4: The Streetlight Effect

38 min read

> "Not everything that counts can be counted, and not everything that can be counted counts."

Learning Objectives

Define the streetlight effect and explain why it is a structural failure mode, not just a bad habit
Apply Goodhart's Law to identify cases where measurement has distorted the phenomenon being measured
Distinguish between proxy measures and the underlying constructs they claim to represent
Identify the streetlight effect operating in at least three different fields
Add the measurement validity lens to your Epistemic Audit

In This Chapter

Chapter Overview
4.1 The McNamara Fallacy
4.2 Goodhart's Law and Campbell's Law
4.3 The Streetlight Effect Across Fields
4.4 Active Right Now: Where the Streetlight Effect May Be Operating
4.5 The Measurement Trap: Why Fields Can't Simply "Measure Better"
4.6 What It Looked Like From Inside
4.7 The Citation Count Problem: When Science Measures Its Own Activity
4.8 Practical Considerations: Working With Imperfect Metrics
4.9 Chapter Summary
Spaced Review
What's Next
Chapter 4 Exercises → exercises.md
Chapter 4 Quiz → quiz.md
Case Study: The Body Count — Vietnam and the McNamara Fallacy → case-study-01.md
Case Study: When Hospital Ratings Kill Patients → case-study-02.md

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 4: The Streetlight Effect

"Not everything that counts can be counted, and not everything that can be counted counts." — Often attributed to Albert Einstein (though the attribution is uncertain; the quote may originate with sociologist William Bruce Cameron)

Chapter Overview

There is an old joke about a drunk searching for his keys under a streetlight. A passerby stops to help and asks, "Are you sure you lost them here?" The drunk replies, "No, I lost them in the park, but the light is better here."

This joke is funny because the error is obvious. Of course you should search where you lost the keys, not where the light is best. No rational person would make this mistake deliberately.

Except that every field does it. Constantly. Systematically. And the consequences are not funny at all.

The joke works because it makes visible a pattern that is normally invisible. In our own fields, the light seems like it's in the right place — because we've been searching there for so long that we've forgotten the keys might be somewhere else. The streetlight doesn't just illuminate the wrong area. It makes us forget that the park exists.

In the Vietnam War, the U.S. military measured success by body counts — not because enemy casualties were a reliable indicator of strategic progress, but because they were countable. The things that actually mattered for the war's outcome — popular support, political legitimacy, the strength of the insurgency's infrastructure, the will of the South Vietnamese government — were harder to quantify and therefore harder to report to Congress. So the military reported what it could count. The body counts went up. The war was being lost. And the metrics said everything was fine.

Robert McNamara, the Secretary of Defense who championed this approach, was not stupid. He was one of the most analytically gifted public officials of the twentieth century — a former president of Ford Motor Company who had applied quantitative management techniques to great effect in the private sector. His error was structural, not intellectual: he applied a measurement framework designed for manufacturing output to a domain where the important variables couldn't be measured, and the measurable variables weren't important.

This is the streetlight effect: the systematic tendency of fields, organizations, and institutions to study, measure, optimize, and manage what is quantifiable rather than what is significant. It is the third major entry mechanism for wrong ideas, and in some ways it is the most insidious — because it doesn't introduce a specific wrong answer. Instead, it gradually reshapes an entire field around the wrong questions.

Unlike the authority cascade (Chapter 2), which introduces a specific wrong answer through prestige dynamics, and unfalsifiability (Chapter 3), which protects specific wrong answers through structural immunity to evidence, the streetlight effect distorts the entire landscape of inquiry. It determines which questions get asked, which evidence gets collected, and which aspects of reality receive institutional attention. The wrong answer isn't a specific claim — it's the implicit assumption that what can be measured is what matters.

In this chapter, you will learn to: - Recognize the streetlight effect as a structural failure mode, not just careless thinking - Apply Goodhart's Law and Campbell's Law to real-world measurement systems - Distinguish between proxy measures and the constructs they represent - Identify when metric fixation is distorting your field's priorities - Add the measurement validity lens to your Epistemic Audit

🏃 Fast Track: If you're familiar with Goodhart's Law and metric fixation, skim sections 4.1–4.2 and start at section 4.3 (the cross-domain analysis). Complete exercise B.3 to verify.

🔬 Deep Dive: After this chapter, read Jerry Muller's The Tyranny of Metrics for an extended treatment, and explore the measurement validity literature in psychometrics for the most rigorous technical account of when measurements capture what they claim to capture.

4.1 The McNamara Fallacy

The Vietnam body count example is so instructive that it has earned its own name: the McNamara Fallacy. The sociologist Daniel Yankelovich described it in four steps:

The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide.

Each step follows logically from the last, and each one represents a deeper commitment to the streetlight. By the time you reach step four, you have not just failed to find your keys — you have convinced yourself that the keys don't exist.

Vietnam: The Numbers That Lied

Let's trace the McNamara Fallacy through the Vietnam War in detail, because the pattern is so clear.

Step 1: Measure what's measurable. The military tracked metrics it could quantify: enemy killed, weapons captured, territory controlled, sorties flown, bombs dropped, hamlet evaluation scores. These were reported in precise numbers. Briefings were filled with charts and graphs showing upward trends.

Step 2: Disregard what can't be measured. The factors that actually determined the war's trajectory — the political legitimacy of the South Vietnamese government, the resilience and adaptability of the Viet Cong's infrastructure, the morale and motivation of opposing forces, the attitudes of the rural population, the effectiveness of North Vietnamese logistics along the Ho Chi Minh Trail — were harder to quantify. They appeared in intelligence reports as qualitative assessments, hedged and uncertain. Compared to the crisp precision of body count charts, they seemed soft and unreliable.

Step 3: Presume the unmeasurable isn't important. Over time, the quantitative metrics dominated decision-making. When the body counts showed progress but field officers reported that the strategic situation was deteriorating, the numbers won. Commanders who reported high body counts were rewarded. Commanders who filed nuanced, ambiguous assessments were not. The incentive structure guaranteed that the measurable metrics would crowd out the unmeasurable reality.

Step 4: Declare the unmeasurable doesn't exist. By 1967, the metrics showed a war being won. General Westmoreland told Congress that the enemy was losing and that "the end begins to come into view." The Tet Offensive in January 1968 — a coordinated attack across South Vietnam that shocked the American public and shattered confidence in the war effort — demonstrated catastrophically that the metrics had been measuring the wrong things entirely.

The cost of this measurement failure was staggering. Over 58,000 American soldiers died. Estimates of Vietnamese casualties — military and civilian — range into the millions. The war lasted a decade longer than many analysts believed was productive. And throughout, the metrics told a story of progress that bore no relationship to strategic reality.

The body counts had been going up because the military was rewarded for producing body counts. The territory controlled had been expanding because the criteria for "controlled" had been defined to produce expanding numbers. The war was being lost in dimensions that the measurement system was blind to.

📜 Historical Context: McNamara himself later acknowledged the error. In his 1995 memoir In Retrospect, he wrote that he and his colleagues were "wrong, terribly wrong." But the structural lesson goes beyond personal regret. McNamara applied quantitative management techniques because they had worked at Ford Motor Company. Cars can be counted. Quality can be measured on an assembly line. The techniques that worked for automobiles failed catastrophically when applied to a domain where the important variables resist quantification. The error was not in using metrics — it was in assuming that what could be measured was what mattered.

🧩 Productive Struggle

Before reading the next section, consider your own field: What does your field measure? What does it NOT measure? Is there a gap between the two? List 3 things your field measures routinely, and 3 things that matter but are rarely or never measured.

Spend 3–5 minutes, then read on.

4.2 Goodhart's Law and Campbell's Law

The streetlight effect has two formal expressions, both discovered independently, both pointing at the same structural dynamic.

Goodhart's Law

In 1975, British economist Charles Goodhart observed that when the Bank of England targeted specific monetary measures as part of its policy, those measures lost their usefulness as indicators. His original formulation (paraphrased in its most famous form): "When a measure becomes a target, it ceases to be a good measure."

The mechanism is straightforward: when people know they're being measured on something, they optimize for the measurement rather than for the underlying construct the measurement was supposed to represent. The measure and the construct decouple. The measure goes up. The construct may or may not.

Campbell's Law

In 1979, social scientist Donald Campbell articulated the same insight in a broader context: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

Campbell was more explicit than Goodhart about the mechanism: it's not just that people optimize for the metric. It's that the act of using a metric for high-stakes decisions creates incentives to manipulate the metric, and these incentives distort the very process the metric was meant to track.

The Mechanism in Three Steps

A proxy measure is adopted because the underlying construct is hard to measure directly. (Test scores as a proxy for learning. Citation counts as a proxy for research quality. GDP as a proxy for national wellbeing. Body counts as a proxy for military progress.)
The proxy becomes a target. Resources, rewards, and penalties are attached to the proxy. People are evaluated, promoted, funded, or punished based on it.
The proxy decouples from the construct. People optimize for the proxy rather than the construct. Test scores rise without corresponding increases in learning. Citation counts rise through citation rings and self-citation. GDP rises through activities that don't improve wellbeing. Body counts rise through counting methods that inflate the numbers.

🚪 Threshold Concept

Goodhart's Law is a threshold concept that transforms how you see measurement systems. Before understanding it, rising metrics feel like evidence of progress. After understanding it, you learn to ask: Is this metric rising because the underlying reality is improving, or because people are optimizing for the metric?

Before this clicks: "Test scores went up! Education is improving!" After this clicks: "Test scores went up. But did learning go up? Or did schools get better at producing test scores specifically?"

This distinction is not cynical — it's essential. Metrics can track genuine progress. But they can also track gaming, manipulation, and optimization for the proxy rather than the construct. The only way to tell the difference is to ask whether the metric and the construct are still correlated.

4.3 The Streetlight Effect Across Fields

The streetlight effect is not confined to the military or to monetary policy. It appears in every field that uses quantitative measures — which is to say, every field. Let us trace it through several domains.

Education: Testing What's Testable

Standardized testing is the most visible and contentious example of the streetlight effect in modern life.

The underlying construct: education. What do we actually want students to gain? Knowledge, certainly. But also: critical thinking, creativity, curiosity, resilience, ethical reasoning, collaboration, self-directed learning, and the ability to apply knowledge to novel situations.

The proxy measure: test scores. Standardized tests can measure factual recall and certain procedural skills with high reliability. They are efficient, scalable, and cheap to administer. They produce precise numbers that can be compared across schools, districts, states, and nations.

The decoupling: when test scores became high-stakes targets (through policies like No Child Left Behind in the U.S.), schools predictably optimized for the proxy: - Teaching to the test replaced teaching to the curriculum - Subjects not on the test (art, music, physical education, recess) were reduced or eliminated - Schools invested in test preparation at the expense of deeper learning - In extreme cases, cheating scandals erupted (the Atlanta Public Schools scandal of 2009, where administrators altered student answer sheets)

Test scores went up in many cases. Whether learning improved is a separate — and much harder — question. The metrics said education was improving. The construct may or may not have been.

The education case is particularly instructive because it illustrates a phenomenon we might call metric displacement: the metric doesn't just fail to capture the construct — it actively displaces it. Time spent on test preparation is time not spent on the kinds of learning that tests can't measure: collaborative problem-solving, creative projects, sustained inquiry, physical activity, artistic expression, and the development of intrinsic motivation. The metric doesn't just miss these things; it crowds them out.

Research by educational scholars suggests that the narrowing of curriculum under high-stakes testing disproportionately affects disadvantaged students, who are more likely to attend schools under performance pressure. The students who most need a broad, enriching education are the ones most likely to receive a test-preparation curriculum. The metric, designed to ensure educational equity, may have worsened educational inequality — not through any individual's malice, but through the structural dynamics of Goodhart's Law.

🌍 Global Perspective: The streetlight effect in education plays out differently across cultures. The United States emphasizes standardized testing; Finland emphasizes teacher professionalism and student wellbeing with minimal standardized testing. South Korea and Japan optimize heavily for university entrance exams, producing high test scores alongside widespread student burnout and mental health crises. Each system has its streetlight — but the lamps illuminate different patches of ground.

🔗 Connection: The standardized testing debate illustrates both the streetlight effect (measuring what's testable) and Goodhart's Law (the measure becoming a target). It also interacts with the authority cascade (Chapter 2): the authority of psychometrics — the science of testing — lent credibility to test scores as indicators of learning, even as the correlation between testing and learning weakened under the pressure of high-stakes optimization.

Healthcare: The Hospital That Optimized for Ratings

The healthcare industry is particularly susceptible to the streetlight effect because the underlying construct — patient health and wellbeing — is extraordinarily complex, while the metrics used to evaluate it are necessarily reductive. A patient's outcome depends on their biology, psychology, social support, lifestyle, environmental factors, access to care, adherence to treatment, and dozens of other variables that interact in ways that no single metric can capture.

Hospital quality ratings provide another vivid example. The underlying construct: quality of patient care. The proxy measures include: patient satisfaction scores, readmission rates, mortality rates, infection rates, and wait times.

Each of these metrics captures something about quality. But when they become targets, Goodhart's Law activates:

Patient satisfaction scores incentivize hospitals to make patients happy rather than healthy. Studies have found correlations between high patient satisfaction and higher rates of unnecessary prescriptions (including opioids), unnecessary admissions, and higher healthcare costs. A patient who receives an unnecessary antibiotic may be more satisfied than one who is correctly told their infection is viral and doesn't need antibiotics.
Mortality rates incentivize hospitals to avoid high-risk patients. If your mortality rate is a public metric tied to funding and reputation, the rational strategy is to refer the sickest patients elsewhere — not because that's better for them, but because it improves your numbers. Research has documented "risk selection" behavior in which hospitals manage their patient mix to optimize reported outcomes.
Readmission rates incentivize keeping patients longer than necessary (to avoid a readmission) or labeling readmissions as new admissions rather than returns.

The underlying construct — quality of care — may or may not improve. The metrics almost certainly improve, because the entire system optimizes for them.

Criminal Justice: Measuring Policing, Not Crime

Crime statistics are among the most consequential metrics in public life. They determine where police resources are deployed, which neighborhoods receive investment, how politicians are evaluated, and how safe people feel.

But what do crime statistics actually measure? They measure reported crime, which is a proxy for actual crime. The gap between the two is enormous and systematic:

Crimes that are reported tend to be those that affect communities with political power, that involve property (which has clear monetary value), and that are visible to police.
Crimes that are underreported include domestic violence, sexual assault, wage theft, white-collar fraud, and crimes in communities that distrust the police.

When police departments are evaluated on crime statistics, they face a perverse incentive: the easiest way to reduce reported crime is not to reduce actual crime but to discourage reporting, reclassify crimes (e.g., downgrading a burglary to a "lost property" report), or concentrate enforcement in areas where arrests are easiest (targeting minor drug offenses rather than complex fraud).

The CompStat system, implemented by the NYPD in the 1990s and widely adopted by police departments worldwide, was designed to use crime data to drive policing decisions. It produced impressive reductions in reported crime. Whether it produced corresponding reductions in actual crime — and at what cost to community trust and civil liberties — remains debated.

The streetlight effect in criminal justice is particularly consequential because the metrics directly affect people's freedom. When arrest numbers are used to evaluate police performance, officers are incentivized to make arrests — which means targeting minor offenses in high-policing areas rather than investigating complex crimes that take months to resolve. The result: police departments that look productive by the metrics while serious crime goes uninvestigated, and communities that are over-policed for minor violations while under-protected from major ones. The Innocence Project cases (our anchor example from Chapter 1) represent the ultimate failure of a measurement-fixated system: convictions were the metric of prosecutorial success, and the metric was achieved even when the convicted person was innocent.

Step back and notice the common structure across all six examples (Vietnam, education, healthcare, criminal justice, business, economics):

The construct (the thing that actually matters) is multidimensional, complex, and hard to quantify.
The proxy (the metric adopted) captures one dimension of the construct and is easy to quantify.
The system (the institutional environment) attaches rewards and punishments to the proxy.
The decoupling occurs when actors optimize for the proxy at the expense of the construct.
The blindness sets in when the proxy's improvement is treated as evidence that the construct is improving.

This is not a story about bad metrics or lazy measurement. It is a story about a fundamental tension between the complexity of the things we care about and the simplicity of the numbers we use to represent them. Every proxy is a simplification. Every simplification loses information. And the information lost is often exactly the information that matters most.

🪞 Learning Check-In

Pause and reflect: - Which example from this chapter resonated most with your own experience? - Can you identify a metric in your daily work that you suspect has decoupled from the construct it represents? - How would your work change if the current metrics disappeared? - What concept from this chapter was most surprising to you?

🔄 Check Your Understanding (try to answer without scrolling up)

State Goodhart's Law in your own words.

Give one example from this chapter where a proxy measure decoupled from the construct it was meant to represent.

Why does making a metric high-stakes tend to weaken its correlation with the underlying construct?

Verify
1. When a measure becomes a target, it ceases to be a good measure — because people optimize for the metric rather than the underlying reality. 2. Any of: test scores vs. learning, body counts vs. military progress, patient satisfaction vs. quality of care, crime statistics vs. actual crime, hospital mortality rates vs. quality of care. 3. Because high stakes create incentives to game the metric — to produce good numbers without producing the underlying improvement. The optimization effort shifts from the construct to the proxy.

Business: The Quarterly Earnings Trap

The corporate world provides perhaps the purest demonstration of metric fixation. Publicly traded companies are evaluated primarily on one metric: quarterly earnings per share (EPS). This metric determines stock price, executive compensation, analyst ratings, and institutional investor behavior.

The underlying construct — corporate health and long-term value creation — is multidimensional: it includes innovation capacity, employee development, customer relationships, supply chain resilience, environmental sustainability, ethical practices, and strategic positioning. None of these appear in the quarterly EPS number.

The consequences of Goodhart's Law applied to quarterly earnings are well-documented:

Short-termism: Companies sacrifice long-term investments (R&D, employee training, infrastructure) to meet quarterly targets. Research suggests that companies managed to quarterly EPS targets underinvest in innovation by significant margins compared to privately held competitors.
Earnings management: Accounting techniques that "smooth" earnings across quarters — accelerating revenue recognition, deferring expenses — create the appearance of steady growth without corresponding reality. Studies estimate that a substantial fraction of publicly traded companies engage in some form of earnings management.
Stock buybacks: Companies spend billions on stock buybacks that inflate EPS (by reducing the number of shares outstanding) without creating any underlying value. The money spent on buybacks could have funded research, raised wages, or built infrastructure.
Layoffs as metric optimization: Reducing headcount immediately improves EPS by cutting expenses. Whether this improves the company's long-term health is a separate question — but the metric rewards it regardless.

The executives making these decisions are not confused about what matters. When asked in private, most CEOs acknowledge that quarterly EPS is a poor measure of corporate health. But the measurement system forces their hand: miss the consensus EPS estimate, and your stock drops 5–10% in a day. Beat it, and your stock rises. The streetlight is blinding.

📊 Real-World Application: In a survey conducted by Duke University and the National Bureau of Economic Research, a majority of CFOs reported that they would sacrifice long-term economic value — cancel a value-creating project, delay R&D — to meet quarterly earnings expectations. They were not confused about what creates value. They were trapped by a metric that doesn't measure it.

Economics: GDP and the Illusion of Progress

Gross Domestic Product (GDP) may be the most consequential streetlight metric in the world. Originally developed as a wartime planning tool to measure production capacity, GDP has become the primary indicator of national economic health, the basis for international comparisons, and the target that drives economic policy worldwide.

What GDP measures: the total monetary value of all goods and services produced within a country in a given period. This captures economic activity.

What GDP does not measure: whether people are healthy, educated, safe, happy, or free. Whether the environment is being sustained or degraded. Whether income is distributed equitably or concentrated. Whether economic activity is productive (building infrastructure) or destructive (cleaning up an oil spill — which increases GDP). Whether people have leisure time, meaningful work, or strong social connections.

Simon Kuznets, the economist who developed the national income accounting framework that became GDP, warned Congress in 1934: "The welfare of a nation can scarcely be inferred from a measure of national income." His warning was ignored. GDP became the streetlight under which nations search for progress, and the dimensions of human wellbeing that GDP cannot capture became progressively invisible to policymakers.

The result: countries optimize for GDP growth — which sometimes aligns with human welfare and sometimes doesn't. China's extraordinary GDP growth involved massive environmental degradation and worker exploitation that GDP did not capture. The United States has the highest GDP per capita among large nations but lags behind many developed countries on life expectancy, infant mortality, inequality, happiness, and social mobility.

Alternative measures have been proposed — the Human Development Index (HDI), Genuine Progress Indicator (GPI), Gross National Happiness (Bhutan's experiment), the OECD's Better Life Index — but none has displaced GDP as the primary measure of national success. The reasons are structural: GDP is deeply embedded in international institutions (the IMF, World Bank, credit rating agencies), in domestic policy frameworks (growth targets, budget projections, tax policy), and in the media's reporting of economic news. Changing the metric would require changing all of these downstream systems simultaneously. This is the constituency problem identified above — the measurement system has created a network of dependencies that resist change regardless of the metric's validity.

📊 Real-World Application: Consider what happens when a natural disaster destroys a city. The cleanup and rebuilding efforts increase GDP — money is spent on construction, debris removal, new infrastructure. By the GDP metric, the disaster was economically stimulating. The homes destroyed, the lives disrupted, the psychological trauma, the community bonds severed — none of these appear in the number. This is not a subtle problem. It is a glaring demonstration of the gap between the metric and the construct. And yet GDP remains the dominant measure of economic health.

💡 Intuition: Think of the streetlight effect as a slow-motion version of looking for your keys in the wrong place. Except instead of an individual drunk, it's an entire institution. And instead of keys, it's the thing the institution was created to achieve. And the "wrong place" becomes the only place anyone looks, because the metrics create the illusion that it's the right place.

4.4 Active Right Now: Where the Streetlight Effect May Be Operating

Social media metrics and democratic discourse. Platforms measure engagement (likes, shares, comments, time on site) as a proxy for user value. But engagement correlates more strongly with outrage, fear, and novelty than with accuracy, nuance, or civic value. The metric (engagement) has become the target, and the construct (informed public discourse) has decoupled from it. Algorithmic amplification of high-engagement content means the streetlight effect operates at a scale and speed never before possible.

University rankings. Rankings like U.S. News & World Report measure proxy variables (acceptance rates, alumni giving, faculty salaries, graduation rates) and combine them into a single number. Universities have responded by gaming the inputs: rejecting more applicants to lower acceptance rates, manipulating financial aid to attract students who boost graduation statistics, and restructuring spending to optimize ranked categories rather than educational quality.

AI model benchmarks. AI research measures model quality through benchmark performance (accuracy on specific test sets). Models are increasingly "trained on the test" — optimized for benchmark performance rather than general capability. This is Goodhart's Law applied to machine learning: the benchmark ceases to measure what it was designed to measure once it becomes the target of optimization. Some researchers have argued that benchmark scores now significantly overestimate real-world AI capabilities.

Workplace productivity metrics. Remote work has intensified the measurement of proxy variables for productivity: emails sent, meetings attended, messages posted, code committed. These measure activity, not output or value. The result is "performative work" — employees optimizing for visible activity metrics rather than actual productive output.

4.5 The Measurement Trap: Why Fields Can't Simply "Measure Better"

A natural response to the streetlight effect is: "We should just measure the right things." This is correct in principle and extraordinarily difficult in practice, for three structural reasons.

Reason 1: The Most Important Things Are Often the Hardest to Measure

There is an inverse relationship between the importance of a construct and the ease of measuring it. This is not accidental — it reflects the nature of what humans care about.

Learning is harder to measure than test scores. Health is harder to measure than lab values. Justice is harder to measure than conviction rates. Security is harder to measure than crime statistics. Wellbeing is harder to measure than GDP.

This is not a coincidence. It reflects a deep feature of the relationship between measurement and reality: the constructs that matter most to human life — education, health, justice, security, wellbeing, meaning — are complex, multidimensional, and context-dependent. They resist reduction to a single number. The metrics that are easiest to construct and most reliably measured are, almost by definition, simplifications that leave out much of what matters.

This is not a technical limitation waiting to be solved by better instruments. It is a fundamental feature of the relationship between measurement and the things that matter most to human life. We can measure blood pressure precisely but not health. We can measure income precisely but not wellbeing. We can measure words per minute but not communication quality. The precision is inversely proportional to the significance.

Reason 2: Better Metrics Get Gamed Too

Creating a more nuanced metric doesn't solve the problem — it just moves the gaming to a higher level. This is one of the most discouraging aspects of the streetlight effect: the problem is not solvable by technical improvement of the metrics alone. If you replace test scores with "portfolio assessments," teachers will optimize for portfolio assessments. If you replace GDP with a "happiness index," governments will optimize for whatever the happiness index measures (which may or may not correlate with actual happiness). Campbell's Law applies to any metric that becomes a target, regardless of how sophisticated the metric is.

Reason 3: Measurement Creates Legibility, and Legibility Creates Power

Political scientist James C. Scott, in Seeing Like a State, argues that modern states require "legibility" — the ability to see, categorize, and measure their populations and resources. Metrics provide this legibility. They make complex realities readable to administrators, policymakers, and managers.

But legibility comes at a cost: it simplifies. A forest becomes "board feet of timber." A population becomes "census categories." A student becomes "test scores." The simplification is necessary for governance — you can't manage what you can't see — but it systematically deletes the dimensions of reality that resist categorization.

This creates a power dynamic: people who control metrics control what is visible, and what is visible controls what gets attention, resources, and care. The streetlight effect is not just a measurement problem — it is a power problem. Whoever defines the metrics defines what counts as progress, success, and failure. And the things that fall outside the metrics — the dimensions of human experience that resist quantification — become politically invisible.

Reason 4: Measurement Systems Create Economic Constituencies

Once a measurement system is established, people invest in it. Testing companies build businesses around standardized tests. Economists build careers around GDP analysis. Hospital administrators build management systems around quality metrics. These constituencies have a material interest in maintaining the current measurement system, even if it doesn't measure what matters. This is the cascade lock-in component (Chapter 2) applied to measurement: the metric persists not because it's valid but because too many people are invested in it. The standardized testing industry alone generates billions in annual revenue. The consulting firms that help hospitals optimize quality metrics have built business models around the current measurement regime. The economists who have spent careers analyzing GDP have professional stakes in its continued dominance. Changing the metric threatens all of these constituencies simultaneously, creating resistance that has nothing to do with the metric's validity.

This is why the streetlight effect is so persistent even when everyone involved acknowledges its existence. The intellectual argument against metric fixation is well-understood. The structural incentives for maintaining it are overwhelming.

⚠️ Common Pitfall: The streetlight effect does not mean that measurement is bad. Measurement is essential. The error is not in measuring — it's in confusing the measurement with the thing being measured. The map is not the territory. The test score is not the learning. The body count is not the war. Keeping this distinction alive — reminding yourself that the metric is a proxy, not the reality — is the first defense against the streetlight effect.

4.6 What It Looked Like From Inside

Consider the perspective of a school principal in 2005, operating under No Child Left Behind:

Consider also a hospital CEO in 2015 facing public quality ratings:

Your hospital's mortality rate is a public metric. Patients, insurers, and regulators use it to evaluate your institution.
A patient with multiple organ failure and a less than 10% chance of survival is referred to your hospital. Accepting this patient will likely increase your mortality rate. Referring them elsewhere will protect your numbers.
You know that your hospital has the best intensive care unit in the region. The patient's best chance of survival is with you.
But if you accept too many high-risk patients, your publicly reported mortality rate rises, your ranking drops, insurers negotiate harder, and your funding decreases — which means worse care for all your patients in the future.

The metric has created a genuine ethical dilemma where none existed before. Without the metric, you would simply treat the sickest patients. With the metric, treating the sickest patients is a strategic liability. The measurement system — designed to improve quality — has created an incentive structure that can harm the most vulnerable patients.

Now consider the school principal:

Your school's funding depends on standardized test scores. If scores don't reach specified targets, your school faces sanctions — reduced funding, mandatory restructuring, potential closure.
You believe deeply in education. You became a principal because you want children to learn and thrive.
You can see that test preparation is crowding out art, music, science labs, recess, and the kinds of deep, project-based learning that you know are valuable.
But if you don't prioritize test preparation, your school will lose funding, teachers will be laid off, and the children will be worse off than they are now.
The rational response — the one that serves your students' immediate interests — is to teach to the test. Not because you believe test scores are education, but because the system has made test scores into the only thing that matters for your school's survival.

This is why the streetlight effect is a structural failure mode, not an individual one. The principal is not confused about what education means. They are trapped in a system that measures one thing (test performance) and rewards or punishes based on it. Their individual understanding is correct; the system's design is wrong. And their rational response to the system's design makes the problem worse.

🔍 Why Does This Work?

The streetlight effect works because of a fundamental asymmetry: measurable proxies provide certainty (precise numbers, clear rankings), while the underlying constructs they represent offer only ambiguity (complex, multidimensional, context-dependent). In any institutional environment where decisions must be justified, certainty wins. A school board that defunds a school because "test scores fell below the threshold" has a clear, defensible rationale. A school board that says "we believe deeper learning is occurring but we can't quantify it" does not. The streetlight effect exploits the institutional demand for legible, defensible numbers.

4.7 The Citation Count Problem: When Science Measures Its Own Activity

One of the most consequential examples of the streetlight effect is happening inside science itself.

Research quality — the value of a scientist's contribution to knowledge — is a construct that is complex, multidimensional, and extremely difficult to measure. A researcher might produce a single paper that transforms a field, or a hundred papers that advance nothing. Quality depends on the question asked, the rigor of the method, the significance of the findings, the long-term impact on the field, and whether the results are replicable.

The proxy measures that academia uses instead include: number of publications, citation count, h-index, journal impact factor, and grant funding obtained. Each of these captures something about research activity. None captures research quality.

The consequences of Goodhart's Law applied to academic metrics are well-documented:

Salami slicing: Publishing the minimum publishable unit rather than comprehensive studies, to maximize publication count
Citation gaming: Self-citation, citation rings (groups of researchers who cite each other), and strategic citation placement to boost metrics
Impact factor chasing: Submitting to journals based on their impact factor rather than their relevance to the research question
P-hacking and HARKing: Manipulating statistical analyses to produce "significant" results that are publishable (we'll examine this in depth in Chapter 10)
Novelty bias: Pursuing surprising, counterintuitive findings (which are more publishable) rather than important, incremental findings (which are less publishable but may be more valuable)

The collective effect of these incentives is a scientific literature that is biased toward novelty over replication, toward positive results over null findings, toward dramatic effect sizes over incremental progress, and toward high-volume output over deep, careful work. This is the structural foundation of the replication crisis (which we'll examine in detail in Chapter 10) — and it is driven entirely by the streetlight effect applied to the measurement of scientific quality.

The scientific community is not confused about the difference between publication metrics and research quality. Most scientists will, if asked privately, acknowledge that the metrics are flawed proxies. But the metrics determine hiring, tenure, promotion, and funding — the survival of scientific careers. Individual scientists cannot opt out of the measurement system without risking their careers. The streetlight is the only source of institutional light.

This creates a particularly painful irony: the institution dedicated to discovering truth about reality — the research university — evaluates its own members using metrics that distort the truth-seeking process. The measurement system designed to identify good science instead selects for a specific type of science (novel, surprising, publishable in high-impact journals) that may not be the science we most need (careful, replicable, incremental, focused on important rather than surprising questions).

🎓 Advanced: The philosophical dimension here is worth noting. The streetlight effect in science represents a case where the epistemology of a field (how it produces knowledge) is in tension with its sociology (how it rewards practitioners). The epistemological ideal says: pursue the most important questions using the most rigorous methods. The sociological reality says: pursue publishable questions and optimize citation counts. When epistemology and sociology pull in different directions, sociology usually wins — because sociology controls careers, and careers control who stays in the field to do the work.

📐 Project Checkpoint

Your Epistemic Audit — Chapter 4 Addition

Return to your audit target and ask:

What does your field measure? List the 3–5 primary metrics used to evaluate quality, success, or progress in your domain.

What does your field NOT measure? What important constructs are left out of the measurement system? Why?

Is there a gap? For each metric, ask: does this proxy still correlate with the construct it's supposed to represent? Or has Goodhart's Law decoupled them?

Who benefits from the current metrics? Map the constituencies that have invested in the current measurement system. Who would lose if the metrics changed?

What would happen if the metrics disappeared? If your field couldn't measure what it currently measures, what would it do instead? Would that be better or worse?

Add 300–500 words to your Epistemic Audit document addressing these questions.

4.8 Practical Considerations: Working With Imperfect Metrics

The streetlight effect cannot be eliminated. Measurement is necessary, and all measurement is imperfect. The question is how to use metrics without being captured by them.

Strategy 1: Multiple Metrics, No Single Target

Use a dashboard of metrics rather than a single number. When multiple metrics are tracked simultaneously, gaming any one metric is harder, and the overall picture is more likely to reflect reality. The Balanced Scorecard approach in business, despite its own limitations, represents this principle. A hospital that tracks not just mortality rates but also patient-reported outcomes, staff satisfaction, referral patterns, long-term follow-up results, and independent clinical audits is harder to game than one that tracks mortality alone.

Strategy 2: Rotate Metrics

Periodically change which metrics are targeted. This prevents the build-up of gaming strategies and forces organizations to attend to different dimensions of the construct at different times. If a school district focuses on math scores this year, reading comprehension next year, and science inquiry the following year — while tracking all three continuously — the incentive to teach to any single test is reduced because the target keeps moving.

Strategy 3: Measure the Gap Between Proxy and Construct

Explicitly study whether your metrics still correlate with the underlying constructs they represent. If test scores are rising but independent assessments of learning are not, the proxy has decoupled. Make this gap visible.

Strategy 4: Include Qualitative Assessment

Quantitative metrics should be supplemented with qualitative assessment — expert judgment, narrative evaluation, case-based review — that can capture dimensions the metrics miss. This is more expensive and less scalable, which is precisely why it's underused. But the cost of not including qualitative assessment — the cost of a measurement system that systematically misses what matters — may be far higher than the cost of the assessment itself. A school that supplements test scores with classroom observations, student portfolios, and teacher evaluations may spend more on assessment but produce better education.

Strategy 5: Sunlight on the Streetlight

Ask explicitly, in every meeting where metrics are discussed: "What are we NOT measuring that might matter?" Make this a standing agenda item. Create a formal "unmeasured dimensions" document that sits alongside every dashboard. This doesn't solve the problem, but it makes the streetlight visible — which is the first step toward not being blinded by it.

Strategy 6: Separate Measurement from Targeting

The most powerful defense is awareness. When everyone in an organization understands the streetlight effect and Goodhart's Law, they are less likely to confuse the metric with the reality. Name the effect explicitly: "We know that test scores are a proxy for learning, not learning itself. What evidence do we have about actual learning?"

The most direct application of Goodhart's Law is: measure things, but don't make every measurement a target. Track a wide range of indicators for information purposes but attach high-stakes consequences to only a small, carefully chosen subset — and rotate which subset is targeted. This preserves the informational value of metrics while reducing gaming incentives.

✅ Best Practice: Before adopting any new metric, ask two questions: (1) "If this metric improves but the underlying reality doesn't, will we know?" (2) "Who will be incentivized to game this metric, and how?" If you can't answer both questions, the metric isn't ready.

4.9 Chapter Summary

Key Arguments

The streetlight effect is the systematic tendency to study, measure, and optimize what is quantifiable rather than what is significant
Goodhart's Law ("when a measure becomes a target, it ceases to be a good measure") and Campbell's Law describe the mechanism by which metrics decouple from the constructs they represent
The effect operates across all fields: education (test scores vs. learning), military (body counts vs. strategic progress), healthcare (ratings vs. quality), criminal justice (crime statistics vs. actual crime), economics (GDP vs. wellbeing), and science (citations vs. research quality)
The effect is structural: individual actors understand the distinction between proxy and construct but are trapped by systems that reward proxy optimization
Better metrics don't solve the problem because better metrics get gamed too (Campbell's Law)

Key Debates

Can meaningful constructs ever be adequately measured, or is the gap between proxy and construct permanent?
Is the solution more measurement (better metrics) or less measurement (more qualitative judgment)?
How do you balance the need for accountability (which requires metrics) with the dangers of metric fixation?

Analytical Framework

The McNamara Fallacy (four steps from measuring the measurable to denying the unmeasurable)
Goodhart's Law and Campbell's Law as formal diagnoses
The five practical strategies for working with imperfect metrics

Spaced Review

Revisiting earlier material to strengthen retention.

(From Chapter 2) What are the three components of an authority cascade? How might authority cascade interact with the streetlight effect? (Hint: who decides which metrics are valid?)
(From Chapter 1) Where in the lifecycle of a wrong idea does the streetlight effect typically operate?
(From Chapter 3) How does the streetlight effect relate to falsifiability? Can a metric-driven claim be unfalsifiable?

Answers

1. Prestige investment, deference amplification, cascade lock-in. Authority cascade interacts with the streetlight effect when prestigious researchers or institutions endorse a particular metric — their endorsement amplifies its adoption regardless of its validity (e.g., the authority of psychometrics lending credibility to standardized testing). 2. Primarily at Stages 1–3 (Introduction, Adoption, Entrenchment): the streetlight effect shapes which questions are asked and which evidence is collected, determining which ideas enter the field's awareness. 3. A metric can become unfalsifiable if "success" is defined entirely in terms of the metric: "Our education system is improving" becomes unfalsifiable if improvement is *defined* as rising test scores, because no evidence of declining learning (in dimensions not captured by tests) can count against the claim.

What's Next

In Chapter 5: Survivorship Bias at Scale, we'll examine the fourth entry mechanism: how entire fields build their knowledge on the evidence that survived while systematically ignoring what didn't. You'll meet Abraham Wald's brilliant insight about WWII bombers, the publication bias that distorts medical research, and the startup mythology that studies winners while ignoring identical companies that failed.

Before moving on, complete the exercises and quiz to solidify your understanding.

Learning Objectives

In This Chapter

Chapter 4: The Streetlight Effect

Chapter Overview

4.1 The McNamara Fallacy

Vietnam: The Numbers That Lied

4.2 Goodhart's Law and Campbell's Law

Goodhart's Law

Campbell's Law

The Mechanism in Three Steps

4.3 The Streetlight Effect Across Fields

Education: Testing What's Testable

Healthcare: The Hospital That Optimized for Ratings

Criminal Justice: Measuring Policing, Not Crime

The Deeper Pattern: What All These Examples Share

Business: The Quarterly Earnings Trap

Economics: GDP and the Illusion of Progress

4.4 Active Right Now: Where the Streetlight Effect May Be Operating

4.5 The Measurement Trap: Why Fields Can't Simply "Measure Better"

Reason 1: The Most Important Things Are Often the Hardest to Measure

Reason 2: Better Metrics Get Gamed Too

Reason 3: Measurement Creates Legibility, and Legibility Creates Power

Reason 4: Measurement Systems Create Economic Constituencies

4.6 What It Looked Like From Inside

4.7 The Citation Count Problem: When Science Measures Its Own Activity

4.8 Practical Considerations: Working With Imperfect Metrics

Strategy 1: Multiple Metrics, No Single Target

Strategy 2: Rotate Metrics

Strategy 3: Measure the Gap Between Proxy and Construct

Strategy 4: Include Qualitative Assessment

Strategy 5: Sunlight on the Streetlight

Strategy 6: Separate Measurement from Targeting

4.9 Chapter Summary

Key Arguments

Key Debates

Analytical Framework

Spaced Review

What's Next

Chapter 4 Exercises → exercises.md

Chapter 4 Quiz → quiz.md

Case Study: The Body Count — Vietnam and the McNamara Fallacy → case-study-01.md

Case Study: When Hospital Ratings Kill Patients → case-study-02.md

Chapter 4 Exercises → `exercises.md`

Chapter 4 Quiz → `quiz.md`

Case Study: The Body Count — Vietnam and the McNamara Fallacy → `case-study-01.md`

Case Study: When Hospital Ratings Kill Patients → `case-study-02.md`