Chapter 15 Exercises
How to use these exercises: Work through the parts in order. Part A builds recognition skills, Part B develops analysis, Part C applies concepts to your own domain, Part D requires synthesis across multiple ideas, Part E stretches into advanced territory, and Part M provides interleaved practice that mixes skills from all levels.
For self-study, aim to complete at least Parts A and B. For a course, your instructor will assign specific sections. For the Deep Dive path, do everything.
Part A: Pattern Recognition
These exercises develop the fundamental skill of recognizing Goodhart's Law across domains.
A1. For each of the following scenarios, identify (i) the metric, (ii) the underlying reality the metric is supposed to represent, (iii) the principal, (iv) the agent, and (v) at least one plausible form of gaming.
a) A call center evaluates employees based on the number of calls handled per hour.
b) A restaurant is rated by health inspectors based on a checklist of observable conditions (clean floors, food stored at correct temperature, handwashing station available).
c) A country measures its economic progress by GDP (Gross Domestic Product).
d) A dating app measures success by the number of matches users make.
e) A hospital ranks surgeons by their patients' 30-day mortality rate.
f) A city government evaluates its parks department by the number of maintenance work orders completed per month.
g) An environmental agency measures air quality progress by the number of days per year that pollution levels stay below a threshold.
h) A fitness tracker measures health by daily step count.
A2. Classify each of the following as an example of Goodhart's Law, Campbell's Law, the Lucas critique, or the cobra effect (from Chapter 21 preview). Some may fit more than one category. Explain your reasoning.
a) A company measures customer satisfaction with a survey. Employees begin coaching customers to give high scores before the survey is administered.
b) A government subsidizes electric car purchases to reduce emissions. Car manufacturers produce cheap, low-quality electric vehicles designed to qualify for the subsidy rather than to be practical transportation.
c) The Federal Reserve observes that low interest rates stimulate economic growth and maintains low rates for an extended period. Asset bubbles form as markets adjust their behavior to the expectation of permanently low rates.
d) A university ranks departments by research output. A department that was once known for excellent teaching and mentoring shifts its resources toward research, and teaching quality declines.
e) A city offers a bounty for reporting potholes. Residents begin filing duplicate reports and reporting minor surface imperfections as potholes.
A3. The chapter distinguishes between using a metric as a "thermometer" (passive indicator) and using it as a "thermostat" (optimization target). For each of the following, explain whether the metric is being used as a thermometer or a thermostat, and predict whether Goodhart's Law is likely to apply.
a) A doctor checks your blood pressure as part of a routine annual physical.
b) A health insurance company offers premium discounts to members who maintain blood pressure below a specific threshold.
c) A teacher reviews student quiz results to decide which topics need more instruction.
d) A school board publishes a ranking of schools based on student quiz results, and the bottom-performing schools face closure.
e) A website owner checks Google Analytics to understand how visitors use their site.
f) A website owner's annual bonus is tied to their site's Google Analytics engagement metrics.
A4. Identify the Goodhart's Law dynamic in each of the following historical examples. What was the metric? How was it gamed? What was the consequence?
a) Colonial-era Dutch tulip mania, in which the price of tulip bulbs was treated as a signal of their value.
b) Medieval trial by ordeal, in which an accused person's physical reaction to a painful test (holding hot iron, being submerged in water) was treated as a metric of guilt or innocence.
c) The British Navy's historical practice of measuring the effectiveness of its blockades by the number of ships captured.
d) Mao's Great Leap Forward, in which local officials reported grain harvests to Beijing.
A5. The chapter discusses Strathern's generalization: "When a measure becomes a target, it ceases to be a good measure." Restate this principle in your own words, using an example from a domain not discussed in the chapter.
Part B: Analysis
These exercises require deeper analysis of metric corruption dynamics.
B1. The Principal-Agent Analysis. Choose one of the following systems and perform a complete principal-agent analysis:
- University admissions (metric: SAT/ACT scores, GPA)
- Criminal justice (metric: conviction rate, recidivism rate)
- Environmental regulation (metric: emissions per unit of output)
- Software development (metric: lines of code written, bugs fixed per week)
For your chosen system:
a) Identify all principals and agents. (Note: some systems have multiple levels -- e.g., the public is the principal of legislators, who are the principal of agency heads, who are the principal of inspectors.)
b) Identify the proxy metric(s) currently in use.
c) Explain what the metric fails to capture about the underlying reality.
d) Predict at least three specific forms of gaming that agents might engage in.
e) Research whether any of your predicted forms of gaming have actually been documented. Were your predictions accurate?
f) Propose an alternative metric design that would be more resistant to gaming. Explain why, using the concepts from Section 15.8.
B2. The Multi-Metric Paradox. Section 15.8 argues that using multiple independent metrics reduces gaming. But there is a problem: the more metrics you use, the more complex the evaluation system becomes, and the harder it is for agents to understand what they are being evaluated on.
a) Explain why clarity was identified as one of tit-for-tat's key strengths in Chapter 11. How does this relate to metric design?
b) If an evaluation system has twenty different metrics, and the agent does not know how they are weighted, what problems might arise? Could the agent simply give up trying to perform well, since the evaluation seems arbitrary?
c) Propose a middle ground: a multi-metric system with enough dimensions to resist gaming but enough clarity to guide behavior. How many metrics? How would you communicate them? How would you weight them?
d) Is there a fundamental tradeoff between gaming-resistance and clarity? If so, where is the optimal point on this tradeoff, and how does it depend on the domain?
B3. Honest Noise vs. Dishonest Noise. The chapter introduces the concept of "dishonest noise" -- noise with an agenda.
a) Define "honest noise" and "dishonest noise" using the signal detection framework from Chapter 6.
b) Explain why standard statistical methods (e.g., averaging, increasing sample size) can correct for honest noise but not for dishonest noise.
c) Give an example of dishonest noise from a domain not discussed in the chapter.
d) What statistical methods, if any, can detect dishonest noise? (Hint: think about the distribution of errors. Are gamed metrics likely to produce the same error distributions as honest measurement error?)
B4. The Thermometer/Thermostat Distinction. The case study distinguishes between using a metric as a thermometer (diagnostic) and using it as a thermostat (optimization target).
a) Give a detailed example of the same metric being used first as a thermometer and then converted to a thermostat. Describe what changes in the system when the conversion happens.
b) Is it possible to prevent a metric from becoming a thermostat once it is published? Why or why not?
c) Some have argued that the very act of publishing a metric converts it to a thermostat, because any public metric inevitably becomes something people optimize for. Evaluate this argument. Is it always true? Under what conditions might a published metric remain a useful thermometer?
Part C: Application to Your Own Domain
These exercises connect Goodhart's Law to your area of expertise.
C1. Identify a metric used as an evaluation target in your field of study or professional domain.
a) What is the metric, and what underlying reality is it supposed to represent?
b) What aspects of the underlying reality does the metric fail to capture?
c) What forms of gaming have you personally observed or heard about?
d) Has the gaming been documented formally (in academic studies, journalistic investigations, or industry reports)? If so, what did the documentation find?
e) Using the five solutions from Section 15.8, propose specific changes that would reduce gaming in your domain.
C2. Design a metric system for evaluating performance in your domain that would be resistant to Goodhart's Law. Your design should:
a) Include at least three independent metrics that together capture the underlying reality more completely than any one metric alone
b) Include at least one qualitative component (human judgment, peer evaluation, narrative assessment)
c) Include a mechanism for detecting gaming (statistical anomaly detection, random audits, etc.)
d) Include a plan for rotating or updating metrics to prevent long-term adaptation
e) Be practically implementable with available resources
C3. Write a one-page "Goodhart's Law Audit" of your organization, department, or team. For each major metric currently in use:
a) Rate the distance between the metric and the underlying reality it represents (1 = very close, 5 = very distant)
b) Rate the intensity of optimization pressure on the metric (1 = low stakes, 5 = career-defining)
c) Rate the observability of gaming (1 = gaming would be immediately visible, 5 = gaming would be invisible)
d) For each metric where you rate all three dimensions above 3, predict specific gaming behaviors and propose countermeasures.
Part D: Synthesis
These exercises require integrating ideas across multiple chapters.
D1. Goodhart's Law and Overfitting. Chapter 14 introduced overfitting -- fitting a model too closely to training data, capturing noise rather than signal.
a) In what specific sense is teaching to the test an example of overfitting? What is the training data, what is the model, and what is the out-of-sample test?
b) Chapter 14 discussed regularization -- techniques for preventing overfitting by constraining the model's complexity. What would "regularization" look like in the context of Goodhart's Law? Propose a concrete example.
c) Chapter 14 noted that overfitting is worst when the training data is small and the model is complex. Under what conditions is Goodhart's Law worst -- when the metric is simple or complex? When the optimization pressure is high or low? Draw the analogy explicitly.
D2. Goodhart's Law and Cooperation. Chapter 11 examined how cooperation emerges when the game structure makes it the self-interested strategy.
a) Metric gaming can be understood as a form of defection against the spirit of the evaluation system. Using Chapter 11's framework, explain why gaming is individually rational and collectively harmful -- the same structure as the prisoner's dilemma.
b) Ostrom's design principles were proposed as solutions to the tragedy of the commons. Section 15.8 proposes Ostrom's polycentric governance as a partial solution to Goodhart's Law. Explain the connection: how is Goodhart's Law a tragedy of the commons?
c) In Chapter 11, tit-for-tat succeeded because it was nice, retaliatory, forgiving, and clear. Design a metric evaluation system that embodies these four properties. What would "nice" mean for a metric system? What would "retaliatory" mean? "Forgiving"? "Clear"?
D3. Goodhart's Law and Feedback Loops. Chapter 2 distinguished positive (reinforcing) and negative (balancing) feedback loops.
a) Identify the positive feedback loop in the academic publish-or-perish system. Draw the loop explicitly, showing how each step leads to the next.
b) Identify a negative (balancing) feedback loop that could counteract the metric corruption feedback loop. Does such a loop currently exist in any of the domains discussed in Chapter 15? If not, could one be designed?
c) Chapter 2 discussed the role of delay in feedback loops. How does the speed of the feedback loop affect the severity of Goodhart's Law? Compare the slow loop of education testing with the fast loop of social media engagement.
D4. Goodhart's Law and Signal Detection. Chapter 6 introduced signal detection theory -- the framework for separating meaningful patterns from noise.
a) Goodhart's Law can be understood as a signal detection problem: the principal is trying to detect the signal of genuine performance in a metric that is increasingly contaminated by the noise of gaming. Using Chapter 6's framework, analyze how Goodhart's Law changes the signal-to-noise ratio over time.
b) Chapter 6 discussed the tradeoff between sensitivity and specificity. Apply this tradeoff to metric design: a metric with high sensitivity (catches all genuine good performance) may have low specificity (also rewards gaming). How would you manage this tradeoff?
c) Base rate neglect (Chapter 6) can compound Goodhart's Law. Explain how. (Hint: if gaming is rare, a test for gaming will produce many false accusations. If gaming is common, the metric itself becomes unreliable. Either way, there is a problem.)
Part E: Advanced Challenges
These exercises push beyond the chapter's material into deeper or more speculative territory.
E1. The Lucas critique argues that observed statistical relationships change when policymakers try to exploit them. Research the Lucas critique in macroeconomics. How did it change economic policy-making? What was the relationship between the Lucas critique and the rational expectations revolution in economics? Write a 500-word analysis connecting the Lucas critique to Goodhart's Law.
E2. The chapter discusses five solutions to Goodhart's Law. Evaluate each solution against the five solutions, and identify which solutions are themselves vulnerable to Goodhart's Law. (For example: if you monitor for gaming, can the monitoring metric itself be gamed? If you use qualitative assessment, can qualitative assessors be influenced or corrupted?) Is there a meta-Goodhart's Law -- a pattern in which solutions to Goodhart's Law are themselves subject to Goodhart's Law?
E3. Some thinkers have argued that artificial intelligence, particularly large language models, represents a new frontier of Goodhart's Law: AI systems are trained to optimize for proxy metrics (human preference ratings, performance on benchmarks) that may not capture the actual capability or safety of the system. Research the concept of "reward hacking" or "specification gaming" in AI alignment literature. How does it relate to Goodhart's Law? Is it a special case, or does it introduce genuinely new dynamics?
E4. Design a thought experiment: a world where Goodhart's Law does not apply. What structural features would this world need? Would such a world require perfect information (the principal can observe the underlying reality directly), perfect alignment (the agent's incentives are identical to the principal's), or something else entirely? What would the consequences of such a world be -- and would they all be positive?
Part M: Mixed Practice (Interleaved Review)
These exercises mix concepts from Chapters 11-15 to build integrated understanding.
M1. A company implements a peer review system where employees rate each other's performance (cooperation mechanism, Ch. 11). The ratings are used for promotion decisions (metric as target, Ch. 15). Employees begin exchanging favorable reviews (gaming, Ch. 15) in a pattern that resembles reciprocal altruism (Ch. 11). Analyze this system using concepts from both chapters. Is the exchanging of favorable reviews cooperation or defection? From whose perspective?
M2. A social media platform discovers that its engagement metric is driving outrage (Goodhart's Law, Ch. 15). It adds a new metric: "meaningful social interactions" -- prioritizing comments and shares between friends over passive consumption of viral content. Analyze this change using signal detection theory (Ch. 6). What is the signal (meaningful interaction)? What is the noise (gaming of the new metric)? Predict how content creators will adapt to the new metric.
M3. A researcher discovers that a published finding in their field does not replicate (replication crisis, Ch. 15). They suspect p-hacking. Using Bayesian reasoning (Ch. 10), explain how the prior probability of a hypothesis being true affects the reliability of a statistically significant result. Why does the replication crisis disproportionately affect fields where most tested hypotheses have low prior probabilities?
M4. Consider the explore/exploit tradeoff (Ch. 8) in the context of metric design. Exploring new metrics risks disruption and confusion. Exploiting existing metrics risks Goodhart's Law. How should an organization balance the need for metric stability (exploitation) against the need to update metrics to prevent gaming (exploration)? Is there an optimal rotation frequency?
M5. A government implements a carbon tax to reduce emissions (mechanism design, Ch. 11). Companies optimize by moving their most carbon-intensive operations to countries without carbon taxes (metric gaming, Ch. 15; this is known as "carbon leakage"). The global emission level does not change, but the national metric improves. Analyze this situation using the concepts of polycentric governance (Ch. 11, Ch. 15), the tragedy of the commons (Ch. 11), and Goodhart's Law (Ch. 15). Propose a structural solution.