> "It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so."
Learning Objectives
- Apply 15 diagnostic questions to any claim in any field to assess the likelihood that the claim is wrong
- Connect each diagnostic question to the specific failure modes from Parts I-III that it detects
- Distinguish between red flags that suggest active error and yellow flags that suggest uncertainty
- Use the Red Flag Scorecard to produce a structured assessment of any claim, field, or consensus
- Advance the Epistemic Audit by applying all 15 questions to the target field's core claims
In This Chapter
- Chapter Overview
- The 15 Diagnostic Questions
- The Red Flag Scorecard
- Worked Example: Applying the Scorecard
- 📐 Project Checkpoint
- 31.2 Chapter Summary
- Spaced Review
- What's Next
- Chapter 31 Exercises → exercises.md
- Chapter 31 Quiz → quiz.md
- Case Study: Scoring the 2008 Financial Crisis — A Retrospective Red Flag Analysis → case-study-01.md
- Case Study: Scoring a Current Controversy — What the Scorecard Says Now → case-study-02.md
Chapter 31: Red Flags
"It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so." — Attributed to Mark Twain
Chapter Overview
For thirty chapters, this book has been building a diagnostic vocabulary. You can now identify authority cascades, unfalsifiable claims, streetlight effects, survivorship bias, plausible stories, sunk cost defenses, replication failures, perverse incentives, Einstellung effects, consensus enforcement, zombie ideas, crisis-driven corrections, revision myths, and overcorrections. You have seen these failure modes dissected in medicine, economics, psychology, nutrition, criminal justice, military strategy, technology, and education.
The question is: what do you do with all of this?
This chapter translates the diagnostic vocabulary into a practical tool — 15 questions you can ask about any claim, in any field, to assess the probability that the claim is wrong. These are not trick questions. They are not debating tactics. They are structural diagnostics — each one grounded in a specific failure mode from Parts I through III and field-tested against the cases from Part IV.
No single red flag proves a claim is wrong. But when multiple red flags cluster around the same claim, the probability that the consensus is protecting an error rises sharply. The 15 questions function like a medical screening: individually, each question has limited diagnostic power, but the pattern across all 15 is informative.
In this chapter, you will learn to: - Apply 15 diagnostic questions to any claim in any field - Connect each question to the failure modes it detects - Distinguish between red flags (strong signals) and yellow flags (uncertainty signals) - Use the Red Flag Scorecard to produce a structured assessment
🏃 Fast Track: This chapter is designed as a reference — you can read the 15 questions in order, or use the summary table at the end to identify the questions most relevant to your field. If you've absorbed Parts I-IV, the questions will feel like natural applications.
🔬 Deep Dive: After completing this chapter, apply all 15 questions to three claims in your own field — one you're confident is correct, one you suspect might be wrong, and one you're uncertain about. The contrast will calibrate your sense of how the tool works.
The 15 Diagnostic Questions
Question 1: Who Funded This?
Failure mode detected: Incentive structures manufacturing error (Chapter 11)
The question: Who paid for the research, the product, the program, or the policy behind this claim? Do the funders benefit from the claim being true?
Why it matters: When the funder benefits from a specific outcome, the research is structurally biased toward that outcome — not necessarily through fraud, but through the subtler mechanisms of study design, selective publication, and framing. The pharmaceutical industry funds studies of its drugs. The food industry funds nutrition research. EdTech companies fund evaluations of their products. The bias is not universal, but it is systematic and well-documented.
What would have caught: - The dietary fat hypothesis was sustained partly by food industry funding of research that favored sugar over fat as the dietary villain - Forensic science techniques were validated by the forensic science community itself, not by independent scientists (Chapter 27) - EdTech products are evaluated in studies funded or designed by the companies selling them (Chapter 30)
Scoring: - 🟢 Green: Funding is independent of the outcome; funders have no financial stake in the claim being true - 🟡 Yellow: Mixed funding; some independence, some conflict - 🔴 Red: Primary funding comes from entities that benefit financially from the claim being true
Question 2: Has This Been Independently Replicated?
Failure mode detected: The replication problem (Chapter 10)
The question: Has the core finding been reproduced by independent researchers — people who were not involved in the original study, have no professional relationship with the original authors, and have no career investment in the result?
Why it matters: A finding that has been produced only by the original research team may reflect the team's methodology, assumptions, or unconscious biases rather than reality. Independent replication is the most powerful filter against false positives. Its absence is one of the strongest red flags.
What would have caught: - Many of the landmark psychology studies that failed replication in the 2010s had never been independently replicated (Chapter 25) - Forensic science techniques (bite marks, hair microscopy) had never been subjected to independent validation (Chapter 27) - Learning styles had not been validated through rigorous independent testing (Chapter 30)
Scoring: - 🟢 Green: Multiple independent replications with consistent results - 🟡 Yellow: Partial replication or replication by related groups - 🔴 Red: Never independently replicated, or replication attempts have failed
Question 3: What Would Disprove This?
Failure mode detected: Unfalsifiable by design (Chapter 3)
The question: Can you specify, in advance, what evidence would convince you that the claim is wrong? Can the claim's proponents specify what evidence would convince them?
Why it matters: A claim that cannot be disproven — because any evidence can be reinterpreted as consistent with it — is not a scientific claim. It may still be true, but it cannot be evaluated empirically. The inability to specify disproving conditions is a structural feature that protects error.
What would have caught: - Strategic bombing doctrine — "we would have won if politicians hadn't constrained us" (Chapter 28) - Crypto utopianism — "that wasn't real crypto" (Chapter 29) - Social media's "connecting the world" narrative — negative outcomes attributed to misuse rather than to the platform (Chapter 29) - Psychoanalytic claims that explain any outcome after the fact (Chapter 3)
Scoring: - 🟢 Green: Clear falsification criteria that proponents agree to in advance - 🟡 Yellow: Vague or post-hoc falsification criteria - 🔴 Red: No falsification criteria; proponents cannot specify what would change their mind
Question 4: Who Benefits From This Being True?
Failure mode detected: Incentive structures (Chapter 11) + sunk cost (Chapter 9)
The question: Beyond the funders, who benefits professionally, financially, or reputationally from this claim being accepted as true? Who would lose if it turned out to be wrong?
Why it matters: When powerful actors benefit from a claim, the institutional machinery around it — peer review, funding, conferences, media coverage — will tend to protect the claim. This is not conspiracy; it is the predictable operation of incentive structures. The more powerful the beneficiaries, the more structural protection the claim receives.
What would have caught: - The acid-stress ulcer hypothesis benefited the pharmaceutical industry (billions in acid-suppression drugs) and the gastroenterology establishment (careers, textbooks, treatment protocols) — Chapter 1 - The body count metric benefited military commanders (career advancement) and politicians (evidence of progress) — Chapter 28 - VaR models benefited banks (regulatory compliance with lower capital requirements) and quants (career justification) — Chapter 12
Scoring: - 🟢 Green: No concentrated beneficiary; the claim would be accepted or rejected on evidence alone - 🟡 Yellow: Some beneficiaries exist but don't dominate the field - 🔴 Red: Powerful actors with careers, institutions, or fortunes at stake
Question 5: How Old Is the Core Evidence?
Failure mode detected: Authority cascade (Chapter 2) + anchoring (Chapter 7)
The question: When was the foundational evidence for this claim produced? Has anyone re-evaluated it with modern methods?
Why it matters: Old evidence can be excellent — the laws of thermodynamics are old and solid. But old evidence can also be outdated, methodologically weak by modern standards, or sustained by citation amplification rather than ongoing evaluation. The longer a claim has been accepted, the less likely it is to be re-examined — because re-examination implies doubt, and doubt is professionally risky.
What would have caught: - Ancel Keys's Seven Countries Study (1958-1970) formed the foundation of dietary fat guidelines for decades without being rigorously re-evaluated (Chapter 26) - Bite mark analysis was admitted in courts based on precedents from decades earlier, without re-evaluation of the underlying science (Chapter 27) - The Maginot Line doctrine was built on 1918 evidence without adequate reassessment of changed conditions (Chapter 28)
Scoring: - 🟢 Green: Core evidence has been re-evaluated with modern methods and holds up - 🟡 Yellow: Core evidence is old but unchallenged — no one has checked - 🔴 Red: Core evidence is old, methodologically weak by modern standards, and has not been re-evaluated
Question 6: Is the Confidence Backed by Precision or Accuracy?
Failure mode detected: Precision without accuracy (Chapter 12)
The question: When experts express high confidence in this claim, is that confidence grounded in the accuracy of the underlying measurement, or in the precision of the numbers? Do the numbers look more exact than the underlying knowledge justifies?
What would have caught: - Financial risk models (VaR) that provided precise daily loss estimates while being systematically wrong about tail risk — Chapter 12 - Body count metrics in Vietnam that were reported to exact numbers but were systematically inflated — Chapter 28 - Forensic testimony stating matches "to the exclusion of all others" when the technique had no established error rate — Chapter 27
Scoring: - 🟢 Green: Confidence is grounded in validated measurements with known error rates - 🟡 Yellow: Numbers are precise but the underlying measurement has uncertain accuracy - 🔴 Red: Precise numbers masking unknown or unacknowledged error
Question 7: What Happens to People Who Disagree?
Failure mode detected: Consensus enforcement (Chapter 14) + outsider problem (Chapter 18)
The question: What happens to researchers, practitioners, or critics who publicly challenge this claim? Are they engaged with, marginalized, or punished?
Why it matters: In a healthy field, dissent is engaged with — challenged on its merits, tested against evidence, and either integrated or refuted. In a field protecting a wrong consensus, dissent is suppressed — through career penalties, funding denial, ridicule, or social exclusion. The treatment of dissenters is a signal about the quality of the consensus.
What would have caught: - Marshall and Warren's marginalization for the H. pylori hypothesis — Chapter 1 - Neural network researchers' marginalization during the AI winter — Chapter 29 - De Gaulle's marginalization for advocating armored warfare — Chapter 28 - Shinseki's marginalization for Iraq troop estimates — Chapter 28
Scoring: - 🟢 Green: Dissenters are engaged with on the merits; disagreement is professionally safe - 🟡 Yellow: Dissent is tolerated but discouraged; mild career costs - 🔴 Red: Dissenters are actively marginalized, ridiculed, or punished
Question 8: Does the Evidence Come from One Source or Many Independent Sources?
Failure mode detected: Authority cascade (Chapter 2) + citation amplification
The question: Does the evidence base consist of genuinely independent data points, or does it trace back to a single study, a single research group, or a single methodological tradition?
What would have caught: - Bite mark analysis — all validation came from the forensic odontology community itself — Chapter 27 - The dietary fat hypothesis — much of the early evidence traced back to Keys's original work — Chapter 26 - Perceptrons' influence — a single book by two prestigious authors shaped an entire field's direction — Chapter 29
Scoring: - 🟢 Green: Multiple independent lines of evidence from different research traditions - 🟡 Yellow: Several studies, but from the same research group or methodological approach - 🔴 Red: The evidence traces back to a single source, group, or tradition
Question 9: Is the Effect Size Meaningful?
Failure mode detected: Precision without accuracy (Chapter 12) + streetlight effect (Chapter 4)
The question: Setting aside statistical significance — is the actual magnitude of the effect large enough to matter? Could a real-world practitioner tell the difference?
What would have caught: - Class size reduction — statistically significant but effect sizes dwarfed by teacher quality effects (Chapter 30) - Many psychology findings that were statistically significant but too small to have practical importance (Chapter 25) - Nutritional epidemiology findings with small relative risks that were treated as definitive (Chapter 26)
Scoring: - 🟢 Green: Large, practically meaningful effect size - 🟡 Yellow: Statistically significant but small effect; practical importance uncertain - 🔴 Red: Small effect being treated as large, or effect size not reported
Question 10: Has Anyone Checked Whether This Works Outside the Lab?
Failure mode detected: Imported error (Chapter 8) + streetlight effect (Chapter 4)
The question: Has the finding been tested in real-world conditions with real-world populations, or only in controlled settings that may not generalize?
What would have caught: - Educational interventions that work in small pilot programs but fail at scale (Chapter 30) - Psychological findings from WEIRD (Western, Educated, Industrialized, Rich, Democratic) populations that may not generalize (Chapter 25) - Economic models that perform well in-sample but fail out-of-sample (Chapter 24)
Scoring: - 🟢 Green: Validated in multiple real-world contexts - 🟡 Yellow: Lab results only, or real-world testing in limited contexts - 🔴 Red: Never tested outside controlled conditions
Question 11: Is There a Simpler Explanation?
Failure mode detected: Plausible story problem (Chapter 6) + complexity hiding (Chapter 15)
The question: Is the accepted explanation the simplest one that fits the evidence, or is there a simpler alternative that has not been considered because the complex explanation arrived first?
What would have caught: - The acid-stress ulcer hypothesis was more complex than the bacterial hypothesis — but it arrived first and became entrenched (Chapter 1) - Epicycles in Ptolemaic astronomy — increasingly complex modifications to save a wrong framework (Chapter 3)
Scoring: - 🟢 Green: The accepted explanation is parsimonious and has been tested against alternatives - 🟡 Yellow: The explanation is complex, but no simpler alternative has been tested - 🔴 Red: A simpler alternative exists but has been dismissed without adequate testing
Question 12: What Is the Prediction Track Record?
Failure mode detected: Unfalsifiability (Chapter 3) + revision myth (Chapter 20)
The question: Has this framework, model, or theory made specific predictions that were later verified? Or does it only explain events after the fact?
What would have caught: - Macroeconomic models that failed to predict the 2008 crisis (Chapter 24) - Autonomous vehicle timeline predictions that were systematically wrong (Chapter 29) - Psychoanalytic theories that explain everything after the fact but predict nothing in advance (Chapter 3)
Scoring: - 🟢 Green: Track record of specific, verified predictions - 🟡 Yellow: Some predictions correct, others wrong, or predictions too vague to evaluate - 🔴 Red: No verified predictions; framework is only used for post-hoc explanation
Question 13: How Does the Field Tell Its Own History?
Failure mode detected: Revision myth (Chapter 20)
The question: Does the field present its history as smooth, inevitable progress — or does it acknowledge the wrong turns, suppressed correct ideas, and costly corrections that are documented in the historical record?
Why it matters: A field that sanitizes its own history is a field that cannot learn from its mistakes — because it has erased the evidence that mistakes were made. The revision myth creates the illusion that the field's correction mechanisms work automatically, which reduces the urgency of building better ones.
What would have caught: - The AI community's smooth narrative about the deep learning revolution, erasing three decades of neural network suppression (Chapter 29) - Medicine's presentation of evidence-based practice as inevitable progress, erasing decades of resistance to clinical trials (Chapter 23) - The military's narrative of "learning from the last war" that obscures the counterinsurgency amnesia cycle (Chapter 28)
Scoring: - 🟢 Green: The field acknowledges its history honestly, including errors and resistance to correction - 🟡 Yellow: History is simplified but not actively distorted - 🔴 Red: History is sanitized; the field presents its current position as the inevitable outcome of rational progress
Question 14: Are Outsiders Saying Something Different?
Failure mode detected: Outsider problem (Chapter 18) + consensus enforcement (Chapter 14)
The question: Are people outside the field — adjacent researchers, practitioners, informed critics — reaching different conclusions from the insiders? If so, are the outsiders being engaged with or dismissed?
What would have caught: - DNA scientists challenging forensic science practices that the forensic community defended (Chapter 27) - Behavioral economists challenging rational expectations models that the macro community defended (Chapter 24) - Neural network researchers in the margins while the AI mainstream pursued symbolic approaches (Chapter 29)
Scoring: - 🟢 Green: Outsiders and insiders reach similar conclusions; no systematic gap - 🟡 Yellow: Some outsider critique exists but is partially engaged with - 🔴 Red: Outsiders are reaching different conclusions and being dismissed or ignored
Question 15: If This Turns Out to Be Wrong, How Would We Know?
Failure mode detected: Error visibility asymmetry (Chapter 27) + crisis threshold (Chapter 19)
The question: Is there a mechanism for detecting that this claim is wrong? Are errors visible, or could the claim be wrong for a long time before anyone notices?
This is the meta-question — it asks whether the field has the structural capacity to detect its own errors. A field where errors are invisible (criminal justice), where outcomes take decades to manifest (education), or where the metric doesn't track the reality (military body counts) can be wrong for a very long time without anyone knowing.
What would have caught: - Criminal justice's inability to detect wrongful convictions before DNA evidence (Chapter 27) - Education's inability to measure the difference between effective and ineffective instruction at scale (Chapter 30) - The financial industry's inability to detect systemic risk that VaR models missed (Chapter 12)
Scoring: - 🟢 Green: Clear mechanisms for error detection; errors become visible relatively quickly - 🟡 Yellow: Some error detection capacity, but significant blind spots - 🔴 Red: No systematic mechanism for detecting errors; the claim could be wrong indefinitely
The Red Flag Scorecard
After applying all 15 questions to a claim, field, or consensus, compile the results:
| Question | Score | Notes |
|---|---|---|
| 1. Who funded this? | 🟢🟡🔴 | |
| 2. Independently replicated? | 🟢🟡🔴 | |
| 3. What would disprove this? | 🟢🟡🔴 | |
| 4. Who benefits? | 🟢🟡🔴 | |
| 5. How old is the core evidence? | 🟢🟡🔴 | |
| 6. Precision or accuracy? | 🟢🟡🔴 | |
| 7. What happens to dissenters? | 🟢🟡🔴 | |
| 8. Independent sources? | 🟢🟡🔴 | |
| 9. Meaningful effect size? | 🟢🟡🔴 | |
| 10. Works outside the lab? | 🟢🟡🔴 | |
| 11. Simpler explanation? | 🟢🟡🔴 | |
| 12. Prediction track record? | 🟢🟡🔴 | |
| 13. How does the field tell its history? | 🟢🟡🔴 | |
| 14. Outsiders saying something different? | 🟢🟡🔴 | |
| 15. How would we know if it's wrong? | 🟢🟡🔴 |
Interpreting the Scorecard
0-2 red flags: The claim is probably sound, or at least has not triggered structural warning signs. This does not guarantee correctness — some wrong consensuses score well on most questions — but the structural indicators are favorable.
3-5 red flags: Caution warranted. The claim may be correct, but there are structural features that increase the probability of error. Investigate the red-flagged dimensions specifically.
6-9 red flags: Significant concern. The structural conditions strongly favor error persistence. This doesn't prove the claim is wrong — a correct claim can exist in a structurally compromised field — but the conditions that sustain wrong consensuses are substantially present.
10+ red flags: The structural conditions are so unfavorable that the claim deserves deep skepticism. In the cases examined in this book, claims with this many red flags have more often been wrong than right.
🔄 Check Your Understanding (try to answer without scrolling up)
- What is the purpose of the Red Flag Scorecard — is it designed to prove claims wrong or to assess structural risk?
- Name the red flag question that detects each of these failure modes: authority cascade, unfalsifiability, incentive structures.
Verify
1. It assesses structural risk, not truth value. No scorecard can prove a claim wrong — only evidence can do that. The scorecard identifies structural conditions that increase the probability of error, helping the reader decide where to focus their skepticism and investigation. 2. Authority cascade → Q8 (Does the evidence come from one source or many independent sources?) and Q5 (How old is the core evidence?). Unfalsifiability → Q3 (What would disprove this?). Incentive structures → Q1 (Who funded this?) and Q4 (Who benefits from this being true?).
Worked Example: Applying the Scorecard
The Dietary Fat Hypothesis (circa 1990)
| Question | Score | Assessment |
|---|---|---|
| 1. Funding | 🔴 | Food industry funding influenced research direction |
| 2. Replication | 🟡 | Some supporting studies, but contradictory evidence suppressed |
| 3. Falsification | 🟡 | In principle testable; in practice, contradictions were explained away |
| 4. Beneficiaries | 🔴 | Food industry, nutrition establishment, low-fat product manufacturers |
| 5. Evidence age | 🔴 | Core evidence from 1950s-1970s; not rigorously re-evaluated |
| 6. Precision vs. accuracy | 🟡 | Calorie counts and fat percentages precise; dietary causation uncertain |
| 7. Dissent treatment | 🔴 | Researchers who challenged were marginalized (Gary Taubes documents this) |
| 8. Source independence | 🔴 | Much evidence traced to Keys's original work and tradition |
| 9. Effect size | 🟡 | Epidemiological associations were modest |
| 10. Real-world validation | 🟡 | Population-level interventions (low-fat guidelines) did not produce expected health improvements |
| 11. Simpler explanation | 🔴 | Alternative hypotheses (sugar, processed food) existed but were suppressed |
| 12. Predictions | 🔴 | Predicted that low-fat diets would reduce heart disease; results were disappointing |
| 13. Field history | 🔴 | Nutrition science presented the fat hypothesis as settled science |
| 14. Outsiders | 🔴 | Endocrinologists, biochemists reached different conclusions |
| 15. Error detection | 🟡 | Long-term health outcomes are measurable but take decades |
Score: 8 red, 6 yellow, 1 green — 8 red flags.
This score — well into the "significant concern" range — would have warranted deep skepticism of the dietary fat consensus in 1990. The claim turned out to be substantially wrong.
📐 Project Checkpoint
Epistemic Audit — Chapter 31 Addition: The Full Red Flag Assessment
31A. Core Claim Scorecard. Apply all 15 Red Flag questions to your field's most important consensus claim. Score each question and document your reasoning.
31B. Comparison Scorecard. Apply the same 15 questions to a claim that you believe is correct in your field. Compare the two scorecards. Where do they differ? What structural features separate a well-supported consensus from a potentially vulnerable one?
31C. Highest-Risk Identification. Based on your scorecards, which claim in your field has the highest red flag count? What would you need to investigate further to determine whether the red flags indicate actual error or merely structural vulnerability?
31.2 Chapter Summary
Key Concepts
- 15 diagnostic questions grounded in specific failure modes from Parts I-III, applicable to any claim in any field
- Red Flag Scorecard — a structured assessment tool that compiles results across all 15 questions
- Traffic-light scoring (green/yellow/red) for each question, enabling pattern recognition
- Structural risk assessment — the scorecard detects the conditions that sustain wrong consensuses, not the wrongness itself
Key Arguments
- No single red flag proves a claim is wrong, but clusters of red flags indicate structural conditions that strongly favor error persistence
- The 15 questions synthesize every failure mode in the book into a practical, usable diagnostic tool
- The scorecard works because the failure modes are structural and predictable — the same conditions that sustained past errors can be detected around current claims
- The worked example (dietary fat hypothesis) demonstrates that the scorecard, applied in 1990, would have correctly identified a claim that turned out to be substantially wrong
Spaced Review
This chapter is the synthesis of Parts I-III. Instead of reviewing specific prior concepts, complete this integration exercise.
Apply the Red Flag Scorecard to one of the following and share your results:
- Your own field's core consensus — the claim that most practitioners accept without question
- A current controversy — a claim that is actively debated, where some experts are confident and others are skeptical
- A historical case from this book — apply the scorecard to the H. pylori hypothesis circa 1985, or to neural networks circa 1975, to calibrate the tool against known outcomes
Calibration Note
The H. pylori hypothesis circa 1985 would score approximately: Q1 🟢 (no industry funding), Q2 🔴 (not independently replicated yet), Q3 🟢 (clearly falsifiable), Q4 🟢 (no powerful beneficiary of the bacterial hypothesis), Q5 🟢 (new evidence), Q6 🟢, Q7 🔴 (Marshall and Warren marginalized), Q8 🔴 (evidence from one lab), Q9 🟢 (large effect — antibiotics cure ulcers), Q10 🟡, Q11 🟢 (simpler explanation than acid-stress), Q12 🟡, Q13 🟢, Q14 🔴 (outsiders vs. insiders), Q15 🟢. Score: ~3-4 red flags on the *challenger's* claim — consistent with a correct idea facing institutional resistance. The *defender's* claim (acid-stress) would score much higher on red flags.What's Next
In Chapter 32: The Epistemic Health Checklist, we move from evaluating individual claims to evaluating entire fields and organizations. The Red Flag Scorecard asks "Is this claim trustworthy?" The Epistemic Health Checklist asks "Is this field capable of producing trustworthy claims?" — a deeper structural question about institutional capacity for self-correction.
Before moving on, complete the exercises and quiz to solidify your understanding.