Further Reading — Chapter 15

Calibration: Why You Think You Know It When You Don't (and How to Fix It)

This annotated bibliography provides resources for deeper exploration of the concepts introduced in Chapter 15. Sources are organized by tier following this textbook's citation honesty system.


Tier 1 — Verified Sources

These are well-known, widely available works that the authors are confident exist with the details provided.

Books

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

The definitive popular treatment of cognitive biases, including overconfidence. Kahneman's chapters on "The Illusion of Validity" and "Intuitions vs. Formulas" are directly relevant to this chapter's discussion of why confidence feels real even when it's wrong. Kahneman's framework of System 1 (fast, automatic, confidence-generating) and System 2 (slow, deliberate, accuracy-checking) maps neatly onto the distinction between raw metacognitive feelings and deliberate calibration techniques. Required reading for anyone who wants to understand the broader cognitive science context of calibration research.

Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.

Tetlock's landmark study tracked thousands of predictions made by political experts over two decades and found that their calibration was remarkably poor — in many cases, no better than chance. Particularly relevant to this chapter's argument that experience and expertise don't automatically fix calibration. Tetlock's work also identified the characteristics of better-calibrated forecasters: intellectual humility, willingness to update, and systematic tracking of prediction accuracy. The concepts of "foxes" (broad thinkers, better calibrated) versus "hedgehogs" (deep specialists, worse calibrated) offer a powerful framework for understanding individual differences in calibration.

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers.

The follow-up to Expert Political Judgment, focusing on the Good Judgment Project — a large-scale study that identified and trained "superforecasters" who achieved remarkable calibration accuracy on geopolitical predictions. Directly relevant to the calibration training techniques in this chapter: the superforecasters used structured prediction, explicit probability estimates, systematic comparison of predictions to outcomes, and progressive updating. Accessible and engaging. If you want evidence that calibration training works at scale, this is the book.

Dunlosky, J., & Metcalfe, J. (2009). Metacognition. SAGE Publications.

Referenced in Chapter 13's further reading as well. The chapters on calibration, monitoring accuracy, and the overconfidence effect provide the academic foundation for this chapter's content. More technical than the treatment here, but comprehensive and precise. The discussion of the Brier score, calibration measurement methodologies, and the relationship between calibration and resolution is particularly valuable for readers who want to understand the quantitative framework behind the concepts.

Brown, P. C., Roediger, H. L., III, & McDaniel, M. A. (2014). Make It Stick: The Science of Successful Learning. Harvard University Press.

Chapters 5 ("Avoid Illusions of Knowing") and 6 ("Get Beyond Learning Styles") address overconfidence and calibration in the context of learning. The book's practical focus complements this chapter's more systematic treatment. Particularly useful for readers who want narrative examples of how miscalibrated confidence leads to study failures.

Research Articles

Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). "Calibration of probabilities: The state of the art to 1980." In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press.

The foundational review of calibration research. Lichtenstein, Fischhoff, and Phillips synthesized decades of studies on how people's confidence in their answers compares to their actual accuracy. Their findings established the overconfidence effect as a robust, replicable phenomenon and introduced the calibration curve methodology used in this chapter. Though published in 1982, the core findings have been replicated extensively and remain the standard reference in the field. Dense but essential for serious students of calibration.

Kruger, J., & Dunning, D. (1999). "Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments." Journal of Personality and Social Psychology, 77(6), 1121-1134.

The original paper describing the unskilled-and-unaware effect (commonly known as the Dunning-Kruger effect). Across four studies on logic, grammar, and humor, Kruger and Dunning demonstrated that bottom-quartile performers dramatically overestimated their performance, while top-quartile performers slightly underestimated theirs. Crucially, the paper also showed that training the least-skilled participants improved both their performance and their ability to recognize their previous incompetence — supporting this chapter's argument that the double bind is addressable through training.

Fischhoff, B. (1975). "Hindsight is not equal to foresight: The effect of outcome knowledge on judgment under uncertainty." Journal of Experimental Psychology: Human Perception and Performance, 1(3), 288-299.

The foundational study on hindsight bias. Fischhoff demonstrated that once people learn the outcome of an event, they overestimate the degree to which they would have predicted it. This paper is essential for understanding why overconfidence persists despite experience: hindsight bias retroactively revises your memory of your predictions, making it impossible to learn from your past calibration errors unless you keep written records.


Tier 2 — Attributed Sources

These are findings and claims attributed to specific researchers or research traditions. The general claims are well-established in the literature, but specific publication details beyond what is provided have not been independently verified for this bibliography.

Research by Sarah Lichtenstein and Baruch Fischhoff on calibration of confidence judgments.

Lichtenstein and Fischhoff's extensive program of research through the 1970s and 1980s established the core empirical findings of calibration research: the overconfidence effect, the hard-easy effect, and the characteristic shape of the calibration curve. Their work also explored whether training could improve calibration — an early version of the calibration training approach described in this chapter. Their finding that providing people with calibration feedback improved their subsequent confidence judgments is one of the foundations for the predict-test-compare technique.

Research by Asher Koriat on the cue-utilization framework for metacognitive judgments.

Koriat's work, also referenced in Chapter 13's further reading, provides the theoretical mechanism for understanding why people are overconfident. His cue-utilization framework shows that metacognitive judgments (including confidence) are based on heuristic cues — fluency, accessibility, familiarity — rather than direct access to memory strength. This explains why confidence can be high even when knowledge is low: the cues that generate confidence are correlated with knowledge but not identical to it. Understanding Koriat's framework is essential for understanding why calibration training works: it provides alternative, more diagnostic cues to replace the biased defaults.

Research on expert calibration in medicine, law, and finance.

A substantial body of research has examined calibration accuracy across professional domains. Findings are mixed: weather forecasters tend to be well-calibrated (because they receive rapid, unambiguous, frequent feedback on their predictions), while doctors, lawyers, and financial analysts tend to be poorly calibrated (because their feedback is delayed, ambiguous, and selective). This domain-specificity is relevant to this chapter's argument that calibration depends on feedback structure, not just experience level.

Research by David Dunning and colleagues on metacognitive deficits in low performers.

Building on the original 1999 Kruger and Dunning paper, Dunning and his collaborators have explored the mechanisms behind the unskilled-and-unaware effect in multiple domains. Key findings include: (a) training low performers improves both their performance and their metacognitive accuracy; (b) the effect is driven by a genuine deficit in metacognitive skills, not by motivational bias; (c) the effect is reduced when people are given external comparison information (e.g., seeing how others performed). These findings support this chapter's emphasis on external feedback as the primary tool for calibration improvement.

Research on the planning fallacy and overconfidence in time estimates.

Originally described by Daniel Kahneman and Amos Tversky, the planning fallacy is a specific form of overconfidence: people consistently underestimate how long tasks will take and how much they will cost. The planning fallacy illustrates that overconfidence extends beyond knowledge assessments to any domain involving prediction under uncertainty. For students, the planning fallacy manifests as underestimating study time requirements — a direct consequence of the calibration errors described in this chapter.


Tier 3 — Illustrative Sources

These are constructed examples, composite cases, or pedagogical resources created for this textbook.

Mia Chen — composite character. Continued from Chapters 1, 7, 8, and 13. In this chapter, Mia illustrates the three-layer progression of metacognitive development: from strategy improvement (Ch 1-8) to monitoring improvement (Ch 13) to calibration improvement (Ch 15). Her arc — predicting B+ and getting C-, then predicting C and getting B+ — dramatizes the gap between felt confidence and actual accuracy, and her calibration audit shows how structured prediction data can recalibrate internal signals.

Diane and Kenji Park — composite characters. Continued from Chapters 5, 13, and earlier appearances. In this chapter, the tracking sheet on the refrigerator serves as a simple calibration training tool, and Diane's discovery that her own teaching predictions are miscalibrated illustrates that overconfidence affects teachers and helpers as well as learners.

Dr. Amara Hassan — composite character. Introduced in this chapter's Case Study 2. Dr. Hassan is a composite based on patterns documented in research on diagnostic confidence and calibration training programs in medical education. Her story illustrates that expert overconfidence persists in domains with poor feedback structures and responds to structured calibration training.


If you want to go deeper on Chapter 15's topics before moving to Chapter 16, here's a prioritized reading path:

  1. Highest priority: Read the relevant chapters of Thinking, Fast and Slow (Kahneman, 2011). Chapters 19-24 cover overconfidence, expert judgment, and the illusion of validity. Kahneman writes with exceptional clarity, and these chapters will deepen your understanding of why calibration is so hard to achieve.

  2. If you want the superforecasting angle: Read Superforecasting (Tetlock & Gardner, 2015). It provides the most compelling evidence that calibration training works, along with specific techniques that extend beyond what this chapter covers. Particularly relevant for students interested in decision-making, policy, or any field that involves prediction.

  3. If you want the original research on the unskilled-and-unaware effect: Read Kruger & Dunning (1999). It's a well-written, accessible paper (13 pages) that reports four clean studies with clear implications. Understanding the original research will help you distinguish the actual findings from popular oversimplifications of the "Dunning-Kruger effect."

  4. If you want the academic treatment of calibration measurement: Read the calibration chapters in Dunlosky & Metcalfe (2009), Metacognition. They cover the Brier score, alternative calibration measures, and the methodological challenges of studying calibration in controlled settings.

  5. If you're interested in hindsight bias specifically: Read Fischhoff (1975). It's a foundational 12-page paper that demonstrates the effect with elegant experimental design. Understanding hindsight bias at a deep level will help you appreciate why written predictions are so critical for calibration training.


End of Further Reading for Chapter 15.