Chapter 13 Further Reading: Diagnosing and Fixing Bad Outputs

Research Papers on Hallucination

"Survey of Hallucination in Natural Language Generation" Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023) ACM Computing Surveys

The most comprehensive academic survey on AI hallucination, covering types, causes, detection methods, and mitigation strategies. Categorizes hallucination into intrinsic (contradicting the source) and extrinsic (generating information not in the source or unverifiable from it) types. Essential background reading for anyone who wants to understand hallucination at a deeper level than this chapter provides.

Available: arxiv.org/abs/2202.03629


"TruthfulQA: Measuring How Models Mimic Human Falsehoods" Lin, S., Hilton, J., & Evans, O. (2022) Association for Computational Linguistics

A benchmark study measuring AI model truthfulness on questions where humans commonly hold false beliefs. Demonstrates that larger models can be less truthful than smaller models in specific domains because they more effectively reproduce common but incorrect beliefs from training data. Relevant to understanding training data bias (Root Cause 6) as a hallucination driver.

Available: arxiv.org/abs/2109.07958


"Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models" Various authors, Anthropic (2023)

Research on AI sycophancy — the tendency of models to agree with or validate user-stated positions rather than providing accurate information. Directly relevant to understanding why self-critique prompting sometimes produces validating rather than genuinely critical responses, and why models may avoid correcting incorrect information provided in a prompt.

Available: anthropic.com/research


"Hallucination is Inevitable: An Innate Limitation of Large Language Models" Xu, Z., Jain, S., & Kankanhalli, M. (2024)

A theoretical analysis arguing that hallucination is a fundamental property of language models, not a fixable bug. The paper's argument — that statistical language modeling cannot reliably distinguish memorized truth from plausible confabulation — informs the chapter's approach of treating hallucination as a known risk to be managed (with verification protocols) rather than eliminated.

Available: arxiv.org/abs/2401.11817


Research on AI Code Generation Reliability

"Evaluating Large Language Models Trained on Code" Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., ... & Zaremba, W. (2021) OpenAI (Codex paper)

The original paper introducing Codex and evaluating code generation capabilities. Includes discussion of reliability metrics and failure modes — relevant context for understanding how AI code generation works and where its reliability boundaries are.

Available: arxiv.org/abs/2107.03374


"Is Your Code Generated by ChatGPT Really Correct?" Various authors, multiple universities (2023)

A study evaluating the correctness of AI-generated code on practical programming tasks. Finds that while code generation is often syntactically correct, logical errors, edge case failures, and incorrect API usage are common. Supports the verification practices Raj's case study established.

Available through Google Scholar; multiple papers on this topic emerged in 2023-2024.


Books on Professional Judgment and Verification

"Noise: A Flaw in Human Judgment" Daniel Kahneman, Olivier Sibony, Cass R. Sunstein (Little, Brown Spark, 2021)

A book about variability in professional judgment — how the same information assessed by different people (or the same person on different days) produces different conclusions. The framework for understanding noise in human judgment maps directly onto understanding variance in AI output: both result from probabilistic processes that produce different outputs from the same inputs. The book's framework for "decision hygiene" — systematic approaches to reducing bad judgment — is directly analogous to the diagnostic framework this chapter establishes for AI output.


"The Intelligence Trap" David Robson (W. W. Norton & Company, 2019)

A study of how intelligent people make systematic errors — being overconfident in their reasoning, resistant to correction, and vulnerable to specific types of motivated thinking. The chapter's observation that authoritative-sounding AI output is more dangerous than obviously wrong output parallels Robson's finding that intelligent people's errors are more dangerous than obvious errors because they are harder to catch. The lesson for AI users: fluency and confidence in AI output are not accuracy signals.


"The Checklist Manifesto" Atul Gawande (Metropolitan Books, 2009)

As mentioned in Chapter 11's further reading, Gawande's study of how checklists reduce professional errors in high-stakes contexts applies directly to the verification protocols this chapter recommends. The chapter's "does this method exist?" verification step and Elena's factual audit protocol are both checklistyle approaches to systematic error reduction. Reading this book provides the intellectual foundation for why systematic verification habits work.


Practical Resources for Verification

FactCheck.org and Snopes.com

For checking specific factual claims in AI-generated content about current events, public figures, or commonly circulated claims. Useful as a first-pass verification resource for AI outputs that make claims about public facts.


Semantic Scholar https://www.semanticscholar.org

For verifying academic citations produced by AI. Semantic Scholar indexes millions of academic papers and allows you to search by title, author, or DOI. When AI generates a paper citation, search Semantic Scholar to verify the paper exists, the authors are correct, and the stated findings match the actual paper.


Google Scholar https://scholar.google.com

Similar to Semantic Scholar for academic citation verification. Also useful for checking whether a claim that AI attributes to research is actually supported by research in the field.


Wayback Machine (web.archive.org)

For verifying whether AI-generated claims about companies, products, or websites reflect a historical state vs. current state. If the AI's training data cutoff means it's working from outdated information, the Wayback Machine can help you see what the source looked like at a specific time.


Community Resources

AI Snake Oil Newsletter Arvind Narayanan and Sayash Kapoor (aisnakeoil.substack.com)

A rigorous, skeptical newsletter on AI capabilities and limitations from Princeton researchers. Regularly publishes research on hallucination rates, reliability issues, and cases where AI capability has been overstated. Essential reading for calibrated trust — neither dismissing AI capabilities nor over-trusting them.


"Papers with Code" — Benchmarking Section https://paperswithcode.com/sota

Tracks state-of-the-art AI benchmark results, including hallucination and factual accuracy benchmarks. Useful for understanding where current models perform well and where they remain unreliable. The TruthfulQA and HaluEval benchmarks are particularly relevant to this chapter.


A Note on the Evolving Reliability Landscape

AI model reliability is improving with each generation. Some hallucination rates that were problematic in 2022-2023 models have improved significantly in more recent versions. Techniques like retrieval-augmented generation (RAG), where models access verified knowledge bases rather than relying solely on training data, address some of the limitations this chapter discusses.

However, the fundamental challenge described in the "Hallucination is Inevitable" paper — that statistical language modeling cannot reliably distinguish memorized fact from plausible confabulation — remains. Even as rates improve, the skill of verification and the diagnostic framework for identifying failure types remain valuable, because:

  1. The failure modes are predictable even as their frequency decreases
  2. High-stakes outputs (legal, financial, medical, client-facing) require verification regardless of base error rate
  3. Internal library and custom API hallucination (the Raj case) is a structural problem that will persist regardless of general model improvements

The verification habits and diagnostic skills in this chapter are not temporary workarounds for early AI limitations — they are permanent features of responsible professional AI use.