Chapter 29 Further Reading: Hallucinations, Errors, and How to Catch Them
Foundational Research
1. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Faltings, B. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38. The most comprehensive academic survey of hallucination as a research problem. Covers definitions, taxonomies, evaluation methods, and mitigation approaches. Dense but essential reading for anyone who wants a technical grounding in why hallucinations happen and how they are measured. Accessible via ACM Digital Library.
2. Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). On Faithfulness and Factuality in Abstractive Summarization. Proceedings of ACL 2020. An early and influential study showing that abstractive summarization models frequently introduce factual errors even when summarizing provided content. Directly relevant to the context collapse and distortion failure modes covered in Chapter 29. Available via ACL Anthology (free).
3. Mallen, A., Shi, W., Liu, V., Zhang, S., Khashabi, D., & Hajishirzi, H. (2023). Never Trust a LLM: Countering Reliability Challenges in LLMs. arXiv preprint. A detailed analysis of LLM reliability failures with a focus on knowledge-intensive tasks. Includes empirical hallucination rate data across domains and discussion of why high-confidence outputs are not more reliable than lower-confidence ones.
The Citation and Legal Hallucination Problem
4. U.S. District Court for the Southern District of New York. (2023). Mata v. Avianca, Inc. Sanctions Order (23-cv-769). The court order in the case where attorneys submitted ChatGPT-fabricated case citations. The document itself is public record and worth reading directly. It lays out exactly what happened, how it was discovered, and the court's response. A primary source for the most prominent documented hallucination harm case.
5. Weiser, B. (2023, May 27). Here's What Happens When Your Lawyer Uses ChatGPT. The New York Times. The Times reporting on the Mata v. Avianca case and the broader pattern of legal AI citation errors. Accessible and informative for non-technical readers who want the human story behind the court documents.
6. Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis. A systematic study of legal hallucination rates across multiple LLMs. Directly relevant to the statistics cited in Chapter 29 Section 4. One of the most rigorous domain-specific hallucination studies available.
Medical and Healthcare Hallucinations
7. Sallam, M. (2023). ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare, 11(6), 887. A peer-reviewed systematic review examining both the potential and the risks of LLM use in healthcare. Includes discussion of accuracy failures in clinical information generation. Open access through MDPI.
8. Omiye, J.A., Lester, J.C., Spichak, S., Rotemberg, V., & Daneshjou, R. (2023). Large language models propagate race-based medicine. npj Digital Medicine, 6, 195. Examines a specific and critical healthcare hallucination failure: AI models propagating scientifically debunked race-based clinical guidelines. An important case study in how hallucinations can embed and amplify existing errors rather than generating entirely novel ones.
Understanding the Mechanism
9. Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of FAccT 2021. The influential "Stochastic Parrots" paper. Though it addresses broader issues than hallucination specifically, its core argument — that LLMs manipulate linguistic form without having semantic grounding — is the foundational theoretical frame for understanding why hallucinations happen. Essential background reading.
10. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv preprint. A more recent and comprehensive survey than Ji et al. (2023), with updated taxonomy and coverage of newer mitigation techniques including retrieval-augmented generation. Useful for understanding where the research is going, not just where it has been.
Practical Verification Resources
11. CrossRef (crossref.org) The registration agency for DOIs. The most authoritative resource for verifying whether a cited DOI resolves to the claimed paper. Essential first stop in any citation verification workflow. Free to use.
12. PubMed (pubmed.ncbi.nlm.nih.gov) National Library of Medicine's database covering biomedical and life sciences literature. Essential for verifying citations in healthcare, clinical, and biological research domains. Free to use. If a medical paper claimed in AI output doesn't appear in PubMed, treat it as suspect.
13. Google Scholar (scholar.google.com) Broad academic search covering multiple fields. Useful for title searches (searching by exact quoted title is most diagnostic), author profile verification, and citation tracking. Free to use. Does not guarantee completeness, but absence from Scholar is a strong signal of fabrication for papers claimed to be from major journals.
Hallucination Detection Tools and Research
14. FactScore: Factual Precision in Long-Form Text Generation (Min et al., 2023) — Available via GitHub (shmsw25/FActScoring) A research tool and framework for evaluating factual accuracy in AI-generated long-form text. While primarily a research tool rather than a practitioner product, understanding the methodology clarifies what "hallucination rate" measurements actually measure and how to interpret published accuracy statistics.
15. Guo, Z., Schlichtkrull, M., & Vlachos, A. (2022). A Survey on Automated Fact-Checking. Transactions of the Association for Computational Linguistics, 10, 178-206. A survey of the automated fact-checking literature. Useful for understanding both what AI can and cannot do when asked to verify its own output, and what external tools exist for factual verification. Available through ACL Anthology (free).
For Further Exploration
- The ACL Anthology (aclanthology.org) hosts free access to most computational linguistics and NLP research, including hallucination studies.
- arXiv (arxiv.org) hosts pre-prints of recent AI research, including many hallucination and reliability studies before formal publication.
- The AI Incident Database (incidentdatabase.ai) catalogs documented cases of AI systems causing harm, including hallucination-related incidents, in a searchable format.