Chapter 14: Further Reading and Annotated Bibliography
This list covers the foundational literature on explainable AI, the key critique papers, and important regulatory and policy sources. Entries are organized thematically. All sources are real publications; annotations describe both the paper's content and its significance for business and governance practitioners.
Foundational XAI Methods
1. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144.
The paper that introduced LIME. Ribeiro and colleagues articulated the core problem of post-hoc explanation — that models must be trusted by different audiences for different reasons — and proposed LIME as a model-agnostic solution using local linear approximation. The paper includes evaluations across text, image, and tabular domains and introduces the concept of "locally faithful" explanations. The title's question — "Why should I trust you?" — identifies the governance motivation that has made this one of the most cited papers in machine learning. Essential reading for anyone designing XAI-based compliance workflows. The paper is freely available on arXiv.
2. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems (NeurIPS), 30, 4765-4774.
The paper introducing SHAP. Lundberg and Lee showed that several existing explanation methods (LIME, DeepLIFT, Layer-Wise Relevance Propagation) could be unified under the Shapley value framework, and that Shapley values have desirable properties (additivity, consistency, dummy) that other methods lack. The paper also introduced the first efficient tree-based SHAP algorithm. The theoretical grounding of SHAP in Shapley (1953) is made accessible here. This paper is the appropriate technical reference for data scientists implementing SHAP in production systems; the appendix provides the formal proofs of the key properties.
3. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., ... & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56-67.
The follow-on paper introducing TreeSHAP (the exact, polynomial-time algorithm for tree-based models) and several SHAP visualization methods including the beeswarm plot, waterfall plot, and dependence plot. This paper is the practical reference for SHAP deployment with XGBoost, Random Forest, LightGBM, and other tree ensembles. The healthcare application examples (ICU prediction, cancer risk) make it particularly relevant for readers interested in clinical AI governance.
4. Shapley, L. S. (1953). A value for n-person games. In H. Kuhn & A. Tucker (Eds.), Contributions to the Theory of Games, Volume II (pp. 307-317). Princeton University Press.
The original paper introducing Shapley values in the context of cooperative game theory. Lloyd Shapley received the 2012 Nobel Memorial Prize in Economic Sciences for his work in game theory. While this is a mathematical paper that predates machine learning by decades, its core idea — the fair allocation of collaborative contributions — translates directly into the XAI context. Practitioners interested in understanding why SHAP values have their mathematical properties should read the first few pages of this paper; the axiomatic derivation of the Shapley value is elegant and accessible with college-level mathematics.
5. Wachter, S., Mittelstadt, B., & Russell, C. (2017). Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 31(2), 841-887.
The paper that formalized counterfactual explanations in the AI context and connected them to GDPR Article 22's provisions on automated decision-making. Wachter and colleagues argue that a "right to explanation" under GDPR is better interpreted as a "right to a counterfactual explanation" — information about what would have led to a different decision — because this type of explanation is both actionable for affected individuals and technically feasible for complex models. Essential reading for anyone working on AI governance in EU-regulated contexts, and the foundational reference for the algorithmic recourse literature.
Critiques and Limitations
6. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.
One of the most provocative and consequential papers in AI ethics. Rudin argues that the project of explaining black-box models for high-stakes decisions is fundamentally misguided — that the explanations are unreliable, gameable, and unnecessary, because interpretable models often achieve comparable accuracy. The paper introduces the Rashomon set concept, reviews evidence on accuracy-interpretability tradeoffs, and argues for a paradigm shift from post-hoc explanation to interpretability-by-design. Whether or not readers agree with Rudin's position, engaging with her argument is essential for anyone making model selection decisions in high-stakes domains. Available freely on arXiv.
7. Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 180-186.
The paper demonstrating that LIME and SHAP can be defeated by adversarially constructed classifiers. This finding directly challenges any regulatory framework that treats XAI explanation as sufficient evidence of non-discrimination. The paper is technically accessible (the attack mechanism is explained clearly) and its governance implications are profound. Case Study 14.2 in this chapter is built around this paper's findings. Free preprint available on arXiv.
8. Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems (NeurIPS), 31.
The paper showing that many popular gradient-based saliency methods fail basic validity tests. Adebayo et al. tested whether saliency maps change when the model is re-initialized randomly or when training labels are permuted — tests that reveal whether the maps are actually reflecting model behavior or merely input data structure. Multiple widely-used methods failed these tests. This paper is essential reading for practitioners using saliency-based explanation in any computer vision application, including medical imaging. Available freely on arXiv.
9. Jain, S., & Wallace, B. C. (2019). Attention is not explanation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 3543-3556.
The paper establishing that attention weights in transformer models are not reliable explanations of model behavior. Jain and Wallace show that attention distributions can be permuted substantially without changing model outputs, and that alternative distributions producing similar outputs can be found. This finding undermined a common practice of presenting attention heatmaps as explanations of language model behavior. Critically important for anyone deploying LLM-based systems in contexts where explanation is required. Available on arXiv.
10. Lipton, Z. C. (2018). The mythos of model interpretability. Queue, 16(3), 31-57.
An influential conceptual review arguing that "interpretability" in machine learning is an ill-defined concept used to mean very different things by different researchers and practitioners. Lipton distinguishes transparency (understanding the mechanism) from post-hoc interpretability (explaining outputs), and argues that many interpretability claims are not backed by evidence that the explanations are actually useful to humans or faithful to model behavior. A useful corrective to naive enthusiasm about XAI. Accessible to non-technical readers. Available on arXiv.
Interpretable Models and Applications
11. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., & Elhadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721-1730.
The paper introducing the predecessor to the EBM (Explainable Boosting Machine) and one of the canonical papers on interpretable models in healthcare. Caruana et al. found that their interpretable model revealed a counterintuitive pattern — that pneumonia patients with asthma were assigned lower risk by a neural network, because such patients were typically sent directly to intensive care, which reduced their observed mortality in the training data. The interpretable model revealed this problem; a black-box model that made the same prediction would not have. This case is widely cited as evidence that interpretability is not just a communication value but a safety value.
12. Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., & Rudin, C. (2017). Learning certifiably optimal rule lists for categorical features. Journal of Machine Learning Research, 18(234), 1-78.
Technical paper on learning optimal rule lists — the class of models used in criminal justice risk assessment and credit scorecards. The "certifiably optimal" aspect means the algorithm finds the best possible rule list within defined constraints, addressing the criticism that rule lists sacrifice too much accuracy. Relevant background for understanding the technical landscape of interpretable alternatives to black-box models.
Fairness, Bias, and Proxy Discrimination
13. Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. California Law Review, 104(3), 671-732.
The foundational legal scholarship on how machine learning systems can produce disparate impact even without discriminatory intent. Barocas and Selbst analyze how proxy variables, biased training data, and feedback loops can create illegal discrimination under existing US civil rights law. This paper is the appropriate reference for the legal analysis of disparate impact in credit, employment, and housing AI. Essential for compliance officers and attorneys advising on AI deployment.
14. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
A landmark empirical study documenting racial bias in a widely deployed commercial healthcare algorithm used by hospitals and insurers to identify high-risk patients for care management programs. The algorithm used healthcare costs as a proxy for healthcare need — but because Black patients historically received less care for the same health conditions (due to unequal access), they had lower costs at the same level of need, causing the algorithm to systematically underestimate their risk. The paper demonstrates the proxy variable problem in a non-credit context with real population health consequences. SHAP-style analysis of this model would have identified cost as a high-importance feature, prompting exactly the type of proxy investigation described in Chapter 14.
15. Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163.
The mathematical paper demonstrating that several definitions of fairness for recidivism prediction instruments are mutually incompatible when base rates differ across groups. Specifically, a model cannot simultaneously satisfy predictive parity (equal positive predictive values across groups), equal false positive rates, and equal false negative rates when base rates differ. This impossibility result is directly relevant to the COMPAS controversy (referenced in Chapter 14) and has broad implications for how XAI-based fairness auditing should be designed.
Regulatory and Policy Sources
16. Board of Governors of the Federal Reserve System & Office of the Comptroller of the Currency. (2011). Supervisory Guidance on Model Risk Management (SR 11-7). Federal Reserve System.
The primary US banking regulatory guidance on model risk management. SR 11-7 requires banks to maintain comprehensive governance of model development, validation, and use — including documentation of model limitations, assumptions, and appropriate uses. The guidance does not address machine learning specifically (it predates the widespread adoption of ML in banking) but has been interpreted by regulators to apply to ML models, including requirements for validating explanation outputs. The most recent OCC and Fed guidance documents on ML (referenced in regulatory updates since 2021) build directly on SR 11-7's framework.
17. Consumer Financial Protection Bureau. (2022). Supervisory Highlights: Fair Lending, Fall 2022. CFPB.
One of the CFPB's supervisory highlight reports addressing machine learning in consumer credit. These reports (published periodically) describe the types of issues examiners are finding in the field and the standards regulators are applying. The fair lending sections address adverse action notice requirements for ML-based decisions, the use of post-hoc explanation tools, and the CFPB's expectations for disparate impact analysis. Practitioners in financial services should monitor these publications as a primary source for understanding evolving regulatory expectations.
18. National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1.
The US government's primary voluntary framework for AI risk management, developed with broad stakeholder input. The AI RMF addresses explainability and interpretability under the "GOVERN" and "MAP" functions, treating them as components of trustworthy AI. While voluntary and less prescriptive than EU requirements, the AI RMF reflects regulatory expectations and is being referenced in agency-specific AI guidance across the federal government. Available free from NIST.
19. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019). Model cards for model reporting. Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 220-229.
The paper introducing model cards — a documentation format for AI systems that includes intended use, performance metrics across demographic groups, limitations, and ethical considerations. Model cards have become a de facto standard for responsible AI documentation and are referenced in regulatory guidance and internal AI governance policies across major technology companies. This paper provides the framework and rationale for model card design.
20. European Parliament and Council. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union.
The EU AI Act, the world's first comprehensive AI regulation. Relevant provisions for Chapter 14 include: Article 13 (transparency and provision of information to deployers), Article 14 (human oversight requirements), Annex IV (technical documentation requirements for high-risk AI systems), and Article 86 (right to explanation for affected persons for certain high-risk AI systems). Organizations deploying AI in the EU must understand these provisions and design their XAI infrastructure to satisfy them. The full text is available on EUR-Lex.
Additional Recommended Reading
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. A systematic attempt to define interpretability rigorously and develop methods for evaluating whether interpretability methods are actually useful. Provides the conceptual grounding for distinguishing application-grounded, human-grounded, and functionally-grounded evaluation of XAI methods.
Ghassemi, M., Oakden-Rayner, L., & Beam, A. L. (2021). The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health, 3(11), e745-e750. A clinical perspective on the limitations of current XAI methods in healthcare, arguing that existing methods are insufficient to support clinical decision-making and that the field's optimism about XAI is not justified by the evidence. A useful counterweight to promotional claims about XAI in clinical applications.
Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and abstraction in sociotechnical systems. Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 59-68. Identifies "traps" that researchers and practitioners fall into when trying to operationalize fairness in technical systems — including the "abstraction trap" of treating technical solutions (like XAI) as adequate responses to problems that are fundamentally social and political. Directly relevant to the "explanation placebo" concept in Chapter 14.