Chapter 35: Further Reading

Essential Sources

1. Scott M. Lundberg and Su-In Lee, "A Unified Approach to Interpreting Model Predictions" (NeurIPS, 2017)

The paper that introduced SHAP — SHapley Additive exPlanations — by connecting Shapley values from cooperative game theory with local feature attribution for machine learning models. Lundberg and Lee prove that Shapley values are the unique solution satisfying a set of desirable properties (local accuracy/efficiency, missingness, and consistency — which imply the classical Shapley axioms). They show that LIME, DeepLIFT, and classical Shapley regression are all members of a broader class of additive feature attribution methods, and that SHAP is the unique member of this class that is theoretically grounded.

Reading guidance: Section 2 defines the class of additive feature attribution methods and proves the uniqueness theorem — this is the theoretical core. Section 3 introduces KernelSHAP (the model-agnostic approximation via weighted regression) and DeepSHAP (the neural network approximation via DeepLIFT's backpropagation rules). Section 4's experiments compare SHAP with LIME and classical methods on consistency, computational cost, and human subject evaluations. The supplementary material contains the proof of the SHAP kernel derivation — worth reading if you want to understand why the specific kernel weights recover Shapley values.

For the TreeSHAP extension (exact polynomial-time computation for tree models), see Lundberg, Erion, Chen et al., "From Local Explanations to Global Understanding with Explainable AI for Trees" (Nature Machine Intelligence, 2020). This follow-up paper introduces TreeSHAP's $O(TLD^2)$ algorithm and demonstrates the distinction between tree-path-dependent and interventional modes. The global aggregation methods (summary plots, dependence plots, interaction values) make this paper essential for practitioners who need to generate model documentation or regulatory reports.

For a critical perspective on Shapley values for feature attribution, see Kumar, Vaidyanathan, Patel, and Talwalkar, "Problems with Shapley-Value-Based Explanations as Feature Importance Measures" (ICML, 2020), which argues that the game-theoretic axioms do not always correspond to the properties practitioners want from explanations. The tension between Shapley's efficiency axiom (attributions sum to the prediction) and the desire for causal rather than associational attribution is a live research debate.

2. Been Kim, Martin Wattenberg, Justin Gilmer, Rich Caruana, Max Welling, and Fernanda Viégas, "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)" (ICML, 2018)

The paper that introduced TCAV — a method for testing whether a neural network's predictions are sensitive to human-defined concepts rather than individual features. Kim et al. define the Concept Activation Vector (CAV) as the normal to a linear decision boundary learned in a neural network's activation space, and the TCAV score as the fraction of target-class inputs for which the model's output increases when moving in the CAV direction. The statistical significance test against random CAVs provides a principled criterion for determining whether a concept genuinely influences the model.

Reading guidance: Section 3 defines CAVs and the TCAV score with formal precision. Section 4's experiments on GoogleNet with medical imaging concepts (texture, color, shape) and object classification concepts (presence of specific objects in scenes) demonstrate the method's practical utility. The key finding: TCAV can confirm that a medical imaging model relies on clinically meaningful features (tissue texture, staining pattern) rather than artifacts (image orientation, slide preparation). Section 5's user study shows that TCAV explanations improve domain experts' ability to identify model failure modes.

For the concept bottleneck model extension, see Koh et al., "Concept Bottleneck Models" (ICML, 2020), which builds concepts directly into the model architecture rather than testing them post-hoc. For completeness-aware concept explanations, see Yeh et al., "On Completeness-Aware Concept-Based Explanations in Deep Neural Networks" (NeurIPS, 2020), which addresses the question of whether the defined concepts are sufficient to explain the model's behavior.

3. Sandra Wachter, Brent Mittelstadt, and Chris Russell, "Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR" (Harvard Journal of Law & Technology, 2018)

The paper that formalized counterfactual explanations for machine learning and connected them to legal requirements under the GDPR. Wachter et al. argue that counterfactual explanations — "the smallest change to the input that would change the decision" — are more useful to individuals affected by automated decisions than traditional feature attribution, because counterfactuals are actionable (they tell you what to change) while attributions are diagnostic (they tell you what mattered). The paper provides both the optimization formulation and a legal analysis of why counterfactual explanations satisfy the GDPR's transparency requirements.

Reading guidance: Section II provides the legal analysis of GDPR Articles 13-15 and 22, arguing that the "right to explanation" (Recital 71) is best satisfied by counterfactual explanations rather than model internals. This legal analysis is essential context for practitioners building explanation systems for European markets. Section III formalizes the counterfactual optimization problem and discusses distance metrics, feasibility constraints, and the challenge of generating diverse counterfactuals (multiple paths to a different outcome). Section IV's experiments on real-world datasets demonstrate the method's practical applicability.

For extensions addressing causal consistency in counterfactuals, see Karimi, Schölkopf, and Valera, "Algorithmic Recourse: From Counterfactual Explanations to Interventions" (FAccT, 2021), which argues that counterfactuals should respect causal structure — changing income should propagate to debt-to-income ratio. For diverse counterfactuals, see Mothilal, Sharma, and Tan, "Explaining Machine Learning Classifiers Through Diverse Counterfactual Explanations" (FAccT, 2020), which generates multiple diverse counterfactuals representing different "paths to recourse."

4. Mukund Sundararajan, Ankur Taly, and Qiqi Yan, "Axiomatic Attribution for Deep Networks" (ICML, 2017)

The paper that introduced integrated gradients — the gradient-based attribution method for neural networks that satisfies the completeness and sensitivity axioms. Sundararajan et al. formalize two properties that any attribution method for neural networks should satisfy: (1) sensitivity (if the input and baseline differ in one feature and the output differs, that feature must receive nonzero attribution) and (2) implementation invariance (two networks that compute the same function should receive the same attributions). They prove that integrated gradients — the path integral of gradients from a baseline to the input — is the unique method satisfying both properties (up to the choice of path).

Reading guidance: Section 2 defines the axioms and proves that vanilla gradients, LRP, and DeepLIFT violate at least one. Section 3 introduces integrated gradients with the path integral formulation and proves the completeness property (attributions sum to $f(x) - f(x')$). Section 4 discusses baseline choice — a critical practical decision, since the baseline defines "absence" of a feature and different baselines produce different attributions. The authors recommend a zero baseline for images and the training mean for tabular data, but acknowledge that baseline selection remains partially a domain-specific choice.

For the connection between integrated gradients and Shapley values, see Sundararajan and Najmi, "The Many Shapley Values for Model Explanation" (ICML, 2020), which shows that integrated gradients correspond to a specific Shapley value formulation (the Aumann-Shapley value for continuous games). For GradCAM and other spatial attribution methods for CNNs, see Selvaraju et al., "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization" (ICCV, 2017). For a unified implementation of all gradient-based methods in PyTorch, see the Captum documentation (captum.ai).

5. Cynthia Rudin, "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead" (Nature Machine Intelligence, 2019)

The most influential argument against the dominant paradigm of this chapter. Rudin argues that for high-stakes decisions — criminal sentencing, medical diagnosis, lending — interpretable models should always be preferred over post-hoc explanations of black-box models. Her central claim: post-hoc explanations are fundamentally unreliable because they are separate models that approximate the original model's behavior, and this approximation can fail in precisely the cases that matter most (unusual inputs, edge cases, adversarial manipulation). She presents evidence that for many high-stakes domains, interpretable models can match the accuracy of black-box models, making the accuracy-interpretability tradeoff a false dilemma.

Reading guidance: Section 2's critique of post-hoc explanations is the intellectual core. Rudin identifies three failure modes: (1) explanations can be unfaithful (they do not accurately reflect the model), (2) explanations can be incomplete (they summarize but do not fully characterize the model), and (3) explanations provide false confidence (users trust the explanation without knowing whether it is faithful). Section 3 presents case studies where interpretable models match black-box accuracy: criminal recidivism prediction (the COMPAS debate), medical diagnostic rules, and credit scoring. Section 4 provides a research agenda for improving interpretable models rather than improving explanations of black boxes.

This paper is essential reading for any practitioner who designs explanation systems. Even if you ultimately use complex models with post-hoc explanations (as this chapter recommends when the performance gap is material), understanding Rudin's argument sharpens your awareness of the limitations of those explanations and motivates the adversarial testing, faithfulness audits, and multi-method cross-validation that Section 35.12 recommends.