Chapter 38: Further Reading
Foundational Texts
-
Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed. A comprehensive, practical guide covering SHAP, LIME, partial dependence, and more. Freely available at https://christophm.github.io/interpretable-ml-book/.
-
Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits." Distill. An accessible introduction to the circuits approach in mechanistic interpretability, with interactive visualizations. Available at https://distill.pub/2020/circuits/.
Feature Attribution Methods
SHAP
-
Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS 2017. The foundational SHAP paper, unifying six existing attribution methods under the Shapley value framework.
-
Lundberg, S. M., Erion, G., Chen, H., et al. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." Nature Machine Intelligence, 2, 56--67. TreeSHAP and its application to global explanations.
LIME
-
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." KDD 2016. The original LIME paper, introducing local interpretable model-agnostic explanations.
-
Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). "Anchors: High-Precision Model-Agnostic Explanations." AAAI 2018. A follow-up to LIME that provides rule-based explanations with coverage guarantees.
Gradient-Based Methods
-
Sundararajan, M., Taly, A., and Yan, Q. (2017). "Axiomatic Attribution for Deep Networks." ICML 2017. Integrated Gradients: principled gradient attribution satisfying completeness and sensitivity axioms.
-
Smilkov, D., Thorat, N., Kim, B., Viegas, F., and Wattenberg, M. (2017). "SmoothGrad: Removing Noise by Adding Noise." arXiv:1706.03825. Averaging gradients over noisy copies of the input reduces noise in saliency maps.
Attention Analysis
-
Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. The influential paper demonstrating that attention weights do not provide faithful explanations.
-
Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. A nuanced response arguing that attention is more informative than Jain and Wallace suggested, under certain conditions.
-
Abnar, S. and Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." ACL 2020. Introduced attention rollout and attention flow methods for tracing information through multi-layer Transformers.
Probing
-
Belinkov, Y. (2022). "Probing Classifiers: Promises, Shortcomings, and Advances." Computational Linguistics, 48(1), 207--219. A comprehensive survey of probing methodology, limitations, and best practices.
-
Hewitt, J. and Liang, P. (2019). "Designing and Interpreting Probes with Control Tasks." EMNLP 2019. Introduced control tasks to address the problem of overpowered probes.
-
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). "What You Can Cram into a Single $&!# Vector: Probing Sentence Embeddings for Linguistic Properties." ACL 2018*. Systematic probing of sentence representations for ten linguistic properties.
Mechanistic Interpretability
Superposition and Features
-
Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." Anthropic. The foundational paper on superposition in neural networks, demonstrating how models represent more features than neurons.
-
Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. Demonstrated that sparse autoencoders can discover millions of interpretable features in production language models.
-
Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." Anthropic. The initial paper on using sparse autoencoders for feature discovery in language models.
Circuits
-
Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Anthropic. Discovery and analysis of induction heads, a key circuit for in-context learning in Transformers.
-
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023. Reverse-engineering a complete circuit in a production model.
-
Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." ICLR 2023. Mechanistic analysis of how Transformers learn modular addition, revealing Fourier-based algorithms.
Activation Patching and Causal Analysis
-
Meng, K., Bau, D., Mitchell, A., and Yun, C. (2022). "Locating and Editing Factual Associations in GPT." NeurIPS 2022. Introduced causal tracing and ROME for identifying and editing factual knowledge.
-
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. (2023). "Mass-Editing Memory in a Transformer." ICLR 2023. MEMIT: scaling model editing to thousands of facts.
-
Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). "Causal Abstractions of Neural Networks." NeurIPS 2021. A formal framework for causal analysis of neural network computations.
Surveys and Overviews
-
Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2023). "Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks." arXiv:2207.13243. A broad survey covering feature visualization, circuits, probing, and mechanistic interpretability.
-
Madsen, A., Reddy, S., and Chandar, S. (2022). "Post-hoc Interpretability for Neural NLP: A Survey." ACM Computing Surveys, 55(8). A thorough survey of interpretability methods for NLP models.
Software
-
SHAP Library: https://github.com/slundberg/shap. The reference implementation of SHAP, with TreeSHAP, KernelSHAP, and visualization tools.
-
Captum: https://captum.ai/. PyTorch library for model interpretability, implementing Integrated Gradients, GradientSHAP, LIME, and more.
-
TransformerLens: https://github.com/neelnanda-io/TransformerLens. A library for mechanistic interpretability of Transformers, with hooks for activation access and patching.
-
SAELens: https://github.com/jbloomAus/SAELens. Library for training and analyzing sparse autoencoders on language model activations.