Chapter 38: Further Reading

Foundational Texts

  • Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed. A comprehensive, practical guide covering SHAP, LIME, partial dependence, and more. Freely available at https://christophm.github.io/interpretable-ml-book/.

  • Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits." Distill. An accessible introduction to the circuits approach in mechanistic interpretability, with interactive visualizations. Available at https://distill.pub/2020/circuits/.

Feature Attribution Methods

SHAP

  • Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS 2017. The foundational SHAP paper, unifying six existing attribution methods under the Shapley value framework.

  • Lundberg, S. M., Erion, G., Chen, H., et al. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." Nature Machine Intelligence, 2, 56--67. TreeSHAP and its application to global explanations.

LIME

  • Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." KDD 2016. The original LIME paper, introducing local interpretable model-agnostic explanations.

  • Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). "Anchors: High-Precision Model-Agnostic Explanations." AAAI 2018. A follow-up to LIME that provides rule-based explanations with coverage guarantees.

Gradient-Based Methods

  • Sundararajan, M., Taly, A., and Yan, Q. (2017). "Axiomatic Attribution for Deep Networks." ICML 2017. Integrated Gradients: principled gradient attribution satisfying completeness and sensitivity axioms.

  • Smilkov, D., Thorat, N., Kim, B., Viegas, F., and Wattenberg, M. (2017). "SmoothGrad: Removing Noise by Adding Noise." arXiv:1706.03825. Averaging gradients over noisy copies of the input reduces noise in saliency maps.

Attention Analysis

  • Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. The influential paper demonstrating that attention weights do not provide faithful explanations.

  • Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. A nuanced response arguing that attention is more informative than Jain and Wallace suggested, under certain conditions.

  • Abnar, S. and Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." ACL 2020. Introduced attention rollout and attention flow methods for tracing information through multi-layer Transformers.

Probing

  • Belinkov, Y. (2022). "Probing Classifiers: Promises, Shortcomings, and Advances." Computational Linguistics, 48(1), 207--219. A comprehensive survey of probing methodology, limitations, and best practices.

  • Hewitt, J. and Liang, P. (2019). "Designing and Interpreting Probes with Control Tasks." EMNLP 2019. Introduced control tasks to address the problem of overpowered probes.

  • Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). "What You Can Cram into a Single $&!# Vector: Probing Sentence Embeddings for Linguistic Properties." ACL 2018*. Systematic probing of sentence representations for ten linguistic properties.

Mechanistic Interpretability

Superposition and Features

  • Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." Anthropic. The foundational paper on superposition in neural networks, demonstrating how models represent more features than neurons.

  • Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. Demonstrated that sparse autoencoders can discover millions of interpretable features in production language models.

  • Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." Anthropic. The initial paper on using sparse autoencoders for feature discovery in language models.

Circuits

  • Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Anthropic. Discovery and analysis of induction heads, a key circuit for in-context learning in Transformers.

  • Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023. Reverse-engineering a complete circuit in a production model.

  • Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." ICLR 2023. Mechanistic analysis of how Transformers learn modular addition, revealing Fourier-based algorithms.

Activation Patching and Causal Analysis

  • Meng, K., Bau, D., Mitchell, A., and Yun, C. (2022). "Locating and Editing Factual Associations in GPT." NeurIPS 2022. Introduced causal tracing and ROME for identifying and editing factual knowledge.

  • Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. (2023). "Mass-Editing Memory in a Transformer." ICLR 2023. MEMIT: scaling model editing to thousands of facts.

  • Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). "Causal Abstractions of Neural Networks." NeurIPS 2021. A formal framework for causal analysis of neural network computations.

Surveys and Overviews

  • Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2023). "Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks." arXiv:2207.13243. A broad survey covering feature visualization, circuits, probing, and mechanistic interpretability.

  • Madsen, A., Reddy, S., and Chandar, S. (2022). "Post-hoc Interpretability for Neural NLP: A Survey." ACM Computing Surveys, 55(8). A thorough survey of interpretability methods for NLP models.

Software

  • SHAP Library: https://github.com/slundberg/shap. The reference implementation of SHAP, with TreeSHAP, KernelSHAP, and visualization tools.

  • Captum: https://captum.ai/. PyTorch library for model interpretability, implementing Integrated Gradients, GradientSHAP, LIME, and more.

  • TransformerLens: https://github.com/neelnanda-io/TransformerLens. A library for mechanistic interpretability of Transformers, with hooks for activation access and patching.

  • SAELens: https://github.com/jbloomAus/SAELens. Library for training and analyzing sparse autoencoders on language model activations.