Chapter 38: Further Reading

Foundational Texts

Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed. A comprehensive, practical guide covering SHAP, LIME, partial dependence, and more. Freely available at https://christophm.github.io/interpretable-ml-book/.
Olah, C. et al. (2020). "Zoom In: An Introduction to Circuits." Distill. An accessible introduction to the circuits approach in mechanistic interpretability, with interactive visualizations. Available at https://distill.pub/2020/circuits/.

Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS 2017. The foundational SHAP paper, unifying six existing attribution methods under the Shapley value framework.
Lundberg, S. M., Erion, G., Chen, H., et al. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." Nature Machine Intelligence, 2, 56--67. TreeSHAP and its application to global explanations.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?' Explaining the Predictions of Any Classifier." KDD 2016. The original LIME paper, introducing local interpretable model-agnostic explanations.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). "Anchors: High-Precision Model-Agnostic Explanations." AAAI 2018. A follow-up to LIME that provides rule-based explanations with coverage guarantees.

Sundararajan, M., Taly, A., and Yan, Q. (2017). "Axiomatic Attribution for Deep Networks." ICML 2017. Integrated Gradients: principled gradient attribution satisfying completeness and sensitivity axioms.
Smilkov, D., Thorat, N., Kim, B., Viegas, F., and Wattenberg, M. (2017). "SmoothGrad: Removing Noise by Adding Noise." arXiv:1706.03825. Averaging gradients over noisy copies of the input reduces noise in saliency maps.

Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. The influential paper demonstrating that attention weights do not provide faithful explanations.
Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. A nuanced response arguing that attention is more informative than Jain and Wallace suggested, under certain conditions.
Abnar, S. and Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." ACL 2020. Introduced attention rollout and attention flow methods for tracing information through multi-layer Transformers.

Belinkov, Y. (2022). "Probing Classifiers: Promises, Shortcomings, and Advances." Computational Linguistics, 48(1), 207--219. A comprehensive survey of probing methodology, limitations, and best practices.
Hewitt, J. and Liang, P. (2019). "Designing and Interpreting Probes with Control Tasks." EMNLP 2019. Introduced control tasks to address the problem of overpowered probes.
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). "What You Can Cram into a Single $&!# Vector: Probing Sentence Embeddings for Linguistic Properties." ACL 2018*. Systematic probing of sentence representations for ten linguistic properties.

Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." Anthropic. The foundational paper on superposition in neural networks, demonstrating how models represent more features than neurons.
Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. Demonstrated that sparse autoencoders can discover millions of interpretable features in production language models.
Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." Anthropic. The initial paper on using sparse autoencoders for feature discovery in language models.

Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Anthropic. Discovery and analysis of induction heads, a key circuit for in-context learning in Transformers.
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023. Reverse-engineering a complete circuit in a production model.
Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." ICLR 2023. Mechanistic analysis of how Transformers learn modular addition, revealing Fourier-based algorithms.

Meng, K., Bau, D., Mitchell, A., and Yun, C. (2022). "Locating and Editing Factual Associations in GPT." NeurIPS 2022. Introduced causal tracing and ROME for identifying and editing factual knowledge.
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. (2023). "Mass-Editing Memory in a Transformer." ICLR 2023. MEMIT: scaling model editing to thousands of facts.
Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). "Causal Abstractions of Neural Networks." NeurIPS 2021. A formal framework for causal analysis of neural network computations.

Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2023). "Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks." arXiv:2207.13243. A broad survey covering feature visualization, circuits, probing, and mechanistic interpretability.
Madsen, A., Reddy, S., and Chandar, S. (2022). "Post-hoc Interpretability for Neural NLP: A Survey." ACM Computing Surveys, 55(8). A thorough survey of interpretability methods for NLP models.

SHAP Library: https://github.com/slundberg/shap. The reference implementation of SHAP, with TreeSHAP, KernelSHAP, and visualization tools.
Captum: https://captum.ai/. PyTorch library for model interpretability, implementing Integrated Gradients, GradientSHAP, LIME, and more.
TransformerLens: https://github.com/neelnanda-io/TransformerLens. A library for mechanistic interpretability of Transformers, with hooks for activation access and patching.
SAELens: https://github.com/jbloomAus/SAELens. Library for training and analyzing sparse autoencoders on language model activations.