Chapter 38: Key Takeaways

Core Concepts

Interpretability is a spectrum from practical explanations to fundamental understanding. Feature attribution methods (SHAP, LIME, Integrated Gradients) explain individual predictions; probing classifiers test what information representations encode; mechanistic interpretability reverse-engineers the algorithms neural networks learn.
Interpretability is not optional for high-stakes applications. Healthcare, criminal justice, and finance require explanations for automated decisions. Beyond compliance, interpretability enables debugging, trust calibration, and scientific discovery.
The accuracy-interpretability trade-off is often overstated. Post-hoc methods can explain complex models without sacrificing accuracy. Mechanistic interpretability aims for full transparency of unmodified models.

SHAP is the most principled attribution method. Grounded in Shapley values from cooperative game theory, SHAP satisfies efficiency (attributions sum to the prediction), symmetry, dummy, and linearity axioms. KernelSHAP is model-agnostic; TreeSHAP is exact for tree models; GradientSHAP works for neural networks.
LIME provides model-agnostic local explanations. By fitting a linear model to the black-box model's behavior around an input, LIME produces intuitive explanations. However, results can be inconsistent across runs and may miss nonlinear interactions.
Integrated Gradients fix the problems of vanilla gradients. By integrating gradients along a path from baseline to input, Integrated Gradients satisfy the completeness axiom (attributions sum to the output difference) and avoid the saturation problem of vanilla gradients in ReLU networks.

Attention weights are not reliable explanations. Jain and Wallace (2019) showed that different attention patterns can produce identical outputs. Attention rollout partially addresses this by accounting for residual connections across layers, but interpretation must be done carefully.
Probing classifiers test information encoding, not usage. A successful linear probe shows that information is linearly accessible in the representation, but does not prove the model uses that information. Control tasks (predicting random labels) are essential for establishing a meaningful baseline.

Neural networks use superposition to represent more features than they have neurons. Features are encoded as nearly orthogonal directions in activation space, exploiting high-dimensional geometry and feature sparsity. This produces polysemanticity (neurons responding to multiple unrelated concepts) and is a core challenge for interpretability.
Activation patching identifies causally important components. By replacing a component's activation from a corrupted input with its activation from a clean input, activation patching tests whether that component is causally responsible for the model's behavior---going beyond correlational methods.
Sparse autoencoders decompose superposition into interpretable features. An overcomplete autoencoder with L1 sparsity constraints learns individual features from model activations. Anthropic's research has demonstrated this can discover millions of interpretable features in large language models.
Circuits are the building blocks of model computation. Mechanistic interpretability has identified specific circuits in Transformers, such as induction heads (pattern matching for in-context learning) and indirect object identification circuits. Understanding these circuits advances our ability to predict and control model behavior.

Model editing enables targeted knowledge modification. ROME applies rank-one updates to MLP weights to change specific factual associations. MEMIT extends this to mass editing. However, current methods suffer from ripple effects (failing to update related facts) and limited scalability.

Use multiple interpretability methods and validate. No single method gives the full picture. Combine feature attributions with perturbation tests, use probing alongside activation patching, and always validate explanations against domain knowledge.
Choose the right tool for the audience and goal. Stakeholders need SHAP waterfall plots, not sparse autoencoder analyses. Debugging requires different tools than compliance. Match the interpretability method to the question being asked and the person who needs the answer.