Chapter 38 Exercises: Interpretability, Explainability, and Mechanistic Understanding

Conceptual Exercises

Exercise 1: Interpretability Taxonomy

Classify each of the following methods as (a) inherently interpretable, (b) post-hoc local explanation, (c) post-hoc global explanation, or (d) mechanistic interpretability. Justify each classification.

  1. Decision tree
  2. SHAP values for a single prediction
  3. SHAP summary plot across 1,000 predictions
  4. Probing classifier on hidden states
  5. Sparse autoencoder for feature discovery
  6. Logistic regression coefficients
  7. LIME explanation
  8. Activation patching
  9. Attention rollout
  10. ROME model editing

Exercise 2: SHAP Properties

a) State the four axioms that Shapley values satisfy (efficiency, symmetry, dummy, linearity). b) For a model $f(x_1, x_2) = 3x_1 + 2x_2 + 5$, compute exact SHAP values for the input $(x_1=2, x_2=3)$ with baseline $(x_1=0, x_2=0)$. c) Verify that the efficiency property holds for your answer. d) Why is exact Shapley value computation intractable for high-dimensional inputs?

Exercise 3: LIME Analysis

a) Explain why LIME explanations can be inconsistent across runs. b) How does the kernel width parameter affect LIME explanations? What happens with very small or very large kernel widths? c) Design a scenario where LIME would give a misleading explanation. (Hint: consider a model with strong feature interactions.) d) How would you choose between LIME and SHAP for explaining predictions from an XGBoost model?

Exercise 4: Integrated Gradients

a) Explain the completeness axiom: $\sum_j \text{IG}_j(x) = f(x) - f(x')$. Why is this important? b) How does the choice of baseline affect Integrated Gradients attributions? For an image classifier, compare using a black image vs. a blurred image as the baseline. c) Why do vanilla gradients fail on ReLU networks? Give a concrete example where a feature clearly matters but has zero gradient. d) Implement a "noisy baseline" version of Integrated Gradients that averages over multiple random baselines. How does this compare to GradientSHAP?

Exercise 5: Attention Interpretation

a) Explain the Jain and Wallace (2019) finding that attention is not explanation, in your own words. b) Design an experiment to test whether attention weights in a sentiment classifier are faithful explanations. What would you measure? c) When is it valid to interpret attention weights? List three conditions that make attention-based explanations more reliable. d) Compare attention rollout with raw attention weights from the last layer. Which is more reliable and why?

Exercise 6: Probing Classifiers

a) A linear probe achieves 85% accuracy predicting part-of-speech from BERT's layer 4 representations. What can and cannot you conclude from this? b) Explain the probing paradox: why might a successful probe not mean the model uses the encoded information? c) What is a control task, and how does it address the probing paradox? d) Design a probing experiment to test whether a language model's representations encode syntactic dependency distance.

Exercise 7: Superposition

a) Explain why neural networks use superposition. What trade-off does it involve? b) If a model has 512 neurons and represents 2,000 features in superposition, what must be true about feature sparsity for this to work? c) Why does superposition make interpretability harder? What is polysemanticity? d) How do sparse autoencoders address the superposition problem?

Exercise 8: Model Editing

a) Describe the ROME procedure for editing a factual association in a Transformer. b) What is a "ripple effect" in model editing? Give an example. c) If you edit "The capital of France is Paris" to "The capital of France is Lyon," what related queries should also change? What queries should remain unchanged? d) Discuss the ethical implications of model editing. When is it appropriate and when is it dangerous?


Programming Exercises

Exercise 9: Implementing GradientSHAP

Implement GradientSHAP for a neural network classifier: a) Train a 3-layer MLP on a synthetic dataset with 10 features, 3 of which are informative. b) Implement the GradientSHAP algorithm using random baselines from the training data. c) Plot the mean absolute SHAP values for all features. Do the top 3 features match the informative ones? d) Compare with vanilla gradients. Which method correctly identifies the important features?

Exercise 10: LIME for Text

Implement a simplified LIME explainer for text classification: a) Train a simple text classifier (bag-of-words + logistic regression) on positive/negative movie reviews. b) Implement LIME by randomly masking words, getting predictions, and fitting a linear model. c) For 5 example reviews, show the top 5 most important words according to LIME. d) Verify the explanations against your intuition. Do they make sense?

Exercise 11: Probing a Transformer

a) Train or load a pre-trained Transformer model (you may use a small one for efficiency). b) Extract hidden representations from each layer for a dataset with known properties. c) Train linear probes at each layer. Plot probe accuracy vs. layer number. d) At which layer does the information peak? Does it decline in later layers?

Exercise 12: Building a Sparse Autoencoder

a) Train a simple language model (2-layer Transformer) on synthetic data. b) Collect activations from the MLP layer. c) Train a sparse autoencoder on these activations with varying sparsity coefficients. d) Examine the top-activating examples for the 10 most active features. Are they interpretable?

Exercise 13: Activation Patching Experiment

a) Train a model on a task where you control the data generating process (e.g., "output the larger of two numbers"). b) Implement activation patching across all layers. c) Identify which layers are most causally important for the task. d) Does the causal importance align with probing accuracy across layers?

Exercise 14: Comparing Attribution Methods

For a single model and dataset: a) Compute attributions using vanilla gradients, Integrated Gradients, and GradientSHAP. b) Measure the overlap between the top-5 attributed features across methods. c) Perform a perturbation test: mask the top-K features and measure the drop in model confidence. Which attribution method identifies the most impactful features? d) Report the computational cost (wall-clock time) of each method.