Chapter 38 Quiz

Test your understanding of interpretability and explainability in AI. Each question has one best answer unless stated otherwise.

Question 1

What is the key difference between interpretability and explainability?

A) Interpretability is for models; explainability is for data
B) Interpretability refers to inherently understandable models; explainability refers to post-hoc methods for black-box models
C) They are synonyms used interchangeably
D) Interpretability is qualitative; explainability is quantitative

Show Answer

**B) Interpretability refers to inherently understandable models; explainability refers to post-hoc methods for black-box models.** Interpretable models (linear regression, decision trees) are understandable by design. Explainability methods (SHAP, LIME) provide post-hoc explanations for models that are not inherently interpretable.

Question 2

SHAP values are based on which concept from game theory?

A) Nash equilibrium
B) Shapley values from cooperative game theory
C) Minimax strategy
D) Prisoner's dilemma payoffs

Show Answer

**B) Shapley values from cooperative game theory.** SHAP values use Shapley values, which assign each "player" (feature) a fair share of the "payout" (prediction) based on their marginal contribution across all possible coalitions.

Question 3

Which property of Shapley values guarantees that attributions sum to the difference between the prediction and the baseline?

A) Symmetry
B) Linearity
C) Efficiency
D) Dummy

Show Answer

**C) Efficiency.** The efficiency axiom states that $\sum_j \phi_j(x) = f(x) - E[f(X)]$, meaning the attributions for all features sum exactly to the difference between the model's prediction and the average prediction.

Question 4

What makes KernelSHAP different from TreeSHAP?

A) KernelSHAP is faster
B) KernelSHAP is model-agnostic; TreeSHAP is specific to tree-based models
C) KernelSHAP computes exact Shapley values; TreeSHAP approximates them
D) KernelSHAP only works for regression; TreeSHAP works for classification

Show Answer

**B) KernelSHAP is model-agnostic; TreeSHAP is specific to tree-based models.** KernelSHAP uses weighted linear regression to approximate Shapley values for any model, while TreeSHAP exploits tree structure to compute exact Shapley values in polynomial time.

Question 5

In the LIME algorithm, what is the role of the kernel function?

A) To transform input features into a higher-dimensional space
B) To weight perturbed samples by their proximity to the original input
C) To regularize the interpretable model
D) To select which features to include

Show Answer

**B) To weight perturbed samples by their proximity to the original input.** The kernel function (typically exponential) assigns higher weights to perturbed samples that are more similar to the original input, ensuring the linear explanation is most faithful in the local neighborhood.

Question 6

Why do vanilla gradients produce noisy attributions for ReLU networks?

A) ReLU networks are not differentiable
B) Gradients are zero in saturated regions, producing incomplete and noisy attributions
C) The gradient computation is numerically unstable
D) Vanilla gradients cannot handle multi-class outputs

Show Answer

**B) Gradients are zero in saturated regions, producing incomplete and noisy attributions.** ReLU activations have zero gradient when the input is negative (saturated region). This means features that have been "gated off" by ReLU receive zero attribution, even if they were important in determining the activation pattern.

Question 7

What is the completeness axiom of Integrated Gradients?

A) All features must receive non-zero attribution
B) Attributions sum to $f(x) - f(x')$, the difference between input and baseline outputs
C) The integral must converge for any input
D) Attributions are invariant to the choice of baseline

Show Answer

**B) Attributions sum to $f(x) - f(x')$, the difference between input and baseline outputs.** Completeness guarantees that Integrated Gradients produces a complete accounting of the prediction: the sum of all feature attributions exactly equals the difference between the output for the input and the output for the baseline.

Question 8

According to Jain and Wallace (2019), why is attention NOT a reliable explanation?

A) Attention is too computationally expensive to interpret
B) Different attention patterns can produce identical outputs
C) Attention weights are always uniform
D) Attention only applies to the first layer

Show Answer

**B) Different attention patterns can produce identical outputs.** Jain and Wallace showed that adversarial attention distributions exist that are very different from the learned attention weights yet produce the same predictions, demonstrating that attention weights are not uniquely determined by the model's behavior.

Question 9

What does attention rollout address that raw attention weights do not?

A) Edge cases with very long sequences
B) The composition of attention across layers and residual connections
C) The softmax temperature parameter
D) Multi-head attention aggregation

Show Answer

**B) The composition of attention across layers and residual connections.** Raw attention from a single layer ignores the fact that information flows through multiple layers via residual connections. Attention rollout multiplicatively combines attention matrices across layers (with identity residual corrections) to trace how information flows from input to output.

Question 10

What is a probing classifier?

A) A model trained to find bugs in other models
B) A simple classifier trained on frozen representations to test what information they encode
C) A method for automatically testing model robustness
D) A technique for reducing model size

Show Answer

**B) A simple classifier trained on frozen representations to test what information they encode.** A probing classifier (typically linear) is trained on top of frozen model representations to predict a specific property (e.g., POS tags, syntax). Success indicates the information is encoded in the representations.

Question 11

What is the probing paradox?

A) Probes work better on worse models
B) A successful probe does not prove the model uses the encoded information
C) Probes cannot work on deep networks
D) Linear probes are always as good as MLP probes

Show Answer

**B) A successful probe does not prove the model uses the encoded information.** A representation might incidentally encode information (e.g., syntactic structure) without the model ever using it for its downstream task. The probe detects encoding, not usage.

Question 12

How do control tasks address the probing paradox?

A) By testing the probe on held-out data
B) By training the probe to predict random labels; if accuracy is similar, the probe is too powerful
C) By using a simpler model for probing
D) By testing the probe on adversarial examples

Show Answer

**B) By training the probe to predict random labels; if accuracy is similar, the probe is too powerful.** Hewitt and Liang (2019) proposed control tasks where the probe predicts random (non-linguistic) labels. If the probe achieves similar accuracy on random labels as on linguistic labels, the probe's expressive power---not the representation's information content---explains the result.

Question 13

What is superposition in the context of neural networks?

A) When multiple models are ensembled
B) When the network represents more features than it has neurons, using nearly orthogonal directions
C) When gradients from multiple losses are combined
D) When attention heads attend to the same positions

Show Answer

**B) When the network represents more features than it has neurons, using nearly orthogonal directions.** Superposition allows networks to represent many more features than their dimensionality by encoding features as nearly orthogonal directions, exploiting the fact that high-dimensional spaces have many approximately orthogonal directions and features are sparse.

Question 14

What is polysemanticity?

A) A model producing multiple outputs
B) A single neuron participating in representing multiple unrelated features
C) A word having multiple meanings
D) Multiple models solving the same task

Show Answer

**B) A single neuron participating in representing multiple unrelated features.** Polysemanticity is a consequence of superposition: because features are distributed across neurons, a single neuron may activate for unrelated concepts (e.g., a neuron that responds to both "cat images" and "car text").

Question 15

What is the purpose of activation patching?

A) To fix bugs in model weights
B) To identify which components are causally responsible for a specific behavior
C) To speed up model inference
D) To reduce model size

Show Answer

**B) To identify which components are causally responsible for a specific behavior.** Activation patching replaces a component's activation on a corrupted input with its activation from a clean input. If this restores the original behavior, that component is causally important.

Question 16

In a sparse autoencoder for mechanistic interpretability, why is the latent dimension typically larger than the input dimension?

A) For better compression
B) Because there are more interpretable features than neurons (overcomplete representation)
C) To increase reconstruction accuracy
D) For numerical stability

Show Answer

**B) Because there are more interpretable features than neurons (overcomplete representation).** Due to superposition, the number of distinct features in a model's activations exceeds the activation dimension. An overcomplete SAE (e.g., 4x to 64x larger latent space) can decompose superposed activations into individual interpretable features.

Question 17

What are induction heads?

A) The first layer of a neural network
B) Attention heads that implement pattern matching: after seeing [A][B]...[A], they predict [B]
C) Neurons that detect edges in images
D) Heads that determine which tokens to attend to first

Show Answer

**B) Attention heads that implement pattern matching: after seeing [A][B]...[A], they predict [B].** Olsson et al. (2022) discovered induction heads as a key circuit in Transformer language models. They implement a simple algorithm: copy the token that previously followed the current pattern, enabling in-context learning.

Question 18

What does ROME (Rank-One Model Editing) modify in a Transformer?

A) The attention weights
B) The embedding layer
C) MLP layer weights via a rank-one update
D) The layer normalization parameters

Show Answer

**C) MLP layer weights via a rank-one update.** ROME identifies the MLP layer that stores a specific factual association and applies a rank-one update to its weight matrix to change the stored fact while minimally affecting other behaviors.

Question 19

What is a "ripple effect" in model editing?

A) Edits propagate backward through the network
B) Editing one fact fails to update logically entailed facts
C) The model's output oscillates after editing
D) Editing causes training instability

Show Answer

**B) Editing one fact fails to update logically entailed facts.** Editing "The Eiffel Tower is in London" should also change answers to "What country is the Eiffel Tower in?" from "France" to "England." Failure to update related facts is a ripple effect.

Question 20

Which interpretability method is most appropriate for explaining a single prediction to a non-technical stakeholder?

A) Sparse autoencoder analysis
B) Activation patching
C) SHAP with a waterfall plot
D) Probing classifiers

Show Answer

**C) SHAP with a waterfall plot.** SHAP waterfall plots visually show which features pushed the prediction up or down, providing an intuitive explanation accessible to non-technical audiences. Mechanistic methods (SAEs, patching) are too technical for most stakeholders.

Question 21

What is the L1 penalty in a sparse autoencoder's loss function for?

A) Preventing overfitting
B) Encouraging sparsity in the latent activations so each input activates few features
C) Ensuring the decoder weights have unit norm
D) Improving reconstruction quality

Show Answer

**B) Encouraging sparsity in the latent activations so each input activates few features.** The L1 penalty $\lambda \|\mathbf{z}\|_1$ encourages most latent dimensions to be zero for any given input, ensuring that each input is explained by a small number of active features, which aids interpretability.

Question 22

Which of the following is NOT a valid criticism of using attention weights as explanations?

A) Attention is not unique (multiple patterns give the same output)
B) Attention does not imply causal importance
C) Attention weights always sum to 1, creating zero-sum competition
D) Attention is only computed in the final layer

Show Answer

**D) Attention is only computed in the final layer.** Attention is computed in every layer of a Transformer, not just the final layer. This is a factual error. The other three options are all valid criticisms of interpreting attention as explanation.

Question 23

When performing a perturbation test to validate feature attributions, what should you measure?

A) Whether the attribution values are positive
B) The drop in model confidence when top-attributed features are removed
C) The correlation between attributions across different methods
D) The training loss after perturbation

Show Answer

**B) The drop in model confidence when top-attributed features are removed.** A good attribution method should identify the most important features. Removing (masking) the top-K attributed features should cause a larger drop in confidence than removing random features. This validates that the attributions capture genuine importance.

Question 24

What is the relationship between SHAP and Integrated Gradients?

A) They are identical algorithms
B) Integrated Gradients with a uniform distribution of baselines approximates SHAP values
C) SHAP is a generalization of Integrated Gradients to non-differentiable models
D) There is no relationship

Show Answer

**B) Integrated Gradients with a uniform distribution of baselines approximates SHAP values.** GradientSHAP (a SHAP variant) computes expected gradients by averaging Integrated Gradients over multiple baselines sampled from the training distribution, connecting the two frameworks.

Question 25

What is the primary safety motivation for mechanistic interpretability?

A) To make models faster
B) To reduce model size
C) To understand and verify model behavior before deployment in critical systems
D) To improve model accuracy

Show Answer

**C) To understand and verify model behavior before deployment in critical systems.** The safety case for mechanistic interpretability is that understanding *how* models work---not just *what* they output---is essential for verifying they behave safely, detecting deceptive or misaligned behavior, and ensuring they remain aligned with human values.