Chapter 35: Exercises

DataField.Dev

Chapter 35: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Shapley Values and SHAP Fundamentals

Exercise 35.1 (*)

A cooperative game has three players $\{A, B, C\}$ with the following value function:

Coalition	$v(S)$
$\emptyset$	0
$\{A\}$	4
$\{B\}$	2
$\{C\}$	0
$\{A, B\}$	10
$\{A, C\}$	6
$\{B, C\}$	5
$\{A, B, C\}$	15

(a) Compute the Shapley value for each player by enumerating all permutations and computing the marginal contribution of each player in each permutation.

(b) Verify the efficiency axiom: do the Shapley values sum to $v(\{A, B, C\}) - v(\emptyset) = 15$?

(c) Player $C$ contributes 0 when alone. Is $C$ a dummy player? Why or why not?

(d) This game has a synergy between $A$ and $B$: $v(\{A, B\}) - v(\{A\}) - v(\{B\}) = 4$. How is this synergy reflected in the Shapley values?

Exercise 35.2 (*)

A credit scoring model uses 5 features: income, debt_to_income, credit_history_years, num_credit_inquiries, and employment_years. For a specific denied applicant, the TreeSHAP values are:

Feature	SHAP Value	Feature Value
income	-0.02	$68,000
debt_to_income	+0.15	52%
credit_history_years	+0.08	1.5 years
num_credit_inquiries	+0.04	7
employment_years	-0.03	4 years

The base value (expected prediction) is 0.12 (probability of default).

(a) What is the model's prediction for this applicant? Verify using the efficiency axiom.

(b) Which feature contributes most to the denial? Write an ECOA-compliant adverse action reason for this feature.

(c) The applicant asks: "What can I do to get approved?" Based on the SHAP values, which feature would provide the most benefit if improved? Is this a causal claim? Why or why not?

(d) Write the top-4 adverse action reasons, ordered by SHAP magnitude, in the format required by Regulation B.

Exercise 35.3 (*)

Using the shap library, compute TreeSHAP explanations for an XGBoost model trained on the UCI Adult Income dataset.

(a) Train an XGBoost classifier to predict whether income exceeds $50,000. Report test AUC.

(b) Compute TreeSHAP values for 1,000 test instances. Generate a SHAP summary plot (beeswarm plot). Which three features have the highest mean absolute SHAP value?

(c) Generate a SHAP dependence plot for the top feature. Is the relationship monotonic? Are there interaction effects (indicated by the color of the dots)?

(d) Select one instance predicted as high-income and one predicted as low-income. Generate SHAP waterfall plots for both. Explain the key differences in terms a non-technical audience would understand.

Exercise 35.4 (**)

Compare TreeSHAP's two modes — tree-path-dependent and interventional — on a dataset with highly correlated features.

(a) Generate a synthetic dataset with 10,000 samples and 5 features where features 1 and 2 are correlated (Pearson $r = 0.9$) and features 3-5 are independent. The target $y = 3x_1 + 2x_2 + x_3 + \epsilon$.

(b) Train an XGBoost model. Compute TreeSHAP values using feature_perturbation="tree_path_dependent" and feature_perturbation="interventional" for 500 test instances.

(c) Compare the mean absolute SHAP values for features 1 and 2 under both modes. Which mode attributes more importance to feature 2? Explain why in terms of the conditional vs. marginal distribution.

(d) Which mode would you recommend for: (i) a regulatory audit where you need to identify which features the model actually uses, (ii) a causal analysis where you want to know the effect of intervening on each feature?

Exercise 35.5 (**)

KernelSHAP approximates Shapley values by weighted linear regression. Implement a simplified version from scratch.

(a) For a model with $p = 4$ features and a single instance, enumerate all $2^4 = 16$ coalition vectors $z \in \{0, 1\}^4$.

(b) For each coalition vector, compute the SHAP kernel weight $\pi(z) = \frac{(p-1)}{\binom{p}{|z|} \cdot |z| \cdot (p - |z|)}$. Handle the edge cases $|z| = 0$ and $|z| = p$ by assigning a large weight (e.g., $10^6$).

(c) Assume a model prediction function $f$ and a background dataset. For each coalition vector, construct the masked input (replacing absent features with background values), evaluate $f$, and collect the (coalition, prediction) pairs.

(d) Fit a weighted linear regression using the SHAP kernel weights to recover the Shapley values. Compare your results with the shap.KernelExplainer output. They should match within sampling noise.

LIME, PDP, ALE

Exercise 35.6 (*)

Apply LIME to the same XGBoost model from Exercise 35.3.

(a) Compute LIME explanations for 100 test instances. For each instance, record the top-5 feature attributions.

(b) Run LIME twice on the same 100 instances with different random seeds. For how many instances do the top-3 features change? Compute the Kendall rank correlation between the two runs' feature rankings for each instance. Report the mean and standard deviation of the correlation.

(c) Compare LIME and TreeSHAP attributions for 10 instances. For each instance, compute the Spearman rank correlation between the two methods' feature rankings. Are they consistent?

(d) Based on your results, would you recommend LIME for: (i) a one-time model audit, (ii) real-time adverse action notices, (iii) internal model debugging? Justify each answer.

Exercise 35.7 (*)

Generate PDP, ALE, and ICE plots for the top 3 features from Exercise 35.3.

(a) For each feature, generate a PDP and overlay ICE curves for 200 random instances. Are the ICE curves parallel (indicating no interaction) or divergent (indicating interactions)?

(b) Generate ALE plots for the same 3 features. Compare the ALE and PDP shapes. For which features do they differ most, and why?

(c) Identify a pair of features with the highest interaction (use SHAP interaction values or visual inspection of ICE plots). Generate a 2D PDP for this pair. Describe the interaction in one paragraph.

Exercise 35.8 (**)

PDP can be misleading when features are correlated. Demonstrate this.

(a) Create a synthetic dataset where $x_1 \sim \text{Uniform}(0, 10)$, $x_2 = x_1 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 0.5)$, and $y = x_1 \cdot x_2$. Train a gradient-boosted tree model.

(b) Generate the PDP for $x_1$. The PDP marginalizes over all values of $x_2$, including values of $x_2$ that never co-occur with extreme values of $x_1$ in the data. Identify a specific region of the PDP where the plot shows predictions based on unrealistic $(x_1, x_2)$ combinations.

(c) Generate the ALE plot for $x_1$. Does the ALE avoid the extrapolation problem? Explain mechanically how ALE's binning approach handles correlated features differently from PDP.

(d) When should you use PDP, and when should you use ALE? Write a decision rule that a junior data scientist could follow.

Gradient-Based Methods and Captum

Exercise 35.9 (*)

Use Captum to compute integrated gradients for a simple PyTorch MLP trained on the Iris dataset.

(a) Train a 3-class MLP (2 hidden layers, 64 units, ReLU) on the Iris dataset. Report test accuracy.

(b) For 10 test instances (at least 3 from each class), compute integrated gradients with a zero baseline and 50 interpolation steps. Which features receive the highest absolute attribution for each class?

(c) Change the baseline from zero to the training set mean. Do the attributions change? Explain why baseline choice matters for integrated gradients.

(d) Verify the completeness axiom: for each instance, check that the sum of the integrated gradient attributions equals $f(x) - f(x')$ within numerical tolerance.

Exercise 35.10 (**)

Compare saliency maps, integrated gradients, and GradCAM for a pretrained ResNet-18 on 5 images from ImageNet.

(a) Load a pretrained ResNet-18 from torchvision. For each image, compute: (i) vanilla gradient saliency, (ii) integrated gradients (50 steps, black baseline), (iii) GradCAM (target layer: layer4[-1]).

(b) For each image, overlay all three attribution maps on the original image. Which method produces the most visually coherent (least noisy) attribution?

(c) Apply SmoothGrad (10 noise samples, std=0.1) to the vanilla gradient saliency. Does it improve the visual quality? Why does averaging over noisy copies reduce high-frequency noise in saliency maps?

(d) For one image, change the target class from the predicted class to the second-most-likely class. How do the attribution maps change? What does this tell you about class-conditional attribution?

Exercise 35.11 (**)

Implement integrated gradients from scratch (without Captum).

(a) Write a function integrated_gradients(model, input, baseline, n_steps) that: (i) generates n_steps linearly interpolated inputs between the baseline and the input, (ii) computes the gradient of the model output with respect to each interpolated input, (iii) numerically integrates the gradients using the trapezoidal rule, (iv) multiplies by $(x - x')$.

(b) Apply your implementation to the Iris MLP from Exercise 35.9. Compare your results with Captum's IntegratedGradients. They should match within $10^{-3}$.

(c) The completeness axiom requires that $\sum_j \text{IG}_j(x) = f(x) - f(x')$. Test this on 100 instances. What is the maximum absolute error? How does the error change as you increase n_steps from 10 to 50 to 200?

(d) Integrated gradients depend on the path from baseline to input. The standard implementation uses a straight-line path. Propose an alternative path (e.g., following the data manifold) and discuss whether it would produce more meaningful attributions.

Exercise 35.12 (***)

Implement GradCAM from scratch for a CNN.

(a) Train a small CNN (3 conv layers, 2 FC layers) on CIFAR-10. Report test accuracy.

(b) Implement gradcam(model, input, target_class, target_layer) that: (i) registers forward and backward hooks on the target layer, (ii) performs a forward pass and computes the class score, (iii) performs a backward pass to get gradients of the class score with respect to the target layer activations, (iv) computes channel importance weights via global average pooling of the gradients, (v) computes the weighted sum of activation maps, (vi) applies ReLU (to keep only positive contributions).

(c) Generate GradCAM heatmaps for 20 correctly classified images and 10 misclassified images. Is there a qualitative difference in where the model "looks" for correct vs. incorrect predictions?

(d) GradCAM produces a coarse heatmap (resolution of the final conv layer). Apply bilinear upsampling to match the input resolution. Discuss the limitations of this upsampling for fine-grained attribution.

Concept-Based Explanations

Exercise 35.13 (**)

Implement a simplified TCAV analysis for a pretrained image classifier.

(a) Define two concepts for a pretrained ResNet-18: "striped texture" and "round shape." Collect 50 examples of each concept from texture/shape datasets (or generate synthetic examples).

(b) For each concept, train a linear classifier in the activation space of layer3 to separate concept examples from random examples. Report the classification accuracy of each CAV (it should be well above chance).

(c) Compute the TCAV score for the "striped texture" concept for the "zebra" class and the "goldfish" class. The score should be high for "zebra" and low for "goldfish."

(d) Run the statistical significance test with 10 random CAVs. Is the "striped texture" TCAV score significantly different from 0.5 for the "zebra" class?

Exercise 35.14 (***)

Build a concept bottleneck model for a tabular dataset.

(a) Using the UCI Heart Disease dataset, define 5 clinical concepts: (i) blood pressure risk (derived from resting BP), (ii) cholesterol risk (derived from serum cholesterol), (iii) heart rate abnormality (derived from max heart rate), (iv) exercise intolerance (derived from exercise-induced angina and ST depression), (v) vessel disease (derived from number of major vessels colored by fluoroscopy).

(b) Train the concept predictor: a neural network that maps the 13 raw features to 5 concept scores. Train with the concept labels you defined in (a).

(c) Train the task predictor: a logistic regression that maps the 5 concept scores to the disease prediction. Report test AUC.

(d) Compare the CBM's AUC with a standard MLP trained end-to-end on the raw features. What is the accuracy gap? Is the gap justified by the interpretability gain?

(e) Simulate a "concept intervention": for 20 test instances, manually override one concept prediction (e.g., set "blood pressure risk" to the correct value) and observe the change in the final prediction. How many predictions change?

Exercise 35.15 (***)

Design a concept-based explanation system for a pharmaceutical application.

(a) You are building a model to predict drug response for a cancer therapy. The model uses 50 molecular and clinical features. Define 8 clinically meaningful concepts (e.g., "tumor mutation burden," "immune infiltration," "prior treatment response") that a oncologist would recognize.

(b) Describe how you would collect concept annotations. What is the annotation cost? How many annotations per concept would you need for a reliable CAV?

(c) Design the TCAV analysis pipeline: which layer(s) to probe, how many random CAV runs for the statistical test, how to present results to clinicians.

(d) A clinician says: "I trust the model when it uses concepts I recognize, but I don't trust it when it doesn't." Design a concept completeness metric that measures what fraction of the model's prediction is explained by the defined concepts. (Hint: compare the TCAV scores of real concepts with random concepts.)

Counterfactual Explanations

Exercise 35.16 (*)

Generate counterfactual explanations for a credit scoring model.

(a) Train a logistic regression on the German Credit dataset. Identify 10 instances that are classified as "bad credit."

(b) For each instance, generate a counterfactual explanation using gradient-based optimization with the following constraints: age is immutable, all features must stay within observed data ranges, and the counterfactual must change the prediction to "good credit" (probability > 0.5).

(c) For each counterfactual, report: the number of features changed, the L1 distance, and whether the counterfactual is plausible (all feature values within 2 standard deviations of the training data).

(d) Write an actionable explanation for one applicant based on the counterfactual: "If you [specific action], your application would likely be approved."

Exercise 35.17 (**)

Counterfactual explanations should respect causal structure.

(a) Define a simple causal graph for credit scoring: $\text{Education} \to \text{Income} \to \text{Debt-to-Income}$; $\text{Age} \to \text{Credit History Length}$; $\text{Employment Years}$ is exogenous.

(b) A naive counterfactual changes income from $40,000 to $65,000 while keeping debt_to_income at 55%. Is this causally consistent? If income increases, what should happen to debt-to-income (assuming debt stays constant)?

(c) Implement a causal counterfactual generator that, when changing a parent feature, propagates the change to its descendants using the structural equations of the causal graph.

(d) Compare a causally consistent counterfactual with a naive counterfactual for the same instance. Which requires more features to change? Which is more actionable?

Exercise 35.18 (**)

Evaluate the diversity and plausibility of counterfactual explanations.

(a) For one denied applicant, generate 20 diverse counterfactual explanations by running the optimization with different random initializations.

(b) Cluster the 20 counterfactuals. How many distinct "paths to approval" exist (e.g., "reduce debt," "increase credit history," "increase income")?

(c) For each counterfactual, compute a plausibility score: the log-likelihood under a Gaussian density estimated on the training data. Rank the counterfactuals by plausibility. Are the most plausible counterfactuals also the most actionable?

(d) Design a system that presents the applicant with the top-3 most diverse and plausible counterfactuals, representing three different "paths to approval."

Production Infrastructure

Exercise 35.19 (**)

Design an explanation monitoring system.

(a) Define 5 metrics for monitoring explanation quality over time: (i) mean number of unique features appearing in top-4 explanations per day, (ii) distribution shift in the most-common top factor, (iii) explanation computation time p50/p95/p99, (iv) percentage of explanations where the efficiency axiom is violated by > 1%, (v) one additional metric of your choice.

(b) For each metric, define an alerting threshold and an escalation policy.

(c) Describe a scenario where explanation monitoring would catch a problem that prediction monitoring (Chapter 30) would miss.

(d) Implement the monitoring metrics as a Python class that ingests a stream of ExplanationResponse objects and computes rolling statistics.

Exercise 35.20 (**)

Build a natural language explanation generator for three audiences.

(a) Define a feature-to-plain-English mapping for a credit scoring model with 20 features.

(b) Implement three NL templates: - Applicant: "The primary factors in this decision were: 1. [factor] 2. [factor] ..." - Underwriter: "Score: [score]. Key risk drivers: [factor]: [SHAP value], ..." - Regulator: "Model [version], method [SHAP variant], prediction [value]. Full attribution: ..."

(c) Test your generator on 50 denied applicants. For each, generate all three formats. Verify that the applicant format never reveals raw SHAP values and the regulator format always includes the model version and method.

(d) A product manager asks: "Can we use GPT-4 to generate more natural-sounding explanations?" Discuss the risks of using a language model to generate regulatory-grade explanations. Consider faithfulness, consistency, hallucination, and auditability.

Exercise 35.21 (***)

Design and implement an explanation API service.

(a) Define a REST API with the following endpoints: - POST /explain — generate an explanation for a single instance - GET /explain/{request_id} — retrieve a previously generated explanation - GET /audit?model_version={v}&start={t1}&end={t2} — retrieve audit log entries - GET /global/{model_id} — retrieve global feature importance

(b) Implement the POST /explain endpoint using FastAPI. The endpoint should: accept the instance as JSON, load the correct model version, compute TreeSHAP values, format the explanation for the specified audience, log to the audit trail, and return the response in < 50ms for a model with 200 features.

(c) Add a caching layer: if the same (model_version, input_hash) pair has been explained before, return the cached explanation. What invalidation strategy is appropriate?

(d) Load test the endpoint: generate 1,000 concurrent requests and report p50, p95, p99 latency. Identify the bottleneck.

Exercise 35.22 (***)

Implement a complete audit trail system with hash chain integrity.

(a) Implement the ExplanationAuditLogger from Section 35.10 with an SQLite storage backend. Each entry should include the fields defined in AuditLogEntry.

(b) Implement hash chain verification: a function that reads all entries and verifies that each entry's previous_entry_hash matches the hash of the previous entry.

(c) Simulate a tampering attack: modify one entry in the middle of the chain. Verify that your verification function detects the tampering and identifies the first corrupted entry.

(d) Compute storage requirements: if Meridian Financial processes 15,000 applications per day and retains audit logs for 25 months, how many entries and how many gigabytes of storage are needed? Assume 2 KB per entry.

Integration and Research-Level Problems

Exercise 35.23 (***)

Combine fairness (Chapter 31) and explainability.

(a) Using the Meridian Financial credit scoring model, compute TreeSHAP values for 5,000 denied applicants, disaggregated by race.

(b) For each racial group, compute the mean SHAP value for each feature. Which features have the largest difference in mean SHAP value across racial groups?

(c) If zip_code has a high mean SHAP value for Black applicants but a low mean SHAP value for White applicants, what does this suggest about proxy discrimination? How does this relate to the proxy discrimination analysis in Chapter 31?

(d) Design a "fairness-aware explanation" system that: (i) detects when a denial explanation is driven primarily by proxy features, (ii) flags such cases for human review, and (iii) provides the reviewer with a comparison of the applicant's explanation to the average explanation for their demographic group.

Exercise 35.24 (***)

Combine uncertainty quantification (Chapter 34) and explainability.

(a) For a conformal prediction interval (Chapter 34), the model produces both a prediction and an uncertainty estimate. Can SHAP explain why the model is uncertain (not just why the prediction is high or low)?

(b) Implement "SHAP for uncertainty": compute SHAP values where the value function is the model's predictive uncertainty (e.g., the width of the conformal interval or the standard deviation of an ensemble). Which features contribute most to high uncertainty?

(c) For a credit scoring model, generate a combined explanation: "Your predicted default probability is 0.18. The main factors driving this prediction are [X, Y, Z]. The model's uncertainty about this prediction is [low/moderate/high], primarily because [A, B]."

(d) Discuss whether "explaining uncertainty" is more or less useful to an applicant than "explaining the prediction." When would each be more valuable?

Exercise 35.25 (****)

Adversarial attacks on explanation methods (Slack et al., 2020).

(a) Implement a "scaffolded model" that: (i) detects whether an input is a real instance or a SHAP/LIME perturbation (based on the distributional properties of perturbations), (ii) when it detects a real instance, uses a biased classifier (e.g., one that uses zip code as a proxy for race), (iii) when it detects a perturbation, uses a fair classifier (one that does not use zip code).

(b) Show that SHAP and LIME explanations of the scaffolded model attribute importance to the features used by the fair classifier, not the biased classifier that actually drives predictions on real data.

(c) Propose a defense: a detection method that identifies scaffolded models by comparing explanations of real instances with explanations of realistic (in-distribution) perturbations. Implement and test your defense.

(d) Discuss the implications for regulatory compliance. If a lender deploys a scaffolded model, the ECOA-mandated adverse action notice would be based on the fair classifier's features, not the biased features actually used. What technical audit procedures could detect this?

Exercise 35.26 (****)

Explanation methods for large language models.

(a) For a fine-tuned BERT model on a sentiment classification task, compute integrated gradients at the token embedding level. Visualize the attribution for 5 sentences, highlighting tokens with positive (pushing toward positive sentiment) and negative (pushing toward negative sentiment) attributions.

(b) Compute attention rollout (combining attention across all layers) for the same 5 sentences. Compare the attention-based explanation with the integrated gradients explanation. Do they agree on which tokens are most important?

(c) SHAP for text requires defining a "reference" for absent tokens. The two common choices are: (i) mask token ([MASK]), (ii) empty string (remove the token). Compute KernelSHAP with both reference types for 3 sentences. Do the attributions differ? Which reference is more semantically meaningful?

(d) For generative LLMs (e.g., GPT-style), explanations must explain which input tokens influenced each output token. Discuss the challenges of applying SHAP or integrated gradients to autoregressive generation where the output length is variable and each output token depends on all previous output tokens.

Exercise 35.27 (****)

Faithfulness evaluation of explanation methods.

(a) Define a "faithfulness" metric for local feature attributions: given an attribution that ranks features by importance, iteratively remove features in order of decreasing importance and measure the change in the model's prediction. A faithful attribution should cause the prediction to change rapidly when the most important features are removed.

(b) Implement this metric as the Area Over the Perturbation Curve (AOPC): $\text{AOPC} = \frac{1}{K} \sum_{k=1}^{K} [f(x) - f(x_{\setminus 1:k})]$ where $x_{\setminus 1:k}$ is the input with the top-$k$ features removed.

(c) Compute AOPC for TreeSHAP, LIME (10 runs averaged), and random feature ordering on the credit scoring model from Exercise 35.3 for 500 test instances. Report mean AOPC for each method.

(d) TreeSHAP should have the highest AOPC (most faithful), random should have the lowest, and LIME should fall in between. Do your results confirm this? If LIME sometimes outperforms TreeSHAP, explain why this might happen and whether it invalidates the theoretical superiority of Shapley values.

Exercise 35.28 (****)

Design an explanation system for a multi-model pipeline.

(a) The StreamRec pipeline has three stages: retrieval (two-tower model), ranking (transformer), and re-ranking (gradient-boosted tree with fairness constraints). A user sees an item that passed through all three stages. How should the explanation be attributed across stages?

(b) Propose a "pipeline SHAP" method: compute the contribution of each stage to the final recommendation. Hint: treat each stage as a player in a cooperative game.

(c) Implement a simplified version for a 2-stage pipeline (retrieval + ranking). For a recommended item, compute: (i) how much of the recommendation is due to the retrieval score, (ii) how much is due to the ranking score.

(d) The user sees: "Recommended because you watched X." But the retrieval model never used the user's watch history (it used collaborative filtering embeddings), and the watch history signal entered at the ranking stage via the session transformer. The explanation is technically correct (the item was ranked higher because of the watch history) but misleading (it implies the item was found because of the watch history). Discuss how pipeline explanations can be misleading and propose a solution.

Exercise 35.29 (****)

Scaling KernelSHAP with approximation techniques.

(a) KernelSHAP requires $O(n_{\text{samples}} \times n_{\text{background}})$ model evaluations per instance. For a model with 1-second inference time, 2,048 coalition samples, and 100 background instances, compute the wall-clock time per explanation.

(b) Implement three acceleration strategies: (i) reduce background to 10 samples via k-medoids clustering, (ii) use paired sampling (for each coalition $z$, also evaluate $1-z$) to reduce variance by 50%, (iii) early stopping when the SHAP values converge (L2 norm of value changes < 0.01 across 100 consecutive samples).

(c) Compare the accuracy (vs. full KernelSHAP with 2,048 samples and 100 background) and speed of each acceleration. Report mean absolute error in SHAP values and wall-clock time.

(d) At what model inference latency does KernelSHAP become feasible for real-time serving (< 1 second per explanation) with your accelerations?

Exercise 35.30 (****)

Open research question: explanation methods for graph neural networks.

(a) The StreamRec LightGCN model (Chapter 14) makes recommendations based on graph structure. Standard SHAP treats each feature independently, but graph features are inherently relational (a user's embedding depends on their neighbors' embeddings). Describe why standard feature-level attribution is insufficient for GNNs.

(b) GNNExplainer (Ying et al., 2019) finds a subgraph and feature mask that maximizes the mutual information between the explanation and the model's prediction. Describe the optimization problem and discuss how it differs from SHAP's cooperative game formulation.

(c) For the StreamRec bipartite graph, a "subgraph explanation" would identify which user-item edges (past interactions) were most important for a recommendation. Design the data structures and API for a GNN explanation service.

(d) The computational cost of GNN explanations scales with the $k$-hop neighborhood size, which can be very large for popular items. Propose a scalable approximation and analyze its faithfulness-efficiency tradeoff.