46 min read

In This Chapter

Opening: The Black Box Problem
Learning Objectives
Section 14.1: What Is Explainability and Why Does It Matter?
Section 14.2: Interpretable Models — When Transparency Is Built In
Section 14.3: LIME — Local Interpretable Model-Agnostic Explanations
Section 14.4: SHAP — SHapley Additive exPlanations
Section 14.5: Counterfactual Explanations
Section 14.6: Saliency Maps and Visual Explainability
Section 14.7: Attention Mechanisms and Language Model Explanation
Section 14.8: The Faithfulness Problem — Do Explanations Actually Reflect Model Behavior?
Section 14.9: XAI in Organizational Practice
Section 14.10: The Limits of XAI — What Explanation Cannot Fix
Discussion Questions

Case Study 01 Case Study 02 Key Takeaways Exercises Quiz Further Reading

Chapter 14: Explainable AI (XAI) Techniques

Opening: The Black Box Problem

When a bank denies a loan application, US law — specifically the Equal Credit Opportunity Act (ECOA) and Regulation B — requires it to tell the applicant why. For decades, this was straightforward. A loan officer reviewed the file and issued an adverse action notice that cited specific, articulable reasons: "your income is insufficient relative to the requested loan amount," "your debt-to-income ratio exceeds our threshold," "your credit history contains a delinquency within the past 24 months." These explanations were imperfect — they could still encode bias, and the thresholds themselves could be set discriminatorily — but they were at least legible. A consumer who understood the reason could, in principle, take corrective action or challenge the decision.

The 2010s changed this in ways that regulators, ethicists, and technologists are still sorting out. As banks began replacing or supplementing traditional credit scoring models with machine learning systems — gradient-boosted decision trees, deep neural networks, ensemble models with hundreds of interacting components — the gap between model performance and model legibility widened dramatically. A well-trained ensemble classifier might outperform a logistic regression on default prediction by a meaningful margin. It might also be genuinely, technically unknowable why it denied any particular application. Not unknowable in the sense that the bank was hiding something. Unknowable in the sense that the computation produced a number — a probability of default — through a process of millions of nonlinear interactions among hundreds of features that no human can summarize in plain English without losing essential fidelity.

This is the black box problem, and it sits at the intersection of technical reality, legal obligation, and ethical accountability. The 2020s have produced a new class of tools specifically designed to address it — Explainable AI, or XAI. These tools use various mathematical strategies to peer inside black-box models and extract human-interpretable explanations: which features mattered, how much, in what direction, and what would have had to be different for the outcome to change.

This chapter examines what those tools are, how they work, and — crucially — what they can and cannot do. The honest conclusion is that XAI is genuinely valuable and genuinely limited in ways that matter for governance. Explanation is not the same as justification. A biased model with an excellent explanation is still a biased model. The power of XAI tools should not become an excuse for deploying systems that ought not to be deployed at all.

Learning Objectives

By the end of this chapter, you should be able to:

Distinguish between model interpretability and model explainability, and explain why the distinction matters for different audiences — data scientists, regulators, and affected individuals.
Describe how LIME (Local Interpretable Model-Agnostic Explanations) works and identify its principal limitations, including instability and the potential for adversarial manipulation.
Describe how SHAP (SHapley Additive exPlanations) works, explain its game-theoretic foundations, and distinguish between global and local SHAP analyses.
Explain what counterfactual explanations are and why they provide actionable information that other explanation types often do not.
Identify the faithfulness problem in post-hoc XAI methods and explain why a technically accurate explanation does not guarantee a fair or accountable process.
Match explanation types to the appropriate audience — data scientist, compliance officer, affected individual, executive — and identify when each type of explanation is appropriate.
Analyze a realistic organizational scenario and identify where XAI tools could help and where they are insufficient to address governance concerns.
Critically evaluate claims that deploying XAI tools satisfies ethical or legal obligations around AI transparency and accountability.

Section 14.1: What Is Explainability and Why Does It Matter?

Defining the Terms

Before examining specific techniques, it is worth being precise about language, because the field uses several terms interchangeably in ways that obscure important distinctions.

Interpretability refers to the degree to which a human can understand the internal workings of a model — its structure, its parameters, its decision logic. A linear regression model is highly interpretable: you can look at the coefficients, understand their signs and magnitudes, and follow the arithmetic from input to output. A 500-tree random forest is not interpretable in this sense: the internal logic is spread across millions of nodes and branches in a way that no human can follow directly.

Explainability refers to the degree to which humans can understand the cause of an AI decision — not necessarily by understanding the model's internals, but by receiving an explanation that is accurate and comprehensible. A black-box model can, in principle, be made explainable through post-hoc analysis: tools that examine the model's behavior and produce summaries that non-experts can understand.

The distinction matters because it points to two fundamentally different strategies for responsible AI deployment. The first strategy is to use inherently interpretable models. The second is to use black-box models paired with explanation tools. These strategies carry different risks, different costs, and different governance implications — a theme we will return to throughout this chapter.

Why the Audience Matters

Critically, there is no such thing as a universal explanation. What constitutes an adequate explanation depends entirely on the audience and their purpose.

A data scientist building a model wants to understand global feature importance: which variables matter most overall? How do features interact? Where does the model make systematic errors? The appropriate explanation for a data scientist might be a SHAP summary plot showing the distribution of feature contributions across thousands of predictions, or a partial dependence plot showing how model output changes as one variable shifts while others are held constant.

A compliance officer or regulator auditing a model wants to verify that the model does not violate applicable law — that it does not use protected characteristics or their proxies, that its error rates are consistent across demographic groups, that there is documentation supporting model validation. The appropriate explanation involves fairness metrics, model cards, documentation of training data, and evidence that the model was tested for disparate impact.

An affected individual — the loan applicant, the job candidate, the benefits recipient — wants to understand why the system treated them the way it did, and what they could do differently. The appropriate explanation is local (specific to their case), actionable (pointing to things they can change), and accessible (not requiring data science expertise). The ECOA adverse action notice requirement exists to serve this population.

A business decision-maker evaluating whether to deploy an AI system or how to oversee one already deployed wants high-level model summaries: what is the model's overall accuracy? Where does it fail? How confident are its predictions? What are the edge cases?

Designing XAI governance for organizations means designing for all these audiences simultaneously — a substantially harder problem than designing for any one.

Types of Explanations

XAI explanations are typically classified along two dimensions.

Global explanations describe how the model works in general: which features it relies on most heavily, what patterns drive predictions across the full distribution of inputs. Global explanations are useful for model auditing, for understanding systematic behavior, and for detecting categories of potential harm.

Local explanations describe why the model made a specific prediction for a specific input. Local explanations are what affected individuals typically need, and they are what post-hoc explanation tools like LIME and SHAP primarily provide.

A third type, contrastive or counterfactual explanations, addresses the question "what would have been different if the input had been different?" — which is often the most actionable form of explanation for affected individuals and is examined in Section 14.5.

Causality, Correlation, and the Explanation Trap

One of the most important — and most frequently misunderstood — limitations of XAI is that explanations derived from correlational models describe correlations, not causes. A model that uses zip code to predict creditworthiness is learning a correlation between zip code and repayment behavior. An XAI tool that identifies zip code as a top feature is correctly describing this correlation. But neither the model nor the explanation tells you whether zip code causes creditworthiness variations, or whether both are caused by something else — for instance, a history of racially discriminatory lending that systematically excluded Black homebuyers from wealth-building zip codes. Explaining the model accurately is not the same as justifying it ethically.

This point has governance implications that are easy to miss. A bank that uses SHAP to identify its top features and confirms that zip code is highly predictive has learned something real. But "highly predictive" is not an ethical justification for use — especially when predictive power stems from historical injustice. The explanation tool has done its job. The ethical work of deciding whether the model should be deployed at all is a human decision that cannot be delegated to the explanation tool.

Vocabulary Builder

XAI (Explainable AI): A field of techniques and practices aimed at making AI model decisions understandable to humans.

LIME: Local Interpretable Model-Agnostic Explanations; a technique that explains individual predictions by fitting a simple model to local perturbations.

SHAP: SHapley Additive exPlanations; a framework that attributes prediction outcomes to individual features using Shapley values from cooperative game theory.

Feature importance: A measure of how much each input variable contributes to a model's predictions, either globally or locally.

Counterfactual explanation: An explanation that describes what would have had to be different about the input for the model to produce a different output.

Saliency map: A visualization technique, primarily for image models, that highlights which pixels or regions most influenced a model's prediction.

Model distillation: The process of training a simpler, interpretable model to approximate the behavior of a more complex model.

Attention: A mechanism in transformer-based language models that assigns weights to different parts of the input when producing an output; often but incorrectly interpreted as an explanation of model reasoning.

Section 14.2: Interpretable Models — When Transparency Is Built In

The Case for Interpretability First

Before discussing tools for explaining black-box models, it is important to acknowledge the argument that black-box models should not be used in high-stakes settings in the first place. This argument, advanced most forcefully by Cynthia Rudin (2019) in her paper "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead," deserves serious attention.

Rudin's argument is not that black-box models are always worse in high-stakes settings. Her argument is that (1) for many problems, interpretable models achieve comparable accuracy to black-box models; (2) the common assumption that there is a fundamental accuracy-interpretability tradeoff is empirically weak; and (3) the risks of post-hoc explanation — specifically, that explanations can be wrong, gameable, or misleading — are sufficiently serious that building interpretability into the model is preferable when it is feasible. If you can solve a problem well with an interpretable model, you should not introduce the additional risks of opacity by using a black box and then trying to explain it after the fact.

This is a consequential argument for business leaders. The choice of whether to deploy an interpretable model or a black-box model is not merely a technical choice — it is a governance choice. And that choice should be made deliberately, with awareness of the tradeoffs, rather than by default to whatever method achieves the highest accuracy on a validation set.

Linear and Logistic Regression

The oldest and most widespread interpretable models in business are linear regression (for continuous outcomes) and logistic regression (for binary classification). Their interpretability is direct and precise: each coefficient tells you the change in the output associated with a one-unit change in that feature, holding all other features constant. A logistic regression predicting loan default with a coefficient of 0.15 on debt-to-income ratio tells you exactly how much that variable increases the log-odds of default per unit increase.

This transparency has practical governance value. A regulator asking why the model denied an application can receive a mathematically precise answer. A compliance officer checking for disparate impact can audit which variables are used and what their coefficients are. A data scientist debugging the model can identify which variables have counterintuitive coefficients — a potential sign of data problems or multicollinearity.

The limitations are equally transparent. Linear models cannot capture nonlinear relationships or complex interactions among features without manual feature engineering. In domains where true underlying relationships are highly nonlinear — such as the interaction among credit history, income volatility, employment sector, and default risk — linear models may leave meaningful predictive accuracy on the table.

Decision Trees

Decision trees produce a flowchart-like structure of if-then rules that is among the most visually accessible model types for non-technical audiences. A decision tree that classifies a loan application might say: if income > $60,000 AND debt-to-income < 0.35 AND no defaults in past 24 months, approve; otherwise, evaluate next branch. Individual decisions can be traced from root to leaf, and the full decision logic is visible.

Shallow decision trees (with limited depth) are genuinely interpretable. Deep decision trees are not — a tree with hundreds of nodes becomes as opaque as any black box. Decision trees also tend to overfit unless carefully constrained or pruned. For these reasons, they are typically more useful as components of ensemble methods (like random forests) than as standalone models in high-stakes production systems.

Rule Lists and Scorecard Models

Rule lists — models that apply a sequence of if-then conditions, with each prediction following from the first applicable rule — have a long history in high-stakes decision-making. Scorecard models, common in credit underwriting and criminal justice risk assessment, assign points to feature values and sum them to produce a risk score. These models are explicitly designed for human interpretation: a credit analyst or parole officer can follow the logic with a pen and paper.

The COMPAS recidivism prediction tool at the center of the ProPublica fairness controversy (discussed in Chapter 12) was, in its public-facing form, a scorecard model — and this interpretability was one of the reasons it attracted scrutiny that would have been harder to apply to a fully opaque system. The ability to examine a model's logic enables criticism of that logic, which is precisely the kind of accountability that governance requires.

Generalized Additive Models (GAMs) and EBMs

Generalized Additive Models extend linear models by allowing each feature to have a nonlinear effect on the outcome, modeled as a smooth function, while maintaining the additive structure that makes predictions interpretable. The key insight is that while the effect of each feature can be nonlinear, features do not interact in the prediction — each contributes its own term, which can be visualized separately.

Microsoft Research has developed the Explainable Boosting Machine (EBM), which extends GAMs to include pairwise interaction terms while remaining interpretable. EBMs use gradient boosting to learn complex feature shapes but present results in a form where each feature's contribution can be visualized and examined independently. In several benchmark comparisons, EBMs achieve accuracy comparable to gradient-boosted tree ensembles while maintaining substantially greater interpretability. For many business applications, EBMs represent a compelling middle path between the accuracy of black-box methods and the governance value of transparent models.

The Rashomon Set and the Decision Framework

The concept of the Rashomon set (named for the Kurosawa film about multiple conflicting accounts of the same event) refers to the collection of models that achieve similar accuracy on a given problem. The point, empirically important, is that for most real-world datasets this set is large — there are many models with similar accuracy, and some of them are interpretable. The existence of a large Rashomon set undercuts the argument that accuracy demands opacity.

The practical question for organizations deploying AI in high-stakes settings is: before choosing a black-box model, have you verified that interpretable alternatives do not achieve comparable accuracy? If you have verified this and found a meaningful accuracy gap that has material business consequences, the choice of a black-box model may be defensible. If you have not performed this comparison — or if the accuracy gap is small — choosing opacity by default is difficult to defend on governance grounds.

Case Example: The FICO Score

The FICO credit score is one of the most widely used interpretable models in financial services. It aggregates credit bureau data into a single number using a relatively transparent methodology — payment history, amounts owed, length of credit history, credit mix, and new credit each contribute with known weights. Consumers can access information about which factors most positively and negatively affected their score.

The FICO score's interpretability has real governance value: it can be explained to consumers, audited for disparate impact, and challenged in regulatory proceedings. Its limitations are also instructive. The score's input data — derived from credit bureau records — reflects historical patterns of lending discrimination. Interpretability does not cure the upstream bias in the data; it merely makes the bias more visible. This is a microcosm of a broader principle: interpretability is necessary but not sufficient for ethical AI.

Section 14.3: LIME — Local Interpretable Model-Agnostic Explanations

How LIME Works

LIME, introduced by Ribeiro, Singh, and Guestrin in 2016, is built on a deceptively simple insight: even if a complex model is globally opaque, it may be locally approximable. In other words, even if you cannot describe the model's behavior across all possible inputs, you may be able to describe its behavior in the neighborhood of any specific input well enough to produce a useful explanation.

The algorithm works in five steps:

Step 1: Select the instance to explain. Begin with the specific prediction you want to understand — for instance, a loan application that was denied with a 72% predicted probability of default.

Step 2: Create perturbations. Generate many variations of the original input by randomly changing feature values — slightly higher income, slightly different credit history, different zip code. These perturbations explore the neighborhood around the original input in feature space.

Step 3: Get predictions for perturbations. Run each perturbed version through the original black-box model and collect its predictions. You now have a small dataset: many similar-but-different inputs, each with a model output.

Step 4: Weight by proximity. Give greater weight to perturbations that are closer to the original input, so the local approximation emphasizes behavior near the instance of interest.

Step 5: Fit an interpretable model to the perturbations. Train a simple model — typically a linear regression — on the weighted perturbations. The coefficients of this simple model describe which features drove the original prediction in which direction.

The result is an explanation that says, in effect: "In the neighborhood of this specific input, the model's behavior is approximately described by the following linear model, and these were the features that mattered most."

What LIME Tells You

For the denied loan applicant, a LIME explanation might produce output like: "The top factors pushing toward denial were: (1) debt-to-income ratio of 0.48, which increased predicted default probability significantly; (2) credit history length of 2.3 years, which is below the typical approved applicant; (3) two late payments in the past 12 months. The top factors pushing toward approval were: (1) income of $72,000, which is above median for this loan type."

This is a local explanation — it describes why this specific application received this specific score, not how the model works in general. It is model-agnostic — LIME works the same way regardless of whether the underlying model is a random forest, a neural network, or any other architecture. And it is post-hoc — it does not change the model, it examines it from outside.

LIME in the Business Context: Adverse Action Notices

The most immediate business application of LIME in financial services is generating adverse action notices that satisfy ECOA/Regulation B requirements. Instead of a compliance team manually reviewing which factors drove each denial, LIME (or similar tools) can be integrated into the production pipeline to automatically generate feature-level explanations for each denied application.

This application has significant practical appeal, and several major credit bureaus and fintech lenders have explored or implemented versions of it. However, regulators have been cautious. The Consumer Financial Protection Bureau (CFPB) and the Office of the Comptroller of the Currency (OCC) have indicated that adverse action notices based on post-hoc explanation tools must be validated to ensure they accurately reflect the model's actual decision logic — a requirement that points directly to LIME's central limitation.

The Limitations of LIME

LIME has several important limitations that every practitioner and business leader should understand.

Instability. Because LIME uses random sampling to generate perturbations, running LIME twice on the same instance can produce different explanations. In experiments, researchers have found that LIME explanations for the same prediction can vary substantially across runs. For a compliance application requiring consistent, reproducible adverse action notices, this instability is a significant problem.

The approximation gap. LIME fits a linear model to the local neighborhood of the prediction. If the true model is highly nonlinear in that region — if the model's behavior changes sharply as features vary — the linear approximation may be poor. The explanation may be easy to understand while being inaccurate about what actually drove the prediction.

Perturbation strategy dependence. The distribution from which LIME samples perturbations matters substantially. If perturbed inputs are sampled in a way that does not reflect realistic inputs — for instance, if the perturbation strategy treats features as independent when they are in fact correlated — the resulting explanation may not reflect model behavior on realistic inputs.

Cannot detect all discrimination. LIME shows which features were locally important. It cannot directly detect all forms of discriminatory model behavior, particularly discrimination that manifests through complex feature interactions rather than simple feature weights.

When LIME is appropriate: For generating human-readable explanations of individual predictions, for debugging specific model errors, and as one component of a broader explainability strategy. LIME is particularly useful when you need model-agnostic explanations and when instability can be managed through repeated sampling and aggregation.

When LIME is insufficient: For regulatory compliance as a standalone tool, for detecting systematic discrimination, for providing legally robust adverse action notices without additional validation. LIME is a useful diagnostic tool, not a governance solution by itself.

Section 14.4: SHAP — SHapley Additive exPlanations

Game Theory Foundations

SHAP, introduced by Lundberg and Lee in 2017, is grounded in a concept from cooperative game theory: the Shapley value, named for Nobel laureate Lloyd Shapley who developed it in 1953. The Shapley value answers a specific question: in a cooperative game where players work together to produce a joint payoff, how much of that payoff should each player be credited with?

The answer Shapley proposed has elegant mathematical properties. You consider every possible ordering in which players might join the game, and for each ordering, you calculate how much each player's entry increases the payoff. You then average these marginal contributions across all possible orderings. The result is a fair attribution of the total payoff to individual players.

Lundberg and Lee recognized that this framework maps naturally onto machine learning prediction. The "players" are the input features. The "payoff" is the difference between the model's prediction for this specific instance and the model's average prediction across all instances (the baseline). The Shapley value for each feature is its average marginal contribution to the prediction across all possible feature orderings — in other words, how much credit does this feature deserve for the model's prediction being what it is?

What SHAP Tells You

SHAP values have several mathematically guaranteed properties that make them more principled than LIME:

Additivity: The SHAP values for all features sum exactly to the difference between the prediction and the baseline. This means explanations are internally consistent — you can verify that the pieces add up to the whole.

Consistency: If changing a model makes a feature's contribution unambiguously higher, that feature's SHAP value will not decrease.

Dummy feature property: A feature that has no effect on the model's predictions receives a SHAP value of zero.

For a denied loan application, a SHAP explanation might say: "This applicant's predicted default probability is 0.71. The baseline probability for all applicants is 0.42. The following features contributed to the difference: (+0.18) two recent late payments; (+0.09) debt-to-income ratio; (+0.07) short credit history; (-0.04) above-median income; (-0.01) stable employment sector." These values add to +0.29, which equals the difference between the specific prediction (0.71) and the baseline (0.42).

Types of SHAP

TreeSHAP computes exact Shapley values for tree-based models — random forests, gradient-boosted trees (XGBoost, LightGBM, CatBoost) — in polynomial time. This makes it fast and exact for the most widely used model families in production business applications. TreeSHAP is the recommended approach whenever you are working with tree-based models.

KernelSHAP is model-agnostic — it works with any model architecture — using a sampling approach similar in spirit to LIME but with a kernel weighting scheme derived from the Shapley value axioms. KernelSHAP is slower than TreeSHAP and provides approximate rather than exact values, but it can be applied to neural networks, linear models, and any other architecture.

DeepSHAP is designed for deep neural networks and uses a backpropagation-based approach to compute approximate SHAP values efficiently. It is faster than KernelSHAP for neural networks but makes additional assumptions about the model's structure.

SHAP Visualizations

SHAP's practical impact on the field has been amplified by a set of visualizations that make its output accessible to non-technical audiences.

Waterfall plots show, for a single prediction, how each feature's SHAP value pushes the prediction from the baseline toward the final output. Features are shown as bars, with positive contributions in one color and negative contributions in another, stacked to illustrate how the prediction was built up.

Beeswarm plots (also called summary plots) show global feature importance across many predictions simultaneously. Each dot represents one prediction, positioned horizontally by its SHAP value for a given feature, and color-coded by the feature's actual value. This allows visualization of not just which features matter most but how they matter — whether high feature values push predictions up or down, and whether there are complex nonlinear patterns.

Force plots show a single prediction's SHAP values as a horizontal force diagram, with features pushing the prediction left (toward low values) and right (toward high values), giving an immediate visual intuition about what drove the outcome.

Dependence plots show how a single feature's SHAP contribution varies as the feature's value changes, and can reveal interaction effects by color-coding a second feature.

Business Application: SHAP for Model Auditing and Bias Detection

Beyond individual prediction explanations, SHAP's most powerful business application is model auditing. By computing SHAP values across a large sample of predictions, organizations can identify patterns that raise governance concerns.

The most important application in financial services is proxy variable detection. If a protected characteristic — race, gender, national origin — is not included in the model but is correlated with a feature that is included (zip code, for instance, correlates with race due to residential segregation), that feature may function as a proxy. A SHAP audit that shows zip code as a top feature, combined with geographic analysis showing that high-SHAP zip codes correspond to predominantly minority neighborhoods, is evidence of potential disparate impact — even if no protected characteristic was ever directly used.

This application is examined in detail in Case Study 1 of this chapter.

SHAP's Limitations

SHAP's mathematical foundations are sounder than LIME's, but it has its own limitations.

Computational cost. For large datasets with many features, computing SHAP values for all predictions is computationally expensive. KernelSHAP is particularly slow. For production systems requiring real-time explanations for each prediction, performance optimization is non-trivial.

Approximation for complex models. For non-tree models, SHAP values are approximate, not exact. The approximation quality depends on the sampling strategy and the number of samples used.

Feature independence assumption. Like LIME, SHAP can produce misleading attributions when features are highly correlated, because it involves examining how the model behaves when some features are "missing" — and the way it handles missing features assumes independence that may not hold in practice.

Still a post-hoc method. SHAP describes model behavior but does not guarantee that the model is fair or that the explanation is a causal account of why the outcome occurred. The adversarial explanation problem (discussed in Section 14.8) applies to SHAP as well as to LIME.

Section 14.5: Counterfactual Explanations

The Intuition

Of all the explanation types covered in this chapter, counterfactual explanations are perhaps the most intuitively accessible and the most directly useful for affected individuals. The core idea is simple: instead of telling you why the model made the decision it made, a counterfactual explanation tells you what would have had to be different about your application for the model to have made a different decision.

"Your loan was denied. If your income had been $5,200 higher, your application would have been approved."

"Your job application was not selected for the next round. If your resume had included three years of experience in the relevant software stack, the outcome would have been different."

This form of explanation, developed formally in an AI context by Wachter, Mittelstadt, and Russell (2017), is called algorithmic recourse: the right to understand what you could do differently to achieve a different algorithmic outcome in the future.

Why Counterfactuals Matter

Counterfactuals provide two things that feature-importance explanations typically do not: actionability and accessibility.

Actionability: knowing that your debt-to-income ratio was the top negative factor (a feature-importance explanation) tells you something about why you were denied, but not specifically what to do about it. Knowing that if your debt-to-income ratio had been 0.34 rather than 0.42, you would have been approved, tells you something specific: reduce your monthly debt by approximately $X. This is information you can act on.

Accessibility: "Your predicted default probability was 0.71 due to SHAP contributions of +0.18 from recent late payments" is accurate but requires explanation to understand. "If you had no late payments in the past 12 months, you would have been approved" is immediately comprehensible.

The Actionability Problem

Counterfactuals that describe actionable changes are useful. Counterfactuals that describe changes to immutable characteristics are not — and can be harmful. A counterfactual that says "If you were a different race, you would have been approved" is technically a valid counterfactual (it describes what would have led to a different outcome) but is both non-actionable and direct evidence of illegal discrimination.

More subtle is the case of characteristics that are technically mutable but practically difficult or impossible to change: age (slowly mutable), neighborhood (mutable for those with resources), credit history length (mutable only with time). The quality of a counterfactual explanation depends not just on its proximity to the original input in mathematical space but on whether the described changes are genuinely available to the affected individual.

Good counterfactual generation algorithms therefore need to incorporate constraints: prefer counterfactuals that require changes to mutable, actionable features; penalize counterfactuals that require changes to protected characteristics; prefer counterfactuals that require smaller, more plausible changes.

The Proximity and Diversity Problems

The nearest counterfactual in feature space is mathematically well-defined — it is the point closest to the original input (by some distance metric) that crosses the decision boundary. But nearest in feature space may not be most useful in practice, and there may be multiple equally-near counterfactuals pointing in different directions.

A loan applicant who is told "increase your income by $5,000 OR reduce your debt by $3,000 OR eliminate your most recent late payment" receives more useful information than one told the single nearest counterfactual, because they can evaluate which of the options is most available to them. Diverse counterfactual generation algorithms therefore aim to produce multiple counterfactuals that vary in which features they change, giving the affected individual genuine choice.

Counterfactuals and Gaming

A concern sometimes raised about counterfactual explanations is that they can be "gamed" — that a bad actor who knows the counterfactual could manipulate their inputs to achieve approval while not genuinely being lower-risk. This concern is most acute in adversarial settings where applicants are actively trying to game a classifier.

In practice, the concern is real but should not be overstated. The same concern applies to any transparent system: a person who knows that FICO scores above 720 get better interest rates can work to raise their FICO score through legitimate behavior. Transparency that enables gaming also enables genuine improvement. The distinction between gaming and genuine improvement is worth monitoring — but it does not generally argue against providing counterfactual explanations to affected individuals.

Section 14.6: Saliency Maps and Visual Explainability

How Visual Explanation Works

In computer vision applications — radiology AI that reads scans, quality control systems that inspect products, facial recognition systems — the question "why did the model make this prediction" naturally takes a visual form. Which part of the image drove the classification? Where was the model "looking"?

Saliency maps answer this question by producing a heatmap overlay on the original image, highlighting pixels or regions that contributed most strongly to the model's output. The standard interpretation is that high-saliency regions are those to which the model gave the most weight — where it "attended" most.

Several methods produce saliency maps:

Grad-CAM (Gradient-weighted Class Activation Mapping) computes the gradient of the model's output with respect to the activations of the final convolutional layer, then uses those gradients to weight the feature maps and produce a coarse localization map. Grad-CAM explanations are fast to compute and have been widely adopted in medical imaging AI.

Integrated Gradients compute the average gradient of the model's output along a straight path in input space from a baseline (e.g., a black image) to the actual input, assigning each pixel a contribution equal to the integral of the gradient along this path. Integrated Gradients have stronger theoretical properties than simple gradient-based methods and satisfy an axiom set similar to SHAP's.

The Faithfulness Problem in Visual Explanation

Saliency maps look compelling. They draw boxes around tumors in radiology scans, they highlight the faces of detected people, they illuminate text regions that drove document classification. This visual plausibility creates a strong intuition that the explanation is correct.

But appearances can deceive. A saliency map shows where a gradient is large — where small changes in pixel values would most change the model's output. This is not necessarily the same as where the model understood something meaningful. A model can achieve correct predictions for spurious reasons.

The canonical example: a model trained to detect skin cancer from photographs achieved high accuracy — but a subsequent analysis found that it was partially using the presence of a ruler or a dermatoscope in the image as a predictive signal. These objects appear more frequently in images of malignant lesions because clinicians are more likely to photograph and measure suspicious lesions. The model learned a real correlation but not the right one. Saliency maps might highlight regions around the ruler, producing an explanation that accurately describes what drove the prediction but reveals a spurious and fragile basis for classification.

Case Study: The Stethoscope Correlation

A study of a chest X-ray classification AI found that the model learned to associate the presence of a stethoscope — visible in some X-rays taken in certain clinical settings — with particular diagnostic labels, because the imaging protocols at hospitals that took X-rays with stethoscopes in frame happened to correlate with the disease prevalence in that dataset. The saliency maps for these predictions highlighted the stethoscope region. The explanation was technically accurate — the stethoscope did drive the prediction — but revealed that the model had learned a spurious shortcut that would fail on images from hospitals with different protocols.

This example illustrates a principle with broad application: saliency maps can be diagnostic tools that reveal model problems, but only if someone is looking at them critically. A saliency map presented to regulators or patients as evidence that "the model is looking at the right things" can be misleading if the audience interprets visual plausibility as technical validity.

Section 14.7: Attention Mechanisms and Language Model Explanation

What Attention Is

Transformer-based language models — the architecture underlying GPT, BERT, and their successors — use attention mechanisms to process text. At each layer of the transformer, each position in the input sequence attends to every other position, with learned weights (attention weights) determining how much influence each position has on each other position's representation. The intuition is that attention allows the model to relate words to each other across arbitrary distances in the sequence — allowing "the trophy" to be connected to "it" in a coreference resolution task even if they are separated by many words.

Because attention weights are explicit and interpretable — they are probability distributions over input positions — researchers and practitioners quickly adopted them as explanation tools: "the model predicted X because it attended heavily to words Y and Z."

Attention Is Not Explanation

This interpretation was challenged directly and forcefully by Jain and Wallace (2019) in a paper with the unambiguous title "Attention is not Explanation." Their argument has two components.

First, attention weights do not uniquely determine model behavior. Different attention distributions can produce the same model output, meaning that the specific pattern of attention weights is not causally necessary for the prediction. You can permute attention weights substantially without changing the prediction — which means attention weights don't have the causal relationship to outputs that an explanation should have.

Second, adversarial attention patterns exist. It is possible to construct attention distributions that look very different from the original while producing the same prediction, and vice versa. This means that attention weights can look meaningful while having no stable relationship to what the model is actually doing.

The implication for governance is significant: presentations of attention heatmaps as explanations of language model behavior — which were common in both research and product documentation — are not reliable. Telling a user "the model focused on these words to reach its conclusion" when based on attention weights may be misleading.

Better Approaches to LLM Explainability

More principled approaches to explaining text model predictions apply the same frameworks as tabular models. SHAP can be applied to text models by treating the presence or absence of words or phrases as features and computing Shapley values for their contributions to the prediction. Feature attribution methods like Integrated Gradients, applied to the token embeddings, provide more faithful explanations than raw attention weights.

For large language models with hundreds of billions of parameters — GPT-4, Claude, Gemini, and their successors — explainability is an open and actively researched problem. These models exhibit emergent behaviors (capabilities that appear at large scale that were not present in smaller versions of the same architecture) that are not well understood even by their developers. The concept of a local explanation may be applicable at the token level, but global explanations of how these models work as systems remain beyond current XAI capability.

The governance implication is uncomfortable: the models that are being most aggressively deployed in high-stakes applications — large language models for legal document analysis, medical information retrieval, financial advice, and hiring — are also the models for which we have the least reliable explanation methods.

Section 14.8: The Faithfulness Problem — Do Explanations Actually Reflect Model Behavior?

Post-Hoc Explanations Are Models of Models

The deepest challenge for XAI is the faithfulness problem: post-hoc explanations are themselves approximate models of the underlying model's behavior. LIME fits a linear approximation. SHAP attributes predictions to features using an averaging procedure. Saliency maps identify high-gradient regions. None of these directly reads out the model's decision logic — they observe the model's behavior from outside and produce a summary.

This means a post-hoc explanation can be wrong — not through any malice, but simply because the approximation is imperfect. If LIME's local linear model is a poor approximation of the true model's local behavior, the explanation it provides will be inaccurate. If SHAP's feature independence assumptions are violated, SHAP attributions will be distorted. If a saliency method highlights pixels that have high gradients but do not reflect human-meaningful features, the visualization misleads rather than explains.

This is not a hypothetical concern. Adebayo et al. (2018), in "Sanity Checks for Saliency Maps," found that many popular gradient-based saliency methods failed basic validation tests: their saliency maps did not change significantly when the model was randomly re-initialized or when the training labels were randomized, which should — if the maps were truly reflecting learned model behavior — produce entirely different maps. The saliency maps were, in effect, reflecting properties of the input data more than properties of the model, making them unreliable as model explanations.

The Adversarial Explanation Problem

The faithfulness problem becomes an active governance threat when it can be exploited intentionally. Slack, Friedler, Venkatasubramanian, and Roy (2020) demonstrated that machine learning classifiers can be designed to give fair-looking LIME and SHAP explanations while actually discriminating in their predictions.

The attack works by training the model with an awareness of when it is being queried by an explanation tool. Explanation tools generate specific kinds of perturbations — out-of-distribution inputs that differ from real data in characteristic ways. A model that recognizes these perturbation patterns can behave differently when queried by an explanation tool than when processing real inputs: providing benign explanations to auditors while discriminating in production.

This finding has profound governance implications. If a regulated institution were to deploy a discriminatory model with an adversarially crafted explanation layer, regulatory review of LIME or SHAP outputs would not detect the discrimination. The explanation would be passing a compliance check while the underlying model was violating anti-discrimination law.

The Slack et al. result does not mean that XAI is useless — it means that XAI alone is insufficient for compliance auditing. Effective regulatory oversight of AI systems requires access to the model itself, the training data, and outcome data on real predictions — not just explanation outputs generated on demand.

What Faithful Explanation Means and How to Test It

A faithful explanation accurately describes the behavior of the model for the instance being explained. Testing faithfulness requires going beyond accepting explanation outputs at face value.

Practical faithfulness tests include:

Completeness checks: Do the explanation components sum to the difference between prediction and baseline? (SHAP guarantees this by construction; LIME does not.)

Consistency checks: Do similar inputs receive similar explanations? Does changing a feature that the explanation says is unimportant actually leave the prediction unchanged?

Perturbation tests: If the explanation says feature X was most important, does removing or randomizing feature X actually change the prediction significantly?

Randomization sanity checks: Does the explanation change substantially when the model is replaced with a random model? If not, the explanation is reflecting input structure rather than model behavior.

Organizations deploying XAI tools in production should build these validation tests into their model governance processes — not treat explanation outputs as authoritative without verification.

Regulatory Implications

The faithfulness problem has direct implications for how regulatory requirements around explainability should be designed. A requirement that says "the AI system must provide explanations" is satisfied by any explanation tool, regardless of whether those explanations are accurate. A well-designed requirement would specify that explanations must be validated to be faithful representations of actual model behavior — a substantially higher bar.

The EU AI Act's transparency provisions, and the CFPB's guidance on explainability in credit decisions, both acknowledge that explanations must be accurate but leave validation methodology largely unspecified. As regulators gain sophistication about XAI limitations, the expectation is that future guidance will be more specific about what constitutes adequate validation. Organizations that invest in explanation faithfulness now are likely to be better positioned for more demanding future requirements.

Section 14.9: XAI in Organizational Practice

Matching Explanation Types to Audiences

Effective XAI governance is not about deploying one explanation tool and calling the problem solved. It requires mapping different types of explanations to the different audiences who need them, and ensuring that each audience receives information that is accurate, relevant, and actionable for their role.

Data scientists building and validating models need global feature importance (SHAP beeswarm plots, permutation importance), partial dependence plots showing how features affect predictions across their range, and error analysis tools that identify where the model fails and why. For model debugging, LIME and SHAP local explanations help data scientists understand specific mispredictions.

Affected individuals need local explanations specific to their case, presented in plain language without jargon, with actionable information about what they could do differently. For this audience, counterfactual explanations are often more useful than feature-importance explanations. The information should be available on request and not require technical expertise to interpret.

Compliance officers and regulators need documentation of the model's global behavior, fairness metrics broken down by demographic group, evidence that proxy variables have been identified and addressed, and validation that explanation tools are faithful. For this audience, SHAP-based model audits (Section 14.4 and Case Study 1) and adverse action notice validation processes are the relevant tools.

Business decision-makers — C-suite executives and board members overseeing AI systems — need high-level summaries of model behavior, confidence intervals, known failure modes, and escalation protocols for cases where the model's reliability is low. For this audience, the appropriate explanation is an accessible model card or AI system summary, not raw SHAP values.

XAI in the Model Development Lifecycle

XAI should not be bolted on after a model is deployed. It should be integrated throughout the development process.

In the problem formulation phase, teams should ask: what explanation capability will this model need to have, for which audiences, under which regulatory frameworks? The answer to this question constrains model architecture choices — if you need legally robust adverse action notices, you may need to prioritize interpretable models or models for which faithful explanations can be validated.

In the model selection and training phase, XAI analysis of candidate models can reveal unexpected behavior before deployment. SHAP audits can identify proxy variable usage. Partial dependence plots can reveal counterintuitive relationships that warrant investigation. Discovering these issues before deployment is dramatically cheaper than discovering them after.

In the validation phase, XAI outputs should be validated for faithfulness using the methods described in Section 14.8. Explanations that fail faithfulness tests should not be used for compliance purposes.

In production monitoring, XAI tools can detect drift in model behavior over time. If the SHAP importance of certain features changes substantially without a corresponding model retrain, it may indicate that the input data distribution has shifted — which can degrade model performance and explanation accuracy.

XAI Tools in Production

Several commercial and open-source tools have matured to support XAI in production environments.

The SHAP library (available for Python) is the standard implementation for SHAP-based explanation and includes support for TreeSHAP, KernelSHAP, and several visualization types. It integrates with major ML frameworks including scikit-learn, XGBoost, LightGBM, and TensorFlow.

LIME is available as an open-source Python library and supports text, tabular, and image explanation.

InterpretML from Microsoft Research packages the EBM alongside SHAP and LIME in an integrated framework designed for both model training and explanation.

What-If Tool from Google provides an interactive web interface for exploring model behavior, counterfactuals, and fairness metrics.

AI Fairness 360 from IBM, while primarily a fairness toolkit, includes explanation capabilities and integrates with LIME and SHAP.

For organizations deploying at scale, explanation generation adds computational overhead. TreeSHAP is fast enough for real-time explanation in many production systems. KernelSHAP and counterfactual generation are slower and may require batch processing or dedicated explanation infrastructure.

Documentation Standards

Regulatory frameworks are increasingly specific about what AI documentation should include. The EU AI Act requires high-risk AI systems to maintain technical documentation that includes "the logic followed by the AI system" — a provision that points directly to explainability. The NIST AI Risk Management Framework includes "explainability and interpretability" as components of trustworthy AI, with specific guidance on documenting explanation capabilities.

Model cards, introduced by Mitchell et al. (2019), have become a standard documentation format that includes intended use, performance metrics, fairness evaluations, and limitations. Several major technology companies now publish model cards for their AI systems, and regulatory expectations for model documentation are converging toward this format.

For organizations in regulated industries — financial services, healthcare, employment — the minimum viable documentation should include: the model's overall architecture and performance metrics; a SHAP global analysis identifying top features; a fairness analysis showing performance across demographic groups; documentation of any proxy variables identified and how they were addressed; a description of the explanation methodology and its validation; and a process for providing individual explanations to affected persons.

Section 14.10: The Limits of XAI — What Explanation Cannot Fix

Explanation Is Not Justification

Perhaps the most important point in this chapter — the one most frequently obscured in both technical and policy discussions of XAI — is that explaining a decision is not the same as justifying it. A fully accurate, perfectly faithful, beautifully visualized SHAP analysis of a decision can reveal with precision that the decision was wrong, discriminatory, or based on factors that should not have been used.

The explanation tool does its job. The model is explained. The ethical failure remains unexplained and unremedied.

This is not hypothetical. The SHAP analysis in Case Study 1 of this chapter reveals — accurately — that zip code was a top feature in a credit model, and that high-SHAP zip codes corresponded to minority neighborhoods. That is a correct explanation. It is also, without remediation, a description of potential illegal discrimination. The explanation created the evidence for accountability; it did not by itself produce accountability.

Explanation Is Not Fairness

A biased model can have an excellent explanation. A model that systematically disadvantages applicants from certain zip codes will, when analyzed with SHAP, produce a clear, faithful, internally consistent explanation that zip code is a key feature — and this explanation will accurately describe the mechanism of discrimination without eliminating it.

XAI tools can identify bias. They can measure its magnitude. They can describe its mechanisms. They cannot remedy bias. That requires changing the model, the training data, the deployment context, or the decision to deploy at all. Explanation is a diagnostic tool. Remedy requires human judgment and action.

Explanation Is Not Accountability

Knowing how an AI system made a decision does not automatically tell you who is responsible for that decision, whether they can be held to account, or what remedies are available. Accountability requires not just explanation but clear lines of human responsibility, mechanisms for challenge and appeal, and consequences for systems that cause harm.

The risk, which this textbook has flagged under the concept of "ethics washing" (see Chapter 3), is that organizations deploy sophisticated XAI tools precisely to create the appearance of accountability without the substance. "We provide SHAP-based explanations for every decision" is a meaningful statement about transparency. It is not a statement about accountability, fairness, or justice.

The Explanation Placebo

The concept of the "explanation placebo" describes the tendency to treat the availability of explanations as a substitute for genuine governance — to provide explanations as a way of satisfying regulatory requirements or public expectations without actually ensuring that the underlying system is fair, accurate, or appropriate for its deployment context.

The explanation placebo is particularly dangerous in high-stakes settings where affected individuals have limited recourse. A denied loan applicant who receives a technically accurate but practically useless SHAP-based adverse action notice — "your loan was denied because feature_427 had a SHAP value of +0.23" — has received an explanation in the technical sense without receiving the actionable information they needed. The legal requirement has been satisfied; the spirit of the requirement has not.

Organizations that use XAI tools to generate explanations that technically satisfy regulatory requirements while providing no genuine transparency to affected individuals are engaging in a form of ethics washing that is likely to face increasing regulatory and reputational scrutiny as XAI literacy among regulators and consumer advocates grows.

Looking Ahead

The chapters that follow build on the foundation established here. Chapter 15 examines the communication of AI decisions: how to present explanations to affected individuals in ways that are genuinely comprehensible and useful, not just technically compliant. Chapter 17 addresses the legal landscape around the right to explanation: what GDPR Article 22, the EU AI Act, and US sector-specific requirements actually require, and how organizations should design explanation processes that satisfy those requirements. Chapter 18 addresses accountability: how to build organizational structures that ensure genuine human responsibility for AI decisions, with XAI as one component of a broader accountability framework.

Discussion Questions

A financial services firm argues that deploying SHAP-based explanations for its credit model satisfies its ECOA adverse action notice obligations. A consumer advocacy group argues that SHAP explanations are not legally adequate because they are post-hoc approximations that may not faithfully reflect the model's actual decision logic. How should a regulator evaluate this dispute? What additional information would be needed?
Cynthia Rudin argues that high-stakes AI systems should use interpretable models rather than black-box models paired with explanation tools. Where do you think this argument is strongest? Where is it weakest? Are there high-stakes domains where you believe black-box models with post-hoc explanation are nonetheless the right choice? What would justify that choice?
The Slack et al. (2020) finding that LIME and SHAP can be deceived by adversarial models is concerning, but some argue that the attack is unlikely in practice because it requires sophisticated engineering and is detectable with appropriate auditing. How should organizations weigh this concern when deciding how to use XAI tools for compliance purposes?
Consider the different explanation needs of the following audiences for the same credit model: (a) the data scientist who built the model; (b) the applicant who was denied; (c) the OCC examiner auditing the bank; (d) the bank's board of directors. Design a brief explanation framework that addresses each audience's needs. What are the tensions among these different needs?
The "explanation placebo" problem suggests that making explanations available can actually reduce genuine accountability by creating a false impression of transparency. Do you think this concern justifies reducing explanation requirements? Or does it argue for more rigorous explanation requirements? What would "more rigorous" mean in practice?
Some AI ethics advocates argue that the focus on XAI is misplaced — that instead of trying to explain black-box models, we should be asking whether those models should be deployed in specific high-stakes contexts at all. How should organizations decide when explainability is sufficient governance and when it is not?
How might the appropriate explanation methodology differ across cultural and regulatory contexts? The EU's GDPR imposes specific requirements around automated decision-making that differ from US sector-specific requirements. How should a multinational corporation deploying AI in both contexts design its explanation strategy?

Chapter 14 continues with three case studies, an exercises section, quiz, and further reading. See the chapter directory for all accompanying files.