Case Study 14.2: When Explanation Goes Wrong — Adversarial XAI and the Limits of Post-Hoc Methods

Based on: Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods." Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES).

The Promise of Regulatory Compliance Through XAI

By the mid-2020s, explainable AI had become, in many regulatory and corporate governance discussions, something close to a panacea. The logic was appealing in its simplicity: if an AI system can explain its decisions in human-intelligible terms, then its behavior can be audited, its biases can be detected, and its accountability can be ensured. Several major regulatory frameworks reinforced this narrative. The European Union's General Data Protection Regulation (GDPR) Article 22 was widely interpreted as creating a right to explanation for automated decisions. The EU AI Act proposed transparency requirements for high-risk AI systems. In the United States, the CFPB and the OCC issued guidance indicating that banks using ML in credit decisions would need to provide explanations comparable in specificity to those provided by traditional models.

Industry responded to this regulatory signal in a predictable way: by deploying explanation tools. LIME wrappers were integrated into loan origination systems. SHAP dashboards were built for model auditing. Model documentation began including "explanation methodology" sections. Regulatory submissions cited XAI implementations as evidence of responsible AI deployment. A new category of enterprise software emerged: "AI explainability platforms" that promised to automate the explanation process for any model architecture.

The implicit assumption underlying all of this activity was that explanation tools could be trusted: that when a LIME or SHAP explanation said a model was using legitimate features and making principled decisions, this was actually true. This assumption, it turns out, is not warranted. And a research team at the University of Massachusetts Amherst, in collaboration with colleagues at the University of California, San Diego, showed why in a 2020 paper that should have been — and arguably was not — more disruptive to regulatory practice than it proved to be.

The Slack et al. Finding

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju published "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods" at the AAAI/ACM Conference on AI, Ethics, and Society in February 2020. The paper's central finding is stated clearly in its abstract: they demonstrate that classifiers can be constructed to use sensitive features (such as race or gender) when making predictions on real data, while producing innocuous explanations when queried by explanation methods such as LIME and SHAP.

This is not a subtle finding. It means that a discriminatory model can pass LIME and SHAP audits. It means that an organization could deploy a model that actually discriminates on the basis of race, present SHAP explanations to regulators showing that race-correlated features have low SHAP values, and pass the examination. The XAI tools would not be lying in any simple sense — they would be producing technically valid explanations of the model's behavior on the query inputs. But the model's behavior on query inputs would be systematically different from its behavior on real applicant data.

How the Attack Works

Understanding the attack requires understanding one feature of how LIME and SHAP generate their explanations. Both methods, in their standard implementations, examine the model's behavior on out-of-distribution inputs: LIME generates random perturbations around the instance of interest; KernelSHAP samples from the marginal distribution of features. These perturbed or sampled inputs are not typical real-world data — they are synthetic combinations that may violate correlations and constraints that exist in the real data distribution.

The Slack et al. attack exploits this property. The adversarial classifier is engineered to behave differently depending on whether its input looks like a real-world data point or a synthetic explanation-method query.

The implementation has two components:

Component 1: An outlier detector. The classifier includes a mechanism for detecting whether a given input resembles real-world data. The detector learns the boundary of the real data distribution during training. Inputs that fall within this boundary are classified as "real" queries; inputs that fall outside it are classified as "explanation queries."

Component 2: Conditional discrimination. When the classifier receives a real query (a genuine applicant data point), it uses a discriminatory decision rule — one that relies on race, gender, or another protected characteristic, or on a highly correlated proxy. When the classifier receives an explanation query (a LIME perturbation or SHAP sample), it uses a fair, innocuous decision rule that does not rely on protected characteristics.

The explanation tool observes the classifier's behavior on explanation queries — the fair decision rule — and produces explanations reflecting that behavior. The explanations are accurate descriptions of what the model does to explanation queries. They are inaccurate descriptions of what the model does to actual applicants.

The attack is not hypothetical or theoretical. Slack et al. implemented it, demonstrated that it works against both LIME and SHAP with high reliability, and showed that the adversarial explanations are qualitatively indistinguishable from legitimate ones: a reviewer examining the SHAP values for a range of queries would see feature attributions consistent with a fair model.

The Implication: XAI as a Compliance Shield

The governance implication of the Slack et al. finding is one that industry and regulators have been slow to fully internalize. If an adversarial classifier of the type Slack et al. describe can be built — and it can — then any regulatory framework that requires explanation as its primary compliance mechanism is vulnerable to gaming. An institution that is sufficiently motivated to continue discriminating, and that has the technical sophistication to implement the attack (which is well within the capabilities of any organization that can build the discriminatory model in the first place), could in principle use XAI as a compliance shield: deploying explanation tools not to create genuine transparency, but to create the appearance of transparency while continuing to discriminate.

This is not a hypothetical concern about bad actors operating outside normal governance structures. It is a concern about how governance structures themselves can be designed to produce the appearance of compliance rather than its substance — a dynamic that this textbook has discussed under the heading of "ethics washing." The ethics washing risk is particularly acute here because the tools being used to wash are the same tools that are supposed to be preventing washing. LIME and SHAP, deployed as compliance mechanisms, can be the instrument through which compliance obligations are evaded.

It is important to be precise about the scope of this concern. The Slack et al. paper does not show that LIME and SHAP are useless. It shows that they are insufficient as the sole or primary mechanism for detecting discrimination in AI systems. The attack requires deliberate engineering; it does not happen accidentally. For most real-world AI deployments, the primary risks of bias come from uncorrected patterns in training data, poor feature selection, and inadequate testing — not from deliberate adversarial construction. These sources of bias are, in principle, detectable with standard LIME and SHAP analysis.

But "most real-world deployments are not adversarially constructed" is not the same as "XAI-based compliance is adequate." Regulatory frameworks that are adequate only for non-adversarial actors fail precisely in the cases where enforcement matters most — against actors who are motivated to circumvent the regulation.

What Auditors Need Beyond Explanation

The Slack et al. finding points toward a more robust framework for AI compliance auditing — one that treats explanation as one tool among several rather than as a sufficient mechanism by itself.

Model access, not just explanation access. The adversarial classifier works by behaving differently on explanation queries than on real data. An auditor who can access the model's code, architecture, and training procedure can potentially detect the outlier detector and the conditional discrimination logic. An auditor who can only access explanation outputs cannot. Effective regulatory auditing of AI systems requires the ability to examine the model itself — its source code, its training data, its weights — not just the explanations it generates on demand.

Training data access. Discriminatory models typically learn from discriminatory patterns in training data. Auditors who can examine training data can identify the encoding of protected characteristics, the presence of high-correlation proxies, and the patterns of historical outcomes that the model is trained to replicate. Training data access is a more powerful discriminatory-bias detection tool than explanation access.

Real outcome monitoring. The most robust protection against discriminatory models — whether adversarially constructed or not — is monitoring actual outcomes. If a model produces significantly different approval rates for similarly qualified applicants from different demographic groups, this is detectable through outcome data regardless of what the model's explanations say. Fair lending examiners have long used statistical analysis of lending outcomes as the primary detection tool for discrimination, and this approach remains valid for ML-based systems.

Adversarial testing. Regulators and auditors can apply their own adversarial tests: querying the model with inputs designed to probe for the detection-of-explanation-query mechanism that the Slack et al. attack requires. If a classifier's behavior on inputs that resemble SHAP samples differs from its behavior on inputs that resemble real applicant data, this is a detectable signal that warrants further investigation.

Ensemble validation. No single explanation method should be relied upon exclusively. Cross-validating LIME and SHAP results against each other, against partial dependence plots, against outcome analysis, and against model architecture inspection provides more robust assurance than any single method.

The Faithfulness Problem in Broader Context

The adversarial explanation problem is an extreme case of the faithfulness problem that Section 14.8 of this chapter addresses more generally. Every post-hoc explanation is a model of the model — an approximation of the model's behavior based on a sample of queries. The faithfulness problem is that this approximation can be wrong: LIME's local linear fit may be a poor approximation of the model's local behavior; SHAP's independence assumptions may be violated; saliency methods may reflect input structure rather than model behavior.

In the Slack et al. attack, the faithfulness problem is deliberately engineered. The adversarial classifier is specifically constructed to make the approximation wrong in a systematic way. But even without adversarial construction, post-hoc explanations can be unfaithful due to the natural limitations of the approximation methods.

Adebayo et al. (2018) demonstrated this for gradient-based saliency methods, finding that many produced saliency maps that were indistinguishable from those of randomly initialized models — evidence that the methods were reflecting input structure rather than model behavior. Rudin (2019) argued that the faithfulness problem is inherent in post-hoc explanation and cannot be fully resolved without using interpretable models. These arguments remain relevant and have not been adequately refuted.

What Good Explainability Requirements Look Like

The Slack et al. finding, combined with the broader faithfulness literature, points toward what a well-designed explainability requirement should include:

1. Explanation as one requirement, not the only requirement. Explainability should be required as part of a broader set of transparency and accountability mechanisms, not as a standalone compliance mechanism. Good explainability requirements specify what explanations must be provided and to whom, while also requiring outcome monitoring, model access for regulators, and documentation of training data.

2. Validation requirements for explanation tools. A regulation that requires explanations should also require that those explanations be validated for faithfulness using specified methods. This is analogous to how the SEC requires that financial statements not just be prepared according to GAAP, but also audited by independent auditors using specified audit standards. Unvalidated explanations are analogous to unaudited financial statements.

3. Regulatory access to model internals. Regulatory frameworks should establish clear authority for examiners to access model source code, training data, and documentation — not just model outputs and explanations. The OCC's SR 11-7 framework moves in this direction for banking, but similar authority is less clearly established in other regulated sectors.

4. Prohibition on adversarial behavior. Regulatory frameworks should explicitly prohibit the design or deployment of models that behave differently when queried for compliance purposes than when processing real decisions. This is analogous to prohibitions on dual books in accounting. The standard should be clear: the model that regulators examine must be the model that applicants face.

5. Outcome-based testing as a floor. Disparate impact analysis based on actual outcomes — the approval and denial rates by demographic group for actual applicants, controlled for legitimate creditworthiness factors — should be a minimum required element of any AI fairness compliance framework. Outcome-based testing cannot be fooled by adversarially constructed explanations, because it measures what actually happens to real applicants.

The Regulatory Response

Regulatory responses to the adversarial explanation problem have been cautious and incomplete as of this writing. The EU AI Act's transparency provisions require that high-risk AI systems be capable of providing explanations to affected individuals, but do not specify validation requirements or explicitly address the adversarial explanation risk. The CFPB's guidance on ML in credit decisions acknowledges the limitations of post-hoc explanation but does not mandate specific validation methodologies. The UK's Information Commissioner's Office has published guidance on meaningful information for automated decisions under UK GDPR that acknowledges the faithfulness question without resolving it.

The most specific regulatory engagement with these issues has come from prudential banking regulators in the United States, particularly through the model risk management framework of SR 11-7, which does require validation of model outputs including explanation tools. Banking regulators have also been more willing to require access to model internals during examinations. Other regulatory sectors — employment, housing, healthcare — have been slower to develop comparably specific frameworks.

The implication for organizations is that the regulatory landscape around XAI is evolving, and that the standard of care is likely to become more demanding as regulatory sophistication about XAI limitations grows. Organizations that treat current requirements as a floor and invest in more robust compliance frameworks now are likely to be better positioned for future regulatory expectations than those that do the minimum required today.

Discussion Questions

The Slack et al. attack requires a deliberate decision to engineer a classifier that behaves differently for explanation queries than for real data. How should regulators distinguish between adversarial and non-adversarial violations of explanation faithfulness? What evidentiary standards should apply? Should the legal standard be intent-based (like fraud) or outcome-based (like disparate impact)?
The Slack et al. paper was published in 2020 and has been widely cited in the academic literature. Yet regulatory frameworks in most jurisdictions have not yet explicitly addressed the adversarial explanation risk it describes. What explains this gap between research findings and regulatory response? Who has the incentive to close it, and who has the incentive to keep it open?
The paper recommends that regulators require access to model internals rather than relying on explanation outputs. What are the practical challenges of implementing this recommendation? How should regulators develop the technical capacity to audit model code and training data? What privacy or intellectual property concerns might institutions raise as objections, and how should those concerns be weighed?
The concept of "ethics washing" describes using ethical tools or language to create the appearance of ethical behavior without the substance. How does the adversarial XAI problem illustrate this dynamic? Are there analogous cases from other domains — environmental reporting, financial auditing, workplace safety — where compliance mechanisms have been similarly gamed? What does the history of those domains suggest about how the XAI compliance problem is likely to evolve?