Case Study 23-2: When the Algorithm Got the Law Wrong — LLM Hallucination in a Compliance Context
Background
Cornerstone Financial Group is a mid-tier UK-headquartered financial services group operating a retail bank, a small asset management division, and a payment services entity. The group is regulated by both the FCA and the PRA, and its payment services entity holds authorization under the UK Payment Services Regulations. With the application of DORA — the EU Digital Operational Resilience Act — Cornerstone's compliance team undertook a significant review of its ICT risk management framework in 2024, as a precautionary exercise even given the UK's own operational resilience framework, to prepare for client and counterparty expectations around DORA compliance.
The compliance team at Cornerstone, led by Chief Compliance Officer Harriet Voss, had adopted an LLM-based compliance Q&A tool in early 2024 as part of a broader digital transformation initiative. The tool — a commercial product built on GPT-4 — was deployed for internal use by compliance and risk staff to answer questions about the regulatory framework, firm policies, and the content of the firm's GRC system. It had been available for six months and had built a reputation for giving fast, articulate, generally reliable answers.
The tool had not been implemented with a Retrieval-Augmented Generation architecture. It answered questions from the model's training data, not from a corpus of current regulatory text. It did not cite sources.
The Incident
In November 2024, James Oduya — a junior compliance officer, eighteen months into his role at Cornerstone — was assigned to complete a preliminary DORA applicability assessment for the group's technology leadership team. The task was to identify which articles of DORA applied to Cornerstone as a "financial entity" under the regulation, and to flag any articles that required significant process or technology changes.
James had read a briefing document on DORA that the team's external law firm had provided in early 2024. The briefing covered Articles 5 through 16 in depth — the core ICT risk management requirements. It was less detailed on Chapter V, which addresses ICT third-party risk management and runs from Article 28 through Article 44. The law firm had noted, in the briefing, that Chapter V applied to financial entities in their relationships with ICT third-party service providers, and recommended that the team review Article 28 in particular.
James had the article number but had not read the full text of Article 28. He turned to the compliance Q&A tool and typed: "Does Cornerstone need to comply with DORA Article 28 obligations?"
The tool responded:
"DORA Article 28 establishes requirements for ICT third-party service providers (TPPs) — specifically, for critical ICT third-party service providers (CTPPs) that are designated by the European Supervisory Authorities. Article 28 creates an oversight framework for these providers and sets out the powers of lead overseers. As a financial entity, Cornerstone is not within the scope of Article 28 — that article applies to the TPPs themselves, not to the financial entities that use them. Your obligations as a financial entity relate primarily to Articles 28 to 30 of Chapter V, which concern how you manage your relationships with ICT TPPs, including due diligence, contractual requirements, and monitoring. But Article 28 itself is addressed to the providers, not to you."
The answer was confidently phrased, used correct regulatory terminology, and distinguished between financial entities and TPPs in a way that sounded technically sophisticated. It was also wrong, in a way that mattered.
DORA Article 28 — "General Principles of Sound Management of ICT Third-Party Risk" — is addressed specifically to financial entities. It requires financial entities to maintain an ICT third-party risk management framework as part of their overall ICT risk management framework under Article 6. It requires financial entities to adopt and maintain a strategy on ICT third-party risk, maintain an information register of all ICT third-party arrangements, assess the concentration risk arising from their ICT third-party relationships, and establish exit strategies for critical ICT third-party services. These are obligations on Cornerstone — directly, explicitly, and without ambiguity in the text of the regulation.
The LLM had confused Article 28 with the oversight framework articles that apply to designated critical ICT third-party providers — a conceptually plausible confusion given that Chapter V of DORA does establish such an oversight regime, and that the distinction between articles applying to financial entities and articles applying to TPPs is one of the genuinely complex structural features of the regulation. The model's training data may have included commentary or analysis that conflated these provisions, or the model may have reconstructed a plausible-sounding but inaccurate answer from multiple sources.
James, not yet deeply familiar with DORA's structure, accepted the answer. He completed the preliminary applicability assessment without flagging Article 28 as applying to Cornerstone. The technology leadership team's DORA readiness roadmap, produced in December 2024, did not include the Article 28 requirements: no strategy on ICT third-party risk, no information register, no concentration risk assessment, no exit strategy framework.
Discovery
The error was discovered in February 2025, when Harriet Voss attended a DORA implementation workshop organized by a trade association and heard a presentation that specifically addressed Article 28 obligations for financial entities. She returned to the office and reviewed the preliminary applicability assessment. The gap was immediately apparent.
The Article 28 requirements were not trivial to implement. The information register — a comprehensive inventory of all ICT third-party arrangements — required coordination across seven business areas and the technology team. The concentration risk assessment required analysis of the firm's dependencies on a small number of critical cloud providers. The exit strategy documentation required legal and technology collaboration. What had been, in November 2024, a twelve-month implementation program became a compressed six-month remediation effort.
Harriet convened a post-incident review. The Q&A tool's response was retrieved from James's conversation history. Reading it, she felt two things simultaneously: she understood completely how James had accepted the answer, and she was clear that the tool had failed its users at exactly the moment that mattered most.
The post-incident review did not result in disciplinary action against James. He had used the tool in good faith, for exactly the purpose it had been positioned for. The failure was systemic.
Root Cause Analysis
Priya Nair, brought in to support the post-incident review, identified four contributing factors:
Architecture: The tool operated without RAG. It answered from training memory, not from current regulatory text. The DORA final regulation had been published in the EU Official Journal in December 2022 and had come into full effect on 17 January 2025. Training data for the model may have included early commentary on DORA that predated the final version, or commentary that was simply incorrect. Without RAG, there was no mechanism to ground the answer in the actual text of the regulation.
No citation requirement: The tool produced answers without citing the specific article text it drew from. This meant that a user receiving an answer could not verify it against the primary source without independently locating and reading the article — which largely defeats the purpose of a Q&A tool for compliance staff who are relying on the tool to save time.
Absence of uncertainty signaling: The tool's response was expressed with full confidence. A well-designed compliance Q&A tool should signal its uncertainty — flagging answers to questions about specific article interpretation as requiring verification, or noting when its training data may not reflect the most current version of a regulation. The commercial tool Cornerstone had deployed did not include this feature.
Positioning and training: The tool had been positioned to staff as a reliable Q&A resource. No training had been provided to compliance staff on the hallucination risk, the importance of verifying answers against primary sources, or the types of questions for which the tool was not appropriate. James's use of the tool for a specific article applicability question was a predictable and reasonable use given the positioning.
Redesign: Principles for Responsible LLM Deployment in Compliance
Following the post-incident review, Cornerstone replaced the context-free LLM Q&A tool with a RAG-based compliance Q&A system. The redesign was guided by five principles that Priya documented in the implementation brief:
Principle 1: Every answer must cite a source. The RAG system retrieves specific passages from the regulatory corpus before answering. Every response includes the passage retrieved, the document it came from, and the article reference. The user can read the retrieved passage immediately and assess whether the answer is consistent with the primary text. Answers not grounded in retrieved text are not displayed.
Principle 2: Outputs are marked "For guidance only — verify against primary source." This label appears on every response from the Q&A tool. It is not boilerplate. It is a substantive instruction that every answer is a starting point for analysis, not a definitive compliance determination. The label is styled to be visually prominent, not buried in fine print.
Principle 3: Questions about specific article interpretation are escalated. The system classifies incoming questions by type. Questions that ask whether a specific article of a regulation applies to the firm — or that request a legal interpretation — are flagged with a yellow warning: "This question requires regulatory interpretation that may require specialist review. The answer below is based on retrieved regulatory text only; you should verify with a qualified compliance professional before relying on this for compliance purposes."
Principle 4: The regulatory corpus is version-controlled and dated. The document corpus that underlies the RAG system is maintained by the compliance team, not by the technology vendor. Each document in the corpus is tagged with its publication date and version. When a regulation is updated, the old version is archived (not deleted) and the new version is added. The tool answers from the current corpus, and responses include the version date of the retrieved document.
Principle 5: Staff receive explicit training on LLM limitations. All compliance staff who have access to the tool receive a thirty-minute training module covering: what hallucination is and how it occurs, the types of questions that are appropriate for the tool and those that require independent verification, how to read and evaluate the cited sources in a response, and the escalation path for questions where the tool's answer is uncertain or contradictory. The training is refreshed annually.
The Corrected Answer
Under the RAG-based system, the same question — "Does Cornerstone need to comply with DORA Article 28 obligations?" — produces a response that includes the following retrieved passage:
Retrieved from: DORA (EU) 2022/2554, Article 28(1) — Published Official Journal 27/12/2022, in force 17/01/2025: "As part of their ICT third-party risk management framework, financial entities shall adopt and regularly review a strategy on ICT third-party risk, taking into account the multi-vendor strategy referred to in Article 6(9)(g). The strategy on ICT third-party risk shall include a policy on the use of ICT services supporting critical or important functions provided by ICT third-party service providers and shall apply at individual and, where relevant, at consolidated and sub-consolidated level."
The system's answer, generated from this passage, reads:
"Based on the retrieved text of DORA Article 28(1), financial entities — which includes Cornerstone as an FCA- and PRA-regulated financial entity — are required to adopt and regularly review a strategy on ICT third-party risk as part of their ICT third-party risk management framework. This obligation applies to Cornerstone directly. Note: this answer is based on the retrieved regulatory text above; please verify applicability with your compliance team before drawing final conclusions."
The difference between the two answers — the context-free LLM's confident and incorrect "Article 28 applies to providers, not to you" and the RAG system's accurate and cited "as a financial entity, this applies to you" — is the difference between a tool that created a compliance gap and a tool that would have prevented one.
Broader Implications: LLM Deployment in High-Stakes Regulatory Contexts
The Cornerstone incident is not unusual. In the period 2023–2025, multiple financial institutions reported incidents — most not publicly — in which LLM-generated regulatory guidance was acted upon without verification and subsequently found to be incorrect. The errors ranged from minor (incorrect threshold values) to significant (incorrect applicability assessments for major regulatory requirements).
The pattern across these incidents is consistent. The errors occurred in organizations where LLM tools had been deployed with insufficient attention to:
- Architecture (RAG vs. context-free generation)
- Output design (cited vs. uncited responses)
- User training (hallucination risk awareness)
- Appropriate use boundaries (what the tool is and is not appropriate for)
These are not technology failures in the traditional sense. The LLM performed exactly as a context-free LLM without RAG would be expected to perform: it generated a plausible, articulate, confident response based on a synthesis of training data that may or may not have been accurate. The failure was in the deployment design — in treating a technology with known hallucination risk as if it were reliable, without the architectural safeguards that make it so.
Harriet Voss, reflecting on the incident a year later, offered a summary that has since been cited in several RegTech presentations: "We thought the tool was reading the regulation. It was synthesizing things that sounded like the regulation. That sounds like a subtle distinction until it costs you six months of remediation work."
Discussion Questions
-
The LLM's incorrect response on DORA Article 28 was not random — it was a coherent, plausible-sounding answer that confused two adjacent concepts in the regulation. Why does this type of confident, plausible-but-wrong hallucination present a higher risk than a response that is obviously wrong or incoherent?
-
RAG substantially reduces hallucination by grounding answers in retrieved source text. However, Priya notes that RAG "does not eliminate hallucination." What hallucination risks remain even in a RAG-based system, and how should the Q&A tool's output design account for these residual risks?
-
The post-incident review found that James had used the tool "in good faith, for exactly the purpose it had been positioned for." The review did not result in disciplinary action. Do you agree with this conclusion? What accountability framework should govern the use of LLM tools in compliance contexts, and how should responsibility be distributed between the tool, the user, and the compliance function leadership?
-
One of the five redesign principles requires that the regulatory corpus underlying the RAG system be maintained by the compliance team. This is resource-intensive: regulations change, new publications must be indexed, outdated versions must be managed. Design a governance process for corpus maintenance that is systematic enough to be reliable but proportionate for a compliance team of eight professionals.
-
The incident occurred because a compliance officer relied on an LLM for a task that required regulatory interpretation — assessing the applicability of a specific article to a specific firm. Draft a brief policy (one page) for Cornerstone that defines which types of compliance questions are appropriate for the LLM Q&A tool and which are not, including the escalation path for questions that fall outside the tool's appropriate use.