Case Study 39.2: Priya's LLM Experiment — What Worked, What Didn't, and the Lesson

DataField.Dev

Case Study 39.2: Priya's LLM Experiment — What Worked, What Didn't, and the Lesson

Background

Priya had been thinking about large language models in compliance since the first time she used one to summarize a 140-page ESMA consultation paper in twelve minutes and got an answer that was, as far as she could tell, accurate. That had been in 2023. By 2024, she had become the de facto internal resource at her firm on LLM use in regulatory advisory — not because she had formal expertise, but because she was the person who had asked the most questions and accumulated the most field evidence.

The formal pilot came together over six months. Priya proposed it, her practice lead approved it, and three client firms — at different stages of their RegTech maturity — agreed to participate. The pilot used a commercial LLM platform configured with a regulatory knowledge base: a curated collection of FCA Handbooks, PRA rulebooks, ESA technical standards, FCA consultation papers, and key guidance documents. The platform also had access to a live feed of regulatory publication RSS outputs. The intended use cases were: regulatory horizon scanning and summarization; policy gap analysis; and compliance training content generation.

The pilot ran for four months, with structured evaluation at week four, week eight, and week twelve. Priya's evaluation methodology was deliberately simple: take a sample of LLM outputs, evaluate them against the primary source documents, and categorize errors as minor (presentational, inconsequential), moderate (incomplete, misleading but correctable), or material (factually wrong in a way that would produce an incorrect compliance determination if acted upon).

The results were more complicated than she had expected.

What the Data Showed

Horizon scanning efficiency: The headline number was compelling. Across all three firms, the time compliance analysts spent reading and summarizing regulatory publications fell by approximately 60% during the pilot period compared to the baseline quarter. Analysts reported that the LLM summaries were, in most cases, a useful starting point that directed their attention to the right parts of the source documents.

This was the use case that worked best, and Priya attributed the success to why it worked: horizon scanning is inherently a triage function. The goal is not to produce a definitive interpretation of a regulatory publication; it is to identify what needs human attention. An LLM that correctly identifies the three most important developments in a 60-page consultation paper and summarizes them intelligibly — even if the summary is slightly incomplete — has done its job, provided a human expert reads the relevant sections before acting.

Policy gap analysis: The policy gap analysis results were mixed. The LLM was effective at identifying areas of a firm's policy documentation that were not addressed by an existing policy — the structural gaps. It was less reliable at assessing whether the policies that did exist were actually adequate to meet regulatory requirements. The distinction matters: a tool that tells you "you have no policy on X" is doing something different and less interpretive than a tool that tells you "your policy on X meets the regulatory requirement."

Priya documented three cases across the pilot period where the gap analysis produced a "no gap identified" conclusion for a policy area where a human reviewer subsequently identified a material inadequacy. In each case, the LLM had identified that a policy document existed and covered the relevant topic, but had not correctly assessed whether the policy's substantive provisions met the regulatory standard. The tool was accurate about coverage; it was less accurate about adequacy.

Error rate across summaries: The structured evaluation at week twelve produced the finding that Priya was most careful about characterizing. Testing a sample of 200 LLM-generated regulatory summaries against the primary source documents, the evaluation team classified 12% of summaries as containing at least one material error — a factual inaccuracy about a regulatory requirement that, if acted upon without verification, would produce an incorrect compliance outcome.

This number required context. Priya spent considerable time in the pilot report explaining what it meant and what it did not mean. It did not mean that the LLM was unreliable. It meant that the LLM's outputs required expert verification before being acted upon — which was what the pilot's workflow had specified from the outset. The 12% figure was the cost of not verifying; in a workflow where verification was mandatory, the material errors were caught and corrected before they influenced any compliance decision.

"The tool was not designed to operate without verification," Priya wrote in the pilot report. "The 12% figure is the risk of using it as if it were. In the workflow we specified, that risk was managed. The question is whether workflows without verification controls will be used with this tool in other contexts."

The Two Submissions

It was not one of the three pilot firms. It happened at a fourth firm — one that had heard about the pilot from a peer and had implemented a similar LLM configuration independently, without Priya's firm's involvement and without the workflow controls the pilot had specified.

The firm — a UK asset manager — had used the LLM platform to assist with drafting two regulatory submissions: a response to an FCA consultation on sustainable investment disclosure and a notification to the FCA regarding a change in controlled function. The compliance analyst who drafted both submissions had used the LLM to generate the substantive content and had made edits but had not systematically verified the LLM's characterizations of the relevant regulatory requirements against the primary sources.

Both submissions were returned by the FCA. The sustainable investment disclosure response contained a characterization of the FCA's existing disclosure requirements that was incorrect — the LLM had apparently drawn on an earlier consultation draft rather than the finalised rules, and the characterization did not reflect the amendments made in the final version. The controlled function notification contained a reference to a notification requirement under the FCA's rules that cited the wrong section and described the obligation in terms that did not match the current rulebook.

The FCA's returns were professional and factual — the errors were identified and corrected submissions were requested. There was no enforcement action. But the compliance director at the asset manager, when she learned what had happened, called Priya.

"She wanted to understand how the tool had produced confident, specific, wrong answers," Priya said. "The answer is that this is what these tools do. They produce confident, fluent, specific text. The confidence of the output tells you nothing about its accuracy."

Priya briefed the incident — with client details anonymized — in her firm's internal LLM governance group. She also included it in the updated pilot report that she distributed to the three participating firms.

"The honest lesson," she said in that briefing, "is not that LLMs should not be used in compliance. The lesson is that the verification requirement is not optional. It is not a nice-to-have. It is the thing that makes the tool safe to use. A workflow that uses an LLM to draft regulatory submissions without expert review of every substantive regulatory claim is not a safe workflow."

What Changed After the Pilot

All three pilot firms continued using the LLM platform for regulatory horizon scanning after the pilot ended. The efficiency gains were real and the tool was well-suited to the triage function.

One firm — the most sophisticated of the three, a UK bank with a well-staffed regulatory affairs team — expanded use to policy gap analysis, with a structured two-person review process for any gap analysis output that would influence a compliance remediation decision. Priya regarded this as an appropriate control structure, and the firm had been operationally disciplined in maintaining it.

The second firm extended to training content generation, with a subject-matter expert review process. This was also working well. The LLM's drafting speed had materially accelerated the compliance training calendar.

The third firm — the smallest, a fintech with five compliance staff — was the one that Priya was most cautious about. Under time pressure, the verification steps were being compressed. The head of compliance at the firm acknowledged this in a review meeting.

"We know the risk. But when it's just me and one analyst and we're looking at fifteen publications a week, and we're doing six other things at the same time — the verification doesn't always happen the way it should."

Priya had heard this before. It was the perennial compliance resource problem in a different form. The LLM tool had been adopted because it saved time. The verification requirement that made it safe to use cost time. Under pressure, the verification was the first thing to slip.

"The compliance director who came back to me after the submission returns was embarrassed," Priya said, in a conversation with Rafael after the pilot had concluded. "She shouldn't have been. Her firm did what a lot of firms are doing. They adopted a tool that saves time, removed the control that costs time, and got the result you'd expect." She paused. "The good news is that the result, in their case, was a returned submission, not an enforcement action. The lesson was available cheaply. That's not always the case."

Priya's Framework: Research Accelerator, Not Decision-Maker

Out of the pilot and the incidents that followed it, Priya developed a framework for LLM use in compliance that she began deploying in client advisory engagements and that became a standard output of her firm's RegTech advisory practice.

The framework had four components:

Role clarity: Every LLM deployment in compliance must have explicit documentation of the tool's role in each workflow. The permitted roles are: first-pass research assistant; drafting accelerator; and structural analysis triage. The prohibited roles are: regulatory decision-maker; autonomous author of regulatory submissions or client advice; and substitute for expert review.

Verification by default: Any LLM output that will influence a compliance decision, a regulatory submission, or regulatory advice to a business line must be verified against primary sources by a human expert before it is acted upon. This requirement must be embedded in the workflow, not left to individual discretion.

Error tracking: The rate of material errors in LLM outputs should be tracked over time. If the error rate increases — for instance, because the model's training data has become stale relative to regulatory developments — that is a signal to adjust the verification intensity or the use case scope.

Model governance: LLM deployments in compliance must be treated as models under the firm's model risk management framework. This requires documentation of the tool's purpose, limitations, and performance metrics; periodic validation; and a process for updating or retiring the tool when it is no longer adequate.

"The firms that will use LLMs well in compliance," Priya told the conference panel audience, many months after the pilot had concluded, "are the ones that are clear about what the tool is for. It is for acceleration. It is not for elimination of expert judgment. The day compliance stops requiring expert judgment is the day I will need a new career. I'm not worried about that day."

She smiled slightly.

"What I am worried about is the firms that are using these tools as if that day has already arrived."

Discussion Questions

The pilot documented a 12% material error rate in LLM-generated regulatory summaries. In what circumstances would this error rate be acceptable, and in what circumstances would it be unacceptable? What factors determine the answer?
Priya's framework specifies that LLM outputs must be verified by a human expert before influencing compliance decisions or regulatory submissions. The case study also documents that, under time pressure, verification steps are the first to be cut. What organizational controls would best protect the verification requirement from being eroded in this way?
The asset manager who used the LLM to draft regulatory submissions without systematic verification did so without Priya's firm's involvement or the pilot's workflow controls. What does this imply about how LLM-related compliance risk management needs to be governed — through training, through workflow design, or through controls that make unverified use harder rather than just less recommended?
The case study describes LLMs as accurate about coverage (whether a policy exists) but less reliable about adequacy (whether the policy meets the regulatory standard). Why might this distinction exist technically, and what does it imply for which compliance tasks are safe to delegate to LLM assistance?

Case Study 39.2 complete.