Chapter 23: Exercises — NLP for Regulatory Intelligence and Horizon Scanning

DataField.Dev

Chapter 23: Exercises — NLP for Regulatory Intelligence and Horizon Scanning

Exercise 1: Classification Practice — Sorting the Regulatory Queue

Difficulty: Introductory | Format: Analysis and written response

Background

Below are five regulatory document titles and abstracts from five different regulatory bodies, representing a single week's output relevant to a UK-headquartered investment manager with asset management and trading operations.

Task

For each document, classify it across three dimensions:

Topic (select all that apply from: AML/KYC, Market Conduct, Reporting, Algorithmic Trading, Data Privacy, Capital/Prudential, Governance, Technology/Operational Risk, Other)
Business Line (select all that apply from: Banking, Asset Management, Trading, Payments, All)
Urgency (High, Medium, or Low — with justification)

Additionally, for each document, write one sentence explaining whether you would immediately escalate the alert to a business line owner or file it for scheduled review.

Document A Regulator: FCA | Type: Final Policy Statement Title: PS25/3 — Changes to the SMCR: Streamlining the Senior Managers and Certification Regime for Enhanced Accountability

Abstract: This policy statement confirms amendments to the Senior Managers and Certification Regime (SMCR) following Consultation Paper CP24/8. Key changes include: a revised list of Senior Management Functions (SMFs) that consolidates two existing functions into a new SMF "Chief Responsibility Officer"; updated Certification Regime requirements that extend the timeline for annual fitness and propriety assessments from twelve to eighteen months; and new guidance on the responsibilities of SMF holders for oversight of material third-party service providers. Implementation date: 1 October 2025.

Document B Regulator: ESMA | Type: Q&A Update Title: Questions and Answers on the Application of the Market Abuse Regulation (MAR) — Update 27

Abstract: This Q&A updates ESMA's existing guidance on MAR. New questions addressed include: Q.27.1 — whether cross-venue algorithmic order placement strategies involving simultaneous bids and offers constitute market manipulation; Q.27.2 — the application of the market sounding regime to ESG-linked private placements; Q.27.3 — the STOR reporting timeline for firms that detect suspicious orders originating from a client's automated execution system. The update does not constitute new regulation but clarifies the application of existing MAR provisions.

Document C Regulator: ICO (UK Information Commissioner's Office) | Type: Guidance Title: Guidance on the Use of Automated Decision-Making in Financial Services: Data Subject Rights and Legitimate Interests

Abstract: This guidance addresses data controllers in financial services that use automated decision-making for credit, insurance, fraud detection, and compliance purposes. The ICO clarifies that Article 22 of UK GDPR applies to automated decisions that produce legal or similarly significant effects on data subjects, including the denial of financial services on the basis of an automated risk score. Firms must provide meaningful information about the logic of automated decisions and must have processes for human review on request. The ICO notes that compliance-related processing (AML screening, sanctions matching) may be conducted on the basis of legal obligation grounds but must still satisfy data minimization requirements.

Document D Regulator: Basel Committee on Banking Supervision | Type: Consultative Document Title: Consultative Document: Principles for Climate-Related Financial Risk Disclosures for International Banks

Abstract: The Basel Committee is consulting on proposed principles for climate-related financial risk disclosures for internationally active banks. The proposed principles would apply to banks using the internal models approach (IMA) or standardized approach under the Basel framework. They address the disclosure of climate-related physical and transition risk exposures across the trading book and banking book, including scope 3 emissions financed by the institution. The consultation period closes 30 April 2025. Comments are invited on whether the proposed principles should be mandatory minimum standards or voluntary guidance.

Document E Regulator: SEC | Type: Final Rule Title: Release No. 33-11312: Cybersecurity Risk Management, Strategy, Governance, and Incident Disclosure — Technical Amendment to Form 8-K Item 1.05

Abstract: This technical amendment clarifies the SEC's cybersecurity incident disclosure rules adopted in July 2023. The amendment specifies that foreign private issuers with US-registered securities are subject to equivalent cybersecurity incident disclosure obligations on Form 20-F, with a 72-hour notification period from the time the issuer determines that a cybersecurity incident is material. The amendment confirms that the materiality assessment must consider aggregate harm across all systems affected by a single incident, not system by system. Effective date: 30 days from publication in the Federal Register.

Guidance Notes

A document can reasonably be classified differently by different firms depending on their business model. Justify your choices.
The distinction between "Medium" and "High" urgency often hinges on effective date proximity. State any assumptions you make about the current date.
Consider the difference between a Q&A clarification (no new regulation, but changes interpretation of existing rules) and a final rule (new binding obligation).

Exercise 2: Obligation Extraction from Regulatory Text

Difficulty: Intermediate | Format: Structured extraction

Background

The following paragraph is excerpted from a fictional FCA policy statement on operational resilience (PS25/7). Your task is to extract every firm obligation contained in the paragraph and present each as a structured obligation record.

Regulatory Text

Investment firms and credit institutions shall, by 31 March 2026, complete a comprehensive mapping of their important business services and the people, processes, technology, facilities, and information required to support those services. Firms must identify the maximum tolerable downtime (MTD) for each important business service and shall validate that their operational resilience capabilities are sufficient to remain within MTD even under severe but plausible disruption scenarios. Where a firm identifies that it cannot remain within its MTD for a particular important business service, it is required to notify the FCA within 30 business days and submit a remediation plan that specifies the steps to be taken and the expected timeline for achieving MTD adherence. Investment firms shall ensure that their important business service mapping is reviewed and updated at least annually and following any material change to the firm's technology, personnel, or third-party arrangements. A firm must retain evidence of its operational resilience testing and its MTD validation for a minimum of six years and make it available to the FCA on request. Investment firms with assets under management exceeding £10 billion shall additionally conduct an annual exercise simulating a full service disruption for their most critical important business services, involving senior management and, where appropriate, key third-party service providers.

Task

For each obligation you identify, complete the following fields:

Obligation text: The verbatim text of the obligation (may be condensed to the core requirement)
Obligation type: Explicit ("shall/must"), Conditional ("where X, the firm must"), or Threshold-based ("firms with X must")
Firm type affected: Which category of firm does this apply to?
Effective date / deadline: What date must this be completed by?
Regulatory reference: If an article or rule number is cited, note it; otherwise note the source document
Linked obligation: Is this obligation connected to completing another obligation first?

Extension Question

Two of the obligations in the paragraph are conditional — they only apply when a specific circumstance is met. Identify these two and explain how a regulatory intelligence system should handle conditional obligations differently from absolute obligations in the obligation register.

Exercise 3: Regulatory Taxonomy Design

Difficulty: Intermediate | Format: Design exercise with written justification

Scenario

You are advising the compliance team at NorthBridge Bank, a mid-sized UK bank with the following profile:

UK-authorized deposit-taking institution, regulated by the FCA and PRA
Mortgage lending, personal loans, and SME credit products
Treasury function operating in UK gilt and money markets
Payment services entity processing domestic and international transfers
No investment management or trading on behalf of clients
Single UK jurisdiction — no EU passporting or US operations

NorthBridge's head of compliance has asked you to design a regulatory taxonomy for their new regulatory intelligence platform. The taxonomy will be used to classify all regulatory publications automatically, route alerts to the correct team, and populate the obligation register.

Task

Design a regulatory taxonomy for NorthBridge Bank covering four dimensions:

Topics (provide 8–12 topic categories, each with 3–4 example keywords that would indicate a document belongs to this category)
Business Lines (provide 4–6 business line categories that reflect NorthBridge's actual structure, with brief descriptions of which regulatory obligations primarily apply to each)
Jurisdictions (which regulatory bodies should NorthBridge monitor, and how should they be grouped?)
Urgency Levels (define 3–4 urgency levels with precise criteria for when each applies — include effective date thresholds, document type rules, and any topic-specific rules)

For each dimension, briefly justify your design choices in terms of how they support the routing and tracking workflow.

Discussion Question

NorthBridge is planning to expand into EU payment services within two years. How should the taxonomy be designed now to accommodate this future expansion without requiring a complete rebuild when the expansion occurs?

Exercise 4: Coding — Extending the Platform with Change Detection

Difficulty: Intermediate Coding | Format: Python extension exercise

Background

The RegulatoryIntelligencePlatform implemented in Chapter 23 processes new regulatory documents. It does not yet handle version comparisons — comparing an old version of a regulation against a new version to identify what changed.

Task

Extend the RegulatoryIntelligencePlatform class by implementing a compare_versions method that, given two versions of a regulatory document as strings, identifies:

Sentences present in the new version but not the old (additions)
Sentences present in the old version but not the new (removals)
Sentences that appear in both but differ significantly (modifications — use a simple word-overlap metric)

Starter Code

from __future__ import annotations
import re
from dataclasses import dataclass


@dataclass
class DeltaReport:
    """Represents the changes between two document versions."""
    document_id: str
    added_sentences: list[str]
    removed_sentences: list[str]
    modified_pairs: list[tuple[str, str]]  # (old_sentence, new_sentence)
    similarity_score: float  # 0 = completely different, 1 = identical


def tokenize_sentences(text: str) -> list[str]:
    """
    Split text into sentences using basic punctuation rules.
    In production, use nltk.sent_tokenize or spaCy for better accuracy.
    """
    # Split on period, exclamation, or question mark followed by whitespace
    raw = re.split(r'(?<=[.!?])\s+', text.strip())
    # Filter empty and very short fragments
    return [s.strip() for s in raw if len(s.strip()) > 20]


def word_overlap_similarity(sentence_a: str, sentence_b: str) -> float:
    """
    Compute word overlap similarity between two sentences.
    Returns a value between 0 (no overlap) and 1 (identical word sets).
    In production, use sentence-transformer cosine similarity for semantic matching.
    """
    words_a = set(sentence_a.lower().split())
    words_b = set(sentence_b.lower().split())
    if not words_a or not words_b:
        return 0.0
    intersection = words_a & words_b
    union = words_a | words_b
    return len(intersection) / len(union)  # Jaccard similarity


def compare_versions(
    document_id: str,
    old_text: str,
    new_text: str,
    similarity_threshold: float = 0.7,
    modification_threshold: float = 0.4,
) -> DeltaReport:
    """
    Compare two versions of a regulatory document and produce a delta report.

    Parameters
    ----------
    document_id : str
        Identifier for the document being compared.
    old_text : str
        The full text of the older document version.
    new_text : str
        The full text of the newer document version.
    similarity_threshold : float
        Jaccard similarity above which two sentences are considered the "same"
        (allowing for minor edits). Default: 0.7.
    modification_threshold : float
        Jaccard similarity below which a matched pair is flagged as significantly
        modified rather than trivially different. Default: 0.4.

    Returns
    -------
    DeltaReport
        Structured report of additions, removals, and modifications.
    """
    # TODO: Implement this function.
    #
    # Suggested algorithm:
    # 1. Tokenize both documents into sentence lists using tokenize_sentences().
    # 2. For each sentence in the new version, find the most similar sentence
    #    in the old version using word_overlap_similarity().
    # 3. If the best match exceeds similarity_threshold, the sentence exists
    #    in both versions (with possible modifications).
    #    - If the similarity is below modification_threshold, record it as a
    #      (old_sentence, new_sentence) pair in modified_pairs.
    # 4. Sentences in new_text with no match above similarity_threshold are additions.
    # 5. Sentences in old_text that were not matched to any new sentence are removals.
    # 6. Compute an overall similarity_score as:
    #    (number of matched sentence pairs) / (total unique sentences across both versions)
    #
    # Hint: Use a greedy matching approach — once a sentence has been matched,
    # remove it from the candidate pool to avoid double-matching.
    pass

Your Implementation

Implement the compare_versions function using the structure and hints provided. Your implementation should:

Return a DeltaReport with correctly populated added_sentences, removed_sentences, modified_pairs, and similarity_score
Handle edge cases: empty documents, completely identical documents, completely different documents
Include inline comments explaining your logic at key decision points

Test Case

Test your implementation against the following abbreviated document versions:

old_version = """
Investment firms must submit transaction reports to the FCA within one business day
of the execution of the transaction. Reports shall include the full LEI of the
executing firm and the counterparty. Firms are required to maintain records of all
submitted reports for a period of five years.
"""

new_version = """
Investment firms must submit transaction reports to the FCA within one business day
of the execution of the transaction. Reports shall include the full LEI of the
executing firm, the counterparty, and any transmission chain intermediaries.
Firms are required to maintain records of all submitted reports for a period of
seven years. Investment firms shall validate their transaction reports against
the FCA's validation rules before submission and retain evidence of validation.
"""

report = compare_versions("TEST-DOC-001", old_version, new_version)
print(f"Added sentences ({len(report.added_sentences)}):")
for s in report.added_sentences:
    print(f"  + {s[:100]}")
print(f"Removed sentences ({len(report.removed_sentences)}):")
for s in report.removed_sentences:
    print(f"  - {s[:100]}")
print(f"Modified pairs ({len(report.modified_pairs)}):")
for old_s, new_s in report.modified_pairs:
    print(f"  OLD: {old_s[:80]}")
    print(f"  NEW: {new_s[:80]}")
print(f"Overall similarity: {report.similarity_score:.2f}")

Expected Output (approximate — exact output depends on your implementation)

The test case should identify approximately: - 1–2 additions (the transmission chain intermediary requirement; the validation requirement) - 0–1 removals (none — the old sentences are retained with modifications) - 1–2 modified pairs (the counterparty LEI sentence; the record retention period sentence, changed from five to seven years)

Extension Question

The word_overlap_similarity function uses a bag-of-words Jaccard metric. This metric is blind to paraphrasing: "firms must submit" and "investment firms are required to file" would score low similarity despite expressing the same obligation. How would you replace this metric with a semantic similarity measure using sentence-transformers? Describe the change in pseudocode (you do not need to implement it if the sentence-transformers library is not available in your environment).

Exercise 5: Vendor Evaluation — Selecting a Regulatory Intelligence Platform

Difficulty: Applied | Format: Structured evaluation matrix with recommendation

Scenario

Hartley Global Partners is a London-headquartered multi-strategy asset manager with £42 billion AUM, operating across equity long/short, credit, and macro strategies. It has offices in London, New York, and Hong Kong, and its regulatory footprint covers: FCA (UK), ESMA and CBI (EU/Ireland), SEC and CFTC (US), and SFC (Hong Kong). The compliance team has eight professionals globally.

Hartley is selecting a regulatory intelligence platform. The following five criteria have been agreed as the primary evaluation dimensions, weighted by importance:

Criterion	Weight	Description
Jurisdictional coverage	30%	Must cover FCA, ESMA, SEC, CFTC, SFC, and Basel/IOSCO comprehensively
Obligation extraction and workflow	25%	Must include structured obligation extraction, status tracking (identified through compliant), and GRC integration capability
Classification precision	20%	Low false-positive rate; alerts should be genuinely relevant to Hartley's business lines
LLM / Q&A capability	15%	Must offer cited, RAG-based regulatory Q&A with source traceability
Total cost of ownership	10%	Licensing, implementation, and ongoing maintenance within budget constraints

Vendors Under Consideration

Using the vendor information provided in Chapter 23's key takeaways and Section 23.7, and drawing on publicly available knowledge of these platforms, evaluate the following five vendors against Hartley's five criteria:

Thomson Reuters Regulatory Intelligence
Wolters Kluwer FRR
Corlytics
Compliance.ai
Ascent

Task

Construct a weighted evaluation matrix. For each vendor, assign a score from 1 (poor fit) to 5 (excellent fit) against each criterion, apply the weights, and compute a total weighted score.

Your matrix should look like this:

Vendor	Jurisdictional Coverage (30%)	Obligation Workflow (25%)	Classification Precision (20%)	LLM/Q&A (15%)	TCO (10%)	Weighted Total
Thomson Reuters RI	?	?	?	?	?	?
Wolters Kluwer FRR	?	?	?	?	?	?
Corlytics	?	?	?	?	?	?
Compliance.ai	?	?	?	?	?	?
Ascent	?	?	?	?	?	?

After completing the matrix, write a 300-word recommendation that: - Names your recommended vendor based on the weighted scores - Identifies the one criterion where the recommended vendor is weakest and proposes how Hartley should address this gap - Notes any condition under which a different vendor would be preferable

Discussion Question

Hartley's CTO has proposed building a custom regulatory intelligence platform rather than licensing a vendor, arguing that none of the vendors cover the SFC (Hong Kong) comprehensively and that custom development would allow tight integration with Hartley's proprietary portfolio management system. Evaluate this argument. What additional information would you need to assess whether a custom build is justified, and what are the key risks of that approach?