14 min read

When Cornerstone Financial Group's AML analytics team built the graph that uncovered the trade finance fraud network described in Chapter 4, they spent three days building the model. Before that, they spent eleven days finding, cleaning, and...

Chapter 5: Data Architecture for Regulatory Compliance


Opening: The Data Problem Underneath Everything Else

When Cornerstone Financial Group's AML analytics team built the graph that uncovered the trade finance fraud network described in Chapter 4, they spent three days building the model. Before that, they spent eleven days finding, cleaning, and integrating the data.

This is not unusual. In data-intensive compliance work — which is to say, in all serious RegTech implementation — the data problem is typically larger than the technology problem. The algorithms are well-understood. The cloud infrastructure is available. The bottleneck is almost always data: data that is incomplete, inconsistent, duplicated, untraced, or simply unavailable from the systems where it lives.

The Basel Committee recognized this when it issued BCBS 239 — its principles for risk data aggregation and risk reporting — in 2013. The principles were specifically motivated by the finding that major financial institutions, during the 2008 crisis, could not quickly aggregate their own risk exposures across legal entities and geographies. They could not answer basic questions — "What is our total exposure to Lehman Brothers? How much do we stand to lose if Lehman fails?" — because their data was fragmented across systems that didn't talk to each other, maintained in formats that weren't consistent, and governed by organizational silos that prevented enterprise-wide analysis.

BCBS 239 made data governance a regulatory obligation for significant financial institutions. For RegTech practitioners, it articulated a truth that data professionals had known for decades: no analytics system, however sophisticated, can produce reliable compliance outputs from unreliable data.

This chapter is about building data architecture that is fit for regulatory compliance — not eventually, but reliably and auditably, from day one.


5.1 Why Data Is the Foundation of Every RegTech Solution

Before examining the specific components of compliance data architecture, it is worth establishing clearly why data quality and governance are prerequisite to everything else in RegTech.

The Garbage In, Garbage Out Problem

Machine learning models learn from data. If the training data contains systematic errors — missing records, incorrect classifications, duplicated entries — the model will learn those errors and reproduce them in production. A KYC verification system trained on data where "verified" is sometimes mis-labeled will verify some customers it should not. An AML model trained on data where certain typologies were consistently missed will continue to miss them.

More insidiously, data quality problems in ML systems are often not immediately obvious. The model may produce plausible-looking outputs while systematically misclassifying a specific subset of transactions in ways that are not apparent until an enforcement investigation, a regulatory examination, or a significant fraud event surfaces the problem.

The Regulatory Demonstration Problem

Compliance is not just about doing the right thing — it is about being able to demonstrate that you are doing the right thing. When a regulator asks "How did you calculate this capital ratio?" or "Why did you file (or not file) a SAR on this transaction?" or "Show me the evidence that you verified this customer's identity," the answer requires data that is: - Complete (all required records exist) - Accurate (the data correctly reflects the underlying facts) - Traceable (you can show where the data came from and how it was transformed) - Auditable (you can demonstrate what was known at what time)

None of this is possible without deliberate data architecture.

The Integration Problem

Most compliance functions consume data from multiple source systems — core banking, CRM, trading systems, document management, external data feeds. These systems were typically built independently, without a common data model, at different times by different vendors. Integrating them for compliance analytics purposes is a substantial engineering challenge that most institutions underestimate.

🔧 Practitioner Note: A common pattern in RegTech implementations: a vendor demonstrations their solution using their own sample data and it works beautifully. The implementation begins with the actual customer data and immediately reveals that the data is not in the format the solution expects, that key fields are missing, or that the same field is coded differently in different source systems. Allow for 50–100% more time than vendors quote for data integration. This is not an exaggeration; it is the industry norm.


5.2 Data Governance Frameworks for Compliance

Data governance refers to the policies, processes, and organizational structures that ensure data is managed as a compliance asset rather than an IT byproduct.

BCBS 239 as a Governance Framework

While BCBS 239 applies specifically to significant banks, its eleven principles articulate data governance requirements that are relevant to any institution managing compliance data.

The eleven principles address: - Governance (Principle 1): Senior management accountability for data quality - Data architecture and IT infrastructure (Principles 2–3): Integrated, scalable infrastructure supporting risk data aggregation - Accuracy and integrity (Principle 4): Automated, reconciled data with clear audit trails - Completeness (Principle 5): Capture and aggregation of all material risk data - Timeliness (Principle 6): Ability to produce data quickly in normal and stress conditions - Adaptability (Principle 7): Ability to generate ad hoc data requests - Accuracy in reporting (Principle 8): Reports that accurately reflect the risk data - Comprehensiveness (Principle 9): Coverage of all material risk areas - Clarity and usefulness (Principle 10): Reports that support decision-making - Frequency (Principle 11): Regular production of regulatory reports

For compliance practitioners working outside banking, these principles translate directly: you need someone accountable for data quality, systems that can aggregate across sources, automated data pipelines with reconciliation, complete capture of required data, and the ability to produce reports quickly in normal and stressed conditions.

The Data Governance Roles

A functional compliance data governance framework requires four types of roles:

Data owners: Business-side managers responsible for the accuracy and completeness of data within their domain. A data owner is accountable when data quality problems occur.

Data stewards: Operational staff who maintain data quality on a day-to-day basis — creating records, correcting errors, enforcing standards.

Data custodians: IT professionals responsible for the technical infrastructure — databases, pipelines, storage — that maintains and protects the data.

Data governance committee: A cross-functional body (compliance, risk, technology, legal) that sets data standards, resolves conflicts, and provides governance oversight.


5.3 The Regulatory Data Taxonomy

Compliance data is not homogeneous. Different types of data are governed by different requirements, have different quality standards, and require different architectural treatment.

Customer Data

Customer data is the foundation of KYC and AML compliance. It includes: - Identification data: Legal name, date of birth, address, nationality, identification document numbers - Risk attributes: Customer risk rating, PEP status, adverse media flags, SAR history - Relationship data: Account ownership, authorized signatories, beneficial ownership - Behavioral data: Transaction history, product usage, communication patterns

Quality requirements: Customer data must be verified at onboarding and kept current. The FCA expects firms to have processes to identify and update stale customer data — particularly for high-risk customers.

Architecture implications: Customer data should be maintained in a single authoritative customer record (sometimes called the "golden record") that serves as the source of truth for all compliance functions. Duplicates and inconsistencies across systems are a common source of KYC failures.

Transaction Data

Transaction data is the raw material of AML monitoring. It includes: - Basic transaction attributes: Amount, currency, date, time, instrument type - Party information: Originator and beneficiary accounts, counterparty identifiers - Narrative information: Payment references, descriptions, codes - Metadata: Processing system, channel, approval status

Quality requirements: Transaction data must be complete (all transactions captured), accurate (amounts, parties, and dates correct), and timely (available for monitoring with minimal delay).

Architecture implications: Transaction data volumes can be very large (millions of transactions per day for large institutions). Architecture must support efficient querying for monitoring purposes while also supporting historical analysis for investigation.

Reference Data

Reference data provides the context for interpreting transaction and customer data. It includes: - Sanctions lists: OFAC SDN, EU consolidated list, UN Security Council list, HM Treasury list - PEP lists: Politically exposed persons databases - Country risk classifications: High-risk jurisdictions for AML purposes - Legal entity identifiers (LEIs): ISO standard identifiers for legal entities in financial transactions - Currency codes, instrument codes, counterparty classification codes

Quality requirements: Reference data must be current. Sanctions lists change daily; using a stale list creates regulatory and legal risk. PEP databases require regular refresh.

Architecture implications: Reference data management is an often-overlooked compliance risk. The process for ingesting, validating, and publishing updated reference data must be automated and auditable.

Regulatory Reporting Data

Some compliance data is specifically created for regulatory reporting — risk metrics, position data, capital calculations. This data: - May require complex calculation logic applied to underlying transaction and position data - Must reconcile to other data sources (the regulatory report must agree with the internal management accounts) - Requires complete audit trails (the regulator must be able to trace a reported number back to its component source data)


5.4 Data Quality: The Silent Failure Mode

Data quality problems are the most common cause of RegTech implementation failure — and the most commonly underestimated. This section examines the most important data quality dimensions in a compliance context.

The Six Data Quality Dimensions

Completeness: Are all required records present? Missing transaction records, missing customer data, and missing beneficial ownership information are completeness failures.

Accuracy: Does the data correctly represent the underlying reality? An incorrect customer address, a wrong transaction amount, a mislabeled account type are accuracy failures.

Consistency: Does the same fact have the same representation across all systems? If the customer's name is "John Smith" in the CRM, "J. Smith" in the transaction system, and "Jonathan Smith" in the KYC system, consistency has failed — and entity matching will struggle.

Timeliness: Is the data available when it is needed? Transaction data that arrives for monitoring 24 hours after the transaction occurred may be too late for real-time fraud prevention.

Validity: Does the data conform to the required format and domain? An invalid LEI, a transaction date in an incorrect format, a country code that doesn't exist in the reference table are validity failures.

Uniqueness: Is each entity represented exactly once? Duplicate customer records — the same customer appearing as two separate entities — cause monitoring systems to miss patterns that span the duplicates.

A Data Quality Assessment in Python

import pandas as pd
import numpy as np
from datetime import datetime


def assess_data_quality(df: pd.DataFrame,
                        required_columns: list[str]) -> dict:
    """
    Perform a data quality assessment on a DataFrame.

    Returns a dictionary of quality metrics for each column.
    """
    quality_report = {}

    for column in df.columns:
        col_report = {
            'total_records': len(df),
            'null_count': df[column].isnull().sum(),
            'null_rate': df[column].isnull().mean(),
            'unique_count': df[column].nunique(),
            'duplicate_rate': 1 - (df[column].nunique() / len(df))
                             if len(df) > 0 else 0,
            'is_required': column in required_columns,
        }

        # Flag data quality issues
        issues = []
        if column in required_columns and col_report['null_rate'] > 0:
            issues.append(
                f"REQUIRED field has {col_report['null_rate']:.1%} nulls"
            )
        if col_report['null_rate'] > 0.05:
            issues.append(f"High null rate: {col_report['null_rate']:.1%}")

        col_report['issues'] = issues
        quality_report[column] = col_report

    return quality_report


def check_customer_data_quality(customers_df: pd.DataFrame) -> None:
    """
    Specific quality checks for customer compliance data.
    These mirror the checks an FCA examination might perform.
    """
    print("CUSTOMER DATA QUALITY ASSESSMENT")
    print("="*50)
    print(f"Total customer records: {len(customers_df):,}")

    # 1. Completeness checks
    required_fields = [
        'customer_id', 'full_name', 'date_of_birth',
        'nationality', 'address', 'kyc_verified_date',
        'risk_rating', 'customer_type'
    ]

    print("\n1. COMPLETENESS")
    for field in required_fields:
        if field in customers_df.columns:
            null_count = customers_df[field].isnull().sum()
            null_rate = null_count / len(customers_df)
            status = "✓" if null_count == 0 else "⚠️ " if null_rate < 0.02 else "❌"
            print(f"  {status} {field}: {null_count:,} missing "
                  f"({null_rate:.1%})")
        else:
            print(f"  ❌ {field}: FIELD MISSING FROM DATASET")

    # 2. Freshness checks (how current is KYC data?)
    if 'kyc_verified_date' in customers_df.columns:
        print("\n2. DATA FRESHNESS (KYC CURRENCY)")
        customers_df['kyc_verified_date'] = pd.to_datetime(
            customers_df['kyc_verified_date'], errors='coerce'
        )
        now = pd.Timestamp.now()
        customers_df['days_since_kyc'] = (
            now - customers_df['kyc_verified_date']
        ).dt.days

        # FCA expects high-risk customers reviewed annually
        # Standard customers may have longer review cycles
        stale_threshold_days = 365  # 1 year for this example

        stale_records = (
            customers_df['days_since_kyc'] > stale_threshold_days
        ).sum()
        stale_rate = stale_records / len(customers_df)

        status = "✓" if stale_rate < 0.05 else "⚠️ " if stale_rate < 0.15 else "❌"
        print(f"  {status} Records with KYC older than "
              f"{stale_threshold_days} days: "
              f"{stale_records:,} ({stale_rate:.1%})")

    # 3. Duplicate detection
    print("\n3. DUPLICATE DETECTION")
    if 'customer_id' in customers_df.columns:
        dup_ids = customers_df['customer_id'].duplicated().sum()
        print(f"  {'✓' if dup_ids == 0 else '❌'} Duplicate customer_id: "
              f"{dup_ids}")

    # Fuzzy name + DOB duplicate check (same name + same DOB = likely duplicate)
    if 'full_name' in customers_df.columns and 'date_of_birth' in customers_df.columns:
        key = customers_df['full_name'].str.lower() + '|' + \
              customers_df['date_of_birth'].astype(str)
        likely_dups = key.duplicated().sum()
        print(f"  {'✓' if likely_dups == 0 else '⚠️ '} "
              f"Likely duplicate persons (same name+DOB): {likely_dups}")

    print("\nAssessment complete.")


# Example usage
if __name__ == "__main__":
    # Generate synthetic customer dataset with intentional quality issues
    np.random.seed(42)
    n = 1000

    customers = pd.DataFrame({
        'customer_id': range(1, n + 1),
        'full_name': [f"Customer {i}" for i in range(n)],
        'date_of_birth': pd.date_range('1960-01-01', periods=n, freq='3D'),
        'nationality': np.random.choice(['GB', 'US', 'DE', None], n,
                                        p=[0.4, 0.3, 0.28, 0.02]),
        'address': [f"{i} Main Street" if i % 50 != 0 else None
                    for i in range(n)],
        'kyc_verified_date': pd.date_range('2020-01-01', periods=n, freq='D'),
        'risk_rating': np.random.choice(['LOW', 'MEDIUM', 'HIGH', None], n,
                                        p=[0.6, 0.3, 0.09, 0.01]),
        'customer_type': np.random.choice(
            ['INDIVIDUAL', 'CORPORATE', 'TRUST', None], n,
            p=[0.7, 0.2, 0.09, 0.01]
        )
    })

    # Introduce some specific quality issues
    # Make 50 records have KYC dates > 2 years ago
    customers.loc[:50, 'kyc_verified_date'] = pd.Timestamp('2021-01-01')

    check_customer_data_quality(customers)

5.5 Data Lineage and Audit Trails

Data lineage — the ability to trace data from its source through every transformation to its final use in a compliance output — is both a regulatory expectation and a practical necessity.

Why Lineage Matters

When a regulator asks "How did you calculate this capital ratio?", you need to be able to trace every component: - Where did the position data come from? - How was it transformed (aggregated, risk-weighted)? - What reference data (risk weights, FX rates) was applied? - What calculation logic was used, and when was it last updated? - Who approved the final output?

Without data lineage, this question cannot be answered. The inability to answer it is itself a regulatory finding.

Building Audit Trails

An effective audit trail captures: - What data was used: Source records and their values at the time of calculation - When it was used: Timestamps for data extraction, transformation, and output generation - What logic was applied: The calculation rules, their version, and who approved them - Who reviewed the output: Sign-off records for reported outputs - What changed: A history of corrections and the reasons for them

import hashlib
from datetime import datetime
from typing import Any


class AuditableDataTransformation:
    """
    A wrapper that records an audit trail for data transformations.
    Every transformation is logged with:
    - Input data hash (for integrity verification)
    - Transformation applied
    - Output data hash
    - Timestamp
    - User ID
    """

    def __init__(self, transformation_name: str, user_id: str):
        self.transformation_name = transformation_name
        self.user_id = user_id
        self.audit_log = []

    def _hash_data(self, data: Any) -> str:
        """Create a hash of the data for integrity verification."""
        return hashlib.sha256(
            str(data).encode('utf-8')
        ).hexdigest()[:16]

    def transform(self, input_data: pd.DataFrame,
                  transformation_func,
                  **kwargs) -> pd.DataFrame:
        """
        Apply a transformation and record the audit trail.
        """
        input_hash = self._hash_data(input_data.to_dict())
        input_shape = input_data.shape

        # Apply the transformation
        output_data = transformation_func(input_data, **kwargs)

        output_hash = self._hash_data(output_data.to_dict())
        output_shape = output_data.shape

        # Record the audit entry
        audit_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'transformation': self.transformation_name,
            'user_id': self.user_id,
            'input_hash': input_hash,
            'input_records': input_shape[0],
            'output_hash': output_hash,
            'output_records': output_shape[0],
            'records_added': output_shape[0] - input_shape[0],
            'kwargs': str(kwargs)
        }

        self.audit_log.append(audit_entry)
        return output_data

    def print_audit_log(self):
        """Display the audit trail."""
        print(f"\nAudit Trail for: {self.transformation_name}")
        print("-" * 50)
        for entry in self.audit_log:
            print(f"  Timestamp: {entry['timestamp']}")
            print(f"  User: {entry['user_id']}")
            print(f"  Input: {entry['input_records']} records "
                  f"[hash: {entry['input_hash']}]")
            print(f"  Output: {entry['output_records']} records "
                  f"[hash: {entry['output_hash']}]")
            print()

5.6 Master Data Management in Financial Institutions

Master data management (MDM) refers to the processes and technology for creating and maintaining a single, authoritative source of core business entities — customers, counterparties, products, accounts — across a financial institution's systems.

In compliance, MDM is critical for two reasons:

Entity matching: To monitor a customer's activity across all their accounts and products, you need to be able to identify that account A and account B belong to the same customer. This requires a reliable customer identifier (the "golden record") that links across systems.

Beneficial ownership: To identify the ultimate beneficial owner of a corporate account, you need a complete and accurate corporate hierarchy — which companies own which other companies, and who controls each company. This data is typically fragmented across onboarding documents, company registries, and internal relationship management systems.

The Customer Golden Record

The customer golden record is the single authoritative representation of a customer's identity and relationship with the institution. It includes: - A universal customer identifier used across all systems - The verified identity attributes (from KYC) - Account relationships (all accounts held by this customer) - Risk attributes (risk rating, PEP status, SAR history) - Relationship history (products held, relationship managers)

Creating and maintaining the golden record requires: - Identity resolution: determining that different records in different systems refer to the same customer - Deduplication: removing duplicate records - Survivorship rules: deciding which record's attributes "win" when two records have conflicting data - Ongoing refresh: updating the golden record as data changes


5.7 Cloud vs. On-Premise vs. Hybrid: Architectural Choices

The architecture of a compliance data environment — where data is stored and processed — has both technical and regulatory implications.

On-Premise Architecture

Historically, financial institutions processed compliance data on internal servers within their own data centers. This gave them complete control over data security and residency but imposed significant capital and operational costs.

Advantages: Complete data sovereignty; full control over security configuration; no dependency on cloud provider availability. Disadvantages: Significant capital expenditure; limited scalability; slower deployment of new capabilities.

Cloud Architecture

Cloud-hosted compliance data environments are now the norm for new implementations and increasingly for migrations.

Advantages: Scalability; faster deployment; access to ML and analytics services; typically more cost-effective at scale. Disadvantages: Data residency requirements may constrain where data can be stored; dependency on cloud provider; regulatory attention to concentration risk.

Data Residency Requirements

GDPR and some national data protection laws impose requirements on where personal data can be stored and processed. Financial regulations in some jurisdictions require that certain types of data remain within the jurisdiction. Cloud architecture must be designed to comply with these requirements.

⚖️ Regulatory Alert: The FCA (PS23/3) and ECB/SSM have both issued guidance on cloud adoption for regulated financial services firms. The FCA requires firms to: assess concentration risk from cloud provider dependence; ensure appropriate exit strategies exist; maintain audit rights over cloud providers; comply with data residency requirements. Cloud adoption that does not address these requirements may result in regulatory findings. Chapter 27 covers this in detail.

The Hybrid Architecture

Most large financial institutions use a hybrid architecture: cloud for new data workloads and analytics, on-premise for core systems and highly sensitive data, with a data integration layer connecting both.

Practical Architecture for Compliance Data

A practical compliance data architecture for a mid-size institution:

                    ┌─────────────────────────────────────┐
                    │         DATA SOURCES                 │
                    │  Core Banking │ Trading │ CRM │ Docs  │
                    └──────────┬──────────────────────────┘
                               │ ETL / API Connectors
                    ┌──────────▼──────────────────────────┐
                    │      COMPLIANCE DATA LAKE            │
                    │  Raw data storage; complete history  │
                    │  (Cloud: AWS S3, Azure Data Lake)    │
                    └──────────┬──────────────────────────┘
                               │ Data Quality & Transformation
                    ┌──────────▼──────────────────────────┐
                    │      COMPLIANCE DATA WAREHOUSE       │
                    │  Cleaned, governed, modeled data     │
                    │  (Customer golden records, etc.)     │
                    └──────────┬──────────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
    ┌─────────▼───┐  ┌─────────▼───┐  ┌────────▼────────┐
    │    AML       │  │  Regulatory  │  │   KYC System    │
    │ Monitoring   │  │  Reporting   │  │                 │
    └─────────────┘  └─────────────┘  └─────────────────┘

Chapter Summary

Data architecture is the foundation of effective RegTech. Every technology solution for compliance — ML models, regulatory reports, KYC systems — is only as reliable as the data it runs on.

BCBS 239 principles provide the most authoritative framework for compliance data governance, even for institutions not directly subject to BCBS 239.

The regulatory data taxonomy distinguishes customer data, transaction data, reference data, and regulatory reporting data — each with different quality requirements and architectural implications.

Data quality has six dimensions: completeness, accuracy, consistency, timeliness, validity, and uniqueness. Each has compliance implications when it fails.

Data lineage — the ability to trace data from source through transformation to compliance output — is both a regulatory expectation and a practical operational necessity.

Master data management — particularly the customer golden record — is the enabler of enterprise-wide compliance monitoring.

Architecture choices (cloud, on-premise, hybrid) have both technical and regulatory implications that must be addressed explicitly in any compliance data strategy.


Continue to Part 2: Identity, KYC, and AML →