Chapter 12: Operational Risk and Technology Risk Management

DataField.Dev

13 min read

On August 1, 2012, Knight Capital Group — a major US market maker and electronic trading firm — deployed a new trading algorithm to production. The deployment went wrong. An old, deactivated piece of trading software was inadvertently reactivated...

Prerequisites

Chapter 2: The Regulatory Landscape — prudential and operational regulatory mandate
Chapter 5: Data Architecture — control-state telemetry and lineage
Awareness of Basel operational-risk framework and key incidents (Knight, SocGen)

Learning Objectives

Classify operational-risk events under the Basel event-type taxonomy
Design a Risk and Control Self-Assessment (RCSA) workflow with Key Risk Indicators (KRIs)
Apply the Standardised Measurement Approach (SMA) for operational-risk capital under Basel IV
Implement technology-risk controls aligned with DORA, FFIEC, and PRA SS2/21 expectations
Operate operational-risk loss data collection that withstands supervisory challenge

In This Chapter

Opening: The System That Failed at the Wrong Moment
12.1 Operational Risk: Definition, Scope, and the Basel Framework
12.2 Technology Risk: From Residual Category to Core Risk
12.3 The Operational Risk Management Framework
12.4 Third-Party Risk Management
12.5 Cybersecurity Risk in Financial Institutions
12.6 Model Risk Management
12.7 Priya's Technology Risk Audit: A Practitioner's Approach
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 12: Operational Risk and Technology Risk Management

Opening: The System That Failed at the Wrong Moment

On August 1, 2012, Knight Capital Group — a major US market maker and electronic trading firm — deployed a new trading algorithm to production. The deployment went wrong. An old, deactivated piece of trading software was inadvertently reactivated. For 45 minutes, Knight's systems executed millions of unintended trades, accumulating massive positions in 148 stocks that the firm had not intended to hold.

By the time the trading was stopped, Knight had lost $440 million. The company was effectively insolvent. It was acquired within days.

The root cause was not fraud, not market losses, not credit defaults. It was an operational failure: a software deployment error, a configuration management failure, and an inadequate monitoring system that did not alert human operators quickly enough to prevent catastrophic loss.

Knight Capital's collapse is the canonical example of technology operational risk in financial services — the risk that systems, processes, and human factors, rather than market movements, will cause institutional harm. It also illustrates why operational risk management, once treated as the residual category after "real" risks (market and credit) were addressed, has moved to the center of regulatory attention.

12.1 Operational Risk: Definition, Scope, and the Basel Framework

The Basel Definition

The Basel Committee on Banking Supervision defines operational risk as:

"The risk of loss resulting from inadequate or failed internal processes, people and systems, or from external events."

This definition is deliberately broad. It encompasses:

Process failures: Inadequate procedures, errors in execution, failed controls, documentation gaps

People failures: Human error, fraud (internal and external), inadequate training, key person dependency

System failures: Technology failures, software errors, cybersecurity incidents, data corruption

External events: Natural disasters, regulatory changes, fraud by third parties, power failures

The Basel framework explicitly excludes strategic risk (the risk of poor business decisions) and reputational risk from operational risk, though these may result from operational events.

The Basel Capital Framework for Operational Risk

Under Basel III (now being superseded by Basel IV / Basel III Final Standards), banks are required to hold capital against operational risk. The measurement approaches have evolved significantly:

Basic Indicator Approach (BIA): Capital = 15% × average gross income over three years. Simple but crude — allocates no capital for firms with negative gross income and takes no account of actual risk profile.

Standardized Approach (SA): Similar to BIA but maps gross income to business lines with different multipliers (12–18%). Slightly more risk-sensitive.

Advanced Measurement Approaches (AMA): Permitted banks to use internal models to calculate operational risk capital. Highly sophisticated but produced inconsistent, often unverifiable results across firms.

Standardized Measurement Approach (SMA) — Basel IV: Replaces BIA, SA, and AMA with a single standardized approach based on a Business Indicator (a measure of income and expenses) multiplied by loss multipliers derived from the institution's own historical loss data. Effective from January 1, 2025 (with national implementation variation).

The trajectory reflects a regulatory judgment: AMA produced regulatory arbitrage (banks could minimize capital through model choice); the SMA provides greater comparability and prudential reliability.

12.2 Technology Risk: From Residual Category to Core Risk

Technology risk — the risk of loss arising from the failure, inadequacy, or malicious exploitation of technology systems — was historically treated as a sub-category of operational risk. This framing has become inadequate as financial institutions have become essentially technology companies.

The DORA Framework (EU)

The EU's Digital Operational Resilience Act (DORA), effective January 17, 2025, represents the most comprehensive regulatory treatment of technology risk for financial services to date. DORA applies to: - Banks, investment firms, payment institutions, e-money institutions - Critical third-party ICT providers serving financial institutions - Insurance companies, pension funds

DORA's five pillars:

ICT risk management: Institutions must implement a comprehensive ICT risk management framework — identifying, classifying, and managing ICT risks with board-level accountability
ICT incident reporting: Material ICT incidents must be reported to competent authorities; a three-phase reporting structure (initial notification → intermediate report → final report) with prescribed timelines
Digital operational resilience testing: Mandatory periodic testing of ICT systems, including for significant institutions: Threat-Led Penetration Testing (TLPT) every three years
Third-party risk management: Comprehensive risk assessment of ICT third-party providers; registers of all arrangements; contractual requirements for cloud providers and other critical services
Information sharing: Voluntary but encouraged information sharing on cyber threats across the financial sector

DORA represents a significant expansion of supervisory expectations: technology resilience is now a primary regulatory obligation, not a secondary operational consideration.

UK OCIR and US Regulatory Expectations

UK: The PRA and FCA's Operational Continuity in Resolution (OCIR) rules require major UK financial institutions to maintain operational continuity even through a resolution scenario. The regulators have also issued detailed guidance on operational resilience — requiring institutions to identify "important business services" and set impact tolerances (maximum downtime thresholds).

US: The Federal Reserve, OCC, and FDIC have issued interagency guidance on technology risk, including specific guidance on sound practices for model risk management (SR 11-7) and on third-party relationships (interagency guidance finalized 2023).

12.3 The Operational Risk Management Framework

A sound operational risk management (ORM) framework has four core components:

1. Risk Identification and Assessment

Risk and Control Self-Assessment (RCSA): Business units systematically identify the operational risks in their processes, assess their inherent risk (before controls) and residual risk (after controls), and document the controls that mitigate each risk. The RCSA is the foundation of the ORM framework.

RCSA Structure (simplified)

Process: Customer onboarding
↓
Risk: Incorrect identity verification allows fraudulent account opening
↓
Inherent risk: HIGH (financial loss, regulatory penalty, reputational damage)
↓
Controls:
  - Automated document verification system
  - eIDV database check
  - Adverse media and sanctions screening
  - Analyst review for flagged applications
↓
Residual risk after controls: MEDIUM
↓
Control effectiveness assessment: Strong (documented test results)
Control gaps: [Description of any gaps]
Action items: [Remediation plans with owners and dates]

Risk taxonomy: A consistent classification system for operational risk events enables aggregation and trend analysis across business units. The Basel Committee's seven-category taxonomy is widely used: 1. Internal fraud 2. External fraud 3. Employment practices and workplace safety 4. Clients, products, and business practices 5. Damage to physical assets 6. Business disruption and system failures 7. Execution, delivery, and process management

2. Loss Data Collection and Analysis

Internal loss data: Institutions collect data on operational risk events — incidents that resulted in financial loss, near-misses that could have resulted in loss, or control failures that created risk. The data includes: event date, event description, business line, risk category, gross loss, recovery, and root cause.

Loss data serves multiple functions: - Informs RCSA risk assessments (are control failures resulting in actual losses?) - Provides input to capital models (particularly under the SMA) - Identifies patterns and trends for risk management - Supports post-incident remediation

External loss data: Institutions supplement their own loss history with industry-wide loss databases — ORX (Operational Riskdata eXchange Association) is the primary global industry utility. External data is particularly important for low-frequency, high-severity events (like Knight Capital) where individual institutions may have no internal history but face the risk.

3. Scenario Analysis

For large, low-probability operational risk events that don't appear in historical loss data, institutions use scenario analysis — systematic consideration of "what could go wrong" scenarios based on: - Expert elicitation (asking experienced business leaders to identify plausible severe scenarios) - External data on industry events - Regulatory and supervisory concerns

Output: A set of severity estimates for extreme scenarios, used in capital calculation and business continuity planning.

4. Key Risk Indicators

Key Risk Indicators (KRIs) are quantitative metrics that provide early warning of increasing operational risk:

KRI	Business Line	Risk Dimension	Alert Threshold
Failed trade settlement rate	Markets	Execution risk	> 0.5%
IT incidents per month	Technology	System stability	> 15 critical
Overdue audit actions	All	Control effectiveness	> 10 overdue >30 days
Staff turnover in compliance	Compliance	People risk	> 25% per quarter
Customer complaints (AML-related)	Retail	Conduct risk	> 3× prior quarter
Third-party SLA breaches	Technology	Vendor risk	> 5% of SLAs

"""
KRI Monitoring and Alert Generation

Tracks Key Risk Indicators against defined thresholds
and generates alerts when indicators exceed warning levels.
"""

from dataclasses import dataclass
from datetime import date
from enum import IntEnum


class AlertLevel(IntEnum):
    GREEN = 1   # Within tolerance
    AMBER = 2   # Approaching threshold — enhanced monitoring
    RED = 3     # Threshold exceeded — management action required


@dataclass
class KRI:
    name: str
    description: str
    business_line: str
    risk_category: str
    unit: str
    amber_threshold: float
    red_threshold: float
    threshold_direction: str  # 'above' (high is bad) or 'below' (low is bad)


@dataclass
class KRIReading:
    kri_name: str
    reading_date: date
    value: float
    source: str  # e.g., 'ops_data_warehouse', 'manual_submission'


def evaluate_kri(kri: KRI, value: float) -> AlertLevel:
    """Evaluate a KRI reading against thresholds."""
    if kri.threshold_direction == "above":
        if value >= kri.red_threshold:
            return AlertLevel.RED
        elif value >= kri.amber_threshold:
            return AlertLevel.AMBER
        return AlertLevel.GREEN
    else:  # 'below'
        if value <= kri.red_threshold:
            return AlertLevel.RED
        elif value <= kri.amber_threshold:
            return AlertLevel.AMBER
        return AlertLevel.GREEN


class KRIDashboard:
    """
    Aggregates KRI readings and produces risk dashboard summaries.
    """

    def __init__(self, kris: list[KRI]):
        self.kris = {k.name: k for k in kris}
        self.readings: list[KRIReading] = []

    def add_reading(self, reading: KRIReading):
        self.readings.append(reading)

    def current_status(self, as_of: date) -> list[dict]:
        """Return current status of all KRIs as of the given date."""
        # Get the most recent reading for each KRI up to as_of date
        latest: dict[str, KRIReading] = {}
        for reading in self.readings:
            if reading.reading_date <= as_of:
                if (reading.kri_name not in latest or
                        reading.reading_date > latest[reading.kri_name].reading_date):
                    latest[reading.kri_name] = reading

        results = []
        for kri_name, kri in self.kris.items():
            if kri_name in latest:
                reading = latest[kri_name]
                level = evaluate_kri(kri, reading.value)
                results.append({
                    "kri": kri_name,
                    "business_line": kri.business_line,
                    "value": reading.value,
                    "unit": kri.unit,
                    "status": level.name,
                    "reading_date": reading.reading_date.isoformat(),
                    "amber_threshold": kri.amber_threshold,
                    "red_threshold": kri.red_threshold,
                })
            else:
                results.append({
                    "kri": kri_name,
                    "business_line": kri.business_line,
                    "value": None,
                    "status": "NO_DATA",
                    "reading_date": None,
                })

        return sorted(results, key=lambda x: (x["status"] == "GREEN", x["status"]))

12.4 Third-Party Risk Management

Financial institutions increasingly rely on third parties — cloud providers, RegTech vendors, core banking platforms, data aggregators, cybersecurity providers — for critical functions. This creates third-party risk: the operational, legal, and reputational exposure arising from the performance, security, and resilience of these external providers.

The Regulatory Evolution

Third-party risk management has received increasing regulatory attention:

US: The 2023 Interagency Guidance on Third-Party Relationships (OCC, Fed, FDIC) replaced prior agency-specific guidance with a unified framework covering the full lifecycle of third-party relationships: planning, due diligence, contract negotiation, ongoing monitoring, and termination.

EU (DORA): DORA's third-party pillar requires institutions to: - Maintain a register of all ICT arrangements (cloud and non-cloud) - Classify critical ICT third-party providers (CTTPPs) - Ensure contractual provisions meet DORA's mandatory requirements (audit rights, sub-contracting disclosure, business continuity) - Develop exit strategies for all critical providers

UK: The FCA and PRA have published operational resilience guidance specifically addressing third-party risks, particularly cloud concentration risk — the risk that a failure at a dominant cloud provider (AWS, Microsoft Azure, Google Cloud) could simultaneously impair multiple financial institutions.

The Due Diligence Process

Third-party risk due diligence applies a risk-based assessment:

Initial classification: Is this a critical third party? Critical means: failure of the third party would materially impair the institution's ability to deliver an important business service, meet regulatory obligations, or protect customer data.

Pre-contract due diligence for critical third parties: - Financial health of the vendor - Security posture: SOC 2 reports, penetration test results, vulnerability management practices - Business continuity and disaster recovery capabilities - Sub-contractor chain (does the vendor rely on other third parties for critical services?) - Concentration risk: does the institution have alternatives if this provider fails? - Regulatory compliance: does the provider operate in compliance with relevant regulations in the jurisdictions of service delivery?

Contractual provisions: - Service level agreements (SLAs) with financial penalties - Audit rights: the institution's right to audit the vendor's controls - Incident notification requirements - Data handling and data residency - Sub-contracting approval - Exit assistance: vendor must support orderly transition to an alternative provider

12.5 Cybersecurity Risk in Financial Institutions

Cybersecurity risk — increasingly the dominant component of technology operational risk — deserves specific treatment.

The Regulatory Framework

US (NIST Cybersecurity Framework): The National Institute of Standards and Technology Cybersecurity Framework (CSF) — updated to CSF 2.0 in February 2024 — provides the primary voluntary framework for cybersecurity risk management. Six functions: Identify, Protect, Detect, Respond, Recover, Govern.

EU (DORA): DORA's ICT risk management requirements include specific provisions for cybersecurity — network and information security controls, encryption, multi-factor authentication, and incident detection and response.

UK (CBEST and STAR-FS): The Bank of England's CBEST framework (threat intelligence-based ethical red team testing) and the FCA/BoE STAR-FS (Systemic Testing Against Realistic Scenarios) program subject major UK financial institutions to rigorous adversarial testing.

Cyber Incident Reporting

Financial institutions are subject to increasingly strict cyber incident reporting requirements:

US SEC: Registered public companies must disclose material cybersecurity incidents within 4 business days of determining materiality (effective December 2023)
EU DORA: Major incidents must be reported to the competent authority — initial notification within 4 hours; intermediate report within 72 hours; final report within 30 days
UK: FCA-regulated firms must notify the FCA of significant cybersecurity incidents "as soon as reasonably practicable"

12.6 Model Risk Management

Model risk is the risk of loss arising from incorrect, inappropriate, or misused models. It is a component of operational risk that has received its own dedicated regulatory framework.

SR 11-7 and the Federal Reserve's Model Risk Framework

The Federal Reserve's Supervisory Guidance SR 11-7 (2011), supplemented by OCC Bulletin 2011-12, remains the dominant global reference for model risk management governance. Its core requirements:

Model inventory: Institutions must maintain a comprehensive inventory of all models, including: model name, purpose, business line, model developer, owner, validation status, and materiality classification.

Conceptual soundness: Models must be based on theoretically sound methodology, appropriate for their intended use, implemented correctly, and regularly tested.

Model validation: Every model must be validated by a party independent of the model development team. Validation includes: conceptual review, outcomes analysis (benchmarking actual vs. predicted performance), sensitivity testing.

Ongoing monitoring: Model performance must be monitored continuously; models showing performance deterioration must be flagged for re-validation.

Model governance: A Model Risk Committee or equivalent governance body with appropriate seniority must oversee the model risk framework.

The SR 11-7 framework, designed originally for credit and market risk models, is increasingly applied to: - AML/KYC machine learning models - Fraud detection models - Sanctions screening ML systems - Regulatory capital models - Stress testing models

12.7 Priya's Technology Risk Audit: A Practitioner's Approach

When Priya conducts a technology risk assessment for a new client, she works through a structured framework that covers all major dimensions of operational and technology risk. Her assessment for a mid-size UK challenger bank client illustrates the practitioner approach.

Scope: Priya was engaged to conduct a pre-FCA-examination technology risk assessment. The scope covered all technology systems material to regulatory compliance: core banking platform, KYC/AML systems, sanctions screening, transaction monitoring, regulatory reporting infrastructure, and key third-party relationships.

Assessment methodology:

Documentation review: Priya reviewed existing operational risk frameworks, IT policies, and governance documentation. Key gaps identified: the incident management policy had not been updated since 2019; there was no formal model inventory for the three ML-based compliance systems deployed in 2021-2022.
Interview program: Interviews with CTO, CISO, CCO, Head of Operations, and key system owners. Focus: understanding the actual operational practice, not just what the policies said.
Technical walkthrough: For each critical system, Priya reviewed the last 12 months of system performance data, incident logs, and vendor service reports.
Third-party assessment: Priya mapped all critical third-party relationships and assessed the adequacy of each vendor's contract, monitoring, and exit strategy.
Gap report: A structured report identifying gaps against regulatory expectations (FCA operational resilience guidance, PRA expectations, DORA requirements for EU-connected operations), with risk-rated remediation recommendations.

Key findings: - The KYC automation system had experienced two significant outages in 12 months; neither had been reported to the FCA despite meeting the reporting threshold - No formal exit strategy existed for the core banking platform vendor - The ML-based transaction monitoring model had not been validated since deployment — a direct SR 11-7 equivalent gap - Third-party security assessments had not been conducted for 4 of 8 critical vendors

Priya's memo to the board after delivering the report: "The good news is that none of these gaps is unique — every institution I review has some version of most of them. The bad news is that the FCA is now specifically looking for model validation documentation and DORA-readiness, and this institution doesn't have either. The remediation program is achievable in 6 months if it's prioritized."

Chapter Summary

Operational risk — the risk of loss from failed processes, people, systems, or external events — has evolved from a residual category to a central regulatory and management concern, particularly as financial institutions have become technology-dependent.

The Basel framework categorizes and provides capital treatment for operational risk; the Standardized Measurement Approach (SMA) under Basel IV brings greater consistency and uses historical loss data as a key input.

Technology risk, once a sub-category of operational risk, now has dedicated regulatory frameworks: DORA (EU) is the most comprehensive, establishing ICT risk management, incident reporting, resilience testing, and third-party risk requirements as primary regulatory obligations.

The ORM framework — RCSA, loss data collection, scenario analysis, and KRI monitoring — provides the management infrastructure for operational risk identification, assessment, and oversight.

Third-party risk management has expanded significantly as institutions outsource critical functions to cloud providers and RegTech vendors; regulatory requirements now cover the full lifecycle of critical third-party relationships.

Model risk management — governed in the US by SR 11-7 — applies to all models including the ML-based compliance systems at the heart of modern RegTech programs.

Continue to Chapter 13: Regulatory Reporting: From XBRL to API-Based Reporting →