Case Study 29.2: Multi-Site Replication for Global Operations

Background

Meridian National Bank has grown through acquisition and now operates in three regions:

  • Americas (New York): Core banking, 15 million customers, primary data center
  • Europe (London): Acquired European retail bank, 8 million customers
  • Asia-Pacific (Singapore): New market entry, 2 million customers and growing rapidly

Each region has its own DB2 LUW database serving local operations. However, the bank's executive team has mandated a unified global platform:

  1. Customer data must be accessible from any region
  2. Regulatory compliance requires data residency (EU data stays in EU)
  3. If any region's data center goes offline, the other two must absorb its workload
  4. Real-time analytics must cover all three regions

The chief architect, Dr. Yuki Tanaka, leads the design of a multi-site replication strategy.

Challenge

Design and implement a replication architecture that:

  • Provides low-latency read access to global customer data from any region
  • Maintains data residency compliance (GDPR for Europe, MAS for Singapore)
  • Enables regional failover with RPO < 10 seconds and RTO < 30 minutes
  • Feeds a global analytics platform with real-time data from all regions
  • Handles conflict resolution for shared reference data

Architecture Design

Tier 1: Regional HADR (Within Each Data Center)

Each region implements HADR for local high availability:

Region Primary Standby Sync Mode Purpose
Americas NY-PROD NY-STBY NEARSYNC Local HA + reads-on-standby
Europe LDN-PROD LDN-STBY NEARSYNC Local HA + reads-on-standby
Asia-Pacific SG-PROD SG-STBY NEARSYNC Local HA + reads-on-standby

Tier 2: Q Replication (Cross-Region)

Q Replication provides bidirectional replication of shared reference data and unidirectional replication of regional transactional data:

           ┌────────────────┐
           │   New York     │
           │   (Americas)   │
           │   NY-PROD      │
           └───┬──────┬─────┘
               │      │
    Q Rep      │      │     Q Rep
    (bidir)    │      │     (bidir)
               │      │
     ┌─────────┘      └──────────┐
     │                           │
     ▼                           ▼
┌────────────────┐      ┌────────────────┐
│    London      │      │   Singapore    │
│   (Europe)     │◄────►│  (Asia-Pac)    │
│   LDN-PROD     │Q Rep │   SG-PROD      │
└────────────────┘(bidir)└────────────────┘

Bidirectional Tables (Reference Data)

These tables are replicated bidirectionally between all three regions:

Table Description Conflict Strategy
EXCHANGE_RATES Currency exchange rates Timestamp wins (source of truth: NY)
PRODUCT_CATALOG Banking products Source wins (products defined centrally)
BRANCH_DIRECTORY Global branch information Region of origin wins
COMPLIANCE_RULES Regulatory rules Source wins (compliance team in NY)

Unidirectional Tables (Regional Data)

Regional transactional data is replicated outward to a global read-only copy:

Source Tables Targets Purpose
NY-PROD ACCOUNTS_US, TRANSACTIONS_US LDN-READONLY, SG-READONLY Global visibility
LDN-PROD ACCOUNTS_EU, TRANSACTIONS_EU NY-READONLY, SG-READONLY Global visibility
SG-PROD ACCOUNTS_AP, TRANSACTIONS_AP NY-READONLY, LDN-READONLY Global visibility

The READONLY databases are separate DB2 instances that receive replicated data for query purposes. They do not participate in the HADR configuration.

Tier 3: CDC to Kafka (Global Analytics)

Each region streams changes via CDC to a regional Kafka cluster. A Kafka MirrorMaker 2 setup consolidates all three regional Kafka clusters into a global analytics Kafka cluster in New York:

NY-PROD ──CDC──► Kafka-NY ────┐
                               │
LDN-PROD ──CDC──► Kafka-LDN ──┼──MirrorMaker──► Kafka-Global
                               │                    │
SG-PROD ──CDC──► Kafka-SG ────┘                    │
                                                    ▼
                                          Global Analytics Platform
                                          (Db2 Warehouse on Cloud)

Implementation

Phase 1: Regional HADR (Month 1)

Each region implements HADR independently. This is straightforward — the same pattern as Case Study 29.1, replicated three times.

Key decisions: - NEARSYNC mode for all regions (standbys are in the same data center) - Pacemaker for automatic failover in Americas and Asia-Pacific - PowerHA for automatic failover in Europe (AIX-based infrastructure)

Phase 2: Q Replication for Reference Data (Months 2-3)

MQ Infrastructure

Dr. Tanaka deploys IBM MQ queue managers at each site with channels connecting all three:

Queue Managers:
  QMNY01 (New York)
  QMLDN01 (London)
  QMSG01 (Singapore)

Channels:
  QMNY01 ←→ QMLDN01  (London-NY dedicated fiber: 70 ms RTT)
  QMNY01 ←→ QMSG01   (NY-Singapore via submarine cable: 240 ms RTT)
  QMLDN01 ←→ QMSG01  (London-Singapore: 170 ms RTT)

All channels use TLS 1.3 encryption and MQ message authentication.

Bidirectional Replication Setup

For the EXCHANGE_RATES table (bidirectional, NY-London):

-- On NY-PROD: Q Capture subscription
INSERT INTO ASN.IBMQREP_SUBS (
    SUBNAME, SOURCE_OWNER, SOURCE_NAME,
    TARGET_OWNER, TARGET_NAME,
    SENDQ, RECVQ,
    SUB_TYPE, CONFLICT_RULE
) VALUES (
    'EXRATE_NY_TO_LDN', 'MERIDIAN', 'EXCHANGE_RATES',
    'MERIDIAN', 'EXCHANGE_RATES',
    'NY_TO_LDN_Q', 'LDN_FROM_NY_Q',
    'B', 'C'  -- B=Bidirectional, C=Custom conflict resolution
);

Conflict Resolution Procedure

For EXCHANGE_RATES, a custom stored procedure resolves conflicts:

CREATE PROCEDURE MERIDIAN.RESOLVE_EXRATE_CONFLICT (
    IN p_source_region VARCHAR(10),
    IN p_target_region VARCHAR(10),
    IN p_source_timestamp TIMESTAMP,
    IN p_target_timestamp TIMESTAMP,
    IN p_source_rate DECIMAL(15,6),
    IN p_target_rate DECIMAL(15,6),
    OUT p_winner VARCHAR(10)
)
BEGIN
    -- New York is the authoritative source for exchange rates.
    -- If NY is the source, NY always wins.
    -- Otherwise, the most recent update wins.
    IF p_source_region = 'NY' THEN
        SET p_winner = 'SOURCE';
    ELSEIF p_target_region = 'NY' THEN
        SET p_winner = 'TARGET';
    ELSEIF p_source_timestamp > p_target_timestamp THEN
        SET p_winner = 'SOURCE';
    ELSE
        SET p_winner = 'TARGET';
    END IF;
END;

Phase 3: CDC to Kafka (Month 4)

Each region deploys a CDC capture engine reading the local DB2 transaction log:

Kafka Topic Structure:

Region-specific topics:
  meridian.ny.accounts
  meridian.ny.transactions
  meridian.ldn.accounts
  meridian.ldn.transactions
  meridian.sg.accounts
  meridian.sg.transactions

Global topics (after MirrorMaker):
  meridian.global.accounts      (merged from all regions)
  meridian.global.transactions  (merged from all regions)

Message Format (Avro):

{
  "operation": "UPDATE",
  "region": "NY",
  "table": "ACCOUNTS",
  "timestamp": "2026-03-15T14:30:00.123Z",
  "transaction_id": "0x00000A21F3400000",
  "before": {
    "account_id": "US-1234567",
    "balance": 52340.00,
    "last_updated": "2026-03-15T14:29:55.000Z"
  },
  "after": {
    "account_id": "US-1234567",
    "balance": 51840.00,
    "last_updated": "2026-03-15T14:30:00.123Z"
  }
}

Testing: Regional Failover

Scenario: London Data Center Goes Offline

Dr. Tanaka conducts a full disaster recovery test simulating a London data center failure.

Preparation: - European customers' traffic is served by LDN-PROD via a European load balancer - Q Replication is active between all three regions - Global analytics Kafka pipeline is operational

Execution:

  1. T+0 seconds: London data center simulated offline (network cut)
  2. T+5 seconds: MQ channels to London disconnect. Q Capture/Apply queues begin buffering.
  3. T+10 seconds: Monitoring alerts fire: "London region unreachable"
  4. T+2 minutes: DR declared. Operations team initiates failover.
  5. T+3 minutes: DNS updated to route European customer traffic to NY-PROD
  6. T+5 minutes: NY-PROD's READONLY copy of European data is promoted to a writable database (Q Replication's stored copy)
  7. T+8 minutes: Application servers in the European edge locations reconnect to NY
  8. T+10 minutes: European customers resume banking operations via NY

Results: - RPO: 6 seconds (the Q Replication lag at the time of failure) - RTO: 10 minutes (from failure to service restoration) - 847 European transactions in the Q Replication queue were lost (within the 10-second RPO tolerance) - All other transactions were preserved

Recovery (London Restored):

When London comes back online: 1. HADR standby (LDN-STBY) is promoted to primary 2. Q Replication resumes, re-synchronizing London with NY's changes during the outage 3. Conflict resolution handles any dual-region updates during the failover period 4. European traffic is gradually migrated back to London 5. Original LDN-PROD is rebuilt as the new standby

Data Residency Compliance

GDPR Compliance (Europe)

European customer PII (Personal Identifiable Information) resides on LDN-PROD and is replicated only to: - LDN-STBY (same data center, same jurisdiction) via HADR - NY-READONLY (anonymized/pseudonymized subset only)

The Q Replication subscription for EU data to NY applies a transformation:

-- Column mapping in Q Apply subscription
-- Personal data is hashed before replication to NY
SOURCE_COL          TARGET_COL          TRANSFORMATION
CUSTOMER_NAME       CUSTOMER_NAME_HASH  SHA256(CUSTOMER_NAME)
EMAIL               EMAIL_HASH          SHA256(EMAIL)
PHONE               PHONE_HASH          SHA256(PHONE)
ADDRESS             COUNTRY_CODE        SUBSTR(ADDRESS, 1, 2)
ACCOUNT_BALANCE     ACCOUNT_BALANCE     (no transformation)

MAS Compliance (Singapore)

Singapore's Monetary Authority requires that customer data for Singapore-regulated accounts remains in Singapore. Similar transformations apply for SG-to-NY and SG-to-LDN replication.

Performance Metrics

After three months of production operation:

Metric Americas Europe Asia-Pacific
HADR log gap (avg) 0 bytes 0 bytes 0 bytes
HADR log gap (max) 12 KB 8 KB 15 KB
Q Rep latency (to nearest region) 85 ms 95 ms 190 ms
Q Rep latency (to farthest region) 260 ms 210 ms 275 ms
Q Rep conflict rate 0.003% 0.002% 0.001%
CDC-to-Kafka latency <1 sec <1 sec <2 sec
Global analytics freshness <5 sec <5 sec <5 sec

Lessons Learned

  1. Network latency is the dominant factor in cross-region replication. The NY-Singapore link (240 ms RTT) makes synchronous replication impossible. Q Replication's asynchronous nature is essential for cross-continent scenarios.

  2. Conflict resolution requires deep domain knowledge. The initial "timestamp wins" strategy for all tables caused subtle issues with exchange rates, where an older NY value should always override a newer London value. Custom conflict resolution procedures were necessary.

  3. Data residency complicates the architecture significantly. About 30% of the design effort went into ensuring PII was properly handled during cross-region replication. Automated compliance checks validate that no unmasked PII reaches unauthorized regions.

  4. MQ channel reliability matters. Two MQ channel failures during the first month caused Q Replication to fall behind by several minutes. Implementing MQ channel monitoring with automatic restart and redundant channels resolved this.

  5. Global analytics requires schema harmonization. Each region had slightly different column names and data types (a legacy of the European acquisition). Significant effort went into creating a unified schema for the global analytics platform.

Discussion Questions

  1. Why was Q Replication chosen over HADR ASYNC for cross-region replication? What advantages does Q Replication offer for this use case?

  2. The conflict rate is very low (0.003%). Under what circumstances could it spike? What business process changes might reduce conflicts to zero?

  3. If Meridian acquires a bank in Brazil, what changes would be needed to extend this architecture to a fourth region?

  4. The CDC-to-Kafka pipeline feeds a centralized analytics platform in New York. What are the data sovereignty implications? How might a federated analytics approach work instead?

  5. Calculate the total cost of this architecture in terms of servers, licenses, and network bandwidth. Compare it to a single-site architecture with cross-region backup shipping. What availability improvement justifies the additional cost?