Case Study 31.2: Hybrid Architecture — z/OS Core + Cloud Digital

Background

Continental Trust Corporation is a large commercial bank with $85 billion in assets, 4 million customer accounts, and operations across 15 states. The bank's core banking system has run on DB2 for z/OS for 28 years. The mainframe processes 45 million transactions daily through CICS online applications and batch processing.

The Business Imperative

Continental's board has mandated the launch of a digital banking platform — mobile app and web portal — within 12 months. The platform must:

  1. Support 2 million mobile users with sub-second API response times.
  2. Display real-time account balances and transaction history.
  3. Enable fund transfers, bill payments, and mobile check deposits.
  4. Provide personalized product recommendations based on transaction patterns.
  5. Scale elastically during peak periods (Monday mornings, payroll days, tax season).

The Constraint

The CTO has made clear: "The core ledger stays on the mainframe. It processes 45 million transactions daily with five-nines reliability. We are not re-platforming it."

This constraint shapes the entire architecture: the digital platform must be built on cloud infrastructure but tightly integrated with the z/OS core.

Architecture Design

Three-Tier Hybrid Model

Tier 1: z/OS Core (Existing — No Changes)
    DB2 for z/OS v13
    - Account master (4M accounts)
    - Transaction processing (45M daily)
    - General ledger
    - CICS online programs
    - Batch processing (nightly cycle)

Tier 2: Cloud Digital Platform (New)
    Db2 on Cloud Enterprise HA (IBM Cloud, Dallas)
    - Customer digital profiles
    - Session management
    - Notification preferences
    - Mobile-specific data (device tokens, biometric enrollment)
    - Transaction cache (replicated from z/OS)

Tier 3: Cloud Analytics (New)
    Db2 Warehouse on Cloud (IBM Cloud, Dallas)
    - Customer 360 views
    - Product recommendation engine data
    - Fraud scoring models
    - Regulatory analytics

Data Flow Design

The architecture uses three data movement patterns:

Pattern 1: CDC Replication (z/OS to Cloud)

Account balances and transaction history are replicated from z/OS to Db2 on Cloud in near-real-time using IBM Data Replication:

  • Source: DB2 z/OS ACCOUNT_MASTER and TRANSACTION_HISTORY tables.
  • Target: Db2 on Cloud account_cache and transaction_cache tables.
  • Latency: 2-5 seconds under normal load; up to 15 seconds during peak batch processing.
  • Volume: Approximately 45 million change rows per day.

The mobile app reads from the cloud cache for display purposes. The cached data is flagged with a REPLICATED_TIMESTAMP column so the app can display "Balance as of [time]" to set user expectations.

Pattern 2: API-Mediated Writes (Cloud to z/OS)

When a customer initiates a fund transfer, bill payment, or other financial transaction through the mobile app:

  1. The mobile API validates the request in the cloud.
  2. The API calls a z/OS-hosted REST service (via z/OS Connect EE).
  3. The z/OS service executes the CICS transaction, which updates DB2 z/OS.
  4. CDC replicates the change back to the cloud cache within seconds.
  5. The mobile app polls the cloud cache until the updated balance appears.
Mobile App → Cloud API → z/OS Connect EE → CICS → DB2 z/OS
                                                      |
                                                      v (CDC)
Mobile App ← Cloud API ← Db2 on Cloud (cache updated)

Pattern 3: Batch ETL (z/OS + Cloud to Warehouse)

Nightly batch ETL feeds the analytics warehouse:

  1. Extract account and transaction data from z/OS (UNLOAD utility).
  2. Extract digital profile data from Db2 on Cloud (EXPORT).
  3. Transform and load into Db2 Warehouse on Cloud (LOAD).
  4. Refresh materialized query tables for the recommendation engine.

Cloud Database Schema

Db2 on Cloud — Digital Platform Schema:

-- Customer digital profile (cloud-native data)
CREATE TABLE continental.customer_digital_profile (
    customer_id         BIGINT       NOT NULL PRIMARY KEY,
    email               VARCHAR(255),
    mobile_phone        VARCHAR(20),
    preferred_channel   CHAR(3)      CHECK (preferred_channel IN ('MOB','WEB','ALL')),
    notification_prefs  VARCHAR(500), -- JSON format
    mfa_method          CHAR(4)      CHECK (mfa_method IN ('SMS','TOTP','PUSH','NONE')),
    biometric_enrolled  BOOLEAN      DEFAULT FALSE,
    device_tokens       VARCHAR(2000), -- JSON array
    last_login_ts       TIMESTAMP,
    created_ts          TIMESTAMP    DEFAULT CURRENT_TIMESTAMP,
    updated_ts          TIMESTAMP    DEFAULT CURRENT_TIMESTAMP
);

-- Account cache (replicated from z/OS)
CREATE TABLE continental.account_cache (
    account_id          BIGINT       NOT NULL,
    customer_id         BIGINT       NOT NULL,
    account_type        CHAR(3),
    account_status      CHAR(1),
    current_balance     DECIMAL(15,2),
    available_balance   DECIMAL(15,2),
    last_activity_date  DATE,
    replicated_ts       TIMESTAMP    NOT NULL,
    PRIMARY KEY (account_id)
);

-- Transaction cache (replicated from z/OS, last 90 days)
CREATE TABLE continental.transaction_cache (
    trans_id            BIGINT       NOT NULL,
    account_id          BIGINT       NOT NULL,
    trans_date          DATE         NOT NULL,
    trans_type          CHAR(3),
    amount              DECIMAL(15,2),
    description         VARCHAR(200),
    replicated_ts       TIMESTAMP    NOT NULL,
    PRIMARY KEY (trans_date, trans_id)
)
PARTITION BY RANGE (trans_date)
(
    PARTITION p_current_month STARTING (CURRENT DATE - 30 DAYS) ENDING (CURRENT DATE + 1 DAY),
    PARTITION p_prev_month    STARTING (CURRENT DATE - 60 DAYS) ENDING (CURRENT DATE - 30 DAYS),
    PARTITION p_archive       STARTING (CURRENT DATE - 90 DAYS) ENDING (CURRENT DATE - 60 DAYS)
);

-- Session management (cloud-native data)
CREATE TABLE continental.user_sessions (
    session_id          VARCHAR(64)  NOT NULL PRIMARY KEY,
    customer_id         BIGINT       NOT NULL,
    device_type         VARCHAR(20),
    ip_address          VARCHAR(45),
    login_ts            TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    last_activity_ts    TIMESTAMP,
    expiry_ts           TIMESTAMP,
    is_active           BOOLEAN      DEFAULT TRUE
);

Federation Configuration

For ad-hoc queries that need to join z/OS and cloud data:

-- Federation server on the cloud Db2 instance
CREATE WRAPPER drda_wrapper LIBRARY 'libdb2drda.so';

CREATE SERVER zos_core
    TYPE DB2/ZOS VERSION 13
    WRAPPER drda_wrapper
    OPTIONS (DBNAME 'DSNP', NODE 'ZOSNODE');

CREATE USER MAPPING FOR continental_dba
    SERVER zos_core
    OPTIONS (REMOTE_AUTHID 'CONTDBA', REMOTE_PASSWORD '***');

-- Nicknames for z/OS tables
CREATE NICKNAME continental.zos_account_master
    FOR zos_core."CONTINENTAL"."ACCOUNT_MASTER";

CREATE NICKNAME continental.zos_gl_summary
    FOR zos_core."CONTINENTAL"."GL_DAILY_SUMMARY";

-- Example federated query: Find customers eligible for premium upgrade
SELECT a.CUSTOMER_ID,
       a.CUSTOMER_NAME,
       a.TOTAL_RELATIONSHIP_VALUE,
       p.preferred_channel,
       p.last_login_ts
FROM continental.zos_account_master a
JOIN continental.customer_digital_profile p
    ON a.CUSTOMER_ID = p.customer_id
WHERE a.TOTAL_RELATIONSHIP_VALUE > 250000
  AND a.ACCOUNT_STATUS = 'A'
  AND p.last_login_ts > CURRENT_TIMESTAMP - 30 DAYS;

Security Architecture

Network Security

On-Premises z/OS Data Center
    |
    | IBM Direct Link (10 Gbps, encrypted)
    | Dedicated physical circuit — no internet
    |
IBM Cloud VPC (Dallas)
    |
    ├── Private Endpoint (Db2 on Cloud)
    │   No public IP — accessible only from VPC
    |
    ├── Private Endpoint (Db2 Warehouse)
    │   No public IP — accessible only from VPC
    |
    └── Application Subnet (Kubernetes)
        - Mobile API pods
        - Internal load balancer
        - Egress only through Cloud Internet Services (CDN + WAF)

Data Security

  • Encryption at rest: AES-256 with BYOK. Keys stored in Hyper Protect Crypto Services (FIPS 140-2 Level 4).
  • Encryption in transit: TLS 1.3 for all connections. Mutual TLS (mTLS) between the API layer and Db2.
  • Column-level encryption: PII columns (email, phone, SSN) are encrypted using ENCRYPT_AES() with a column-specific key.
-- Column-level encryption for PII
CREATE TABLE continental.customer_pii (
    customer_id     BIGINT NOT NULL PRIMARY KEY,
    ssn_encrypted   VARCHAR(128) FOR BIT DATA,
    email_encrypted VARCHAR(512) FOR BIT DATA,
    phone_encrypted VARCHAR(128) FOR BIT DATA
);

-- Insert with encryption
INSERT INTO continental.customer_pii VALUES (
    12345,
    ENCRYPT('123-45-6789', 'column-key-ssn'),
    ENCRYPT('customer@email.com', 'column-key-email'),
    ENCRYPT('555-0100', 'column-key-phone')
);

Access Control

Role Cloud DB Access z/OS Access Scope
Mobile API Service SELECT, INSERT, UPDATE on digital profile and session tables. SELECT on cache tables. Read-only via z/OS Connect API Production
CDC Replication Agent INSERT, UPDATE, DELETE on cache tables Read (log capture) on source tables Replication
Analytics ETL SELECT on all cloud tables SELECT on z/OS source tables Nightly batch
DBA Full DBADM Full SYSADM Administration
Auditor SELECT on audit views only SELECT on audit views only Compliance

Performance Results

Mobile API Response Times

Operation Target Actual (P95) Method
View balance < 200 ms 85 ms Read from cloud cache
View recent transactions < 300 ms 180 ms Read from cloud cache (partitioned)
Fund transfer (initiate) < 500 ms 320 ms Cloud validation + z/OS API call
Fund transfer (confirm) < 2 sec 1.1 sec z/OS processing + CDC replication
Digital profile update < 200 ms 45 ms Direct cloud write

Replication Metrics

Metric Normal Peak (Payroll) Batch Window
CDC latency 2.1 sec 8.4 sec 14.7 sec
Rows replicated/sec 520 1,850 3,200
Network bandwidth used 12 Mbps 45 Mbps 78 Mbps

Scaling Events

During the first tax season (April), the mobile app saw a 340% surge in concurrent users:

  • Before auto-scaling: Db2 on Cloud with 16 vCPUs, 64 GB RAM.
  • During peak: Scaled to 32 vCPUs, 128 GB RAM (via IBM Cloud CLI script triggered by monitoring alert).
  • Scaling duration: 8 minutes (no downtime — online vertical scaling).
  • After peak: Scaled back to baseline configuration.

Cost Profile

Monthly Costs

Component Monthly Cost
Db2 on Cloud Enterprise HA (3-year reserved, 16 vCPU base) $9,200
Db2 Warehouse on Cloud (reserved) $4,800
IBM Direct Link 10 Gbps $5,500
Cloud Object Storage (analytics staging) $150
Key Protect / HPCS $600
Kubernetes cluster (API layer) $3,200
Data transfer (CDC + ETL) $280
Monthly total $23,730
Annual total $284,760

Comparison with Full On-Premises Alternative

Building the equivalent digital platform entirely on-premises would have required: - New LUW server cluster: $400,000 (hardware + licenses). - Network infrastructure upgrades: $120,000. - Additional DBA headcount: $160,000/year. - Data center expansion: $80,000. - Total first-year: $760,000. - Annual ongoing: $320,000.

The cloud approach saves $35,000 annually and avoids $480,000 in capital expenditure.

Challenges and Resolutions

Challenge 1: CDC Latency During z/OS Batch Window

The nightly z/OS batch cycle generates 120 million log records in 4 hours. CDC replication lag spiked to 45 seconds during this period.

Resolution: - Increased CDC apply parallelism from 4 to 16 threads. - Configured CDC to use "batch apply" mode during the known batch window (22:00-02:00), buffering changes and applying them in larger batches for higher throughput. - Reduced peak lag from 45 seconds to 15 seconds.

Challenge 2: z/OS Connect Timeout Under Load

The z/OS Connect REST APIs that mediate fund transfers timed out under high load (>500 concurrent requests), returning HTTP 504 errors to the mobile app.

Resolution: - Increased the z/OS Connect Liberty server thread pool from 50 to 200. - Implemented a circuit breaker in the cloud API layer: when z/OS Connect errors exceed 5% in a 10-second window, the circuit opens and returns a "please try again" message to users instead of queuing requests. - Added a retry queue (IBM MQ) for failed transfer requests, with automatic retry after 30 seconds.

Challenge 3: Data Consistency Between Cache and Core

Customers occasionally saw "stale" balances in the mobile app because the CDC replication lag meant the cloud cache was 2-15 seconds behind z/OS.

Resolution: - Added a replicated_ts column to cache tables. - The mobile app displays "Balance as of [time]" using this timestamp. - For balance-critical operations (fund transfer confirmation), the API makes a synchronous call to z/OS to fetch the authoritative balance rather than using the cache. - Customer complaints about stale data dropped to zero after the UI change.

Lessons Learned

  1. Design for eventual consistency from day one: The hybrid architecture is inherently eventually consistent. Rather than trying to eliminate latency (impossible), design the user experience to accommodate it gracefully.

  2. Federation is for exploration, not production traffic: The federation setup was invaluable during development and for ad-hoc analytics queries, but production traffic should always read from the replicated cache. A single federated query adds 30-50 ms of z/OS round-trip latency.

  3. Monitor replication lag as a first-class metric: CDC replication lag should be on the same monitoring dashboard as API response time and error rate. When lag exceeds 10 seconds, it triggers an alert.

  4. Plan for z/OS batch impact on the cloud: The nightly batch cycle on z/OS directly affects the cloud platform through CDC. The cloud team must understand and plan for the z/OS batch schedule.

Discussion Questions

  1. Continental chose CDC replication over event-driven synchronization (Kafka). Under what circumstances would Kafka have been a better choice?

  2. The transaction cache in the cloud retains 90 days of data. If the regulatory requirement changes to 2 years, how would the cloud schema and cost projection change?

  3. If Continental acquires a bank that runs Oracle instead of DB2, how could the hybrid architecture be extended to federate with the Oracle system?

  4. The z/OS Connect timeout issue was resolved with a circuit breaker and retry queue. What is the risk of this approach for financial transactions, and how would you mitigate it?