Case Study 31.2: Hybrid Architecture — z/OS Core + Cloud Digital
Background
Continental Trust Corporation is a large commercial bank with $85 billion in assets, 4 million customer accounts, and operations across 15 states. The bank's core banking system has run on DB2 for z/OS for 28 years. The mainframe processes 45 million transactions daily through CICS online applications and batch processing.
The Business Imperative
Continental's board has mandated the launch of a digital banking platform — mobile app and web portal — within 12 months. The platform must:
- Support 2 million mobile users with sub-second API response times.
- Display real-time account balances and transaction history.
- Enable fund transfers, bill payments, and mobile check deposits.
- Provide personalized product recommendations based on transaction patterns.
- Scale elastically during peak periods (Monday mornings, payroll days, tax season).
The Constraint
The CTO has made clear: "The core ledger stays on the mainframe. It processes 45 million transactions daily with five-nines reliability. We are not re-platforming it."
This constraint shapes the entire architecture: the digital platform must be built on cloud infrastructure but tightly integrated with the z/OS core.
Architecture Design
Three-Tier Hybrid Model
Tier 1: z/OS Core (Existing — No Changes)
DB2 for z/OS v13
- Account master (4M accounts)
- Transaction processing (45M daily)
- General ledger
- CICS online programs
- Batch processing (nightly cycle)
Tier 2: Cloud Digital Platform (New)
Db2 on Cloud Enterprise HA (IBM Cloud, Dallas)
- Customer digital profiles
- Session management
- Notification preferences
- Mobile-specific data (device tokens, biometric enrollment)
- Transaction cache (replicated from z/OS)
Tier 3: Cloud Analytics (New)
Db2 Warehouse on Cloud (IBM Cloud, Dallas)
- Customer 360 views
- Product recommendation engine data
- Fraud scoring models
- Regulatory analytics
Data Flow Design
The architecture uses three data movement patterns:
Pattern 1: CDC Replication (z/OS to Cloud)
Account balances and transaction history are replicated from z/OS to Db2 on Cloud in near-real-time using IBM Data Replication:
- Source: DB2 z/OS
ACCOUNT_MASTERandTRANSACTION_HISTORYtables. - Target: Db2 on Cloud
account_cacheandtransaction_cachetables. - Latency: 2-5 seconds under normal load; up to 15 seconds during peak batch processing.
- Volume: Approximately 45 million change rows per day.
The mobile app reads from the cloud cache for display purposes. The cached data is flagged with a REPLICATED_TIMESTAMP column so the app can display "Balance as of [time]" to set user expectations.
Pattern 2: API-Mediated Writes (Cloud to z/OS)
When a customer initiates a fund transfer, bill payment, or other financial transaction through the mobile app:
- The mobile API validates the request in the cloud.
- The API calls a z/OS-hosted REST service (via z/OS Connect EE).
- The z/OS service executes the CICS transaction, which updates DB2 z/OS.
- CDC replicates the change back to the cloud cache within seconds.
- The mobile app polls the cloud cache until the updated balance appears.
Mobile App → Cloud API → z/OS Connect EE → CICS → DB2 z/OS
|
v (CDC)
Mobile App ← Cloud API ← Db2 on Cloud (cache updated)
Pattern 3: Batch ETL (z/OS + Cloud to Warehouse)
Nightly batch ETL feeds the analytics warehouse:
- Extract account and transaction data from z/OS (UNLOAD utility).
- Extract digital profile data from Db2 on Cloud (EXPORT).
- Transform and load into Db2 Warehouse on Cloud (LOAD).
- Refresh materialized query tables for the recommendation engine.
Cloud Database Schema
Db2 on Cloud — Digital Platform Schema:
-- Customer digital profile (cloud-native data)
CREATE TABLE continental.customer_digital_profile (
customer_id BIGINT NOT NULL PRIMARY KEY,
email VARCHAR(255),
mobile_phone VARCHAR(20),
preferred_channel CHAR(3) CHECK (preferred_channel IN ('MOB','WEB','ALL')),
notification_prefs VARCHAR(500), -- JSON format
mfa_method CHAR(4) CHECK (mfa_method IN ('SMS','TOTP','PUSH','NONE')),
biometric_enrolled BOOLEAN DEFAULT FALSE,
device_tokens VARCHAR(2000), -- JSON array
last_login_ts TIMESTAMP,
created_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Account cache (replicated from z/OS)
CREATE TABLE continental.account_cache (
account_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
account_type CHAR(3),
account_status CHAR(1),
current_balance DECIMAL(15,2),
available_balance DECIMAL(15,2),
last_activity_date DATE,
replicated_ts TIMESTAMP NOT NULL,
PRIMARY KEY (account_id)
);
-- Transaction cache (replicated from z/OS, last 90 days)
CREATE TABLE continental.transaction_cache (
trans_id BIGINT NOT NULL,
account_id BIGINT NOT NULL,
trans_date DATE NOT NULL,
trans_type CHAR(3),
amount DECIMAL(15,2),
description VARCHAR(200),
replicated_ts TIMESTAMP NOT NULL,
PRIMARY KEY (trans_date, trans_id)
)
PARTITION BY RANGE (trans_date)
(
PARTITION p_current_month STARTING (CURRENT DATE - 30 DAYS) ENDING (CURRENT DATE + 1 DAY),
PARTITION p_prev_month STARTING (CURRENT DATE - 60 DAYS) ENDING (CURRENT DATE - 30 DAYS),
PARTITION p_archive STARTING (CURRENT DATE - 90 DAYS) ENDING (CURRENT DATE - 60 DAYS)
);
-- Session management (cloud-native data)
CREATE TABLE continental.user_sessions (
session_id VARCHAR(64) NOT NULL PRIMARY KEY,
customer_id BIGINT NOT NULL,
device_type VARCHAR(20),
ip_address VARCHAR(45),
login_ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_activity_ts TIMESTAMP,
expiry_ts TIMESTAMP,
is_active BOOLEAN DEFAULT TRUE
);
Federation Configuration
For ad-hoc queries that need to join z/OS and cloud data:
-- Federation server on the cloud Db2 instance
CREATE WRAPPER drda_wrapper LIBRARY 'libdb2drda.so';
CREATE SERVER zos_core
TYPE DB2/ZOS VERSION 13
WRAPPER drda_wrapper
OPTIONS (DBNAME 'DSNP', NODE 'ZOSNODE');
CREATE USER MAPPING FOR continental_dba
SERVER zos_core
OPTIONS (REMOTE_AUTHID 'CONTDBA', REMOTE_PASSWORD '***');
-- Nicknames for z/OS tables
CREATE NICKNAME continental.zos_account_master
FOR zos_core."CONTINENTAL"."ACCOUNT_MASTER";
CREATE NICKNAME continental.zos_gl_summary
FOR zos_core."CONTINENTAL"."GL_DAILY_SUMMARY";
-- Example federated query: Find customers eligible for premium upgrade
SELECT a.CUSTOMER_ID,
a.CUSTOMER_NAME,
a.TOTAL_RELATIONSHIP_VALUE,
p.preferred_channel,
p.last_login_ts
FROM continental.zos_account_master a
JOIN continental.customer_digital_profile p
ON a.CUSTOMER_ID = p.customer_id
WHERE a.TOTAL_RELATIONSHIP_VALUE > 250000
AND a.ACCOUNT_STATUS = 'A'
AND p.last_login_ts > CURRENT_TIMESTAMP - 30 DAYS;
Security Architecture
Network Security
On-Premises z/OS Data Center
|
| IBM Direct Link (10 Gbps, encrypted)
| Dedicated physical circuit — no internet
|
IBM Cloud VPC (Dallas)
|
├── Private Endpoint (Db2 on Cloud)
│ No public IP — accessible only from VPC
|
├── Private Endpoint (Db2 Warehouse)
│ No public IP — accessible only from VPC
|
└── Application Subnet (Kubernetes)
- Mobile API pods
- Internal load balancer
- Egress only through Cloud Internet Services (CDN + WAF)
Data Security
- Encryption at rest: AES-256 with BYOK. Keys stored in Hyper Protect Crypto Services (FIPS 140-2 Level 4).
- Encryption in transit: TLS 1.3 for all connections. Mutual TLS (mTLS) between the API layer and Db2.
- Column-level encryption: PII columns (email, phone, SSN) are encrypted using
ENCRYPT_AES()with a column-specific key.
-- Column-level encryption for PII
CREATE TABLE continental.customer_pii (
customer_id BIGINT NOT NULL PRIMARY KEY,
ssn_encrypted VARCHAR(128) FOR BIT DATA,
email_encrypted VARCHAR(512) FOR BIT DATA,
phone_encrypted VARCHAR(128) FOR BIT DATA
);
-- Insert with encryption
INSERT INTO continental.customer_pii VALUES (
12345,
ENCRYPT('123-45-6789', 'column-key-ssn'),
ENCRYPT('customer@email.com', 'column-key-email'),
ENCRYPT('555-0100', 'column-key-phone')
);
Access Control
| Role | Cloud DB Access | z/OS Access | Scope |
|---|---|---|---|
| Mobile API Service | SELECT, INSERT, UPDATE on digital profile and session tables. SELECT on cache tables. | Read-only via z/OS Connect API | Production |
| CDC Replication Agent | INSERT, UPDATE, DELETE on cache tables | Read (log capture) on source tables | Replication |
| Analytics ETL | SELECT on all cloud tables | SELECT on z/OS source tables | Nightly batch |
| DBA | Full DBADM | Full SYSADM | Administration |
| Auditor | SELECT on audit views only | SELECT on audit views only | Compliance |
Performance Results
Mobile API Response Times
| Operation | Target | Actual (P95) | Method |
|---|---|---|---|
| View balance | < 200 ms | 85 ms | Read from cloud cache |
| View recent transactions | < 300 ms | 180 ms | Read from cloud cache (partitioned) |
| Fund transfer (initiate) | < 500 ms | 320 ms | Cloud validation + z/OS API call |
| Fund transfer (confirm) | < 2 sec | 1.1 sec | z/OS processing + CDC replication |
| Digital profile update | < 200 ms | 45 ms | Direct cloud write |
Replication Metrics
| Metric | Normal | Peak (Payroll) | Batch Window |
|---|---|---|---|
| CDC latency | 2.1 sec | 8.4 sec | 14.7 sec |
| Rows replicated/sec | 520 | 1,850 | 3,200 |
| Network bandwidth used | 12 Mbps | 45 Mbps | 78 Mbps |
Scaling Events
During the first tax season (April), the mobile app saw a 340% surge in concurrent users:
- Before auto-scaling: Db2 on Cloud with 16 vCPUs, 64 GB RAM.
- During peak: Scaled to 32 vCPUs, 128 GB RAM (via IBM Cloud CLI script triggered by monitoring alert).
- Scaling duration: 8 minutes (no downtime — online vertical scaling).
- After peak: Scaled back to baseline configuration.
Cost Profile
Monthly Costs
| Component | Monthly Cost |
|---|---|
| Db2 on Cloud Enterprise HA (3-year reserved, 16 vCPU base) | $9,200 |
| Db2 Warehouse on Cloud (reserved) | $4,800 |
| IBM Direct Link 10 Gbps | $5,500 |
| Cloud Object Storage (analytics staging) | $150 |
| Key Protect / HPCS | $600 |
| Kubernetes cluster (API layer) | $3,200 |
| Data transfer (CDC + ETL) | $280 |
| Monthly total | $23,730 |
| Annual total | $284,760 |
Comparison with Full On-Premises Alternative
Building the equivalent digital platform entirely on-premises would have required: - New LUW server cluster: $400,000 (hardware + licenses). - Network infrastructure upgrades: $120,000. - Additional DBA headcount: $160,000/year. - Data center expansion: $80,000. - Total first-year: $760,000. - Annual ongoing: $320,000.
The cloud approach saves $35,000 annually and avoids $480,000 in capital expenditure.
Challenges and Resolutions
Challenge 1: CDC Latency During z/OS Batch Window
The nightly z/OS batch cycle generates 120 million log records in 4 hours. CDC replication lag spiked to 45 seconds during this period.
Resolution: - Increased CDC apply parallelism from 4 to 16 threads. - Configured CDC to use "batch apply" mode during the known batch window (22:00-02:00), buffering changes and applying them in larger batches for higher throughput. - Reduced peak lag from 45 seconds to 15 seconds.
Challenge 2: z/OS Connect Timeout Under Load
The z/OS Connect REST APIs that mediate fund transfers timed out under high load (>500 concurrent requests), returning HTTP 504 errors to the mobile app.
Resolution: - Increased the z/OS Connect Liberty server thread pool from 50 to 200. - Implemented a circuit breaker in the cloud API layer: when z/OS Connect errors exceed 5% in a 10-second window, the circuit opens and returns a "please try again" message to users instead of queuing requests. - Added a retry queue (IBM MQ) for failed transfer requests, with automatic retry after 30 seconds.
Challenge 3: Data Consistency Between Cache and Core
Customers occasionally saw "stale" balances in the mobile app because the CDC replication lag meant the cloud cache was 2-15 seconds behind z/OS.
Resolution:
- Added a replicated_ts column to cache tables.
- The mobile app displays "Balance as of [time]" using this timestamp.
- For balance-critical operations (fund transfer confirmation), the API makes a synchronous call to z/OS to fetch the authoritative balance rather than using the cache.
- Customer complaints about stale data dropped to zero after the UI change.
Lessons Learned
-
Design for eventual consistency from day one: The hybrid architecture is inherently eventually consistent. Rather than trying to eliminate latency (impossible), design the user experience to accommodate it gracefully.
-
Federation is for exploration, not production traffic: The federation setup was invaluable during development and for ad-hoc analytics queries, but production traffic should always read from the replicated cache. A single federated query adds 30-50 ms of z/OS round-trip latency.
-
Monitor replication lag as a first-class metric: CDC replication lag should be on the same monitoring dashboard as API response time and error rate. When lag exceeds 10 seconds, it triggers an alert.
-
Plan for z/OS batch impact on the cloud: The nightly batch cycle on z/OS directly affects the cloud platform through CDC. The cloud team must understand and plan for the z/OS batch schedule.
Discussion Questions
-
Continental chose CDC replication over event-driven synchronization (Kafka). Under what circumstances would Kafka have been a better choice?
-
The transaction cache in the cloud retains 90 days of data. If the regulatory requirement changes to 2 years, how would the cloud schema and cost projection change?
-
If Continental acquires a bank that runs Oracle instead of DB2, how could the hybrid architecture be extended to federate with the Oracle system?
-
The z/OS Connect timeout issue was resolved with a circuit breaker and retry queue. What is the risk of this approach for financial transactions, and how would you mitigate it?