Case Study 1: Architecture Comparison — A Bank Running Both z/OS and LUW
Background
Meridian National Bank has grown through acquisition. Its core banking system — the "system of record" for customer accounts, transactions, and regulatory reporting — runs on DB2 for z/OS on an IBM z16 mainframe at the primary data center in Charlotte, North Carolina. This system has been in operation since 1997, has been upgraded through multiple DB2 versions, and currently runs DB2 13 for z/OS.
Three years ago, Meridian acquired Pacific Coast Savings, a smaller regional bank that ran its entire technology stack on Linux servers. Pacific Coast's systems were built on DB2 11.5 for LUW running on Red Hat Enterprise Linux. Rather than immediately migrating Pacific Coast's systems to the mainframe — a project estimated at 18 months and $4.2 million — Meridian's CTO made the pragmatic decision to operate both platforms in parallel while planning a longer-term consolidation.
Today, Meridian's infrastructure team manages both environments. The core banking system on z/OS handles 85% of the transaction volume. The Pacific Coast LUW system handles the remaining 15% and is also used for the bank's newer digital banking channels (mobile app, online portal), which were originally developed against the LUW platform.
This case study examines how the same architectural concepts manifest differently on each platform and the real challenges of operating both.
The Architectural Landscape
z/OS Environment
Meridian's z/OS DB2 subsystem, MBPD, runs on an LPAR with 256 GB of central storage allocated to DB2. The configuration:
Subsystem: MBPD (DB2 13 for z/OS)
Address Spaces:
MBPDMSTR - System Services
MBPDDBM1 - Database Services
MBPDDIST - DDF (remote access)
IRLM - Lock Manager (shared with test subsystem MBPT)
Buffer Pools:
BP0 (4K, 20,000 pages) - Catalog, small reference tables
BP2 (8K, 200,000 pages) - Customer/Account data (~1.6 GB)
BP3 (8K, 150,000 pages) - Transaction data (~1.2 GB)
BP4 (4K, 100,000 pages) - All indexes (~400 MB)
BP5 (32K, 10,000 pages) - Work/temp tablespace (~320 MB)
BP8K1(8K, 50,000 pages) - LOB auxiliary data (~400 MB)
EDM Pool: 512 MB
RID Pool: 128 MB
Sort Pool: 256 MB
Logging:
32 active log data sets (dual), each 2 GB
Archive to virtual tape (IBM TS7770)
BSDS dual copies on separate DASD volumes
Tablespaces:
All UTS (PBR for CUSTOMER, ACCOUNT, TRANSACTION; PBG for reference tables)
TRANSACTION partitioned by month (24 monthly partitions + 2 growth)
LUW Environment
The Pacific Coast system runs on two Linux servers in an active-passive high availability configuration using DB2 HADR (High Availability Disaster Recovery):
Instance: db2pcst (DB2 11.5 for LUW, RHEL 8)
Primary Server: pcst-db-prod01 (128 GB RAM, 32 cores)
Standby Server: pcst-db-prod02 (128 GB RAM, 32 cores)
Database: PCSTBANK
SELF_TUNING_MEM = ON
DATABASE_MEMORY = 80 GB
Buffer Pools (all AUTOMATIC):
BP_CORE_8K - Customer/Account data (initial 400,000 pages)
BP_TRANS_8K - Transaction data (initial 250,000 pages)
BP_INDEX_4K - All indexes (initial 200,000 pages)
BP_TEMP_32K - Temporary tablespace (initial 20,000 pages)
IBMDEFAULTBP - Catalog and defaults
Logging:
LOGARCHMETH1 = DISK:/db2archlog/
LOGPRIMARY = 30 (each 50 MB = 1.5 GB active)
LOGSECOND = 20
LOGBUFSZ = 4096 (16 MB)
Tablespaces:
All automatic storage
TS_CUSTOMER, TS_ACCOUNT, TS_TRANSACTION (range partitioned by date)
TS_REFERENCE, TS_INDEX, TEMPSPACE1
Storage paths: /db2data/ssd01, /db2data/ssd02 (NVMe SSD)
Operational Challenges
Challenge 1: Buffer Pool Management Philosophy
The most visible day-to-day difference is in buffer pool management.
On z/OS, senior DBA Marcus Chen manually manages every buffer pool. He monitors hit ratios daily using OMEGAMON, adjusts pool sizes based on workload trends, and has a change-control process for any buffer pool modification. He knows that BP2 needs exactly 200,000 pages because he has tested this across every quarterly peak for the past four years. When business volumes increase during tax season (January through April), he submits a change request to temporarily increase BP3 from 150,000 to 220,000 pages to handle the surge in transaction activity.
On LUW, DBA Sarah Okonkwo relies heavily on STMM. When she first took over the Pacific Coast system, she was surprised to find that buffer pool sizes fluctuated throughout the day. During peak online hours, STMM grew BP_CORE_8K and shrank BP_TRANS_8K. During overnight batch windows, the allocation reversed. She initially found this unsettling — on her previous z/OS assignment, buffer pools stayed exactly the size you set them — but after six months of monitoring, she found that STMM consistently maintained hit ratios above 98% without manual intervention.
The philosophical difference is stark: z/OS DBAs tend to be prescriptive (I tell DB2 how much memory to use), while LUW DBAs with STMM are more descriptive (I tell DB2 the total memory budget and let it allocate within that budget).
Neither approach is inherently superior. Marcus's prescriptive method gives him absolute predictability — he knows exactly how the system will behave during a given workload. Sarah's descriptive method adapts automatically to changing workloads but occasionally makes surprising decisions that require investigation.
Challenge 2: The Distributed Access Problem
Meridian's digital banking applications (mobile app, online portal) need data from BOTH platforms. When a customer who originated at Pacific Coast checks their account on the Meridian mobile app, the request may need to:
- Hit the LUW system for the customer's original Pacific Coast account data.
- Hit the z/OS system for any new Meridian products the customer has been cross-sold.
This dual-platform access is handled through DB2's distributed capabilities:
- The z/OS system accepts DRDA connections through DDF (DIST address space) on port 446.
- The LUW system accepts connections through
db2tcpcmon port 50000. - The application middleware (WebSphere Liberty on Linux) maintains connection pools to both systems.
The architectural challenge is that a single user action may require two separate database connections, two separate optimizers making independent access path decisions, and two separate lock managers managing concurrency. There is no unified buffer pool, no shared lock manager, and no single transaction coordinator unless explicit distributed transaction (two-phase commit) is used.
The team initially implemented two-phase commit for cross-platform transactions but found the overhead unacceptable for real-time mobile banking. They switched to an eventual consistency model for read-only cross-platform queries and reserve two-phase commit only for the nightly account reconciliation batch job.
Challenge 3: Monitoring and Tooling
Marcus uses IBM OMEGAMON for z/OS performance monitoring. Sarah uses IBM Data Server Manager and the built-in MON_GET_* table functions for LUW monitoring. The monitoring tools are completely different, with different metrics, different presentation, and different alerting mechanisms.
The operations team wanted a single dashboard showing the health of both environments. After evaluating options, they built a custom Grafana dashboard that:
- Pulls z/OS metrics from OMEGAMON via its REST API.
- Pulls LUW metrics from
MON_GET_BUFFERPOOL,MON_GET_TABLESPACE, andMON_GET_CONNECTIONtable functions via JDBC. - Normalizes the metrics into comparable formats (e.g., buffer pool hit ratio is calculated the same way on both platforms, but the raw counters have different names).
This was a six-week project that highlighted how architecturally similar but operationally different the two platforms are.
Challenge 4: Logging and Recovery Asymmetry
The recovery strategies differ significantly:
On z/OS, Marcus relies on dual active logging, the BSDS, and IBM's RECOVER utility to restore individual tablespaces to a point in time from image copies and log application. He can recover a single tablespace without affecting the rest of the subsystem.
On LUW, Sarah uses database-level backup and ROLLFORWARD. While LUW supports tablespace-level backup and restore, the process is different — she must RESTORE the tablespace from a backup and then ROLLFORWARD to a point in time. HADR provides continuous replication to the standby server, which can be activated in seconds if the primary fails.
A significant incident occurred in February when a batch job on the LUW system corrupted data in the TRANSACTION table. Sarah needed to recover just that tablespace to a point in time 30 minutes before the corruption. The process involved:
- Taking the tablespace offline.
- Restoring it from the most recent tablespace backup (6 hours old).
- Rolling forward using archive logs to the target point in time.
- Bringing the tablespace back online.
Total recovery time: 47 minutes. On z/OS, Marcus estimated a similar recovery would take 15-20 minutes using RECOVER with image copies and log apply, partly because z/OS image copy granularity and log processing are optimized for this scenario.
Lessons Learned
1. Architecture Concepts Transfer, Implementation Details Do Not
The team found that a DBA who understands buffer pools, logging, and lock management on one platform can learn the other platform relatively quickly. The concepts are the same. But the commands, tools, configuration parameters, and operational procedures are completely different. Cross-training took approximately three months of hands-on work.
2. Memory Management Philosophy Matters
The z/OS approach of precise, manual memory management works well when you have highly experienced DBAs, stable workloads, and rigorous change control. The LUW approach with STMM works well when workloads are variable, DBA staffing is limited, or the organization prefers automated operations. Neither is wrong — they reflect different operational philosophies.
3. Distributed Access Is an Architectural Decision, Not Just a Technical One
Running on two platforms means every data-access request has an implicit architecture question: "Which platform holds this data?" The middleware layer, the application design, and the data distribution strategy all become critical architectural concerns that would not exist on a single platform.
4. Recovery Procedures Must Be Tested Regularly
Sarah's 47-minute recovery during the incident was fast because she had practiced the procedure quarterly. Marcus does the same on z/OS. Both platforms have robust recovery capabilities, but those capabilities are only valuable if the team knows how to use them under pressure.
Discussion Questions
-
If Meridian decides to consolidate onto a single platform, what factors should drive the choice between z/OS and LUW? Consider workload characteristics, staff expertise, cost, and future application direction.
-
The team chose eventual consistency over two-phase commit for mobile banking queries. What are the risks of this approach? Under what circumstances might a customer see inconsistent data?
-
Marcus and Sarah have very different buffer pool management approaches. If you were the manager over both DBAs, would you standardize on one approach? Why or why not?
-
The custom Grafana monitoring dashboard took six weeks to build. Was this investment justified? What would you have done differently?
-
If the TRANSACTION table corruption incident had occurred on a system with circular logging (no archive logs), what would the recovery options have been? What is the business impact?
Architecture Mapping Exercise
Complete the following mapping based on the case study:
| Meridian z/OS Component | Pacific Coast LUW Equivalent | Key Difference |
|---|---|---|
| MBPDMSTR | ||
| MBPDDBM1 | ||
| MBPDDIST | ||
| IRLM | ||
| EDM Pool (512 MB) | ||
| BP2 (8K, 200K pages, manual) | ||
| BSDS (dual) | ||
| Image copy + RECOVER | ||
| OMEGAMON |