Case Study 2: SecureFirst's Mobile API Performance Tuning
From 800ms to 120ms — A Systematic Approach to CICS Performance
Background
SecureFirst Retail Bank is a mid-size retail bank executing a mobile-first modernization strategy. Their architecture uses a strangler fig pattern: an API gateway (z/OS Connect) fronts CICS-hosted COBOL services, exposing them as RESTful APIs consumed by their mobile banking application.
The mobile API CICS region — SFAORM1 — handles three primary transactions:
| Transaction | Function | Target SLA | Volume |
|---|---|---|---|
| MBAL | Mobile balance inquiry | 150ms | 1,200 TPS |
| MXFR | Mobile fund transfer | 300ms | 200 TPS |
| MHST | Mobile transaction history | 500ms | 400 TPS |
When the mobile app launched, Carlos Vega — SecureFirst's mobile API architect — was pleased that the API gateway latency was under 30ms. But the end-to-end response times were disappointing:
| Transaction | Target | Actual P50 | Actual P95 |
|---|---|---|---|
| MBAL | 150ms | 280ms | 800ms |
| MXFR | 300ms | 450ms | 1,200ms |
| MHST | 500ms | 600ms | 2,500ms |
No SLA was being met. The mobile app felt sluggish. Customer satisfaction scores for the mobile channel were 20 points below the industry benchmark.
Yuki Nakamura — SecureFirst's DevOps lead — took ownership of the performance optimization project. Her approach was systematic, data-driven, and executed over 6 weeks.
Week 1: Baseline and Diagnosis
Collecting the Data
Yuki's first step was establishing a measurement baseline. She enabled CMF at 15-minute intervals and collected SMF 110 Type 1 records for all three transactions over a full business week (Monday–Friday).
She also activated auxiliary trace filtered to the dispatcher and DB2 domains for a 30-minute window during peak hours, capturing approximately 50,000 trace entries.
The Wait-Time Breakdown
The SMF 110 analysis revealed the following average wait-time breakdown for MBAL (the highest-volume transaction):
| Wait Category | Time (ms) | % of Elapsed |
|---|---|---|
| Dispatcher wait | 85 | 30% |
| DB2 wait | 120 | 43% |
| MRO wait | 45 | 16% |
| Program load | 15 | 5% |
| CPU (QR TCB) | 8 | 3% |
| Other | 7 | 3% |
| Total elapsed | 280 | 100% |
Three findings stood out:
-
Dispatcher wait was 30% of elapsed time. The QR TCB was congested — tasks waited an average of 85ms for dispatch. QR TCB busy was 82%.
-
DB2 wait was 43%. Each MBAL transaction made 3 DB2 calls. The calls themselves averaged 12ms, but thread wait time (waiting for a CMDT-limited thread) added an average of 40ms.
-
MRO wait was 16%. MBAL called a program on a downstream AOR via DPL for fraud-check processing. The MRO round-trip added 45ms.
The Configuration
Yuki reviewed SFAORM1's configuration:
MXT=150
EDSALIM=500M
DSALIM=5M
CMDT=30 (on DB2CONN)
TRANCLASS: none defined
All programs: CONCURRENCY(QUASIRENT) — the default
Every program was QUASIRENT. No TRANCLASS. CMDT of 30 for a region processing 1,800 TPS with 85% DB2 usage. These were "set it and forget it" configurations from the initial deployment — the values had never been tuned.
Week 2: THREADSAFE Conversion
The Biggest Win
Yuki identified THREADSAFE conversion as the highest-impact change. With all programs as QUASIRENT, every DB2 call blocked the QR TCB. At 1,800 TPS with an average of 3.5 DB2 calls per transaction at 12ms each:
QR TCB blocking from DB2 = 1,800 × 3.5 × 0.012 = 75.6 seconds per second
This is impossible — the QR TCB can only be busy for 1.0 seconds per second. The implication: massive queuing. The QR TCB was the bottleneck.
The Conversion Process
Yuki's team reviewed the top 3 programs (PGMMBAL, PGMMXFR, PGMMHST) for THREADSAFE eligibility:
Prerequisites for THREADSAFE: 1. Program must be reentrant (compiled with RENT option) — all three were. 2. No use of CICS commands that are not threadsafe (e.g., EXEC CICS ADDRESS CWA) — PGMMHST had one CWA reference that was refactored. 3. No shared writeable storage between tasks (global WORKING-STORAGE used as cross-task communication) — none found. 4. All called subprograms must also be threadsafe — the DB2 call interface (EXEC SQL) is threadsafe.
After code review, all three programs were eligible. The changes:
DEFINE PROGRAM(PGMMBAL) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMXFR) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMHST) CONCURRENCY(THREADSAFE) ...
Results After THREADSAFE
| Metric | Before | After | Change |
|---|---|---|---|
| QR TCB busy | 82% | 31% | -62% |
| Dispatcher wait (MBAL) | 85ms | 12ms | -86% |
| MBAL P50 elapsed | 280ms | 185ms | -34% |
| MBAL P95 elapsed | 800ms | 380ms | -53% |
THREADSAFE alone brought MBAL P50 from 280ms to 185ms. The QR TCB was no longer the bottleneck. But 185ms still exceeded the 150ms target.
Week 3: DB2 Thread Tuning
The Thread Wait Problem
With the dispatcher bottleneck resolved, the next dominant wait was DB2 thread acquisition. CMDT was 30, but the region needed more concurrent threads after THREADSAFE (because tasks now proceeded through DB2 calls faster, reducing queuing but increasing concurrent thread demand).
Yuki calculated the required CMDT:
CMDT = Peak_TPS × Fraction_DB2 × Avg_DB2_Elapsed × Safety
= 1,800 × 0.85 × 0.012 × 1.5
= ~28
Wait — the formula suggested 28, but the observed thread wait was 40ms. What was wrong?
The answer: the formula assumes uniform arrival. In reality, mobile API traffic is bursty — the API gateway batches requests from the mobile app's connection pool. Peak instantaneous throughput was 2,800 TPS, not the average 1,800. Recalculating with peak:
CMDT = 2,800 × 0.85 × 0.012 × 1.5 = ~43
Yuki raised CMDT to 60 (providing headroom above the burst peak).
Results After CMDT Tuning
| Metric | Before | After | Change |
|---|---|---|---|
| DB2 thread waits/hour | 4,200 | 35 | -99% |
| DB2 wait (MBAL avg) | 120ms | 42ms | -65% |
| MBAL P50 elapsed | 185ms | 138ms | -25% |
| MBAL P95 elapsed | 380ms | 210ms | -45% |
MBAL P50 was now at 138ms — below the 150ms target. P95 was 210ms — still above target. The long tail needed attention.
Week 4: MRO Optimization
Eliminating the Fraud Check Round-Trip
The 45ms MRO wait for the fraud-check DPL call was the next target. Yuki investigated the fraud-check program:
- It ran on a separate AOR (SFAORC1) for isolation
- It performed a single DB2 read (fraud rule lookup) and a comparison
- Total CPU time: 0.8ms
- Total elapsed time on the remote AOR: 8ms
- MRO round-trip overhead: 37ms
The MRO overhead (37ms) dwarfed the actual work (8ms). For a program that was essentially a DB2 read and a comparison, the inter-region hop was pure overhead.
Yuki's recommendation: move the fraud-check program to SFAORM1 (the mobile API AOR) and call it via a local LINK instead of DPL. The fraud-check program was stateless and read-only — there was no isolation benefit from running it on a separate AOR.
Carlos pushed back: "We separated it for modularity." Yuki's response: "Modularity is a code concern. Region topology is a performance concern. You can have a separate program without a separate region."
After the change:
| Metric | Before | After | Change |
|---|---|---|---|
| MRO wait (MBAL) | 45ms | 0ms | -100% |
| Local LINK overhead | 0ms | 2ms | N/A |
| MBAL P50 elapsed | 138ms | 98ms | -29% |
| MBAL P95 elapsed | 210ms | 155ms | -26% |
MBAL P50 at 98ms. P95 at 155ms. Target met.
Week 5: Storage and TRANCLASS
EDSALIM Right-Sizing
With the performance improvements, task concurrency dropped (faster response times = fewer concurrent tasks):
Before tuning: 1,800 TPS × 0.280s avg = 504 concurrent tasks (frequently hitting MXT 150)
After tuning: 1,800 TPS × 0.098s avg = 176 concurrent tasks (MXT 150 was a bottleneck!)
Yuki realized the pre-tuning MXT of 150 had itself been a performance bottleneck — the region was frequently in MAXT, and the resulting queuing was contributing to the high response times. A feedback loop: slow response times cause task accumulation, which causes MAXT, which causes queuing, which causes slower response times.
She recalculated MXT:
MXT = 1,800 × 0.098 × 2.0 = 353 → set to 400
She also measured EUDSA peak usage and found it at 310MB — EDSALIM of 500M provided only 38% headroom over the 310MB peak (after subtracting the CICS kernel). She raised EDSALIM to 650M.
TRANCLASS Implementation
Yuki implemented a three-class model:
TRANCLASS CLSBAL MAXACTIVE(200) *> MBAL — highest volume
TRANCLASS CLSXFR MAXACTIVE(80) *> MXFR — highest value
TRANCLASS CLSHST MAXACTIVE(80) *> MHST — lowest priority
CLSHST was given the lowest MAXACTIVE because transaction history queries are the most expensive (many DB2 reads) and the most tolerant of latency (500ms SLA vs. 150ms for balance).
Self-Aware Transaction Instrumentation
Yuki added performance instrumentation to all three programs. Each program captures start and end timestamps and writes a performance record to a TD queue if elapsed time exceeds the alert threshold:
| Transaction | SLA | Alert Threshold |
|---|---|---|
| MBAL | 150ms | 500ms (3.3x SLA) |
| MXFR | 300ms | 900ms (3x SLA) |
| MHST | 500ms | 1,500ms (3x SLA) |
The alert records are processed by an automated monitoring transaction that runs every 60 seconds, counts alerts per transaction type, and triggers operator notifications if the count exceeds a threshold.
Week 6: Final Results and Monitoring
The Final Numbers
| Transaction | Target | Before | After | Improvement |
|---|---|---|---|---|
| MBAL P50 | 150ms | 280ms | 98ms | 65% |
| MBAL P95 | 150ms | 800ms | 155ms | 81% |
| MXFR P50 | 300ms | 450ms | 145ms | 68% |
| MXFR P95 | 300ms | 1,200ms | 290ms | 76% |
| MHST P50 | 500ms | 600ms | 210ms | 65% |
| MHST P95 | 500ms | 2,500ms | 480ms | 81% |
All SLAs met at P50. MHST P95 at 480ms — within the 500ms target. The mobile app customer satisfaction score improved by 35 points in the quarter following the optimization.
The Changes — Summarized
| Change | Impact | Effort |
|---|---|---|
| THREADSAFE conversion (3 programs) | QR TCB busy 82% → 31%, P50 -34% | 2 weeks (code review, test, deploy) |
| CMDT increase (30 → 60) | DB2 thread waits -99%, P50 -25% | 1 hour (parameter change + test) |
| MRO elimination (fraud check local) | MRO wait eliminated, P50 -29% | 1 week (code move, test, deploy) |
| MXT increase (150 → 400) | MAXT eliminated, reduces feedback loop | 5 minutes (SIT parameter) |
| EDSALIM increase (500M → 650M) | Storage headroom from 38% to 52% | 5 minutes (SIT parameter) |
| TRANCLASS implementation | Protects MBAL/MXFR from MHST surge | 1 hour (CSD definitions) |
| Self-aware instrumentation | Real-time alerting for SLA breaches | 1 week (code changes) |
Total effort: approximately 4 weeks of engineering work spread over 6 calendar weeks. No hardware changes. No software purchases. No architecture redesign.
Lessons for the Reader
Lesson 1: Measure Before You Tune
Yuki did not start by guessing. She collected a week of SMF 110 data, performed a wait-time breakdown analysis, and identified the three dominant bottlenecks in priority order. Every change was driven by data, and every change was measured afterward to confirm the impact.
Lesson 2: THREADSAFE Is Not Optional
For any CICS region processing more than a few hundred TPS with DB2, QUASIRENT programs are a performance tax. The QR TCB becomes the bottleneck, and every DB2 call blocks every other task. THREADSAFE conversion is the single highest-return-on-investment optimization in modern CICS.
Lesson 3: Beware the Feedback Loop
The pre-tuning state had a feedback loop: slow response → high task count → MAXT → queuing → slower response. This loop masks the root cause because the observed symptoms (MAXT, queuing) look like capacity problems rather than configuration problems. Breaking the loop at any point (THREADSAFE, CMDT, MRO) revealed that the underlying capacity was adequate.
Lesson 4: Region Topology Is a Performance Decision
Moving the fraud-check program from a remote AOR to the local AOR eliminated 37ms of MRO overhead per transaction. The architectural decision (separate region for modularity) had a 37ms performance cost that was invisible until someone measured it. Every inter-region call has a latency cost. Ensure the isolation benefit justifies that cost.
Lesson 5: Quick Wins Are Real
Of the 7 changes Yuki made, two (CMDT increase, MXT increase) took less than an hour each and provided measurable improvement. Do not defer quick wins while pursuing larger optimizations. Apply them in parallel.
Lesson 6: Carlos's Aha Moment
Carlos Vega came into the project with a distributed systems mindset: separate services, separate deployments, separate regions. He learned that in CICS, the cost of inter-region communication is higher than in microservices (where a local network call is sub-millisecond). The mainframe's strength is running many things efficiently in one address space — the same address space. Distributing work across regions should be driven by failure isolation and security requirements, not by coding modularity preferences.
This was Carlos's "aha moment" about mainframe performance: co-location is a feature, not a limitation.
Discussion Questions
-
Yuki's THREADSAFE conversion required code review of only 3 programs. In an environment with 200 CICS programs, how would you prioritize which programs to convert first? What criteria would you use?
-
The fraud-check program was moved from a separate AOR to the mobile API AOR. Under what circumstances would you keep it on a separate AOR despite the 37ms MRO overhead?
-
CMDT was initially calculated at 28 using average TPS, but the actual requirement was 43 due to bursty traffic. How should capacity formulas account for burstiness? Propose a modification to the CMDT formula.
-
The feedback loop (slow response → high tasks → MAXT → queuing → slower response) is self-reinforcing. Once entered, it rarely resolves on its own. Why? What mechanism would need to change for self-resolution?
-
Carlos described the fraud-check separation as "modularity." Yuki called it a "performance concern." They are both right — modularity and performance are in tension. Propose a design pattern that preserves code modularity (separate programs) while avoiding the MRO performance cost (same region). Does CICS provide such a mechanism?
-
The project took 6 weeks and required no hardware changes. Estimate the cost of the alternative: buying enough hardware to brute-force the performance problem. At approximately $100,000 per additional MIPS, how many MIPS would have been needed to halve the response times through CPU alone (assuming the QR TCB saturation was the bottleneck)?