Case Study 2: SecureFirst's Mobile API Performance Tuning

From 800ms to 120ms — A Systematic Approach to CICS Performance


Background

SecureFirst Retail Bank is a mid-size retail bank executing a mobile-first modernization strategy. Their architecture uses a strangler fig pattern: an API gateway (z/OS Connect) fronts CICS-hosted COBOL services, exposing them as RESTful APIs consumed by their mobile banking application.

The mobile API CICS region — SFAORM1 — handles three primary transactions:

Transaction Function Target SLA Volume
MBAL Mobile balance inquiry 150ms 1,200 TPS
MXFR Mobile fund transfer 300ms 200 TPS
MHST Mobile transaction history 500ms 400 TPS

When the mobile app launched, Carlos Vega — SecureFirst's mobile API architect — was pleased that the API gateway latency was under 30ms. But the end-to-end response times were disappointing:

Transaction Target Actual P50 Actual P95
MBAL 150ms 280ms 800ms
MXFR 300ms 450ms 1,200ms
MHST 500ms 600ms 2,500ms

No SLA was being met. The mobile app felt sluggish. Customer satisfaction scores for the mobile channel were 20 points below the industry benchmark.

Yuki Nakamura — SecureFirst's DevOps lead — took ownership of the performance optimization project. Her approach was systematic, data-driven, and executed over 6 weeks.


Week 1: Baseline and Diagnosis

Collecting the Data

Yuki's first step was establishing a measurement baseline. She enabled CMF at 15-minute intervals and collected SMF 110 Type 1 records for all three transactions over a full business week (Monday–Friday).

She also activated auxiliary trace filtered to the dispatcher and DB2 domains for a 30-minute window during peak hours, capturing approximately 50,000 trace entries.

The Wait-Time Breakdown

The SMF 110 analysis revealed the following average wait-time breakdown for MBAL (the highest-volume transaction):

Wait Category Time (ms) % of Elapsed
Dispatcher wait 85 30%
DB2 wait 120 43%
MRO wait 45 16%
Program load 15 5%
CPU (QR TCB) 8 3%
Other 7 3%
Total elapsed 280 100%

Three findings stood out:

  1. Dispatcher wait was 30% of elapsed time. The QR TCB was congested — tasks waited an average of 85ms for dispatch. QR TCB busy was 82%.

  2. DB2 wait was 43%. Each MBAL transaction made 3 DB2 calls. The calls themselves averaged 12ms, but thread wait time (waiting for a CMDT-limited thread) added an average of 40ms.

  3. MRO wait was 16%. MBAL called a program on a downstream AOR via DPL for fraud-check processing. The MRO round-trip added 45ms.

The Configuration

Yuki reviewed SFAORM1's configuration:

MXT=150
EDSALIM=500M
DSALIM=5M
CMDT=30 (on DB2CONN)
TRANCLASS: none defined
All programs: CONCURRENCY(QUASIRENT) — the default

Every program was QUASIRENT. No TRANCLASS. CMDT of 30 for a region processing 1,800 TPS with 85% DB2 usage. These were "set it and forget it" configurations from the initial deployment — the values had never been tuned.


Week 2: THREADSAFE Conversion

The Biggest Win

Yuki identified THREADSAFE conversion as the highest-impact change. With all programs as QUASIRENT, every DB2 call blocked the QR TCB. At 1,800 TPS with an average of 3.5 DB2 calls per transaction at 12ms each:

QR TCB blocking from DB2 = 1,800 × 3.5 × 0.012 = 75.6 seconds per second

This is impossible — the QR TCB can only be busy for 1.0 seconds per second. The implication: massive queuing. The QR TCB was the bottleneck.

The Conversion Process

Yuki's team reviewed the top 3 programs (PGMMBAL, PGMMXFR, PGMMHST) for THREADSAFE eligibility:

Prerequisites for THREADSAFE: 1. Program must be reentrant (compiled with RENT option) — all three were. 2. No use of CICS commands that are not threadsafe (e.g., EXEC CICS ADDRESS CWA) — PGMMHST had one CWA reference that was refactored. 3. No shared writeable storage between tasks (global WORKING-STORAGE used as cross-task communication) — none found. 4. All called subprograms must also be threadsafe — the DB2 call interface (EXEC SQL) is threadsafe.

After code review, all three programs were eligible. The changes:

DEFINE PROGRAM(PGMMBAL) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMXFR) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMHST) CONCURRENCY(THREADSAFE) ...

Results After THREADSAFE

Metric Before After Change
QR TCB busy 82% 31% -62%
Dispatcher wait (MBAL) 85ms 12ms -86%
MBAL P50 elapsed 280ms 185ms -34%
MBAL P95 elapsed 800ms 380ms -53%

THREADSAFE alone brought MBAL P50 from 280ms to 185ms. The QR TCB was no longer the bottleneck. But 185ms still exceeded the 150ms target.


Week 3: DB2 Thread Tuning

The Thread Wait Problem

With the dispatcher bottleneck resolved, the next dominant wait was DB2 thread acquisition. CMDT was 30, but the region needed more concurrent threads after THREADSAFE (because tasks now proceeded through DB2 calls faster, reducing queuing but increasing concurrent thread demand).

Yuki calculated the required CMDT:

CMDT = Peak_TPS × Fraction_DB2 × Avg_DB2_Elapsed × Safety
     = 1,800 × 0.85 × 0.012 × 1.5
     = ~28

Wait — the formula suggested 28, but the observed thread wait was 40ms. What was wrong?

The answer: the formula assumes uniform arrival. In reality, mobile API traffic is bursty — the API gateway batches requests from the mobile app's connection pool. Peak instantaneous throughput was 2,800 TPS, not the average 1,800. Recalculating with peak:

CMDT = 2,800 × 0.85 × 0.012 × 1.5 = ~43

Yuki raised CMDT to 60 (providing headroom above the burst peak).

Results After CMDT Tuning

Metric Before After Change
DB2 thread waits/hour 4,200 35 -99%
DB2 wait (MBAL avg) 120ms 42ms -65%
MBAL P50 elapsed 185ms 138ms -25%
MBAL P95 elapsed 380ms 210ms -45%

MBAL P50 was now at 138ms — below the 150ms target. P95 was 210ms — still above target. The long tail needed attention.


Week 4: MRO Optimization

Eliminating the Fraud Check Round-Trip

The 45ms MRO wait for the fraud-check DPL call was the next target. Yuki investigated the fraud-check program:

  • It ran on a separate AOR (SFAORC1) for isolation
  • It performed a single DB2 read (fraud rule lookup) and a comparison
  • Total CPU time: 0.8ms
  • Total elapsed time on the remote AOR: 8ms
  • MRO round-trip overhead: 37ms

The MRO overhead (37ms) dwarfed the actual work (8ms). For a program that was essentially a DB2 read and a comparison, the inter-region hop was pure overhead.

Yuki's recommendation: move the fraud-check program to SFAORM1 (the mobile API AOR) and call it via a local LINK instead of DPL. The fraud-check program was stateless and read-only — there was no isolation benefit from running it on a separate AOR.

Carlos pushed back: "We separated it for modularity." Yuki's response: "Modularity is a code concern. Region topology is a performance concern. You can have a separate program without a separate region."

After the change:

Metric Before After Change
MRO wait (MBAL) 45ms 0ms -100%
Local LINK overhead 0ms 2ms N/A
MBAL P50 elapsed 138ms 98ms -29%
MBAL P95 elapsed 210ms 155ms -26%

MBAL P50 at 98ms. P95 at 155ms. Target met.


Week 5: Storage and TRANCLASS

EDSALIM Right-Sizing

With the performance improvements, task concurrency dropped (faster response times = fewer concurrent tasks):

Before tuning: 1,800 TPS × 0.280s avg = 504 concurrent tasks (frequently hitting MXT 150)
After tuning: 1,800 TPS × 0.098s avg = 176 concurrent tasks (MXT 150 was a bottleneck!)

Yuki realized the pre-tuning MXT of 150 had itself been a performance bottleneck — the region was frequently in MAXT, and the resulting queuing was contributing to the high response times. A feedback loop: slow response times cause task accumulation, which causes MAXT, which causes queuing, which causes slower response times.

She recalculated MXT:

MXT = 1,800 × 0.098 × 2.0 = 353 → set to 400

She also measured EUDSA peak usage and found it at 310MB — EDSALIM of 500M provided only 38% headroom over the 310MB peak (after subtracting the CICS kernel). She raised EDSALIM to 650M.

TRANCLASS Implementation

Yuki implemented a three-class model:

TRANCLASS CLSBAL   MAXACTIVE(200)   *> MBAL — highest volume
TRANCLASS CLSXFR   MAXACTIVE(80)    *> MXFR — highest value
TRANCLASS CLSHST   MAXACTIVE(80)    *> MHST — lowest priority

CLSHST was given the lowest MAXACTIVE because transaction history queries are the most expensive (many DB2 reads) and the most tolerant of latency (500ms SLA vs. 150ms for balance).

Self-Aware Transaction Instrumentation

Yuki added performance instrumentation to all three programs. Each program captures start and end timestamps and writes a performance record to a TD queue if elapsed time exceeds the alert threshold:

Transaction SLA Alert Threshold
MBAL 150ms 500ms (3.3x SLA)
MXFR 300ms 900ms (3x SLA)
MHST 500ms 1,500ms (3x SLA)

The alert records are processed by an automated monitoring transaction that runs every 60 seconds, counts alerts per transaction type, and triggers operator notifications if the count exceeds a threshold.


Week 6: Final Results and Monitoring

The Final Numbers

Transaction Target Before After Improvement
MBAL P50 150ms 280ms 98ms 65%
MBAL P95 150ms 800ms 155ms 81%
MXFR P50 300ms 450ms 145ms 68%
MXFR P95 300ms 1,200ms 290ms 76%
MHST P50 500ms 600ms 210ms 65%
MHST P95 500ms 2,500ms 480ms 81%

All SLAs met at P50. MHST P95 at 480ms — within the 500ms target. The mobile app customer satisfaction score improved by 35 points in the quarter following the optimization.

The Changes — Summarized

Change Impact Effort
THREADSAFE conversion (3 programs) QR TCB busy 82% → 31%, P50 -34% 2 weeks (code review, test, deploy)
CMDT increase (30 → 60) DB2 thread waits -99%, P50 -25% 1 hour (parameter change + test)
MRO elimination (fraud check local) MRO wait eliminated, P50 -29% 1 week (code move, test, deploy)
MXT increase (150 → 400) MAXT eliminated, reduces feedback loop 5 minutes (SIT parameter)
EDSALIM increase (500M → 650M) Storage headroom from 38% to 52% 5 minutes (SIT parameter)
TRANCLASS implementation Protects MBAL/MXFR from MHST surge 1 hour (CSD definitions)
Self-aware instrumentation Real-time alerting for SLA breaches 1 week (code changes)

Total effort: approximately 4 weeks of engineering work spread over 6 calendar weeks. No hardware changes. No software purchases. No architecture redesign.


Lessons for the Reader

Lesson 1: Measure Before You Tune

Yuki did not start by guessing. She collected a week of SMF 110 data, performed a wait-time breakdown analysis, and identified the three dominant bottlenecks in priority order. Every change was driven by data, and every change was measured afterward to confirm the impact.

Lesson 2: THREADSAFE Is Not Optional

For any CICS region processing more than a few hundred TPS with DB2, QUASIRENT programs are a performance tax. The QR TCB becomes the bottleneck, and every DB2 call blocks every other task. THREADSAFE conversion is the single highest-return-on-investment optimization in modern CICS.

Lesson 3: Beware the Feedback Loop

The pre-tuning state had a feedback loop: slow response → high task count → MAXT → queuing → slower response. This loop masks the root cause because the observed symptoms (MAXT, queuing) look like capacity problems rather than configuration problems. Breaking the loop at any point (THREADSAFE, CMDT, MRO) revealed that the underlying capacity was adequate.

Lesson 4: Region Topology Is a Performance Decision

Moving the fraud-check program from a remote AOR to the local AOR eliminated 37ms of MRO overhead per transaction. The architectural decision (separate region for modularity) had a 37ms performance cost that was invisible until someone measured it. Every inter-region call has a latency cost. Ensure the isolation benefit justifies that cost.

Lesson 5: Quick Wins Are Real

Of the 7 changes Yuki made, two (CMDT increase, MXT increase) took less than an hour each and provided measurable improvement. Do not defer quick wins while pursuing larger optimizations. Apply them in parallel.

Lesson 6: Carlos's Aha Moment

Carlos Vega came into the project with a distributed systems mindset: separate services, separate deployments, separate regions. He learned that in CICS, the cost of inter-region communication is higher than in microservices (where a local network call is sub-millisecond). The mainframe's strength is running many things efficiently in one address space — the same address space. Distributing work across regions should be driven by failure isolation and security requirements, not by coding modularity preferences.

This was Carlos's "aha moment" about mainframe performance: co-location is a feature, not a limitation.


Discussion Questions

  1. Yuki's THREADSAFE conversion required code review of only 3 programs. In an environment with 200 CICS programs, how would you prioritize which programs to convert first? What criteria would you use?

  2. The fraud-check program was moved from a separate AOR to the mobile API AOR. Under what circumstances would you keep it on a separate AOR despite the 37ms MRO overhead?

  3. CMDT was initially calculated at 28 using average TPS, but the actual requirement was 43 due to bursty traffic. How should capacity formulas account for burstiness? Propose a modification to the CMDT formula.

  4. The feedback loop (slow response → high tasks → MAXT → queuing → slower response) is self-reinforcing. Once entered, it rarely resolves on its own. Why? What mechanism would need to change for self-resolution?

  5. Carlos described the fraud-check separation as "modularity." Yuki called it a "performance concern." They are both right — modularity and performance are in tension. Propose a design pattern that preserves code modularity (separate programs) while avoiding the MRO performance cost (same region). Does CICS provide such a mechanism?

  6. The project took 6 weeks and required no hardware changes. Estimate the cost of the alternative: buying enough hardware to brute-force the performance problem. At approximately $100,000 per additional MIPS, how many MIPS would have been needed to halve the response times through CPU alone (assuming the QR TCB saturation was the bottleneck)?