Case Study 2: SecureFirst's Mobile API Performance Tuning

DataField.Dev

Case Study 2: SecureFirst's Mobile API Performance Tuning

From 800ms to 120ms — A Systematic Approach to CICS Performance

Background

SecureFirst Retail Bank is a mid-size retail bank executing a mobile-first modernization strategy. Their architecture uses a strangler fig pattern: an API gateway (z/OS Connect) fronts CICS-hosted COBOL services, exposing them as RESTful APIs consumed by their mobile banking application.

The mobile API CICS region — SFAORM1 — handles three primary transactions:

Transaction	Function	Target SLA	Volume
MBAL	Mobile balance inquiry	150ms	1,200 TPS
MXFR	Mobile fund transfer	300ms	200 TPS
MHST	Mobile transaction history	500ms	400 TPS

When the mobile app launched, Carlos Vega — SecureFirst's mobile API architect — was pleased that the API gateway latency was under 30ms. But the end-to-end response times were disappointing:

Transaction	Target	Actual P50	Actual P95
MBAL	150ms	280ms	800ms
MXFR	300ms	450ms	1,200ms
MHST	500ms	600ms	2,500ms

No SLA was being met. The mobile app felt sluggish. Customer satisfaction scores for the mobile channel were 20 points below the industry benchmark.

Yuki Nakamura — SecureFirst's DevOps lead — took ownership of the performance optimization project. Her approach was systematic, data-driven, and executed over 6 weeks.

Week 1: Baseline and Diagnosis

Collecting the Data

Yuki's first step was establishing a measurement baseline. She enabled CMF at 15-minute intervals and collected SMF 110 Type 1 records for all three transactions over a full business week (Monday–Friday).

She also activated auxiliary trace filtered to the dispatcher and DB2 domains for a 30-minute window during peak hours, capturing approximately 50,000 trace entries.

The Wait-Time Breakdown

The SMF 110 analysis revealed the following average wait-time breakdown for MBAL (the highest-volume transaction):

Wait Category	Time (ms)	% of Elapsed
Dispatcher wait	85	30%
DB2 wait	120	43%
MRO wait	45	16%
Program load	15	5%
CPU (QR TCB)	8	3%
Other	7	3%
Total elapsed	280	100%

Three findings stood out:

Dispatcher wait was 30% of elapsed time. The QR TCB was congested — tasks waited an average of 85ms for dispatch. QR TCB busy was 82%.
DB2 wait was 43%. Each MBAL transaction made 3 DB2 calls. The calls themselves averaged 12ms, but thread wait time (waiting for a CMDT-limited thread) added an average of 40ms.
MRO wait was 16%. MBAL called a program on a downstream AOR via DPL for fraud-check processing. The MRO round-trip added 45ms.

The Configuration

Yuki reviewed SFAORM1's configuration:

MXT=150
EDSALIM=500M
DSALIM=5M
CMDT=30 (on DB2CONN)
TRANCLASS: none defined
All programs: CONCURRENCY(QUASIRENT) — the default

Every program was QUASIRENT. No TRANCLASS. CMDT of 30 for a region processing 1,800 TPS with 85% DB2 usage. These were "set it and forget it" configurations from the initial deployment — the values had never been tuned.

Week 2: THREADSAFE Conversion

The Biggest Win

Yuki identified THREADSAFE conversion as the highest-impact change. With all programs as QUASIRENT, every DB2 call blocked the QR TCB. At 1,800 TPS with an average of 3.5 DB2 calls per transaction at 12ms each:

QR TCB blocking from DB2 = 1,800 × 3.5 × 0.012 = 75.6 seconds per second

This is impossible — the QR TCB can only be busy for 1.0 seconds per second. The implication: massive queuing. The QR TCB was the bottleneck.

The Conversion Process

Yuki's team reviewed the top 3 programs (PGMMBAL, PGMMXFR, PGMMHST) for THREADSAFE eligibility:

Prerequisites for THREADSAFE: 1. Program must be reentrant (compiled with RENT option) — all three were. 2. No use of CICS commands that are not threadsafe (e.g., EXEC CICS ADDRESS CWA) — PGMMHST had one CWA reference that was refactored. 3. No shared writeable storage between tasks (global WORKING-STORAGE used as cross-task communication) — none found. 4. All called subprograms must also be threadsafe — the DB2 call interface (EXEC SQL) is threadsafe.

After code review, all three programs were eligible. The changes:

DEFINE PROGRAM(PGMMBAL) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMXFR) CONCURRENCY(THREADSAFE) ...
DEFINE PROGRAM(PGMMHST) CONCURRENCY(THREADSAFE) ...

Results After THREADSAFE

Metric	Before	After	Change
QR TCB busy	82%	31%	-62%
Dispatcher wait (MBAL)	85ms	12ms	-86%
MBAL P50 elapsed	280ms	185ms	-34%
MBAL P95 elapsed	800ms	380ms	-53%

THREADSAFE alone brought MBAL P50 from 280ms to 185ms. The QR TCB was no longer the bottleneck. But 185ms still exceeded the 150ms target.

Week 3: DB2 Thread Tuning

The Thread Wait Problem

With the dispatcher bottleneck resolved, the next dominant wait was DB2 thread acquisition. CMDT was 30, but the region needed more concurrent threads after THREADSAFE (because tasks now proceeded through DB2 calls faster, reducing queuing but increasing concurrent thread demand).

Yuki calculated the required CMDT:

CMDT = Peak_TPS × Fraction_DB2 × Avg_DB2_Elapsed × Safety
     = 1,800 × 0.85 × 0.012 × 1.5
     = ~28

Wait — the formula suggested 28, but the observed thread wait was 40ms. What was wrong?

The answer: the formula assumes uniform arrival. In reality, mobile API traffic is bursty — the API gateway batches requests from the mobile app's connection pool. Peak instantaneous throughput was 2,800 TPS, not the average 1,800. Recalculating with peak:

CMDT = 2,800 × 0.85 × 0.012 × 1.5 = ~43

Yuki raised CMDT to 60 (providing headroom above the burst peak).

Results After CMDT Tuning

Metric	Before	After	Change
DB2 thread waits/hour	4,200	35	-99%
DB2 wait (MBAL avg)	120ms	42ms	-65%
MBAL P50 elapsed	185ms	138ms	-25%
MBAL P95 elapsed	380ms	210ms	-45%

MBAL P50 was now at 138ms — below the 150ms target. P95 was 210ms — still above target. The long tail needed attention.

Week 4: MRO Optimization

Eliminating the Fraud Check Round-Trip

The 45ms MRO wait for the fraud-check DPL call was the next target. Yuki investigated the fraud-check program:

It ran on a separate AOR (SFAORC1) for isolation
It performed a single DB2 read (fraud rule lookup) and a comparison
Total CPU time: 0.8ms
Total elapsed time on the remote AOR: 8ms
MRO round-trip overhead: 37ms

The MRO overhead (37ms) dwarfed the actual work (8ms). For a program that was essentially a DB2 read and a comparison, the inter-region hop was pure overhead.

Yuki's recommendation: move the fraud-check program to SFAORM1 (the mobile API AOR) and call it via a local LINK instead of DPL. The fraud-check program was stateless and read-only — there was no isolation benefit from running it on a separate AOR.

Carlos pushed back: "We separated it for modularity." Yuki's response: "Modularity is a code concern. Region topology is a performance concern. You can have a separate program without a separate region."

After the change:

Metric	Before	After	Change
MRO wait (MBAL)	45ms	0ms	-100%
Local LINK overhead	0ms	2ms	N/A
MBAL P50 elapsed	138ms	98ms	-29%
MBAL P95 elapsed	210ms	155ms	-26%

MBAL P50 at 98ms. P95 at 155ms. Target met.

Week 5: Storage and TRANCLASS

EDSALIM Right-Sizing

With the performance improvements, task concurrency dropped (faster response times = fewer concurrent tasks):

Before tuning: 1,800 TPS × 0.280s avg = 504 concurrent tasks (frequently hitting MXT 150)
After tuning: 1,800 TPS × 0.098s avg = 176 concurrent tasks (MXT 150 was a bottleneck!)

Yuki realized the pre-tuning MXT of 150 had itself been a performance bottleneck — the region was frequently in MAXT, and the resulting queuing was contributing to the high response times. A feedback loop: slow response times cause task accumulation, which causes MAXT, which causes queuing, which causes slower response times.

She recalculated MXT:

MXT = 1,800 × 0.098 × 2.0 = 353 → set to 400

She also measured EUDSA peak usage and found it at 310MB — EDSALIM of 500M provided only 38% headroom over the 310MB peak (after subtracting the CICS kernel). She raised EDSALIM to 650M.

TRANCLASS Implementation

Yuki implemented a three-class model:

TRANCLASS CLSBAL   MAXACTIVE(200)   *> MBAL — highest volume
TRANCLASS CLSXFR   MAXACTIVE(80)    *> MXFR — highest value
TRANCLASS CLSHST   MAXACTIVE(80)    *> MHST — lowest priority

CLSHST was given the lowest MAXACTIVE because transaction history queries are the most expensive (many DB2 reads) and the most tolerant of latency (500ms SLA vs. 150ms for balance).

Self-Aware Transaction Instrumentation

Yuki added performance instrumentation to all three programs. Each program captures start and end timestamps and writes a performance record to a TD queue if elapsed time exceeds the alert threshold:

Transaction	SLA	Alert Threshold
MBAL	150ms	500ms (3.3x SLA)
MXFR	300ms	900ms (3x SLA)
MHST	500ms	1,500ms (3x SLA)

The alert records are processed by an automated monitoring transaction that runs every 60 seconds, counts alerts per transaction type, and triggers operator notifications if the count exceeds a threshold.

Week 6: Final Results and Monitoring

The Final Numbers

Transaction	Target	Before	After	Improvement
MBAL P50	150ms	280ms	98ms	65%
MBAL P95	150ms	800ms	155ms	81%
MXFR P50	300ms	450ms	145ms	68%
MXFR P95	300ms	1,200ms	290ms	76%
MHST P50	500ms	600ms	210ms	65%
MHST P95	500ms	2,500ms	480ms	81%

All SLAs met at P50. MHST P95 at 480ms — within the 500ms target. The mobile app customer satisfaction score improved by 35 points in the quarter following the optimization.

The Changes — Summarized

Change	Impact	Effort
THREADSAFE conversion (3 programs)	QR TCB busy 82% → 31%, P50 -34%	2 weeks (code review, test, deploy)
CMDT increase (30 → 60)	DB2 thread waits -99%, P50 -25%	1 hour (parameter change + test)
MRO elimination (fraud check local)	MRO wait eliminated, P50 -29%	1 week (code move, test, deploy)
MXT increase (150 → 400)	MAXT eliminated, reduces feedback loop	5 minutes (SIT parameter)
EDSALIM increase (500M → 650M)	Storage headroom from 38% to 52%	5 minutes (SIT parameter)
TRANCLASS implementation	Protects MBAL/MXFR from MHST surge	1 hour (CSD definitions)
Self-aware instrumentation	Real-time alerting for SLA breaches	1 week (code changes)

Total effort: approximately 4 weeks of engineering work spread over 6 calendar weeks. No hardware changes. No software purchases. No architecture redesign.

Lessons for the Reader

Lesson 1: Measure Before You Tune

Yuki did not start by guessing. She collected a week of SMF 110 data, performed a wait-time breakdown analysis, and identified the three dominant bottlenecks in priority order. Every change was driven by data, and every change was measured afterward to confirm the impact.

Lesson 2: THREADSAFE Is Not Optional

For any CICS region processing more than a few hundred TPS with DB2, QUASIRENT programs are a performance tax. The QR TCB becomes the bottleneck, and every DB2 call blocks every other task. THREADSAFE conversion is the single highest-return-on-investment optimization in modern CICS.

Lesson 3: Beware the Feedback Loop

The pre-tuning state had a feedback loop: slow response → high task count → MAXT → queuing → slower response. This loop masks the root cause because the observed symptoms (MAXT, queuing) look like capacity problems rather than configuration problems. Breaking the loop at any point (THREADSAFE, CMDT, MRO) revealed that the underlying capacity was adequate.

Lesson 4: Region Topology Is a Performance Decision

Moving the fraud-check program from a remote AOR to the local AOR eliminated 37ms of MRO overhead per transaction. The architectural decision (separate region for modularity) had a 37ms performance cost that was invisible until someone measured it. Every inter-region call has a latency cost. Ensure the isolation benefit justifies that cost.

Lesson 5: Quick Wins Are Real

Of the 7 changes Yuki made, two (CMDT increase, MXT increase) took less than an hour each and provided measurable improvement. Do not defer quick wins while pursuing larger optimizations. Apply them in parallel.

Lesson 6: Carlos's Aha Moment

Carlos Vega came into the project with a distributed systems mindset: separate services, separate deployments, separate regions. He learned that in CICS, the cost of inter-region communication is higher than in microservices (where a local network call is sub-millisecond). The mainframe's strength is running many things efficiently in one address space — the same address space. Distributing work across regions should be driven by failure isolation and security requirements, not by coding modularity preferences.

This was Carlos's "aha moment" about mainframe performance: co-location is a feature, not a limitation.

Discussion Questions

Yuki's THREADSAFE conversion required code review of only 3 programs. In an environment with 200 CICS programs, how would you prioritize which programs to convert first? What criteria would you use?
The fraud-check program was moved from a separate AOR to the mobile API AOR. Under what circumstances would you keep it on a separate AOR despite the 37ms MRO overhead?
CMDT was initially calculated at 28 using average TPS, but the actual requirement was 43 due to bursty traffic. How should capacity formulas account for burstiness? Propose a modification to the CMDT formula.
The feedback loop (slow response → high tasks → MAXT → queuing → slower response) is self-reinforcing. Once entered, it rarely resolves on its own. Why? What mechanism would need to change for self-resolution?
Carlos described the fraud-check separation as "modularity." Yuki called it a "performance concern." They are both right — modularity and performance are in tension. Propose a design pattern that preserves code modularity (separate programs) while avoiding the MRO performance cost (same region). Does CICS provide such a mechanism?
The project took 6 weeks and required no hardware changes. Estimate the cost of the alternative: buying enough hardware to brute-force the performance problem. At approximately $100,000 per additional MIPS, how many MIPS would have been needed to halve the response times through CPU alone (assuming the QR TCB saturation was the bottleneck)?