45 min read

> "I've been in three production incidents where a MAXT condition brought down a billion-dollar payment network. Every single time, someone said 'just raise MXT.' Every single time, that was the wrong answer. The MAXT is a symptom. The dispatcher...

Learning Objectives

  • Analyze CICS task management including dispatching, MAXT conditions, and task priority
  • Diagnose CICS performance problems using statistics, transaction dumps, and monitoring data
  • Tune CICS storage (DSA, EDSA, GCDSA) for optimal performance
  • Configure CICS task-related system resources (MXT, CMDT, threads)
  • Design performance monitoring for the HA banking system's CICS regions

Chapter 17: CICS Performance and Tuning

Task Management, Storage Tuning, and Diagnosing MAXT Conditions

"I've been in three production incidents where a MAXT condition brought down a billion-dollar payment network. Every single time, someone said 'just raise MXT.' Every single time, that was the wrong answer. The MAXT is a symptom. The dispatcher queue, the storage profile, the transaction mix — those are the disease." — Kwame Mensah, Chief Mainframe Architect, Continental National Bank


Performance tuning in CICS is not a separate discipline. It is the discipline. Every architectural decision you made in Chapter 13 — region topology, routing, MRO connections — and every programming pattern you adopted in Chapters 14 through 16 ultimately manifests as a performance characteristic that either meets your SLA or triggers a 2 AM page.

This chapter is where architecture meets reality. We are going to walk through the CICS performance model from the ground up: how the dispatcher schedules tasks, how storage is allocated and reclaimed, how MAXT conditions cascade, and — most critically — how you diagnose problems in production when the symptoms are ambiguous and the pressure is absolute.

I need to be direct about something: CICS performance tuning has more folklore than any other mainframe discipline. Half the "best practices" circulating in shops today were valid for CICS TS 3.2 and are actively harmful on CICS TS 5.6. We will separate the signal from the noise.

🔗 SPACED REVIEW — This chapter integrates concepts from several earlier chapters. Specifically: - Chapter 5 (WLM): CICS regions are WLM service classes. The dispatcher's priority decisions are driven by WLM velocity goals. If your WLM configuration is wrong, no amount of CICS tuning will save you. - Chapter 13 (CICS Architecture): Your region topology determines your performance ceiling. A badly designed topology cannot be tuned — it can only be redesigned. - Chapter 15 (Channels/Containers): The data passing mechanism you choose (COMMAREA vs. channels) has direct performance implications for inter-region communication.


17.1 When CICS Stops Processing — Understanding MAXT

Let me tell you about the worst Tuesday afternoon of Rob Calloway's career.

The CNB Production Incident

It was 2:47 PM on a Tuesday — peak transaction volume for CNB's core banking AORs. Kwame was in a capacity planning meeting. Lisa was running a DB2 REORG on the general ledger tablespace (scheduled, approved, off-peak for DB2 but not for CICS). Rob's batch monitoring console was clear.

Then CICS message DFHZC0101 appeared on the system log for CNBAORB1:

DFHZC0101 CNBAORB1 MAXT condition - number of tasks has reached the MXT limit of 250

Within 90 seconds, the same message appeared for CNBAORB2. The TOR's dynamic routing was distributing work across both AORs — when AORB1 hit MAXT, the routing shifted load to AORB2, which promptly hit its own MAXT. The TOR began queuing transactions. Response times for the core banking channel went from 200 milliseconds to 45 seconds in under three minutes.

The operations team's first instinct: raise MXT. Wrong answer. Here is why.

What MXT Actually Controls

MXT (Maximum Tasks) is a CICS system initialization table (SIT) parameter that sets an upper bound on the number of user tasks that can be concurrently active in a CICS region. When the active task count reaches MXT, new transaction requests are queued — they do not receive a task control block (TCA) and cannot begin execution. This is the MAXT condition.

The key word is active, not executing. A CICS task is active from the moment it receives a TCA until it issues its final RETURN. During that lifetime, the task may be:

  • Running — actually executing on a processor (dispatched by the CICS dispatcher, which in turn was dispatched by the z/OS dispatcher)
  • Ready — waiting for the CICS dispatcher to give it control
  • Waiting — suspended on an I/O operation (DB2 call, file read, MRO request, GETMAIN, etc.)

Most tasks spend the majority of their lifetime in the Waiting state. A task that reads from DB2, processes the result, and writes an audit log might execute for 2 milliseconds of CPU but hold its TCA for 50 milliseconds of elapsed time. At any given moment, a region with 200 active tasks might have only 15 actually running.

This distinction is critical because raising MXT does not give you more CPU capacity. It gives you more TCAs — more tasks that can simultaneously wait. If your MAXT is caused by tasks accumulating in the Waiting state because DB2 is slow, raising MXT just allows more tasks to accumulate, consuming more storage, holding more DB2 threads, and making the underlying problem worse.

⚠️ COMMON PITFALL — "Just raise MXT" is the most dangerous reflex in CICS operations. Raising MXT without understanding why tasks are accumulating is like raising the speed limit on a highway where cars are stopped because of a bridge collapse. You're not fixing the bottleneck — you're allowing more cars to pile up in front of it.

The Anatomy of a MAXT Cascade

Back to CNB. The MAXT was not caused by a volume spike. Transaction rates were normal — 4,200 TPS across the two AORs. The cause was Lisa's DB2 REORG.

The REORG acquired a drain lock on the general ledger tablespace. Transactions that read the general ledger — about 35% of the core banking workload — now waited for the drain lock to release, which happened in brief windows between REORG phases. Each transaction that would normally complete in 50ms was now taking 2–4 seconds. Tasks accumulated. At 4,200 TPS with 35% affected and a 40x response time increase:

  • Normal steady state: 4,200 x 0.35 x 0.050s = ~74 concurrent GL-reading tasks
  • During REORG: 4,200 x 0.35 x 2.0s = ~2,940 concurrent GL-reading tasks

MXT of 250 was obliterated. Even raising it to 1,000 would not have helped — the system would have run out of storage long before reaching 2,940 tasks.

The actual fix was immediate: Lisa paused the REORG (TERM UTILITY). Within 30 seconds, the backlog drained and task counts returned to normal.

The real fix was procedural: DB2 REORGs on hot tablespaces were rescheduled to the batch window, and a CICS health check was added that monitors CICS task counts whenever a REORG is running on a tablespace with CICS affinity.

💡 INSIGHT — The CNB incident illustrates a principle that governs all CICS performance work: the CICS region does not exist in isolation. A DB2 utility, a DASD volume switch, a coupling facility structure rebuild, a z/OS WLM policy change — any of these external events can cascade into CICS performance degradation. The best CICS performance engineers understand the entire z/OS ecosystem, not just the CICS SIT parameters.

MXT Sizing — The Right Way

The formula from Chapter 13 bears repeating with more precision:

MXT = (Peak_TPS × Avg_Elapsed_Response_Time × Safety_Factor)

For CNB's core banking AOR: - Peak TPS per AOR: 2,100 - Average elapsed response time: 0.050 seconds (50ms) - Safety factor: 2.0

MXT = 2,100 × 0.050 × 2.0 = 210

CNB rounds up to 250 for headroom. Note several things about this calculation:

  1. It uses elapsed time, not CPU time. A transaction that uses 2ms of CPU but 50ms of elapsed time holds its TCA for 50ms.
  2. The safety factor accounts for normal variance, not catastrophic events. A 2.0 factor means you can handle response times doubling. It does not protect against a 40x slowdown from a REORG.
  3. Peak TPS is the per-region peak, not the system total. With dynamic routing, the per-region peak can spike above the theoretical even split if one AOR is temporarily preferred.

After the incident, Kwame added a second check — the storage-bounded MXT:

Storage_Bounded_MXT = Available_DSA / Avg_Task_Storage

For CNB: - Available DSA (below the line, after CICS kernel): ~180 MB - Average task storage (program, TWA, TCA, user storage): ~400 KB

Storage_Bounded_MXT = 180,000 / 400 = 450

MXT should never exceed the storage-bounded limit. CNB's MXT of 250 is well within the 450 ceiling. But if they had naively raised MXT to 500 during the incident, they would have hit SOS (short-on-storage) before hitting the new MXT — a worse outcome, because SOS triggers storage compression that degrades all tasks, not just new ones.

📊 BY THE NUMBERS — CNB's post-incident analysis found that across their 16 CICS regions, 4 had MXT settings that exceeded the storage-bounded limit. Kwame's team corrected all four. The lesson: MXT should be calculated, not inherited from the previous system programmer who set it in 2009 and never revisited it.


17.2 Task Management and Dispatching

Understanding the CICS dispatcher is prerequisite to every tuning decision you will make. Let me strip away the abstractions and describe what actually happens when a transaction arrives.

The CICS Dispatcher — What It Actually Does

The CICS dispatcher is a z/OS subtask (TCB) within the CICS address space. Actually, in modern CICS TS 5.6, it is several subtasks, but the conceptual model starts with the quasi-reentrant (QR) TCB.

The QR TCB is the heart of CICS. It runs one task at a time, in a cooperative multitasking model. When a task issues an EXEC CICS command that requires a wait (READ FILE, LINK to a DPL program, DB2 call via the task-related user exit), the task suspends and returns control to the dispatcher. The dispatcher selects the next ready task and dispatches it on the QR TCB.

This is fundamentally different from z/OS's preemptive multitasking. z/OS can interrupt a running program and dispatch another. CICS cannot — at least not on the QR TCB. A task running on the QR TCB runs until it voluntarily yields by issuing a CICS command. This means:

  • A CPU-bound loop in application code monopolizes the QR TCB. Every other task in the region waits. This is the single most destructive application-level performance bug in CICS.
  • The dispatcher's scheduling is cooperative, not time-sliced. Task priority affects which task the dispatcher picks next from the ready queue, but it does not preempt a running task.

🔴 PRODUCTION RULE — Never write a COBOL program that loops without issuing a CICS command. A PERFORM VARYING loop that processes 100,000 records from a working storage table will hold the QR TCB for seconds. Every other transaction in the region starves. If you must process large in-memory data, insert periodic EXEC CICS DELAY INTERVAL(0) calls to yield control.

The Open TCB Model

Modern CICS TS (from 4.1 onward) provides additional TCBs beyond the QR:

  • L8 TCBs — For programs defined as CONCURRENCY(THREADSAFE). These run on open TCBs that operate concurrently with the QR TCB. A THREADSAFE program making a DB2 call runs on an L8 TCB; the DB2 call executes without blocking the QR TCB. This is the single most impactful performance optimization available in modern CICS.
  • L9 TCBs — For C/C++ and Java programs.
  • J8/J9 TCBs — For Liberty JVM server work.
  • S8 TCBs — For CICS system tasks.

The performance implication is stark. In a region where all programs run as QUASIRENT (the default), every DB2 call blocks the QR TCB. At 2,000 TPS with 3 DB2 calls per transaction averaging 5ms each, the QR TCB is blocked for:

2,000 × 3 × 0.005 = 30 seconds of blocking per second

That is clearly impossible — it means the QR TCB is blocked more time than exists. The result: massive queuing, high task waits, and poor throughput. The region cannot sustain 2,000 TPS.

Switch those programs to THREADSAFE with CONCURRENCY(THREADSAFE) and the DB2 calls move to L8 TCBs. The QR TCB is now free to dispatch other tasks during those 30 seconds of DB2 wait time. Throughput can double or triple.

💡 INSIGHT — At CNB, Kwame's team converted the top 20 CICS transactions (by volume) to THREADSAFE over a 6-month period. The result: 40% reduction in QR TCB busy time, 25% improvement in average response time, and they deferred a hardware upgrade worth $2.8 million. This is the kind of return on investment that gets architects invited to the CIO's budget meetings.

Task Priority and the Ready Queue

When multiple tasks are ready for dispatch, the CICS dispatcher uses a priority scheme:

  1. Task priority — a value from 0 (lowest) to 255 (highest), assigned by the TRANCLASS or the transaction definition's PRIORITY parameter
  2. Within the same priority — first-in-first-out (FIFO)

The CICS dispatcher checks the ready queue after each task yields. It dispatches the highest-priority ready task. If a low-priority task is running when a high-priority task becomes ready, the low-priority task is not preempted — it continues until it yields.

This means priority only matters when there is queuing. In a healthy region where the QR TCB has idle time between dispatches, priority is irrelevant — every ready task gets dispatched immediately. Priority becomes critical when the region is saturated and the ready queue is non-empty.

For CNB's core banking AORs, the priority scheme is:

Transaction Type Priority Rationale
ATM authorization (real-time, customer waiting) 200 SLA: 200ms
Balance inquiry (online banking) 180 SLA: 500ms
Fund transfer (online banking) 180 SLA: 500ms
Statement generation (background) 100 SLA: 5 seconds
Audit log write (system) 80 SLA: 30 seconds

This ensures that during congestion, ATM authorizations — where a customer is physically standing at a machine — get dispatched before background statement generation.

⚠️ COMMON PITFALL — Do not set all transactions to priority 255 "just to be safe." This defeats the purpose of priority entirely — if everything is highest priority, nothing is. Worse, it removes your only mechanism for protecting critical transactions during congestion.

CMDT — The Ceiling Below the Ceiling

CMDT (CICS Maximum concurrent tasks for DB2 threads) is a parameter on the CICS-DB2 connection (the CICS DB2CONN definition) that limits the number of concurrent tasks that can have an active DB2 thread. This is a separate limit from MXT.

Why does this exist? Because each active DB2 thread consumes DB2 resources — thread storage, EDM pool entries, buffer pool pages. If every CICS task simultaneously holds a DB2 thread, and MXT is 250, that is 250 DB2 threads from one CICS region alone. Multiply by 8 AOR regions and you have 2,000 threads. DB2 will not be pleased.

CMDT creates a gating mechanism: if CMDT is 100, then at most 100 of the region's 250 tasks can hold DB2 threads simultaneously. The 101st task that needs DB2 waits for a thread to become available. This bounds DB2 resource consumption but introduces a new queuing point.

Sizing CMDT requires measuring the fraction of transactions that use DB2 and the average DB2 elapsed time per transaction:

CMDT = Peak_TPS × Fraction_DB2 × Avg_DB2_Elapsed × Safety_Factor

For CNB: - Peak TPS: 2,100 - Fraction using DB2: 0.85 - Average DB2 elapsed per transaction: 0.015s (15ms across all DB2 calls) - Safety factor: 1.5

CMDT = 2,100 × 0.85 × 0.015 × 1.5 = ~40

CNB sets CMDT to 60 with some extra headroom. This means their MXT of 250 tasks is partitioned: up to 60 can use DB2 concurrently, and the remaining tasks are either non-DB2 transactions or DB2 transactions waiting for a thread.

The DB2CONN definition also includes THREADLIMIT (maximum threads per transaction) and THREADWAIT (YES to queue, NO to abend when no thread is available). CNB uses THREADWAIT(YES) — a task that cannot get a thread waits rather than abending, because a brief wait is preferable to an abend that the user sees as a failure.


17.3 CICS Storage Architecture and Tuning

If the dispatcher is the brain of a CICS region, storage is the bloodstream. Every task, every program, every control block, every data buffer consumes storage. When storage is exhausted, the region enters SOS — short on storage — and performance collapses.

The Storage Domains

CICS manages its own storage within the z/OS address space. This storage is divided into domains based on addressing:

DSA (Dynamic Storage Area) — Below the 16 MB line. In 2026, you might think below-the-line storage is irrelevant. It is not. CICS still uses below-the-line storage for BMS maps, some terminal control blocks, and 24-bit AMODE programs. DSALIM controls the upper bound.

EDSA (Extended Dynamic Storage Area) — Below the 2 GB bar but above the 16 MB line. This is where the bulk of CICS's work happens: program load areas, task control areas, working storage for most programs, channel data, container data. EDSALIM controls the upper bound.

GCDSA (Global/Common Dynamic Storage Area above the bar) — Above the 2 GB bar, using 64-bit addressing. Introduced in CICS TS 5.1 for coupling facility data table caches, and expanded in subsequent releases. GCDSALIM controls the upper bound. This area is becoming increasingly important as CICS migrates more internal structures above the bar.

Each domain has a set of sub-pools:

Sub-pool Domain Purpose
CDSA DSA CICS-key storage below 16MB
UDSA DSA User-key storage below 16MB
ECDSA EDSA CICS-key storage below 2GB
EUDSA EDSA User-key storage below 2GB
ERDSA EDSA Read-only program storage
ESDSA EDSA Shared (reentrant) program storage
ETDSA EDSA Terminal-related storage
GCDSA GCDSA Above-the-bar CICS-key storage
GUDSA GCDSA Above-the-bar user-key storage

DSALIM, EDSALIM, and GCDSALIM — Setting the Boundaries

These SIT parameters define how much address space CICS can use for each domain:

DSALIM — Maximum size of the DSA. Default is 5M. For most modern workloads, 5M is sufficient. Only increase if you have a significant number of 24-bit AMODE programs or heavy BMS map usage. CNB runs DSALIM=5M across all regions.

EDSALIM — Maximum size of the EDSA. This is the critical parameter. Default is 500M, which is almost always wrong for production. CNB runs EDSALIM=900M on their core banking AORs after careful measurement.

GCDSALIM — Maximum size of the GCDSA. Default is 256M. Increase for regions that heavily use coupling facility data table caching, or that benefit from above-the-bar program placement (CICS TS 5.5+).

The sizing process for EDSALIM follows this methodology:

  1. Measure current usage. Use EXEC CICS INQUIRE DSAS or CEMT I DSAS to see current and peak allocation for each sub-pool.
  2. Identify the peak. Run during peak transaction volume for at least one business cycle (a full day for online regions, a full month for month-end-sensitive regions).
  3. Calculate the target. Peak usage + 30% headroom. The 30% is not arbitrary — it accounts for program expansion during maintenance windows, unexpected workload spikes, and CICS's internal fragmentation overhead.
  4. Set EDSALIM. Round up to the nearest 50 MB. Do not set EDSALIM to the maximum — an excessively large EDSALIM can mask storage leaks by delaying the SOS condition until the leak is massive and the region is unstable.

⚠️ COMMON PITFALL — Setting EDSALIM=1500M because "more is better." An artificially high EDSALIM creates three problems: (1) it delays detection of storage leaks, (2) it increases the region's working set size, which can cause z/OS paging, and (3) it can conflict with other address space occupants (DB2 stored procedures, Language Environment heap). Size to actual need plus headroom, not to maximum possible.

SOS — Short on Storage

When any DSA domain's free storage drops below a threshold (typically when the cushion falls below the SOSBELOWBAR or SOSABOVEBAR values), CICS enters the SOS condition. The behavior depends on the specific SOS type:

SOS Below the Bar — CICS suspends new task creation for transactions that require below-the-line storage. Existing tasks continue but may wait on GETMAIN requests. The dispatcher prioritizes task completion to free storage.

SOS Above the Bar — Similar behavior for EDSA exhaustion, but more impactful because most workload uses EDSA.

What SOS Feels Like in Production: Response times spike for all transactions, not just new ones. The dispatcher shifts into a mode that prioritizes freeing storage over throughput. Program compression runs (CICS attempts to release storage for programs that are loaded but not currently in use). If the SOS persists, CICS may begin purging tasks — a last resort that results in transaction abends.

At CNB, SOS is a Severity 1 event. The runbook specifies:

  1. Identify which DSA sub-pool is exhausted (CEMT I DSAS)
  2. Check for storage leaks (sudden increase in EUDSA without corresponding task increase)
  3. Check for abnormal task accumulation (MAXT causing tasks to pile up with their storage)
  4. If caused by a specific transaction, PURGE or FORCEPURGE that transaction class
  5. If caused by a storage leak, close and re-open the affected program (releases its program storage)
  6. If none of the above resolves it, emergency recycle the region (last resort)

🔴 PRODUCTION RULE — A CICS SOS event that resolves on its own is not an event to ignore. It indicates you are operating closer to the storage boundary than you realize. Investigate the cause, measure the headroom, and adjust EDSALIM or MXT accordingly. The next SOS may not resolve on its own.

Storage and the Task Lifecycle

Every CICS task consumes storage from the moment it is created until the moment it terminates. The typical storage profile:

  1. Task creation: TCA (Task Control Area) allocated from ECDSA — approximately 4 KB
  2. Program load: If the program is not already loaded, CICS loads it into ERDSA (read-only) or ESDSA (shared reentrant). This is a one-time cost amortized across all tasks that use the program.
  3. Working storage: Each task gets its own copy of the program's WORKING-STORAGE SECTION, allocated from EUDSA. For a program with 500 KB of working storage, that is 500 KB per concurrent task.
  4. User storage: EXEC CICS GETMAIN FLENGTH requests allocate from EUDSA. This is the most dangerous category — a program that GETMAINs without FREEMAINing creates a storage leak.
  5. Channel/container data: Channel data allocates from EUDSA. Large containers can consume significant storage, especially if passed across multiple LINKed programs.
  6. Terminal storage: TIOA (Terminal I/O Area) and other terminal-related control blocks from ETDSA.

The tuning implication: working storage is the biggest per-task cost. A program with 2 MB of working storage supporting 200 concurrent tasks consumes 400 MB of EUDSA — just for working storage. This is why Kwame enforces a code review standard at CNB: no CICS program may have more than 500 KB of working storage without architect approval. Programs that need large data areas must use EXEC CICS GETMAIN/FREEMAIN or, better, channels and containers that are freed between LINK calls.

💡 INSIGHT — The single most effective CICS storage optimization is reducing working storage size in high-volume programs. A 100 KB reduction in a program running 200 concurrent tasks saves 20 MB of EUDSA. Multiply across the top 10 programs and you can reclaim hundreds of megabytes.


17.4 Transaction Class Management

Transaction classes (TRANCLASS) are CICS's mechanism for partitioning MXT into slices — ensuring that one type of transaction cannot consume all available tasks at the expense of others.

Why TRANCLASS Exists

Without TRANCLASS, MXT is a single pool shared by all transactions. If MXT is 250, and a batch-like inquiry transaction spawns 200 concurrent tasks during a month-end report, only 50 tasks remain for real-time ATM authorizations. The ATM customers wait.

TRANCLASS allows you to say: "Of the 250 tasks, no more than 40 can be this inquiry transaction." When the inquiry hits 40 concurrent tasks, additional inquiry requests queue — but ATM authorizations continue to receive tasks from the remaining 210.

Defining Transaction Classes

Transaction classes are defined via CSD (CICS System Definition):

DEFINE TRANCLASS(CLSINQR)
       GROUP(BANKGRP)
       MAXACTIVE(40)
       PURGETHRESH(NO)

Key parameters:

  • MAXACTIVE — Maximum concurrent tasks in this class. When reached, new tasks in this class queue.
  • PURGETHRESH — If set to a number, CICS purges tasks in this class when the queue depth reaches the threshold. Set to NO to queue indefinitely (up to MAXT overall).

Associate a transaction with a class via its TRANSACTION definition:

DEFINE TRANSACTION(INQR)
       GROUP(BANKGRP)
       TRANCLASS(CLSINQR)
       PRIORITY(100)
       ...

CNB's TRANCLASS Architecture

Kwame's team uses a four-tier TRANCLASS model:

Class MAXACTIVE Transactions Rationale
CLSCRIT 80 ATM auth, wire transfer Never starved. These have customer-facing SLAs.
CLSONLN 100 Balance inquiry, fund transfer, statement Main online workload. Largest slice.
CLSBULK 30 Month-end inquiries, batch-initiated CICS Bounded so they cannot flood the region.
CLSSYS 20 CICS system transactions, monitoring Guaranteed capacity for operational tasks.

Total MAXACTIVE across classes: 230, with MXT at 250. The remaining 20 tasks are unclassed transactions — a buffer for any transaction that does not have a TRANCLASS assignment (which should be zero in a well-managed environment, but defensive configuration assumes exceptions).

💡 INSIGHT — TRANCLASS is the CICS equivalent of WLM service classes. Just as WLM prevents batch jobs from starving online transactions at the z/OS level, TRANCLASS prevents low-priority CICS transactions from starving high-priority ones at the region level. The two mechanisms are complementary — WLM manages between address spaces, TRANCLASS manages within a single address space.

TRANCLASS and MAXT Interaction

When a TRANCLASS hits its MAXACTIVE, the queued transactions generate a TRANCLASS wait, not a MAXT condition. MAXT only occurs when the overall MXT is reached. This distinction matters for monitoring:

  • TRANCLASS waits indicate that a specific workload is saturated but the region has capacity. This is often the desired behavior — the class limit is protecting higher-priority work.
  • MAXT indicates the entire region is saturated. This is an emergency.

CICS statistics distinguish between TRANCLASS waits and MAXT waits. In your monitoring, track both. A region with zero MAXT waits but frequent TRANCLASS waits for the CLSBULK class is working exactly as designed. A region with MAXT waits needs immediate investigation regardless of which classes are affected.

Dynamic TRANCLASS Adjustment

TRANCLASS MAXACTIVE can be changed dynamically without restarting the region:

CEMT SET TRANCLASS(CLSBULK) MAXACTIVE(15)

This is invaluable during production incidents. If a runaway batch-initiated workload is threatening the region, you can throttle it in real time by reducing its TRANCLASS MAXACTIVE. Kwame keeps a "throttle playbook" that specifies, for each incident type, which TRANCLASS limits to adjust and by how much.

🔵 CNB PRACTICE — Kwame's team reviews TRANCLASS utilization quarterly. If a class consistently operates below 50% of its MAXACTIVE, the limit may be too generous. If a class frequently hits its MAXACTIVE, either the limit is too low or the transaction volume has grown beyond expectations. Either finding triggers a tuning cycle.


17.5 Performance Monitoring and Statistics

You cannot tune what you cannot measure. CICS provides multiple layers of performance data, each suited to different diagnostic scenarios.

CICS Statistics — The Foundation

CICS statistics are the broadest source of performance data. They are collected continuously and can be written to SMF (System Management Facility) at intervals, at shutdown, and on demand.

The key statistics record types for performance work:

Dispatcher Statistics — How busy is the dispatcher? Key fields: - QR TCB busy percentage — if this exceeds 80%, the QR is saturated - Number of task switches per interval - Maximum task count reached (peak active tasks) - Number of MAXT conditions - Total MAXT wait time

Storage Statistics — How much storage is in use? Key fields: - Current and peak allocation for each DSA sub-pool - Number of GETMAIN requests and failures - Number of SOS events and duration - Storage cushion (free space) for each domain - Program compression count (indicates storage pressure)

Transaction Statistics — Per-transaction-type aggregates. Key fields: - Transaction count (number of completions) - Average, minimum, maximum response time - Average CPU time (QR and open TCB) - Average wait time (broken down by wait type: dispatcher, DB2, file I/O, MRO) - Average storage used - Abend count

DB2 Connection Statistics — CICS-DB2 thread usage. Key fields: - Current and peak active threads - Thread wait count (tasks waiting for a DB2 thread) - CMDT limit reached count - Average DB2 elapsed time per call

These statistics can be viewed in real time via CEMT PERFORM STATISTICS or collected via the CICS monitoring facility (CMF) into SMF type 110 records for historical analysis.

📊 BY THE NUMBERS — CNB collects CICS statistics at 15-minute intervals for all 16 regions. The data is written to SMF, extracted daily by their SMF processing job, and loaded into a DB2 performance warehouse. Kwame's team uses this historical data to identify trends — a gradual increase in QR TCB busy, a slow creep in EUDSA usage, a seasonal pattern in TRANCLASS waits. Trend analysis catches problems before they become incidents.

SMF 110 Records — The Gold Standard

SMF type 110 records are CICS's contribution to the z/OS System Management Facility. There are several sub-types:

  • Type 1 (Transaction Performance) — One record per transaction completion. Contains response time, CPU time, wait times (broken down by wait type), storage used, DB2 call counts, file I/O counts, and hundreds of other fields. This is the richest source of individual transaction performance data.
  • Type 2 (System Exception) — Written for exceptional events: MAXT, SOS, program abends.
  • Type 3 (Statistics) — Interval, requested, and shutdown statistics.

SMF 110 Type 1 records are the foundation of CICS performance analysis. Every commercial CICS monitor (IBM CICS Performance Analyzer, Broadcom CICS Detector, BMC MainView) reads these records. But even without commercial tools, you can analyze them with SAS, Python (via z/OS Connect or SFTP), or a COBOL program that reads the SMF dataset.

At CNB, every SMF 110 Type 1 record for the top 50 transactions (by volume) is retained for 90 days. This allows Kwame's team to answer questions like: "What was the average response time for transaction XFER at 2 PM last Tuesday?" — which is exactly the question they needed to answer during the MAXT incident.

CEDF — The Interactive Debugger

CEDF (CICS Execution Diagnostic Facility) is the interactive debugger for CICS transactions. It intercepts every EXEC CICS command and displays the command, its parameters, and the result (including the EIBRESP and EIBRESP2 codes).

For performance analysis, CEDF reveals: - Which EXEC CICS commands a transaction executes (and in what order) - The response time of each command (not built-in, but you can note timestamps) - Whether commands are succeeding or failing - The actual data being passed in channels, containers, and COMMAREAs

CEDF is invaluable for diagnosing individual transaction performance problems, but it has limitations: - It can only intercept one transaction at a time - It adds overhead (interception, display, user interaction) - It does not capture DB2 internals (use DB2 trace for that) - It requires a terminal session — not suitable for automated monitoring

⚠️ COMMON PITFALL — Never leave CEDF enabled on a production transaction without a clear diagnostic purpose and an exit plan. CEDF intercepts every EXEC CICS command, which adds processing overhead. In a high-volume transaction, CEDF can itself cause performance degradation.

Auxiliary Trace — The Deep Dive

When CEDF is not sufficient and you need a detailed trace of CICS internal processing, auxiliary trace provides a sequential record of every CICS operation — dispatching, storage management, file control, program control, terminal control, and all EXEC CICS commands.

Auxiliary trace is written to a CICS auxiliary trace dataset (DFHAUXT). It can be activated and deactivated dynamically:

CEMT SET AUXTRACE OPEN START

The trace can be filtered by transaction, by domain (e.g., only file control operations), or by component. Without filtering, auxiliary trace generates enormous volumes of data — gigabytes per hour in a busy region. Always filter.

For performance diagnosis, the most useful trace entries are: - Dispatcher entries — Show task state transitions (running, waiting, ready) - File control entries — Show VSAM I/O timing - DB2 interface entries — Show thread allocation, SQL preparation, and execution - Storage manager entries — Show GETMAIN/FREEMAIN activity

Rob's team at CNB maintains a "trace kit" — a set of pre-built CEMT commands for common diagnostic scenarios (trace transaction XFER only, trace DB2 domain only, etc.). During an incident, the on-call engineer can activate the appropriate trace within seconds.

The EIB — Transaction-Level Diagnostics

Every CICS task has an EIB (Execute Interface Block) that contains diagnostic information about the most recent EXEC CICS command. The relevant fields for performance work:

  • EIBRESP / EIBRESP2 — Response code from the last command. Non-zero responses indicate errors that may be causing retries or fallback logic — a common hidden performance problem.
  • EIBTIME / EIBDATE — Current time and date. Can be captured at transaction start and end to measure elapsed time within the program.
  • EIBTASKN — Task number. Useful for correlating trace entries.
  • EIBTRNID — Transaction ID. Useful for filtering.

Programs can capture EIB data and write it to a performance log — a lightweight, application-level monitoring mechanism that does not require CICS trace or CMF:

       EXEC CICS ASKTIME
            ABSTIME(WS-START-TIME)
       END-EXEC.

      * ... business logic ...

       EXEC CICS ASKTIME
            ABSTIME(WS-END-TIME)
       END-EXEC.

       COMPUTE WS-ELAPSED-MS =
           (WS-END-TIME - WS-START-TIME) / 1000.

       IF WS-ELAPSED-MS > WS-THRESHOLD-MS
           PERFORM WRITE-PERF-LOG
       END-IF.

This pattern — instrument the program to detect its own slow execution — is what Kwame calls "self-aware transactions." At CNB, every critical transaction has a built-in performance threshold. If the transaction exceeds its threshold, it writes a record to a CICS TD queue that triggers an alert.

💡 INSIGHT — The best performance monitoring systems combine top-down data (CMF/SMF 110 aggregates) with bottom-up data (application-level instrumentation). Top-down tells you what is slow. Bottom-up tells you why. Neither alone is sufficient.


17.6 Diagnosing Production Performance Problems

Theory is useful. War stories are better. Let me walk you through four diagnostic scenarios that represent 80% of the production CICS performance problems I have encountered in 25 years.

Scenario 1: Response Time Spike — Is It CICS or DB2?

Symptom: Transaction XFER average response time jumps from 50ms to 800ms at 10:15 AM. No MAXT. No SOS.

Diagnostic steps:

  1. Pull SMF 110 Type 1 records for XFER. Compare the 10:00–10:15 period (normal) with 10:15–10:30 (degraded). Look at the wait-time breakdown.

  2. If dispatcher wait time increased: The region's QR TCB is saturated. Check QR TCB busy percentage. A new high-volume transaction, a CPU-bound loop, or a THREADSAFE regression (a program that was THREADSAFE but was recompiled without the attribute) can cause this.

  3. If DB2 wait time increased: The problem is in DB2, not CICS. Check DB2 performance data — was there a lock timeout spike? Did RUNSTATS make the optimizer choose a new access path? Did a tablespace REORG cause drain locks?

  4. If file I/O wait time increased: Check VSAM LSR pool statistics. Was the pool hit ratio degraded? Was there a DASD volume switch or a dataset migration to ML2?

  5. If MRO wait time increased: A downstream region (FOR, another AOR) is slow. The problem cascades back through MRO.

At CNB, this triage takes Rob's team less than 5 minutes using their SMF 110 dashboard. The wait-time breakdown is the single most diagnostic field in the SMF 110 Type 1 record.

Scenario 2: Gradual Degradation Over Hours

Symptom: Response times for all transactions in a region gradually increase from 50ms to 200ms over 6 hours. No sudden event.

Diagnostic steps:

  1. Check EUDSA usage over time. A storage leak manifests as a slow, steady increase in EUDSA allocation without a corresponding increase in task count. Plot EUDSA usage and task count on the same timeline. If EUDSA rises while tasks remain stable, you have a leak.

  2. Identify the leaking program. Use CICS statistics to find which program's storage allocation is growing. A program that GETMAINs without FREEMAINing, or that repeatedly opens containers without closing them, will show increasing storage attribution.

  3. Check for fragmentation. Even without a leak, heavy GETMAIN/FREEMAIN activity can fragment EUDSA. The total free storage may be adequate, but no single contiguous block is large enough for a new allocation. CICS's storage compression (program compression) attempts to defragment, but it adds overhead that further degrades performance.

  4. Check program compression count. An increasing program compression count indicates CICS is struggling to find storage. This is a leading indicator of SOS.

🧪 DIAGNOSTIC TECHNIQUE — To identify a storage leak in production, take CICS storage snapshots (CEMT I DSAS) at 30-minute intervals during the degradation. Plot EUDSA current allocation over time. If it is monotonically increasing, you have a leak. Cross-reference with CICS program statistics to identify which program's storage is growing. Then review that program's GETMAIN/FREEMAIN pairs.

Scenario 2b: The Pinnacle Health Storage Leak

Diane Okoye at Pinnacle Health encountered a textbook storage leak in their claims adjudication AOR. Response times for all transactions gradually degraded over 8–10 hours, resetting only when the region was recycled overnight.

The culprit: a claims validation program that used EXEC CICS GETMAIN to allocate a 32 KB buffer for each claim's attachment data. The program processed the attachment, but the FREEMAIN was inside a paragraph that was only executed on the "happy path." If the claim failed validation — approximately 12% of claims — the program branched to an error-handling paragraph that wrote the rejection record but never freed the 32 KB buffer.

At 800 claims per minute with a 12% rejection rate, the leak was:

800 × 0.12 × 32 KB / 60 = ~51 KB per second = ~3 MB per minute = ~180 MB per hour

With 400 MB of available EUDSA, the region hit SOS in approximately 2.2 hours after peak load began. The overnight recycle masked the problem for weeks — the region ran clean each morning and degraded each afternoon.

Ahmad Rashidi's audit trail records eventually provided the clue: the rejection count correlated precisely with the EUDSA growth rate. The fix was a single PERFORM FREEMAIN-BUFFER in the error-handling paragraph — a one-line code change that eliminated a daily SOS event.

⚠️ COMMON PITFALL — Storage leaks in CICS are insidious because they reset on region restart. If your region "needs" a nightly recycle to stay healthy, you almost certainly have a storage leak. A healthy CICS region should run indefinitely without storage growth beyond normal program loading.

Scenario 3: Periodic Spikes at Predictable Intervals

Symptom: Every 15 minutes, response times spike for 30 seconds, then return to normal.

Diagnostic steps:

  1. Check for CICS statistics collection. If CICS statistics are collected at 15-minute intervals (the default), the statistics collection itself can cause a brief CPU spike. On a saturated region, this is enough to cause a visible blip.

  2. Check for log archival. CICS system log archival (triggered by log stream buffer overflow) can cause periodic I/O spikes.

  3. Check for coupling facility activity. If the region uses shared temporary storage or named counters, coupling facility structure rebuild or alter operations can cause periodic pauses.

  4. Check for external schedulers. Is a batch job or external monitoring tool hitting the region at 15-minute intervals? A health check that runs a CICS transaction every 15 minutes is benign; a monitoring tool that runs 50 EXEC CICS INQUIRE commands every 15 minutes is not.

Scenario 4: Sudden MAXT After a Code Deployment

Symptom: A new version of transaction INQR is deployed at 2 PM. By 2:30 PM, the region hits MAXT. The previous version of INQR had no issues.

Diagnostic steps:

  1. Compare the new code to the old code. Look for: - New DB2 calls (more calls = longer elapsed time = more concurrent tasks) - Changed GETMAIN sizes (larger working storage = more storage per task) - Removed THREADSAFE attribute (regression from L8 to QR TCB) - New EXEC CICS LINK calls (DPL to remote regions adds MRO wait time)

  2. Pull SMF 110 for INQR before and after deployment. Compare elapsed time, CPU time, and wait-time breakdown.

  3. Check if the new program loaded successfully. A NEWCOPY that fails to load correctly can cause every transaction to abend and retry, creating a cascade.

  4. Compare working storage size. If the new program's WORKING-STORAGE grew from 200 KB to 2 MB, the per-task storage consumption increased 10x. At 200 concurrent INQR tasks, that is 360 MB of additional EUDSA — likely pushing the region into SOS territory.

🔴 PRODUCTION RULE — Every code deployment to a CICS production region should include a pre-deployment checklist: working storage size comparison (old vs. new), THREADSAFE attribute verification, DB2 call count comparison (via code inspection or test trace), and a rollback plan that can execute in under 5 minutes. At CNB, this checklist is part of the change management process — no CICS NEWCOPY without it.


17.7 Capacity Planning for CICS Regions

Performance tuning is reactive — you fix what is broken. Capacity planning is proactive — you ensure nothing breaks in the first place. For a Tier-1 bank processing 500 million transactions per day, the distinction is measured in millions of dollars.

The Capacity Planning Cycle

Kwame runs a quarterly capacity planning cycle for CNB's CICS environment:

1. Collect baseline metrics (Month 1 of quarter) - Peak TPS per AOR (from SMF 110 aggregates) - Average and P95 response time per transaction type - Peak task count per region (from CICS statistics) - Peak EUDSA usage per region - QR TCB busy percentage at peak - DB2 thread usage at peak

2. Project growth (Month 2) - Business forecast: projected transaction growth (from business unit, typically 10–15% annually for CNB) - New features: any new transactions planned for deployment (from development team) - Seasonal factors: year-end, tax season, promotional campaigns - Technology changes: CICS version upgrades, DB2 version upgrades, hardware refreshes

3. Model and validate (Month 3) - Apply growth projections to baseline metrics - Identify bottlenecks: which resource hits its limit first? - Validate with performance test environment (replay scaled-up traffic) - Present findings to architecture review board with specific recommendations

The Bottleneck Sequence

In CNB's experience, the bottlenecks hit in this order as transaction volume increases:

  1. DB2 thread limits (CMDT) — Threads exhaust before MXT because most transactions use DB2.
  2. MXT — Task accumulation hits the MXT limit.
  3. EUDSA — Storage exhaustion from too many concurrent tasks with large working storage.
  4. QR TCB saturation — The dispatcher cannot keep up (only if THREADSAFE is not fully deployed).
  5. CPU — The LPAR's processor capacity is exhausted (typically the last bottleneck, thanks to z/OS efficiency).

The order matters for planning. If your next bottleneck is CMDT, the fix is increasing CMDT (or reducing DB2 elapsed time through SQL tuning). If it is MXT, the fix may be adding another AOR (horizontal scaling) rather than raising MXT (vertical scaling). If it is EUDSA, the fix is reducing per-task storage or increasing EDSALIM.

Modeling the Sandra Chen Problem

Sandra Chen at Federal Benefits Administration faces a different capacity challenge. FBA's CICS regions process benefits calculations that vary dramatically by season. January through March (tax season) sees 3x normal volume as citizens verify benefits for tax filing. The annual open enrollment period in November creates another 2.5x spike.

Sandra's capacity plan must accommodate these seasonal spikes without over-provisioning for the 9 months of normal traffic. Her approach:

Elastic capacity model: FBA maintains 2 "hot standby" AOR regions that are started and added to the CICSPlex SM workload group during peak seasons. These regions are fully configured but normally quiesced. Starting them takes 90 seconds; adding them to the routing group takes another 30 seconds. Total time from "we need more capacity" to "capacity is online": under 3 minutes.

Pre-emptive scaling: FBA's operational calendar marks peak periods. The hot standby regions are activated one week before the expected peak and deactivated one week after. This is not reactive scaling — it is planned scaling based on known business cycles.

The Marcus Whitfield factor: Marcus, retiring in 2 years, is the only person who knows the JCL and procedures for starting the hot standby regions. Sandra has documented the procedures and cross-trained two team members — but this is a knowledge transfer challenge as much as a technical one. The hot standby regions are an operational capability that must be preserved beyond any single person's tenure.

Horizontal vs. Vertical Scaling

Vertical scaling — Increasing limits within a single region (raise MXT, raise EDSALIM, raise CMDT). Cheaper but has diminishing returns. Each limit increase moves you closer to the next bottleneck.

Horizontal scaling — Adding another AOR and distributing workload via dynamic routing. More expensive (another region to manage, another set of resources) but provides linear scalability and improved failure isolation.

Kwame's rule of thumb: "Vertical scale until you can't. Horizontal scale when you must. But plan for horizontal scaling 6 months before you need it, because deploying a new AOR in production is a 4-week change management process."

CNB's scaling triggers:

Metric Vertical Tuning Horizontal Scaling
Peak tasks > 70% of MXT > 85% of MXT for 2 consecutive quarters
EUDSA > 70% of EDSALIM > 80% sustained after EDSALIM increase
QR TCB busy > 70% > 85% after THREADSAFE optimization
Response time P95 > 80% of SLA P95 > 90% of SLA after tuning

Projecting Into the Future

The simplest capacity model for CICS is linear extrapolation:

Future_Resource = Current_Resource × (1 + Annual_Growth_Rate) ^ Years

For CNB's core banking AOR: - Current peak tasks: 175 (out of MXT 250) - Annual growth rate: 12% - Year 1: 175 × 1.12 = 196 tasks - Year 2: 175 × 1.12^2 = 220 tasks - Year 3: 175 × 1.12^3 = 246 tasks — within 2% of MXT

This tells Kwame that within 3 years, the current two-AOR-per-LPAR configuration will hit its limits. The plan: deploy a third AOR per LPAR in Year 2, before the limit is reached.

But linear extrapolation is the floor of capacity planning. Kwame's team also models:

  • Spike scenarios: What if Black Friday transaction volume doubles? Can the topology absorb a 2x spike with existing headroom?
  • Failure scenarios: If one AOR fails and its workload shifts to the surviving AOR, does the surviving AOR hit MAXT? (It should not, if MXT was sized with a 2x safety factor for this exact reason.)
  • Technology shifts: If SecureFirst's mobile banking integration doubles API transaction volume, how does that change the profile?

💡 INSIGHT — The best capacity plans include a failure scenario model. If your region cannot absorb the workload of a failed peer, you are one failure away from a cascading outage. This is the connection between capacity planning and high availability — they are the same discipline.

WLM and CICS Performance Integration

Recall from Chapter 5 that z/OS Workload Manager (WLM) assigns dispatching priorities to address spaces based on service class goals. CICS regions are WLM service classes. The interaction between WLM and CICS performance is bidirectional:

WLM affects CICS — A CICS region in a high-priority WLM service class gets more CPU dispatching priority. If the region is CPU-bound, this directly improves response times. If the region is I/O-bound, WLM priority has minimal impact (the region is waiting for I/O, not for CPU).

CICS affects WLM — CICS reports transaction response times to WLM via the enclave model. WLM uses these response times to evaluate whether the service class is meeting its goals. If transactions miss their velocity goals, WLM increases the region's dispatching priority — but only if the miss is due to CPU contention (WLM can influence CPU dispatching but not I/O speed).

The practical implication: ensure your CICS regions are in the correct WLM service classes with appropriate velocity goals. At CNB:

  • Core banking AORs: VELOCITY goal, 200ms for 95% of transactions
  • Web banking AORs: VELOCITY goal, 500ms for 90%
  • Batch-initiated CICS: DISCRETIONARY (lowest priority, runs when resources are available)

A misconfigured WLM service class — for example, putting a core banking AOR in a DISCRETIONARY class — can cause severe performance degradation that is invisible from within CICS. The region looks healthy (no MAXT, no SOS) but transactions are slow because z/OS is not giving the region enough CPU.

⚠️ COMMON PITFALL — If your CICS performance looks clean from CICS statistics but transactions are still slow, check WLM. A region in the wrong service class, or a service class with an inappropriate goal, is one of the hardest CICS performance problems to diagnose because all the CICS-level metrics look normal.

The SecureFirst Monitoring Dashboard

Yuki Nakamura at SecureFirst Retail Bank built a lightweight monitoring approach that demonstrates the principle of layered observability. Rather than purchasing a commercial CICS monitor (which the budget did not support), Yuki implemented three tiers:

Tier 1 — Real-time pulse (every 60 seconds): A CICS transaction (HLTH) runs on a timer and captures five critical metrics via EXEC CICS INQUIRE commands: active task count, peak task count since last check, EUDSA current allocation, QR TCB busy percentage (computed from dispatcher statistics), and DB2 thread count. These five numbers are written to a CICS TS queue and also sent to a Splunk forwarder via a z/OS Connect REST call.

Tier 2 — Transaction-level instrumentation (continuous): The self-aware transaction pattern (Section 17.5) is deployed on all three mobile API transactions. Alert records written to TD queue PFAL are forwarded to Splunk. Yuki set up Splunk dashboards showing P50, P95, and P99 response times in 5-minute rolling windows.

Tier 3 — Historical trending (weekly): SMF 110 Type 1 records are extracted weekly by a batch job, loaded into a DB2 analytics table, and analyzed via a Python script that runs on Carlos's laptop via z/OS Connect. The script produces trend charts for each transaction type showing response time, CPU time, and wait-time breakdown over the previous 4 weeks.

The total cost of this monitoring system: zero dollars in software licensing. Approximately 3 weeks of Yuki's time to build. The ongoing overhead: less than 0.1% of the region's CPU capacity.

💡 INSIGHT — You do not need a six-figure monitoring product to have effective CICS performance visibility. CICS provides the data (statistics, CMF, SMF 110, EXEC CICS INQUIRE). You need a plan for collecting it, a plan for alerting on it, and a plan for trending it. The best monitoring system is the one your team actually looks at every day.


17.8 Progressive Project: HA Banking System — CICS Tuning

🏗️ PROJECT CHECKPOINT — In Chapter 13, you designed your HA banking system's CICS region topology (TOR/AOR/FOR with MRO). In Chapter 15, you designed the channel/container architecture for the multi-step transfer transaction. In Chapter 16, you designed the CICS security model. Now you will tune that topology for performance.

Your HA banking transaction processing system handles three primary CICS transactions:

  1. BINQ — Balance inquiry. Reads one DB2 row. Average response time: 30ms. Working storage: 150 KB.
  2. XFER — Fund transfer. Reads 2 DB2 rows, writes 3 DB2 rows, writes 1 audit record. Average response time: 80ms. Working storage: 300 KB.
  3. STMT — Statement generation. Reads 50–500 DB2 rows. Average response time: 2 seconds. Working storage: 800 KB.

Peak transaction mix per AOR: BINQ 1,500 TPS, XFER 400 TPS, STMT 20 TPS.

Task 1: Calculate MXT

Using the formula from Section 17.1, calculate the required MXT for a single AOR:

MXT = SUM(TPS_i × Avg_Response_i) × Safety_Factor

For each transaction type: - BINQ: 1,500 × 0.030 = 45 concurrent tasks - XFER: 400 × 0.080 = 32 concurrent tasks - STMT: 20 × 2.0 = 40 concurrent tasks

Total: 117 concurrent tasks × 2.0 safety factor = 234

Set MXT to 250 (round up for operational headroom).

Task 2: Calculate Storage-Bounded MXT

Estimate per-task storage using weighted average:

  • BINQ: 150 KB × (45/117) = 57.7 KB weighted
  • XFER: 300 KB × (32/117) = 82.1 KB weighted
  • STMT: 800 KB × (40/117) = 273.5 KB weighted

Average per-task storage: ~414 KB (including TCA overhead, round to 450 KB).

With 900 MB EDSALIM and approximately 400 MB available after CICS kernel (conservative):

Storage_Bounded_MXT = 400,000 KB / 450 KB = 889

MXT of 250 is well within the storage-bounded limit. Good.

Task 3: Design TRANCLASS Configuration

Define three transaction classes:

DEFINE TRANCLASS(CLSRTL)  MAXACTIVE(100)    *> BINQ, XFER
DEFINE TRANCLASS(CLSBULK) MAXACTIVE(50)     *> STMT
DEFINE TRANCLASS(CLSSYS)  MAXACTIVE(20)     *> System tasks

Task 4: Calculate CMDT

CMDT = Peak_TPS_DB2 × Fraction_DB2 × Avg_DB2_Elapsed × Safety
     = 1,920 × 1.0 × 0.012 × 1.5
     = ~35

Set CMDT to 50 for headroom.

Task 5: Design Monitoring

Implement the self-aware transaction pattern from Section 17.5 for BINQ and XFER. Define performance thresholds: - BINQ: alert if elapsed > 200ms (6.7x normal) - XFER: alert if elapsed > 500ms (6.25x normal) - STMT: alert if elapsed > 10 seconds (5x normal)

Write performance records to a CICS TD queue. Create an alerting transaction that reads the queue and triggers operator notifications.

See code/project-checkpoint.md for the detailed implementation specification.


Summary

CICS performance tuning is a discipline built on measurement, not intuition. The key principles:

  1. MAXT is a symptom, not a cause. When you hit MAXT, the question is not "what should MXT be?" but "why are tasks accumulating?" The answer is always in the wait-time breakdown.

  2. The dispatcher is cooperative, not preemptive. A single CPU-bound program can monopolize the QR TCB and starve an entire region. THREADSAFE programs on L8 TCBs are the most important performance optimization in modern CICS.

  3. Storage exhaustion is worse than MAXT. SOS degrades all tasks, not just new ones. Size EDSALIM to actual need plus 30%, and ensure MXT does not exceed the storage-bounded limit.

  4. TRANCLASS prevents starvation. Without transaction classes, any workload can consume all available tasks. With transaction classes, critical transactions are protected.

  5. Measure everything, tune selectively. SMF 110 Type 1 records, CICS statistics, and application-level instrumentation provide the data. Change one parameter at a time and measure the impact.

  6. Capacity planning prevents incidents. The cheapest outage is the one that never happens. A quarterly capacity planning cycle that projects growth, models failure scenarios, and validates with performance testing keeps you ahead of the curve.

📊 CHAPTER METRICS — This chapter covered the CICS performance model from dispatcher through storage through capacity planning. In the next chapter (Chapter 18: CICS Failure and Recovery), we will shift from performance to resilience — designing CICS systems that recover from failures without losing data or breaking SLAs. The skills from this chapter — particularly MAXT diagnosis and storage monitoring — are prerequisites for understanding how CICS detects and recovers from failure conditions.


Spaced Review Questions

🔄 FROM CHAPTER 5 (WLM): How does z/OS WLM's velocity goal for a CICS service class interact with CICS's internal task priority? If a region's transactions are missing their WLM velocity goal due to DB2 wait time (not CPU contention), what can WLM do? What can it not do?

🔄 FROM CHAPTER 13 (CICS Architecture): In a topology with 2 TORs and 4 AORs with dynamic routing, how does TRANCLASS interact with CICSPlex SM workload management? If one AOR's CLSBULK class is full but another AOR has capacity, does CICSPlex SM route the transaction to the available AOR?

🔄 FROM CHAPTER 15 (Channels/Containers): When a program passes a 2 MB container via a channel to a LINKed program, where is that 2 MB stored in CICS memory? How does this impact EUDSA sizing for the region? Compare the storage impact of a 2 MB COMMAREA versus a 2 MB channel.