Appendix E: Capacity Planning Formulas and Worksheets

Capacity planning is the discipline of ensuring your mainframe has enough horsepower to run tomorrow's workload, not just today's. For COBOL developers, it means understanding how your code consumes resources — CPU, I/O, memory, and time — and how those consumption patterns translate into infrastructure costs. This appendix provides the formulas, rules of thumb, and worked examples you need to participate intelligently in capacity conversations with systems programmers and management.

You do not need to become a capacity planner. You do need to understand these concepts well enough to answer questions like: "If we double the input volume, how much longer will the batch window take?" and "What happens to CICS response time if we add 500 concurrent users?"


MSU Calculations

What Is an MSU?

MSU (Million Service Units) is IBM's measure of mainframe processing capacity. It is used for software licensing (many IBM products are priced by MSU), workload management, and capacity reporting. One MSU equals one million "service units" per hour of processing capacity.

Key Relationships

MSU Rating = Machine capacity (from IBM's LSPR tables)
CPU Utilization = (Used CPU seconds / Available CPU seconds) × 100
Used MSU = MSU Rating × CPU Utilization
Rolling 4-Hour Average (R4HA) = Average MSU consumption over 4-hour peaks

IBM software pricing (Sub-Capacity Reporting Tool, SCRT) is based on the R4HA — the highest rolling 4-hour average of MSU consumption within the reporting period. This makes peak management a financial concern, not just a performance concern.

Worked Example: MSU Impact of a New Batch Job

Scenario: Your z16 LPAR is rated at 400 MSU. Current peak utilization is 65%. A new batch job consumes 120 CPU seconds per run in a 4-hour peak window.

Current peak MSU consumption:
  400 MSU × 0.65 = 260 MSU

New job's MSU consumption:
  CPU time = 120 seconds
  In a 4-hour window (14,400 seconds), additional utilization:
  120 / 14,400 = 0.0083 = 0.83%

  Additional MSU = 400 × 0.0083 = 3.33 MSU

New peak: 260 + 3.33 = 263.33 MSU
New utilization: 263.33 / 400 = 65.83%

At an illustrative cost of $1,500/MSU/month for IBM software licensing, that 3.33 MSU increase costs roughly $5,000/month. This is why performance tuning COBOL batch programs has direct financial impact.

MSU Estimation for Code Changes

When estimating the MSU impact of a code change:

CPU time change = (New path length / Old path length) × Old CPU time

Where path length is estimated from:
  - Number of COBOL statements executed
  - Number of DB2 calls
  - Number of I/O operations
  - Compiler optimization level

Rule of thumb: A single DB2 SQL call costs roughly 10,000 to 100,000 instructions, depending on complexity. A COBOL MOVE costs roughly 5-20 instructions. Adding one SQL call per record to a million-record batch job has a massively disproportionate impact compared to adding ten COBOL statements.


Throughput Formulas

Records Per Second

Throughput (records/sec) = Total records / Elapsed time (seconds)

CPU per record = Total CPU seconds / Total records

Elapsed per record = Total elapsed seconds / Total records

I/O Rate

I/O rate (EXCP/sec) = Total EXCPs / Elapsed time (seconds)

EXCPs per record = Total EXCPs / Total records

An EXCP (Execute Channel Program) is a single physical I/O operation. Modern DASD subsystems can handle 10,000-50,000 IOPS (I/O operations per second) per device, but I/O response times vary dramatically based on cache hit ratios.

Elapsed Time Estimation

Elapsed Time = MAX(CPU Time, I/O Time, Wait Time)

Where:
  CPU Time = Records × CPU per record
  I/O Time = Records × I/Os per record × Avg I/O response time
  Wait Time = Lock waits + Tape mounts + Other waits

In practice, batch jobs are either CPU-bound or I/O-bound: - CPU-bound: Elapsed time is approximately equal to CPU time. Optimization means reducing instructions per record. - I/O-bound: Elapsed time is much greater than CPU time. Optimization means reducing I/O count (larger block sizes, buffering, sequential access patterns) or improving I/O response time (buffer pool tuning, DS8000 cache).

Worked Example: Estimating Batch Job Duration

Scenario: A new batch job will process 5 million customer records. From a pilot test with 10,000 records:

Pilot results:
  Records: 10,000
  CPU time: 8.5 seconds
  Elapsed: 12.3 seconds
  EXCPs: 2,150

Per-record metrics:
  CPU per record: 8.5 / 10,000 = 0.00085 seconds
  Elapsed per record: 12.3 / 10,000 = 0.00123 seconds
  EXCPs per record: 2,150 / 10,000 = 0.215

Full-volume projection:
  CPU time: 5,000,000 × 0.00085 = 4,250 seconds = 70.8 minutes
  EXCPs: 5,000,000 × 0.215 = 1,075,000
  Elapsed (estimate): 5,000,000 × 0.00123 = 6,150 seconds = 102.5 minutes

Important caveat: Linear scaling assumes no contention effects. At full volume, contention for CPU, I/O channels, DB2 locks, and buffer pools will likely increase elapsed time beyond the linear projection. Apply a contention factor of 1.2x to 1.5x for conservative estimates:

Adjusted elapsed: 102.5 × 1.3 = 133.25 minutes ≈ 2 hours 13 minutes

Queueing Theory Basics

Queueing theory is the mathematical framework for understanding how contention affects response time. You do not need to derive these formulas. You need to understand what they tell you about system behavior.

Utilization and Response Time

The fundamental relationship (for a single-server queue with random arrivals and service times — the M/M/1 model):

Response Time = Service Time / (1 - Utilization)

Where:
  Service Time = time to process one request with no waiting
  Utilization = Arrival Rate × Service Time (must be < 1.0)

This formula explains the "hockey stick" curve: response time increases slowly as utilization rises from 0% to 50%, moderately from 50% to 75%, and explosively above 80%.

Utilization Response Time Factor Example (10ms service)
10% 1.11× 11.1 ms
30% 1.43× 14.3 ms
50% 2.0× 20.0 ms
70% 3.33× 33.3 ms
80% 5.0× 50.0 ms
90% 10.0× 100.0 ms
95% 20.0× 200.0 ms

The lesson: Never plan to run above 80% sustained utilization on any single resource (CPU, I/O path, DB2 thread pool). Above 80%, small increases in load cause large increases in response time.

Little's Law

L = λ × W

Where:
  L = Average number of requests in the system (in-progress + waiting)
  λ = Arrival rate (requests per second)
  W = Average time a request spends in the system (response time)

Little's Law is universal — it applies to any stable queueing system. It is extremely useful for capacity validation:

Worked example:

Scenario: A CICS region processes 50 transactions per second.
          Average response time is 0.2 seconds.

L = 50 × 0.2 = 10 transactions in the system at any moment

If MXT (Maximum Tasks) is set to 50, utilization is:
  10 / 50 = 20% — comfortable.

If response time degrades to 1.0 second:
  L = 50 × 1.0 = 50 — MXT is fully utilized.
  New arrivals must wait. Response time degrades further.
  This is a "meltdown" — the system has hit its capacity wall.

Multi-Server Queue (M/M/c)

Real mainframe subsystems have multiple servers (multiple CPUs, multiple I/O paths, multiple DB2 threads). The response time formula for c servers:

The exact formula involves the Erlang C function, which is complex.
For practical purposes, use the rule of thumb:

Effective Utilization per server = Total utilization / Number of servers

If total arrival rate = 80 requests/sec
   Service time = 50 ms per request
   Number of servers = 5

Total offered load = 80 × 0.050 = 4.0 Erlangs
Utilization per server = 4.0 / 5 = 0.80 = 80%

At 80% per-server utilization, response time is approximately:
  Service Time × 2 to 3 (depending on variability)
  50 ms × 2.5 ≈ 125 ms

For precise calculations, use a queueing theory calculator or the Erlang C tables. For capacity planning conversations, the M/M/1 approximation with per-server utilization gives you a reasonable estimate.


Batch Window Math

Critical Path

The batch window is the time available for overnight processing (typically between the end of online operations and the start of the next business day). The critical path is the longest chain of dependent jobs.

Batch Window Duration = End Time - Start Time - Buffer

Critical Path Duration = Sum of elapsed times on the longest
                         dependency chain

Slack = Batch Window Duration - Critical Path Duration

If slack is negative, the batch window will not complete on time. Options: 1. Reduce critical-path job elapsed times (performance tuning). 2. Parallelize independent jobs that are currently serialized. 3. Move non-critical jobs outside the window. 4. Extend the batch window (business impact).

Parallel Speedup

If you split a single-threaded job into N parallel streams:

Theoretical elapsed time = (Single-stream elapsed) / N + Overhead

Where overhead includes:
  - Job submission/termination time per stream
  - Sort/merge of parallel outputs
  - Lock contention between parallel streams
  - I/O contention on shared files

Practical limits: Parallel speedup rarely exceeds 4-6x regardless of the number of streams, because I/O and lock contention become the bottleneck. The sweet spot is usually 3-5 parallel streams.

Worked Example: Batch Window Analysis

Batch window: 22:00 to 06:00 = 8 hours = 480 minutes
Buffer for reruns: 60 minutes
Available: 420 minutes

Current critical path:
  Job A: 45 minutes (extract)
  Job B: 120 minutes (transform, depends on A)
  Job C: 90 minutes (DB2 load, depends on B)
  Job D: 60 minutes (reporting, depends on C)
  Critical path: 45 + 120 + 90 + 60 = 315 minutes

Slack: 420 - 315 = 105 minutes — comfortable.

If input volume doubles (Job B and C scale linearly):
  Job A: 60 minutes (I/O bound, scales ~1.3x)
  Job B: 240 minutes (CPU bound, scales ~2x)
  Job C: 180 minutes (I/O bound, scales ~2x)
  Job D: 60 minutes (reporting volume doesn't change)
  Critical path: 60 + 240 + 180 + 60 = 540 minutes

Slack: 420 - 540 = -120 minutes — DOES NOT FIT.

Options:
  1. Tune Job B: target 150 minutes (37.5% reduction through
     DB2 multi-row operations, OPT(2), FASTSRT)
  2. Parallelize Job B into 3 streams: 240/3 + 15 = 95 minutes
  3. Both: 150/3 + 15 = 65 minutes

  With option 3: 60 + 65 + 180 + 60 = 365 minutes
  Slack: 420 - 365 = 55 minutes — acceptable but tight.
  Also need to tune Job C.

CICS Sizing

MXT (Maximum Tasks)

MXT controls the maximum number of concurrent tasks in a CICS region. Setting it requires balancing throughput against storage consumption.

MXT = (Available DSA) / (Average task storage)

Where:
  Available DSA = Total DSA - CICS overhead
  Average task storage = LE stack + heap + Working-Storage/Local-Storage
                         + CICS overhead per task (~8-12 KB)

Worked example:

Total CDSA + UDSA = 256 MB
CICS overhead: 40 MB
Available for tasks: 216 MB = 221,184 KB

Average task storage (measured): 150 KB

MXT = 221,184 / 150 = 1,474

Conservative setting: 1,200 (leaves headroom for peak variation)

Transaction Rate and Response Time

Maximum Transaction Rate = MXT / Average Response Time

If MXT = 500 and average response time = 0.1 seconds:
  Max rate = 500 / 0.1 = 5,000 transactions/second

If response time degrades to 0.5 seconds:
  Max rate = 500 / 0.5 = 1,000 transactions/second

This shows why response time matters so much in CICS — a 5x increase in response time causes a 5x decrease in throughput capacity.

DSA Sizing

DSA Region Contents Typical Range
CDSA CICS nucleus, system programs 16-64 MB
UDSA User programs (COBOL), LE stacks/heaps 128-512 MB
ECDSA Extended CICS programs (above the line) 256 MB - 2 GB
EUDSA Extended user programs (above the line) 256 MB - 2 GB
RDSA Read-only programs (reentrant) 64-256 MB
ERDSA Extended read-only programs 256 MB - 2 GB
SDSA Shared (non-reentrant) programs 64-128 MB
ESDSA Extended shared programs 128-512 MB

Modern CICS programs compiled with RENT execute in ERDSA (above the 2GB bar in z/OS), where storage is abundant. The critical constraint is usually EUDSA (for LE runtime storage) rather than program storage.


DB2 Sizing

Buffer Pool Sizing

DB2 buffer pools cache data and index pages in memory, avoiding physical I/O. Buffer pool hit ratio is the single most important DB2 performance metric.

Hit Ratio = (Getpages - Physical Reads) / Getpages × 100

Target: > 95% for data pools, > 99% for index pools

Buffer Pool Size Estimation

Minimum buffer pool size (pages) =
    Active tablespace pages accessed concurrently

Practical sizing:
    Frequently accessed tables should fit entirely in the buffer pool.
    Index non-leaf pages should always be in the buffer pool.

For a table with 100,000 rows at 50 rows per page:
    Data pages = 100,000 / 50 = 2,000 pages
    At 4 KB per page = 8 MB buffer pool space
    At 8 KB per page = 16 MB buffer pool space
    At 16 KB per page = 32 MB buffer pool space
    At 32 KB per page = 64 MB buffer pool space

Thread Limits

CICS DB2 Connection:
  THREADLIMIT = Maximum concurrent DB2 threads from this CICS region

  Sizing: THREADLIMIT = MXT × DB2-usage-ratio

  If MXT = 500 and 60% of transactions use DB2:
  THREADLIMIT = 500 × 0.60 = 300 threads

Batch:
  Each batch DB2 program consumes one thread.
  MAXDBAT = Maximum concurrent batch threads

  Sizing: MAXDBAT = Maximum concurrent batch programs that use DB2

Thread wait time: If all threads are in use, new requests wait. Thread wait time appears in DB2 statistics as "thread create not satisfied." If you see this counter incrementing, increase THREADLIMIT or MAXDBAT.

Worked Example: DB2 Buffer Pool Tuning

Current statistics:
  BP0 (4KB data pool):
    Size: 50,000 pages (200 MB)
    Getpages: 10,000,000
    Physical reads: 1,500,000
    Hit ratio: 85%

  BP1 (4KB index pool):
    Size: 20,000 pages (80 MB)
    Getpages: 5,000,000
    Physical reads: 100,000
    Hit ratio: 98%

Analysis:
  BP0 hit ratio is too low (target > 95%).
  Each physical read costs ~2-5ms vs. ~0.01ms for a buffer hit.

  Additional I/O time from poor caching:
  1,500,000 × 3ms = 4,500 seconds = 75 minutes of I/O wait

  If we increase BP0 to 200,000 pages (800 MB):
    Expected hit ratio improvement: 85% → 97%
    New physical reads: 10,000,000 × 0.03 = 300,000
    I/O time reduction: (1,500,000 - 300,000) × 3ms = 3,600 seconds
    = 60 minutes saved across all programs using these tables

  Cost: 600 MB additional real memory
  Benefit: 60 minutes of aggregate I/O wait time saved per measurement interval

Quick Reference: Resource Consumption Rules of Thumb

Operation Approximate CPU Cost Approximate I/O Cost
COBOL MOVE 5-20 instructions 0
COBOL COMPUTE (simple) 20-50 instructions 0
COBOL STRING/UNSTRING 100-500 instructions 0
Sequential READ (QSAM) 500-1,000 instructions 1 EXCP per block
VSAM READ (keyed) 2,000-5,000 instructions 1-3 EXCPs (depending on index levels + data)
DB2 singleton SELECT 10,000-50,000 instructions 0-5 I/Os (depending on buffer pool)
DB2 cursor FETCH 5,000-20,000 instructions 0-2 I/Os per fetch
DB2 INSERT 15,000-50,000 instructions 1-5 I/Os (data + index + log)
DB2 COMMIT 5,000-15,000 instructions 1-2 I/Os (log force)
CICS SEND MAP 5,000-15,000 instructions 0 (terminal I/O is separate)
EXEC CICS READ FILE 5,000-20,000 instructions 0-3 I/Os (VSAM through CICS)

These are order-of-magnitude estimates. Actual costs vary by machine model, z/OS level, DB2 version, buffer pool configuration, and data characteristics. Measure your actual workload — do not rely on these numbers for precise planning.


Capacity Planning Worksheet Template

Use this template when estimating resources for a new program or a significant change to an existing one.

Program: _______________
Type: [ ] Batch  [ ] CICS  [ ] DB2 Stored Procedure

Input:
  Source: _______________
  Volume: _______ records per run / per day
  Growth rate: _______% per year

Processing:
  DB2 calls per record: _______
  File I/Os per record: _______
  Estimated CPU per record: _______ ms

Output:
  Destination: _______________
  Volume: _______ records per run / per day

Resource estimates:
  CPU time per run: _______ seconds
  Elapsed time per run: _______ minutes
  I/Os per run: _______
  DB2 getpages per run: _______
  Peak memory: _______ MB

  MSU impact: _______ MSU (during peak window)
  Monthly SW license cost impact: $_______ (at $___/MSU)

Batch window impact:
  Which batch stream: _______________
  Critical path? [ ] Yes  [ ] No
  Added elapsed time: _______ minutes

CICS impact (if applicable):
  Transaction rate: _______ /second
  Response time target: _______ ms
  Additional MXT requirement: _______
  Additional storage per task: _______ KB

Validation:
  [ ] Pilot tested with _______ records
  [ ] Scaled linearly with ___x contention factor
  [ ] Reviewed by capacity planning team
  [ ] Batch window fit confirmed
  [ ] MSU impact within budget

Tools for Measurement

Tool What It Measures When to Use
SMF Records (Type 30) Job-level CPU, elapsed, I/O Post-run analysis of batch jobs
RMF / RMF Monitor III System-wide CPU, storage, I/O, paging Capacity trending and analysis
DB2 PM / OMEGAMON for DB2 DB2 thread activity, SQL costs, buffer pools DB2 performance tuning
CICS PA / OMEGAMON for CICS Transaction response times, task rates, storage CICS performance tuning
Batch Accounting (SMF 30) Aggregate batch resource consumption Month-over-month trending
IBM SCRT Sub-capacity MSU reporting Software license optimization
zBNA (z Batch Network Analyzer) Batch dependency chain analysis Batch window optimization

The golden rule of capacity planning: measure, do not guess. Every estimate in this appendix should be validated with actual measurements before being used for procurement or architectural decisions.