Appendix E: Capacity Planning Formulas and Worksheets
Capacity planning is the discipline of ensuring your mainframe has enough horsepower to run tomorrow's workload, not just today's. For COBOL developers, it means understanding how your code consumes resources — CPU, I/O, memory, and time — and how those consumption patterns translate into infrastructure costs. This appendix provides the formulas, rules of thumb, and worked examples you need to participate intelligently in capacity conversations with systems programmers and management.
You do not need to become a capacity planner. You do need to understand these concepts well enough to answer questions like: "If we double the input volume, how much longer will the batch window take?" and "What happens to CICS response time if we add 500 concurrent users?"
MSU Calculations
What Is an MSU?
MSU (Million Service Units) is IBM's measure of mainframe processing capacity. It is used for software licensing (many IBM products are priced by MSU), workload management, and capacity reporting. One MSU equals one million "service units" per hour of processing capacity.
Key Relationships
MSU Rating = Machine capacity (from IBM's LSPR tables)
CPU Utilization = (Used CPU seconds / Available CPU seconds) × 100
Used MSU = MSU Rating × CPU Utilization
Rolling 4-Hour Average (R4HA) = Average MSU consumption over 4-hour peaks
IBM software pricing (Sub-Capacity Reporting Tool, SCRT) is based on the R4HA — the highest rolling 4-hour average of MSU consumption within the reporting period. This makes peak management a financial concern, not just a performance concern.
Worked Example: MSU Impact of a New Batch Job
Scenario: Your z16 LPAR is rated at 400 MSU. Current peak utilization is 65%. A new batch job consumes 120 CPU seconds per run in a 4-hour peak window.
Current peak MSU consumption:
400 MSU × 0.65 = 260 MSU
New job's MSU consumption:
CPU time = 120 seconds
In a 4-hour window (14,400 seconds), additional utilization:
120 / 14,400 = 0.0083 = 0.83%
Additional MSU = 400 × 0.0083 = 3.33 MSU
New peak: 260 + 3.33 = 263.33 MSU
New utilization: 263.33 / 400 = 65.83%
At an illustrative cost of $1,500/MSU/month for IBM software licensing, that 3.33 MSU increase costs roughly $5,000/month. This is why performance tuning COBOL batch programs has direct financial impact.
MSU Estimation for Code Changes
When estimating the MSU impact of a code change:
CPU time change = (New path length / Old path length) × Old CPU time
Where path length is estimated from:
- Number of COBOL statements executed
- Number of DB2 calls
- Number of I/O operations
- Compiler optimization level
Rule of thumb: A single DB2 SQL call costs roughly 10,000 to 100,000 instructions, depending on complexity. A COBOL MOVE costs roughly 5-20 instructions. Adding one SQL call per record to a million-record batch job has a massively disproportionate impact compared to adding ten COBOL statements.
Throughput Formulas
Records Per Second
Throughput (records/sec) = Total records / Elapsed time (seconds)
CPU per record = Total CPU seconds / Total records
Elapsed per record = Total elapsed seconds / Total records
I/O Rate
I/O rate (EXCP/sec) = Total EXCPs / Elapsed time (seconds)
EXCPs per record = Total EXCPs / Total records
An EXCP (Execute Channel Program) is a single physical I/O operation. Modern DASD subsystems can handle 10,000-50,000 IOPS (I/O operations per second) per device, but I/O response times vary dramatically based on cache hit ratios.
Elapsed Time Estimation
Elapsed Time = MAX(CPU Time, I/O Time, Wait Time)
Where:
CPU Time = Records × CPU per record
I/O Time = Records × I/Os per record × Avg I/O response time
Wait Time = Lock waits + Tape mounts + Other waits
In practice, batch jobs are either CPU-bound or I/O-bound: - CPU-bound: Elapsed time is approximately equal to CPU time. Optimization means reducing instructions per record. - I/O-bound: Elapsed time is much greater than CPU time. Optimization means reducing I/O count (larger block sizes, buffering, sequential access patterns) or improving I/O response time (buffer pool tuning, DS8000 cache).
Worked Example: Estimating Batch Job Duration
Scenario: A new batch job will process 5 million customer records. From a pilot test with 10,000 records:
Pilot results:
Records: 10,000
CPU time: 8.5 seconds
Elapsed: 12.3 seconds
EXCPs: 2,150
Per-record metrics:
CPU per record: 8.5 / 10,000 = 0.00085 seconds
Elapsed per record: 12.3 / 10,000 = 0.00123 seconds
EXCPs per record: 2,150 / 10,000 = 0.215
Full-volume projection:
CPU time: 5,000,000 × 0.00085 = 4,250 seconds = 70.8 minutes
EXCPs: 5,000,000 × 0.215 = 1,075,000
Elapsed (estimate): 5,000,000 × 0.00123 = 6,150 seconds = 102.5 minutes
Important caveat: Linear scaling assumes no contention effects. At full volume, contention for CPU, I/O channels, DB2 locks, and buffer pools will likely increase elapsed time beyond the linear projection. Apply a contention factor of 1.2x to 1.5x for conservative estimates:
Adjusted elapsed: 102.5 × 1.3 = 133.25 minutes ≈ 2 hours 13 minutes
Queueing Theory Basics
Queueing theory is the mathematical framework for understanding how contention affects response time. You do not need to derive these formulas. You need to understand what they tell you about system behavior.
Utilization and Response Time
The fundamental relationship (for a single-server queue with random arrivals and service times — the M/M/1 model):
Response Time = Service Time / (1 - Utilization)
Where:
Service Time = time to process one request with no waiting
Utilization = Arrival Rate × Service Time (must be < 1.0)
This formula explains the "hockey stick" curve: response time increases slowly as utilization rises from 0% to 50%, moderately from 50% to 75%, and explosively above 80%.
| Utilization | Response Time Factor | Example (10ms service) |
|---|---|---|
| 10% | 1.11× | 11.1 ms |
| 30% | 1.43× | 14.3 ms |
| 50% | 2.0× | 20.0 ms |
| 70% | 3.33× | 33.3 ms |
| 80% | 5.0× | 50.0 ms |
| 90% | 10.0× | 100.0 ms |
| 95% | 20.0× | 200.0 ms |
The lesson: Never plan to run above 80% sustained utilization on any single resource (CPU, I/O path, DB2 thread pool). Above 80%, small increases in load cause large increases in response time.
Little's Law
L = λ × W
Where:
L = Average number of requests in the system (in-progress + waiting)
λ = Arrival rate (requests per second)
W = Average time a request spends in the system (response time)
Little's Law is universal — it applies to any stable queueing system. It is extremely useful for capacity validation:
Worked example:
Scenario: A CICS region processes 50 transactions per second.
Average response time is 0.2 seconds.
L = 50 × 0.2 = 10 transactions in the system at any moment
If MXT (Maximum Tasks) is set to 50, utilization is:
10 / 50 = 20% — comfortable.
If response time degrades to 1.0 second:
L = 50 × 1.0 = 50 — MXT is fully utilized.
New arrivals must wait. Response time degrades further.
This is a "meltdown" — the system has hit its capacity wall.
Multi-Server Queue (M/M/c)
Real mainframe subsystems have multiple servers (multiple CPUs, multiple I/O paths, multiple DB2 threads). The response time formula for c servers:
The exact formula involves the Erlang C function, which is complex.
For practical purposes, use the rule of thumb:
Effective Utilization per server = Total utilization / Number of servers
If total arrival rate = 80 requests/sec
Service time = 50 ms per request
Number of servers = 5
Total offered load = 80 × 0.050 = 4.0 Erlangs
Utilization per server = 4.0 / 5 = 0.80 = 80%
At 80% per-server utilization, response time is approximately:
Service Time × 2 to 3 (depending on variability)
50 ms × 2.5 ≈ 125 ms
For precise calculations, use a queueing theory calculator or the Erlang C tables. For capacity planning conversations, the M/M/1 approximation with per-server utilization gives you a reasonable estimate.
Batch Window Math
Critical Path
The batch window is the time available for overnight processing (typically between the end of online operations and the start of the next business day). The critical path is the longest chain of dependent jobs.
Batch Window Duration = End Time - Start Time - Buffer
Critical Path Duration = Sum of elapsed times on the longest
dependency chain
Slack = Batch Window Duration - Critical Path Duration
If slack is negative, the batch window will not complete on time. Options: 1. Reduce critical-path job elapsed times (performance tuning). 2. Parallelize independent jobs that are currently serialized. 3. Move non-critical jobs outside the window. 4. Extend the batch window (business impact).
Parallel Speedup
If you split a single-threaded job into N parallel streams:
Theoretical elapsed time = (Single-stream elapsed) / N + Overhead
Where overhead includes:
- Job submission/termination time per stream
- Sort/merge of parallel outputs
- Lock contention between parallel streams
- I/O contention on shared files
Practical limits: Parallel speedup rarely exceeds 4-6x regardless of the number of streams, because I/O and lock contention become the bottleneck. The sweet spot is usually 3-5 parallel streams.
Worked Example: Batch Window Analysis
Batch window: 22:00 to 06:00 = 8 hours = 480 minutes
Buffer for reruns: 60 minutes
Available: 420 minutes
Current critical path:
Job A: 45 minutes (extract)
Job B: 120 minutes (transform, depends on A)
Job C: 90 minutes (DB2 load, depends on B)
Job D: 60 minutes (reporting, depends on C)
Critical path: 45 + 120 + 90 + 60 = 315 minutes
Slack: 420 - 315 = 105 minutes — comfortable.
If input volume doubles (Job B and C scale linearly):
Job A: 60 minutes (I/O bound, scales ~1.3x)
Job B: 240 minutes (CPU bound, scales ~2x)
Job C: 180 minutes (I/O bound, scales ~2x)
Job D: 60 minutes (reporting volume doesn't change)
Critical path: 60 + 240 + 180 + 60 = 540 minutes
Slack: 420 - 540 = -120 minutes — DOES NOT FIT.
Options:
1. Tune Job B: target 150 minutes (37.5% reduction through
DB2 multi-row operations, OPT(2), FASTSRT)
2. Parallelize Job B into 3 streams: 240/3 + 15 = 95 minutes
3. Both: 150/3 + 15 = 65 minutes
With option 3: 60 + 65 + 180 + 60 = 365 minutes
Slack: 420 - 365 = 55 minutes — acceptable but tight.
Also need to tune Job C.
CICS Sizing
MXT (Maximum Tasks)
MXT controls the maximum number of concurrent tasks in a CICS region. Setting it requires balancing throughput against storage consumption.
MXT = (Available DSA) / (Average task storage)
Where:
Available DSA = Total DSA - CICS overhead
Average task storage = LE stack + heap + Working-Storage/Local-Storage
+ CICS overhead per task (~8-12 KB)
Worked example:
Total CDSA + UDSA = 256 MB
CICS overhead: 40 MB
Available for tasks: 216 MB = 221,184 KB
Average task storage (measured): 150 KB
MXT = 221,184 / 150 = 1,474
Conservative setting: 1,200 (leaves headroom for peak variation)
Transaction Rate and Response Time
Maximum Transaction Rate = MXT / Average Response Time
If MXT = 500 and average response time = 0.1 seconds:
Max rate = 500 / 0.1 = 5,000 transactions/second
If response time degrades to 0.5 seconds:
Max rate = 500 / 0.5 = 1,000 transactions/second
This shows why response time matters so much in CICS — a 5x increase in response time causes a 5x decrease in throughput capacity.
DSA Sizing
| DSA Region | Contents | Typical Range |
|---|---|---|
| CDSA | CICS nucleus, system programs | 16-64 MB |
| UDSA | User programs (COBOL), LE stacks/heaps | 128-512 MB |
| ECDSA | Extended CICS programs (above the line) | 256 MB - 2 GB |
| EUDSA | Extended user programs (above the line) | 256 MB - 2 GB |
| RDSA | Read-only programs (reentrant) | 64-256 MB |
| ERDSA | Extended read-only programs | 256 MB - 2 GB |
| SDSA | Shared (non-reentrant) programs | 64-128 MB |
| ESDSA | Extended shared programs | 128-512 MB |
Modern CICS programs compiled with RENT execute in ERDSA (above the 2GB bar in z/OS), where storage is abundant. The critical constraint is usually EUDSA (for LE runtime storage) rather than program storage.
DB2 Sizing
Buffer Pool Sizing
DB2 buffer pools cache data and index pages in memory, avoiding physical I/O. Buffer pool hit ratio is the single most important DB2 performance metric.
Hit Ratio = (Getpages - Physical Reads) / Getpages × 100
Target: > 95% for data pools, > 99% for index pools
Buffer Pool Size Estimation
Minimum buffer pool size (pages) =
Active tablespace pages accessed concurrently
Practical sizing:
Frequently accessed tables should fit entirely in the buffer pool.
Index non-leaf pages should always be in the buffer pool.
For a table with 100,000 rows at 50 rows per page:
Data pages = 100,000 / 50 = 2,000 pages
At 4 KB per page = 8 MB buffer pool space
At 8 KB per page = 16 MB buffer pool space
At 16 KB per page = 32 MB buffer pool space
At 32 KB per page = 64 MB buffer pool space
Thread Limits
CICS DB2 Connection:
THREADLIMIT = Maximum concurrent DB2 threads from this CICS region
Sizing: THREADLIMIT = MXT × DB2-usage-ratio
If MXT = 500 and 60% of transactions use DB2:
THREADLIMIT = 500 × 0.60 = 300 threads
Batch:
Each batch DB2 program consumes one thread.
MAXDBAT = Maximum concurrent batch threads
Sizing: MAXDBAT = Maximum concurrent batch programs that use DB2
Thread wait time: If all threads are in use, new requests wait. Thread wait time appears in DB2 statistics as "thread create not satisfied." If you see this counter incrementing, increase THREADLIMIT or MAXDBAT.
Worked Example: DB2 Buffer Pool Tuning
Current statistics:
BP0 (4KB data pool):
Size: 50,000 pages (200 MB)
Getpages: 10,000,000
Physical reads: 1,500,000
Hit ratio: 85%
BP1 (4KB index pool):
Size: 20,000 pages (80 MB)
Getpages: 5,000,000
Physical reads: 100,000
Hit ratio: 98%
Analysis:
BP0 hit ratio is too low (target > 95%).
Each physical read costs ~2-5ms vs. ~0.01ms for a buffer hit.
Additional I/O time from poor caching:
1,500,000 × 3ms = 4,500 seconds = 75 minutes of I/O wait
If we increase BP0 to 200,000 pages (800 MB):
Expected hit ratio improvement: 85% → 97%
New physical reads: 10,000,000 × 0.03 = 300,000
I/O time reduction: (1,500,000 - 300,000) × 3ms = 3,600 seconds
= 60 minutes saved across all programs using these tables
Cost: 600 MB additional real memory
Benefit: 60 minutes of aggregate I/O wait time saved per measurement interval
Quick Reference: Resource Consumption Rules of Thumb
| Operation | Approximate CPU Cost | Approximate I/O Cost |
|---|---|---|
| COBOL MOVE | 5-20 instructions | 0 |
| COBOL COMPUTE (simple) | 20-50 instructions | 0 |
| COBOL STRING/UNSTRING | 100-500 instructions | 0 |
| Sequential READ (QSAM) | 500-1,000 instructions | 1 EXCP per block |
| VSAM READ (keyed) | 2,000-5,000 instructions | 1-3 EXCPs (depending on index levels + data) |
| DB2 singleton SELECT | 10,000-50,000 instructions | 0-5 I/Os (depending on buffer pool) |
| DB2 cursor FETCH | 5,000-20,000 instructions | 0-2 I/Os per fetch |
| DB2 INSERT | 15,000-50,000 instructions | 1-5 I/Os (data + index + log) |
| DB2 COMMIT | 5,000-15,000 instructions | 1-2 I/Os (log force) |
| CICS SEND MAP | 5,000-15,000 instructions | 0 (terminal I/O is separate) |
| EXEC CICS READ FILE | 5,000-20,000 instructions | 0-3 I/Os (VSAM through CICS) |
These are order-of-magnitude estimates. Actual costs vary by machine model, z/OS level, DB2 version, buffer pool configuration, and data characteristics. Measure your actual workload — do not rely on these numbers for precise planning.
Capacity Planning Worksheet Template
Use this template when estimating resources for a new program or a significant change to an existing one.
Program: _______________
Type: [ ] Batch [ ] CICS [ ] DB2 Stored Procedure
Input:
Source: _______________
Volume: _______ records per run / per day
Growth rate: _______% per year
Processing:
DB2 calls per record: _______
File I/Os per record: _______
Estimated CPU per record: _______ ms
Output:
Destination: _______________
Volume: _______ records per run / per day
Resource estimates:
CPU time per run: _______ seconds
Elapsed time per run: _______ minutes
I/Os per run: _______
DB2 getpages per run: _______
Peak memory: _______ MB
MSU impact: _______ MSU (during peak window)
Monthly SW license cost impact: $_______ (at $___/MSU)
Batch window impact:
Which batch stream: _______________
Critical path? [ ] Yes [ ] No
Added elapsed time: _______ minutes
CICS impact (if applicable):
Transaction rate: _______ /second
Response time target: _______ ms
Additional MXT requirement: _______
Additional storage per task: _______ KB
Validation:
[ ] Pilot tested with _______ records
[ ] Scaled linearly with ___x contention factor
[ ] Reviewed by capacity planning team
[ ] Batch window fit confirmed
[ ] MSU impact within budget
Tools for Measurement
| Tool | What It Measures | When to Use |
|---|---|---|
| SMF Records (Type 30) | Job-level CPU, elapsed, I/O | Post-run analysis of batch jobs |
| RMF / RMF Monitor III | System-wide CPU, storage, I/O, paging | Capacity trending and analysis |
| DB2 PM / OMEGAMON for DB2 | DB2 thread activity, SQL costs, buffer pools | DB2 performance tuning |
| CICS PA / OMEGAMON for CICS | Transaction response times, task rates, storage | CICS performance tuning |
| Batch Accounting (SMF 30) | Aggregate batch resource consumption | Month-over-month trending |
| IBM SCRT | Sub-capacity MSU reporting | Software license optimization |
| zBNA (z Batch Network Analyzer) | Batch dependency chain analysis | Batch window optimization |
The golden rule of capacity planning: measure, do not guess. Every estimate in this appendix should be validated with actual measurements before being used for procurement or architectural decisions.