> "Capacity planning is the only discipline where being right means nothing happens. You never get credit for the outage that didn't occur. You only hear about it when you're wrong." — Kwame Mensah, Chief Architect, Continental National Bank
Learning Objectives
- Develop capacity forecasting models using historical SMF/RMF data and business growth projections
- Calculate MSU consumption and budget for COBOL workloads
- Analyze the capacity impact of application changes, DB2 changes, and workload shifts
- Design capacity planning processes that align technical capacity with business planning
- Create the capacity plan for the HA banking system
In This Chapter
- Forecasting, MSU Budgeting, and Right-Sizing
- 29.1 Why Capacity Planning Matters — The Cost of Getting It Wrong
- 29.2 Measuring Capacity: MSU, MIPS, CPU Seconds, zIIP Eligibility, and RMF Data
- 29.3 Workload Characterization — Different Workloads, Different Capacity Profiles
- 29.4 Forecasting Models — From Historical Data to Future Projections
- 29.5 MSU Budgeting and Optimization — Where Capacity Meets Finance
- 29.6 Capacity Impact Analysis — What Happens When Things Change
- 29.7 The Capacity Planning Process — Annual Cycle, Quarterly Reviews, Exception Reporting
- 29.8 Progressive Project: The HA Banking System Capacity Plan
- 29.9 Summary — The Discipline That Prevents 3:00 AM Phone Calls
Chapter 29: Capacity Planning and Growth Management
Forecasting, MSU Budgeting, and Right-Sizing
"Capacity planning is the only discipline where being right means nothing happens. You never get credit for the outage that didn't occur. You only hear about it when you're wrong." — Kwame Mensah, Chief Architect, Continental National Bank
29.1 Why Capacity Planning Matters — The Cost of Getting It Wrong
I have been involved in exactly two mainframe capacity emergencies in twenty-five years. Both were entirely preventable. Both cost more than a full year of competent capacity planning would have. One of them nearly cost someone their job.
The first was at a regional bank in 2009. They had been running their z14 at 87% peak utilization for six months. The capacity planner — a systems programmer who did capacity work "when he had time," which meant never — had raised the issue once in a quarterly review. Management deferred the upgrade because the budget was tight and "we've been fine so far." Then October arrived. The bank had just acquired a smaller institution, and the merged account base hit the system on the first combined month-end cycle. Peak utilization hit 99.2%. WLM started throttling batch work. The nightly posting cycle blew the batch window by three hours. Online response times the next morning averaged 4.7 seconds instead of the 200ms SLA. The emergency upgrade — expedited hardware delivery, unplanned weekend installation, overtime for every systems programmer in the shop — cost $2.3 million. The planned upgrade they had deferred would have cost $800,000.
The second was at a health insurer in 2016. The opposite problem. A well-meaning but overly cautious capacity team had been requesting upgrades based on peak-hour utilization that included a 40% "safety margin." Over five years, they had accumulated roughly $4 million in excess capacity — processors that were never utilized above 35%, memory that was never touched, DASD that sat empty. When a new CIO arrived and asked why the mainframe cost $18 million a year to run while averaging 28% utilization, the capacity team couldn't produce a business justification. The result: a poorly conceived "cloud migration" initiative driven by cost perception rather than architectural analysis. Three years and $12 million later, the core claims processing was still on the mainframe, but the organization had lost trust in the platform.
Both stories illustrate the same truth: capacity planning is a business discipline that uses technical data. Get it wrong in either direction and you pay — either in outages or in wasted capital. The technical measurement is the easy part. The hard part is translating bits per second and CPU utilization percentages into language that finance and business leadership can act on.
The Capacity Planning Mandate
At the architect level — which is where you are if you're reading this book — capacity planning is not optional. It's as fundamental as security architecture (Chapter 28) or disaster recovery (Chapter 30). Here's why:
Financial exposure. IBM mainframe costs are directly tied to consumption. Under IBM's Monthly License Charge (MLC) model, the software bill for z/OS, CICS, DB2, MQ, and the rest is calculated from your rolling four-hour average MSU consumption. Every MSU you consume is money. Every MSU you waste is also money. At CNB, Kwame Mensah estimates that 1 MSU of sustained capacity costs approximately $1,500–$2,000 per month in combined hardware depreciation and software MLC charges. That means a 10% forecasting error on a 3,000 MSU machine is $450,000–$600,000 per month in either direction.
SLA commitments. CNB's service level agreements specify 200ms online response time for 95% of transactions. Pinnacle Health's claims adjudication must process 50 million claims per month within regulatory timelines. Federal Benefits Administration processes payments for 4.2 million recipients on a fixed schedule — late payments trigger congressional inquiries. None of these SLAs care about your capacity plan. They care about whether the system performs. Capacity planning is how you ensure it does.
Lead times. You cannot add mainframe capacity overnight. Even with Capacity on Demand (CoD) features, activating additional capacity requires planning, testing, and WLM reconfiguration. A full hardware upgrade — moving from a z15 to a z16, for example — involves months of planning, procurement, installation, migration, and validation. If you start capacity planning when utilization hits 85%, you're already behind.
Regulatory expectations. In banking, FFIEC examiners expect documented capacity planning processes. In healthcare, HIPAA requires that systems be available and performant. In government, OMB Circular A-130 mandates IT resource planning. Ahmad Rashidi at Pinnacle Health learned this when an auditor asked for the capacity plan and got a blank stare. The finding cost Pinnacle three months of remediation work.
💡 KEY INSIGHT: The goal of capacity planning is not to predict the future perfectly. It's to reduce the variance of outcomes to a manageable range. A capacity plan that says "we need between 2,800 and 3,200 MSUs by Q4" is infinitely more useful than no plan at all, even if the actual number turns out to be 3,050.
Where Capacity Planning Fits in the Architecture
Capacity planning connects to nearly every other topic in this book:
- WLM (Chapter 5): Service classes and workload classification determine how capacity is consumed. A poorly configured WLM policy can waste 10–15% of your capacity through unnecessary dispatching priority conflicts.
- DB2 (Chapters 6–12): Database access patterns — sequential scans, index lookups, sort operations, utility processing — are major capacity consumers. A single bad SQL statement can consume more CPU than an entire batch stream.
- CICS (Chapters 13–18): Online transaction volumes directly drive MSU consumption. CICS tuning decisions (MAXT, thread pooling, compression) affect how efficiently that consumption translates into business transactions.
- Batch (Chapters 23–27): Batch workloads have fundamentally different capacity profiles than online workloads. The batch window represents your peak consumption period for most shops. Batch performance optimization (Chapter 26) directly reduces capacity requirements.
- Security (Chapter 28): Encryption — particularly pervasive encryption on z15/z16 — consumes CPU cycles. RACF processing adds overhead. Security is not free, and it must be budgeted for.
29.2 Measuring Capacity: MSU, MIPS, CPU Seconds, zIIP Eligibility, and RMF Data
Before you can plan capacity, you have to measure it. And on the mainframe, measurement is simultaneously precise and confusing, because IBM uses multiple units that don't translate cleanly into each other, and the industry has layered decades of historical baggage on top of them.
Let's cut through the confusion.
29.2.1 The Units — What They Mean and What They Don't
CPU Seconds. The most fundamental unit. A CPU second is one second of processor time consumed by a task or address space. This is what SMF records report. This is what RMF measures. It's precise, unambiguous, and useless for capacity planning by itself — because a CPU second on a z13 is not the same as a CPU second on a z16. Processor speeds differ across generations and models.
MIPS (Millions of Instructions Per Second). The traditional capacity unit. A 10,000 MIPS machine can theoretically execute 10 billion instructions per second. In practice, MIPS ratings are derived from IBM's Large Systems Performance Reference (LSPR) benchmarks, not from literal instruction counting. MIPS is still widely used in conversation ("we need a 15,000 MIPS box") but IBM discourages it because MIPS varies by instruction mix, and modern processors do dramatically more work per instruction than older ones.
MSU (Millions of Service Units). IBM's official capacity unit since the early 2000s. An MSU is derived from the LSPR benchmark ratings and represents a hardware-independent measure of processing capacity. One MSU on a z14 represents the same nominal workload capacity as one MSU on a z16 — at least in theory. MSU is what IBM uses for software pricing, what SCRT reports, and what your CFO sees on the bill.
The conversion between MIPS and MSU is approximately:
1 MSU ≈ 8.5 MIPS (varies by processor model)
Example: z16 Model A01-706
Rated capacity: 6,534 MSU
Approximate MIPS: ~55,500
Ratio: ~8.5 MIPS per MSU
⚠️ CRITICAL: Never mix units in the same analysis. If your capacity plan uses MSU, use MSU everywhere. If your performance team thinks in MIPS, convert consistently. I've seen capacity plans that mixed MSU for software licensing and MIPS for hardware sizing — the resulting confusion cost a team two months of rework.
Service Units. The most granular IBM capacity measure. A service unit combines CPU time, I/O activity, and storage usage into a single weighted value. SMF type 72 records report service units consumed by WLM service classes. For capacity planning, you'll typically work with either MSU or CPU seconds, but service units appear in RMF reports and you need to understand them.
29.2.2 zIIP and the Specialty Engine Factor
Here is where capacity planning gets strategically interesting.
IBM's Integrated Information Processor (zIIP) is a specialty engine that runs specific workloads at zero software licensing cost. Work that executes on a zIIP does not count toward your MSU consumption for MLC pricing. Read that again, because it is the single most important cost optimization lever on the platform.
Workloads eligible for zIIP include:
- DB2 DRDA distributed processing — queries from remote clients, JDBC/ODBC access
- DB2 parallel query processing — CPU consumed by DB2 parallelism
- XML processing — XML parsing and validation within DB2
- IPSec encryption — network encryption processing
- z/OS Container Extensions (zCX) — Linux workloads on z/OS
- Java and JVM workloads — when running under z/OS (significant for z/OS Connect, Liberty)
- Certain IMS and MQ processing — specific subsystem functions
What is not zIIP-eligible:
- Traditional COBOL batch processing — your mainline batch jobs run on general purpose (GP) processors
- CICS COBOL transactions — CICS application processing runs on GPs
- DB2 locally attached processing — when CICS or batch COBOL calls DB2 through a local thread, the DB2 work runs on GPs (mostly)
The capacity planning implication is significant. If you can restructure workloads to shift processing to zIIP-eligible paths, you reduce MSU consumption without reducing throughput. This is not theoretical — it's a standard optimization technique:
Example: CNB's Balance Inquiry
Current architecture:
CICS transaction → local DB2 call → GP processor
CPU: 0.0035 seconds per transaction
Volume: 12M transactions/day
Daily GP CPU: 42,000 CPU seconds
Restructured architecture:
API call → z/OS Connect (zIIP) → DRDA to DB2 (zIIP) → response
CPU: 0.0042 seconds per transaction (slightly higher due to DRDA overhead)
zIIP eligible: ~70% of CPU
Daily GP CPU: 12,600 CPU seconds (0.0042 × 0.30 × 12M)
Daily zIIP CPU: 35,280 CPU seconds
GP CPU reduction: 70% → direct MSU and MLC savings
Trade-off: 20% more total CPU, but 70% runs on free engines
Lisa Tran at CNB ran exactly this analysis when evaluating the z/OS Connect API layer for the mobile banking initiative. The total CPU increased by 20%, but the MSU-rated consumption dropped by 55% for that workload. The annual MLC savings paid for the z/OS Connect implementation in nine months.
💡 KEY INSIGHT: Capacity planning is not just about "how much do we need." It's about "how much do we need on which engine type." Every workload should be evaluated for zIIP offload potential as part of the capacity planning process.
29.2.3 RMF — The Data Source for Capacity Planning
Resource Measurement Facility (RMF) is the z/OS performance and capacity measurement subsystem. If SMF is the telemetry system (raw data), RMF is the analysis engine (processed metrics). For capacity planning, three RMF data sources are essential:
RMF Monitor I (Long-Term Reporting). Collects interval-based data — typically at 15-minute or 1-hour intervals — and writes SMF type 70–79 records. This is your primary capacity planning data source. Key record types:
| SMF Type | Content | Capacity Use |
|---|---|---|
| 70 | Processor activity | CPU utilization by LPAR, by engine type (GP, zIIP) |
| 70.1 | Crypto hardware | Encryption capacity utilization |
| 71 | Paging activity | Real storage consumption and constraint indicators |
| 72 | Workload activity | Service units consumed by WLM service class |
| 73 | Channel path activity | I/O channel utilization |
| 74 | Device activity | DASD device utilization and response time |
| 75 | Page/swap dataset activity | Paging subsystem capacity indicators |
| 78 | Virtual storage | Virtual storage consumption by address space |
RMF Monitor III (Real-Time). Provides real-time capacity data through ISPF panels or the RMF Distributed Data Server. Useful for immediate capacity assessment but not for planning — you need historical data for that.
RMF Post-Processor (PP). Processes RMF Monitor I data to produce formatted reports. The post-processor can generate:
- Duration reports: Capacity metrics aggregated over a time period (day, week, month)
- Exception reports: Intervals where metrics exceeded defined thresholds
- Trend reports: Capacity metrics plotted over time to show growth patterns
For capacity planning, the duration report is your workhorse. Here's what a typical processor utilization duration report tells you:
RMF Post-Processor Duration Report
Period: 2025-01-01 to 2025-01-31
System: CNBPLEX (4 LPARs)
LPAR Engine Avg% Peak% Peak Time MSU(Avg) MSU(Peak)
------ ------ ----- ------ ------------------ -------- ---------
CNBA GP 62.3 89.7 2025-01-15 02:15 823 1,185
CNBA zIIP 41.2 78.4 2025-01-15 02:30 n/a n/a
CNBB GP 58.1 84.2 2025-01-31 01:45 768 1,112
CNBB zIIP 38.7 71.3 2025-01-31 02:00 n/a n/a
CNBC GP 47.5 72.8 2025-01-15 13:30 628 962
CNBC zIIP 29.4 54.1 2025-01-15 14:00 n/a n/a
CNBD GP 43.2 68.4 2025-01-31 02:15 571 904
CNBD zIIP 25.8 49.7 2025-01-31 02:30 n/a n/a
Sysplex GP 52.8 89.7 2025-01-15 02:15 2,790 4,163
Sysplex zIIP 33.8 78.4 2025-01-15 02:30 n/a n/a
Several things jump out of this report to a capacity planner:
-
Peak utilization of 89.7% on CNBA during batch processing (02:15 = middle of batch window). That's too high. Industry best practice is to keep peak GP utilization below 85% to preserve WLM's ability to manage workloads. Above 85%, WLM starts making hard choices about what to delay.
-
The peak occurs during batch, not online. This is typical for most mainframe shops. Batch processing concentrates more CPU per unit time than distributed online transactions.
-
zIIP utilization peaks at 78.4%. Headroom exists on zIIP, but not as much as you'd like. If the shop is planning to shift more workloads to zIIP (z/OS Connect, Java), they'll need to account for this.
-
LPAR imbalance. CNBA at 62.3% average while CNBD at 43.2%. This suggests potential for workload rebalancing across LPARs before buying more capacity.
29.2.4 Building the Capacity Data Repository
Raw RMF data is useless for planning. You need it organized, historicized, and queryable. Here's the data pipeline that Kwame Mensah built at CNB — and that I've seen replicated at every well-run shop:
SMF Records (Type 70-78)
│
▼
SMF Dump Job (runs every 4 hours)
│
▼
RMF Post-Processor (daily duration reports)
│
▼
COBOL Load Program (CAPDLOAD)
│
▼
DB2 Capacity Database
├── CAPACITY_CPU_DAILY (1 row per LPAR per day)
├── CAPACITY_CPU_HOURLY (1 row per LPAR per hour)
├── CAPACITY_WKL_DAILY (1 row per service class per day)
├── CAPACITY_IO_DAILY (1 row per device group per day)
└── CAPACITY_STORAGE (1 row per storage group per week)
The key design decision: store hourly data for 90 days, daily data for 3 years. Hourly granularity is essential for identifying peak patterns and seasonal effects within a day. Daily data over three years gives you enough history for meaningful trend analysis and year-over-year comparison. Anything less than 18 months of daily data and your seasonal models will be garbage.
🔗 CROSS-REFERENCE: The SMF data pipeline here follows the same architecture as the batch monitoring pipeline in Chapter 27, Section 27.2. In fact, at CNB, the capacity data load runs as part of the same morning analysis batch stream. One SMF dump feeds both monitoring and capacity analysis.
29.3 Workload Characterization — Different Workloads, Different Capacity Profiles
Not all MSUs are created equal. A CICS transaction that consumes 0.003 CPU seconds and a batch job that consumes 3,000 CPU seconds have radically different capacity planning implications — different peak patterns, different growth drivers, different optimization levers. You cannot build a meaningful capacity plan without first understanding what your workload is made of.
29.3.1 The Four Workload Families
Every mainframe shop's workload decomposes into four families, each with distinct capacity characteristics:
Online Transactions (CICS/IMS). High volume, low CPU per transaction, sustained throughout the business day. Capacity driven by transaction volume and per-transaction cost.
CNB Online Profile:
Volume: 500M transactions/day (~35,000/second peak)
Avg CPU: 0.003 seconds per transaction
Peak hour: 12:00-1:00 PM (lunch hour mobile banking)
Peak MSU: ~960 MSU (online workload only)
Growth: 12-15% annually (mobile banking driving volume)
zIIP share: 5% (DB2 thread scheduling, minimal)
Batch Processing. High CPU per job, concentrated in the batch window, driven by data volume and processing complexity.
CNB Batch Profile:
Jobs/night: 2,500
Total CPU: 85,000 CPU seconds
Peak hour: 01:00-02:00 AM (account posting cycle)
Peak MSU: ~1,185 MSU (single LPAR) / ~3,800 MSU (sysplex)
Growth: 8-10% annually (account base growth)
zIIP share: 2% (minimal — mostly traditional COBOL)
DB2 Subsystem Processing. Utility processing (REORG, RUNSTATS, COPY), distributed access (DRDA), and parallel query processing. Often scheduled in the batch window but with different characteristics than application batch.
CNB DB2 Profile:
Utility CPU: 12,000 CPU seconds/night
DRDA CPU: 8,500 CPU seconds/day (API and distributed access)
Peak utility: 03:00-04:00 AM (REORG window)
Peak DRDA: 12:00-1:00 PM (aligned with online peak)
zIIP share: 45% of DRDA (significant offload opportunity)
Growth: 15-20% annually (API growth driving DRDA)
MQ and Integration Processing. Message-driven workloads — inter-system communication, event processing, file transfers.
CNB MQ Profile:
Messages/day: 8M (persistent) + 22M (non-persistent)
CPU/message: 0.0008 seconds (average)
Peak: Aligned with online peak + batch extract window
Peak MSU: ~120 MSU
zIIP share: 15% (channel initiator processing)
Growth: 25-30% annually (API and event-driven growth)
29.3.2 The Workload Mix Shift — Why It Matters
Here's the capacity planning insight that separates practitioners from number-crunchers: the mix is shifting.
Ten years ago, batch processing was 65% of a typical mainframe shop's capacity consumption. Today, at shops like CNB, batch is 45% and falling. Online and integration workloads are growing at 15–25% annually while batch grows at 8–10%. The workload that's growing fastest — API and distributed access via z/OS Connect — happens to be the most zIIP-eligible.
This has profound implications for capacity planning:
-
Peak hour is shifting. Traditional mainframe peak was 2:00 AM (batch). Modern mainframe peak may be 12:00 PM (online + API) on some LPARs. Your capacity plan must address both peaks.
-
Engine mix is shifting. You may need more zIIP capacity and less GP capacity — but you can only buy zIIPs up to the number of GP engines. The ratio matters.
-
Growth rate varies by workload type. A single "system growth rate" is meaningless. You need growth projections for each workload family independently.
At Pinnacle Health, Diane Okoye discovered this shift the hard way. Her capacity plan projected 8% overall growth based on historical trends. What actually happened: claims batch grew 5% (slower than historical because they'd optimized the adjudication engine), but the new real-time eligibility API grew 40% in its first year because provider adoption exceeded projections. The aggregate was close to 8%, but the capacity was needed on different LPARs at different times than the plan assumed.
📊 DATA POINT: Across the twelve mainframe shops I've worked with in the last five years, the average workload mix shift has been: batch declining 3–5% of total capacity per year, online steady or growing 2–3%, API/integration growing 15–25%. If your capacity plan doesn't account for this shift, it's wrong.
29.3.3 Characterizing Your Workloads — The WLM Service Class Approach
The most efficient way to characterize workloads for capacity planning is through WLM service classes. If your WLM policy is well-designed (Chapter 5), each service class represents a meaningful workload category. RMF reports service units consumed by service class (SMF type 72), which gives you a clean decomposition.
At CNB, Kwame configured the following service class structure for capacity planning purposes:
Service Class Workload Type Capacity Category
─────────────────────────────────────────────────────────────
SC_CICS_HIGH CICS Tier-1 trans Online-Critical
SC_CICS_NORMAL CICS Tier-2 trans Online-Standard
SC_CICS_LOW CICS inquiry/browse Online-Deferrable
SC_BATCH_CRIT Critical batch Batch-Critical
SC_BATCH_STD Standard batch Batch-Standard
SC_BATCH_LOW Low priority batch Batch-Deferrable
SC_DB2_UTIL DB2 utility DB2-Maintenance
SC_DB2_DRDA DB2 DRDA access API-Integration
SC_MQ_HIGH MQ critical channels Integration-Critical
SC_MQ_STD MQ standard channels Integration-Standard
SC_TSO TSO users Development
SC_OMVS Unix System Services Infrastructure
Each service class maps to a capacity category. The capacity plan tracks and forecasts each category independently. This alignment between WLM and capacity planning is not accidental — Kwame designed them together. If your shop has "one big service class" for all batch work, your capacity planning data will be too coarse to be useful.
🔄 SPACED REVIEW — Chapter 5 (WLM): Remember that WLM service classes determine dispatching priority, which determines resource allocation, which determines throughput. A workload that's capped at Importance 4 will never consume the same capacity as the same workload at Importance 1 — WLM will throttle it when resources are constrained. Your capacity plan must reflect WLM's actual resource allocation, not the theoretical maximum.
29.4 Forecasting Models — From Historical Data to Future Projections
You have the data. You have the workload characterization. Now comes the part where most capacity planning efforts either succeed or become expensive fantasy: forecasting.
I'll be direct. Capacity forecasting on the mainframe is not rocket science. It uses the same statistical techniques you'd apply in any forecasting domain. What makes it hard is not the math — it's the combination of technical data and business intelligence that you need to get the inputs right.
29.4.1 Linear Regression — The Foundation
For workloads with steady growth, linear regression is sufficient and interpretable. You're fitting a line through historical data points and extending it forward.
Model: MSU(t) = a + b × t
Where:
MSU(t) = projected MSU consumption at time t
a = intercept (baseline MSU)
b = slope (MSU growth per period)
t = time period (months from baseline)
CNB Batch Example (24 months of data):
Baseline (t=0): 2,450 MSU average daily peak
Month 24 actual: 2,890 MSU average daily peak
Slope: 18.3 MSU per month
Projection for Month 36: 2,450 + (18.3 × 36) = 3,109 MSU
Projection for Month 48: 2,450 + (18.3 × 48) = 3,328 MSU
Linear regression works well when: - Growth has been steady over the historical period - No major business changes are anticipated - The forecast horizon is 12–18 months (beyond that, linear becomes unreliable)
Linear regression fails when: - Growth is accelerating or decelerating - Business events (acquisitions, new products, regulatory changes) will change the trajectory - Seasonal patterns dominate the data and you're forecasting at a granularity that doesn't account for them
29.4.2 Seasonal Adjustment — Accounting for Predictable Patterns
Every mainframe shop has seasonal patterns. End-of-month processing is heavier than mid-month. Q4 is heavier than Q1 in retail banking. Open enrollment drives healthcare workloads in Q4. Tax season drives government workloads in Q1.
If you fit a linear model to data that includes seasonal effects, the model will be wrong during every seasonal peak and trough. Seasonal adjustment separates the trend from the pattern:
Model: MSU(t) = Trend(t) × SeasonalIndex(t)
Step 1: Calculate monthly averages over 2+ years
Step 2: Compute overall average
Step 3: Seasonal index = monthly average / overall average
CNB Monthly Seasonal Indices (batch MSU):
Jan: 0.95 Feb: 0.93 Mar: 1.02 Apr: 0.98
May: 0.96 Jun: 1.01 Jul: 0.94 Aug: 0.97
Sep: 1.04 Oct: 1.08 Nov: 1.05 Dec: 1.12
Interpretation:
December batch MSU is typically 12% above the annual trend
February batch MSU is typically 7% below
Range: 19 percentage points (significant)
With seasonal adjustment, the forecast becomes:
Quarter-End Forecast (March, Month 36):
Trend projection: 3,109 MSU
Seasonal index: 1.02
Adjusted forecast: 3,171 MSU
Year-End Forecast (December, Month 48):
Trend projection: 3,328 MSU
Seasonal index: 1.12
Adjusted forecast: 3,727 MSU — 18% higher than the unseasoned forecast
That 18% difference between the trended and seasonally-adjusted December forecast? That's the difference between a capacity plan that works and one that results in a 3:00 AM phone call to the CIO on December 31st.
⚠️ CRITICAL: You need at least two full years of data to calculate meaningful seasonal indices. One year gives you one data point per month — not enough to distinguish signal from noise. Three years is better. If your shop has less than two years of data, use industry benchmarks for seasonal patterns and flag them as assumptions.
29.4.3 Business-Driven Projections — When History Isn't Enough
Here's where capacity planning becomes an art informed by science, rather than pure science.
Linear regression and seasonal adjustment assume the future will look like the past, only more so. But businesses don't work that way. Acquisitions, new products, regulatory changes, technology migrations, and strategic initiatives create step-function changes that no historical trend can predict.
At CNB, Kwame maintains a "capacity events calendar" — a forward-looking list of business initiatives with estimated capacity impacts:
CNB Capacity Events Calendar — 2025-2026
Event Target Date Estimated Impact
──────────────────────────────────────────────────────────────
Mobile banking v3.0 launch 2025-Q2 +150 MSU online
+80 MSU zIIP (API)
Acquire Pacific Trust Bank 2025-Q3 +400 MSU batch
+200 MSU online
+120 MSU DB2 utility
Real-time fraud detection 2025-Q4 +100 MSU online
+250 MSU zIIP (ML)
ISO 20022 migration 2026-Q1 +80 MSU batch
+40 MSU MQ
DB2 13 upgrade 2026-Q2 -5% batch CPU (optimizer improvements)
+10% zIIP offload
Pervasive encryption phase 2 2026-Q3 +60 MSU crypto processing
Each event has an estimated capacity impact — some positive (new workloads), some negative (optimizations), some shifting between engine types. These estimates come from three sources:
-
Analogous workloads. If CNB is acquiring Pacific Trust Bank, and Pacific Trust runs a 1,200 MSU workload today, the merged workload won't be 1,200 MSU additional — some processing will consolidate. Kwame estimates 60% of Pacific Trust's standalone capacity will be additive.
-
Vendor benchmarks. IBM publishes capacity impact estimates for z/OS version upgrades, DB2 upgrades, and feature enablements like pervasive encryption. These are starting points, not gospel.
-
Proof-of-concept measurements. For high-impact changes like the fraud detection system, CNB runs a capacity POC — deploying a test instance and measuring actual CPU consumption under simulated production volume.
The composite forecast combines trend, seasonality, and business events:
Composite Forecast = Trend(t) × SeasonalIndex(t) + Σ(EventImpact(t))
Example: 2025-Q4 forecast
Trend projection: 3,240 MSU
Seasonal (October avg): × 1.08 = 3,499 MSU
Mobile banking v3.0: + 150 MSU (launched Q2, full effect by Q4)
Pacific Trust merge: + 400 MSU (batch, phased by Q4)
Fraud detection: + 100 MSU (GP) + 250 MSU (zIIP)
Total GP forecast: 4,149 MSU
Total zIIP forecast: current zIIP + 250 MSU
Sandra Chen at Federal Benefits Administration uses a variant of this approach. Instead of acquisition-driven events, her capacity events are regulatory: new benefit programs mandated by Congress, annual cost-of-living adjustments that increase payment processing volume, and modernization initiatives that shift processing between IMS and DB2. Her capacity events calendar is driven by the legislative calendar — when a new benefits program passes, she has 12–18 months to deploy it, and capacity planning starts the day the legislation is signed.
🔗 CROSS-REFERENCE — Chapter 23 (Batch Window): Remember the Q4 crisis at CNB where batch window capacity was exhausted? That crisis was fundamentally a capacity planning failure. The 30% transaction volume increase from the mobile banking partnership should have appeared on the capacity events calendar six months before go-live. If it had, the batch window could have been re-engineered before it broke.
29.4.4 Confidence Intervals and the Planning Range
A point forecast is a lie. The honest answer to "how much capacity will we need in Q4?" is a range, not a number. Presenting a single number implies certainty that doesn't exist.
At CNB, Kwame presents capacity forecasts with three scenarios:
CNB Q4 2025 Capacity Forecast (GP MSU):
Scenario MSU Assumptions
─────────────────────────────────────────────────────────
Conservative 3,650 Trend only, no events early
Expected 4,149 Trend + seasonal + events (on schedule)
Aggressive 4,580 Expected + 10% buffer + events early
+ higher-than-projected mobile growth
Planning target: 4,300 MSU (between Expected and Aggressive)
Current capacity: 4,200 MSU
Gap: 100 MSU (within CoD activation range)
The planning target is deliberately set between Expected and Aggressive — biased toward having more capacity than you need rather than less. The cost asymmetry justifies this: an unexpected capacity shortage costs 5–10x more than the equivalent over-provisioning. Kwame's rule of thumb: plan for the 75th percentile scenario, not the 50th.
29.5 MSU Budgeting and Optimization — Where Capacity Meets Finance
This is the section that will make you valuable in meetings with the CFO. Every section before this was about measuring and forecasting capacity. This section is about money.
29.5.1 IBM Software Pricing Models
IBM mainframe software pricing is complex, evolving, and the single largest controllable cost in most mainframe shops. Understanding the pricing models is not optional for a capacity planner — it's the core of the job.
Monthly License Charge (MLC). The traditional pricing model. You pay a monthly fee for each IBM software product (z/OS, CICS, DB2, MQ, etc.) based on your MSU consumption. The fee is calculated from the Rolling Four-Hour Average (R4HA) — the highest four-hour average MSU consumption in the month, measured across all LPARs running the product.
MLC Pricing Mechanics:
Step 1: RMF collects MSU consumption every measurement interval (typically 5 min)
Step 2: SCRT (Sub-Capacity Reporting Tool) calculates 4-hour rolling averages
Step 3: The highest R4HA in the month = your billable MSU
Step 4: Billable MSU × per-MSU rate = monthly charge
Example:
z/OS base charge at 3,000 MSU: approximately $35,000/month
(Actual rates depend on customer agreement — this is illustrative)
Total MLC stack (z/OS + CICS + DB2 + MQ + utilities):
Typical range: $12-18 per MSU per month
At 3,000 MSU: $36,000 - $54,000/month = $432,000 - $648,000/year
Sub-Capacity Pricing. Before sub-capacity pricing (introduced in 2000), IBM charged based on the full capacity of the machine, regardless of how much you used. Sub-capacity pricing charges based on actual LPAR utilization. This was a game-changer — it meant you could share a physical machine across LPARs and only pay for what each LPAR consumed.
The Sub-Capacity Reporting Tool (SCRT) is the mechanism. SCRT reads SMF type 89 records (which contain LPAR utilization data) and type 70 records (processor activity) to calculate the billable MSU for each product on each LPAR. You submit SCRT reports to IBM monthly.
⚠️ CRITICAL: If your SCRT data is inaccurate — due to SMF collection gaps, clock synchronization issues, or misconfigured LPAR definitions — you may be over-reporting (paying too much) or under-reporting (which IBM will eventually audit and bill you retroactively with interest). SCRT accuracy is a financial control, not just a technical nicety.
Tailored Fit Pricing (TFP). IBM's newer pricing model, introduced in 2019, offers alternatives to MLC:
- Enterprise Consumption Solution (ECS): You commit to a total MSU budget for all eligible products. If you stay under the cap, the per-MSU rate is significantly lower than MLC. If you exceed the cap, overage rates apply.
- Enterprise Capacity Solution: Based on machine capacity rather than consumption. Useful for shops with high utilization where the consumption-based model is expensive.
The choice between MLC and TFP depends on your consumption patterns:
Decision Framework:
MLC favors shops with:
- Low utilization (paying only for what you use)
- Highly variable workloads (low R4HA relative to peak)
- Many products with different usage patterns
TFP/ECS favors shops with:
- High, steady utilization (bulk discount on committed capacity)
- Growing workloads (lock in rates before growth increases billing)
- Consolidated product stacks (one cap covers everything)
At CNB, Kwame moved to TFP/ECS two years ago. The committed capacity is 3,200 MSU across all LPARs. With average consumption around 2,800 MSU and peak R4HA around 3,400 MSU, the ECS model saves approximately 15% compared to traditional MLC — because the few peak hours that push above 3,200 MSU are less expensive under ECS overage rates than they would be as the R4HA billing point under MLC.
29.5.2 The R4HA — Your Most Important Number
Under MLC pricing, the Rolling Four-Hour Average is literally the number that determines your bill. Understanding what drives it — and how to manage it — is the most valuable capacity planning skill you can have.
The R4HA calculation:
For each 5-minute SMF interval:
1. Calculate LPAR MSU consumption (from SMF type 70)
2. For each 4-hour window that includes this interval:
- Average all interval MSUs within the window
3. The month's billable MSU = highest 4-hour average
Visual example (simplified to hourly):
Hour: 00 01 02 03 04 05 06 07 08
MSU: 2400 2800 3200 3600 3400 3000 2600 2200 1800
4hr windows:
00-03: (2400+2800+3200+3600)/4 = 3000
01-04: (2800+3200+3600+3400)/4 = 3250
02-05: (3200+3600+3400+3000)/4 = 3300 ← highest
03-06: (3600+3400+3000+2600)/4 = 3150
04-07: (3400+3000+2600+2200)/4 = 2800
05-08: (3000+2600+2200+1800)/4 = 2400
Month's R4HA = 3300 MSU (this is what you're billed for)
The capacity planning implication: your bill is determined by your worst four-hour period. A single month-end batch cycle that pushes utilization up for four hours can set the R4HA for the entire month. This creates a strong financial incentive to manage peak consumption.
29.5.3 R4HA Management Techniques
Here are the techniques that actually work in production. I've seen each of these save real money:
1. Batch Scheduling Optimization (from Chapter 23). Spread batch workloads across the window rather than concentrating them. If your peak four hours are 01:00–05:00, and you can shift 15% of that workload to start at 23:00 or extend to 06:00, you flatten the peak and reduce the R4HA.
Before optimization:
01:00-05:00 average: 3,300 MSU ← R4HA
After shifting 15% to earlier/later:
23:00-01:00 average: 2,600 MSU (was 2,200)
01:00-05:00 average: 2,805 MSU ← new R4HA
05:00-07:00 average: 2,400 MSU (was 2,000)
R4HA reduction: 495 MSU
Annual MLC savings at $15/MSU/month: $89,100
Rob Calloway at CNB did exactly this analysis. By rescheduling non-critical batch work (report generation, archiving, statistics) to start two hours earlier in the window, he reduced the peak R4HA by over 400 MSU. The batch window critical path was unaffected — only non-critical work moved.
2. zIIP Offload Maximization. Every CPU second shifted from GP to zIIP reduces GP MSU consumption and therefore the R4HA. The highest-value offload targets:
- DB2 DRDA processing (convert local threads to DRDA where feasible)
- Java-based API processing (z/OS Connect runs on zIIP)
- XML/JSON transformation (DB2 XML processing is zIIP-eligible)
- Encryption workloads (IPSec processing on zIIP)
3. WLM Capping. WLM can cap the capacity consumed by specific workloads. Use this surgically:
- Cap development/test LPARs during production peak hours
- Cap low-priority batch service classes during the R4HA window
- Cap TSO usage during batch peak (developers shouldn't be running interactive queries at 2:00 AM anyway)
🔗 CROSS-REFERENCE — Chapter 5 (WLM): WLM capping (defined capacity and soft capping) was introduced in Chapter 5. For capacity planning, capping is a financial tool, not just a technical one. By capping non-essential workloads during peak hours, you can reduce the R4HA without affecting business-critical processing.
4. Workload Balancing Across LPARs. Under sub-capacity pricing, each LPAR's MSU is calculated independently. If one LPAR peaks at 1,200 MSU while another runs at 600 MSU during the same period, the aggregate R4HA is 1,800. But if workloads are balanced at 900 each, the R4HA is still 1,800 — or potentially lower if the rebalancing smooths the peak. The real win comes when LPAR-level peaks don't coincide: balancing ensures that no single LPAR hits a disproportionate peak that drives up the product-level R4HA.
5. Capacity on Demand (CoD) for Planned Peaks. IBM offers temporary capacity activation for planned peak periods (month-end, quarter-end, year-end, annual enrollment). You can activate additional MSUs for a defined period — typically measured in days — at a per-day rate that's lower than the monthly MLC equivalent. For predictable seasonal peaks, CoD is more cost-effective than permanently provisioning for the peak.
29.5.4 The MSU Budget — Connecting Capacity to Finance
At mature shops, the capacity plan produces an MSU budget that feeds directly into the IT financial plan. Here's the structure Kwame uses at CNB:
CNB MSU Budget — FY2026
Q1 Q2 Q3 Q4 Annual
──── ──── ──── ──── ──────
GP (Base R4HA): 2,950 3,050 3,200 3,450 3,450*
GP (CoD days): 5 5 10 15 35
zIIP (Avg): 680 720 800 880 880*
MLC Software: $162K $168K $176K $190K $696K
TFP Commitment: - - - - $640K
Hardware Depr: $185K $185K $185K $185K $740K
CoD Charges: $8K $8K $18K $28K $62K
───── ───── ───── ───── ──────
Total: $355K $361K $379K $403K $1,498K
* Annual figure is peak quarter for planning purposes
The MSU budget is reviewed quarterly. Actual R4HA is compared to the forecast. Variances above 5% trigger investigation — either the forecast was wrong (update the model) or something unexpected happened (understand what and adjust).
At Federal Benefits, Sandra Chen aligns her MSU budget to the federal fiscal year (October–September) and maps it to the agency's IT capital plan. Every MSU dollar has to be justified against the agency's mission — processing benefits for 4.2 million recipients. Marcus Whitfield, her legacy SME, once observed dryly: "Nobody in Congress cares about MSUs. They care about whether veterans get paid on time. The MSU budget is how we make sure they do."
29.6 Capacity Impact Analysis — What Happens When Things Change
The steady-state forecast is the baseline. But the most valuable capacity planning skill is impact analysis — answering the question: "If we do X, what happens to capacity?"
29.6.1 Application Changes
Every application change has a capacity signature. Here are the common ones and how to estimate their impact:
New COBOL program deployment. Measure CPU time per transaction or per batch execution in the test environment. Scale by production volume. Apply a test-to-production factor (typically 1.1–1.3x, because production data is messier than test data).
Impact Estimation — New Wire Transfer Module:
Test measurement:
CPU per transaction: 0.0045 seconds
DB2 getpages: 47 per transaction
Transactions/day: 120,000 (projected)
Production projection:
CPU per transaction: 0.0045 × 1.2 (test-to-prod factor) = 0.0054 seconds
Daily CPU: 0.0054 × 120,000 = 648 CPU seconds
Equivalent MSU impact: ~2 MSU on peak (negligible individually)
A single new program is rarely a capacity concern. The danger is cumulative: twenty new programs deployed over a year, each adding 2 MSU, add up to 40 MSU — and nobody noticed because each one was "negligible."
COBOL compiler upgrade. Upgrading from Enterprise COBOL V5 to V6, or V6.3 to V6.4, can change CPU consumption in either direction. IBM's compiler optimizations generally reduce CPU usage — V6.4 typically shows 5–10% improvement over V5 for compute-intensive programs. But occasionally, a compiler change interacts poorly with specific coding patterns. Always benchmark critical programs before and after a compiler upgrade.
29.6.2 DB2 Changes
DB2 changes have the largest and most unpredictable capacity impact of any single factor. Lisa Tran at CNB has a rule: "No DB2 change goes to production without a capacity impact assessment. No exceptions."
New indexes. An index reduces CPU for SELECT statements that use it (by replacing table scans with index lookups) but increases CPU for INSERT, UPDATE, and DELETE operations (which must maintain the index). For high-volume insert tables — like CNB's transaction log — a single unnecessary index can add thousands of CPU seconds per day.
Impact of New Index on TRANSACTION_LOG table:
Benefit (SELECT optimization):
Queries affected: TXNQUERY, TXNRPT, TXNAUDIT
Current access: tablespace scan (avg 12,000 getpages/query)
With index: matching index scan (avg 45 getpages/query)
CPU reduction per query: ~0.8 seconds
Queries/day: 5,000
Daily CPU savings: 4,000 seconds
Cost (INSERT overhead):
Current insert CPU: 0.0012 seconds/insert
Additional index: +0.0003 seconds/insert
Inserts/day: 8,000,000
Daily CPU increase: 2,400 seconds
Net impact: 4,000 - 2,400 = 1,600 CPU seconds savings/day
Lisa runs this analysis for every index change request. She's rejected index requests that had negative net impact — where the insert cost exceeded the query benefit.
SQL changes. A single SQL statement change can swing capacity by hundreds of CPU seconds per day. The most dangerous changes:
- Removing a predicate from a WHERE clause (turns an index scan into a table scan)
- Adding ORDER BY to a high-volume query (forces a sort)
- Changing a static SQL program to use dynamic SQL (loses access path stability)
- Adding a function to a predicate (prevents index use:
WHERE YEAR(TX_DATE) = 2025versusWHERE TX_DATE BETWEEN '2025-01-01' AND '2025-12-31')
🔄 SPACED REVIEW — Chapter 26 (Batch Performance): The performance decomposition from Chapter 26 (CPU time = code + I/O wait + DB2 wait + other) applies directly here. A capacity impact analysis for a DB2 change should assess impact on all four components, not just CPU. An index that reduces CPU but increases I/O (because the index requires additional DASD reads) may not deliver the expected benefit.
DB2 version upgrade. DB2 upgrades typically improve performance — IBM invests heavily in optimizer improvements each release. DB2 13, for example, introduced AI-assisted access path optimization that can reduce CPU for complex queries by 10–20%. But upgrades also change optimizer behavior, which can cause access path regressions for specific queries. Plan for a 2–4 week stabilization period after a DB2 upgrade where capacity behavior may be unpredictable.
29.6.3 Workload Shifts
The most challenging impact analysis is when workloads shift between platforms or processing modes:
Batch to online conversion. When a batch process is converted to real-time (e.g., converting nightly batch posting to real-time posting), the total CPU usually increases (real-time processing has per-transaction overhead that batch amortizes), but the peak shifts from nighttime to daytime. This may reduce the batch R4HA while increasing the online R4HA.
On-premises to hybrid. When workloads move between the mainframe and cloud (Chapter 34), the mainframe capacity decreases but integration workload (MQ, API calls, data synchronization) increases. The net mainframe impact depends on whether the integration overhead exceeds the processing reduction.
Sandra Chen at Federal Benefits is managing exactly this scenario. Her modernization program (Chapter 32 reference) is moving the public-facing benefits inquiry from IMS to a cloud-hosted microservice. The IMS inquiry workload will decrease by approximately 800 MSU, but the new z/OS Connect API layer that feeds the cloud service adds approximately 200 MSU (mostly zIIP-eligible). Net GP reduction: ~700 MSU. But the cloud service costs $180,000/year, and the IMS mainframe savings are approximately $350,000/year in MLC. Net savings: $170,000/year — enough to justify the project on cost alone, with the added benefit of a modern API surface.
29.6.4 The Capacity Impact Assessment Template
At CNB, every change that could affect capacity by more than 10 MSU goes through a formal Capacity Impact Assessment (CIA). Kwame's template:
CAPACITY IMPACT ASSESSMENT — Template
1. Change Description:
[What is changing, when, why]
2. Affected Workloads:
[Service classes impacted, LPARs affected]
3. Volume Assumptions:
[Transaction/record volumes at deployment]
[Projected growth rate post-deployment]
4. CPU Impact Estimate:
GP change: +/- ___ MSU (peak) / +/- ___ CPU seconds (daily)
zIIP change: +/- ___ MSU (peak) / +/- ___ CPU seconds (daily)
Method: [Measurement / Benchmark / Analogy / Vendor estimate]
Confidence: [High / Medium / Low]
5. Non-CPU Impact:
Storage: +/- ___ GB DASD
Memory: +/- ___ GB real storage
I/O: +/- ___ EXCP/day
6. R4HA Impact:
Current R4HA: ___ MSU
Projected R4HA after change: ___ MSU
Within current capacity: [Yes / No]
CoD required: [Yes / No, ___ days]
7. Approval:
Capacity planner: ___________
Systems programmer: ___________
Application owner: ___________
29.7 The Capacity Planning Process — Annual Cycle, Quarterly Reviews, Exception Reporting
Tools and models are worthless without a process. The best capacity plans I've seen are embedded in the organization's planning rhythm — they're not separate technical exercises but integral components of the business planning cycle.
29.7.1 The Annual Capacity Planning Cycle
Here's the annual cycle that works. I've seen variations of this at CNB, Pinnacle Health, Federal Benefits, and a dozen other shops:
Month 1–2 (October–November for calendar-year shops): Data Collection and Trend Analysis.
- Extract 24–36 months of historical capacity data from the DB2 capacity repository
- Calculate trend lines for each workload category (online, batch, DB2, MQ/integration)
- Calculate seasonal indices
- Identify anomalies and structural breaks in the data (acquisitions, migrations, major application changes)
- Document current capacity (hardware, LPAR configuration, engine counts, defined capacity)
Month 2–3 (November–December): Business Input.
- Meet with business unit leaders to gather growth projections (account growth, transaction volume, new products)
- Meet with application teams to gather the capacity events calendar (new deployments, migrations, retirements, DB2 changes)
- Meet with the modernization team to understand planned platform shifts (cloud migration, API strategy, data platform changes)
- Meet with the security team to understand planned security enhancements (encryption, audit changes)
- Document assumptions and confidence levels for each input
Month 3–4 (December–January): Modeling and Scenario Analysis.
- Build the composite forecast (trend + seasonal + business events) for each workload category
- Produce three scenarios (conservative, expected, aggressive) with R4HA projections by quarter
- Calculate MSU budget for each scenario
- Calculate financial impact (MLC/TFP costs) for each scenario
- Identify capacity constraints (hardware ceiling, engine count, LPAR weight limits) and when they're hit under each scenario
- Document the procurement timeline for any required hardware upgrades
Month 4 (January): Review and Approval.
- Present the capacity plan to IT leadership and finance
- Obtain budget approval for planned capacity investments
- Secure Capacity on Demand authorizations for seasonal peaks
- Publish the plan and distribute to operations, application teams, and business units
Months 5–12 (February–September): Execution and Monitoring.
- Monthly: compare actual R4HA to forecast, calculate variance, update projections if variance exceeds 5%
- Quarterly: formal capacity review with updated forecast, revised business inputs, adjusted procurement timeline
- Ad hoc: capacity impact assessments for unplanned changes
- Exception: alert when actual consumption exceeds the Aggressive scenario threshold
29.7.2 Quarterly Capacity Reviews
The quarterly review is where the plan meets reality. At CNB, the quarterly review follows a standard agenda:
CNB Quarterly Capacity Review — Agenda
1. Actuals vs. Forecast (15 min)
- R4HA: plan vs. actual, by LPAR
- Workload category breakdown
- Variance analysis (what drove the difference?)
2. Updated Forecast (15 min)
- Revised trend lines incorporating new actuals
- Updated capacity events calendar
- Revised MSU budget for remaining quarters
3. Risk Register (10 min)
- Capacity risks identified since last review
- Risk mitigation status
- New risks from business pipeline
4. Optimization Opportunities (10 min)
- zIIP offload candidates identified
- R4HA management improvements
- WLM tuning opportunities
5. Procurement Status (10 min)
- Hardware upgrade timeline
- CoD utilization (actual vs. authorized)
- Contract renewal planning
6. Actions and Decisions (10 min)
At Pinnacle Health, Diane Okoye expanded the quarterly review to include a "capacity cost per claim" metric — dividing total mainframe cost by claims processed. This metric gives the business a tangible efficiency measure and has been trending downward for three years, which helps justify continued mainframe investment to leadership that might otherwise see only the total cost growing.
Ahmad Rashidi insists that the quarterly review also cover regulatory capacity — the capacity needed for audit processing, compliance reporting, and regulatory examination support. "The worst time to discover you don't have capacity for the HIPAA audit extract is the week the auditors arrive," he notes.
29.7.3 Exception Reporting
Between quarterly reviews, capacity exceptions need to be caught and escalated. The exception framework:
Tier 1 — Immediate (within 4 hours): - GP utilization exceeds 90% for any 1-hour period - R4HA exceeds planned Aggressive scenario for the month - zIIP utilization exceeds 85% for any 1-hour period - Any LPAR hits defined capacity ceiling
Tier 2 — Same Business Day: - GP utilization exceeds 85% for any 4-hour period - R4HA exceeds planned Expected scenario by more than 5% - Batch window elapsed time exceeds 95% of available window - Storage growth rate exceeds forecast by more than 20%
Tier 3 — Weekly Review: - Monthly R4HA trend deviating from forecast by more than 3% - Any workload category growing faster than projected - zIIP utilization trending higher than forecast - CoD usage exceeding budget
The exception reporting is automated. A COBOL program (CAPEXRPT) runs daily, reads the previous day's capacity data from the DB2 repository, compares it to the forecast, and generates exception alerts through the shop's monitoring infrastructure (the same framework from Chapter 27).
IDENTIFICATION DIVISION.
PROGRAM-ID. CAPEXRPT.
*================================================================*
* CAPACITY EXCEPTION REPORTER *
* Compares daily capacity actuals to forecast and generates *
* exception alerts for threshold breaches. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-THRESHOLD-VALUES.
05 WS-GP-UTIL-TIER1 PIC 9V99 VALUE 0.90.
05 WS-GP-UTIL-TIER2 PIC 9V99 VALUE 0.85.
05 WS-ZIIP-UTIL-TIER1 PIC 9V99 VALUE 0.85.
05 WS-R4HA-VARIANCE-T2 PIC 9V99 VALUE 0.05.
05 WS-R4HA-VARIANCE-T3 PIC 9V99 VALUE 0.03.
05 WS-BATCH-WIN-TIER2 PIC 9V99 VALUE 0.95.
05 WS-STORAGE-VAR-T2 PIC 9V99 VALUE 0.20.
01 WS-ACTUAL-VALUES.
05 WS-ACT-GP-UTIL-PEAK PIC 9V9999.
05 WS-ACT-GP-UTIL-4HR PIC 9V9999.
05 WS-ACT-ZIIP-PEAK PIC 9V9999.
05 WS-ACT-R4HA PIC 9(5)V99.
05 WS-ACT-BATCH-PCT PIC 9V9999.
05 WS-ACT-STORAGE-RATE PIC 9V9999.
01 WS-FORECAST-VALUES.
05 WS-FCST-R4HA-EXPECT PIC 9(5)V99.
05 WS-FCST-R4HA-AGGR PIC 9(5)V99.
05 WS-FCST-STORAGE-RATE PIC 9V9999.
01 WS-EXCEPTION-RECORD.
05 WS-EXC-DATE PIC X(10).
05 WS-EXC-TIER PIC 9.
05 WS-EXC-TYPE PIC X(20).
05 WS-EXC-LPAR PIC X(8).
05 WS-EXC-ACTUAL PIC X(15).
05 WS-EXC-THRESHOLD PIC X(15).
05 WS-EXC-MESSAGE PIC X(80).
The exception reporter integrates with the same alerting infrastructure that Rob Calloway uses for batch monitoring (Chapter 27). A capacity exception at 2:00 AM might be Rob's first indication that tonight's batch workload is abnormally heavy — information that helps him decide whether to activate Capacity on Demand or adjust the batch schedule.
29.7.4 Organizational Ownership — Who Does Capacity Planning?
This is the question that determines whether capacity planning actually happens or becomes another good intention that dies of neglect.
In small shops (under 2,000 MSU), capacity planning is typically part of the senior systems programmer's responsibilities. This works if — and only if — that person has dedicated time for it. "Do capacity planning when you have time" means "never do capacity planning."
In medium shops (2,000–8,000 MSU), a dedicated capacity planner is justified. This person combines technical skills (RMF analysis, SMF processing, statistical modeling) with business skills (financial modeling, vendor negotiation, executive communication). At CNB, Kwame delegates the technical data collection to a systems programmer but owns the forecasting, business liaison, and executive presentation himself.
In large shops (8,000+ MSU), capacity planning is a team function — typically 2–3 people with different specializations (technical measurement, financial modeling, vendor management). Federal Benefits Administration has a three-person capacity team that reports to Sandra Chen's modernization office.
Regardless of organizational structure, effective capacity planning requires:
- Access to SMF/RMF data — without it, you're guessing
- A seat at the business planning table — without business input, your technical forecast is incomplete
- Budget authority or influence — the capacity plan must connect to procurement decisions
- Credibility with both technical staff and management — the capacity planner must speak both languages
At SecureFirst, Yuki Nakamura brought a DevOps perspective to capacity planning. She integrated capacity monitoring into the CI/CD pipeline — every deployment to the mainframe includes a capacity impact estimate, and the production monitoring dashboard shows real-time R4HA alongside the forecast. Carlos Vega, the API architect, was initially skeptical: "Capacity planning sounds like something from the 1990s." Six months later, after watching the R4HA dashboard help him right-size the z/OS Connect thread pool, he admitted: "Okay, knowing your actual consumption in real time is actually useful. Who knew."
29.8 Progressive Project: The HA Banking System Capacity Plan
Time to apply everything in this chapter to the progressive project. You've been building the HA Banking Transaction Processing System across 28 chapters. In Chapter 28, you secured it. Now you're going to plan its capacity.
29.8.1 Project Context
The HA Banking System has the following workload profile, aggregated from previous chapters:
HA Banking System — Workload Summary
Online (CICS):
Transactions/day: 500M (35,000/sec peak)
Avg CPU/transaction: 0.0035 seconds
Peak hour: 12:00-1:00 PM
Service class: SC_CICS_HIGH, SC_CICS_NORMAL
Batch:
Jobs/night: 2,500
Total CPU/night: 85,000 seconds
Critical path: 310 minutes (designed in Ch. 23)
Peak hour: 01:00-02:00 AM
Service class: SC_BATCH_CRIT, SC_BATCH_STD
DB2:
Utility CPU/night: 12,000 seconds
DRDA CPU/day: 8,500 seconds (API access)
Service class: SC_DB2_UTIL, SC_DB2_DRDA
MQ/Integration:
Messages/day: 30M
CPU/message: 0.0008 seconds
Service class: SC_MQ_HIGH, SC_MQ_STD
Security (from Ch. 28):
Encryption overhead: ~3% of total CPU
RACF overhead: ~1.5% of total CPU
Audit logging: ~0.5% of total CPU
29.8.2 Capacity Sizing
Step 1: Convert to MSU.
Using the workload data and the LSPR conversion for a z16 Model A01-706:
Online peak hour MSU:
35,000 txn/sec × 0.0035 CPU sec × 3600 sec/hr = 441,000 CPU seconds/hour
z16 A01-706: 1 MSU ≈ 54.5 CPU seconds/hour at rated speed
Online peak MSU: 441,000 / 54.5 ≈ 8,092 CPU seconds → roughly 960 MSU
(Note: the conversion factor depends on the specific processor model,
workload instruction mix, and cache behavior. This is approximate.)
Batch peak hour MSU:
85,000 CPU seconds over ~6 hours = ~14,167 CPU seconds/hour average
Peak hour concentration factor: 1.5x (account posting cycle)
Peak hour CPU: 21,250 seconds
Batch peak MSU: ~1,180 MSU
DB2 + MQ + Security overhead: ~400 MSU at peak
Total system peak MSU (concurrent online + DB2/MQ, no batch):
960 + 400 = 1,360 MSU (daytime peak)
Total system peak MSU (batch + DB2/MQ, minimal online):
1,180 + 400 + 100 (overnight online) = 1,680 MSU (nighttime peak)
Sysplex total with 4 LPARs:
Daytime: ~4,200 MSU across sysplex
Nighttime: ~5,100 MSU across sysplex (batch drives the peak)
Step 2: Apply Growth Projections.
Growth rates (from business planning):
Online transactions: 15%/year (mobile banking growth)
Batch volume: 8%/year (account base growth)
API/DRDA: 25%/year (digital transformation)
MQ messages: 20%/year (event-driven architecture)
Year 1 projection:
Online peak: 960 × 1.15 = 1,104 MSU
Batch peak: 1,180 × 1.08 = 1,274 MSU
DB2/MQ: 400 × 1.20 = 480 MSU (blended growth)
Year 2 projection:
Online peak: 1,104 × 1.15 = 1,270 MSU
Batch peak: 1,274 × 1.08 = 1,376 MSU
DB2/MQ: 480 × 1.20 = 576 MSU
Step 3: Apply Seasonal Adjustment.
Year 1, Q4 (December — peak seasonal):
Online peak: 1,104 × 1.12 = 1,236 MSU
Batch peak: 1,274 × 1.12 = 1,427 MSU
DB2/MQ: 480 × 1.08 = 518 MSU
Total Year 1 Q4 peak:
Daytime: (1,236 + 518) × 4 LPARs = ~7,000 MSU sysplex
Nighttime: (1,427 + 518 + 120) × 4 = ~8,260 MSU sysplex
Step 4: Capacity Plan Document.
HA BANKING SYSTEM — CAPACITY PLAN (Year 1)
Current Y1-Q2 Y1-Q4 Y2-Q2 Y2-Q4
─────── ───── ───── ───── ─────
GP MSU (day peak): 4,200 4,550 5,040 4,930 5,470
GP MSU (night peak): 5,100 5,400 5,950 5,720 6,350
GP MSU (R4HA est): 4,800 5,100 5,600 5,400 6,000
zIIP MSU (peak): 1,200 1,380 1,530 1,590 1,760
Hardware Capacity:
z16 A01-706 rated: 6,534 MSU (GP)
Current headroom: 6,534 - 5,100 = 1,434 MSU (28% at night peak)
Y1-Q4 headroom: 6,534 - 5,950 = 584 MSU (9% — below threshold)
Y2-Q2 trigger: Hardware upgrade required by Y2-Q1
Procurement Timeline:
Y1-Q1: Begin hardware evaluation for z16 upgrade
Y1-Q2: RFP and pricing negotiation
Y1-Q3: Order placement (12-week lead time)
Y1-Q4: Installation and migration (with CoD bridge if needed)
Y2-Q1: New capacity operational
MSU Budget (Annual):
Y1: GP $840K MLC + $62K CoD = $902K
Y2: GP $960K MLC + $48K CoD = $1,008K (post-upgrade, lower CoD)
29.8.3 Checkpoint Deliverable
For the progressive project, your Chapter 29 deliverable is the capacity plan document for the HA Banking System. This document includes:
- Workload characterization for each component (online, batch, DB2, MQ, security)
- Current capacity baseline from RMF data (or the simulated data above)
- Growth forecast with three scenarios (conservative, expected, aggressive)
- Seasonal adjustment based on banking industry patterns
- Business events calendar with capacity impact estimates
- MSU budget with quarterly projections for GP and zIIP
- R4HA management strategy (scheduling optimization, zIIP offload, WLM capping)
- Procurement timeline for hardware upgrades
- Exception thresholds aligned to the three-tier framework
This capacity plan feeds directly into the DR design (Chapter 30) — because your DR site must have sufficient capacity to run the recovery workload — and the operational automation (Chapter 31), because automated capacity management reduces the manual intervention required during peak periods.
29.9 Summary — The Discipline That Prevents 3:00 AM Phone Calls
Capacity planning is the quiet discipline. When it works, nobody notices — the system performs, the budget is accurate, and the upgrades arrive before they're needed. When it fails, everyone notices — the CIO gets a 3:00 AM phone call, the CFO gets an unplanned capital request, and the auditors get a finding.
The discipline requires three things:
Good data. RMF/SMF records, properly collected, properly historicized, properly analyzed. Without data, you're making expensive guesses.
Business context. Technical trend analysis alone is insufficient. You need the business growth projections, the acquisition pipeline, the regulatory calendar, and the technology roadmap. The capacity planner must have a seat at the planning table.
A repeatable process. Annual planning, quarterly reviews, monthly monitoring, exception alerting. The process ensures that capacity planning happens consistently, not just when someone remembers to do it.
And one more thing — the thing that separates the capacity planners who save their organizations millions from the ones who produce pretty charts that nobody reads: the courage to present bad news. When the forecast says you'll exceed capacity in Q3 and the CFO says the budget is frozen, the capacity planner's job is not to make the numbers look better. It's to present the risk clearly, quantify the cost of inaction, and let leadership make an informed decision. Kwame Mensah puts it this way: "My job is to make sure nobody can say they weren't warned."
🔄 SPACED REVIEW — Chapter 23 (Batch Window): The batch window crisis at CNB that opened Chapter 23 — the one that blew the 6:00 AM deadline because transaction volume grew 30% — was a capacity planning failure as much as a scheduling failure. The capacity plan should have identified the volume growth and its impact on the batch window months before the deadline was missed. Capacity planning and batch window engineering are two sides of the same coin.
🔄 SPACED REVIEW — Chapter 26 (Batch Performance): The batch performance optimization techniques from Chapter 26 — buffer tuning, SORT optimization, I/O reduction — are also capacity optimization techniques. Every CPU second saved by performance tuning is a CPU second that doesn't have to be purchased. At CNB, Lisa Tran's DB2 tuning work saved an estimated 200 MSU of peak capacity — the equivalent of deferring a hardware upgrade by six months. Performance tuning is capacity creation.
Next chapter: Chapter 30 will cover Disaster Recovery and Business Continuity, where we design the GDPS configuration and DR test plan for the HA Banking System. One of the first inputs to the DR design? The capacity plan you just built — because your DR site needs enough capacity to run the recovery workload.