> "Every COBOL developer who's ever stared at a JES2 log wondering why their batch job took three hours instead of forty minutes has already encountered WLM. They just didn't know it." — Kwame Mensah, Systems Architect, Continental National Bank
Learning Objectives
- Explain the WLM goal-based model including service classes, workload types, classification rules, and service definition
- Analyze how WLM dispatching priorities affect COBOL batch job elapsed time and CICS transaction response time
- Design WLM service class configurations that balance online transaction priorities with batch window requirements
- Interpret WLM reporting data (RMF/SMF type 72) to diagnose workload performance issues
- Define WLM service classes and classification rules for the progressive project's HA banking system
In This Chapter
- 5.1 Why Your Batch Job Ran Slowly Last Night — It Wasn't the Code
- 5.2 The WLM Service Definition — Service Classes, Workloads, and Classification Rules
- 5.3 How WLM Makes Decisions — Dispatching, Goals, and Performance Index
- 5.4 WLM and CICS — How Transaction Priorities Are Managed
- 5.5 WLM and Batch — Initiators, Enclaves, and the Battle for the Batch Window
- 5.6 WLM and DB2 — Stored Procedure Priorities and DDF Workloads
- 5.7 Reading WLM Data — RMF Reports and SMF Type 72 Analysis
- 5.8 Designing Service Policies — Balancing Competing Workloads
- Project Checkpoint: WLM Service Classes for the HA Banking System
- Production Considerations
- Summary
- What's Next
- Key Terms Glossary
Chapter 5: z/OS Workload Manager — How WLM Decides When Your Batch Job Runs (and How to Influence It)
"Every COBOL developer who's ever stared at a JES2 log wondering why their batch job took three hours instead of forty minutes has already encountered WLM. They just didn't know it." — Kwame Mensah, Systems Architect, Continental National Bank
5.1 Why Your Batch Job Ran Slowly Last Night — It Wasn't the Code
Here is the conversation I have had roughly six hundred times in twenty-five years of mainframe work.
A COBOL developer walks into my office — or, more recently, pings me on Slack — and says: "Something is wrong with my batch job. It ran in forty-two minutes on Tuesday. Last night it ran in two hours and fifty minutes. I didn't change anything."
They are telling the truth. They did not change anything. The COBOL code is identical. The input file is the same size, give or take a few thousand records. The DB2 access paths haven't shifted. The job stream is the same JCL it has been since 2019. And yet the elapsed time ballooned by a factor of four.
The answer, almost every single time, is Workload Manager.
WLM is the z/OS subsystem that decides, every few seconds, which work is important and which work can wait. It controls dispatching priorities, storage allocation bias, I/O priority, and processor resource distribution across every address space on the LPAR. When your batch job suddenly runs slowly, it is usually because WLM decided that something else — an online CICS region processing customer transactions, a DB2 stored procedure handling API calls, another batch job classified at higher importance — deserved the resources more than your job did.
This is not a bug. This is the system working exactly as designed.
Why This Chapter Matters for COBOL Architects
If you are reading this book, you are not a junior developer anymore. You are moving toward architecture, toward the decisions that determine whether a system handles 500 million transactions a day or falls over at 50 million. And WLM is one of the most consequential architectural levers on the mainframe.
Consider what Kwame Mensah deals with at Continental National Bank:
- Four LPARs in a Parallel Sysplex, each running CICS, DB2, IMS, and MQ
- 500 million transactions per day across online banking, ATM networks, wire transfers, and ACH processing
- A batch window that has shrunk from eight hours to four as the bank added real-time fraud detection and 24/7 mobile banking
- Regulatory reporting that must complete by 6:00 AM Eastern or the bank faces fines
Every one of those competing demands flows through WLM. The service definition — the document that tells WLM how to prioritize work — is one of the most important architectural artifacts at CNB. It is reviewed quarterly by a committee that includes Kwame, Lisa Tran (the DBA), Rob Calloway (batch operations), the capacity planning team, and a representative from the business side.
🔄 Retrieval Practice (from Chapter 1): Recall the z/OS subsystem architecture we covered in Chapter 1. Where does WLM fit in that hierarchy? What other subsystems does it interact with? Before reading further, sketch a diagram showing WLM's relationship to JES2, CICS, DB2, and the z/OS dispatcher. If you cannot recall the dispatcher's role, review Section 1.3.
The Evolution from Manual to Goal-Based Management
Before WLM, z/OS (then MVS) used a manual performance management approach. System programmers assigned fixed dispatching priorities to address spaces using the Installation Performance Specification (IPS) and Installation Activity Specification (IAS). If you wanted CICS to run at a higher priority than batch, you gave CICS a higher dispatching priority number, and that was that.
The problem was obvious: the system could not adapt. If a critical batch job needed more resources at 2:00 AM when CICS was idle, it still ran at its assigned priority. If an unexpected spike in online transactions overwhelmed the system, the only fix was for a system programmer to manually adjust priorities — assuming they were awake and available.
WLM, introduced with MVS/ESA SP 5.1 in 1994 and running in goal mode on virtually every production z/OS system since the early 2000s, replaced this static model with a dynamic, goal-based approach. Instead of telling the system how to manage resources, you tell it what you want — response time targets, throughput goals, completion deadlines — and WLM figures out how to achieve those goals by dynamically adjusting dispatching priorities, storage management, and I/O scheduling.
💡 Key Insight: The shift from compatibility mode (static priorities) to goal mode (dynamic management) is the single most important resource management change in z/OS history. If your shop is still running in compatibility mode — and yes, I have encountered this as recently as 2024 — stop reading this chapter and go fix that first. You are leaving enormous performance capacity on the table.
The WLM Conceptual Model
WLM operates on a simple but powerful conceptual model:
- Work enters the system — a CICS transaction, a batch job, a DB2 stored procedure call, an MQ message
- WLM classifies the work — using classification rules, it assigns the work to a service class
- The service class defines the goal — response time, velocity, or discretionary
- WLM monitors actual performance against the goal
- WLM adjusts dispatching priorities dynamically to meet the goal
The beauty of this model is that WLM manages the trade-offs automatically. When the system is lightly loaded, everything runs at high priority. When contention appears, WLM makes intelligent decisions about which work to favor, based on the goals and importance levels you defined.
⚠️ Warning: WLM is not magic. It cannot create resources that do not exist. If your LPAR is genuinely capacity-constrained — CPUs pegged at 100%, real storage exhausted, I/O channels saturated — WLM can only decide who suffers least. It cannot make everyone happy. This is why capacity planning (which we will touch on in Chapter 9) and WLM tuning must work together.
5.2 The WLM Service Definition — Service Classes, Workloads, and Classification Rules
The service definition is the central configuration document for WLM. It contains everything WLM needs to classify and manage work. Think of it as the constitution of your z/OS system's resource management — it establishes the rules, the priorities, and the goals that govern how every piece of work is treated.
A service definition contains four major components:
Service Classes
A service class is a container for work with similar performance goals. Each service class has:
- A name (up to 8 characters)
- An importance level (1 through 5, where 1 is highest)
- One or more service class periods, each with a goal type
- An optional description
At CNB, Kwame maintains approximately thirty service classes. Here are the key ones:
| Service Class | Importance | Goal | What It Contains |
|---|---|---|---|
| CICSPROD | 1 | Response time: 0.25 sec (average) | Production CICS online transactions |
| CICSHIGH | 1 | Response time: 0.10 sec (average) | High-priority CICS transactions (wire transfers, ATM) |
| DB2PROD | 1 | Response time: 0.50 sec (average) | Production DB2 DDF workload (API calls) |
| BATCHCRT | 2 | Velocity: 50% | Critical batch (EOD settlement, regulatory reports) |
| BATCHSTD | 3 | Velocity: 30% | Standard batch (non-critical processing) |
| BATCHLOW | 4 | Discretionary | Low-priority batch (data extracts, test reruns) |
| MQPROD | 2 | Response time: 1.0 sec (average) | Production MQ workloads |
| RPTPROD | 3 | Velocity: 40% | Reporting workloads |
| STCHIGH | 2 | Velocity: 60% | Critical started tasks (automation, monitoring) |
| DISCRTNY | 5 | Discretionary | Everything else (TSO users, dev work) |
Service Class Periods
A service class can have up to eight periods. This is where WLM gets sophisticated. Each period defines a goal and a duration, and work moves through periods as it consumes service.
For example, CNB's CICSPROD service class has two periods:
- Period 1: Response time goal of 0.25 seconds, importance 1 — applies to the first service unit of work
- Period 2: Velocity goal of 50%, importance 2 — applies after a transaction has consumed more service than expected
This is powerful because it means a CICS transaction that responds quickly gets top priority, but a runaway transaction that consumes excessive resources gets demoted automatically. WLM protects the system from poorly performing transactions without manual intervention.
🧩 Productive Struggle: Before reading the next section, consider this scenario: At Pinnacle Health Insurance, Diane Okoye discovers that a small percentage of claims processing transactions — roughly 2% — are taking 15 seconds instead of the expected 0.3 seconds. These slow transactions are consuming so many resources that they are degrading response time for the other 98% of transactions. How would you use multi-period service classes to address this? Write down your approach before continuing.
Workloads
A workload is a logical grouping of service classes that represent a business function. Workloads exist primarily for reporting purposes — they let you view performance data by business function rather than by individual service class.
CNB defines workloads like:
- ONLINE — all CICS and DB2 DDF service classes
- BATCH — all batch service classes
- INFRASTR — infrastructure started tasks
- REPORTS — reporting and analytics workloads
Classification Rules
Classification rules are the mechanism that tells WLM which service class a piece of work belongs to. They are evaluated top-down, and the first matching rule wins.
Classification rules can match on a wide variety of attributes:
- Subsystem type: CICS, DB2, JES, STC, TSO, OMVS
- Transaction name or transaction class (for CICS)
- Job name or job class (for batch)
- Accounting information string
- User ID
- Plan name (for DB2)
- Stored procedure name (for DB2)
- Scheduling environment name
- LPAR name (for sysplex-wide definitions)
Here is a simplified excerpt from CNB's classification rules, expressed in the format you would see in the WLM ISPF panels:
Subsystem Type: CICS
Subsystem Instance: CICSPRD1, CICSPRD2, CICSPRD3, CICSPRD4
Transaction Name: XFRI, XFRO, XFRW → Service Class: CICSHIGH
Transaction Name: * → Service Class: CICSPROD
Subsystem Type: JES
Job Name: EOD* → Service Class: BATCHCRT
Job Name: REG* → Service Class: BATCHCRT
Job Name: RPT* → Service Class: RPTPROD
Job Name: STD* → Service Class: BATCHSTD
Job Class: Z → Service Class: BATCHLOW
Job Name: * → Service Class: BATCHSTD
Subsystem Type: DB2
Subsystem Instance: DB2P
Plan Name: APIPLAN* → Service Class: DB2PROD
Plan Name: * → Service Class: DB2PROD
Subsystem Type: STC
Procedure Name: CICS*, DB2*, MQ* → Service Class: (managed by subsystem)
Procedure Name: AUTOMON, NETVIEW, OPC → Service Class: STCHIGH
Procedure Name: * → Service Class: DISCRTNY
✅ Best Practice: Always end each classification group with a wildcard (*) rule that catches unclassified work. Unclassified work defaults to the SYSSTC or SYSOTHER service class, which typically runs at importance 5 — the lowest priority. If a new job or transaction is added to production without updating the classification rules, it will silently run at minimum priority. This has caused more late-night pages than I can count.
🔍 Elaborative Interrogation: Why does CNB classify by job name prefix (EOD, REG, RPT*) rather than by job class? What are the advantages and disadvantages of each approach? Consider what happens when a new critical batch job is added — which approach requires less coordination?
Service Policies
A service definition can contain multiple service policies. A service policy is a named set of service class goals that can be activated at different times. This allows you to shift priorities without changing the classification rules.
CNB uses three service policies:
- DAYTIME — Online transactions at importance 1, batch at importance 3-4
- BATCHWIN — Batch critical-path at importance 1, online at importance 2 (activated at 11:00 PM)
- MONTHEND — Month-end batch at importance 1 (activated on the last business day)
The policy switch can be automated through the z/OS Automation Control tool or triggered by OPC (Operations Planning and Control) scheduling.
💡 Key Insight: Service policies are the most under-used feature of WLM. Many shops maintain a single policy 24/7, forcing the same priority structure regardless of workload patterns. If your shop does this, you are leaving significant performance headroom on the table. The cost of maintaining two or three policies is minimal — a few hours of design and testing — and the benefit is a system that automatically adapts to predictable workload shifts without human intervention.
At Pinnacle Health Insurance, Diane Okoye maintains four service policies: BUSDAY (weekday business hours), EVENING (6 PM to midnight, when provider batch uploads arrive), NIGHTRUN (midnight to 6 AM, heavy claims processing), and WEEKEND (reduced staffing, all batch elevated). Ahmad Rashidi initially resisted the complexity, arguing that four policies were too many to audit. Diane convinced him by pointing out that each policy is a documented, testable, auditable artifact — far better than the alternative of operators manually adjusting priorities based on tribal knowledge.
Report Classes
There is one more component of the service definition worth mentioning: report classes. A report class collects performance data about work without affecting its service class assignment. Think of report classes as an observation layer — you can track the performance of a specific subset of transactions (say, all account inquiry transactions from the mobile platform) without creating a separate service class for them.
Report classes are useful when you want to measure before you manage. At Federal Benefits Administration, Sandra Chen used report classes to instrument the eligibility recalculation workload for three months before proposing service class changes. The report class data showed exactly how much CPU and I/O the eligibility work consumed, what its response time profile looked like, and how it interacted with other workloads — all without changing a single dispatching priority.
To create a report class, you define it in the service definition and associate it with classification rules, just like a service class. The difference is that the report class has no goals and no importance level — it purely collects data.
5.3 How WLM Makes Decisions — Dispatching, Goals, and Performance Index
Now we reach the core of WLM: how it actually makes decisions about resource allocation. Understanding this mechanism is what separates an architect who can diagnose performance issues from one who just guesses.
The WLM Decision Cycle
WLM runs a decision cycle approximately every 10 seconds (the "policy adjustment interval"). During each cycle, WLM:
- Collects performance data for every active service class
- Calculates the Performance Index (PI) for each service class
- Compares PIs across all service classes, weighted by importance
- Adjusts dispatching priorities to move resources toward underperforming service classes
Performance Index (PI)
The Performance Index is the single most important metric in WLM. It is a ratio that tells you whether a service class is meeting its goal:
PI = Actual Performance / Goal
For response time goals:
PI = Actual Average Response Time / Goal Response Time
PI < 1.0 → better than goal
PI = 1.0 → exactly meeting goal
PI > 1.0 → missing goal
For velocity goals:
PI = Goal Velocity / Actual Velocity
(inverted so that PI > 1.0 still means "missing goal")
A PI of 0.5 means the service class is performing twice as well as needed. A PI of 3.0 means it is performing three times worse than its goal. WLM's job is to keep all PIs as close to 1.0 as possible, prioritizing by importance level.
💡 Key Insight: WLM does not simply give all resources to importance-1 work. It gives importance-1 work priority in the sense that it will sacrifice importance-5 performance to maintain importance-1 performance. But if importance-1 work is already meeting its goal (PI < 1.0), WLM will happily allocate excess resources to lower-importance work. This is the genius of goal-based management — it prevents over-allocation.
How WLM Assigns Dispatching Priorities
z/OS dispatching priorities range from 0 (lowest) to 255 (highest). WLM manages priorities in the range from approximately 1 to 223 (the exact range depends on system configuration). Priorities above 223 are reserved for z/OS system components.
WLM assigns dispatching priorities based on a combination of:
- Importance level — higher importance gets a higher base priority range
- Performance Index — within an importance level, work with a higher PI (further from its goal) gets higher priority
- Service class period — work in period 1 gets higher priority than work in period 2
The priority ranges are approximately:
| Importance | Approximate Priority Range |
|---|---|
| 1 | 192–223 |
| 2 | 160–191 |
| 3 | 128–159 |
| 4 | 96–127 |
| 5 | 64–95 |
| System (SYSSTC) | Configurable, typically high |
⚠️ Warning: These ranges are approximate and dynamic. WLM constantly adjusts within and across these ranges. Do not design your service definition around specific priority numbers. Design around goals and importance levels, and let WLM figure out the priorities.
The Importance Hierarchy in Action
Let us walk through a concrete scenario at CNB to see how WLM makes trade-off decisions.
Scenario: 11:30 PM on a Tuesday night. The batch window has just started.
- CICSPROD (Importance 1): Only 5% of normal transaction volume (late-night mobile banking). PI = 0.3 (well under goal).
- BATCHCRT (Importance 2): EOD settlement jobs just started. PI = 1.5 (behind goal — settlement is taking longer than planned because a large commercial client submitted a late wire batch).
- BATCHSTD (Importance 3): Standard nightly batch is running. PI = 2.0 (well behind goal).
- RPTPROD (Importance 3): Regulatory reports are running. PI = 1.2 (slightly behind goal).
WLM's decision: CICSPROD is far below its goal, so WLM does not need to give it extra resources. BATCHCRT is the highest-importance work that is missing its goal, so WLM raises its dispatching priority, potentially above where CICSPROD currently sits. BATCHSTD and RPTPROD both have the same importance (3), but BATCHSTD has a higher PI (2.0 vs. 1.2), so it gets slightly more priority within the importance-3 band.
The result: EOD settlement gets the lion's share of resources, followed by standard batch and reporting, while the trickle of online transactions still runs fine because it needs so few resources.
🔄 Retrieval Practice (from Chapter 2): In Chapter 2, we discussed how LPAR weighting and processor allocation affect capacity. How does LPAR weight interact with WLM? If CNB's production LPAR has a weight of 800 and their test LPAR has a weight of 200, does WLM operate independently on each LPAR, or does it coordinate across the sysplex? Write down your answer before reading the next paragraph.
Answer: WLM operates independently on each LPAR for dispatching decisions, but the service definition is shared across the sysplex. LPAR weights determine how much physical processor capacity each LPAR receives from PR/SM (Processor Resource/Systems Manager). WLM then manages the work within each LPAR's allocated capacity. In a sysplex, WLM can also make cross-system routing decisions — directing work to the LPAR that has the most available capacity — but the dispatching priority adjustments happen per-LPAR.
5.4 WLM and CICS — How Transaction Priorities Are Managed
CICS is the most demanding consumer of WLM services at most mainframe shops. At CNB, CICS processes roughly 5,800 transactions per second during peak hours. Every one of those transactions is classified, prioritized, and managed by WLM.
How CICS Integrates with WLM
CICS integrates with WLM through the transaction class and transaction name classification. When a CICS transaction starts, CICS notifies WLM, which classifies it based on the classification rules in the service definition.
The CICS-WLM integration works at two levels:
- Region level: The CICS region (address space) runs at a dispatching priority determined by WLM based on the highest-priority work it is processing
- Task level: Within the CICS region, CICS uses WLM information to prioritize individual tasks through the CICS task dispatcher
CICS Region Address Space
├── WLM assigns region priority = MAX(task priorities)
├── Task: INQA (inquiry) → Service Class CICSPROD → Priority 195
├── Task: XFRW (wire transfer) → Service Class CICSHIGH → Priority 210
├── Task: RPTA (report gen) → Service Class RPTPROD → Priority 140
└── Region dispatching priority → 210 (highest task)
⚠️ Warning: This means a single high-priority transaction elevates the entire CICS region's priority. If your high-priority wire transfer transactions run in the same CICS region as low-priority reporting transactions, the reporting transactions benefit from the elevated region priority. This is one reason CNB runs separate CICS regions for different workload types — CICSPRD1/CICSPRD2 for general online, CICSPRD3 for high-value transactions, CICSPRD4 for internal/reporting.
CICS Transaction Classification Strategies
There are two primary approaches to classifying CICS transactions:
Approach 1: By Transaction Name
CICS Subsystem: CICSPRD*
Transaction Name: XFRI, XFRO, XFRW → CICSHIGH
Transaction Name: INQ* → CICSPROD
Transaction Name: UPD* → CICSPROD
Transaction Name: RPT* → RPTPROD
Transaction Name: * → CICSPROD
Approach 2: By CICS Transaction Class
CICS Subsystem: CICSPRD*
Transaction Class: HIGHVAL → CICSHIGH
Transaction Class: ONLINE → CICSPROD
Transaction Class: REPORTS → RPTPROD
Transaction Class: * → CICSPROD
Approach 2 is more flexible because you can change a transaction's WLM classification by changing its CICS transaction class definition — no WLM service definition change required. However, Approach 1 is more transparent for debugging.
💡 Key Insight: At Pinnacle Health Insurance, Diane Okoye uses a hybrid approach. Claims adjudication transactions (the high-volume, business-critical path) are classified by transaction name for maximum control. Lower-volume administrative transactions are classified by transaction class for flexibility. She calls this the "name the critical, class the rest" strategy.
COBOL Implications: Why Your EXEC CICS Code Matters
As a COBOL architect, you might think WLM is purely an infrastructure concern. It is not. Your COBOL code directly affects how WLM manages your transactions.
Consider this common COBOL pattern:
EXEC CICS LINK
PROGRAM('ACCTINQ')
COMMAREA(WS-COMMAREA)
LENGTH(WS-COMM-LEN)
END-EXEC
This LINK executes within the same CICS task and therefore inherits the same WLM service class. Now consider the alternative:
EXEC CICS START
TRANSID('AINQ')
FROM(WS-START-DATA)
LENGTH(WS-START-LEN)
END-EXEC
This START creates a new CICS task with transaction ID AINQ. That new task will be independently classified by WLM. If AINQ maps to a different service class than the originating transaction, the spawned work runs at a different priority.
✅ Best Practice: When designing COBOL transaction flows in CICS, be intentional about whether subordinate work should inherit the parent's WLM classification (use LINK) or receive its own classification (use START). At CNB, wire transfer processing uses LINK for all steps in the critical path — ensuring the entire flow runs at CICSHIGH priority — but uses START for the audit trail write, which can run at a lower priority without affecting the customer experience.
CICS Task-Related User Exits and WLM
For architects who need fine-grained control, CICS provides task-related user exits (TRUEs) that can influence WLM behavior. The DFHXCURR exit, for example, allows you to programmatically change a transaction's WLM classification after it starts — useful when a transaction's priority should change based on runtime conditions (e.g., a general inquiry transaction that discovers it is processing a VIP customer account).
At SecureFirst Retail Bank, Yuki Nakamura implemented a TRUE that reclassifies transactions based on the customer tier associated with the account being accessed. A balance inquiry for a private banking customer (account prefix 9000) is elevated to CICSHIGH, while the same inquiry for a standard retail customer stays at CICSPROD. This is a powerful pattern, but it comes with a caution: the exit runs on every transaction, so it must be extremely efficient. Yuki's exit adds approximately 0.002 milliseconds of overhead per transaction — negligible at SecureFirst's volume, but potentially significant at CNB's 5,800 TPS peak.
⚠️ Warning: Programmatic WLM reclassification through TRUEs is a sharp tool. Use it only when the classification cannot be determined statically from transaction name or class. Every additional code path in the transaction flow is a potential point of failure, and a bug in a TRUE can degrade every transaction in the region. At Federal Benefits Administration, Marcus Whitfield once deployed a TRUE that contained a DB2 call to look up customer priority. Under load, the DB2 calls backed up, the TRUE stalled, and every transaction in the region suffered. The TRUE was removed within an hour and replaced with a static classification based on transaction name prefixes.
5.5 WLM and Batch — Initiators, Enclaves, and the Battle for the Batch Window
For COBOL architects, batch is where WLM becomes deeply personal. Your batch jobs, your elapsed times, your SLAs — they all depend on how WLM manages batch workloads.
WLM-Managed Initiators
In the old world, JES2 initiators were statically defined. A system programmer would configure, say, fifteen initiators and assign them to specific job classes:
$HASP426 INIT01-05 CLASS=A (general batch)
$HASP426 INIT06-10 CLASS=B (high-priority batch)
$HASP426 INIT11-15 CLASS=C (low-priority batch)
This was rigid. If all the Class A initiators were busy and Class B initiators were idle, Class A jobs waited even though resources were available.
WLM-managed initiators solved this. With WLM-managed initiators, JES2 starts initiators dynamically based on demand and WLM goals. When a job is submitted and classified to a service class that is missing its goal, WLM tells JES2 to start additional initiators. When jobs complete and initiators are idle, WLM lets them drain.
/*JOBPARM SYSAFF=*
//EODSETL JOB (ACCT),'EOD SETTLEMENT',
// CLASS=A,MSGCLASS=X,
// SCHENV=PRODENV
The job class and scheduling environment in the JCL are used for WLM classification, but the initiator is dynamically assigned.
🧩 Productive Struggle: Rob Calloway at CNB faces this problem every month: on the last business day, the month-end batch suite (roughly 2,400 jobs) must complete within the same batch window as the regular nightly batch (roughly 800 jobs). The combined workload exceeds what can physically run in four hours if all jobs are at the same priority. How would you structure the WLM service definition to handle this? Think about service policies, importance levels, and the relationship between the month-end critical path and the regular nightly batch. Write down your strategy before reading CNB's actual approach in Section 5.8.
Batch Job Classification
Batch jobs are classified based on attributes available in the JCL:
- Job name — the most common classifier
- Job class — useful for broad categories
- Accounting information — the ACCT parameter on the JOB statement
- Scheduling environment — the SCHENV parameter
- User ID — the submitter's identity (from RACF/ACF2/Top Secret)
- Procedure name — the cataloged procedure (if used)
Here is a real-world example of how CNB classifies batch work:
JES Subsystem:
Scheduling Environment: CRITPATH → BATCHCRT (Importance 2, Velocity 50%)
Job Name: EOD* → BATCHCRT
Job Name: REG* → BATCHCRT
Job Name: FEDWIRE* → BATCHCRT
Job Name: ACH* → BATCHCRT
Scheduling Environment: RPTENV → RPTPROD (Importance 3, Velocity 40%)
Job Name: RPT* → RPTPROD
Job Name: EXT* → BATCHSTD (Importance 3, Velocity 30%)
Job Class: Z → BATCHLOW (Importance 4, Discretionary)
Job Name: * → BATCHSTD
Enclaves: When Batch Work Runs Inside Other Subsystems
An enclave is a WLM concept that allows work running inside one address space to be classified and managed as if it were independent work. This is critical for DB2 stored procedures, CICS asynchronous processing, and WebSphere Liberty server workloads.
For batch-like work, enclaves matter when your COBOL batch job calls a DB2 stored procedure:
EXEC SQL
CALL VALIDATE_ACCT(:WS-ACCT-NUM,
:WS-RESULT-CODE,
:WS-MSG-TEXT)
END-EXEC
Without enclaves, the DB2 stored procedure runs under DB2's address space priority, not the batch job's priority. With WLM enclaves, the stored procedure can be classified to the same service class as the calling batch job, ensuring consistent priority treatment.
💡 Key Insight: At Federal Benefits Administration, Sandra Chen discovered that their nightly eligibility calculation — a batch job calling hundreds of DB2 stored procedures — was running slowly because the stored procedures were classified as general DB2 work (importance 3) while the batch job itself was classified as critical batch (importance 2). The stored procedures were getting less priority than the job that called them, creating a bottleneck. Reclassifying the stored procedures to match the calling job's importance cut elapsed time by 35%.
The Batch Window Problem
The "batch window" — the period when online transaction volume is low enough to run resource-intensive batch processing — is under pressure at every mainframe shop. Twenty years ago, most shops had eight hours (10 PM to 6 AM). Today, with 24/7 mobile banking and global operations, the window is shrinking.
At CNB, the batch window is nominally 11:00 PM to 3:00 AM — four hours. But "window" is misleading because online transactions never fully stop. At 2:00 AM, CNB still processes roughly 200 mobile banking transactions per second.
WLM manages this coexistence through the service policy switch. At 11:00 PM, Rob Calloway's automation switches from the DAYTIME policy to the BATCHWIN policy:
- DAYTIME policy: CICSHIGH at importance 1, CICSPROD at importance 1, BATCHCRT at importance 3
- BATCHWIN policy: CICSHIGH at importance 1, CICSPROD at importance 2, BATCHCRT at importance 1
Note that CICSHIGH stays at importance 1 in both policies — wire transfers and ATM transactions are always top priority. But general online transactions drop from importance 1 to importance 2, making room for critical batch work.
⚠️ Warning: Never set all batch to importance 1 during the batch window. If you do, WLM cannot differentiate between critical-path batch and non-critical batch, and everything runs at the same priority. The critical path — the longest chain of dependent jobs that determines total batch duration — must be at a higher importance than non-critical batch.
Scheduling Environments: Controlling Where and When Batch Runs
A scheduling environment is a WLM construct that defines the resource conditions under which a batch job can run. It acts as a gate: JES will not start a job with a scheduling environment until WLM confirms that the required resources are available.
Scheduling environments are defined in the service definition and referenced in the JCL:
//EODSETL1 JOB (ACCT),'EOD SETTLEMENT',
// CLASS=A,MSGCLASS=X,
// SCHENV=EODCRIT
The EODCRIT scheduling environment might require that DB2 subsystem DB2P is active and that the CICS regions are running (because the settlement process uses the CICS-DB2 bridge). If DB2P is down for maintenance, JES holds the job rather than starting it and letting it fail.
At Pinnacle Health Insurance, Diane Okoye uses scheduling environments to enforce a strict separation between claims processing batch and provider network updates. The claims batch scheduling environment requires exclusive access to a set of DB2 tablespaces (no concurrent provider updates), while the provider update environment requires its own exclusive access. This prevents the two workloads from running simultaneously and creating lock contention — a problem that plagued Pinnacle for months before Diane implemented the scheduling environment solution.
The combination of scheduling environments and service class classification gives you two-dimensional control over batch: what priority it runs at (service class) and when and where it can run (scheduling environment). Most architects underuse scheduling environments, relying solely on job scheduling tools like OPC or TWS to control dependencies. Scheduling environments add a system-level enforcement layer that operates even if the scheduler is misconfigured.
5.6 WLM and DB2 — Stored Procedure Priorities and DDF Workloads
DB2 is the other major WLM consumer on most mainframe systems. At CNB, DB2 handles both the backend for CICS transactions and a growing volume of distributed data facility (DDF) workload from API calls.
DB2 DDF Classification
When an external application connects to DB2 through DDF — whether it is a Java application on a distributed server, a REST API, or a COBOL CICS transaction from another LPAR — the work is classified based on:
- Connection type (DDF)
- Correlation ID (derived from the client connection)
- Authorization ID (the user or service account)
- Plan name or Package collection ID
- Stored procedure name
At SecureFirst Retail Bank, Carlos Vega's API layer connects to DB2 through DDF. His team uses plan names to differentiate workloads:
DB2 Subsystem: DB2P
Connection Type: DDF
Plan Name: APICORE* → DB2API1 (Importance 1, RT 0.3 sec)
Plan Name: APIBULK* → DB2API2 (Importance 3, RT 2.0 sec)
Plan Name: APISRCH* → DB2API2 (Importance 3, RT 2.0 sec)
Plan Name: * → DB2PROD (Importance 2, RT 1.0 sec)
This means the core API calls (account balance, transfer initiation) get top priority, while bulk operations and search queries get lower priority. The customer sees sub-second response times for critical operations even when bulk API calls are consuming significant resources.
DB2 Stored Procedures and Enclaves
As mentioned in Section 5.5, DB2 stored procedures run in WLM-managed stored procedure address spaces (SPASes). Each SPAS is associated with a WLM environment, and the stored procedures within it inherit that environment's classification.
At CNB, the DB2 stored procedure setup looks like this:
-- High-priority stored procedure for wire transfers
CREATE PROCEDURE CNB.WIRE_VALIDATE
(IN ACCT_NUM CHAR(12),
OUT RESULT_CODE INTEGER,
OUT MSG_TEXT VARCHAR(200))
LANGUAGE COBOL
EXTERNAL NAME 'WIREVAL'
WLM ENVIRONMENT WLMHIGH
PARAMETER STYLE GENERAL;
-- Standard-priority stored procedure for account inquiry
CREATE PROCEDURE CNB.ACCT_INQUIRY
(IN ACCT_NUM CHAR(12),
OUT ACCT_DATA VARCHAR(4000))
LANGUAGE COBOL
EXTERNAL NAME 'ACCTINQ'
WLM ENVIRONMENT WLMSTD
PARAMETER STYLE GENERAL;
The WLM ENVIRONMENT clause determines which WLM-managed address space runs the stored procedure, and that address space's service class determines the priority.
🔍 Elaborative Interrogation: Why does DB2 use separate WLM-managed address spaces for stored procedures rather than running them in the DB2 main address space? Think about what would happen if a runaway stored procedure consumed excessive CPU or storage. How does the separate address space protect DB2?
DB2 Workload Manager Application Environments
The WLM application environment is the bridge between DB2 stored procedures and WLM classification. Each application environment defines:
- A startup procedure name — the cataloged JCL procedure that starts the SPAS
- The number of TCBs — how many concurrent stored procedure executions the SPAS supports
- The associated service class — how WLM prioritizes the SPAS
At CNB, Lisa Tran maintains three DB2 WLM application environments:
Application Environment: WLMHIGH
Startup Procedure: DB2SPHI
NUMTCB: 20
Service Class: DB2PROD (Importance 1)
Used by: WIRE_VALIDATE, ACCT_AUTH, FRAUD_CHECK
Application Environment: WLMSTD
Startup Procedure: DB2SPST
NUMTCB: 40
Service Class: DB2PROD (Importance 2)
Used by: ACCT_INQUIRY, STMT_RETRIEVE, HIST_LOOKUP
Application Environment: WLMBATCH
Startup Procedure: DB2SPBT
NUMTCB: 10
Service Class: BATCHSTD (Importance 3)
Used by: BATCH_VALIDATE, BATCH_CALC, REPORT_GEN
The NUMTCB parameter deserves special attention. It controls the maximum number of concurrent stored procedure executions within a single SPAS. Set it too low, and stored procedures queue up waiting for a TCB. Set it too high, and the SPAS consumes excessive storage. Lisa calibrates NUMTCB based on observed peak concurrency plus a 25% buffer.
COBOL Stored Procedures: Performance Implications
For COBOL architects, the WLM treatment of stored procedures has direct design implications:
-
Group stored procedures by priority, not by function. If you have ten stored procedures and three are critical path, put those three in a high-priority WLM environment and the other seven in a standard environment.
-
Minimize cross-priority calls. If a high-priority stored procedure calls a low-priority one, the chain of execution drops to the lower priority for that segment. Design your stored procedure call graphs to stay within the same priority tier when possible.
-
Watch the SPAS startup cost. WLM-managed stored procedure address spaces take 2-5 seconds to start. If your workload is bursty and the SPAS drains between bursts, the startup cost adds latency. Configure the NUMTCB parameter in the WLM application environment to keep enough address spaces warm.
5.7 Reading WLM Data — RMF Reports and SMF Type 72 Analysis
You cannot manage what you cannot measure, and WLM provides extensive measurement data through RMF (Resource Measurement Facility) reports and SMF (System Management Facility) type 72 records.
RMF Workload Activity Report
The RMF Workload Activity Report is the primary tool for analyzing WLM performance. It shows, for each service class period:
- Transaction count or job count
- Average response time (for response time goals)
- Average velocity (for velocity goals)
- Performance Index (PI)
- Average dispatching priority
- CPU service consumed (in service units)
- I/O service consumed
- Storage usage
Here is an annotated excerpt from a CNB RMF Workload Activity Report (see code/example-01-rmf-analysis.txt for the full annotated report):
WORKLOAD ACTIVITY
SERVICE TRANS AVG ---GOAL--- PERF AVG ----SERVICE---- ---USING%---
CLASS COUNT RESP TYPE VALUE INDEX DPRTY CPU IOC CPU STR
-------- ------ ------ ---- ------ ------ ------ ------ ------ ---- ----
CICSHIGH 42,891 0.082 RT 0.100 0.82 211 1,245 892 12.3 8.1
CICSPROD 485,220 0.198 RT 0.250 0.79 205 14,892 11,234 45.1 22.4
DB2PROD 38,442 0.412 RT 0.500 0.82 198 2,891 4,521 8.2 6.8
BATCHCRT 28 N/A VEL 50.0% 1.22 175 8,234 12,891 18.4 14.2
BATCHSTD 142 N/A VEL 30.0% 1.85 148 4,521 8,234 10.1 18.8
RPTPROD 12 N/A VEL 40.0% 2.10 138 1,892 3,456 3.2 8.4
BATCHLOW 45 N/A VEL N/A N/A 95 892 1,234 1.8 4.1
How to read this:
- CICSHIGH and CICSPROD: PIs below 1.0 (0.82 and 0.79) — performing better than goal. This is the expected state during normal operations.
- BATCHCRT: PI of 1.22 — 22% behind its velocity goal. This bears watching. If it were 1.5 or higher, Rob Calloway would be investigating.
- BATCHSTD: PI of 1.85 — significantly behind goal, but at importance 3, WLM is correctly prioritizing online and critical batch over standard batch.
- RPTPROD: PI of 2.10 — the worst performer. Reports are running slowly because resources are allocated to higher-importance work. This is acceptable as long as reports complete before the 6:00 AM deadline.
- BATCHLOW: Discretionary, no PI — gets whatever resources are left over.
💡 Key Insight: A common mistake is treating any PI > 1.0 as a problem. It is not. WLM is a zero-sum game on a loaded system. If every service class had a PI below 1.0, it would mean your goals are too easy — you are not utilizing your system efficiently. Expect importance 3-5 service classes to have PIs above 1.0 during peak periods. The question is whether the high-importance work is meeting its goals.
SMF Type 72 Records
SMF type 72 records contain the raw WLM performance data that RMF reports summarize. They are written at configurable intervals (typically every 5 or 15 minutes) and contain:
- Subtype 3: Workload activity (the data behind the RMF Workload Activity Report)
- Subtype 4: Channel and device activity
- Subtype 5: LPAR/processor activity
For historical analysis, you process SMF type 72 records with tools like SAS, MXG, or IntelliMagic Vision. At Federal Benefits Administration, Marcus Whitfield built a set of COBOL programs decades ago that read SMF 72 records and produce custom performance reports. Sandra Chen is in the process of replacing these with Python scripts, but the COBOL programs still run every morning.
Here is the structure of a type 72 subtype 3 record header that a COBOL program would process:
01 SMF72-HEADER.
05 SMF72-LEN PIC S9(4) COMP.
05 SMF72-SEG PIC S9(4) COMP.
05 SMF72-FLG PIC X(1).
05 SMF72-RTY PIC X(1).
88 SMF72-TYPE-72 VALUE X'48'.
05 SMF72-TME PIC S9(8) COMP.
05 SMF72-DTE PIC S9(8) COMP PACKED.
05 SMF72-SID PIC X(4).
05 SMF72-SSI PIC X(4).
05 SMF72-STY PIC S9(4) COMP.
88 SMF72-WKLD-ACT VALUE 3.
🔄 Retrieval Practice (from Chapter 3): In Chapter 3, we discussed Language Environment runtime options and their performance overhead. How would you use RMF WLM data to determine whether LE overhead is contributing to a batch job's elapsed time? Specifically, what would the CPU service and I/O service numbers tell you about whether the bottleneck is CPU-bound (possibly LE overhead) or I/O-bound?
Diagnosing Performance Issues with WLM Data
When a batch job or CICS transaction is not meeting its performance target, here is the diagnostic flowchart:
Step 1: Check the Performance Index. - PI < 1.0 → WLM is giving this work adequate resources. The issue is in the application code, DB2 access paths, or I/O configuration. Stop looking at WLM. - PI > 1.0 → WLM is not meeting the goal. Proceed to Step 2.
Step 2: Check the dispatching priority. - Priority is in the expected range for its importance → WLM is doing what you asked. The system may be capacity-constrained. Check CPU utilization. - Priority is lower than expected → Check classification rules. The work may be misclassified.
Step 3: Check for contention. - Is there higher-importance work that is also missing its goal? If so, WLM is correctly prioritizing that work over yours. - Is the system at high CPU utilization (>90%)? If so, WLM cannot help — you need more capacity or less work.
Step 4: Check for resource bottlenecks outside WLM's control. - I/O waits → Storage subsystem issue, not WLM - DB2 lock waits → Application design issue, not WLM - ENQ waits → Serialization issue, not WLM
✅ Best Practice: At CNB, Lisa Tran maintains a WLM performance dashboard that shows PI trends over time for all production service classes. This dashboard — built on MXG-processed SMF 72 data — is the first thing the team checks during any performance incident. Trending is far more valuable than point-in-time snapshots because it shows when a service class started deviating from its goal.
5.8 Designing Service Policies — Balancing Competing Workloads
Designing a WLM service definition is an exercise in managed conflict. Every workload owner believes their work is the most important. Your job as an architect is to establish a priority framework that reflects business reality, not political clout.
Principles of Service Definition Design
Principle 1: Start from the business, not the technology.
Before defining a single service class, ask: "What work must complete on time, or the business loses money / faces regulatory action / damages customer trust?" That work gets importance 1 or 2. Everything else gets importance 3 or lower.
At CNB: - Wire transfers must complete in under 2 seconds → CICSHIGH, Importance 1 - Online banking transactions must respond in under 0.5 seconds → CICSPROD, Importance 1 - EOD settlement must complete by 3:00 AM → BATCHCRT, Importance 2 - Regulatory reports must complete by 6:00 AM → RPTPROD, Importance 3 - Data extracts for analytics team → BATCHLOW, Importance 4
Principle 2: Fewer service classes is better.
Every service class you add increases complexity. Most shops need 15-25 service classes. If you have more than 40, you are almost certainly over-engineering. WLM makes better decisions when it has clear, distinct priority tiers, not dozens of fine-grained distinctions.
Principle 3: Use importance levels to create clear tiers.
Do not try to differentiate within an importance level using different goal values. Use importance levels for coarse-grained priority and goals for fine-grained performance targets.
Bad design:
CICS_TYPE_A: Importance 2, RT 0.20 sec
CICS_TYPE_B: Importance 2, RT 0.30 sec
CICS_TYPE_C: Importance 2, RT 0.40 sec
Better design:
CICSHIGH: Importance 1, RT 0.10 sec (wire transfers, ATM)
CICSPROD: Importance 1, RT 0.25 sec (general online)
CICSLOW: Importance 2, RT 1.00 sec (internal, reporting)
Principle 4: Plan for the worst day, not the average day.
Your service definition must handle month-end, quarter-end, and year-end spikes. Use service policies to shift priorities when workload patterns change. Test your service policies before month-end, not during.
🧩 Productive Struggle: You are designing the WLM service definition for Pinnacle Health Insurance. Their workload includes: (a) real-time claims adjudication (must respond in < 1 second), (b) batch claims processing (50M claims/month, must complete nightly), (c) provider network updates (weekly batch, large but not time-critical), (d) regulatory compliance reporting (monthly, must complete by 5th business day), and (e) ad-hoc queries from actuaries. Design the service classes, importance levels, and goals. Then compare your design with the case study in case-study-01.md.
CNB's Service Definition: A Detailed Walkthrough
Let us walk through CNB's complete service definition design, which Kwame Mensah presented at SHARE in 2024.
Service Classes (production):
CICSHIGH - Imp 1, Period 1: RT 0.10s, Period 2: VEL 60% Imp 2
CICSPROD - Imp 1, Period 1: RT 0.25s, Period 2: VEL 40% Imp 2
CICSINTN - Imp 2, Period 1: RT 1.00s, Period 2: VEL 30% Imp 3
DB2PROD - Imp 1, Period 1: RT 0.50s, Period 2: VEL 40% Imp 2
DB2DDF - Imp 1, Period 1: RT 0.30s, Period 2: VEL 50% Imp 2
MQPROD - Imp 2, Period 1: RT 1.00s
BATCHCRT - Imp 2, Period 1: VEL 50%
BATCHSTD - Imp 3, Period 1: VEL 30%
BATCHLOW - Imp 4, Discretionary
RPTPROD - Imp 3, Period 1: VEL 40%
STCHIGH - Imp 2, Period 1: VEL 60%
STCSTD - Imp 3, Period 1: VEL 30%
TSOPROD - Imp 3, Period 1: RT 0.50s
OMVSPROD - Imp 3, Period 1: VEL 30%
DISCRTNY - Imp 5, Discretionary
Service Policies:
DAYTIME (Active 6:00 AM - 11:00 PM):
CICSHIGH Imp 1, CICSPROD Imp 1, DB2PROD Imp 1, DB2DDF Imp 1
BATCHCRT Imp 3, BATCHSTD Imp 4, BATCHLOW Imp 5
BATCHWIN (Active 11:00 PM - 6:00 AM):
CICSHIGH Imp 1, CICSPROD Imp 2, DB2PROD Imp 2, DB2DDF Imp 2
BATCHCRT Imp 1, BATCHSTD Imp 3, BATCHLOW Imp 4
MONTHEND (Activated last business day, 11:00 PM - 6:00 AM):
CICSHIGH Imp 1, CICSPROD Imp 2
BATCHCRT Imp 1, BATCHSTD Imp 2 (elevated from Imp 3!)
RPTPROD Imp 2 (elevated from Imp 3!)
BATCHLOW Imp 5
The MONTHEND policy is notable because it elevates standard batch and reporting to importance 2, recognizing that month-end batch includes additional critical-path jobs (interest calculation, statement generation) that normally run as standard batch.
Classification Rules:
See code/example-02-wlm-policy.txt for the full classification rule set.
The Design Review Process
At CNB, any change to the WLM service definition goes through a formal review process:
- Change request — business justification required
- Technical review — Kwame and the sysprog team assess impact
- Capacity analysis — can the system support the new priority?
- Test — applied in the test sysplex first
- Implementation — applied during a scheduled maintenance window
- Monitoring — PI trends watched for 72 hours post-change
🔍 Elaborative Interrogation: Why does CNB require business justification for WLM changes? What would happen if individual application teams could modify the service definition without coordination? Think about game theory — if every team can elevate their own importance, what is the equilibrium?
The answer, which Kwame learned the hard way early in his career, is that without governance, every team escalates their importance to 1, and WLM degrades to the equivalent of no priority management at all. The service definition is a shared resource that requires collaborative governance.
Project Checkpoint: WLM Service Classes for the HA Banking System
It is time to apply what you have learned to the progressive project. Your HA Banking Transaction Processing System needs a WLM service definition that supports:
- Online transactions — real-time banking transactions through CICS (target: < 0.3 second response time)
- Batch critical path — EOD settlement, regulatory reporting, fraud detection (must complete in 4-hour window)
- Batch non-critical — data extracts, analytics feeds, archive processing
- Reporting — management and regulatory reports
- API workloads — DB2 DDF connections from the mobile banking API layer
- Infrastructure — monitoring, automation, MQ message processing
See code/project-checkpoint.md for the detailed checkpoint exercise with design templates, evaluation criteria, and a reference solution.
Production Considerations
WLM in a Sysplex Environment
In a Parallel Sysplex, the WLM service definition is shared across all systems through the sysplex couple dataset. Changes made on one system are automatically propagated to all others. However, dispatching decisions are made independently on each system.
This creates an important architectural consideration: if you have workload-specific LPARs (e.g., one LPAR optimized for online, another for batch), the same service definition applies to both, but the actual dispatching priorities will differ because the workload mix differs.
At CNB, the four LPARs have different workload profiles:
- CNBP1, CNBP2: Primary online (CICS, DB2 DDF) — CICSHIGH and CICSPROD dominate
- CNBP3: Mixed online and batch — balanced workload
- CNBP4: Primary batch — BATCHCRT and BATCHSTD dominate
The same service definition works across all four because WLM adapts to the local workload. On CNBP4, batch gets higher dispatching priorities because there is less online competition. On CNBP1, online dominates because there is little batch.
WLM and z/OS Container Extensions (zCX)
For shops running Linux containers on z/OS via zCX, WLM manages the zCX address space as a started task. The containers within zCX inherit the address space's WLM priority. You cannot differentiate priority within a zCX instance at the WLM level — that requires container-level resource management (cgroups within the zCX guest).
Yuki Nakamura at SecureFirst is running API gateway containers in zCX. Her team classifies the zCX address space at importance 2 (same as their MQ workload), which provides adequate priority for the API layer without competing with CICS for importance-1 resources.
WLM and Coupling Facility Structures
In a sysplex, Coupling Facility (CF) structures — used for DB2 data sharing, CICS shared data tables, and cross-system communication — are accessed by work running in various service classes. WLM manages the dispatching priority of CF requests based on the originating service class.
This means a CF structure request from an importance-1 CICS transaction gets higher priority than a CF request from an importance-3 batch job, even though both are accessing the same CF structure. This is one more reason to be precise about service class classification — it affects CF performance as well.
Capacity Planning Integration
WLM data is the foundation of mainframe capacity planning. The service consumed by each service class, tracked over time, tells you:
- Which workloads are growing
- When you will need additional capacity (MIPS, memory, I/O bandwidth)
- Whether a workload shift (like moving batch to a different LPAR) would improve overall performance
At CNB, the capacity planning team reviews WLM data monthly and produces a rolling 12-month forecast. This forecast drives hardware procurement decisions — a $2M+ annual spend that depends on accurate WLM measurement data.
⚠️ Warning: If your WLM classification rules are sloppy — with significant work falling into catch-all categories — your capacity planning data is unreliable. You cannot plan for workload growth if you cannot measure individual workloads accurately. Clean classification rules are not just a performance concern; they are a financial concern.
WLM and Security Considerations
WLM configuration is a security-sensitive operation. The ability to modify the service definition — changing importance levels, goals, or classification rules — can directly impact production system performance. At CNB, WLM administrative authority is controlled through RACF FACILITY class profiles:
RDEFINE FACILITY MVSADMIN.WLM.POLICY UACC(NONE)
PERMIT MVSADMIN.WLM.POLICY CLASS(FACILITY) ID(WLMADM) ACCESS(UPDATE)
PERMIT MVSADMIN.WLM.POLICY CLASS(FACILITY) ID(KWAMEM) ACCESS(UPDATE)
Only two user IDs — the WLM administration group ID and Kwame Mensah's personal ID — have authority to modify the service definition. All changes are logged in SMF type 80 records for audit purposes.
Ahmad Rashidi at Pinnacle Health Insurance requires that WLM changes follow the same change management process as application code deployments: change ticket, risk assessment, approval chain, and post-implementation review. This is not bureaucratic overhead — it reflects the reality that a misconfigured WLM service definition can cause production outages as severe as a code defect.
Common WLM Anti-Patterns
After twenty-five years of reviewing WLM configurations across dozens of shops, I have seen the same mistakes repeatedly. Here are the most damaging anti-patterns:
Anti-Pattern 1: The "Everything Is Important" Trap. Every workload at importance 1 or 2. WLM cannot differentiate, and performance is indistinguishable from no WLM management at all. Fix: force-rank workloads through a business-stakeholder workshop.
Anti-Pattern 2: The "Set and Forget" Definition. A service definition designed five years ago for a different workload profile. The system has changed, but the WLM configuration has not. Fix: quarterly reviews with performance data analysis.
Anti-Pattern 3: The "Manual Override" Culture. Operators routinely use VARY WLM,POLICY to boost individual workloads. This indicates the base service definition does not reflect actual business priorities. Fix: redesign the service definition so manual overrides are the exception, not the norm.
Anti-Pattern 4: The "Missing Catch-All" Omission. No wildcard rules, so new workloads fall into SYSOTHER at importance 5. The failure mode is silent — nobody notices until a new production job misses its SLA. Fix: every classification group ends with * pointing to an appropriate default.
Anti-Pattern 5: The "Aspirational Goals" Mistake. Velocity goals set at levels the system cannot physically achieve, causing WLM to perpetually chase unattainable targets. The result is constant priority thrashing with no performance improvement. Fix: model goals against actual system capacity.
Summary
WLM is the control plane of z/OS performance management. It decides, dynamically and continuously, which work gets resources and which work waits. As a COBOL architect, understanding WLM is essential because:
-
Your code does not run in isolation. Every COBOL program — online or batch — competes for resources with every other program on the LPAR. WLM arbitrates that competition.
-
Performance is not just about code quality. A well-optimized COBOL program classified at importance 5 will run slower than a mediocre program classified at importance 1. Architecture includes ensuring your work is correctly classified.
-
WLM design reflects business priorities. The service definition is a formal encoding of the question "what matters most?" Getting this wrong means the wrong work gets resources during contention.
-
Diagnosis requires WLM data. When performance degrades, the first question is always "is WLM giving this work adequate resources?" RMF reports and SMF type 72 records answer that question.
-
WLM governance is a team sport. No single team should control the service definition. It requires collaboration between architects, system programmers, DBAs, operations, and the business.
The key concepts from this chapter:
- Service definition → the complete WLM configuration (service classes, workloads, classification rules, service policies)
- Service class → a container for work with similar performance goals
- Classification rules → how WLM assigns work to service classes
- Performance Index (PI) → the ratio of actual performance to goal (< 1.0 is good, > 1.0 is missing)
- Importance levels → 1-5 priority tiers that determine resource allocation during contention
- Service policies → named sets of goals that can be activated for different time periods
- WLM-managed initiators → dynamic batch initiator allocation based on WLM goals
- Enclaves → mechanism for classifying work within subsystem address spaces
- RMF/SMF type 72 → the measurement data that makes WLM observable
What's Next
In Chapter 6, we turn to the DB2 Optimizer — the subsystem that decides how your SQL executes. Just as WLM decides when your work runs and at what priority, the DB2 Optimizer decides which access paths, join methods, and index strategies to use. And just as with WLM, understanding the Optimizer's decisions is the difference between a query that runs in milliseconds and one that runs in hours.
We will see how CNB's Lisa Tran uses EXPLAIN output to diagnose access path problems, how Diane Okoye at Pinnacle Health designs DB2 access strategies for high-volume claims processing, and how the WLM service classes we defined in this chapter interact with DB2's internal resource management.
Key Terms Glossary
| Term | Definition |
|---|---|
| Workload Manager (WLM) | The z/OS component that dynamically manages system resources based on business goals rather than static priority assignments |
| Service Class | A named container in the WLM service definition that groups work with similar performance goals and importance |
| Service Definition | The complete WLM configuration comprising service classes, workloads, classification rules, and service policies |
| Workload | A logical grouping of service classes representing a business function, used primarily for reporting |
| Classification Rules | The rules WLM uses to assign incoming work to a service class based on attributes like job name, transaction ID, or plan name |
| Service Class Period | A phase within a service class that defines a specific goal; work moves through periods as it consumes more service |
| Velocity Goal | A WLM goal type that targets the percentage of time work spends using resources vs. waiting (0-100%) |
| Response Time Goal | A WLM goal type that targets average transaction response time in seconds |
| Discretionary | A WLM goal type indicating work should receive only leftover resources after all goal-based work is satisfied |
| Importance Level | A priority ranking (1=highest to 5=lowest) that determines which service classes receive resources first during contention |
| Performance Index (PI) | The ratio of actual performance to goal; < 1.0 means better than goal, > 1.0 means missing goal |
| RMF (Resource Measurement Facility) | The z/OS component that collects and reports system performance data including WLM metrics |
| SMF Type 72 | System Management Facility record type containing WLM performance data, written at configurable intervals |
| Dispatching Priority | The numeric priority (0-255) z/OS uses to determine which address space gets the next available CPU cycle |
| Enclave | A WLM construct that allows work within a subsystem (like a DB2 stored procedure) to be independently classified and managed |
| WLM-Managed Initiator | A JES2/JES3 batch initiator that is dynamically started and stopped by WLM based on workload demand |
| Batch Window | The period (typically overnight) when reduced online volume allows intensive batch processing |
| Service Policy | A named set of service class goal overrides that can be activated to shift priorities for different operational periods |
| Report Class | An optional WLM construct for gathering performance data about work without affecting its service class assignment |