Case Study 1: CNB's WLM Service Definition — Balancing 500 Million Online Transactions with End-of-Day Batch
The Situation
Continental National Bank processes 500 million transactions per day across four LPARs in a Parallel Sysplex. The workload breaks down as follows:
| Workload | Daily Volume | SLA | Criticality |
|---|---|---|---|
| ATM/wire transfers (CICS) | 15M transactions | < 0.15 sec RT | Regulatory, reputational |
| Online banking (CICS) | 380M transactions | < 0.30 sec RT | Customer-facing |
| API/mobile banking (DB2 DDF) | 95M requests | < 0.50 sec RT | Customer-facing |
| MQ messaging | 10M messages | < 1.0 sec delivery | Operational |
| Critical batch (EOD) | 28 job streams | Complete by 3:00 AM | Regulatory, operational |
| Standard batch | 142 job streams | Complete by 6:00 AM | Business |
| Reporting | 12 report suites | Complete by 6:00 AM | Management, regulatory |
| Month-end batch (additional) | 2,400 jobs | Complete by 3:00 AM | Regulatory |
The four LPARs are configured with differentiated workload profiles:
- CNBP1, CNBP2: Primary online processing (CICS, DB2 DDF). Each handles roughly 45% of online volume during the day.
- CNBP3: Mixed workload. Handles overflow online during the day, batch during the window.
- CNBP4: Primary batch processing. Handles the majority of EOD batch and all reporting.
The Challenge
In early 2024, CNB's batch window shrank from five hours to four hours. The mobile banking platform, launched in 2023, generates significant transaction volume around midnight — the traditional start of the batch window. The old window (10:00 PM to 3:00 AM) was no longer viable because mobile traffic between 10:00 PM and 11:00 PM now rivals daytime volume.
Rob Calloway reported that three times in January, the EOD settlement critical path missed the 3:00 AM deadline. Kwame Mensah convened a cross-functional team to redesign the WLM service definition.
The Analysis
Step 1: Characterize the Workload Patterns
Lisa Tran pulled six months of SMF type 72 data and produced hourly transaction volume profiles:
Hour CICS TPS DB2 DDF TPS Batch Jobs Active CPU Util%
------ -------- ----------- ----------------- ---------
06:00 800 120 5 35%
09:00 5,200 890 12 72%
12:00 5,800 980 8 78%
15:00 5,500 920 10 75%
18:00 4,200 750 6 62%
21:00 2,800 620 4 48%
22:00 2,400 580 18 55%
23:00 1,800 420 45 82%
00:00 1,200 380 62 91%
01:00 600 280 58 88%
02:00 350 180 42 76%
03:00 250 120 15 52%
04:00 200 100 8 38%
05:00 280 110 10 42%
Key finding: At midnight, CPU utilization hits 91%. The LPAR is capacity-constrained during the early batch window. WLM is making correct trade-off decisions, but there simply are not enough resources for all workloads at their current priority levels.
Step 2: Identify the Critical Path
Rob Calloway mapped the EOD batch critical path:
EODSETL1 (settlement extract) → 45 min
└→ EODSETL2 (settlement calc) → 30 min
└→ FEDWIRE1 (Fed wire file) → 20 min
└→ ACHGEN01 (ACH file gen) → 25 min
└→ REGFED01 (regulatory) → 60 min
└→ REGFED02 (filing) → 15 min
Critical path duration: 45 + 30 + 60 + 15 = 150 minutes (2.5 hours)
With parallelism: 45 + 30 + 60 + 15 = 150 minutes (no improvement — linear dependency)
The critical path requires 2.5 hours minimum, leaving only 1.5 hours of slack in a 4-hour window. Any delay in any critical-path job pushes the entire chain past 3:00 AM.
Step 3: Analyze the Previous WLM Configuration
The previous service definition treated all batch at importance 3 during the batch window:
Previous BATCHWIN Policy:
CICSHIGH Imp 1, RT 0.10s
CICSPROD Imp 1, RT 0.25s ← Still Importance 1!
DB2PROD Imp 1, RT 0.50s ← Still Importance 1!
BATCHCRT Imp 3, VEL 50% ← Same as standard batch!
BATCHSTD Imp 3, VEL 30%
RPTPROD Imp 3, VEL 40%
BATCHLOW Imp 4, Discretionary
The problem was clear: Online transactions retained importance 1 during the batch window, even though volume dropped by 75%. Batch critical-path jobs competed at importance 3 against reduced but still-present online workloads. When mobile traffic spiked at midnight, batch was starved.
The Redesigned Service Definition
New Service Classes
The team made minimal changes to the service class definitions, focusing instead on the service policies:
Service Class Goal Type Goal Value Description
----------- --------- ---------- -----------
CICSHIGH RT 0.10 sec Wire transfers, ATM (always priority)
CICSPROD RT 0.25 sec General online banking
CICSINTN RT 1.00 sec Internal CICS transactions
DB2DDF RT 0.30 sec API/mobile banking (DB2 DDF)
DB2PROD RT 0.50 sec General DB2 workload
MQPROD RT 1.00 sec MQ message processing
BATCHCRT VEL 50% EOD critical path batch
BATCHSTD VEL 30% Standard nightly batch
BATCHLOW DISC N/A Non-critical batch
RPTPROD VEL 40% Production reporting
STCHIGH VEL 60% Critical started tasks
STCSTD VEL 30% Standard started tasks
TSOPROD RT 0.50 sec TSO interactive users
OMVSPROD VEL 30% USS workloads
DISCRTNY DISC N/A Everything else
New Service Policies
DAYTIME Policy (6:00 AM – 11:00 PM):
Service Class Importance Rationale
CICSHIGH 1 Wire/ATM always top priority
CICSPROD 1 Online banking is the primary revenue driver
DB2DDF 1 Mobile banking API response time is SLA-bound
DB2PROD 1 DB2 supports CICS — must match priority
MQPROD 2 MQ supports async flows — important but not customer-facing
BATCHCRT 3 Daytime critical batch (rare — only ad-hoc reprocessing)
BATCHSTD 4 Standard batch should not impact online
RPTPROD 3 Daytime reports should not impact online
STCHIGH 2 Monitoring and automation must run
BATCHLOW 5 Non-critical work gets leftovers
DISCRTNY 5 Development, TSO, etc.
BATCHWIN Policy (11:00 PM – 6:00 AM):
Service Class Importance Rationale
CICSHIGH 1 Wire/ATM NEVER drops below Imp 1
CICSPROD 2 Online drops — lower volume, can tolerate slight delay
DB2DDF 2 Mobile API drops to match CICSPROD
DB2PROD 2 DB2 matches CICS
MQPROD 2 MQ stays at 2 for batch message flow
BATCHCRT 1 ← ELEVATED: Critical path gets maximum priority
BATCHSTD 3 Standard batch is important but not critical path
RPTPROD 3 Reports run after critical path
STCHIGH 2 Automation is critical during batch window
BATCHLOW 4 Non-critical gets low priority
DISCRTNY 5 No change
MONTHEND Policy (Last business day, 11:00 PM – 6:00 AM):
Service Class Importance Rationale
CICSHIGH 1 Wire/ATM NEVER drops below Imp 1
CICSPROD 2 Online drops during batch window
DB2DDF 2 Mobile API drops to match
DB2PROD 2 DB2 matches
MQPROD 2 MQ stays for batch messaging
BATCHCRT 1 Critical path at maximum
BATCHSTD 2 ← ELEVATED: Month-end standard batch includes
interest calc, statement gen — business critical
RPTPROD 2 ← ELEVATED: Month-end reports are regulatory
STCHIGH 2 Automation critical
BATCHLOW 5 Non-critical suppressed during month-end
DISCRTNY 5 No change
Updated Classification Rules
The team added scheduling environment-based classification to ensure month-end jobs were correctly categorized:
JES Subsystem:
Scheduling Environment: EODCRIT → BATCHCRT
Job Name: EOD* → BATCHCRT
Job Name: FEDWIRE* → BATCHCRT
Job Name: ACH* → BATCHCRT
Job Name: REG* → BATCHCRT
Scheduling Environment: RPTPROD → RPTPROD
Job Name: RPT* → RPTPROD
Scheduling Environment: MTHEND → BATCHSTD (month-end standard, elevated by policy)
Job Name: MTH* → BATCHSTD
Job Name: INT* → BATCHSTD
Job Name: STM* → BATCHSTD
Job Name: EXT* → BATCHLOW
Job Class: Z → BATCHLOW
Job Name: * → BATCHSTD
The Results
The team deployed the new service definition in February 2024 after two weeks of testing in the QA sysplex.
Before (January 2024)
| Metric | Value |
|---|---|
| Critical path average duration | 175 minutes |
| Critical path missed 3:00 AM deadline | 3 times |
| CICS response time during batch window | 0.22 sec (within goal) |
| LPAR peak CPU during batch window | 91% |
| Standard batch completion by 6:00 AM | 94% |
After (March 2024)
| Metric | Value |
|---|---|
| Critical path average duration | 138 minutes |
| Critical path missed 3:00 AM deadline | 0 times |
| CICS response time during batch window | 0.31 sec (slightly above 0.25 goal, but acceptable) |
| LPAR peak CPU during batch window | 91% (unchanged — same capacity) |
| Standard batch completion by 6:00 AM | 97% |
Key outcomes:
-
Critical path duration dropped by 37 minutes (21%). Elevating BATCHCRT to importance 1 during the batch window gave critical-path jobs the dispatching priority they needed to compete effectively with residual online workload.
-
CICS response time increased slightly during the batch window (from 0.22s to 0.31s), exceeding the 0.25s goal. The team decided this was acceptable because: (a) the volume is 75% lower at night, (b) the slight delay is imperceptible to mobile banking users, and (c) the online SLA is measured over 24 hours, not per-interval.
-
No additional capacity was needed. The same hardware, reorganized by WLM, produced a 21% improvement in critical-path batch elapsed time. Kwame estimated this deferred a $1.8M capacity upgrade by 12–18 months.
-
Month-end reliability improved dramatically. The MONTHEND policy, which elevated standard batch and reporting to importance 2, ensured month-end processing completed consistently within the window.
Lessons Learned
Lesson 1: Do not keep online at importance 1 during the batch window if online volume is low. The previous configuration was a legacy from when the batch window started at 10:00 PM and online traffic was negligible. The mobile banking platform changed the economics, and the WLM configuration needed to reflect that.
Lesson 2: Differentiate critical-path batch from standard batch. The original configuration treated all batch the same. Once the critical path was identified and classified separately, WLM could make intelligent trade-offs between critical and non-critical work.
Lesson 3: Accept measured degradation in low-priority windows. The slight increase in CICS response time during the batch window was a deliberate trade-off, not a failure. The team documented this decision and communicated it to the business.
Lesson 4: WLM changes can be as impactful as hardware upgrades. A $0 configuration change deferred a $1.8M hardware purchase. This is why WLM design is an architecture skill, not just a sysprog task.
Lesson 5: Service definition changes need governance. Kwame instituted a formal quarterly review process after this incident. Every stakeholder — online, batch, DBA, operations, business — has a seat at the table.
Discussion Questions
-
CNB accepted a slight CICS response time degradation during the batch window. Under what circumstances would this trade-off be unacceptable? How would you design around it?
-
The MONTHEND policy elevates BATCHSTD to importance 2. What would happen if month-end coincided with a major online banking promotion (e.g., a marketing campaign driving midnight mobile traffic)? How would you handle the conflict?
-
CNB has four LPARs with differentiated workload profiles. Would the redesigned service definition work differently on CNBP1 (online-heavy) vs. CNBP4 (batch-heavy)? Why or why not?
-
The analysis revealed 91% CPU utilization at midnight. If traffic continues to grow, when will WLM optimization no longer be sufficient? What leading indicators would you monitor?
-
Rob Calloway automates the policy switch at 11:00 PM. What risks does automated policy switching introduce? What safeguards should be in place?