Case Study 1: CNB's WLM Service Definition — Balancing 500 Million Online Transactions with End-of-Day Batch


The Situation

Continental National Bank processes 500 million transactions per day across four LPARs in a Parallel Sysplex. The workload breaks down as follows:

Workload Daily Volume SLA Criticality
ATM/wire transfers (CICS) 15M transactions < 0.15 sec RT Regulatory, reputational
Online banking (CICS) 380M transactions < 0.30 sec RT Customer-facing
API/mobile banking (DB2 DDF) 95M requests < 0.50 sec RT Customer-facing
MQ messaging 10M messages < 1.0 sec delivery Operational
Critical batch (EOD) 28 job streams Complete by 3:00 AM Regulatory, operational
Standard batch 142 job streams Complete by 6:00 AM Business
Reporting 12 report suites Complete by 6:00 AM Management, regulatory
Month-end batch (additional) 2,400 jobs Complete by 3:00 AM Regulatory

The four LPARs are configured with differentiated workload profiles:

  • CNBP1, CNBP2: Primary online processing (CICS, DB2 DDF). Each handles roughly 45% of online volume during the day.
  • CNBP3: Mixed workload. Handles overflow online during the day, batch during the window.
  • CNBP4: Primary batch processing. Handles the majority of EOD batch and all reporting.

The Challenge

In early 2024, CNB's batch window shrank from five hours to four hours. The mobile banking platform, launched in 2023, generates significant transaction volume around midnight — the traditional start of the batch window. The old window (10:00 PM to 3:00 AM) was no longer viable because mobile traffic between 10:00 PM and 11:00 PM now rivals daytime volume.

Rob Calloway reported that three times in January, the EOD settlement critical path missed the 3:00 AM deadline. Kwame Mensah convened a cross-functional team to redesign the WLM service definition.

The Analysis

Step 1: Characterize the Workload Patterns

Lisa Tran pulled six months of SMF type 72 data and produced hourly transaction volume profiles:

Hour     CICS TPS    DB2 DDF TPS    Batch Jobs Active    CPU Util%
------   --------    -----------    -----------------    ---------
06:00       800          120              5                 35%
09:00     5,200          890             12                 72%
12:00     5,800          980              8                 78%
15:00     5,500          920             10                 75%
18:00     4,200          750              6                 62%
21:00     2,800          620              4                 48%
22:00     2,400          580             18                 55%
23:00     1,800          420             45                 82%
00:00     1,200          380             62                 91%
01:00       600          280             58                 88%
02:00       350          180             42                 76%
03:00       250          120             15                 52%
04:00       200          100              8                 38%
05:00       280          110             10                 42%

Key finding: At midnight, CPU utilization hits 91%. The LPAR is capacity-constrained during the early batch window. WLM is making correct trade-off decisions, but there simply are not enough resources for all workloads at their current priority levels.

Step 2: Identify the Critical Path

Rob Calloway mapped the EOD batch critical path:

EODSETL1 (settlement extract)     → 45 min
  └→ EODSETL2 (settlement calc)   → 30 min
       └→ FEDWIRE1 (Fed wire file) → 20 min
       └→ ACHGEN01 (ACH file gen)  → 25 min
       └→ REGFED01 (regulatory)    → 60 min
            └→ REGFED02 (filing)   → 15 min

Critical path duration: 45 + 30 + 60 + 15 = 150 minutes (2.5 hours)
With parallelism:       45 + 30 + 60 + 15 = 150 minutes (no improvement — linear dependency)

The critical path requires 2.5 hours minimum, leaving only 1.5 hours of slack in a 4-hour window. Any delay in any critical-path job pushes the entire chain past 3:00 AM.

Step 3: Analyze the Previous WLM Configuration

The previous service definition treated all batch at importance 3 during the batch window:

Previous BATCHWIN Policy:
  CICSHIGH  Imp 1, RT 0.10s
  CICSPROD  Imp 1, RT 0.25s   ← Still Importance 1!
  DB2PROD   Imp 1, RT 0.50s   ← Still Importance 1!
  BATCHCRT  Imp 3, VEL 50%    ← Same as standard batch!
  BATCHSTD  Imp 3, VEL 30%
  RPTPROD   Imp 3, VEL 40%
  BATCHLOW  Imp 4, Discretionary

The problem was clear: Online transactions retained importance 1 during the batch window, even though volume dropped by 75%. Batch critical-path jobs competed at importance 3 against reduced but still-present online workloads. When mobile traffic spiked at midnight, batch was starved.

The Redesigned Service Definition

New Service Classes

The team made minimal changes to the service class definitions, focusing instead on the service policies:

Service Class    Goal Type    Goal Value    Description
-----------      ---------    ----------    -----------
CICSHIGH         RT           0.10 sec      Wire transfers, ATM (always priority)
CICSPROD         RT           0.25 sec      General online banking
CICSINTN         RT           1.00 sec      Internal CICS transactions
DB2DDF           RT           0.30 sec      API/mobile banking (DB2 DDF)
DB2PROD          RT           0.50 sec      General DB2 workload
MQPROD           RT           1.00 sec      MQ message processing
BATCHCRT         VEL          50%           EOD critical path batch
BATCHSTD         VEL          30%           Standard nightly batch
BATCHLOW         DISC         N/A           Non-critical batch
RPTPROD          VEL          40%           Production reporting
STCHIGH          VEL          60%           Critical started tasks
STCSTD           VEL          30%           Standard started tasks
TSOPROD          RT           0.50 sec      TSO interactive users
OMVSPROD         VEL          30%           USS workloads
DISCRTNY         DISC         N/A           Everything else

New Service Policies

DAYTIME Policy (6:00 AM – 11:00 PM):

Service Class    Importance    Rationale
CICSHIGH         1             Wire/ATM always top priority
CICSPROD         1             Online banking is the primary revenue driver
DB2DDF           1             Mobile banking API response time is SLA-bound
DB2PROD          1             DB2 supports CICS — must match priority
MQPROD           2             MQ supports async flows — important but not customer-facing
BATCHCRT         3             Daytime critical batch (rare — only ad-hoc reprocessing)
BATCHSTD         4             Standard batch should not impact online
RPTPROD          3             Daytime reports should not impact online
STCHIGH          2             Monitoring and automation must run
BATCHLOW         5             Non-critical work gets leftovers
DISCRTNY         5             Development, TSO, etc.

BATCHWIN Policy (11:00 PM – 6:00 AM):

Service Class    Importance    Rationale
CICSHIGH         1             Wire/ATM NEVER drops below Imp 1
CICSPROD         2             Online drops — lower volume, can tolerate slight delay
DB2DDF           2             Mobile API drops to match CICSPROD
DB2PROD          2             DB2 matches CICS
MQPROD           2             MQ stays at 2 for batch message flow
BATCHCRT         1             ← ELEVATED: Critical path gets maximum priority
BATCHSTD         3             Standard batch is important but not critical path
RPTPROD          3             Reports run after critical path
STCHIGH          2             Automation is critical during batch window
BATCHLOW         4             Non-critical gets low priority
DISCRTNY         5             No change

MONTHEND Policy (Last business day, 11:00 PM – 6:00 AM):

Service Class    Importance    Rationale
CICSHIGH         1             Wire/ATM NEVER drops below Imp 1
CICSPROD         2             Online drops during batch window
DB2DDF           2             Mobile API drops to match
DB2PROD          2             DB2 matches
MQPROD           2             MQ stays for batch messaging
BATCHCRT         1             Critical path at maximum
BATCHSTD         2             ← ELEVATED: Month-end standard batch includes
                                  interest calc, statement gen — business critical
RPTPROD          2             ← ELEVATED: Month-end reports are regulatory
STCHIGH          2             Automation critical
BATCHLOW         5             Non-critical suppressed during month-end
DISCRTNY         5             No change

Updated Classification Rules

The team added scheduling environment-based classification to ensure month-end jobs were correctly categorized:

JES Subsystem:
  Scheduling Environment: EODCRIT   → BATCHCRT
  Job Name: EOD*                    → BATCHCRT
  Job Name: FEDWIRE*                → BATCHCRT
  Job Name: ACH*                    → BATCHCRT
  Job Name: REG*                    → BATCHCRT
  Scheduling Environment: RPTPROD   → RPTPROD
  Job Name: RPT*                    → RPTPROD
  Scheduling Environment: MTHEND    → BATCHSTD  (month-end standard, elevated by policy)
  Job Name: MTH*                    → BATCHSTD
  Job Name: INT*                    → BATCHSTD
  Job Name: STM*                    → BATCHSTD
  Job Name: EXT*                    → BATCHLOW
  Job Class: Z                      → BATCHLOW
  Job Name: *                       → BATCHSTD

The Results

The team deployed the new service definition in February 2024 after two weeks of testing in the QA sysplex.

Before (January 2024)

Metric Value
Critical path average duration 175 minutes
Critical path missed 3:00 AM deadline 3 times
CICS response time during batch window 0.22 sec (within goal)
LPAR peak CPU during batch window 91%
Standard batch completion by 6:00 AM 94%

After (March 2024)

Metric Value
Critical path average duration 138 minutes
Critical path missed 3:00 AM deadline 0 times
CICS response time during batch window 0.31 sec (slightly above 0.25 goal, but acceptable)
LPAR peak CPU during batch window 91% (unchanged — same capacity)
Standard batch completion by 6:00 AM 97%

Key outcomes:

  1. Critical path duration dropped by 37 minutes (21%). Elevating BATCHCRT to importance 1 during the batch window gave critical-path jobs the dispatching priority they needed to compete effectively with residual online workload.

  2. CICS response time increased slightly during the batch window (from 0.22s to 0.31s), exceeding the 0.25s goal. The team decided this was acceptable because: (a) the volume is 75% lower at night, (b) the slight delay is imperceptible to mobile banking users, and (c) the online SLA is measured over 24 hours, not per-interval.

  3. No additional capacity was needed. The same hardware, reorganized by WLM, produced a 21% improvement in critical-path batch elapsed time. Kwame estimated this deferred a $1.8M capacity upgrade by 12–18 months.

  4. Month-end reliability improved dramatically. The MONTHEND policy, which elevated standard batch and reporting to importance 2, ensured month-end processing completed consistently within the window.

Lessons Learned

Lesson 1: Do not keep online at importance 1 during the batch window if online volume is low. The previous configuration was a legacy from when the batch window started at 10:00 PM and online traffic was negligible. The mobile banking platform changed the economics, and the WLM configuration needed to reflect that.

Lesson 2: Differentiate critical-path batch from standard batch. The original configuration treated all batch the same. Once the critical path was identified and classified separately, WLM could make intelligent trade-offs between critical and non-critical work.

Lesson 3: Accept measured degradation in low-priority windows. The slight increase in CICS response time during the batch window was a deliberate trade-off, not a failure. The team documented this decision and communicated it to the business.

Lesson 4: WLM changes can be as impactful as hardware upgrades. A $0 configuration change deferred a $1.8M hardware purchase. This is why WLM design is an architecture skill, not just a sysprog task.

Lesson 5: Service definition changes need governance. Kwame instituted a formal quarterly review process after this incident. Every stakeholder — online, batch, DBA, operations, business — has a seat at the table.

Discussion Questions

  1. CNB accepted a slight CICS response time degradation during the batch window. Under what circumstances would this trade-off be unacceptable? How would you design around it?

  2. The MONTHEND policy elevates BATCHSTD to importance 2. What would happen if month-end coincided with a major online banking promotion (e.g., a marketing campaign driving midnight mobile traffic)? How would you handle the conflict?

  3. CNB has four LPARs with differentiated workload profiles. Would the redesigned service definition work differently on CNBP1 (online-heavy) vs. CNBP4 (batch-heavy)? Why or why not?

  4. The analysis revealed 91% CPU utilization at midnight. If traffic continues to grow, when will WLM optimization no longer be sufficient? What leading indicators would you monitor?

  5. Rob Calloway automates the policy switch at 11:00 PM. What risks does automated policy switching introduce? What safeguards should be in place?