Case Study 2: Federal Benefits Administration — When Every Batch Job Thinks It Is Critical
The Organization
Federal Benefits Administration manages retirement, disability, and health benefits for 15 million federal employees and retirees. Their mainframe environment — 15 million lines of COBOL, some dating to the 1980s — processes eligibility determinations, payment calculations, enrollment changes, and regulatory reporting.
Unlike Continental National Bank, FBA's workload is overwhelmingly batch. Their CICS online presence is limited to call center applications used by 4,000 agents during business hours. The real work — the calculations that determine whether millions of people receive their benefits correctly and on time — runs in batch.
This makes WLM design at FBA fundamentally different from a bank or retailer. The trade-off is not between online and batch. The trade-off is between competing batch workloads, each with its own deadline, each with a constituency that insists it is the most important work on the system.
The Problem
Sandra Chen inherited a WLM service definition that Marcus Whitfield designed in 2016. At the time, it was adequate. The batch workload was smaller, the deadlines were less aggressive, and there was enough capacity headroom that WLM did not need to make difficult trade-offs.
By 2025, the situation had changed:
- Batch volume had grown 40% due to new benefit programs enacted by Congress
- Three new regulatory reporting mandates added deadline-driven workloads
- The annual enrollment period (November–January) now generated 3x normal batch volume
- Two LPARs had been consolidated to one as a cost-saving measure
The symptom was cascading batch failures. Every few weeks, a sequence of events would unfold:
- One batch job stream would run long due to higher-than-expected volume
- The long-running stream would delay dependent downstream jobs
- The delayed jobs would collide with the next processing window
- Multiple streams would compete for the same resources, all running slowly
- Operators would escalate, and a sysprog would manually boost dispatching priorities
- The manual boost would starve other workloads, causing secondary failures
Marcus called this the "priority escalation death spiral." Sandra recognized it as a WLM design problem.
The Analysis
Inventory of the Existing Service Definition
Sandra's first step was to document what existed. Marcus's service definition had evolved organically over nine years, with service classes added reactively whenever a new workload appeared:
SERVICE CLASS IMP GOAL DESCRIPTION (from Marcus's notes)
----------- --- ---- -----------
BATCHELIG 2 VEL 60% Eligibility determination
BATCHPAY 2 VEL 60% Payment calculation
BATCHENRL 2 VEL 50% Enrollment processing
BATCHREG1 2 VEL 50% Regulatory reporting - OPM
BATCHREG2 2 VEL 50% Regulatory reporting - Treasury
BATCHREG3 2 VEL 50% Regulatory reporting - OMB
BATCHCONV 3 VEL 40% Data conversion (modernization)
BATCHEXT 3 VEL 30% Data extracts
BATCHTEST 4 VEL 20% Test batch
BATCHMISC 4 DISC Miscellaneous
CICSPROD 1 RT 0.50s Call center CICS
STCPROD 2 VEL 60% Production started tasks
TSOPROD 3 RT 1.0s TSO users
DISCRTNY 5 DISC Default
Problems Sandra identified:
-
Six batch service classes at importance 2. When all six had work running simultaneously — which happened every night — WLM could not differentiate between them. They all competed equally, and all of them missed their velocity goals.
-
No time-based policy switching. The same service definition ran 24/7. During business hours, when CICS call center work was active, batch at importance 2 competed with CICS at importance 1. At night, when CICS was idle, the importance-2 batch classes still competed with each other.
-
Velocity goals that were never achievable. BATCHELIG and BATCHPAY both had velocity goals of 60%. On the consolidated LPAR, with all other workloads running, 60% velocity was physically impossible for both simultaneously. The goals were aspirational, not achievable. WLM was perpetually trying and failing to meet them, resulting in constant priority thrashing.
-
No critical-path identification. All batch was classified by application area (eligibility, payment, enrollment) rather than by business criticality. A daily eligibility recalculation for 15 million records and a weekly data quality audit were both in BATCHELIG at the same priority.
-
Regulatory reporting split into three classes at the same importance. There was no way to prioritize one regulatory report over another, even though OPM reporting had a hard deadline (10:00 AM) while OMB reporting was due by end of business.
Workload Pattern Analysis
Sandra pulled 90 days of SMF type 72 data and mapped the actual batch execution patterns:
Time Window Active Batch Classes Avg CPU%
----------- -------------------- --------
06:00–08:00 BATCHELIG, BATCHPAY (prelim runs) 42%
08:00–17:00 BATCHEXT, BATCHCONV, BATCHTEST 55%
17:00–19:00 BATCHENRL (daily enrollment close) 62%
19:00–22:00 BATCHELIG (main elig run) 78%
22:00–01:00 BATCHPAY (main payment run) 85%
01:00–04:00 BATCHREG1/2/3, BATCHEXT 72%
04:00–06:00 BATCHCONV, BATCHMISC 45%
Key finding: The period from 22:00 to 01:00, when the main payment calculation ran, was the most resource-constrained. The payment calculation was the single most important batch workload — it directly determined whether 15 million people got paid correctly — and it was competing with lingering eligibility jobs and early-starting regulatory reports.
The Marcus Factor
Marcus Whitfield, scheduled to retire in four months, was the only person who understood why the service definition was designed the way it was. Sandra scheduled a series of knowledge transfer sessions.
"I put everything at importance 2 because the business directors couldn't agree on priorities," Marcus explained. "Every time I tried to lower one group's importance, their director would call the CIO. So I put them all at 2 and let the velocity goals sort it out."
This was a governance failure, not a technical one. Marcus had the right instinct — different workloads should have different priorities — but lacked the organizational support to enforce it.
Sandra realized that redesigning the WLM service definition required a parallel effort: establishing a governance process that the business would accept.
The Redesigned Service Definition
Step 1: Business Priority Workshop
Sandra convened a workshop with business stakeholders from each program area. Instead of asking "Is your work important?" (answer: always yes), she asked: "If we can only run one thing, which one?" and then "If we can run two things, what's second?"
The result was a clear priority stack:
- Payment calculation — "If payments are late, we make national news."
- Eligibility determination — "If eligibility is wrong, payments are wrong."
- OPM regulatory reporting — "Hard 10:00 AM deadline, fines for late filing."
- Enrollment processing — "Must complete daily or call center cannot function."
- Treasury and OMB reporting — "End-of-business deadline, important but some flexibility."
- Everything else — extracts, conversion, test, miscellaneous.
Step 2: New Service Class Design
Sandra redesigned around five tiers instead of scattering everything at importance 2:
SERVICE CLASS IMP GOAL DESCRIPTION
----------- --- ---- -----------
CICSPROD 1 RT 0.50s Call center CICS (business hours only)
BATCHPAY 1 VEL 50% Payment calculation (NIGHTRUN policy only)
BATCHELIG 1 VEL 50% Eligibility determination (NIGHTRUN policy only)
BATCHREGH 2 VEL 45% High-priority regulatory (OPM deadline)
BATCHENRL 2 VEL 40% Enrollment processing
MQPROD 2 RT 1.0s MQ messaging
BATCHREGL 3 VEL 35% Lower-priority regulatory (Treasury, OMB)
BATCHCONV 3 VEL 25% Data conversion
BATCHEXT 3 VEL 20% Data extracts
STCPROD 2 VEL 50% Production started tasks
TSOPROD 3 RT 1.0s TSO users (business hours)
BATCHTEST 4 VEL 15% Test workloads
DISCRTNY 5 DISC Everything else
Key design decisions:
- BATCHPAY and BATCHELIG at importance 1 — but only during the night run. During the day, they run at importance 3 (preliminary/ad-hoc runs are not critical).
- Regulatory reporting split by deadline urgency, not by agency. OPM (hard deadline) gets importance 2. Treasury and OMB (flexible deadline) get importance 3.
- Velocity goals set to achievable levels. Sandra ran capacity modeling to determine what velocity each service class could actually achieve when all were active. Setting achievable goals prevents WLM from perpetual priority thrashing.
- BATCHCONV (data conversion for modernization) explicitly at importance 3. Sandra argued — and the CIO agreed — that modernization work should not compete with production batch during constrained periods.
Step 3: Service Policies
Sandra defined three service policies:
BUSDAY (Business Hours, 6:00 AM – 5:00 PM):
CICSPROD Imp 1 (call center is active)
BATCHPAY Imp 3 (prelim runs, not critical)
BATCHELIG Imp 3 (prelim runs, not critical)
BATCHREGH Imp 2 (may have morning deadline)
BATCHENRL Imp 2 (enrollment must be current for call center)
BATCHREGL Imp 3
BATCHCONV Imp 4 (conversion should not impact call center)
BATCHEXT Imp 3
BATCHTEST Imp 4
NIGHTRUN (Night Processing, 5:00 PM – 6:00 AM):
CICSPROD Imp 2 (call center closed, minimal online)
BATCHPAY Imp 1 (critical — payment calculation)
BATCHELIG Imp 1 (critical — eligibility determination)
BATCHREGH Imp 2 (must complete by morning)
BATCHENRL Imp 2 (enrollment daily close)
BATCHREGL Imp 3 (flexible deadline)
BATCHCONV Imp 3 (can run at night)
BATCHEXT Imp 3
BATCHTEST Imp 5 (test should not run during night processing)
OPENENRL (Open Enrollment Period, November 1 – January 15):
CICSPROD Imp 1 (call center volume triples during enrollment)
BATCHPAY Imp 1 (in NIGHTRUN hours only — handled by time-based automation)
BATCHELIG Imp 1 (in NIGHTRUN hours only)
BATCHREGH Imp 2
BATCHENRL Imp 1 (enrollment processing is critical during this period)
BATCHREGL Imp 3
BATCHCONV Imp 5 (conversion STOPS during open enrollment)
BATCHEXT Imp 4 (extracts demoted to make room)
BATCHTEST Imp 5
Step 4: Classification Rules
Sandra redesigned the classification rules to use a combination of job name prefix and scheduling environment:
JES Subsystem:
Scheduling Environment: PAYCRIT → BATCHPAY
Job Name: PAY* → BATCHPAY
Job Name: PAYRL* → BATCHPAY
Scheduling Environment: ELIGCRIT → BATCHELIG
Job Name: ELIG* → BATCHELIG
Job Name: ELGRC* → BATCHELIG
Scheduling Environment: REGURENT → BATCHREGH
Job Name: REGOPM* → BATCHREGH
Scheduling Environment: ENRLPROC → BATCHENRL
Job Name: ENRL* → BATCHENRL
Job Name: REGTRS* → BATCHREGL
Job Name: REGOMB* → BATCHREGL
Scheduling Environment: CONVWORK → BATCHCONV
Job Name: CONV* → BATCHCONV
Job Name: MODZ* → BATCHCONV
Job Name: EXT* → BATCHEXT
Job Name: TST* → BATCHTEST
Job Class: T → BATCHTEST
Job Name: * → BATCHEXT (catch-all → low priority)
Important: The catch-all rule sends unclassified work to BATCHEXT (importance 3), not to a discretionary class. Sandra's reasoning: "If someone adds a new production job without telling us, I'd rather it run slowly than not at all. Discretionary work gets starved on busy nights."
Step 5: Governance Process
Sandra established a monthly WLM governance meeting with:
- Sandra Chen (chair) — technical authority for service definition changes
- Business directors from each program area — priority input
- Operations team — batch schedule impact analysis
- Capacity planning — resource availability assessment
- Ahmad Rashidi (Pinnacle Health compliance consultant on loan) — regulatory deadline verification
The governance rules:
- No WLM change without business justification. "My jobs are slow" is not sufficient. The requestor must explain the business impact of current performance.
- Priority changes require trade-off analysis. Elevating one workload means another gets fewer resources. The affected team must be consulted.
- Emergency changes are permitted but must be reviewed at the next meeting.
- All changes are tested first in the QA LPAR for at least one full batch cycle.
The Results
Sandra deployed the new service definition over a three-week phased rollout in August 2025.
Before (June 2025)
| Metric | Value |
|---|---|
| Payment calculation avg elapsed time | 4.2 hours |
| Payment calculation missed deadline | 6 times in 90 days |
| Eligibility determination avg elapsed time | 3.8 hours |
| Regulatory reporting (OPM) missed 10 AM | 4 times in 90 days |
| Priority escalation incidents | 11 in 90 days |
| Manual sysprog intervention required | 11 times in 90 days |
After (November 2025)
| Metric | Value |
|---|---|
| Payment calculation avg elapsed time | 3.1 hours |
| Payment calculation missed deadline | 0 times in 90 days |
| Eligibility determination avg elapsed time | 2.9 hours |
| Regulatory reporting (OPM) missed 10 AM | 0 times in 90 days |
| Priority escalation incidents | 1 in 90 days (root cause: hardware issue) |
| Manual sysprog intervention required | 1 time in 90 days |
Key outcomes:
-
Payment calculation elapsed time dropped 26%. Elevating to importance 1 during the night window gave it the resources it needed without competition from five other importance-2 workloads.
-
Zero missed deadlines for three months. The clear priority hierarchy ensured that when the system was under pressure, the right work got resources first.
-
Priority escalation incidents dropped from 11 to 1. The "death spiral" stopped because WLM now made correct automatic decisions. Operators no longer needed to manually intervene.
-
Conversion work slowed by 15%. This was the explicit trade-off Sandra negotiated. The modernization team accepted slower conversion runs in exchange for reliable production batch. During the open enrollment period, conversion work stops entirely.
-
Marcus Whitfield retired with confidence. The governance process and clear documentation meant the service definition was no longer tribal knowledge. Marcus handed off a system that Sandra and her team could maintain and evolve.
The Open Enrollment Stress Test
The real test came in November 2025 — the first open enrollment period under the new service definition. Enrollment volume tripled, call center transactions doubled, and the batch workload included both regular processing and enrollment-specific batch.
The OPENENRL policy handled it:
- CICS call center stayed at importance 1 (response time remained under 0.50s)
- Enrollment batch joined payment and eligibility at importance 1 during night processing
- Conversion work was suspended entirely (importance 5, and the team paused submissions)
- Data extracts were deferred to weekends
The system processed 3x normal volume without a single missed deadline. Sandra called it "the most boring open enrollment in twenty years." Marcus Whitfield, reached by phone in his retirement, said: "That's the highest compliment you can give a mainframe."
Lessons Learned
Lesson 1: The WLM service definition is a political document as much as a technical one. Marcus's original design was technically sound but politically constrained. Sandra succeeded because she built organizational support for priority decisions before making technical changes.
Lesson 2: Achievable goals prevent thrashing. Setting velocity goals that the system could not physically achieve caused WLM to constantly adjust priorities without improving performance. Realistic goals based on capacity modeling produced stable, predictable behavior.
Lesson 3: Batch-heavy shops need different WLM strategies than online-heavy shops. The CNB pattern (online at importance 1, batch at importance 2-3) does not work when batch is the primary workload. FBA's pattern (multiple batch tiers at importance 1-3, online at importance 1-2 depending on time) reflects their actual business priorities.
Lesson 4: Knowledge transfer must include the "why." Marcus knew why every service class existed and what trade-offs it represented. Without the knowledge transfer sessions, Sandra would have redesigned the definition without understanding the historical context — and might have repeated mistakes that Marcus had already learned from.
Lesson 5: Governance prevents the priority escalation death spiral. When every team can independently request priority increases, the system converges on everyone at importance 1 — which is equivalent to no priority management at all. A governance process with trade-off analysis prevents this.
Discussion Questions
-
Sandra set the catch-all classification rule to BATCHEXT (importance 3) rather than DISCRTNY (importance 5). What are the trade-offs of this decision? When would you prefer the more aggressive approach?
-
The OPENENRL policy suspends conversion work entirely (importance 5). What if the modernization project has a hard deadline? How would you handle the conflict between production stability and project timelines?
-
Marcus's original design put everything at importance 2 because business directors could not agree on priorities. Is there a technical design that can handle this political problem without requiring a governance process? Why or why not?
-
Sandra's velocity goals are based on capacity modeling. What happens when capacity changes (e.g., an LPAR upgrade or consolidation)? How should velocity goals be maintained over time?
-
Compare the FBA WLM challenge (competing batch workloads) with the CNB challenge (online vs. batch). Which is harder to manage, and why? What principles are common to both?