Case Study 2: Federal Benefits Administration — When Every Batch Job Thinks It Is Critical


The Organization

Federal Benefits Administration manages retirement, disability, and health benefits for 15 million federal employees and retirees. Their mainframe environment — 15 million lines of COBOL, some dating to the 1980s — processes eligibility determinations, payment calculations, enrollment changes, and regulatory reporting.

Unlike Continental National Bank, FBA's workload is overwhelmingly batch. Their CICS online presence is limited to call center applications used by 4,000 agents during business hours. The real work — the calculations that determine whether millions of people receive their benefits correctly and on time — runs in batch.

This makes WLM design at FBA fundamentally different from a bank or retailer. The trade-off is not between online and batch. The trade-off is between competing batch workloads, each with its own deadline, each with a constituency that insists it is the most important work on the system.

The Problem

Sandra Chen inherited a WLM service definition that Marcus Whitfield designed in 2016. At the time, it was adequate. The batch workload was smaller, the deadlines were less aggressive, and there was enough capacity headroom that WLM did not need to make difficult trade-offs.

By 2025, the situation had changed:

  • Batch volume had grown 40% due to new benefit programs enacted by Congress
  • Three new regulatory reporting mandates added deadline-driven workloads
  • The annual enrollment period (November–January) now generated 3x normal batch volume
  • Two LPARs had been consolidated to one as a cost-saving measure

The symptom was cascading batch failures. Every few weeks, a sequence of events would unfold:

  1. One batch job stream would run long due to higher-than-expected volume
  2. The long-running stream would delay dependent downstream jobs
  3. The delayed jobs would collide with the next processing window
  4. Multiple streams would compete for the same resources, all running slowly
  5. Operators would escalate, and a sysprog would manually boost dispatching priorities
  6. The manual boost would starve other workloads, causing secondary failures

Marcus called this the "priority escalation death spiral." Sandra recognized it as a WLM design problem.

The Analysis

Inventory of the Existing Service Definition

Sandra's first step was to document what existed. Marcus's service definition had evolved organically over nine years, with service classes added reactively whenever a new workload appeared:

SERVICE CLASS     IMP  GOAL         DESCRIPTION (from Marcus's notes)
-----------       ---  ----         -----------
BATCHELIG         2    VEL 60%      Eligibility determination
BATCHPAY          2    VEL 60%      Payment calculation
BATCHENRL         2    VEL 50%      Enrollment processing
BATCHREG1         2    VEL 50%      Regulatory reporting - OPM
BATCHREG2         2    VEL 50%      Regulatory reporting - Treasury
BATCHREG3         2    VEL 50%      Regulatory reporting - OMB
BATCHCONV         3    VEL 40%      Data conversion (modernization)
BATCHEXT          3    VEL 30%      Data extracts
BATCHTEST         4    VEL 20%      Test batch
BATCHMISC         4    DISC         Miscellaneous
CICSPROD          1    RT 0.50s     Call center CICS
STCPROD           2    VEL 60%      Production started tasks
TSOPROD           3    RT 1.0s      TSO users
DISCRTNY          5    DISC         Default

Problems Sandra identified:

  1. Six batch service classes at importance 2. When all six had work running simultaneously — which happened every night — WLM could not differentiate between them. They all competed equally, and all of them missed their velocity goals.

  2. No time-based policy switching. The same service definition ran 24/7. During business hours, when CICS call center work was active, batch at importance 2 competed with CICS at importance 1. At night, when CICS was idle, the importance-2 batch classes still competed with each other.

  3. Velocity goals that were never achievable. BATCHELIG and BATCHPAY both had velocity goals of 60%. On the consolidated LPAR, with all other workloads running, 60% velocity was physically impossible for both simultaneously. The goals were aspirational, not achievable. WLM was perpetually trying and failing to meet them, resulting in constant priority thrashing.

  4. No critical-path identification. All batch was classified by application area (eligibility, payment, enrollment) rather than by business criticality. A daily eligibility recalculation for 15 million records and a weekly data quality audit were both in BATCHELIG at the same priority.

  5. Regulatory reporting split into three classes at the same importance. There was no way to prioritize one regulatory report over another, even though OPM reporting had a hard deadline (10:00 AM) while OMB reporting was due by end of business.

Workload Pattern Analysis

Sandra pulled 90 days of SMF type 72 data and mapped the actual batch execution patterns:

Time Window          Active Batch Classes                  Avg CPU%
-----------          --------------------                  --------
06:00–08:00          BATCHELIG, BATCHPAY (prelim runs)     42%
08:00–17:00          BATCHEXT, BATCHCONV, BATCHTEST        55%
17:00–19:00          BATCHENRL (daily enrollment close)    62%
19:00–22:00          BATCHELIG (main elig run)             78%
22:00–01:00          BATCHPAY (main payment run)           85%
01:00–04:00          BATCHREG1/2/3, BATCHEXT               72%
04:00–06:00          BATCHCONV, BATCHMISC                  45%

Key finding: The period from 22:00 to 01:00, when the main payment calculation ran, was the most resource-constrained. The payment calculation was the single most important batch workload — it directly determined whether 15 million people got paid correctly — and it was competing with lingering eligibility jobs and early-starting regulatory reports.

The Marcus Factor

Marcus Whitfield, scheduled to retire in four months, was the only person who understood why the service definition was designed the way it was. Sandra scheduled a series of knowledge transfer sessions.

"I put everything at importance 2 because the business directors couldn't agree on priorities," Marcus explained. "Every time I tried to lower one group's importance, their director would call the CIO. So I put them all at 2 and let the velocity goals sort it out."

This was a governance failure, not a technical one. Marcus had the right instinct — different workloads should have different priorities — but lacked the organizational support to enforce it.

Sandra realized that redesigning the WLM service definition required a parallel effort: establishing a governance process that the business would accept.

The Redesigned Service Definition

Step 1: Business Priority Workshop

Sandra convened a workshop with business stakeholders from each program area. Instead of asking "Is your work important?" (answer: always yes), she asked: "If we can only run one thing, which one?" and then "If we can run two things, what's second?"

The result was a clear priority stack:

  1. Payment calculation — "If payments are late, we make national news."
  2. Eligibility determination — "If eligibility is wrong, payments are wrong."
  3. OPM regulatory reporting — "Hard 10:00 AM deadline, fines for late filing."
  4. Enrollment processing — "Must complete daily or call center cannot function."
  5. Treasury and OMB reporting — "End-of-business deadline, important but some flexibility."
  6. Everything else — extracts, conversion, test, miscellaneous.

Step 2: New Service Class Design

Sandra redesigned around five tiers instead of scattering everything at importance 2:

SERVICE CLASS     IMP  GOAL         DESCRIPTION
-----------       ---  ----         -----------
CICSPROD          1    RT 0.50s     Call center CICS (business hours only)
BATCHPAY          1    VEL 50%      Payment calculation (NIGHTRUN policy only)
BATCHELIG         1    VEL 50%      Eligibility determination (NIGHTRUN policy only)
BATCHREGH         2    VEL 45%      High-priority regulatory (OPM deadline)
BATCHENRL         2    VEL 40%      Enrollment processing
MQPROD            2    RT 1.0s      MQ messaging
BATCHREGL         3    VEL 35%      Lower-priority regulatory (Treasury, OMB)
BATCHCONV         3    VEL 25%      Data conversion
BATCHEXT          3    VEL 20%      Data extracts
STCPROD           2    VEL 50%      Production started tasks
TSOPROD           3    RT 1.0s      TSO users (business hours)
BATCHTEST         4    VEL 15%      Test workloads
DISCRTNY          5    DISC         Everything else

Key design decisions:

  • BATCHPAY and BATCHELIG at importance 1 — but only during the night run. During the day, they run at importance 3 (preliminary/ad-hoc runs are not critical).
  • Regulatory reporting split by deadline urgency, not by agency. OPM (hard deadline) gets importance 2. Treasury and OMB (flexible deadline) get importance 3.
  • Velocity goals set to achievable levels. Sandra ran capacity modeling to determine what velocity each service class could actually achieve when all were active. Setting achievable goals prevents WLM from perpetual priority thrashing.
  • BATCHCONV (data conversion for modernization) explicitly at importance 3. Sandra argued — and the CIO agreed — that modernization work should not compete with production batch during constrained periods.

Step 3: Service Policies

Sandra defined three service policies:

BUSDAY (Business Hours, 6:00 AM – 5:00 PM):

CICSPROD   Imp 1  (call center is active)
BATCHPAY   Imp 3  (prelim runs, not critical)
BATCHELIG  Imp 3  (prelim runs, not critical)
BATCHREGH  Imp 2  (may have morning deadline)
BATCHENRL  Imp 2  (enrollment must be current for call center)
BATCHREGL  Imp 3
BATCHCONV  Imp 4  (conversion should not impact call center)
BATCHEXT   Imp 3
BATCHTEST  Imp 4

NIGHTRUN (Night Processing, 5:00 PM – 6:00 AM):

CICSPROD   Imp 2  (call center closed, minimal online)
BATCHPAY   Imp 1  (critical — payment calculation)
BATCHELIG  Imp 1  (critical — eligibility determination)
BATCHREGH  Imp 2  (must complete by morning)
BATCHENRL  Imp 2  (enrollment daily close)
BATCHREGL  Imp 3  (flexible deadline)
BATCHCONV  Imp 3  (can run at night)
BATCHEXT   Imp 3
BATCHTEST  Imp 5  (test should not run during night processing)

OPENENRL (Open Enrollment Period, November 1 – January 15):

CICSPROD   Imp 1  (call center volume triples during enrollment)
BATCHPAY   Imp 1  (in NIGHTRUN hours only — handled by time-based automation)
BATCHELIG  Imp 1  (in NIGHTRUN hours only)
BATCHREGH  Imp 2
BATCHENRL  Imp 1  (enrollment processing is critical during this period)
BATCHREGL  Imp 3
BATCHCONV  Imp 5  (conversion STOPS during open enrollment)
BATCHEXT   Imp 4  (extracts demoted to make room)
BATCHTEST  Imp 5

Step 4: Classification Rules

Sandra redesigned the classification rules to use a combination of job name prefix and scheduling environment:

JES Subsystem:
  Scheduling Environment: PAYCRIT  → BATCHPAY
  Job Name: PAY*                   → BATCHPAY
  Job Name: PAYRL*                 → BATCHPAY

  Scheduling Environment: ELIGCRIT → BATCHELIG
  Job Name: ELIG*                  → BATCHELIG
  Job Name: ELGRC*                 → BATCHELIG

  Scheduling Environment: REGURENT → BATCHREGH
  Job Name: REGOPM*                → BATCHREGH

  Scheduling Environment: ENRLPROC → BATCHENRL
  Job Name: ENRL*                  → BATCHENRL

  Job Name: REGTRS*                → BATCHREGL
  Job Name: REGOMB*                → BATCHREGL

  Scheduling Environment: CONVWORK → BATCHCONV
  Job Name: CONV*                  → BATCHCONV
  Job Name: MODZ*                  → BATCHCONV

  Job Name: EXT*                   → BATCHEXT
  Job Name: TST*                   → BATCHTEST
  Job Class: T                     → BATCHTEST
  Job Name: *                      → BATCHEXT  (catch-all → low priority)

Important: The catch-all rule sends unclassified work to BATCHEXT (importance 3), not to a discretionary class. Sandra's reasoning: "If someone adds a new production job without telling us, I'd rather it run slowly than not at all. Discretionary work gets starved on busy nights."

Step 5: Governance Process

Sandra established a monthly WLM governance meeting with:

  • Sandra Chen (chair) — technical authority for service definition changes
  • Business directors from each program area — priority input
  • Operations team — batch schedule impact analysis
  • Capacity planning — resource availability assessment
  • Ahmad Rashidi (Pinnacle Health compliance consultant on loan) — regulatory deadline verification

The governance rules:

  1. No WLM change without business justification. "My jobs are slow" is not sufficient. The requestor must explain the business impact of current performance.
  2. Priority changes require trade-off analysis. Elevating one workload means another gets fewer resources. The affected team must be consulted.
  3. Emergency changes are permitted but must be reviewed at the next meeting.
  4. All changes are tested first in the QA LPAR for at least one full batch cycle.

The Results

Sandra deployed the new service definition over a three-week phased rollout in August 2025.

Before (June 2025)

Metric Value
Payment calculation avg elapsed time 4.2 hours
Payment calculation missed deadline 6 times in 90 days
Eligibility determination avg elapsed time 3.8 hours
Regulatory reporting (OPM) missed 10 AM 4 times in 90 days
Priority escalation incidents 11 in 90 days
Manual sysprog intervention required 11 times in 90 days

After (November 2025)

Metric Value
Payment calculation avg elapsed time 3.1 hours
Payment calculation missed deadline 0 times in 90 days
Eligibility determination avg elapsed time 2.9 hours
Regulatory reporting (OPM) missed 10 AM 0 times in 90 days
Priority escalation incidents 1 in 90 days (root cause: hardware issue)
Manual sysprog intervention required 1 time in 90 days

Key outcomes:

  1. Payment calculation elapsed time dropped 26%. Elevating to importance 1 during the night window gave it the resources it needed without competition from five other importance-2 workloads.

  2. Zero missed deadlines for three months. The clear priority hierarchy ensured that when the system was under pressure, the right work got resources first.

  3. Priority escalation incidents dropped from 11 to 1. The "death spiral" stopped because WLM now made correct automatic decisions. Operators no longer needed to manually intervene.

  4. Conversion work slowed by 15%. This was the explicit trade-off Sandra negotiated. The modernization team accepted slower conversion runs in exchange for reliable production batch. During the open enrollment period, conversion work stops entirely.

  5. Marcus Whitfield retired with confidence. The governance process and clear documentation meant the service definition was no longer tribal knowledge. Marcus handed off a system that Sandra and her team could maintain and evolve.

The Open Enrollment Stress Test

The real test came in November 2025 — the first open enrollment period under the new service definition. Enrollment volume tripled, call center transactions doubled, and the batch workload included both regular processing and enrollment-specific batch.

The OPENENRL policy handled it:

  • CICS call center stayed at importance 1 (response time remained under 0.50s)
  • Enrollment batch joined payment and eligibility at importance 1 during night processing
  • Conversion work was suspended entirely (importance 5, and the team paused submissions)
  • Data extracts were deferred to weekends

The system processed 3x normal volume without a single missed deadline. Sandra called it "the most boring open enrollment in twenty years." Marcus Whitfield, reached by phone in his retirement, said: "That's the highest compliment you can give a mainframe."

Lessons Learned

Lesson 1: The WLM service definition is a political document as much as a technical one. Marcus's original design was technically sound but politically constrained. Sandra succeeded because she built organizational support for priority decisions before making technical changes.

Lesson 2: Achievable goals prevent thrashing. Setting velocity goals that the system could not physically achieve caused WLM to constantly adjust priorities without improving performance. Realistic goals based on capacity modeling produced stable, predictable behavior.

Lesson 3: Batch-heavy shops need different WLM strategies than online-heavy shops. The CNB pattern (online at importance 1, batch at importance 2-3) does not work when batch is the primary workload. FBA's pattern (multiple batch tiers at importance 1-3, online at importance 1-2 depending on time) reflects their actual business priorities.

Lesson 4: Knowledge transfer must include the "why." Marcus knew why every service class existed and what trade-offs it represented. Without the knowledge transfer sessions, Sandra would have redesigned the definition without understanding the historical context — and might have repeated mistakes that Marcus had already learned from.

Lesson 5: Governance prevents the priority escalation death spiral. When every team can independently request priority increases, the system converges on everyone at importance 1 — which is equivalent to no priority management at all. A governance process with trade-off analysis prevents this.

Discussion Questions

  1. Sandra set the catch-all classification rule to BATCHEXT (importance 3) rather than DISCRTNY (importance 5). What are the trade-offs of this decision? When would you prefer the more aggressive approach?

  2. The OPENENRL policy suspends conversion work entirely (importance 5). What if the modernization project has a hard deadline? How would you handle the conflict between production stability and project timelines?

  3. Marcus's original design put everything at importance 2 because business directors could not agree on priorities. Is there a technical design that can handle this political problem without requiring a governance process? Why or why not?

  4. Sandra's velocity goals are based on capacity modeling. What happens when capacity changes (e.g., an LPAR upgrade or consolidation)? How should velocity goals be maintained over time?

  5. Compare the FBA WLM challenge (competing batch workloads) with the CNB challenge (online vs. batch). Which is harder to manage, and why? What principles are common to both?