Chapter 26: Batch Performance at Scale

DataField.Dev

34 min read

> "You can't tune your way out of a bad architecture. But once the architecture is right, tuning separates the good shops from the great ones."

In This Chapter

I/O Optimization, Buffer Tuning, SORT Optimization, and DFSORT Tricks
26.1 The Performance Mindset: Measure First, Then Optimize
26.2 I/O Optimization: Buffers, BLKSIZE, Access Methods, and VSAM Tuning
26.3 DFSORT Mastery: SORT Optimization, INCLUDE/OMIT, OUTREC, and ICETOOL Operations
26.4 COBOL Compiler Optimization: OPT Levels, FASTSRT, and Generated Code Analysis
26.5 DB2 Batch Performance: Commit Frequency, Prefetch, and Parallelism
26.6 Performance Analysis with SMF and RMF
26.7 Advanced Techniques: Hiperbatch, Data-in-Memory, and zIIP Offload
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 26: Batch Performance at Scale

I/O Optimization, Buffer Tuning, SORT Optimization, and DFSORT Tricks

"You can't tune your way out of a bad architecture. But once the architecture is right, tuning separates the good shops from the great ones."

26.1 The Performance Mindset: Measure First, Then Optimize

Rob Calloway keeps a sign above his desk at Continental National Bank: "Where did the time go?" It's not existential. It's operational. Every batch job that runs in the overnight window consumes clock time — and clock time is the one resource you cannot buy more of between 11:00 PM and 6:00 AM.

In Chapter 23, we established that the batch window is a scheduling problem, not a performance problem. That remains true. But within that framework, individual job performance determines where the critical path falls and how much margin you have when things go sideways. This chapter is about making every minute count.

Here's the rule that separates practitioners from amateurs: measure before you touch anything.

I've watched junior systems programmers spend three days rewriting a COBOL batch program for "performance" — restructuring loops, eliminating PERFORM THRU, hand-optimizing paragraph flow — only to discover the program was 85% I/O-bound and the code changes reduced elapsed time by eleven seconds on a job that ran for two hours. They optimized the 15% while ignoring the 85%.

The Batch Performance Decomposition

Every batch program's elapsed time decomposes into four components:

Elapsed Time = CPU Time + I/O Wait + DB2 Wait + Other Wait

Where:
  CPU Time    = Processor time consumed by your COBOL code
  I/O Wait    = Time waiting for DASD, tape, or channel operations
  DB2 Wait    = Time waiting for DB2 (locks, prefetch, thread switch)
  Other Wait  = ENQ contention, paging, WLM delays, operator replies

The relative proportions determine your optimization strategy:

Profile              CPU%   I/O%   DB2%   Other%   Strategy
─────────────────────────────────────────────────────────────
I/O-bound batch      10-20  60-80  0-10   5-10     Buffer tuning, BLKSIZE, access method
DB2-bound batch      10-20  10-20  50-70  5-10     SQL tuning, commit frequency, prefetch
CPU-bound batch      50-70  15-25  0-10   5-10     Compiler options, algorithm changes
Contention-bound     10-20  10-20  10-20  40-60    ENQ analysis, scheduling changes

⚠️ CRITICAL: If you don't know which profile your program fits, you are guessing. Guessing costs money and wastes the time of people who could be solving real problems.

How to Get the Numbers

At CNB, Rob Calloway requires every batch job on the critical path to have a performance profile. Here's how his team builds them:

SMF Type 30 records provide the definitive accounting data for every job step:

Key fields from SMF Type 30 (subtype 4, step-level):
  SMF30CPT  — CPU time (TCB + SRB)
  SMF30SIO  — EXCP count (total I/O operations)
  SMF30AET  — Elapsed time
  SMF30_DB2_CLASS2  — DB2 elapsed time (Class 2 accounting)
  SMF30TEP  — Total I/O connect time
  SMF30TIS  — Total I/O disconnect time (pending I/O)

From these fields, the decomposition is straightforward:

CPU%   = SMF30CPT / SMF30AET × 100
I/O%   = (SMF30TEP + SMF30TIS) / SMF30AET × 100
DB2%   = SMF30_DB2_CLASS2 / SMF30AET × 100
Other% = 100 - CPU% - I/O% - DB2%

💡 KEY INSIGHT: The "Other" category is often the most revealing. A 30% "Other" means something besides your program's actual work is consuming nearly a third of the elapsed time. Common culprits: dataset ENQ contention from concurrent jobs, WLM delays from over-committed LPARs, GRS (Global Resource Serialization) ring delays in sysplex environments, and operator reply waits on GDG catalog management.

🔄 SPACED REVIEW — Chapter 4 (Dataset Management): You learned about BLKSIZE and physical record format in Chapter 4. This chapter builds on that foundation — the BLKSIZE you chose back then determines the I/O efficiency we optimize here. If your BLKSIZE choices in Chapter 4 were careless, every batch job that reads those datasets pays the penalty every night.

The CNB Performance Baseline

When Kwame Mensah initiated the batch performance project after the Q4 crisis (Chapter 23), his first directive was: "No one touches a program until we have a performance profile for every critical-path job."

Lisa Tran built the performance baseline over a two-week window, collecting SMF data from 23 consecutive overnight runs. Here's what they found for the top 10 critical-path jobs:

Job Name     Elapsed  CPU    I/O    DB2    Other  Profile
─────────────────────────────────────────────────────────────
CNBEOD-POST  43 min   12%    68%    14%     6%    I/O-bound
CNBEOD-BAL   18 min   22%    8%     62%     8%    DB2-bound
CNBEOD-INTST 23 min   15%    11%    67%     7%    DB2-bound
CNBEOD-GL03  12 min   9%     71%    12%     8%    I/O-bound
CNBEOD-STMT  60 min   35%    52%    3%     10%    Mixed I/O+CPU
CNBEOD-SORT  5 min    8%     88%    0%      4%    I/O-bound
CNBEOD-VALID 38 min   41%    32%    21%     6%    CPU-bound
CNBEOD-REG   19 min   18%    62%    12%     8%    I/O-bound
CNBEOD-ACH   5 min    14%    56%    22%     8%    I/O-bound
CNBEOD-RECON 15 min   25%    18%    48%     9%    DB2-bound

This baseline was the single most important artifact in the entire performance project. It told them exactly where to invest effort:

I/O tuning would benefit CNBEOD-POST, CNBEOD-GL03, CNBEOD-SORT, CNBEOD-REG, CNBEOD-ACH (five jobs, 79 minutes of critical path)
DB2 tuning would benefit CNBEOD-BAL, CNBEOD-INTST, CNBEOD-RECON (three jobs, 56 minutes)
COBOL optimization would benefit CNBEOD-VALID and CNBEOD-STMT (two jobs, 98 minutes — but only the CPU portion)

Without this baseline, they'd have been shooting in the dark.

The Performance Optimization Priority Stack

After 25 years of tuning batch programs, here's my priority stack. Start at the top. Don't move to the next level until you've exhausted the current one:

Priority 1: Eliminate unnecessary work (remove jobs, skip steps, filter early)
Priority 2: Reduce I/O operations (buffer tuning, BLKSIZE, access method)
Priority 3: Optimize SORT (DFSORT tricks, FASTSRT, eliminate unnecessary sorts)
Priority 4: Tune DB2 access (SQL, commit frequency, prefetch, parallelism)
Priority 5: Optimize COBOL code (compiler options, algorithm changes)
Priority 6: Hardware/configuration (zIIP offload, Hiperbatch, data-in-memory)

Priority 1 is almost always overlooked. The fastest I/O is the one you never issue. The fastest job is the one you don't run. Before you tune anything, ask: "Does this step actually need to run tonight?"

At Pinnacle Health Insurance, Diane Okoye discovered that 14% of their batch processing volume was spent producing reports that nobody had read in two years. Fourteen percent. The "recipient" had transferred to a different department in 2022, and the distribution list was never updated. Removing those seven jobs didn't just save the CPU cycles — it freed three initiators during the peak concurrency window and reduced the critical path by 22 minutes.

26.2 I/O Optimization: Buffers, BLKSIZE, Access Methods, and VSAM Tuning

I/O is the dominant cost component for most COBOL batch programs. If your performance profile shows 50% or more I/O wait time, this section will deliver the largest gains.

The True Cost of Bad I/O Configuration

Before we dive into the mechanics, let me quantify what bad I/O configuration actually costs. At SecureFirst Retail Bank, Yuki Nakamura ran an audit of the batch I/O configuration as part of her DevOps modernization initiative. She found that 73% of batch datasets were using BLKSIZE values inherited from JCL templates written in the 1990s — typically BLKSIZE=4096 or the single-record BLKSIZE that some older JCL generators produced.

Carlos Vega, the mobile API architect who was learning mainframe fundamentals, was stunned: "You're telling me that the batch window runs for four hours, and a third of that time is wasted on I/O operations that could be eliminated by changing a JCL parameter?"

Yuki's answer: "Yes. And it's not even a hard change. It's just that nobody ever looked."

This is a pattern I've seen at every shop I've consulted with over 25 years. I/O configuration is set when a dataset is first created, and then it's never revisited. The programmer who created the JCL in 2003 used whatever BLKSIZE was in the shop's JCL template, and nobody questioned it because the job ran successfully. Twenty years later, the job is still running successfully — just three times slower than it needs to be.

The financial impact is real. Every unnecessary EXCP consumes: - Channel bandwidth that could serve other I/O - Storage controller CPU cycles for cache management - FICON port capacity - Mainframe CPU cycles for access method overhead and I/O interrupt processing

At CNB's scale — 48.2 million EXCP per night in the pre-optimization baseline — the aggregate cost of suboptimal I/O configuration was approximately 45 minutes of batch window time and 1,200 MSU per month. That's real money: at typical LPAR pricing, 1,200 MSU translates to $120,000-$200,000 annually in software licensing costs, paid every year for two decades because nobody examined the JCL.

Understanding the I/O Path

When your COBOL program executes a READ or WRITE statement, here's the path that I/O request travels:

COBOL READ → Access Method (QSAM/BSAM/VSAM) → I/O Supervisor → Channel Program
  → FICON Channel → Storage Controller → DASD Cache → Physical Disk (maybe)

Each hop adds latency:
  Access method overhead:     0.01–0.05 ms
  Channel program setup:      0.01–0.02 ms
  FICON transfer:             0.02–0.10 ms (depends on block size)
  Cache hit:                  0.05–0.20 ms
  Cache miss (physical I/O):  2.0–5.0 ms

The key insight: a cache hit is 10–100x faster than a cache miss. Modern DS8000-series controllers — the DS8980F and the current Generation-11 DS8990 as of 2026 — carry 256 GB or more of cache, and sequential workloads achieve cache hit ratios of 95%+ when properly configured. But "properly configured" means getting the buffer and block size right.

Buffer Management: BUFNO, BUFNI, BUFND

Buffers are the memory areas where the access method stages data between your program and physical I/O. More buffers mean the access method can prefetch more data ahead of your program, reducing I/O wait.

For QSAM (sequential files):

//TRANFILE DD DSN=CNB.EOD.TRANS,DISP=SHR,
//            BUFNO=30

BUFNO specifies the number of buffers. The default is 5. For a large sequential file read in batch, 5 buffers is criminal negligence.

Here's what happens with different BUFNO values on a 50-million-record file with BLKSIZE=27998:

BUFNO   EXCP Count   Elapsed    Channel Commands   CPU Overhead
─────────────────────────────────────────────────────────────────
5       1,786,000    42 min     1,786,000           Baseline
10      893,000      31 min     893,000             +2%
20      446,500      24 min     446,500             +4%
30      298,334      21 min     298,334             +6%
50      178,600      19 min     178,600             +10%

The relationship isn't linear — there are diminishing returns past about 20-30 buffers. The sweet spot depends on the channel bandwidth, controller cache characteristics, and whether other jobs are competing for the same volumes.

💡 KEY INSIGHT: BUFNO isn't free. Each buffer consumes memory equal to one BLKSIZE. For BUFNO=30 and BLKSIZE=27998, that's 30 × 27,998 = 839,940 bytes — under a megabyte. For a batch job processing billions of dollars in transactions, a megabyte of buffer memory is the cheapest performance improvement you'll ever make.

For VSAM files:

VSAM uses two separate buffer parameters:

//ACCTMSTR DD DSN=CNB.VSAM.ACCOUNTS,DISP=SHR,
//            AMP=('BUFND=30,BUFNI=10')

BUFND — number of data buffers (default: 2). Controls how much data VSAM prefetches.
BUFNI — number of index buffers (default: 1). Controls how much of the KSDS index stays cached in memory.

For VSAM KSDS files, the index buffers are often more important than data buffers. A VSAM KSDS with a three-level index requires three I/O operations just to locate a record — one for each index level. If you keep the entire index in memory (BUFNI sufficient to hold all index records), those three I/Os become zero.

VSAM Cluster: CNB.VSAM.ACCOUNTS
  Records:        5,000,000
  CI Size (Data): 4,096
  CI Size (Index):2,048
  Index levels:   3
  Index records:  12,400

BUFNI Settings and Effect:
  BUFNI=1   → 3 index I/Os per random read (default — terrible for batch)
  BUFNI=10  → 2 index I/Os per random read (top index level cached)
  BUFNI=200 → 1 index I/O per random read (top two levels cached)
  BUFNI=12400 → 0 index I/Os per random read (full index in memory)

Memory cost of BUFNI=12400: 12,400 × 2,048 = 25.4 MB
Savings: 3 × 5,000,000 = 15,000,000 EXCP eliminated per batch run

At CNB, Lisa Tran mandated full index buffering for all VSAM KSDS clusters accessed during batch. The 25 MB memory cost for the accounts master file eliminated 15 million EXCP per night. Rob Calloway called it "the best twenty-five megabytes we've ever spent."

BLKSIZE Optimization and Half-Track Blocking

BLKSIZE determines how many logical records fit in a single physical block. Larger blocks mean fewer I/O operations to read the same amount of data.

🔄 SPACED REVIEW — Chapter 4 (Dataset Management): Chapter 4 introduced BLKSIZE calculation and the System-Determined Blocksize (SDB) feature. Here we focus on the performance implications and the critical concept of half-track blocking.

A 3390 DASD track holds 56,664 bytes. The optimal BLKSIZE is one that wastes the least space on the track. The magic numbers:

Records    BLKSIZE    Blocks/Track   Bytes Used   Track Utilization
per block
───────────────────────────────────────────────────────────────────
            27998      2              55,996       98.8%  ← OPTIMAL
            23476      2              46,952       82.9%
            15476      3              46,428       81.9%
            13300      4              53,200       93.9%
            9076       6              54,456       96.1%
            6233       8              49,864       88.0%
            4096       12             49,152       86.7%

Half-track: 27998 = optimal for 2 blocks per track
Full-track: 56,664 (only for BSAM with NTM or track-overflow)

Half-track blocking (BLKSIZE=27998) is the gold standard for sequential batch files on 3390. Two blocks per track. 98.8% track utilization. It's the number every mainframe performance engineer has memorized.

⚠️ COMMON PITFALL: System-Determined Blocksize (SDB) doesn't always choose half-track blocking. SDB picks the largest BLKSIZE that fits in a half-track given the LRECL, but for variable-length records, it may select a suboptimal size. Always verify. For critical batch datasets, specify BLKSIZE explicitly:

//TRANFILE DD DSN=CNB.EOD.TRANS,
//            DISP=(NEW,CATLG),
//            SPACE=(CYL,(500,50)),
//            DCB=(RECFM=FB,LRECL=200,BLKSIZE=27800)

For LRECL=200 with RECFM=FB, BLKSIZE=27800 packs 139 records per block — the maximum that fits in a half-track. BLKSIZE=27998 would leave a 198-byte fragment that can't hold another record, so 27800 is actually optimal here.

The formula: Optimal BLKSIZE = FLOOR(27998 / LRECL) × LRECL

LRECL    Optimal BLKSIZE   Records/Block   Track Util
──────────────────────────────────────────────────────
80       27920              349             98.5%
100      27900              279             98.5%
150      27900              186             98.5%
200      27800              139             98.1%
250      27750              111             98.0%
500      27500              55              97.1%
1000     27000              27              95.4%

BSAM vs. QSAM: When the Access Method Matters

Most COBOL programs use QSAM (Queued Sequential Access Method) by default — it's what you get with standard READ/WRITE statements. QSAM handles buffering automatically, uses anticipatory read-ahead, and requires no application awareness of block boundaries.

BSAM (Basic Sequential Access Method) gives you direct control over I/O operations. You manage blocks, issue READ/WRITE at the block level, and control the NCP (Number of Channel Programs) parameter.

For standard COBOL batch? QSAM is almost always the right choice. BSAM matters in two scenarios:

When you need to overlap I/O with processing explicitly — BSAM with CHECK macro lets you issue multiple concurrent I/Os. But QSAM's anticipatory buffering achieves similar overlap without code complexity.
When you're writing a sort exit or utility — these bypass COBOL's file handling entirely.

The practical difference in batch:

Access Method   Programmer Effort   I/O Efficiency   Typical Use
───────────────────────────────────────────────────────────────────
QSAM            Low (standard I/O)  Good (95%+ with   Most batch programs
                                     proper BUFNO)
BSAM            High (block mgmt)   Excellent (99%)   Sort exits, utilities
VSAM Sequential Very low            Good (with buffers) VSAM ESDS sequential
VSAM Random     Low                 Depends on BUFNI  KSDS random access

VSAM Batch Tuning: LSR and NSR

For VSAM files in batch, the buffering decision is between NSR (Non-Shared Resources) and LSR (Local Shared Resources).

NSR (the default) gives each open VSAM file its own dedicated buffer pool. Simple, predictable, but wasteful when multiple programs or steps access overlapping VSAM clusters.

LSR creates a shared buffer pool that multiple VSAM opens share:

//STEPLIB  DD DSN=CNB.LOADLIB,DISP=SHR
//ACCTMSTR DD DSN=CNB.VSAM.ACCOUNTS,DISP=SHR
//ACCTIDX  DD DSN=CNB.VSAM.ACCTINDEX,DISP=SHR
//CUSTMSTR DD DSN=CNB.VSAM.CUSTOMERS,DISP=SHR
//*
//* LSR pool definition via AMP
//ACCTMSTR DD AMP=('BUFSP=8388608,SHRPOOL=1')
//ACCTIDX  DD AMP=('SHRPOOL=1')
//CUSTMSTR DD AMP=('SHRPOOL=1')

LSR is particularly effective when: - Multiple VSAM files are accessed in the same job step - Access patterns create locality of reference (same records accessed repeatedly) - Memory is available for a large shared pool

In CICS, LSR is the standard. In batch, NSR is more common but LSR should be evaluated for high-volume VSAM batch programs that access multiple clusters.

At Pinnacle Health Insurance, Diane Okoye's claims processing batch used NSR for 12 VSAM clusters. Switching to LSR with a 64 MB shared pool reduced VSAM EXCP count by 34% and cut the batch step from 28 minutes to 19 minutes. The claims records are heavily clustered by provider ID, and LSR's shared pool captured the locality that NSR's per-file pools couldn't exploit.

Practical I/O Tuning Checklist

Here's the checklist Rob Calloway uses at CNB for every critical-path batch job:

□ BLKSIZE verified as half-track optimal for the LRECL
□ BUFNO set to at least 20 for large sequential files
□ BUFNI set to cache full VSAM index (or at least top 2 levels)
□ BUFND set to at least 10 for VSAM data components
□ No DISP=OLD conflicts with concurrent jobs (check ENQ report)
□ SMS data class assigns to high-performance storage pool
□ Sequential datasets on volumes with >200 MB/sec throughput
□ EXCP count baselined and tracked trend-over-trend
□ Cache hit ratio >95% for sequential access (check RMF)
□ No channel contention (check RMF channel utilization)

26.3 DFSORT Mastery: SORT Optimization, INCLUDE/OMIT, OUTREC, and ICETOOL Operations

DFSORT is the most powerful utility on z/OS that most COBOL programmers barely use. I've seen shops write thousand-line COBOL programs to do what DFSORT can accomplish in a 15-line control statement — running in a tenth of the time.

Why DFSORT Is Faster Than COBOL

DFSORT isn't faster because IBM's programmers are smarter than you (though the DFSORT team has been optimizing this product since 1965). It's faster because:

DFSORT operates below the access method layer. It reads and writes physical blocks directly, bypassing QSAM overhead entirely.
DFSORT uses memory-mapped I/O and parallel I/O scheduling that COBOL programs can't access through standard language features.
DFSORT's sort algorithm is hardware-aware — it knows the CPU cache line size, the number of available processors, and the memory hierarchy.
DFSORT can exploit zIIP processors for eligible work, reducing GP (General Purpose) processor cost.

Benchmark comparison for sorting 50 million 200-byte records:

Method                     Elapsed   CPU (GP)   EXCP Count
──────────────────────────────────────────────────────────
COBOL SORT verb (default)  18 min    9.2 min    2,400,000
COBOL SORT with FASTSRT    12 min    5.8 min    1,600,000
DFSORT standalone JCL      5 min     2.1 min    890,000
DFSORT with HIPERSPACE     4 min     1.8 min    620,000

The standalone DFSORT is 3.6x faster than default COBOL SORT. Even with FASTSRT (which we'll cover in Section 26.4), DFSORT standalone is still 2.4x faster. The EXCP count tells the story — DFSORT issues less than half the I/O operations.

DFSORT Control Statement Mastery

Every z/OS performance engineer needs to be fluent in DFSORT control statements. Here are the key operations:

SORT — The core operation:

  SORT FIELDS=(1,10,CH,A,15,8,PD,D)

This sorts by two keys: positions 1-10 (character, ascending) and positions 15-22 (packed decimal, descending). The field format matters for performance — CH (character) is faster than BI (binary) for keys that are actually character data, because DFSORT optimizes character comparison paths.

INCLUDE/OMIT — Filter before sort (critical for performance):

  SORT FIELDS=(1,10,CH,A)
  INCLUDE COND=(85,2,CH,EQ,C'TX',AND,
                100,8,PD,GT,+0)

This filters to only records where position 85-86 is 'TX' and position 100-107 (packed decimal) is positive — and the filtering happens before the sort. If INCLUDE eliminates 40% of the records, you just reduced your sort time by roughly 40%.

💡 KEY INSIGHT: Always filter before sorting. An INCLUDE or OMIT that runs before SORT reduces the record count that enters the sort phase. Every record you exclude is a record you don't sort, don't write to work datasets, and don't write to the output file. At CNB, adding an INCLUDE to the EOD transaction sort that excluded zero-value adjustment records reduced the sort input from 50 million to 41 million records — an 18% reduction that saved 54 seconds.

OUTREC — Reformat output records:

  SORT FIELDS=(1,10,CH,A)
  OUTREC FIELDS=(1,10,15,8,50,20,C' ',80:X'00')

OUTREC reformats the output record, extracting, rearranging, and padding fields. It replaces the need for a separate COBOL reformatting program after the sort.

INREC — Reformat input records before sort:

  INREC FIELDS=(1,10,15,8,50,20)

INREC reformats records before the sort phase. If your input record is 500 bytes but you only need 38 bytes of it for the sort and output, INREC shrinks the record to 38 bytes before sort. Sorting 38-byte records is dramatically faster than sorting 500-byte records.

ICETOOL: The Multi-Operation Powerhouse

ICETOOL is DFSORT's Swiss Army knife. It executes multiple DFSORT operations in a single job step, with the ability to chain outputs and apply conditional logic.

//STEP01   EXEC PGM=ICETOOL
//TOOLMSG  DD SYSOUT=*
//DFSMSG   DD SYSOUT=*
//IN1      DD DSN=CNB.EOD.TRANS,DISP=SHR
//OUT1     DD DSN=CNB.WORK.DEBITS,DISP=(NEW,PASS),
//            SPACE=(CYL,(100,10)),
//            DCB=(RECFM=FB,LRECL=200,BLKSIZE=27800)
//OUT2     DD DSN=CNB.WORK.CREDITS,DISP=(NEW,PASS),
//            SPACE=(CYL,(100,10)),
//            DCB=(RECFM=FB,LRECL=200,BLKSIZE=27800)
//OUT3     DD DSN=CNB.WORK.SUMMARY,DISP=(NEW,PASS),
//            SPACE=(CYL,(5,1)),
//            DCB=(RECFM=FB,LRECL=100,BLKSIZE=27900)
//CTL1CNTL DD *
  SORT FIELDS=(1,10,CH,A)
  INCLUDE COND=(85,1,CH,EQ,C'D')
  OUTREC FIELDS=(1,10,15,8,PD,EDIT=(STTTTTTTTTT.TT),
                 50,20,C' DEBIT')
/*
//CTL2CNTL DD *
  SORT FIELDS=(1,10,CH,A)
  INCLUDE COND=(85,1,CH,EQ,C'C')
/*
//CTL3CNTL DD *
  SORT FIELDS=COPY
  OUTFIL REMOVECC,
         SECTIONS=(1,10,
           TRAILER3=(1,10,COUNT=(M10,LENGTH=8),
                     TOT=(15,8,PD,EDIT=(STTTTTTTTTT.TT))))
/*
//TOOLIN   DD *
  SORT FROM(IN1) TO(OUT1) USING(CTL1)
  SORT FROM(IN1) TO(OUT2) USING(CTL2)
  SORT FROM(IN1) TO(OUT3) USING(CTL3)
/*

This single ICETOOL invocation reads the input once and produces three outputs: sorted debits with reformatted amounts, sorted credits, and a summary with record counts and totals by account. In COBOL, this would be three separate READ passes through the file or a complex single-pass program with three outputs. DFSORT does it in a fraction of the elapsed time.

Essential DFSORT Patterns Every Practitioner Needs

Beyond the basic operations, there are several DFSORT patterns that I use constantly in production environments. These are the patterns that separate someone who knows DFSORT from someone who thinks in DFSORT.

Pattern 1: SUM FIELDS=NONE for Deduplication

  SORT FIELDS=(1,10,CH,A)
  SUM FIELDS=NONE

This sorts by key and keeps only the first record for each unique key, discarding all duplicates. It replaces the classic COBOL "previous key comparison" pattern that typically requires 15-20 lines of WORKING-STORAGE and 25-30 lines of PROCEDURE DIVISION. DFSORT does it in two lines. At CNB, the account deduplication step uses this pattern to ensure the nightly reconciliation processes each account exactly once.

Pattern 2: SUM FIELDS for Aggregation

  SORT FIELDS=(1,10,CH,A)
  SUM FIELDS=(15,8,PD,25,8,PD)

This sorts by key and for each unique key, sums the packed decimal fields at positions 15-22 and 25-32 across all records with that key. The output has one record per unique key with accumulated totals. It replaces COBOL control-break logic — typically 60-100 lines of COBOL — with two lines of DFSORT.

Pattern 3: IFTHEN for Conditional Processing

  OUTREC IFTHEN=(WHEN=(19,1,CH,EQ,C'D'),
                 BUILD=(1,10,C' DEBIT  ',11,8,PD,
                        EDIT=(STTTTTTTTTTTT.TT))),
         IFTHEN=(WHEN=(19,1,CH,EQ,C'C'),
                 BUILD=(1,10,C' CREDIT ',11,8,PD,
                        EDIT=(STTTTTTTTTTTT.TT))),
         IFTHEN=(WHEN=NONE,
                 BUILD=(1,10,C' OTHER  ',11,8,PD,
                        EDIT=(STTTTTTTTTTTT.TT)))

IFTHEN applies different transformations based on record content — it's DFSORT's equivalent of COBOL's EVALUATE. The WHEN=NONE clause acts as the default case. At Federal Benefits, Sandra Chen used nested IFTHEN statements to handle 12 different record types in the EDI 834 reformatter, replacing 1,100 lines of COBOL.

Pattern 4: OUTFIL SPLIT for Parallel Processing

  SORT FIELDS=COPY
  OUTFIL FNAMES=SPLIT1,STARTREC=1,ENDREC=10000000
  OUTFIL FNAMES=SPLIT2,STARTREC=10000001,ENDREC=20000000
  OUTFIL FNAMES=SPLIT3,STARTREC=20000001,ENDREC=30000000
  OUTFIL FNAMES=SPLIT4,STARTREC=30000001,ENDREC=40000000
  OUTFIL FNAMES=SPLIT5,STARTREC=40000001

🔄 SPACED REVIEW — Chapter 25 (Parallel Batch): In Chapter 25, we designed parallel batch processing by splitting work across multiple jobs. DFSORT's OUTFIL with STARTREC/ENDREC provides the splitting mechanism — one DFSORT pass creates all the partitioned input files for the parallel jobs. This is far more efficient than running a COBOL program that reads the entire file N times to produce N splits.

Pattern 5: JOINKEYS for File Matching

ICETOOL's SPLICE operation and DFSORT's JOINKEYS can join two files by key — matching records without DB2, without COBOL's complex dual-file read logic, and without loading either file into memory:

  JOINKEYS FILE=F1,FIELDS=(1,10,A)
  JOINKEYS FILE=F2,FIELDS=(1,10,A)
  JOIN UNPAIRED,F1,F2
  REFORMAT FIELDS=(F1:1,80,F2:11,50)

This joins two files on positions 1-10, producing matched, unmatched-F1, and unmatched-F2 records. The REFORMAT clause constructs the output record from fields in both files. I've seen COBOL programs of 400+ lines that do nothing but match two sorted files and produce a joined output. DFSORT JOINKEYS does it in four lines and runs 3-5x faster because it uses optimized merge logic instead of COBOL's READ/COMPARE/READ pattern.

DFSORT Performance Tuning Parameters

Beyond the control statement, DFSORT's runtime parameters make a significant difference:

DFSORT OPTION CONTROL CARDS:
  OPTION MAINSIZE=MAX          - Use all available memory
  OPTION FILSZ=E50000000       - Estimated file size (helps workspace allocation)
  OPTION DYNALLOC=(SYSDA,5)    - Dynamic work dataset allocation (5 datasets)
  OPTION HIPRMAX=OPTIMAL       - Use Hiperspace for work storage
  OPTION EXPOLD                - Release unused work datasets immediately
  OPTION SPANINC=RC4           - Set RC=4 (not RC=16) for spanned record issues

MAINSIZE=MAX is the single most impactful parameter. DFSORT's sort time is directly proportional to the number of merge passes required, and merge passes are determined by available memory. With MAX, DFSORT uses all available region memory minus a reserve. The formula:

Merge passes = CEIL(LOG base (M/R) of N)

Where:
  M = Available memory (after MAINSIZE allocation)
  R = Record length
  N = Number of records

For 50M records, 200-byte LRECL:
  MAINSIZE=64M  → 3 merge passes → ~5 min
  MAINSIZE=256M → 2 merge passes → ~3.5 min
  MAINSIZE=1G   → 1 merge pass   → ~2.5 min

Going from 3 merge passes to 1 cuts elapsed time in half. And merge passes determine work dataset I/O — fewer passes means fewer EXCP. This is why Rob Calloway allocates 1 GB regions for the EOD sort jobs.

DYNALLOC for work datasets is critical. Let DFSORT allocate its own work datasets dynamically rather than pre-allocating them in JCL. DFSORT knows exactly how many work datasets it needs and how large they should be. Pre-allocated work datasets are almost always either too small (causing B37 abends) or too large (wasting DASD space and catalog entries).

Sandra's DFSORT Revolution at Federal Benefits

At Federal Benefits Administration, Sandra Chen discovered that 23 COBOL batch programs existed solely to sort, merge, or reformat files. These programs had accumulated over 20 years — each written by a different programmer, each with its own idiosyncratic file handling, each maintained separately.

She replaced all 23 programs with DFSORT/ICETOOL JCL. The results:

Metric                    Before (COBOL)      After (DFSORT)
────────────────────────────────────────────────────────────
Programs maintained        23                  0 (JCL only)
Lines of code              31,400              1,180 (control stmts)
Total elapsed time         147 min             38 min
Total CPU time             89 min              14 min
Total EXCP count           14.2M               3.8M
Maintenance incidents/yr   12                  1

Marcus Whitfield was initially skeptical. "I wrote seven of those programs," he said. "They work." Sandra's response: "They work. DFSORT works four times faster." Marcus examined the DFSORT control statements, verified the output was identical byte-for-byte, and approved the change. "Should have done this fifteen years ago," he admitted.

The maintenance reduction was almost more important than the performance gain. Those 23 COBOL programs required COBOL compilation, load module management, copybook changes when record layouts changed, and someone who understood each program's quirks. The DFSORT control statements are JCL — no compilation, no load modules, and record layout changes are straightforward field-position adjustments.

26.4 COBOL Compiler Optimization: OPT Levels, FASTSRT, and Generated Code Analysis

When I/O and SORT are optimized and the program is still too slow, it's time to look at the COBOL itself. But be precise about what "COBOL optimization" means: it means changing how the Enterprise COBOL compiler generates machine code, not rewriting your PROCEDURE DIVISION.

Compiler Optimization Levels

Enterprise COBOL for z/OS supports three optimization levels:

OPT(0) — No optimization (default for many shops)
  Fastest compilation
  Worst runtime performance
  Best for debugging (DWARF symbols accurate)
  Code generated: literal translation of COBOL to machine code

OPT(1) — Moderate optimization
  Moderate compilation time increase (+30-50%)
  Good runtime improvement (10-25% faster than OPT(0))
  Reasonable debugging support
  Optimizations: dead code elimination, constant folding,
                 simple register allocation, branch optimization

OPT(2) — Aggressive optimization
  Significant compilation time increase (+100-200%)
  Best runtime performance (20-40% faster than OPT(0))
  Debugging compromised (source-level stepping unreliable)
  Additional optimizations: global register allocation,
                            loop optimization, strength reduction,
                            in-line PERFORM, common subexpression elimination

The performance difference is real. At CNB, Kwame Mensah ran benchmarks on the EOD validation program (CNBEOD-VALID, 12,000 lines of COBOL):

OPT Level    CPU Time    Elapsed    Object Size
────────────────────────────────────────────────
OPT(0)       15.6 min    38.0 min   2.1 MB
OPT(1)       12.8 min    34.2 min   1.9 MB
OPT(2)       10.9 min    31.1 min   1.7 MB

OPT(2) saved 4.7 minutes of CPU and 6.9 minutes of elapsed time over OPT(0). On a critical-path job that runs every night, 365 nights a year, that's 42 hours of elapsed time saved annually — and a meaningful reduction in MSU consumption.

⚠️ COMMON PITFALL: Don't use OPT(2) for programs you're actively debugging. The aggressive optimizations move code around, eliminate dead branches, and inline PERFORMs, making source-level debugging nearly impossible. At CNB, the standard is OPT(1) for development/test and OPT(2) for production load modules. The build pipeline handles the switch automatically.

FASTSRT: Let DFSORT Do the Heavy Lifting

FASTSRT is the single most impactful compiler option for COBOL programs that contain SORT or MERGE verbs. When FASTSRT is active, the COBOL runtime delegates SORT/MERGE I/O directly to DFSORT instead of using COBOL's internal file handling.

Without FASTSRT:
  COBOL program → COBOL file handler → QSAM → I/O Supervisor → DASD
  (Double buffering, format conversion, extra memory copies)

With FASTSRT:
  COBOL program → DFSORT → Direct I/O → DASD
  (DFSORT's optimized I/O path, no intermediate buffering)

The improvement is typically 30-50% for SORT-intensive batch programs.

To activate FASTSRT, you need both the compiler option and proper coding:

      *-------------------------------------------------------*
      * COBOL SORT that qualifies for FASTSRT optimization    *
      *-------------------------------------------------------*
       SORT SORT-WORK-FILE
           ON ASCENDING KEY SW-ACCOUNT-NUMBER
           ON ASCENDING KEY SW-TRANS-DATE
           USING TRANSACTION-FILE
           GIVING SORTED-TRANS-FILE.
      *
      * This qualifies for FASTSRT because:
      * 1. USING/GIVING (not INPUT/OUTPUT PROCEDURE)
      * 2. No special registers modified before SORT
      * 3. Files have standard sequential organization

FASTSRT disqualifiers — any of these prevent FASTSRT activation:

1. INPUT PROCEDURE or OUTPUT PROCEDURE specified
   (DFSORT can't intercept procedural I/O)
2. LINAGE clause on output file
3. APPLY WRITE-ONLY on the sort file
4. SAME AREA or SAME RECORD AREA for sort files
5. USING file opened before SORT (must let SORT open it)
6. Variable-length records with NOSEQCHK interaction issues

💡 KEY INSIGHT: If you have an INPUT PROCEDURE that only filters records (no complex processing), replace it with a DFSORT INCLUDE statement and switch to USING. You'll get FASTSRT and DFSORT's filter optimization. Double win.

At CNB, the EOD posting program originally used an OUTPUT PROCEDURE to add trailer records after the sort. Kwame Mensah refactored it to use GIVING with a separate DFSORT OUTFIL step for the trailer. The SORT step went from 8.4 minutes to 4.1 minutes — a 51% reduction.

NOSEQCHK: Skip the Sequence Check

When you specify NOSEQCHK as a compiler option (or runtime option), the COBOL runtime skips the sequence check on MERGE input files and SORT output. This check verifies that records are in the correct sequence — useful for debugging, useless in production when you trust your sort.

The savings are modest (1-3%) but real, and they cost nothing. Every production batch COBOL program at CNB compiles with NOSEQCHK.

Generated Code Analysis: When to Look Under the Hood

For CPU-bound batch programs, examining the compiler's generated code can reveal optimization opportunities that aren't visible at the COBOL source level.

Use the LIST compiler option to produce an assembler listing:

CBL LIST,OFFSET,MAP,OPT(2)

Key things to look for in the generated code:

1. Expensive MOVE operations:

      * COBOL source:
       MOVE CORRESPONDING WS-INPUT-RECORD
                          TO WS-OUTPUT-RECORD.

MOVE CORRESPONDING generates a separate MOVE for each matching field. If WS-INPUT-RECORD and WS-OUTPUT-RECORD have 50 matching fields, that's 50 MVC (Move Character) instructions. If the records are identical layouts and you want a wholesale copy, a single MOVE WS-INPUT-RECORD TO WS-OUTPUT-RECORD generates one MVC or MVCL.

2. Packed decimal arithmetic in tight loops:

      * This generates CVB/CVD conversion pairs in the loop:
       PERFORM VARYING WS-INDEX FROM 1 BY 1
           UNTIL WS-INDEX > WS-MAX-RECORDS
           COMPUTE WS-TOTAL = WS-TOTAL +
               WS-AMOUNT(WS-INDEX)
       END-PERFORM.

If WS-INDEX is defined as PIC S9(4) COMP (binary) and WS-AMOUNT is PIC S9(9)V99 COMP-3 (packed decimal), every iteration requires a format conversion. Defining WS-INDEX as COMP-3 when the subscripted table is COMP-3 — or converting to COMP for the subscript — reduces conversions.

3. Reference modification overhead:

      * Reference modification generates bounds checking:
       MOVE WS-DATA(WS-START:WS-LENGTH)
            TO WS-OUTPUT

Every reference modification with variable start or length generates runtime bounds checking (unless NOSSRANGE is specified). In a 50-million-iteration loop, that bounds checking adds up. If you know the ranges are safe, NOSSRANGE eliminates the checks. But be absolutely certain — an undetected subscript error in a batch program that processes 50 million records can corrupt your entire dataset.

The Compiler Option Stack for Production Batch

Here's the compiler option set that CNB uses for production batch programs:

OPT(2)           — Maximum optimization
FASTSRT           — DFSORT-delegated SORT I/O
NOSEQCHK          — Skip sort sequence verification
NUMPROC(PFD)      — Assume valid packed decimal signs
TRUNC(OPT)        — Optimize binary truncation
SSRANGE            — Keep subscript checking (safety over speed)
RENT               — Reentrant code (required for LE)
RMODE(ANY)         — 31-bit residency (above-the-line)
LIST               — Generate assembler listing (for analysis)
MAP                — Data map (for debugging/analysis)
OFFSET             — Condensed verb listing

Note that SSRANGE stays on even in production. Kwame Mensah's position: "I'll take the 2% CPU hit over the risk of a subscript error corrupting the general ledger." That's the right call. A 2% CPU increase on a 10-minute job is 12 seconds. A subscript error on the GL posting job is a Sev-1 incident, a regulatory filing delay, and Rob Calloway's phone ringing at 3 AM.

26.5 DB2 Batch Performance: Commit Frequency, Prefetch, and Parallelism

For DB2-bound batch programs — and at CNB, that's 40% of the critical path — SQL performance is the dominant factor. Chapter 6 covered the DB2 optimizer in depth. Here we focus specifically on batch-mode DB2 performance patterns.

🔄 SPACED REVIEW — Chapter 6 (DB2 Optimizer): The DB2 optimizer chooses access paths based on catalog statistics. For batch programs, the critical access paths are different from CICS. Batch favors sequential prefetch and tablespace scans; CICS favors index access. Make sure your RUNSTATS strategy captures batch access patterns, not just online patterns.

Commit Frequency: The Throughput-Recovery Tradeoff

Every DB2 COMMIT in a batch program is a synchronization point. It:

Writes log records to the active log
Releases all locks held by the application
Makes all changes visible to other users
Establishes a recovery point

The cost of a COMMIT is typically 2-5 milliseconds. On a batch program processing 50 million records, the commit strategy dramatically affects elapsed time:

Commit Strategy      Commits    Lock Duration    Elapsed    Recovery Risk
─────────────────────────────────────────────────────────────────────────
Every record          50M        Minimal          240 min    None
Every 100 records     500K       Low              85 min     100 records
Every 1,000 records   50K        Moderate         52 min     1,000 records
Every 10,000 records  5K         High             48 min     10,000 records
Every 100,000 records 500        Very High        47 min     100,000 records
No commit (autocommit) 1         Maximum          46 min     Entire run

The curve flattens dramatically after every-1,000. Going from every-100 to every-1,000 saves 33 minutes. Going from every-1,000 to no-commit saves 6 minutes. That last 6 minutes isn't worth the risk.

⚠️ CRITICAL: Commit frequency must balance throughput against two constraints:

Lock escalation threshold. DB2 will escalate from row/page locks to table locks when the number of locks on a tablespace exceeds LOCKMAX (or the installation default). A table lock in batch blocks every CICS transaction that touches that table. At CNB, LOCKMAX is 10,000 — so commit must happen before 10,000 page locks accumulate.
Recovery granularity from Chapter 24. If the program abends between commits, all work since the last commit must be re-processed. Commit every 5,000 records means worst-case re-processing is 5,000 records — typically under a minute for most batch programs.

Lisa Tran's rule at CNB: Commit every 5,000 records for standard batch. Commit every 1,000 for high-contention tables. Never commit every record.

Batch-Online Contention: The 11 PM Problem

At shops that don't fully quiesce CICS during the batch window — and that's increasingly common as 24/7 availability expectations grow — batch DB2 programs compete with online transactions for locks and buffer pool pages.

At Pinnacle Health Insurance, the eligibility verification system runs 24/7. Claims processing batch starts at 8 PM while CICS is still serving real-time eligibility inquiries. Ahmad Rashidi flagged a recurring pattern: every night at 8:15 PM, CICS response times spiked from 0.2 seconds to 3.5 seconds for eligibility queries. The cause? The batch claims posting job was acquiring IX (intent exclusive) locks on the eligibility tablespace, forcing CICS transactions to wait.

Diane Okoye's fix was three-fold:

Commit every 1,000 records (down from 10,000) to release locks faster
Lock at ROW level (not PAGE) to reduce the lock footprint. Row locking uses more locks per commit interval but each lock blocks fewer concurrent readers
Separate buffer pool for batch — BP2 for batch access, BP0 for CICS, preventing batch sequential prefetch from flushing CICS's hot pages

The result: CICS response time during batch dropped from 3.5 seconds back to 0.3 seconds. The batch job itself slowed by 4 minutes (more frequent commits), but the business accepted that tradeoff. "Four extra minutes of batch is invisible," Ahmad said. "Three seconds of eligibility latency means a provider hangs up the phone."

This is the kind of tradeoff that batch performance engineers must navigate. Optimizing batch in isolation is straightforward. Optimizing batch while maintaining online service levels is architecture.

Sequential Prefetch and List Prefetch

DB2's prefetch engine is the mechanism that makes batch SQL perform at I/O speeds rather than random-access speeds.

Sequential prefetch reads ahead in the tablespace, loading pages into the buffer pool before your cursor fetches them. It activates automatically when DB2 detects sequential access patterns (typically after 8 consecutive page accesses in the same direction).

For batch cursors that process entire tables or large ranges:

SELECT ACCOUNT_NUMBER, BALANCE, LAST_ACTIVITY_DATE
FROM CNB.ACCOUNTS
WHERE LAST_ACTIVITY_DATE >= CURRENT DATE - 30 DAYS
ORDER BY ACCOUNT_NUMBER
FOR FETCH ONLY

FOR FETCH ONLY (or FOR READ ONLY) is critical — it tells DB2 this cursor won't update rows, enabling: - Sequential prefetch without lock compatibility concerns - Block fetch (multiple rows per DRDA network message, though irrelevant in local batch) - Avoidance of intent exclusive locks

List prefetch is used when DB2 accesses an index, collects a list of qualifying RIDs (Row IDs), sorts them into page order, and then prefetches data pages in physical sequence. This eliminates the random I/O pattern that index access normally produces.

For batch programs that access a subset of rows by index:

SELECT T.TRANS_ID, T.AMOUNT, T.TRANS_DATE,
       A.ACCOUNT_NAME, A.ACCOUNT_TYPE
FROM CNB.TRANSACTIONS T
JOIN CNB.ACCOUNTS A
  ON T.ACCOUNT_NUMBER = A.ACCOUNT_NUMBER
WHERE T.TRANS_DATE = CURRENT DATE - 1 DAY
ORDER BY T.ACCOUNT_NUMBER

If the TRANSACTIONS table has an index on TRANS_DATE, DB2 can use list prefetch to collect all matching RIDs, sort them, and read data pages sequentially. The access path goes from random I/O (0.5-2 ms per page) to sequential I/O (0.05-0.2 ms per page) — a 10x improvement.

Partition-Level Parallelism

DB2 query parallelism can split a single SQL statement across multiple CP (Central Processor) engines, processing different partitions simultaneously. For batch queries against partitioned tablespaces:

-- This query can exploit partition-level parallelism
-- if TRANSACTIONS is partitioned by TRANS_DATE
SELECT ACCOUNT_NUMBER, SUM(AMOUNT) AS DAILY_TOTAL
FROM CNB.TRANSACTIONS
WHERE TRANS_DATE BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY ACCOUNT_NUMBER

DB2 can process each date partition on a separate CP, then merge the results. The degree of parallelism is controlled by:

DSNZPARM parameters:
  MAX_DEGREE     — Maximum parallelism degree (default: 0 = no limit)
  PARA_THRESH    — Minimum estimated cost for parallel activation

BIND parameters:
  DEGREE(ANY)    — Allow DB2 to choose parallel degree
  DEGREE(1)      — Force serial execution (no parallelism)

At CNB, Lisa Tran uses DEGREE(ANY) for batch plans and DEGREE(1) for CICS plans. Batch programs benefit from parallelism because they're I/O-bound and have the entire overnight window. CICS programs should never go parallel — the overhead of coordinating parallel tasks outweighs the benefit for sub-second transactions.

The Batch SQL Checklist

□ Commit every 1,000–5,000 records (adjust based on lock monitoring)
□ FOR FETCH ONLY on all read cursors
□ OPTIMIZE FOR n ROWS on cursors with known row counts
□ DEGREE(ANY) in batch plan BIND
□ RUNSTATS current on batch-accessed tablespaces
□ Sequential prefetch verified in EXPLAIN (access type = R or RW)
□ No stage 2 predicates in high-volume cursor WHEREs
□ Lock escalation monitored via IFCID 0223/0224
□ Tablespace partitioned by the batch key (usually date or account range)
□ Buffer pool sized for batch working set (BP2 at CNB)

26.6 Performance Analysis with SMF and RMF

You've optimized I/O, tuned DFSORT, set compiler options, and adjusted DB2 parameters. How do you know it worked? You measure. Again.

SMF Records for Batch Performance

SMF (System Management Facility) is z/OS's comprehensive measurement infrastructure. For batch performance analysis, these record types are essential:

SMF Type 14/15 — Dataset activity (open/close, EXCP count per dataset)
  Key fields: dataset name, EXCP count, block count, device type
  Use: Verify BLKSIZE optimization (fewer EXCPs = larger blocks)

SMF Type 30 — Job/step accounting (the master performance record)
  Key fields: CPU time, elapsed time, EXCP count, page-ins, I/O connect time
  Use: Performance decomposition (CPU/IO/DB2/Other split)

SMF Type 42 — DFSORT statistics
  Key fields: records sorted, merge passes, memory used, elapsed time
  Use: Verify DFSORT tuning (merge passes, memory allocation)

SMF Type 101/102 — DB2 accounting (Class 1 and Class 2)
  Key fields: elapsed time, CPU time, getpages, synchronous reads,
              lock waits, commit count, SQL call count
  Use: DB2 performance decomposition

SMF Type 110 — CICS statistics (not used in batch, but relevant
               when batch affects online)

Building a Performance Dashboard

At CNB, Rob Calloway's team built a batch performance dashboard from SMF data. The dashboard runs every morning at 6:30 AM (after the batch window closes) and reports:

CNB BATCH PERFORMANCE DASHBOARD — 2026-03-15 (Saturday)
═══════════════════════════════════════════════════════════
Window opened:   23:00:14    Window closed:  05:38:22
Total elapsed:   6h 38m 08s  Margin:         21m 38s
Critical path:   5h 12m 44s  Non-critical:   various

TOP 10 ELAPSED TIME JOBS:
Job Name        Elapsed   CPU     I/O Wait  DB2 Wait  Delta(Prev)
──────────────────────────────────────────────────────────────────
CNBEOD-STMT     52:14     18:16   28:33     2:04      -7:46 ▼
CNBEOD-POST     36:22     4:21    24:41     5:08      -6:38 ▼
CNBEOD-VALID    31:18     10:55   9:58      6:33      -6:42 ▼
CNBEOD-INTST    19:44     2:57    2:10      13:14     -3:16 ▼
CNBEOD-BAL      15:33     3:25    1:14      9:38      -2:27 ▼
CNBEOD-RECON    13:08     3:16    2:21      6:18      -1:52 ▼
CNBEOD-REG      14:22     2:35    8:56      1:42      -4:38 ▼
CNBEOD-GL03     8:14      0:44    5:51      0:58      -3:46 ▼
CNBEOD-ACH      3:55      0:33    2:12      0:52      -1:05 ▼
CNBEOD-SORT     2:18      0:11    2:01      0:00      -2:42 ▼

ALERTS:
⚠ Margin below 30-minute threshold (21m 38s)
✓ No lock escalations detected
✓ All checkpoint/restart points verified
✓ EXCP trending down 23% week-over-week (optimization project)

The "Delta(Prev)" column shows the improvement from the performance optimization project. Those downward arrows represent the cumulative result of everything covered in this chapter — buffer tuning, BLKSIZE optimization, DFSORT replacement, compiler options, and DB2 tuning.

The margin alert is still concerning — 21 minutes is below Rob's 30-minute minimum. But the trend is positive, and the optimization project has more changes in the pipeline.

RMF Reports for Systemic Analysis

RMF (Resource Measurement Facility) provides system-wide resource utilization data that SMF alone can't provide:

Key RMF reports for batch performance:

CPU Activity Report — Overall processor utilization during batch
  Look for: CP utilization > 85% (throttling risk)
            zIIP utilization vs. GP (offload opportunities)

Channel Activity Report — I/O channel utilization
  Look for: Channel busy > 70% (throughput bottleneck)
            Channel path delays > 5% (controller contention)

Device Activity Report — Individual DASD volume utilization
  Look for: Device busy > 40% (hot volume)
            Pending time > 2ms (controller queue depth)
            Response time > 5ms (cache miss rate high)

Paging Activity Report — Real storage pressure
  Look for: Page-in rate > 10/sec during batch (memory pressure)
            Auxiliary slots > 80% (swap space critical)

Workload Activity Report — WLM service class performance
  Look for: Batch service class velocity < 80 (WLM throttling)
            Execution delays (WLM holding work for online priority)

At Pinnacle Health Insurance, Ahmad Rashidi used RMF channel activity reports to discover that their two largest batch jobs were allocated to VSAM clusters on the same DASD volume group. Channel utilization hit 92% when both jobs ran concurrently. Simply moving one cluster to a different volume group dropped channel utilization to 54% and reduced combined elapsed time by 35%.

🔄 SPACED REVIEW — Chapter 23 (Batch Window): The batch window is a scheduling problem (Chapter 23), but the scheduling math assumes you know how long each job takes. The SMF and RMF measurements in this section are how you validate those assumptions. If your critical path estimate says 310 minutes and actual completion is 375 minutes, the SMF data tells you exactly where the 65 minutes went.

26.7 Advanced Techniques: Hiperbatch, Data-in-Memory, and zIIP Offload

These techniques are for shops that have already optimized the fundamentals and need to push further. Don't skip the basics to get here — 90% of batch performance gains come from Sections 26.2 through 26.5.

Hiperbatch (Data Lookaside Facility)

Hiperbatch caches sequential dataset blocks in data spaces (expanded storage), allowing multiple job steps to share cached data without re-reading from DASD.

How Hiperbatch works:

Step 1: Job A reads CNB.EOD.TRANS (50 million records)
  → QSAM reads from DASD, Hiperbatch caches blocks in data space

Step 2: Job B reads CNB.EOD.TRANS (same dataset)
  → QSAM finds blocks in Hiperbatch cache → No DASD I/O

Step 3: Job C reads CNB.EOD.TRANS (same dataset)
  → QSAM finds blocks in Hiperbatch cache → No DASD I/O

The benefit is enormous when the same sequential dataset is read by multiple jobs in the batch window — and this is extremely common. At CNB, the transaction extract file (CNB.EOD.TRANS) is read by six different jobs: validation, fraud detection, regulatory extract, ACH processing, statement generation, and reconciliation. Without Hiperbatch, that's six full reads from DASD. With Hiperbatch, it's one DASD read and five cache reads.

Hiperbatch setup requires:

1. SMFPRMxx: Enable SMF Type 14/15 recording (already standard)
2. DLF (Data Lookaside Facility) configuration in COFDLFxx parmlib member
3. Dataset registration in the DLF configuration
4. No JCL changes required — transparent to applications

COFDLFxx parmlib configuration:

OBJECT(CNB.EOD.TRANS)
  CONNECT(YES)
  RLSQ(YES)

OBJECT(CNB.EOD.SORTED.*)
  CONNECT(YES)
  RLSQ(YES)

At CNB, Hiperbatch for the five most-read batch datasets eliminated 2.8 billion EXCP per month and reduced aggregate batch I/O wait time by 18%. Kwame Mensah called it "the closest thing to free performance I've seen in 30 years."

Data-in-Memory (z/OS Data Set Storage)

Data-in-memory takes Hiperbatch a step further by pre-loading entire datasets into memory before the batch window starts:

Implementation via SMS data class:

DATACLAS(INMEMORY)
  SPACE TYPE: MEMORY
  BUFFERING: SYSTEM-MANAGED
  PRELOAD: YES
  PRIORITY: HIGH

The dataset is loaded into real storage (or auxiliary storage with fast page-in) before the first job accesses it. All subsequent access is memory-speed — effectively zero I/O latency.

This technique is appropriate for: - Reference tables read by many jobs (code tables, rate tables, parameter files) - Small-to-medium datasets (under 2 GB) that are accessed repeatedly - Datasets where DASD I/O is the dominant performance component

At Federal Benefits Administration, Sandra Chen used data-in-memory for the benefits rate tables (1.2 GB, accessed by 34 batch jobs every night). The rate tables were loaded at 10:45 PM, before the batch window opened, and remained in memory until 6:15 AM. Total I/O savings: 680 million EXCP per month for those 34 jobs.

zIIP Offload for Batch Processing

zIIP (z Integrated Information Processor) engines are specialty processors that run at the same speed as GP (General Purpose) processors but at a fraction of the software licensing cost. Work that runs on zIIP doesn't count toward your MIPS-based software bill.

In batch, zIIP-eligible work includes:

zIIP-eligible batch workload:
  DB2 processing         — DRDA-protocol SQL (all SQL in COBOL batch)
  XML processing         — XML PARSE, XML GENERATE
  DFSORT processing      — Sort processing (partial offload)
  Java processing        — JVM work under LE (hybrid programs)
  z/OS XML System Svcs   — Schema validation
  z/OS Communications    — TCP/IP-related work

NOT zIIP-eligible:
  COBOL compute logic    — All COBOL code runs on GP
  QSAM/BSAM I/O         — Access method processing is GP
  VSAM I/O               — Access method processing is GP
  LE runtime services    — LE runs on GP

The key insight: DB2 SQL execution in batch is zIIP-eligible. For a batch program that's 60% DB2-bound, up to 60% of the CPU cost can potentially shift to zIIP — reducing your GP MIPS consumption (and your IBM software bill) substantially.

To maximize zIIP offload: 1. Ensure zIIP engines are installed and activated 2. Use DSNZPARM parameter MAX_ZIIP_OFFLOAD=100 (allow full offload) 3. Verify with SMF Type 30 that zIIP time is being recorded 4. Monitor zIIP utilization — if zIIPs are saturated, work falls back to GP

SMF Type 30 zIIP fields:
  SMF30_zIIP_TIME     — Time consumed on zIIP
  SMF30_zIIP_ON_CP    — zIIP-eligible work that ran on CP (overflow)
  SMF30_zIIP_QUALIFY  — Total zIIP-qualified time

Offload ratio = SMF30_zIIP_TIME / SMF30_zIIP_QUALIFY
Target: > 90% (if below 90%, zIIP capacity insufficient)

At SecureFirst Retail Bank, Yuki Nakamura's analysis showed that adding two zIIP engines could offload 35% of the overnight batch GP workload. The capital cost of the zIIP engines was recovered in 8 months through reduced software licensing fees. Carlos Vega, who had been pushing for cloud migration of batch workloads, examined the numbers and conceded: "I can't beat that ROI with any cloud provider."

Implementation Pitfalls and War Stories

Advanced techniques are powerful, but they come with operational complexity that can bite you in production. Here are the pitfalls I've seen — and caused — over 25 years.

Hiperbatch and GDG (Generation Data Group) Interactions

Hiperbatch caches by dataset name. GDG datasets change names every generation — CNB.EOD.TRANS.G0045V00 today becomes CNB.EOD.TRANS.G0046V00 tomorrow. If your DLF configuration specifies the absolute dataset name, Hiperbatch stops caching after one day. You must configure Hiperbatch with relative GDG references or use pattern matching in the COFDLFxx parmlib member.

Rob Calloway learned this the hard way when EXCP counts jumped back to pre-optimization levels on the second night after Hiperbatch deployment. The fix took 10 minutes. The diagnosis took 3 hours.

zIIP Overflow Under Load

When zIIP engines are saturated, eligible work falls back to GP processors automatically. This is by design — z/OS ensures workload completion. But it means your GP CPU consumption can spike unexpectedly during peak periods. If your capacity plan assumes a certain zIIP offload ratio, a busy month (like CNB's Q4) can blow your GP budget.

Lisa Tran monitors the zIIP overflow ratio (SMF30_zIIP_ON_CP / SMF30_zIIP_QUALIFY) weekly. When it exceeds 15%, she flags it for capacity review. At that point, either additional zIIP capacity is needed or some batch workloads need to be rescheduled to spread the zIIP demand across the window.

Data-in-Memory and Real Storage Pressure

Pre-loading datasets into real storage is only free if you have the memory to spare. If the pre-loaded data displaces other programs' working sets, you trade I/O savings for paging overhead. The net effect can be negative — more total I/O, not less — because paging I/O is random and cannot exploit sequential prefetch.

At one shop I consulted with (not one of our anchor examples), the systems programmer loaded 8 GB of reference tables into data-in-memory. Batch jobs ran beautifully — zero I/O for reference lookups. But CICS regions started paging because the 8 GB consumed real storage that CICS needed for its buffer pools and dynamic storage areas. Online response times degraded from 0.3 seconds to 1.2 seconds. The change was backed out the same night.

The lesson: data-in-memory requires coordination with the LPAR's real storage configuration and WLM's storage management goals. It's a system-wide decision, not a batch-only decision.

Buffer Tuning and Region Size

Increasing BUFNO and BUFNI allocates memory from the program's region. If the region is too small, the additional buffers cause S878 abends (insufficient virtual storage). At CNB, the batch JCL templates specify REGION=0 (unlimited) for critical-path jobs, but some older JCL still has REGION=64M or REGION=128M from the days when real storage was scarce. Jerome Washington added a pre-production check to the deployment pipeline that verifies region size is sufficient for the specified buffer allocations.

Combining Advanced Techniques

The advanced techniques are most powerful when combined:

Technique Stack for Maximum Batch Performance:

Layer 1: Architecture (Chapter 23)
  ✓ Critical path minimized
  ✓ Dependencies cleaned
  ✓ Parallel streams identified

Layer 2: I/O Optimization (Section 26.2)
  ✓ Half-track BLKSIZE
  ✓ BUFNO=20-30 for sequential
  ✓ Full index buffering for VSAM

Layer 3: SORT Optimization (Section 26.3)
  ✓ DFSORT for all sort/merge/reformat
  ✓ INCLUDE/OMIT to filter before sort
  ✓ INREC to reduce record size before sort
  ✓ MAINSIZE=MAX, HIPRMAX=OPTIMAL

Layer 4: COBOL Optimization (Section 26.4)
  ✓ OPT(2) for production
  ✓ FASTSRT for SORT-containing programs
  ✓ NOSEQCHK for production

Layer 5: DB2 Optimization (Section 26.5)
  ✓ Commit every 1,000-5,000
  ✓ FOR FETCH ONLY on read cursors
  ✓ Sequential/list prefetch verified
  ✓ DEGREE(ANY) for batch plans

Layer 6: Advanced (Section 26.7)
  ✓ Hiperbatch for multi-reader datasets
  ✓ Data-in-memory for reference tables
  ✓ zIIP offload for DB2-heavy batch

The CNB Performance Project: Final Results

After implementing optimizations from all six layers over a 12-week period, here are the final results for CNB's batch window:

                       Before      After       Improvement
──────────────────────────────────────────────────────────
Critical path:         310 min     188 min     39.4%
Total elapsed:         375 min     228 min     39.2%
Total EXCP:            48.2M       18.7M       61.2%
Total CPU (GP):        142 min     89 min      37.3%
Total CPU (zIIP):      0 min       31 min      (new capacity)
Window margin:         30 min      162 min     440%
Monthly MSU:           3,840       2,590       32.6%

The critical path dropped from 310 minutes to 188 minutes — a 39.4% improvement. That's not one optimization; it's the cumulative effect of disciplined, measurement-driven work across every layer of the stack.

The EXCP reduction of 61.2% came primarily from buffer tuning (Section 26.2) and Hiperbatch (Section 26.7). The MSU reduction of 32.6% came from zIIP offload and OPT(2) compilation.

But the number that made Rob Calloway smile was the margin: 162 minutes. From a window that was blowing at 375 minutes to one that finishes with nearly three hours to spare. "That's 12 months of volume growth at 3%," Rob calculated. "I won't have to re-engineer this window for another year."

Kwame Mensah's perspective was broader: "We didn't just fix the batch window. We proved that the mainframe platform can scale to handle the mobile banking volumes that were supposed to require a cloud migration. The hardware was never the bottleneck — our configuration was."

That's the real lesson of batch performance at scale. The z/OS platform, properly configured, is the highest-throughput batch processing engine ever built. But "properly configured" requires understanding every layer of the I/O path, every DFSORT parameter, every compiler option, and every DB2 access path. There are no magic switches. There is only disciplined measurement, informed configuration, and systematic optimization.

And that, as Kwame would say, is architecture.

Chapter Summary

Batch performance optimization follows a strict priority stack: eliminate unnecessary work first, then optimize I/O (buffers, BLKSIZE, access methods), then tune DFSORT, then adjust compiler options, then tune DB2 access paths, and only then consider advanced techniques like Hiperbatch and zIIP offload.

The foundation is measurement. Without a performance decomposition of CPU, I/O, DB2, and other wait time, you're guessing — and in a production batch window, guessing is expensive. SMF Type 30 records provide the definitive data. RMF reports provide the system-wide context. Together, they tell you exactly where the time goes and where optimization effort should be invested.

DFSORT is the most underused performance tool on z/OS. It operates below the access method layer, exploits hardware-aware sort algorithms, and replaces thousands of lines of COBOL with dozens of lines of control statements. The shops that use DFSORT aggressively — replacing COBOL sort/merge/reformat programs with DFSORT/ICETOOL JCL — consistently outperform shops that treat DFSORT as "just the sort utility."

The compiler matters. OPT(2) can save 20-40% of CPU time on compute-intensive batch programs. FASTSRT delegates SORT I/O to DFSORT for a 30-50% improvement. These are configuration changes, not code changes — the highest-ROI optimization category.

For DB2-bound batch, commit frequency is the pivotal decision. Too frequent (every record) wastes time on synchronization. Too infrequent (never) risks lock escalation and long recovery. The sweet spot is every 1,000–5,000 records, depending on contention patterns.

Advanced techniques — Hiperbatch, data-in-memory, zIIP offload — provide the final 15-20% of improvement for shops that have already mastered the fundamentals. They're powerful but they're not substitutes for proper BLKSIZE, buffer tuning, and DFSORT usage.

The CNB performance project demonstrated that a disciplined, measurement-driven approach across all six optimization layers can reduce batch elapsed time by 40% and EXCP count by 60%. That's not theory. That's Tuesday night at Continental National Bank.