> "Premature optimization is the root of all evil." — Donald Knuth
In This Chapter
- 36.1 The Performance Mindset
- 36.2 CPU Optimization
- 36.3 I/O Reduction
- 36.4 WORKING-STORAGE Layout
- 36.5 Compiler Options
- 36.6 SQL Performance
- 36.7 CICS Performance
- 36.8 Batch Job Tuning
- 36.9 Profiling Tools
- 36.10 GlobalBank Case Study: Optimizing the Nightly Batch
- 36.11 MedClaim Case Study: Tuning Claim Adjudication
- 36.12 Performance Tuning Checklist
- 36.13 VSAM Tuning Deep Dive
- 36.14 Compiler Optimization Flags Deep Dive
- 36.15 SQL EXPLAIN Analysis
- 36.16 Memory Layout Optimization
- 36.17 Batch I/O Optimization Patterns
- 36.18 MedClaim Performance Case Study: Tuning the Daily Eligibility Batch
- 36.19 Performance Monitoring and Regression Detection
- 36.20 GlobalBank Post-Optimization Monitoring
- 36.21 Summary
Chapter 36: Performance Tuning
"Premature optimization is the root of all evil." — Donald Knuth
"But ignoring performance until your batch window is blown is the root of weekend overtime." — Maria Chen (probably)
GlobalBank's nightly batch cycle had a problem. The cycle — which included account updates, interest calculations, report generation, and regulatory feeds — had a strict window: it had to start at 11:00 PM and complete by 5:00 AM, when online banking came back up. For years, the cycle completed comfortably within that window, finishing around 3:30 AM. But as the bank grew — more accounts, more transaction types, more regulatory requirements — the cycle crept later. 3:45 AM. Then 4:00. Then 4:15.
One Friday night, the cycle didn't finish until 5:47 AM. Online banking was delayed for 47 minutes. The CIO's phone rang. By Monday morning, Maria Chen had a new assignment: cut the batch cycle from four hours to ninety minutes.
"That's a 62% reduction," Derek Washington said, doing the math. "Is that even possible without rewriting everything?"
"We're not rewriting anything," Maria replied. "We're going to make what we have run faster. And the first thing we're going to do is measure."
This chapter teaches you how to think about COBOL performance — what matters, what doesn't, and how to find and fix the bottlenecks that actually slow your programs down.
36.1 The Performance Mindset
Before you optimize a single line of code, internalize three principles:
Principle 1: Measure First
The most dangerous performance optimization is one based on a guess. Developers are notoriously bad at predicting where their programs spend time. A paragraph you think is slow may execute in microseconds; a paragraph you never considered may be the bottleneck because it runs a million times.
Always profile before optimizing. We'll discuss profiling tools in Section 36.8.
Principle 2: I/O Dominates
On a mainframe, a single disk I/O operation takes approximately 5-10 milliseconds. A single CPU instruction takes approximately 1 nanosecond. That means one I/O operation takes as long as 5-10 million CPU instructions. In most COBOL batch programs, 80-95% of elapsed time is spent waiting for I/O. Optimizing CPU-bound logic in a program that is I/O-bound is like polishing the hubcaps on a car with a blown engine.
Principle 3: Optimize the Hot Path
The "hot path" is the code that executes most frequently. In a batch program processing 2 million records, an optimization that saves 1 millisecond per record saves 33 minutes. The same optimization in a paragraph that runs once per job saves 1 millisecond total — not worth the effort or the risk.
⚠️ Defensive Programming: Every performance optimization is a change to working code. Every change introduces risk. Never optimize code that doesn't need optimization, and always have a regression test suite (Chapter 34) in place before you start. The worst outcome of performance tuning is a faster program that produces wrong results.
36.2 CPU Optimization
While I/O usually dominates, CPU optimization matters in compute-intensive code — financial calculations, data transformations, and high-iteration loops.
Efficient Data Types
The choice of data type has a dramatic effect on arithmetic performance:
| Data Type | PICTURE | Storage | Arithmetic Speed |
|---|---|---|---|
| DISPLAY (zoned decimal) | PIC 9(9) | 9 bytes | Slowest — CPU must convert to binary |
| COMP-3 (packed decimal) | PIC 9(9) COMP-3 | 5 bytes | Moderate — hardware decimal support |
| COMP / BINARY | PIC 9(9) COMP | 4 bytes | Fastest — native binary arithmetic |
For fields used in arithmetic, always use COMP-3 or COMP:
* SLOW: Arithmetic on DISPLAY fields
01 WS-AMOUNT-DISP PIC 9(9)V99.
01 WS-RATE-DISP PIC V9(4).
01 WS-RESULT-DISP PIC 9(9)V99.
MULTIPLY WS-AMOUNT-DISP BY WS-RATE-DISP
GIVING WS-RESULT-DISP.
* CPU must: unpack -> convert -> multiply -> convert -> pack
* FAST: Arithmetic on COMP-3 fields
01 WS-AMOUNT-PKD PIC 9(9)V99 COMP-3.
01 WS-RATE-PKD PIC V9(4) COMP-3.
01 WS-RESULT-PKD PIC 9(9)V99 COMP-3.
MULTIPLY WS-AMOUNT-PKD BY WS-RATE-PKD
GIVING WS-RESULT-PKD.
* CPU: hardware decimal multiply (one instruction)
📊 By the Numbers: In benchmarks on a z15 processor, COMP-3 arithmetic is approximately 3x faster than DISPLAY arithmetic. COMP (binary) arithmetic is approximately 5x faster than DISPLAY. For a program performing 10 million calculations per run, converting from DISPLAY to COMP-3 can save significant CPU time.
COMPUTE vs. Arithmetic Verbs
The COMPUTE statement is generally as fast or faster than individual arithmetic verbs, because the compiler can optimize the entire expression:
* SLOWER: Individual arithmetic verbs
MULTIPLY WS-RATE BY WS-PRINCIPAL
GIVING WS-TEMP
DIVIDE WS-TEMP BY 365
GIVING WS-DAILY-AMT
MULTIPLY WS-DAILY-AMT BY WS-DAYS
GIVING WS-INTEREST
* FASTER: Single COMPUTE (compiler optimizes entire expression)
COMPUTE WS-INTEREST ROUNDED =
WS-PRINCIPAL * WS-RATE * WS-DAYS / 365
The COMPUTE version generates fewer intermediate storage operations and allows the compiler to use registers more efficiently.
SEARCH ALL vs. SEARCH
For table lookups, the choice between SEARCH (linear) and SEARCH ALL (binary) has profound performance implications:
Linear search (SEARCH): O(n) — checks each entry in sequence
Binary search (SEARCH ALL): O(log n) — requires sorted table, halves search space each step
| Table Size | Linear Search (avg) | Binary Search (max) | Speedup |
|---|---|---|---|
| 10 | 5 comparisons | 4 comparisons | 1.25x |
| 100 | 50 comparisons | 7 comparisons | 7x |
| 1,000 | 500 comparisons | 10 comparisons | 50x |
| 10,000 | 5,000 comparisons | 14 comparisons | 357x |
| 100,000 | 50,000 comparisons | 17 comparisons | 2,941x |
* LINEAR SEARCH: O(n) — fine for small tables
SEARCH WS-STATE-TABLE
AT END
MOVE "UNKNOWN" TO WS-STATE-NAME
WHEN WS-STATE-CODE(WS-IDX) = WS-INPUT-STATE
MOVE WS-STATE-NAME(WS-IDX) TO WS-OUTPUT-NAME
END-SEARCH
* BINARY SEARCH: O(log n) — required for large tables
* Table MUST be sorted by key (ASCENDING KEY clause)
SEARCH ALL WS-STATE-TABLE
AT END
MOVE "UNKNOWN" TO WS-STATE-NAME
WHEN WS-STATE-CODE(WS-IDX) = WS-INPUT-STATE
MOVE WS-STATE-NAME(WS-IDX) TO WS-OUTPUT-NAME
END-SEARCH
For a table of 50 US states, linear search is fine. For a table of 100,000 procedure codes, binary search is essential.
Loop Optimization
Minimize work inside high-iteration loops:
* SLOW: Redundant computation inside loop
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-RECORD-COUNT
COMPUTE WS-TAX-RATE =
FUNCTION CURRENT-DATE(1:4) * 0.001
* ^^^ Recomputed every iteration but never changes!
COMPUTE WS-TAX(WS-IDX) =
WS-AMOUNT(WS-IDX) * WS-TAX-RATE
END-PERFORM
* FAST: Move invariant computation outside loop
COMPUTE WS-TAX-RATE =
FUNCTION CURRENT-DATE(1:4) * 0.001
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-RECORD-COUNT
COMPUTE WS-TAX(WS-IDX) =
WS-AMOUNT(WS-IDX) * WS-TAX-RATE
END-PERFORM
Conditional Ordering
In compound conditions, put the most likely-to-fail condition first:
* If 95% of records have STATUS = "A", check that first
* SLOWER: Rare condition checked first
IF CLM-AMOUNT > 50000
AND CLM-STATUS = "A"
...
* FASTER: Common condition checked first (short-circuit)
IF CLM-STATUS = "A"
AND CLM-AMOUNT > 50000
...
COBOL evaluates AND conditions left-to-right. If the first condition is false, the second is not evaluated. Putting the most selective (most likely to be false) condition first reduces total comparisons.
✅ Try It Yourself: Write a program that performs a table lookup 1,000,000 times using both SEARCH and SEARCH ALL on a 1,000-entry sorted table. Time both approaches. On GnuCOBOL, you can use
FUNCTION CURRENT-DATEbefore and after to measure elapsed time.
36.3 I/O Reduction
Since I/O dominates most COBOL programs, I/O optimization yields the greatest returns.
Buffering and Block Size
When you read a sequential file, the operating system reads one block at a time from disk. A block contains multiple logical records. The larger the block, the fewer I/O operations needed to read the entire file.
File: 1,000,000 records, each 100 bytes
Block Size Records/Block Blocks to Read I/O Operations
----------- ------------- -------------- ---------------
100 bytes 1 1,000,000 1,000,000
800 bytes 8 125,000 125,000
8,000 bytes 80 12,500 12,500
32,000 bytes 320 3,125 3,125
In JCL, specify block size on the DD statement:
//INPUT DD DSN=PROD.TRANS.FILE,DISP=SHR,
// DCB=(RECFM=FB,LRECL=100,BLKSIZE=32000)
For VSAM files, the Control Interval (CI) size serves a similar purpose. Larger CI sizes reduce I/O for sequential access patterns.
VSAM Tuning
VSAM performance depends heavily on these parameters:
Buffer allocation: More buffers = more data cached in memory = fewer disk I/Os.
//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
// AMP=('BUFND=20,BUFNI=10')
BUFND: Number of data buffers (for data Control Intervals)BUFNI: Number of index buffers (for index records)
General rule: For sequential access, set BUFND high (20+). For random access, set BUFNI high to cache the index.
CI/CA Split tuning: When a VSAM KSDS runs out of space in a Control Interval, it splits — moving half the records to a new CI. Excessive splits degrade performance. Monitor split frequency and reorganize datasets periodically.
Sequential vs. Random Access
If you need to process more than 20-30% of a VSAM file's records, sequential access is faster than random access, even if you skip records:
* SLOW: Random access for 500,000 out of 1,000,000 records
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-KEY-COUNT
MOVE WS-KEY-TABLE(WS-IDX) TO ACCT-KEY
READ ACCT-MASTER
KEY IS ACCT-KEY
INVALID KEY CONTINUE
NOT INVALID KEY
PERFORM 3000-PROCESS-ACCOUNT
END-READ
END-PERFORM
* Each READ is a random I/O: ~500,000 I/O operations
* FAST: Sequential read with skip logic
PERFORM UNTIL END-OF-FILE
READ ACCT-MASTER NEXT
AT END SET END-OF-FILE TO TRUE
NOT AT END
PERFORM 2500-CHECK-IF-NEEDED
END-READ
END-PERFORM
* Sequential read uses buffering: ~3,125 I/O operations
* (with 32K block size)
Minimize File Opens
Each OPEN/CLOSE cycle has overhead. If you process the same file in multiple program sections, open it once and close it once:
* SLOW: Open/close for each processing phase
PERFORM 1000-PHASE-ONE
PERFORM 2000-PHASE-TWO
1000-PHASE-ONE.
OPEN INPUT MASTER-FILE
...process...
CLOSE MASTER-FILE.
2000-PHASE-TWO.
OPEN INPUT MASTER-FILE
...process...
CLOSE MASTER-FILE.
* FAST: Single open/close
0000-MAIN.
OPEN INPUT MASTER-FILE
PERFORM 1000-PHASE-ONE
PERFORM 2000-PHASE-TWO
CLOSE MASTER-FILE.
36.4 WORKING-STORAGE Layout
How you arrange data in WORKING-STORAGE affects performance through alignment and locality effects.
Alignment and Slack Bytes
On IBM mainframes, the hardware accesses memory most efficiently when data items are aligned on their natural boundaries:
| Data Type | Alignment | Slack if Misaligned |
|---|---|---|
| COMP (halfword, PIC S9(4)) | 2-byte boundary | Up to 1 slack byte |
| COMP (fullword, PIC S9(9)) | 4-byte boundary | Up to 3 slack bytes |
| COMP (doubleword, PIC S9(18)) | 8-byte boundary | Up to 7 slack bytes |
| COMP-1 (float) | 4-byte boundary | Up to 3 slack bytes |
| COMP-2 (double) | 8-byte boundary | Up to 7 slack bytes |
The compiler inserts invisible "slack bytes" to align fields. You can minimize slack by ordering fields from largest to smallest alignment requirement:
* POOR LAYOUT: Slack bytes between fields
01 WS-RECORD.
05 WS-FLAG PIC X. *> 1 byte
* (3 slack bytes inserted here for alignment)
05 WS-AMOUNT PIC S9(9) COMP. *> 4 bytes, fullword
05 WS-CODE PIC X(3). *> 3 bytes
* (1 slack byte inserted here)
05 WS-COUNTER PIC S9(4) COMP. *> 2 bytes, halfword
* Total: 1+3+4+3+1+2 = 14 bytes (4 wasted on slack)
* OPTIMAL LAYOUT: No slack bytes
01 WS-RECORD.
05 WS-AMOUNT PIC S9(9) COMP. *> 4 bytes (fullword)
05 WS-COUNTER PIC S9(4) COMP. *> 2 bytes (halfword)
05 WS-CODE PIC X(3). *> 3 bytes
05 WS-FLAG PIC X. *> 1 byte
* Total: 4+2+3+1 = 10 bytes (0 wasted)
Frequently-Used Fields First
Place the most frequently accessed fields at the beginning of WORKING-STORAGE. While modern hardware caching minimizes this effect, it can matter for very hot loops:
WORKING-STORAGE SECTION.
* Most frequently used fields first
01 WS-HOT-FIELDS.
05 WS-RECORD-COUNT PIC 9(9) COMP.
05 WS-CURRENT-KEY PIC X(10).
05 WS-PROCESS-FLAG PIC X.
05 WS-RUNNING-TOTAL PIC S9(11)V99 COMP-3.
* Less frequently used fields later
01 WS-REPORT-FIELDS.
05 WS-PAGE-COUNT PIC 9(5).
05 WS-LINE-COUNT PIC 9(3).
...
Group MOVE vs. Field-by-Field MOVE
A single group MOVE is faster than moving individual fields:
* SLOWER: Multiple individual MOVEs
MOVE WS-NAME TO OUT-NAME
MOVE WS-ADDRESS TO OUT-ADDRESS
MOVE WS-CITY TO OUT-CITY
MOVE WS-STATE TO OUT-STATE
MOVE WS-ZIP TO OUT-ZIP
* FASTER: Single group MOVE (if layouts match)
MOVE WS-CUSTOMER-DATA TO OUT-CUSTOMER-DATA
However, this only works when the source and target group items have identical layouts. If they differ, use MOVE CORRESPONDING:
* MODERATE: MOVE CORRESPONDING
MOVE CORRESPONDING WS-CUSTOMER TO OUT-RECORD
36.5 Compiler Options
Enterprise COBOL's compiler options significantly affect performance. The key options are:
OPTIMIZE
The OPTIMIZE option controls the compiler's optimization level:
| Option | Effect | Trade-off |
|---|---|---|
NOOPTIMIZE |
No optimization | Fastest compilation, easiest debugging |
OPTIMIZE(STD) |
Standard optimization | Good performance, reasonable compile time |
OPTIMIZE(FULL) |
Aggressive optimization | Best performance, longer compilation, harder to debug |
OPTIMIZE(FULL) can improve CPU performance by 10-30% through:
- Eliminating redundant computations
- Optimizing register usage
- Eliminating dead code
- Inlining small paragraphs
NUMPROC
Controls how the compiler handles sign processing for packed decimal:
| Option | Effect |
|---|---|
NUMPROC(NOPFD) |
Validates signs on every operation (safe but slow) |
NUMPROC(PFD) |
Assumes data has preferred signs (fast but requires clean data) |
NUMPROC(MIG) |
Migration mode — accepts any valid sign |
NUMPROC(PFD) can improve decimal arithmetic performance by 10-15%, but will produce incorrect results if data has non-preferred sign codes. Use only when you can guarantee clean data.
TRUNC
Controls truncation behavior for COMP (binary) fields:
| Option | Effect |
|---|---|
TRUNC(STD) |
Truncates to PIC size (safe, matches language standard) |
TRUNC(OPT) |
Truncates to native binary size (faster, may differ from PIC) |
TRUNC(BIN) |
Treats all COMP as native binary (fastest, non-standard) |
Example: PIC S9(4) COMP occupies a halfword (2 bytes = range -32768 to 32767). With TRUNC(STD), values are truncated to -9999 to 9999. With TRUNC(OPT), values use the full halfword range. This matters for loop counters and indices.
SSRANGE
SSRANGE — Runtime subscript range checking (safe, 10-20% slower)
NOSSRANGE — No range checking (fast, risk of storage overlays)
Use SSRANGE during development and testing. Consider NOSSRANGE for production if performance is critical — but only if your tests are thorough.
💡 The Modernization Spectrum: Compiler options represent one of the easiest performance wins — changing a JCL compile step can improve performance by 10-30% with zero code changes. This is the lowest-risk, highest-return optimization available.
36.6 SQL Performance
For programs with embedded SQL (DB2), SQL performance often dominates everything else. A single poorly-written query can be slower than the entire rest of the program.
Avoid Full Table Scans
* SLOW: Full table scan (no index on CLAIM_DATE)
EXEC SQL
SELECT COUNT(*)
INTO :WS-CLAIM-COUNT
FROM CLAIMS
WHERE CLAIM_DATE >= '2025-01-01'
END-EXEC
* FAST: Use index on CLAIM_DATE
* (Ensure index exists: CREATE INDEX IX_CLAIMS_DATE
* ON CLAIMS (CLAIM_DATE))
The query itself doesn't change — the performance difference comes from the index. Use EXPLAIN to verify your query uses an index.
FETCH FIRST for Existence Checks
When you only need to know if a row exists, don't retrieve all matching rows:
* SLOW: Fetches potentially thousands of rows
EXEC SQL
SELECT MEMBER_ID
INTO :WS-MEMBER-ID
FROM MEMBERS
WHERE MEMBER_STATUS = 'A'
AND MEMBER_STATE = :WS-STATE
END-EXEC
* FAST: Stop after first match
EXEC SQL
SELECT MEMBER_ID
INTO :WS-MEMBER-ID
FROM MEMBERS
WHERE MEMBER_STATUS = 'A'
AND MEMBER_STATE = :WS-STATE
FETCH FIRST 1 ROW ONLY
END-EXEC
Use Host Variable Arrays for Bulk Operations
Instead of fetching one row at a time, fetch blocks of rows:
01 WS-CLAIM-ARRAY.
05 WS-CLAIM-ID PIC X(8) OCCURS 100 TIMES.
05 WS-CLAIM-AMT PIC S9(9)V99 COMP-3
OCCURS 100 TIMES.
01 WS-FETCH-COUNT PIC S9(4) COMP.
* SLOW: One row per FETCH
PERFORM UNTIL SQLCODE NOT = 0
EXEC SQL
FETCH CLAIM-CURSOR
INTO :WS-SINGLE-ID, :WS-SINGLE-AMT
END-EXEC
IF SQLCODE = 0
PERFORM 3000-PROCESS-CLAIM
END-IF
END-PERFORM
* FAST: 100 rows per FETCH
PERFORM UNTIL SQLCODE NOT = 0
EXEC SQL
FETCH CLAIM-CURSOR
FOR 100 ROWS
INTO :WS-CLAIM-ID, :WS-CLAIM-AMT
END-EXEC
MOVE SQLERRD(3) TO WS-FETCH-COUNT
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-FETCH-COUNT
PERFORM 3000-PROCESS-CLAIM
END-PERFORM
END-PERFORM
Multi-row FETCH reduces the number of DB2 interactions by a factor of 100, dramatically reducing overhead.
Avoid SQL in Loops
* TERRIBLE: SQL inside a loop — N queries for N records
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-CLAIM-COUNT
EXEC SQL
SELECT PROVIDER_NAME
INTO :WS-PROV-NAME
FROM PROVIDERS
WHERE PROVIDER_ID = :WS-PROV-ID(WS-IDX)
END-EXEC
END-PERFORM
* BETTER: Join in a single query
EXEC SQL
DECLARE CLAIM-PROV-CURSOR CURSOR FOR
SELECT C.CLAIM_ID, P.PROVIDER_NAME
FROM CLAIMS C
JOIN PROVIDERS P ON C.PROVIDER_ID = P.PROVIDER_ID
WHERE C.BATCH_DATE = :WS-BATCH-DATE
END-EXEC
36.7 CICS Performance
For online programs running under CICS, performance tuning has different priorities than batch.
COMMAREA Sizing
The COMMAREA (Communication Area) passes data between CICS transactions. Keep it as small as possible:
* POOR: Oversized COMMAREA
01 DFHCOMMAREA.
05 CA-CUSTOMER-DATA.
10 CA-CUST-NAME PIC X(100).
10 CA-CUST-ADDR PIC X(200).
10 CA-CUST-HISTORY PIC X(5000).
* 5,300 bytes copied on every RETURN TRANSID
* Most of it unchanged between interactions
* BETTER: Minimal COMMAREA with key references
01 DFHCOMMAREA.
05 CA-CUST-ID PIC X(10).
05 CA-SCREEN-STATE PIC X(2).
05 CA-LAST-ACTION PIC X.
05 CA-ERROR-CODE PIC X(4).
* 17 bytes — re-read customer data from DB2/VSAM when needed
BMS Map Optimization
Reduce BMS (Basic Mapping Support) overhead by sending only changed fields:
* SLOW: Send entire map every time
EXEC CICS SEND MAP('ACCTMAP')
MAPSET('ACCTMS')
ERASE
END-EXEC
* FAST: Send only data (not format) when map already displayed
EXEC CICS SEND MAP('ACCTMAP')
MAPSET('ACCTMS')
DATAONLY
END-EXEC
Avoid Excessive CICS Commands
Each CICS command (READ, WRITE, LINK, etc.) has overhead for command-level processing. Batch CICS operations when possible:
* SLOW: Multiple READQ TS for individual fields
EXEC CICS READQ TS QUEUE('MYQUEUE')
INTO(WS-FIELD-1) ITEM(1) END-EXEC
EXEC CICS READQ TS QUEUE('MYQUEUE')
INTO(WS-FIELD-2) ITEM(2) END-EXEC
EXEC CICS READQ TS QUEUE('MYQUEUE')
INTO(WS-FIELD-3) ITEM(3) END-EXEC
* FAST: Single READQ TS for a group item
EXEC CICS READQ TS QUEUE('MYQUEUE')
INTO(WS-ALL-FIELDS) ITEM(1) END-EXEC
36.8 Batch Job Tuning
Beyond program-level optimization, batch job JCL tuning can yield significant improvements.
REGION Size
The REGION parameter controls how much memory the job step can use. Too little causes ABENDs; too much wastes resources:
//* Too small: May cause S878 ABEND
//STEP1 EXEC PGM=BALCALC,REGION=2M
//* Appropriate: Enough for buffers and working storage
//STEP1 EXEC PGM=BALCALC,REGION=64M
//* Excessive: Wastes memory
//STEP1 EXEC PGM=BALCALC,REGION=0M
//* REGION=0M means "give me everything" — avoid in production
Sort Optimization
DFSORT (or SyncSort) operations often account for a large fraction of batch elapsed time. Key tuning parameters:
//SORT EXEC PGM=SORT
//SORTIN DD DSN=PROD.UNSORTED.FILE,DISP=SHR
//SORTOUT DD DSN=PROD.SORTED.FILE,DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(100,50),RLSE),
// DCB=(RECFM=FB,LRECL=200,BLKSIZE=32000)
//SORTWK01 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SORTWK02 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SORTWK03 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SYSIN DD *
SORT FIELDS=(1,10,CH,A)
OPTION MAINSIZE=MAX,FILSZ=E2000000
/*
Key optimizations: - Multiple SORTWK DDs: Allow parallel sort work — 3 work files is optimal for most sorts. - MAINSIZE=MAX: Use as much memory as possible for in-memory sorting. - FILSZ: Estimate file size so SORT can choose optimal algorithm. - Large BLKSIZE on SORTOUT: Reduces output I/O.
Checkpoint/Restart
For long-running batch jobs, periodic checkpoints allow restart from the last checkpoint rather than from the beginning:
5000-CHECKPOINT.
IF WS-RECORD-COUNT >= WS-CHECKPOINT-INTERVAL
PERFORM 5100-WRITE-CHECKPOINT
MOVE 0 TO WS-RECORD-COUNT
END-IF.
5100-WRITE-CHECKPOINT.
EXEC SQL COMMIT END-EXEC
DISPLAY "Checkpoint at record: "
WS-TOTAL-PROCESSED
" Time: " FUNCTION CURRENT-DATE.
For DB2 batch, COMMIT frequency is critical:
Commits too rarely: Long-running locks, log space issues, long restart
Commits too often: Overhead of commit processing
Sweet spot: Every 1,000-10,000 records (depends on workload)
Mathematical Formulation: I/O Cost Model
We can model the cost of a batch program as:
Total_Time = CPU_Time + I/O_Time + Wait_Time
I/O_Time = N_reads * T_read + N_writes * T_write
Where:
N_reads = File_Size / Block_Size (for sequential access)
N_reads = N_records (for random access)
T_read ≈ 5ms (disk) or 0.1ms (SSD/cache)
T_write ≈ 5ms (disk) or 0.1ms (SSD/cache)
For GlobalBank's BAL-CALC processing 2.3 million accounts:
With 100-byte records and 800-byte blocks (old configuration):
N_reads = 2,300,000 / 8 = 287,500 I/Os
I/O_Time = 287,500 * 5ms = 1,437.5 seconds = 24 minutes
With 100-byte records and 32,000-byte blocks (optimized):
N_reads = 2,300,000 / 320 = 7,188 I/Os
I/O_Time = 7,188 * 5ms = 35.9 seconds = 0.6 minutes
Changing the block size alone reduced I/O time from 24 minutes to under 1 minute — a 40x improvement with zero code changes.
📊 Big-O for COBOL Operations: Understanding algorithmic complexity helps predict how performance scales with data volume:
Operation Complexity 1K Records 1M Records 1B Records Sequential file read O(n) Fast Moderate Slow VSAM random read O(log n) Fast Fast Fast Linear table search O(n) Fast Slow Impossible Binary table search O(log n) Fast Fast Fast Nested loop match O(n*m) Moderate Impossible — Sort O(n log n) Fast Moderate Slow
36.9 Profiling Tools
You cannot optimize what you cannot measure. Mainframe profiling tools tell you exactly where your program spends its time.
IBM Strobe
Strobe is the most widely used mainframe profiling tool. It samples the program counter at regular intervals, building a statistical profile of time spent in each paragraph:
STROBE Performance Profile: BAL-CALC
=====================================
Run Date: 2025-10-20
Total CPU Time: 847.3 seconds
Total Elapsed: 3,612.0 seconds
CPU/Elapsed Ratio: 23.5% (I/O bound)
Paragraph Profile (Top 10 by CPU):
Paragraph CPU Secs %CPU Calls
--------- -------- ---- -----
3110-COMPOUND-DAILY 312.4 36.9% 2,300,000
2000-READ-ACCOUNT 198.7 23.5% 2,300,000
4000-WRITE-OUTPUT 145.2 17.1% 2,300,000
3200-CALC-TIERED-RATE 67.8 8.0% 180,000
3120-COMPOUND-MONTHLY 45.1 5.3% 450,000
3000-CALC-INTEREST 32.6 3.8% 2,300,000
1000-INIT 0.4 0.0% 1
9000-CLEANUP 0.1 0.0% 1
Other 45.0 5.3%
This profile immediately reveals that 3110-COMPOUND-DAILY consumes 37% of CPU time. This is the hot path — the paragraph to optimize first.
SMF Records
System Management Facility (SMF) records provide job-level performance data:
- SMF Type 30: Job/step level CPU and elapsed time
- SMF Type 42: VSAM dataset statistics (I/O counts, splits, etc.)
- SMF Type 101: DB2 accounting (SQL execution time, rows processed)
RMF (Resource Measurement Facility)
RMF provides system-wide performance data, helping identify contention and resource bottlenecks at the system level rather than the program level.
36.10 GlobalBank Case Study: Optimizing the Nightly Batch
Maria Chen's assignment: reduce the nightly batch from 4 hours to 90 minutes. Here's how she did it.
Step 1: Profile
Maria ran Strobe on each job in the batch cycle:
| Job | Elapsed | CPU | Primary Bottleneck |
|---|---|---|---|
| BAL-CALC | 72 min | 14 min | CPU (compound interest calculation) |
| TXN-POST | 55 min | 3 min | I/O (VSAM random reads) |
| RPT-DAILY | 45 min | 8 min | Sort (5 million records) |
| REG-FEED | 38 min | 2 min | I/O (sequential write, small blocks) |
| ACCT-MAINT | 22 min | 5 min | DB2 (SELECT in loop) |
| Other | 8 min | 2 min | Mixed |
| Total | 240 min | 34 min |
Step 2: Prioritize
I/O optimization would have the biggest impact. CPU optimization would address BAL-CALC.
Step 3: Optimize
BAL-CALC (72 min → 25 min): - Changed DISPLAY arithmetic fields to COMP-3: saved 5 min CPU - Precomputed daily rate outside the main loop: saved 3 min CPU - Changed ACCT-MASTER block size from 4K to 32K: saved 18 min I/O - Used OPTIMIZE(FULL) compiler option: saved 4 min CPU - Increased VSAM buffers (BUFND=30): saved 17 min I/O
TXN-POST (55 min → 15 min): - Sorted transaction file by account key before processing: converted random VSAM reads to sequential skip-reads - Increased block sizes on all files: reduced I/O count by 90% - Net result: 40 minutes saved
RPT-DAILY (45 min → 12 min): - Added 3 SORTWK DD statements (was using 1): enabled parallel sort - Increased MAINSIZE to MAX: more in-memory sorting - Optimized output block sizes: faster output write - Net result: 33 minutes saved
REG-FEED (38 min → 8 min): - Block size was 800 bytes (LRECL = 800, BLKSIZE = 800): records not blocked at all! - Changed to BLKSIZE=32000 (40 records per block) - Net result: 30 minutes saved from a one-line JCL change
ACCT-MAINT (22 min → 8 min): - Replaced SELECT-in-a-loop with a JOIN query: 50,000 DB2 calls became 1 - Net result: 14 minutes saved
Results
| Job | Before | After | Savings |
|---|---|---|---|
| BAL-CALC | 72 min | 25 min | 47 min |
| TXN-POST | 55 min | 15 min | 40 min |
| RPT-DAILY | 45 min | 12 min | 33 min |
| REG-FEED | 38 min | 8 min | 30 min |
| ACCT-MAINT | 22 min | 8 min | 14 min |
| Other | 8 min | 7 min | 1 min |
| Total | 240 min | 75 min | 165 min |
The batch cycle went from 240 minutes to 75 minutes — a 69% reduction, exceeding the 90-minute target. The single largest improvement (REG-FEED, saving 30 minutes) required changing one line of JCL.
"The block size thing still kills me," Derek said. "Thirty minutes wasted every night because someone in 1998 forgot to specify BLKSIZE."
Maria shrugged. "That's why we measure. You never know where the time is going until you look."
36.11 MedClaim Case Study: Tuning Claim Adjudication
James Okafor needed to increase CLM-ADJUD's throughput from 8,000 claims per hour to 20,000 claims per hour to meet a new SLA with a large provider network.
The Bottleneck
Profiling revealed that 65% of elapsed time was spent in DB2 operations — specifically, a SELECT inside the claim processing loop that looked up the provider's fee schedule:
* Original: One DB2 call per claim
3000-LOOKUP-FEE-SCHEDULE.
EXEC SQL
SELECT ALLOWED_AMOUNT
INTO :WS-ALLOWED-AMT
FROM FEE_SCHEDULE
WHERE PROVIDER_ID = :CLM-PROVIDER-ID
AND PROCEDURE_CODE = :CLM-PROCEDURE-CODE
AND EFFECTIVE_DATE <= :CLM-SERVICE-DATE
ORDER BY EFFECTIVE_DATE DESC
FETCH FIRST 1 ROW ONLY
END-EXEC.
At 8,000 claims per hour, this query executed 8,000 times per hour. Each execution took approximately 3ms (including DB2 thread switching overhead), consuming 24 seconds per hour of elapsed time — but the overhead was the killer.
The Solution
James implemented a three-level caching strategy:
Level 1: In-memory table — Cache the 500 most common provider/procedure combinations in a WORKING-STORAGE table:
01 WS-FEE-CACHE.
05 WS-CACHE-ENTRY OCCURS 500 TIMES
ASCENDING KEY WS-CACHE-KEY
INDEXED BY WS-CACHE-IDX.
10 WS-CACHE-KEY.
15 WS-CACHE-PROV PIC X(8).
15 WS-CACHE-PROC PIC X(5).
10 WS-CACHE-AMOUNT PIC S9(7)V99 COMP-3.
10 WS-CACHE-DATE PIC X(10).
3000-LOOKUP-FEE-SCHEDULE.
MOVE CLM-PROVIDER-ID TO WS-CACHE-PROV
MOVE CLM-PROCEDURE-CODE TO WS-CACHE-PROC
SEARCH ALL WS-FEE-CACHE
AT END
PERFORM 3100-DB2-LOOKUP
PERFORM 3200-UPDATE-CACHE
WHEN WS-CACHE-KEY(WS-CACHE-IDX) =
WS-CACHE-KEY
MOVE WS-CACHE-AMOUNT(WS-CACHE-IDX)
TO WS-ALLOWED-AMT
END-SEARCH.
Level 2: Multi-row FETCH — When a DB2 lookup was needed, fetch multiple fee schedule entries at once.
Level 3: Preload — At program start, load the top 500 combinations from a pre-computed table.
Results
The cache hit rate was 78% — meaning 78% of claims were resolved without any DB2 call. The remaining 22% hit DB2 but with optimized queries (proper index usage, FETCH FIRST).
Throughput increased from 8,000 to 26,000 claims per hour — exceeding the 20,000 target by 30%.
36.12 Performance Tuning Checklist
Use this checklist when optimizing COBOL programs:
PERFORMANCE TUNING CHECKLIST
===============================
BEFORE STARTING:
[ ] Profile the program to identify actual bottlenecks
[ ] Establish baseline measurements (elapsed, CPU, I/O counts)
[ ] Verify regression test suite exists
[ ] Set a specific performance target
I/O OPTIMIZATION:
[ ] Sequential file block sizes >= 32,000
[ ] VSAM buffer counts appropriate (BUFND/BUFNI)
[ ] Sequential access used when processing >20% of file
[ ] File OPEN/CLOSE minimized
[ ] No unnecessary file reads (cache when possible)
CPU OPTIMIZATION:
[ ] Arithmetic fields are COMP-3 or COMP (not DISPLAY)
[ ] COMPUTE used for complex expressions
[ ] SEARCH ALL used for tables > 50 entries
[ ] Invariant computations moved outside loops
[ ] Conditions ordered by selectivity
COMPILER OPTIONS:
[ ] OPTIMIZE(STD) or OPTIMIZE(FULL)
[ ] NUMPROC(PFD) if data is clean
[ ] TRUNC(OPT) if binary fields are within PIC range
[ ] NOSSRANGE in production (SSRANGE in test)
SQL OPTIMIZATION:
[ ] No SELECT inside loops (use JOINs or cursors)
[ ] Multi-row FETCH for bulk processing
[ ] FETCH FIRST for existence checks
[ ] Indexes exist for WHERE clause predicates
[ ] EXPLAIN used to verify access paths
BATCH JOB TUNING:
[ ] REGION size appropriate
[ ] Sort work files (3 SORTWK DDs)
[ ] Sort MAINSIZE=MAX
[ ] Checkpoint frequency balanced
[ ] Job sequencing minimizes wait time
36.13 VSAM Tuning Deep Dive
VSAM performance tuning deserves special attention because VSAM files are the backbone of most COBOL batch and online systems. The three parameters that matter most are buffer allocation, Control Interval size, and free space management.
BUFND and BUFNI: The Buffer Equation
BUFND (data buffers) and BUFNI (index buffers) control how much of a VSAM dataset is cached in memory during processing. The relationship between buffer count and I/O reduction is dramatic:
VSAM KSDS: ACCT-MASTER
Records: 2,300,000
CI Size (Data): 4,096 bytes
CI Size (Index): 2,048 bytes
Record Size: 200 bytes
Records per CI: 20
Index Levels: 3
Buffer Scenarios for Sequential Access:
──────────────────────────────────────────────────────
BUFND Index Cached? Data I/Os Elapsed (est)
──────────────────────────────────────────────────────
2 No 115,000 575 sec
5 Partial 92,000 460 sec
10 Partial 69,000 345 sec
20 Yes 23,000 115 sec
30 Yes 11,500 58 sec
50 Yes 5,750 29 sec
──────────────────────────────────────────────────────
For sequential processing, each additional data buffer reduces I/O because the system reads ahead. The rule of thumb: set BUFND to at least the number of data CIs that fit in one CA (Control Area), plus a few more for read-ahead.
For random access, BUFNI is more important than BUFND:
Buffer Scenarios for Random Access (100,000 lookups):
──────────────────────────────────────────────────────
BUFNI BUFND Index I/Os Data I/Os Total I/Os
──────────────────────────────────────────────────────
3 2 200,000 100,000 300,000
5 2 80,000 100,000 180,000
10 2 20,000 100,000 120,000
20 5 5,000 100,000 105,000
20 20 5,000 60,000 65,000
──────────────────────────────────────────────────────
With 20 index buffers, the entire index set is typically cached in memory after the first few accesses, eliminating index I/O entirely. Adding more data buffers helps if the same records are accessed repeatedly (locality of reference).
JCL Buffer Specification
//* Sequential batch processing — maximize BUFND
//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
// AMP=('BUFND=30,BUFNI=5')
//* Random lookup in CICS — maximize BUFNI
//* (CICS FCT controls buffers, not JCL)
//* In CICS:
//* DEFINE FILE(ACCTMST)
//* DSNAME(PROD.ACCT.MASTER)
//* STRINGS(10)
//* DATABUFFERS(20)
//* INDEXBUFFERS(20)
//* Mixed access pattern — balance both
//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
// AMP=('BUFND=20,BUFNI=15')
CI/CA Splits and Reorganization
When a VSAM KSDS needs to insert a record into a full CI, it performs a CI split — moving half the records to a new CI. If the CA is also full, a CA split occurs, which is even more expensive. Excessive splitting degrades both sequential and random access performance.
Monitor splits using IDCAMS LISTCAT:
//LISTCAT EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
LISTCAT ENT(PROD.ACCT.MASTER) ALL
/*
Key statistics to watch in the LISTCAT output:
STATISTICS
CI-SPLITS --------- 12,847 <<<< Warning if > 5% of CIs
CA-SPLITS --------- 23 <<<< Warning if ANY
EXTENTS ----------- 4
REC-TOTAL --------- 2,300,000
REC-DELETED --------- 450
REC-INSERTED -------- 85,000
REC-UPDATED --------- 1,200,000
FREESPACE-CI% ------- 0 <<<< Exhausted!
FREESPACE-CA% ------- 0 <<<< Exhausted!
When CI splits exceed 5% of total CIs, or CA splits appear at all, reorganize the dataset:
//*----------------------------------------------------------
//* Reorganize VSAM KSDS to eliminate splits
//*----------------------------------------------------------
//REORG EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//BACKUP DD DSN=TEMP.ACCT.BACKUP,DISP=(NEW,CATLG),
// SPACE=(CYL,(200,50),RLSE)
//SYSIN DD *
REPRO INFILE(MASTER) OUTFILE(BACKUP)
DELETE PROD.ACCT.MASTER PURGE
DEFINE CLUSTER ( -
NAME(PROD.ACCT.MASTER) -
RECORDSIZE(200 200) -
KEYS(10 0) -
CYLINDERS(250 50) -
FREESPACE(20 10) -
SHAREOPTIONS(2 3) -
) -
DATA (NAME(PROD.ACCT.MASTER.DATA)) -
INDEX (NAME(PROD.ACCT.MASTER.INDEX))
REPRO INFILE(BACKUP) OUTFILE(MASTER)
/*
//MASTER DD DSN=PROD.ACCT.MASTER,DISP=SHR
The FREESPACE(20 10) parameter reserves 20% free space in each CI and 10% free CIs in each CA, providing room for insertions without immediate splitting.
⚠️ Caution: VSAM reorganization requires exclusive access to the dataset. Schedule reorganizations during maintenance windows when no programs are accessing the file. Always create a backup before deleting and redefining the cluster.
VSAM Local Shared Resources (LSR)
For CICS environments, Local Shared Resources (LSR) pools allow multiple files to share the same buffer pool, improving overall memory utilization:
CICS LSR Pool Configuration:
Pool 1 (CI Size 4096): 512 buffers shared across 15 files
Pool 2 (CI Size 2048): 256 buffers shared across 8 index files
Pool 3 (CI Size 32768): 64 buffers for large-CI sequential files
LSR is particularly effective when many VSAM files are accessed intermittently — the buffers serve whichever file needs them most at any given moment, rather than being dedicated to idle files.
36.14 Compiler Optimization Flags Deep Dive
Enterprise COBOL's compiler options interact with each other in ways that are not always obvious. Understanding these interactions is essential for squeezing maximum performance from your programs.
The ARCH Option
The ARCH option tells the compiler which z/Architecture level to target. Higher ARCH levels unlock hardware instructions that are not available on older processors:
| ARCH Level | z/Architecture | Key Feature |
|---|---|---|
| ARCH(8) | z196 | Distinct operands, high-word facility |
| ARCH(9) | zEC12 | Transactional execution |
| ARCH(10) | z13 | Vector facility, extended immediate |
| ARCH(11) | z14 | Vector packed decimal, DEFLATE |
| ARCH(12) | z15 | Miscellaneous instruction enhancements |
| ARCH(13) | z16 | AI acceleration, sort acceleration |
For financial COBOL programs, ARCH(11) or higher is particularly valuable because vector packed decimal instructions perform COMP-3 arithmetic in hardware at speeds previously only available for binary arithmetic.
Performance comparison: COMP-3 arithmetic at different ARCH levels
(10 million multiply operations)
ARCH(8): 4.2 seconds
ARCH(10): 3.1 seconds
ARCH(11): 1.8 seconds <<<< Vector packed decimal
ARCH(12): 1.6 seconds
ARCH(13): 1.4 seconds
Interaction Between OPTIMIZE and Other Options
Option Combination Effects:
OPTIMIZE(FULL) + ARCH(12):
Maximum optimization with modern hardware instructions.
Best performance. May produce code that does not run
on older hardware.
OPTIMIZE(FULL) + SSRANGE:
The optimizer cannot fully optimize subscript operations
because range checks prevent certain transformations.
Performance impact: 15-25% slower than OPTIMIZE(FULL)
+ NOSSRANGE.
OPTIMIZE(FULL) + TEST(ALL):
Debug hooks reduce optimization effectiveness.
Performance impact: 20-40% slower than OPTIMIZE(FULL)
without TEST.
OPTIMIZE(FULL) + NUMPROC(PFD) + TRUNC(OPT):
The "maximum performance" combination. Use only when:
- All numeric data has preferred signs
- Binary fields stay within PIC range
- You have thorough regression tests
Compiler Option Selection Guide
┌──────────────────────────────────────────────────────────┐
│ ENVIRONMENT │ RECOMMENDED OPTIONS │
├──────────────────────────────────────────────────────────┤
│ Development │ NOOPTIMIZE, SSRANGE, TEST(ALL) │
│ │ Priority: Debugging ease │
│ │ │
│ Unit Test │ OPTIMIZE(STD), SSRANGE, TEST(SEP) │
│ │ Priority: Catch boundary errors │
│ │ │
│ Integration Test │ OPTIMIZE(STD), NOSSRANGE │
│ │ Priority: Match production behavior │
│ │ │
│ Performance Test │ OPTIMIZE(FULL), NOSSRANGE, │
│ │ NUMPROC(PFD), ARCH(12) │
│ │ Priority: Maximum speed │
│ │ │
│ Production │ OPTIMIZE(FULL), NOSSRANGE, │
│ │ NUMPROC(PFD), ARCH(current HW) │
│ │ Priority: Performance + stability │
└──────────────────────────────────────────────────────────┘
✅ Try It Yourself: If you have access to GnuCOBOL, compile the same program with
cobc -O0(no optimization) andcobc -O2(full optimization). Run both versions on a loop that performs 1,000,000 arithmetic operations and compare elapsed times usingFUNCTION CURRENT-DATEbefore and after the loop. You should see a measurable difference, especially for COMP-3 arithmetic.
36.15 SQL EXPLAIN Analysis
For COBOL programs with embedded DB2 SQL, the EXPLAIN statement is the most powerful tool for understanding query performance. EXPLAIN populates a plan table showing exactly how DB2 will access data for your query.
Running EXPLAIN
* Explain a query before running it
EXEC SQL
EXPLAIN PLAN SET QUERYNO = 1 FOR
SELECT C.CLAIM_ID, C.CLAIM_STATUS,
P.PROVIDER_NAME, P.SPECIALTY
FROM CLAIMS C
JOIN PROVIDERS P
ON C.PROVIDER_ID = P.PROVIDER_ID
WHERE C.BATCH_DATE = :WS-BATCH-DATE
AND C.CLAIM_STATUS = 'N'
END-EXEC
Reading the Plan Table
PLAN_TABLE output for QUERYNO = 1:
─────────────────────────────────────────────────────────────
QUERY TABLE ACCESS MATCH INDEX PREFETCH
BLOCK NAME TYPE COLS NAME TYPE
─────────────────────────────────────────────────────────────
1 CLAIMS I 2 IX_CLM_BATCH S
1 PROVIDERS I 1 PK_PROVIDER —
─────────────────────────────────────────────────────────────
ACCESS TYPE KEY:
I = Index access (good)
R = Tablespace scan (bad for large tables)
M = Multiple index access
MX = Intersecting index
N = Nested loop join
Interpreting the output: Both tables use index access (type "I"), which means DB2 is using indexes to find rows — good. The CLAIMS table matches on 2 columns (BATCH_DATE and CLAIM_STATUS) using the IX_CLM_BATCH index. The PROVIDERS table uses its primary key index.
Common EXPLAIN Red Flags
| Access Type | Meaning | Action |
|---|---|---|
| R (tablespace scan) | Full table scan — every row examined | Add an index on WHERE columns |
| S in PREFETCH | Sequential prefetch — accessing many CIs | May be normal for range queries |
| MIXOPSEQ = 'Y' | Sort required for ORDER BY | Check if index can provide ordering |
| MATCHCOLS = 0 | Index used but no columns match | Index is not useful for this query |
Optimizing a Slow Query
James Okafor found that CLM-ADJUD's fee schedule lookup was performing a tablespace scan. Here is the EXPLAIN analysis and fix:
BEFORE (tablespace scan):
TABLE: FEE_SCHEDULE ACCESS: R MATCHCOLS: 0
Estimated cost: 45,000 I/Os
Query:
SELECT ALLOWED_AMOUNT
FROM FEE_SCHEDULE
WHERE PROVIDER_ID = :WS-PROV-ID
AND PROCEDURE_CODE = :WS-PROC-CODE
AND EFFECTIVE_DATE <= :WS-SRVDATE
ORDER BY EFFECTIVE_DATE DESC
FETCH FIRST 1 ROW ONLY
The problem: no index existed on the combination of PROVIDER_ID, PROCEDURE_CODE, and EFFECTIVE_DATE. DB2 was scanning the entire 500,000-row table for each lookup.
-- Create composite index
CREATE INDEX IX_FEE_SCHED_LOOKUP
ON FEE_SCHEDULE
(PROVIDER_ID, PROCEDURE_CODE, EFFECTIVE_DATE DESC);
AFTER (index access):
TABLE: FEE_SCHEDULE ACCESS: I MATCHCOLS: 3
Estimated cost: 4 I/Os
Improvement: 11,250x reduction in I/O per query
📊 By the Numbers: In the MedClaim environment, the fee schedule query was executed approximately 8,000 times per hour. Before the index, each execution required an average of 250 I/Os (tablespace scan). After the index, each execution required 4 I/Os. Total I/O reduction: 8,000 * 246 = 1,968,000 fewer I/Os per hour. At 5ms per I/O, this saved 9,840 seconds (2.7 hours) of elapsed I/O wait time per hour of processing — explaining why the program could not meet the 20,000-claims-per-hour SLA.
36.16 Memory Layout Optimization
Beyond WORKING-STORAGE alignment (Section 36.4), memory layout optimization extends to how data flows through the program. Efficient layout reduces cache misses and memory bandwidth consumption.
Hot/Cold Data Separation
Separate frequently accessed fields ("hot" data) from rarely accessed fields ("cold" data). This improves CPU cache utilization because the hot data fits in fewer cache lines:
* POOR: Hot and cold data interleaved
01 WS-ACCOUNT-RECORD.
05 ACCT-NUMBER PIC X(10). *> Hot
05 ACCT-OPEN-DATE PIC X(10). *> Cold
05 ACCT-BALANCE PIC S9(11)V99
COMP-3. *> Hot
05 ACCT-LAST-STMT-DATE PIC X(10). *> Cold
05 ACCT-TYPE PIC X(3). *> Hot
05 ACCT-BRANCH-CODE PIC X(5). *> Cold
05 ACCT-ANNUAL-RATE PIC V9(6)
COMP-3. *> Hot
05 ACCT-MARKETING-CODE PIC X(4). *> Cold
* BETTER: Hot data grouped for cache locality
01 WS-ACCT-HOT-FIELDS.
05 ACCT-NUMBER PIC X(10).
05 ACCT-BALANCE PIC S9(11)V99
COMP-3.
05 ACCT-TYPE PIC X(3).
05 ACCT-ANNUAL-RATE PIC V9(6)
COMP-3.
* Total hot data: ~24 bytes — fits in one cache line
01 WS-ACCT-COLD-FIELDS.
05 ACCT-OPEN-DATE PIC X(10).
05 ACCT-LAST-STMT-DATE PIC X(10).
05 ACCT-BRANCH-CODE PIC X(5).
05 ACCT-MARKETING-CODE PIC X(4).
This technique matters most when the hot fields are accessed millions of times (inside a high-iteration processing loop) while the cold fields are accessed only occasionally (for reporting or error handling).
Table Structure Optimization
For large internal tables, the choice between arrays-of-structures and structures-of-arrays affects performance:
* Array of Structures (standard COBOL pattern)
01 WS-RATE-TABLE.
05 WS-RATE-ENTRY OCCURS 10000 TIMES
INDEXED BY WS-RATE-IDX.
10 WS-RATE-CODE PIC X(5).
10 WS-RATE-EFF-DATE PIC X(10).
10 WS-RATE-AMOUNT PIC S9(5)V99 COMP-3.
10 WS-RATE-DESC PIC X(40).
* Total per entry: ~59 bytes
* Searching requires loading 59 bytes per comparison
* even though we only compare the 5-byte code
* Optimized: Split key from data
01 WS-RATE-KEYS.
05 WS-RATE-CODE OCCURS 10000 TIMES
INDEXED BY WS-KEY-IDX
PIC X(5).
* Total: 50,000 bytes — fits in L2 cache
01 WS-RATE-DATA.
05 WS-RATE-DETAIL OCCURS 10000 TIMES.
10 WS-RATE-EFF-DATE PIC X(10).
10 WS-RATE-AMOUNT PIC S9(5)V99 COMP-3.
10 WS-RATE-DESC PIC X(40).
By separating the search key from the full record, SEARCH ALL only touches the 50,000-byte key array during comparison, not the full 590,000-byte table. On modern hardware with 256KB L2 caches, the key array fits entirely in cache, making binary search extremely fast.
🧪 Lab Exercise: Create a COBOL program with a 5,000-entry lookup table. Implement two versions: one where the table has a single group item with key and data combined, and one where the key is in a separate array. Perform 1,000,000 lookups against each version and compare elapsed times. The difference may surprise you, especially if you use SEARCH ALL (binary search).
36.17 Batch I/O Optimization Patterns
Beyond the basic buffering and block size optimization discussed in Section 36.3, several advanced I/O patterns can dramatically reduce batch job elapsed time.
The Sort-Merge Elimination Pattern
Many batch programs read a master file and a transaction file, then process transactions against the master. If the transaction file is unsorted, the program must either perform random reads against the master or sort the transactions first. But sometimes you can eliminate the sort entirely by restructuring the processing logic:
* TRADITIONAL: Sort transactions, then sequential match
* Step 1: SORT transactions by account key
* Step 2: Sequential read both files, match on key
* Total I/O: Read trans + sort work + read master
* OPTIMIZED: Load transactions into memory table
* Step 1: Read all transactions into WORKING-STORAGE
* Step 2: Sort in memory (no I/O)
* Step 3: Sequential read master, lookup in memory
* Total I/O: Read trans + read master (no sort work I/O)
01 WS-TXN-TABLE.
05 WS-TXN-ENTRY OCCURS 50000 TIMES
ASCENDING KEY WS-TXN-ACCT-KEY
INDEXED BY WS-TXN-IDX.
10 WS-TXN-ACCT-KEY PIC X(10).
10 WS-TXN-AMOUNT PIC S9(9)V99 COMP-3.
10 WS-TXN-TYPE PIC X.
01 WS-TXN-COUNT PIC 9(5) COMP VALUE 0.
1000-LOAD-TRANSACTIONS.
* Read all transactions into memory
PERFORM UNTIL END-OF-TXN-FILE
READ TXN-FILE INTO WS-TXN-RECORD
AT END SET END-OF-TXN-FILE TO TRUE
NOT AT END
ADD 1 TO WS-TXN-COUNT
MOVE TXN-ACCT-KEY
TO WS-TXN-ACCT-KEY(WS-TXN-COUNT)
MOVE TXN-AMOUNT
TO WS-TXN-AMOUNT(WS-TXN-COUNT)
MOVE TXN-TYPE
TO WS-TXN-TYPE(WS-TXN-COUNT)
END-READ
END-PERFORM
* Sort in memory using SORT verb on table
SORT WS-TXN-TABLE ON ASCENDING KEY WS-TXN-ACCT-KEY.
This pattern works when the transaction file fits in memory (typically up to 50,000-100,000 records, depending on record size and available REGION). For larger transaction files, the external sort approach is necessary.
Multi-File Processing with Balanced Merge
When a batch job reads multiple input files and produces a merged output, the order of file processing matters:
* SLOW: Process files sequentially, write output each time
PERFORM 1000-PROCESS-CHECKING-FILE
PERFORM 2000-PROCESS-SAVINGS-FILE
PERFORM 3000-PROCESS-CD-FILE
PERFORM 4000-PROCESS-MMA-FILE
* Total: 4 passes over the output file
* FAST: Read all inputs in parallel, write output once
PERFORM UNTIL ALL-FILES-AT-END
PERFORM 1000-FIND-LOWEST-KEY
PERFORM 2000-WRITE-MERGED-RECORD
PERFORM 3000-READ-NEXT-FROM-SOURCE
END-PERFORM
* Total: 1 pass over the output file
Deferred Write Pattern
For programs that compute running totals or multi-pass calculations, defer writing the output until all processing is complete:
* SLOW: Write partial results, then update
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ACCT-COUNT
PERFORM 3000-FIRST-PASS-CALC
WRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
END-PERFORM
* Second pass: REWRITE records that need adjustment
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ADJUST-COUNT
READ OUTPUT-FILE KEY IS WS-ADJ-KEY(WS-IDX)
PERFORM 4000-ADJUST-CALC
REWRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
END-PERFORM
* Double I/O: write + read + rewrite
* FAST: Compute everything in memory, write once
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ACCT-COUNT
PERFORM 3000-FIRST-PASS-CALC
END-PERFORM
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ADJUST-COUNT
PERFORM 4000-ADJUST-CALC
END-PERFORM
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ACCT-COUNT
WRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
END-PERFORM
* Single I/O: write only (no read-back, no rewrite)
💡 Memory vs. I/O Tradeoff: Most batch I/O optimization boils down to the same principle: trade memory for I/O. Load data into WORKING-STORAGE, process it in memory, and write it once. Memory operations are measured in nanoseconds; I/O operations are measured in milliseconds — a factor of one million. Use every byte of available REGION to avoid unnecessary I/O.
36.18 MedClaim Performance Case Study: Tuning the Daily Eligibility Batch
Tomás Rivera identified a performance issue with MedClaim's daily eligibility batch (ELIG-BATCH). The program verified member eligibility for all claims received that day — checking policy dates, benefit coverage, and provider network status. As claim volumes grew from 15,000 to 45,000 per day, the job's elapsed time grew from 20 minutes to over 90 minutes, threatening to delay the downstream adjudication cycle.
Profiling Results
STROBE Performance Profile: ELIG-BATCH
========================================
Total CPU Time: 142.3 seconds
Total Elapsed: 5,412.0 seconds (90.2 minutes)
CPU/Elapsed: 2.6% (severely I/O bound)
Paragraph Profile (Top 5):
Paragraph CPU Secs %CPU Calls
--------- -------- ---- -----
3000-CHECK-POLICY 12.8 9.0% 45,000
3100-CHECK-NETWORK 89.4 62.8% 45,000 <<<
3200-CHECK-BENEFITS 18.7 13.1% 45,000
2000-READ-CLAIM 8.4 5.9% 45,000
4000-WRITE-RESULT 6.2 4.4% 45,000
The bottleneck was 3100-CHECK-NETWORK, which consumed 63% of CPU but — crucially — had a CPU/elapsed ratio of only 2.6%, meaning the program spent 97.4% of its time waiting for I/O. Examining the paragraph revealed a DB2 query inside the processing loop:
3100-CHECK-NETWORK.
EXEC SQL
SELECT NETWORK_STATUS, EFF_DATE, TERM_DATE
INTO :WS-NET-STATUS, :WS-NET-EFF, :WS-NET-TERM
FROM PROVIDER_NETWORK
WHERE PROVIDER_ID = :CLM-PROVIDER-ID
AND PLAN_CODE = :MBR-PLAN-CODE
AND EFF_DATE <= :CLM-SERVICE-DATE
ORDER BY EFF_DATE DESC
FETCH FIRST 1 ROW ONLY
END-EXEC.
At 45,000 claims, this query executed 45,000 times. EXPLAIN showed a tablespace scan (no index on the PROVIDER_NETWORK table for the provider/plan combination).
The Three-Part Fix
Fix 1: Add composite index
CREATE INDEX IX_PROV_NET_LOOKUP
ON PROVIDER_NETWORK
(PROVIDER_ID, PLAN_CODE, EFF_DATE DESC);
Result: Query I/O dropped from ~200 per execution to 3. Elapsed time: 90 minutes down to 35 minutes.
Fix 2: Implement in-memory cache
Since many claims share the same provider/plan combination, Tomás added a 1,000-entry cache (similar to James's fee schedule cache in Section 36.11):
01 WS-NET-CACHE.
05 WS-NET-CACHE-ENTRY OCCURS 1000 TIMES
ASCENDING KEY WS-NC-LOOKUP-KEY
INDEXED BY WS-NC-IDX.
10 WS-NC-LOOKUP-KEY.
15 WS-NC-PROV-ID PIC X(8).
15 WS-NC-PLAN-CODE PIC X(5).
10 WS-NC-STATUS PIC X.
10 WS-NC-EFF-DATE PIC X(10).
10 WS-NC-TERM-DATE PIC X(10).
Cache hit rate: 82%. Only 8,100 DB2 queries instead of 45,000. Elapsed time: 35 minutes down to 12 minutes.
Fix 3: Multi-row FETCH for remaining lookups
For the 18% of claims that missed the cache, Tomás batched the DB2 lookups using a temporary table and a single JOIN query:
* Batch uncached lookups into a temp table
EXEC SQL
INSERT INTO SESSION.LOOKUP_BATCH
(PROVIDER_ID, PLAN_CODE, SERVICE_DATE)
VALUES (:WS-BATCH-PROV(WS-B-IDX),
:WS-BATCH-PLAN(WS-B-IDX),
:WS-BATCH-DATE(WS-B-IDX))
END-EXEC
* Single query to resolve all lookups
EXEC SQL
DECLARE BATCH-CURSOR CURSOR FOR
SELECT B.PROVIDER_ID, B.PLAN_CODE,
N.NETWORK_STATUS, N.EFF_DATE
FROM SESSION.LOOKUP_BATCH B
JOIN PROVIDER_NETWORK N
ON B.PROVIDER_ID = N.PROVIDER_ID
AND B.PLAN_CODE = N.PLAN_CODE
AND N.EFF_DATE <= B.SERVICE_DATE
END-EXEC
Elapsed time: 12 minutes down to 6 minutes.
Final Results
| Optimization | Elapsed | Reduction |
|---|---|---|
| Original (tablespace scan) | 90.2 min | — |
| + Composite index | 35.0 min | 61% |
| + In-memory cache | 12.0 min | 87% |
| + Batch DB2 lookup | 6.0 min | 93% |
The total optimization reduced elapsed time from 90 minutes to 6 minutes — a 15x improvement. The downstream adjudication cycle now starts 84 minutes earlier, providing comfortable margin for the overall batch window.
"The lesson," Tomás told his team, "is that performance tuning is almost never about making the CPU go faster. It is about making the program stop waiting."
36.19 Performance Monitoring and Regression Detection
Optimizing a program once is not enough. Data volumes grow, access patterns change, and new code introduces performance regressions. Continuous performance monitoring catches degradation before it causes batch window overruns.
Establishing Performance Baselines
Record key metrics for every production batch run:
* Add timing instrumentation to batch programs
WORKING-STORAGE SECTION.
01 WS-START-TIME PIC X(21).
01 WS-END-TIME PIC X(21).
01 WS-RECORDS-PROCESSED PIC 9(9) COMP VALUE 0.
01 WS-IO-COUNT PIC 9(9) COMP VALUE 0.
0000-MAIN.
MOVE FUNCTION CURRENT-DATE TO WS-START-TIME
DISPLAY "BAL-CALC START: " WS-START-TIME
PERFORM 1000-INIT
PERFORM 2000-PROCESS-ALL-ACCOUNTS
PERFORM 8000-FINALIZE
MOVE FUNCTION CURRENT-DATE TO WS-END-TIME
DISPLAY "BAL-CALC END: " WS-END-TIME
DISPLAY "RECORDS: " WS-RECORDS-PROCESSED
DISPLAY "I/O OPS: " WS-IO-COUNT
STOP RUN.
Building a Trend Dashboard
Track elapsed time, CPU time, and record count over weeks to identify trends:
BAL-CALC Performance Trend (last 30 days):
────────────────────────────────────────────────
Date Records Elapsed CPU I/O
────────────────────────────────────────────────
2025-10-16 2,300,000 25.1 min 8.2m 23,400
2025-10-17 2,302,000 25.2 min 8.2m 23,420
2025-10-18 2,305,000 25.3 min 8.3m 23,450
...
2025-11-10 2,340,000 25.8 min 8.4m 23,800
2025-11-11 2,342,000 26.1 min 8.5m 23,820
2025-11-12 2,344,000 31.4 min 8.5m 48,200 <<<
2025-11-13 2,346,000 31.6 min 8.5m 48,400 <<<
────────────────────────────────────────────────
ALERT: I/O count doubled on 2025-11-12
Elapsed increased 21% with only 0.1% record growth
The spike on November 12 indicates a performance regression — I/O doubled while record count barely changed. Investigation revealed that a code change on November 11 introduced an additional READ statement inside the main processing loop, doubling the I/O count. The fix was straightforward: cache the result of the extra read rather than re-reading for each record.
Automated Alerting
Set thresholds that trigger alerts when performance degrades beyond acceptable bounds:
//*----------------------------------------------------------
//* Performance gate: Alert if elapsed exceeds threshold
//*----------------------------------------------------------
//PERFGATE EXEC PGM=PERFCHEK
//STEPLIB DD DSN=TOOLS.LOAD,DISP=SHR
//TIMING DD DSN=PROD.PERF.LOG,DISP=SHR
//THRESHLD DD *
BAL-CALC,ELAPSED,30.0
TXN-POST,ELAPSED,20.0
RPT-DAILY,ELAPSED,15.0
REG-FEED,ELAPSED,10.0
/*
//ALERT DD SYSOUT=*
//* RC=0: Within threshold
//* RC=4: Warning (within 10% of threshold)
//* RC=8: Exceeded threshold — alert operations
✅ Try It Yourself: Add timing instrumentation to any COBOL program you have written. Record FUNCTION CURRENT-DATE at the start and end of the program, and count the number of records processed. Run the program with different data volumes (100, 1,000, 10,000 records) and observe how elapsed time scales. Does it scale linearly with data volume? If not, you may have an O(n^2) algorithm hiding in your code.
36.20 GlobalBank Post-Optimization Monitoring
After Maria Chen's performance optimization project (Section 36.10), GlobalBank institutionalized performance monitoring to prevent regression. Priya Kapoor designed a monitoring framework that tracked three key indicators.
The Three Performance Pillars
Pillar 1: Batch Window Utilization. The ratio of actual batch elapsed time to the available batch window. Target: below 60%, allowing headroom for growth:
Batch Window: 23:00 - 05:00 (6 hours = 360 minutes)
Current Elapsed: 75 minutes
Utilization: 20.8%
Headroom: 285 minutes (79.2%)
Monthly Growth Rate: 0.4 minutes/month
Time Until 60% Threshold: ~356 months (29 years)
Status: GREEN — no concern
Pillar 2: Per-Job CPU Efficiency. The ratio of CPU time to elapsed time, tracked per job. A declining ratio indicates increasing I/O wait — often the first sign of a growing dataset or degraded VSAM organization:
Job CPU/Elapsed (Oct) CPU/Elapsed (Nov) Trend
BAL-CALC 32.8% 32.6% Stable
TXN-POST 20.0% 19.7% Stable
RPT-DAILY 17.8% 11.2% DECLINING <<<
REG-FEED 25.0% 25.1% Stable
ACCT-MAINT 62.5% 62.5% Stable
RPT-DAILY's declining CPU/elapsed ratio triggered an investigation. The cause: a VSAM dataset used by the report had not been reorganized in four months. CI splits had fragmented the data, increasing I/O. After reorganization, the ratio returned to 17.5%.
Pillar 3: Records-Per-Second Throughput. Track processing throughput over time. A declining throughput with constant or growing record volumes indicates performance degradation:
BAL-CALC Throughput Trend:
Date Records Elapsed (sec) Records/Sec
2025-10-01 2,300,000 1,500 1,533
2025-10-15 2,310,000 1,508 1,532
2025-11-01 2,320,000 1,520 1,526
2025-11-15 2,330,000 1,528 1,525
Stable throughput indicates that performance scales linearly with data growth — a healthy sign. Declining throughput would indicate an algorithmic issue (e.g., a quadratic search emerging as the table grows).
📊 By the Numbers: In the year following Maria's optimization project, the batch window utilization remained below 25% despite a 4.3% growth in data volume. The monitoring framework caught two potential regressions (the VSAM fragmentation issue above and a new program with an unindexed DB2 query) before they affected production. The total investment in ongoing monitoring was approximately 2 hours per month of Priya's time reviewing the dashboards — a trivial cost for preventing the kind of crisis that started the optimization project.
36.21 Summary
Performance tuning is a discipline of measurement, analysis, and targeted optimization. The key concepts from this chapter:
- Measure first — profile before optimizing. The bottleneck is rarely where you think it is.
- I/O dominates — in most COBOL batch programs, 80-95% of elapsed time is I/O. Optimize I/O first.
- Data types matter — COMP-3 and COMP arithmetic are 3-5x faster than DISPLAY arithmetic.
- Block size is critical — proper blocking can reduce I/O operations by 40x.
- SEARCH ALL for large tables — binary search is O(log n) vs. linear search's O(n).
- Compiler options provide 10-30% improvement with zero code changes.
- SQL optimization is essential for DB2 programs — avoid queries in loops, use multi-row FETCH.
- CICS optimization focuses on COMMAREA sizing and minimizing command overhead.
- Batch tuning includes sort optimization, checkpoint strategy, and REGION sizing.
- Profile continuously — performance degrades as data volumes grow. Regular profiling catches degradation before it causes batch window overruns.
As Maria's batch optimization proved: the biggest wins often come from the simplest changes. A one-line JCL change saved 30 minutes per night. The lesson is not that complex optimizations are unnecessary — sometimes they are — but that you should always pick the lowest-hanging fruit first.
In the next chapter, we'll address the ultimate performance question: when should you stop optimizing COBOL and start considering migration to modern platforms?