26 min read

> "Premature optimization is the root of all evil." — Donald Knuth

Chapter 36: Performance Tuning

"Premature optimization is the root of all evil." — Donald Knuth

"But ignoring performance until your batch window is blown is the root of weekend overtime." — Maria Chen (probably)

GlobalBank's nightly batch cycle had a problem. The cycle — which included account updates, interest calculations, report generation, and regulatory feeds — had a strict window: it had to start at 11:00 PM and complete by 5:00 AM, when online banking came back up. For years, the cycle completed comfortably within that window, finishing around 3:30 AM. But as the bank grew — more accounts, more transaction types, more regulatory requirements — the cycle crept later. 3:45 AM. Then 4:00. Then 4:15.

One Friday night, the cycle didn't finish until 5:47 AM. Online banking was delayed for 47 minutes. The CIO's phone rang. By Monday morning, Maria Chen had a new assignment: cut the batch cycle from four hours to ninety minutes.

"That's a 62% reduction," Derek Washington said, doing the math. "Is that even possible without rewriting everything?"

"We're not rewriting anything," Maria replied. "We're going to make what we have run faster. And the first thing we're going to do is measure."

This chapter teaches you how to think about COBOL performance — what matters, what doesn't, and how to find and fix the bottlenecks that actually slow your programs down.

36.1 The Performance Mindset

Before you optimize a single line of code, internalize three principles:

Principle 1: Measure First

The most dangerous performance optimization is one based on a guess. Developers are notoriously bad at predicting where their programs spend time. A paragraph you think is slow may execute in microseconds; a paragraph you never considered may be the bottleneck because it runs a million times.

Always profile before optimizing. We'll discuss profiling tools in Section 36.8.

Principle 2: I/O Dominates

On a mainframe, a single disk I/O operation takes approximately 5-10 milliseconds. A single CPU instruction takes approximately 1 nanosecond. That means one I/O operation takes as long as 5-10 million CPU instructions. In most COBOL batch programs, 80-95% of elapsed time is spent waiting for I/O. Optimizing CPU-bound logic in a program that is I/O-bound is like polishing the hubcaps on a car with a blown engine.

Principle 3: Optimize the Hot Path

The "hot path" is the code that executes most frequently. In a batch program processing 2 million records, an optimization that saves 1 millisecond per record saves 33 minutes. The same optimization in a paragraph that runs once per job saves 1 millisecond total — not worth the effort or the risk.

⚠️ Defensive Programming: Every performance optimization is a change to working code. Every change introduces risk. Never optimize code that doesn't need optimization, and always have a regression test suite (Chapter 34) in place before you start. The worst outcome of performance tuning is a faster program that produces wrong results.

36.2 CPU Optimization

While I/O usually dominates, CPU optimization matters in compute-intensive code — financial calculations, data transformations, and high-iteration loops.

Efficient Data Types

The choice of data type has a dramatic effect on arithmetic performance:

Data Type PICTURE Storage Arithmetic Speed
DISPLAY (zoned decimal) PIC 9(9) 9 bytes Slowest — CPU must convert to binary
COMP-3 (packed decimal) PIC 9(9) COMP-3 5 bytes Moderate — hardware decimal support
COMP / BINARY PIC 9(9) COMP 4 bytes Fastest — native binary arithmetic

For fields used in arithmetic, always use COMP-3 or COMP:

      * SLOW: Arithmetic on DISPLAY fields
       01  WS-AMOUNT-DISP    PIC 9(9)V99.
       01  WS-RATE-DISP      PIC V9(4).
       01  WS-RESULT-DISP    PIC 9(9)V99.
           MULTIPLY WS-AMOUNT-DISP BY WS-RATE-DISP
               GIVING WS-RESULT-DISP.
      * CPU must: unpack -> convert -> multiply -> convert -> pack

      * FAST: Arithmetic on COMP-3 fields
       01  WS-AMOUNT-PKD     PIC 9(9)V99   COMP-3.
       01  WS-RATE-PKD       PIC V9(4)     COMP-3.
       01  WS-RESULT-PKD     PIC 9(9)V99   COMP-3.
           MULTIPLY WS-AMOUNT-PKD BY WS-RATE-PKD
               GIVING WS-RESULT-PKD.
      * CPU: hardware decimal multiply (one instruction)

📊 By the Numbers: In benchmarks on a z15 processor, COMP-3 arithmetic is approximately 3x faster than DISPLAY arithmetic. COMP (binary) arithmetic is approximately 5x faster than DISPLAY. For a program performing 10 million calculations per run, converting from DISPLAY to COMP-3 can save significant CPU time.

COMPUTE vs. Arithmetic Verbs

The COMPUTE statement is generally as fast or faster than individual arithmetic verbs, because the compiler can optimize the entire expression:

      * SLOWER: Individual arithmetic verbs
           MULTIPLY WS-RATE BY WS-PRINCIPAL
               GIVING WS-TEMP
           DIVIDE WS-TEMP BY 365
               GIVING WS-DAILY-AMT
           MULTIPLY WS-DAILY-AMT BY WS-DAYS
               GIVING WS-INTEREST

      * FASTER: Single COMPUTE (compiler optimizes entire expression)
           COMPUTE WS-INTEREST ROUNDED =
               WS-PRINCIPAL * WS-RATE * WS-DAYS / 365

The COMPUTE version generates fewer intermediate storage operations and allows the compiler to use registers more efficiently.

For table lookups, the choice between SEARCH (linear) and SEARCH ALL (binary) has profound performance implications:

Linear search (SEARCH):   O(n) — checks each entry in sequence
Binary search (SEARCH ALL): O(log n) — requires sorted table, halves search space each step
Table Size Linear Search (avg) Binary Search (max) Speedup
10 5 comparisons 4 comparisons 1.25x
100 50 comparisons 7 comparisons 7x
1,000 500 comparisons 10 comparisons 50x
10,000 5,000 comparisons 14 comparisons 357x
100,000 50,000 comparisons 17 comparisons 2,941x
      * LINEAR SEARCH: O(n) — fine for small tables
       SEARCH WS-STATE-TABLE
           AT END
               MOVE "UNKNOWN" TO WS-STATE-NAME
           WHEN WS-STATE-CODE(WS-IDX) = WS-INPUT-STATE
               MOVE WS-STATE-NAME(WS-IDX) TO WS-OUTPUT-NAME
       END-SEARCH

      * BINARY SEARCH: O(log n) — required for large tables
      * Table MUST be sorted by key (ASCENDING KEY clause)
       SEARCH ALL WS-STATE-TABLE
           AT END
               MOVE "UNKNOWN" TO WS-STATE-NAME
           WHEN WS-STATE-CODE(WS-IDX) = WS-INPUT-STATE
               MOVE WS-STATE-NAME(WS-IDX) TO WS-OUTPUT-NAME
       END-SEARCH

For a table of 50 US states, linear search is fine. For a table of 100,000 procedure codes, binary search is essential.

Loop Optimization

Minimize work inside high-iteration loops:

      * SLOW: Redundant computation inside loop
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-RECORD-COUNT
           COMPUTE WS-TAX-RATE =
               FUNCTION CURRENT-DATE(1:4) * 0.001
      *    ^^^ Recomputed every iteration but never changes!
           COMPUTE WS-TAX(WS-IDX) =
               WS-AMOUNT(WS-IDX) * WS-TAX-RATE
       END-PERFORM

      * FAST: Move invariant computation outside loop
       COMPUTE WS-TAX-RATE =
           FUNCTION CURRENT-DATE(1:4) * 0.001
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-RECORD-COUNT
           COMPUTE WS-TAX(WS-IDX) =
               WS-AMOUNT(WS-IDX) * WS-TAX-RATE
       END-PERFORM

Conditional Ordering

In compound conditions, put the most likely-to-fail condition first:

      * If 95% of records have STATUS = "A", check that first
      * SLOWER: Rare condition checked first
           IF CLM-AMOUNT > 50000
              AND CLM-STATUS = "A"
               ...

      * FASTER: Common condition checked first (short-circuit)
           IF CLM-STATUS = "A"
              AND CLM-AMOUNT > 50000
               ...

COBOL evaluates AND conditions left-to-right. If the first condition is false, the second is not evaluated. Putting the most selective (most likely to be false) condition first reduces total comparisons.

Try It Yourself: Write a program that performs a table lookup 1,000,000 times using both SEARCH and SEARCH ALL on a 1,000-entry sorted table. Time both approaches. On GnuCOBOL, you can use FUNCTION CURRENT-DATE before and after to measure elapsed time.

36.3 I/O Reduction

Since I/O dominates most COBOL programs, I/O optimization yields the greatest returns.

Buffering and Block Size

When you read a sequential file, the operating system reads one block at a time from disk. A block contains multiple logical records. The larger the block, the fewer I/O operations needed to read the entire file.

File: 1,000,000 records, each 100 bytes

Block Size     Records/Block    Blocks to Read    I/O Operations
-----------    -------------    --------------    ---------------
100 bytes      1                1,000,000         1,000,000
800 bytes      8                125,000           125,000
8,000 bytes    80               12,500            12,500
32,000 bytes   320              3,125             3,125

In JCL, specify block size on the DD statement:

//INPUT    DD DSN=PROD.TRANS.FILE,DISP=SHR,
//            DCB=(RECFM=FB,LRECL=100,BLKSIZE=32000)

For VSAM files, the Control Interval (CI) size serves a similar purpose. Larger CI sizes reduce I/O for sequential access patterns.

VSAM Tuning

VSAM performance depends heavily on these parameters:

Buffer allocation: More buffers = more data cached in memory = fewer disk I/Os.

//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
//            AMP=('BUFND=20,BUFNI=10')
  • BUFND: Number of data buffers (for data Control Intervals)
  • BUFNI: Number of index buffers (for index records)

General rule: For sequential access, set BUFND high (20+). For random access, set BUFNI high to cache the index.

CI/CA Split tuning: When a VSAM KSDS runs out of space in a Control Interval, it splits — moving half the records to a new CI. Excessive splits degrade performance. Monitor split frequency and reorganize datasets periodically.

Sequential vs. Random Access

If you need to process more than 20-30% of a VSAM file's records, sequential access is faster than random access, even if you skip records:

      * SLOW: Random access for 500,000 out of 1,000,000 records
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-KEY-COUNT
           MOVE WS-KEY-TABLE(WS-IDX) TO ACCT-KEY
           READ ACCT-MASTER
               KEY IS ACCT-KEY
               INVALID KEY CONTINUE
               NOT INVALID KEY
                   PERFORM 3000-PROCESS-ACCOUNT
           END-READ
       END-PERFORM
      * Each READ is a random I/O: ~500,000 I/O operations

      * FAST: Sequential read with skip logic
       PERFORM UNTIL END-OF-FILE
           READ ACCT-MASTER NEXT
               AT END SET END-OF-FILE TO TRUE
               NOT AT END
                   PERFORM 2500-CHECK-IF-NEEDED
           END-READ
       END-PERFORM
      * Sequential read uses buffering: ~3,125 I/O operations
      * (with 32K block size)

Minimize File Opens

Each OPEN/CLOSE cycle has overhead. If you process the same file in multiple program sections, open it once and close it once:

      * SLOW: Open/close for each processing phase
       PERFORM 1000-PHASE-ONE
       PERFORM 2000-PHASE-TWO

       1000-PHASE-ONE.
           OPEN INPUT MASTER-FILE
           ...process...
           CLOSE MASTER-FILE.

       2000-PHASE-TWO.
           OPEN INPUT MASTER-FILE
           ...process...
           CLOSE MASTER-FILE.

      * FAST: Single open/close
       0000-MAIN.
           OPEN INPUT MASTER-FILE
           PERFORM 1000-PHASE-ONE
           PERFORM 2000-PHASE-TWO
           CLOSE MASTER-FILE.

36.4 WORKING-STORAGE Layout

How you arrange data in WORKING-STORAGE affects performance through alignment and locality effects.

Alignment and Slack Bytes

On IBM mainframes, the hardware accesses memory most efficiently when data items are aligned on their natural boundaries:

Data Type Alignment Slack if Misaligned
COMP (halfword, PIC S9(4)) 2-byte boundary Up to 1 slack byte
COMP (fullword, PIC S9(9)) 4-byte boundary Up to 3 slack bytes
COMP (doubleword, PIC S9(18)) 8-byte boundary Up to 7 slack bytes
COMP-1 (float) 4-byte boundary Up to 3 slack bytes
COMP-2 (double) 8-byte boundary Up to 7 slack bytes

The compiler inserts invisible "slack bytes" to align fields. You can minimize slack by ordering fields from largest to smallest alignment requirement:

      * POOR LAYOUT: Slack bytes between fields
       01  WS-RECORD.
           05 WS-FLAG         PIC X.           *> 1 byte
      *    (3 slack bytes inserted here for alignment)
           05 WS-AMOUNT       PIC S9(9) COMP.  *> 4 bytes, fullword
           05 WS-CODE         PIC X(3).        *> 3 bytes
      *    (1 slack byte inserted here)
           05 WS-COUNTER      PIC S9(4) COMP.  *> 2 bytes, halfword
      * Total: 1+3+4+3+1+2 = 14 bytes (4 wasted on slack)

      * OPTIMAL LAYOUT: No slack bytes
       01  WS-RECORD.
           05 WS-AMOUNT       PIC S9(9) COMP.  *> 4 bytes (fullword)
           05 WS-COUNTER      PIC S9(4) COMP.  *> 2 bytes (halfword)
           05 WS-CODE         PIC X(3).        *> 3 bytes
           05 WS-FLAG         PIC X.           *> 1 byte
      * Total: 4+2+3+1 = 10 bytes (0 wasted)

Frequently-Used Fields First

Place the most frequently accessed fields at the beginning of WORKING-STORAGE. While modern hardware caching minimizes this effect, it can matter for very hot loops:

       WORKING-STORAGE SECTION.
      * Most frequently used fields first
       01  WS-HOT-FIELDS.
           05 WS-RECORD-COUNT     PIC 9(9)   COMP.
           05 WS-CURRENT-KEY      PIC X(10).
           05 WS-PROCESS-FLAG     PIC X.
           05 WS-RUNNING-TOTAL    PIC S9(11)V99 COMP-3.

      * Less frequently used fields later
       01  WS-REPORT-FIELDS.
           05 WS-PAGE-COUNT       PIC 9(5).
           05 WS-LINE-COUNT       PIC 9(3).
           ...

Group MOVE vs. Field-by-Field MOVE

A single group MOVE is faster than moving individual fields:

      * SLOWER: Multiple individual MOVEs
           MOVE WS-NAME    TO OUT-NAME
           MOVE WS-ADDRESS TO OUT-ADDRESS
           MOVE WS-CITY    TO OUT-CITY
           MOVE WS-STATE   TO OUT-STATE
           MOVE WS-ZIP     TO OUT-ZIP

      * FASTER: Single group MOVE (if layouts match)
           MOVE WS-CUSTOMER-DATA TO OUT-CUSTOMER-DATA

However, this only works when the source and target group items have identical layouts. If they differ, use MOVE CORRESPONDING:

      * MODERATE: MOVE CORRESPONDING
           MOVE CORRESPONDING WS-CUSTOMER TO OUT-RECORD

36.5 Compiler Options

Enterprise COBOL's compiler options significantly affect performance. The key options are:

OPTIMIZE

The OPTIMIZE option controls the compiler's optimization level:

Option Effect Trade-off
NOOPTIMIZE No optimization Fastest compilation, easiest debugging
OPTIMIZE(STD) Standard optimization Good performance, reasonable compile time
OPTIMIZE(FULL) Aggressive optimization Best performance, longer compilation, harder to debug

OPTIMIZE(FULL) can improve CPU performance by 10-30% through: - Eliminating redundant computations - Optimizing register usage - Eliminating dead code - Inlining small paragraphs

NUMPROC

Controls how the compiler handles sign processing for packed decimal:

Option Effect
NUMPROC(NOPFD) Validates signs on every operation (safe but slow)
NUMPROC(PFD) Assumes data has preferred signs (fast but requires clean data)
NUMPROC(MIG) Migration mode — accepts any valid sign

NUMPROC(PFD) can improve decimal arithmetic performance by 10-15%, but will produce incorrect results if data has non-preferred sign codes. Use only when you can guarantee clean data.

TRUNC

Controls truncation behavior for COMP (binary) fields:

Option Effect
TRUNC(STD) Truncates to PIC size (safe, matches language standard)
TRUNC(OPT) Truncates to native binary size (faster, may differ from PIC)
TRUNC(BIN) Treats all COMP as native binary (fastest, non-standard)

Example: PIC S9(4) COMP occupies a halfword (2 bytes = range -32768 to 32767). With TRUNC(STD), values are truncated to -9999 to 9999. With TRUNC(OPT), values use the full halfword range. This matters for loop counters and indices.

SSRANGE

SSRANGE      — Runtime subscript range checking (safe, 10-20% slower)
NOSSRANGE    — No range checking (fast, risk of storage overlays)

Use SSRANGE during development and testing. Consider NOSSRANGE for production if performance is critical — but only if your tests are thorough.

💡 The Modernization Spectrum: Compiler options represent one of the easiest performance wins — changing a JCL compile step can improve performance by 10-30% with zero code changes. This is the lowest-risk, highest-return optimization available.

36.6 SQL Performance

For programs with embedded SQL (DB2), SQL performance often dominates everything else. A single poorly-written query can be slower than the entire rest of the program.

Avoid Full Table Scans

      * SLOW: Full table scan (no index on CLAIM_DATE)
           EXEC SQL
               SELECT COUNT(*)
               INTO :WS-CLAIM-COUNT
               FROM CLAIMS
               WHERE CLAIM_DATE >= '2025-01-01'
           END-EXEC

      * FAST: Use index on CLAIM_DATE
      * (Ensure index exists: CREATE INDEX IX_CLAIMS_DATE
      *  ON CLAIMS (CLAIM_DATE))

The query itself doesn't change — the performance difference comes from the index. Use EXPLAIN to verify your query uses an index.

FETCH FIRST for Existence Checks

When you only need to know if a row exists, don't retrieve all matching rows:

      * SLOW: Fetches potentially thousands of rows
           EXEC SQL
               SELECT MEMBER_ID
               INTO :WS-MEMBER-ID
               FROM MEMBERS
               WHERE MEMBER_STATUS = 'A'
                 AND MEMBER_STATE = :WS-STATE
           END-EXEC

      * FAST: Stop after first match
           EXEC SQL
               SELECT MEMBER_ID
               INTO :WS-MEMBER-ID
               FROM MEMBERS
               WHERE MEMBER_STATUS = 'A'
                 AND MEMBER_STATE = :WS-STATE
               FETCH FIRST 1 ROW ONLY
           END-EXEC

Use Host Variable Arrays for Bulk Operations

Instead of fetching one row at a time, fetch blocks of rows:

       01  WS-CLAIM-ARRAY.
           05 WS-CLAIM-ID     PIC X(8) OCCURS 100 TIMES.
           05 WS-CLAIM-AMT    PIC S9(9)V99 COMP-3
                               OCCURS 100 TIMES.
       01  WS-FETCH-COUNT     PIC S9(4) COMP.

      * SLOW: One row per FETCH
       PERFORM UNTIL SQLCODE NOT = 0
           EXEC SQL
               FETCH CLAIM-CURSOR
               INTO :WS-SINGLE-ID, :WS-SINGLE-AMT
           END-EXEC
           IF SQLCODE = 0
               PERFORM 3000-PROCESS-CLAIM
           END-IF
       END-PERFORM

      * FAST: 100 rows per FETCH
       PERFORM UNTIL SQLCODE NOT = 0
           EXEC SQL
               FETCH CLAIM-CURSOR
               FOR 100 ROWS
               INTO :WS-CLAIM-ID, :WS-CLAIM-AMT
           END-EXEC
           MOVE SQLERRD(3) TO WS-FETCH-COUNT
           PERFORM VARYING WS-IDX FROM 1 BY 1
               UNTIL WS-IDX > WS-FETCH-COUNT
               PERFORM 3000-PROCESS-CLAIM
           END-PERFORM
       END-PERFORM

Multi-row FETCH reduces the number of DB2 interactions by a factor of 100, dramatically reducing overhead.

Avoid SQL in Loops

      * TERRIBLE: SQL inside a loop — N queries for N records
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-CLAIM-COUNT
           EXEC SQL
               SELECT PROVIDER_NAME
               INTO :WS-PROV-NAME
               FROM PROVIDERS
               WHERE PROVIDER_ID = :WS-PROV-ID(WS-IDX)
           END-EXEC
       END-PERFORM

      * BETTER: Join in a single query
           EXEC SQL
               DECLARE CLAIM-PROV-CURSOR CURSOR FOR
               SELECT C.CLAIM_ID, P.PROVIDER_NAME
               FROM CLAIMS C
               JOIN PROVIDERS P ON C.PROVIDER_ID = P.PROVIDER_ID
               WHERE C.BATCH_DATE = :WS-BATCH-DATE
           END-EXEC

36.7 CICS Performance

For online programs running under CICS, performance tuning has different priorities than batch.

COMMAREA Sizing

The COMMAREA (Communication Area) passes data between CICS transactions. Keep it as small as possible:

      * POOR: Oversized COMMAREA
       01  DFHCOMMAREA.
           05 CA-CUSTOMER-DATA.
              10 CA-CUST-NAME      PIC X(100).
              10 CA-CUST-ADDR      PIC X(200).
              10 CA-CUST-HISTORY   PIC X(5000).
      * 5,300 bytes copied on every RETURN TRANSID
      * Most of it unchanged between interactions

      * BETTER: Minimal COMMAREA with key references
       01  DFHCOMMAREA.
           05 CA-CUST-ID          PIC X(10).
           05 CA-SCREEN-STATE     PIC X(2).
           05 CA-LAST-ACTION      PIC X.
           05 CA-ERROR-CODE       PIC X(4).
      * 17 bytes — re-read customer data from DB2/VSAM when needed

BMS Map Optimization

Reduce BMS (Basic Mapping Support) overhead by sending only changed fields:

      * SLOW: Send entire map every time
           EXEC CICS SEND MAP('ACCTMAP')
                MAPSET('ACCTMS')
                ERASE
           END-EXEC

      * FAST: Send only data (not format) when map already displayed
           EXEC CICS SEND MAP('ACCTMAP')
                MAPSET('ACCTMS')
                DATAONLY
           END-EXEC

Avoid Excessive CICS Commands

Each CICS command (READ, WRITE, LINK, etc.) has overhead for command-level processing. Batch CICS operations when possible:

      * SLOW: Multiple READQ TS for individual fields
           EXEC CICS READQ TS QUEUE('MYQUEUE')
               INTO(WS-FIELD-1) ITEM(1) END-EXEC
           EXEC CICS READQ TS QUEUE('MYQUEUE')
               INTO(WS-FIELD-2) ITEM(2) END-EXEC
           EXEC CICS READQ TS QUEUE('MYQUEUE')
               INTO(WS-FIELD-3) ITEM(3) END-EXEC

      * FAST: Single READQ TS for a group item
           EXEC CICS READQ TS QUEUE('MYQUEUE')
               INTO(WS-ALL-FIELDS) ITEM(1) END-EXEC

36.8 Batch Job Tuning

Beyond program-level optimization, batch job JCL tuning can yield significant improvements.

REGION Size

The REGION parameter controls how much memory the job step can use. Too little causes ABENDs; too much wastes resources:

//* Too small: May cause S878 ABEND
//STEP1    EXEC PGM=BALCALC,REGION=2M

//* Appropriate: Enough for buffers and working storage
//STEP1    EXEC PGM=BALCALC,REGION=64M

//* Excessive: Wastes memory
//STEP1    EXEC PGM=BALCALC,REGION=0M
//* REGION=0M means "give me everything" — avoid in production

Sort Optimization

DFSORT (or SyncSort) operations often account for a large fraction of batch elapsed time. Key tuning parameters:

//SORT     EXEC PGM=SORT
//SORTIN   DD DSN=PROD.UNSORTED.FILE,DISP=SHR
//SORTOUT  DD DSN=PROD.SORTED.FILE,DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(100,50),RLSE),
//            DCB=(RECFM=FB,LRECL=200,BLKSIZE=32000)
//SORTWK01 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SORTWK02 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SORTWK03 DD UNIT=SYSDA,SPACE=(CYL,(50))
//SYSIN    DD *
  SORT FIELDS=(1,10,CH,A)
  OPTION MAINSIZE=MAX,FILSZ=E2000000
/*

Key optimizations: - Multiple SORTWK DDs: Allow parallel sort work — 3 work files is optimal for most sorts. - MAINSIZE=MAX: Use as much memory as possible for in-memory sorting. - FILSZ: Estimate file size so SORT can choose optimal algorithm. - Large BLKSIZE on SORTOUT: Reduces output I/O.

Checkpoint/Restart

For long-running batch jobs, periodic checkpoints allow restart from the last checkpoint rather than from the beginning:

       5000-CHECKPOINT.
           IF WS-RECORD-COUNT >= WS-CHECKPOINT-INTERVAL
               PERFORM 5100-WRITE-CHECKPOINT
               MOVE 0 TO WS-RECORD-COUNT
           END-IF.

       5100-WRITE-CHECKPOINT.
           EXEC SQL COMMIT END-EXEC
           DISPLAY "Checkpoint at record: "
               WS-TOTAL-PROCESSED
               " Time: " FUNCTION CURRENT-DATE.

For DB2 batch, COMMIT frequency is critical:

Commits too rarely: Long-running locks, log space issues, long restart
Commits too often: Overhead of commit processing

Sweet spot: Every 1,000-10,000 records (depends on workload)

Mathematical Formulation: I/O Cost Model

We can model the cost of a batch program as:

Total_Time = CPU_Time + I/O_Time + Wait_Time

I/O_Time = N_reads * T_read + N_writes * T_write

Where:
  N_reads  = File_Size / Block_Size    (for sequential access)
  N_reads  = N_records                  (for random access)
  T_read   ≈ 5ms (disk) or 0.1ms (SSD/cache)
  T_write  ≈ 5ms (disk) or 0.1ms (SSD/cache)

For GlobalBank's BAL-CALC processing 2.3 million accounts:

With 100-byte records and 800-byte blocks (old configuration):
  N_reads = 2,300,000 / 8 = 287,500 I/Os
  I/O_Time = 287,500 * 5ms = 1,437.5 seconds = 24 minutes

With 100-byte records and 32,000-byte blocks (optimized):
  N_reads = 2,300,000 / 320 = 7,188 I/Os
  I/O_Time = 7,188 * 5ms = 35.9 seconds = 0.6 minutes

Changing the block size alone reduced I/O time from 24 minutes to under 1 minute — a 40x improvement with zero code changes.

📊 Big-O for COBOL Operations: Understanding algorithmic complexity helps predict how performance scales with data volume:

Operation Complexity 1K Records 1M Records 1B Records
Sequential file read O(n) Fast Moderate Slow
VSAM random read O(log n) Fast Fast Fast
Linear table search O(n) Fast Slow Impossible
Binary table search O(log n) Fast Fast Fast
Nested loop match O(n*m) Moderate Impossible
Sort O(n log n) Fast Moderate Slow

36.9 Profiling Tools

You cannot optimize what you cannot measure. Mainframe profiling tools tell you exactly where your program spends its time.

IBM Strobe

Strobe is the most widely used mainframe profiling tool. It samples the program counter at regular intervals, building a statistical profile of time spent in each paragraph:

STROBE Performance Profile: BAL-CALC
=====================================
Run Date: 2025-10-20
Total CPU Time: 847.3 seconds
Total Elapsed: 3,612.0 seconds
CPU/Elapsed Ratio: 23.5% (I/O bound)

Paragraph Profile (Top 10 by CPU):
Paragraph                CPU Secs    %CPU    Calls
---------                --------    ----    -----
3110-COMPOUND-DAILY      312.4       36.9%   2,300,000
2000-READ-ACCOUNT        198.7       23.5%   2,300,000
4000-WRITE-OUTPUT        145.2       17.1%   2,300,000
3200-CALC-TIERED-RATE     67.8        8.0%     180,000
3120-COMPOUND-MONTHLY     45.1        5.3%     450,000
3000-CALC-INTEREST        32.6        3.8%   2,300,000
1000-INIT                  0.4        0.0%         1
9000-CLEANUP               0.1        0.0%         1
Other                      45.0        5.3%

This profile immediately reveals that 3110-COMPOUND-DAILY consumes 37% of CPU time. This is the hot path — the paragraph to optimize first.

SMF Records

System Management Facility (SMF) records provide job-level performance data:

  • SMF Type 30: Job/step level CPU and elapsed time
  • SMF Type 42: VSAM dataset statistics (I/O counts, splits, etc.)
  • SMF Type 101: DB2 accounting (SQL execution time, rows processed)

RMF (Resource Measurement Facility)

RMF provides system-wide performance data, helping identify contention and resource bottlenecks at the system level rather than the program level.

36.10 GlobalBank Case Study: Optimizing the Nightly Batch

Maria Chen's assignment: reduce the nightly batch from 4 hours to 90 minutes. Here's how she did it.

Step 1: Profile

Maria ran Strobe on each job in the batch cycle:

Job Elapsed CPU Primary Bottleneck
BAL-CALC 72 min 14 min CPU (compound interest calculation)
TXN-POST 55 min 3 min I/O (VSAM random reads)
RPT-DAILY 45 min 8 min Sort (5 million records)
REG-FEED 38 min 2 min I/O (sequential write, small blocks)
ACCT-MAINT 22 min 5 min DB2 (SELECT in loop)
Other 8 min 2 min Mixed
Total 240 min 34 min

Step 2: Prioritize

I/O optimization would have the biggest impact. CPU optimization would address BAL-CALC.

Step 3: Optimize

BAL-CALC (72 min → 25 min): - Changed DISPLAY arithmetic fields to COMP-3: saved 5 min CPU - Precomputed daily rate outside the main loop: saved 3 min CPU - Changed ACCT-MASTER block size from 4K to 32K: saved 18 min I/O - Used OPTIMIZE(FULL) compiler option: saved 4 min CPU - Increased VSAM buffers (BUFND=30): saved 17 min I/O

TXN-POST (55 min → 15 min): - Sorted transaction file by account key before processing: converted random VSAM reads to sequential skip-reads - Increased block sizes on all files: reduced I/O count by 90% - Net result: 40 minutes saved

RPT-DAILY (45 min → 12 min): - Added 3 SORTWK DD statements (was using 1): enabled parallel sort - Increased MAINSIZE to MAX: more in-memory sorting - Optimized output block sizes: faster output write - Net result: 33 minutes saved

REG-FEED (38 min → 8 min): - Block size was 800 bytes (LRECL = 800, BLKSIZE = 800): records not blocked at all! - Changed to BLKSIZE=32000 (40 records per block) - Net result: 30 minutes saved from a one-line JCL change

ACCT-MAINT (22 min → 8 min): - Replaced SELECT-in-a-loop with a JOIN query: 50,000 DB2 calls became 1 - Net result: 14 minutes saved

Results

Job Before After Savings
BAL-CALC 72 min 25 min 47 min
TXN-POST 55 min 15 min 40 min
RPT-DAILY 45 min 12 min 33 min
REG-FEED 38 min 8 min 30 min
ACCT-MAINT 22 min 8 min 14 min
Other 8 min 7 min 1 min
Total 240 min 75 min 165 min

The batch cycle went from 240 minutes to 75 minutes — a 69% reduction, exceeding the 90-minute target. The single largest improvement (REG-FEED, saving 30 minutes) required changing one line of JCL.

"The block size thing still kills me," Derek said. "Thirty minutes wasted every night because someone in 1998 forgot to specify BLKSIZE."

Maria shrugged. "That's why we measure. You never know where the time is going until you look."

36.11 MedClaim Case Study: Tuning Claim Adjudication

James Okafor needed to increase CLM-ADJUD's throughput from 8,000 claims per hour to 20,000 claims per hour to meet a new SLA with a large provider network.

The Bottleneck

Profiling revealed that 65% of elapsed time was spent in DB2 operations — specifically, a SELECT inside the claim processing loop that looked up the provider's fee schedule:

      * Original: One DB2 call per claim
       3000-LOOKUP-FEE-SCHEDULE.
           EXEC SQL
               SELECT ALLOWED_AMOUNT
               INTO :WS-ALLOWED-AMT
               FROM FEE_SCHEDULE
               WHERE PROVIDER_ID = :CLM-PROVIDER-ID
                 AND PROCEDURE_CODE = :CLM-PROCEDURE-CODE
                 AND EFFECTIVE_DATE <= :CLM-SERVICE-DATE
               ORDER BY EFFECTIVE_DATE DESC
               FETCH FIRST 1 ROW ONLY
           END-EXEC.

At 8,000 claims per hour, this query executed 8,000 times per hour. Each execution took approximately 3ms (including DB2 thread switching overhead), consuming 24 seconds per hour of elapsed time — but the overhead was the killer.

The Solution

James implemented a three-level caching strategy:

Level 1: In-memory table — Cache the 500 most common provider/procedure combinations in a WORKING-STORAGE table:

       01  WS-FEE-CACHE.
           05 WS-CACHE-ENTRY OCCURS 500 TIMES
                             ASCENDING KEY WS-CACHE-KEY
                             INDEXED BY WS-CACHE-IDX.
              10 WS-CACHE-KEY.
                 15 WS-CACHE-PROV  PIC X(8).
                 15 WS-CACHE-PROC  PIC X(5).
              10 WS-CACHE-AMOUNT   PIC S9(7)V99 COMP-3.
              10 WS-CACHE-DATE     PIC X(10).

       3000-LOOKUP-FEE-SCHEDULE.
           MOVE CLM-PROVIDER-ID TO WS-CACHE-PROV
           MOVE CLM-PROCEDURE-CODE TO WS-CACHE-PROC
           SEARCH ALL WS-FEE-CACHE
               AT END
                   PERFORM 3100-DB2-LOOKUP
                   PERFORM 3200-UPDATE-CACHE
               WHEN WS-CACHE-KEY(WS-CACHE-IDX) =
                    WS-CACHE-KEY
                   MOVE WS-CACHE-AMOUNT(WS-CACHE-IDX)
                       TO WS-ALLOWED-AMT
           END-SEARCH.

Level 2: Multi-row FETCH — When a DB2 lookup was needed, fetch multiple fee schedule entries at once.

Level 3: Preload — At program start, load the top 500 combinations from a pre-computed table.

Results

The cache hit rate was 78% — meaning 78% of claims were resolved without any DB2 call. The remaining 22% hit DB2 but with optimized queries (proper index usage, FETCH FIRST).

Throughput increased from 8,000 to 26,000 claims per hour — exceeding the 20,000 target by 30%.

36.12 Performance Tuning Checklist

Use this checklist when optimizing COBOL programs:

PERFORMANCE TUNING CHECKLIST
===============================

BEFORE STARTING:
[ ] Profile the program to identify actual bottlenecks
[ ] Establish baseline measurements (elapsed, CPU, I/O counts)
[ ] Verify regression test suite exists
[ ] Set a specific performance target

I/O OPTIMIZATION:
[ ] Sequential file block sizes >= 32,000
[ ] VSAM buffer counts appropriate (BUFND/BUFNI)
[ ] Sequential access used when processing >20% of file
[ ] File OPEN/CLOSE minimized
[ ] No unnecessary file reads (cache when possible)

CPU OPTIMIZATION:
[ ] Arithmetic fields are COMP-3 or COMP (not DISPLAY)
[ ] COMPUTE used for complex expressions
[ ] SEARCH ALL used for tables > 50 entries
[ ] Invariant computations moved outside loops
[ ] Conditions ordered by selectivity

COMPILER OPTIONS:
[ ] OPTIMIZE(STD) or OPTIMIZE(FULL)
[ ] NUMPROC(PFD) if data is clean
[ ] TRUNC(OPT) if binary fields are within PIC range
[ ] NOSSRANGE in production (SSRANGE in test)

SQL OPTIMIZATION:
[ ] No SELECT inside loops (use JOINs or cursors)
[ ] Multi-row FETCH for bulk processing
[ ] FETCH FIRST for existence checks
[ ] Indexes exist for WHERE clause predicates
[ ] EXPLAIN used to verify access paths

BATCH JOB TUNING:
[ ] REGION size appropriate
[ ] Sort work files (3 SORTWK DDs)
[ ] Sort MAINSIZE=MAX
[ ] Checkpoint frequency balanced
[ ] Job sequencing minimizes wait time

36.13 VSAM Tuning Deep Dive

VSAM performance tuning deserves special attention because VSAM files are the backbone of most COBOL batch and online systems. The three parameters that matter most are buffer allocation, Control Interval size, and free space management.

BUFND and BUFNI: The Buffer Equation

BUFND (data buffers) and BUFNI (index buffers) control how much of a VSAM dataset is cached in memory during processing. The relationship between buffer count and I/O reduction is dramatic:

VSAM KSDS: ACCT-MASTER
Records: 2,300,000
CI Size (Data): 4,096 bytes
CI Size (Index): 2,048 bytes
Record Size: 200 bytes
Records per CI: 20
Index Levels: 3

Buffer Scenarios for Sequential Access:
──────────────────────────────────────────────────────
BUFND    Index Cached?    Data I/Os      Elapsed (est)
──────────────────────────────────────────────────────
2        No              115,000        575 sec
5        Partial          92,000        460 sec
10       Partial          69,000        345 sec
20       Yes              23,000        115 sec
30       Yes              11,500         58 sec
50       Yes               5,750         29 sec
──────────────────────────────────────────────────────

For sequential processing, each additional data buffer reduces I/O because the system reads ahead. The rule of thumb: set BUFND to at least the number of data CIs that fit in one CA (Control Area), plus a few more for read-ahead.

For random access, BUFNI is more important than BUFND:

Buffer Scenarios for Random Access (100,000 lookups):
──────────────────────────────────────────────────────
BUFNI    BUFND    Index I/Os   Data I/Os   Total I/Os
──────────────────────────────────────────────────────
3        2        200,000      100,000     300,000
5        2         80,000      100,000     180,000
10       2         20,000      100,000     120,000
20       5          5,000      100,000     105,000
20       20         5,000       60,000      65,000
──────────────────────────────────────────────────────

With 20 index buffers, the entire index set is typically cached in memory after the first few accesses, eliminating index I/O entirely. Adding more data buffers helps if the same records are accessed repeatedly (locality of reference).

JCL Buffer Specification

//* Sequential batch processing — maximize BUFND
//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
//            AMP=('BUFND=30,BUFNI=5')

//* Random lookup in CICS — maximize BUFNI
//* (CICS FCT controls buffers, not JCL)
//* In CICS:
//*   DEFINE FILE(ACCTMST)
//*     DSNAME(PROD.ACCT.MASTER)
//*     STRINGS(10)
//*     DATABUFFERS(20)
//*     INDEXBUFFERS(20)

//* Mixed access pattern — balance both
//ACCTMSTR DD DSN=PROD.ACCT.MASTER,DISP=SHR,
//            AMP=('BUFND=20,BUFNI=15')

CI/CA Splits and Reorganization

When a VSAM KSDS needs to insert a record into a full CI, it performs a CI split — moving half the records to a new CI. If the CA is also full, a CA split occurs, which is even more expensive. Excessive splitting degrades both sequential and random access performance.

Monitor splits using IDCAMS LISTCAT:

//LISTCAT  EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN    DD *
  LISTCAT ENT(PROD.ACCT.MASTER) ALL
/*

Key statistics to watch in the LISTCAT output:

STATISTICS
  CI-SPLITS --------- 12,847    <<<< Warning if > 5% of CIs
  CA-SPLITS ---------     23    <<<< Warning if ANY
  EXTENTS -----------      4
  REC-TOTAL --------- 2,300,000
  REC-DELETED ---------   450
  REC-INSERTED -------- 85,000
  REC-UPDATED --------- 1,200,000
  FREESPACE-CI% ------- 0       <<<< Exhausted!
  FREESPACE-CA% ------- 0       <<<< Exhausted!

When CI splits exceed 5% of total CIs, or CA splits appear at all, reorganize the dataset:

//*----------------------------------------------------------
//* Reorganize VSAM KSDS to eliminate splits
//*----------------------------------------------------------
//REORG    EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//BACKUP   DD DSN=TEMP.ACCT.BACKUP,DISP=(NEW,CATLG),
//            SPACE=(CYL,(200,50),RLSE)
//SYSIN    DD *
  REPRO INFILE(MASTER) OUTFILE(BACKUP)
  DELETE PROD.ACCT.MASTER PURGE
  DEFINE CLUSTER (                         -
      NAME(PROD.ACCT.MASTER)               -
      RECORDSIZE(200 200)                  -
      KEYS(10 0)                           -
      CYLINDERS(250 50)                    -
      FREESPACE(20 10)                     -
      SHAREOPTIONS(2 3)                    -
  )                                        -
  DATA (NAME(PROD.ACCT.MASTER.DATA))       -
  INDEX (NAME(PROD.ACCT.MASTER.INDEX))
  REPRO INFILE(BACKUP) OUTFILE(MASTER)
/*
//MASTER   DD DSN=PROD.ACCT.MASTER,DISP=SHR

The FREESPACE(20 10) parameter reserves 20% free space in each CI and 10% free CIs in each CA, providing room for insertions without immediate splitting.

⚠️ Caution: VSAM reorganization requires exclusive access to the dataset. Schedule reorganizations during maintenance windows when no programs are accessing the file. Always create a backup before deleting and redefining the cluster.

VSAM Local Shared Resources (LSR)

For CICS environments, Local Shared Resources (LSR) pools allow multiple files to share the same buffer pool, improving overall memory utilization:

CICS LSR Pool Configuration:
Pool 1 (CI Size 4096):   512 buffers shared across 15 files
Pool 2 (CI Size 2048):   256 buffers shared across 8 index files
Pool 3 (CI Size 32768):  64 buffers for large-CI sequential files

LSR is particularly effective when many VSAM files are accessed intermittently — the buffers serve whichever file needs them most at any given moment, rather than being dedicated to idle files.

36.14 Compiler Optimization Flags Deep Dive

Enterprise COBOL's compiler options interact with each other in ways that are not always obvious. Understanding these interactions is essential for squeezing maximum performance from your programs.

The ARCH Option

The ARCH option tells the compiler which z/Architecture level to target. Higher ARCH levels unlock hardware instructions that are not available on older processors:

ARCH Level z/Architecture Key Feature
ARCH(8) z196 Distinct operands, high-word facility
ARCH(9) zEC12 Transactional execution
ARCH(10) z13 Vector facility, extended immediate
ARCH(11) z14 Vector packed decimal, DEFLATE
ARCH(12) z15 Miscellaneous instruction enhancements
ARCH(13) z16 AI acceleration, sort acceleration

For financial COBOL programs, ARCH(11) or higher is particularly valuable because vector packed decimal instructions perform COMP-3 arithmetic in hardware at speeds previously only available for binary arithmetic.

Performance comparison: COMP-3 arithmetic at different ARCH levels
(10 million multiply operations)

ARCH(8):   4.2 seconds
ARCH(10):  3.1 seconds
ARCH(11):  1.8 seconds   <<<< Vector packed decimal
ARCH(12):  1.6 seconds
ARCH(13):  1.4 seconds

Interaction Between OPTIMIZE and Other Options

Option Combination Effects:

OPTIMIZE(FULL) + ARCH(12):
  Maximum optimization with modern hardware instructions.
  Best performance. May produce code that does not run
  on older hardware.

OPTIMIZE(FULL) + SSRANGE:
  The optimizer cannot fully optimize subscript operations
  because range checks prevent certain transformations.
  Performance impact: 15-25% slower than OPTIMIZE(FULL)
  + NOSSRANGE.

OPTIMIZE(FULL) + TEST(ALL):
  Debug hooks reduce optimization effectiveness.
  Performance impact: 20-40% slower than OPTIMIZE(FULL)
  without TEST.

OPTIMIZE(FULL) + NUMPROC(PFD) + TRUNC(OPT):
  The "maximum performance" combination. Use only when:
  - All numeric data has preferred signs
  - Binary fields stay within PIC range
  - You have thorough regression tests

Compiler Option Selection Guide

┌──────────────────────────────────────────────────────────┐
│ ENVIRONMENT        │ RECOMMENDED OPTIONS                 │
├──────────────────────────────────────────────────────────┤
│ Development        │ NOOPTIMIZE, SSRANGE, TEST(ALL)      │
│                    │ Priority: Debugging ease              │
│                    │                                      │
│ Unit Test          │ OPTIMIZE(STD), SSRANGE, TEST(SEP)   │
│                    │ Priority: Catch boundary errors      │
│                    │                                      │
│ Integration Test   │ OPTIMIZE(STD), NOSSRANGE             │
│                    │ Priority: Match production behavior  │
│                    │                                      │
│ Performance Test   │ OPTIMIZE(FULL), NOSSRANGE,           │
│                    │ NUMPROC(PFD), ARCH(12)              │
│                    │ Priority: Maximum speed              │
│                    │                                      │
│ Production         │ OPTIMIZE(FULL), NOSSRANGE,           │
│                    │ NUMPROC(PFD), ARCH(current HW)      │
│                    │ Priority: Performance + stability    │
└──────────────────────────────────────────────────────────┘

Try It Yourself: If you have access to GnuCOBOL, compile the same program with cobc -O0 (no optimization) and cobc -O2 (full optimization). Run both versions on a loop that performs 1,000,000 arithmetic operations and compare elapsed times using FUNCTION CURRENT-DATE before and after the loop. You should see a measurable difference, especially for COMP-3 arithmetic.

36.15 SQL EXPLAIN Analysis

For COBOL programs with embedded DB2 SQL, the EXPLAIN statement is the most powerful tool for understanding query performance. EXPLAIN populates a plan table showing exactly how DB2 will access data for your query.

Running EXPLAIN

      * Explain a query before running it
       EXEC SQL
           EXPLAIN PLAN SET QUERYNO = 1 FOR
           SELECT C.CLAIM_ID, C.CLAIM_STATUS,
                  P.PROVIDER_NAME, P.SPECIALTY
           FROM CLAIMS C
           JOIN PROVIDERS P
               ON C.PROVIDER_ID = P.PROVIDER_ID
           WHERE C.BATCH_DATE = :WS-BATCH-DATE
             AND C.CLAIM_STATUS = 'N'
       END-EXEC

Reading the Plan Table

PLAN_TABLE output for QUERYNO = 1:
─────────────────────────────────────────────────────────────
QUERY  TABLE      ACCESS  MATCH  INDEX        PREFETCH
BLOCK  NAME       TYPE    COLS   NAME         TYPE
─────────────────────────────────────────────────────────────
  1    CLAIMS     I         2    IX_CLM_BATCH  S
  1    PROVIDERS  I         1    PK_PROVIDER   —
─────────────────────────────────────────────────────────────

ACCESS TYPE KEY:
  I  = Index access (good)
  R  = Tablespace scan (bad for large tables)
  M  = Multiple index access
  MX = Intersecting index
  N  = Nested loop join

Interpreting the output: Both tables use index access (type "I"), which means DB2 is using indexes to find rows — good. The CLAIMS table matches on 2 columns (BATCH_DATE and CLAIM_STATUS) using the IX_CLM_BATCH index. The PROVIDERS table uses its primary key index.

Common EXPLAIN Red Flags

Access Type Meaning Action
R (tablespace scan) Full table scan — every row examined Add an index on WHERE columns
S in PREFETCH Sequential prefetch — accessing many CIs May be normal for range queries
MIXOPSEQ = 'Y' Sort required for ORDER BY Check if index can provide ordering
MATCHCOLS = 0 Index used but no columns match Index is not useful for this query

Optimizing a Slow Query

James Okafor found that CLM-ADJUD's fee schedule lookup was performing a tablespace scan. Here is the EXPLAIN analysis and fix:

BEFORE (tablespace scan):
  TABLE: FEE_SCHEDULE    ACCESS: R    MATCHCOLS: 0
  Estimated cost: 45,000 I/Os

Query:
  SELECT ALLOWED_AMOUNT
  FROM FEE_SCHEDULE
  WHERE PROVIDER_ID = :WS-PROV-ID
    AND PROCEDURE_CODE = :WS-PROC-CODE
    AND EFFECTIVE_DATE <= :WS-SRVDATE
  ORDER BY EFFECTIVE_DATE DESC
  FETCH FIRST 1 ROW ONLY

The problem: no index existed on the combination of PROVIDER_ID, PROCEDURE_CODE, and EFFECTIVE_DATE. DB2 was scanning the entire 500,000-row table for each lookup.

-- Create composite index
CREATE INDEX IX_FEE_SCHED_LOOKUP
    ON FEE_SCHEDULE
    (PROVIDER_ID, PROCEDURE_CODE, EFFECTIVE_DATE DESC);
AFTER (index access):
  TABLE: FEE_SCHEDULE    ACCESS: I    MATCHCOLS: 3
  Estimated cost: 4 I/Os

Improvement: 11,250x reduction in I/O per query

📊 By the Numbers: In the MedClaim environment, the fee schedule query was executed approximately 8,000 times per hour. Before the index, each execution required an average of 250 I/Os (tablespace scan). After the index, each execution required 4 I/Os. Total I/O reduction: 8,000 * 246 = 1,968,000 fewer I/Os per hour. At 5ms per I/O, this saved 9,840 seconds (2.7 hours) of elapsed I/O wait time per hour of processing — explaining why the program could not meet the 20,000-claims-per-hour SLA.

36.16 Memory Layout Optimization

Beyond WORKING-STORAGE alignment (Section 36.4), memory layout optimization extends to how data flows through the program. Efficient layout reduces cache misses and memory bandwidth consumption.

Hot/Cold Data Separation

Separate frequently accessed fields ("hot" data) from rarely accessed fields ("cold" data). This improves CPU cache utilization because the hot data fits in fewer cache lines:

      * POOR: Hot and cold data interleaved
       01  WS-ACCOUNT-RECORD.
           05 ACCT-NUMBER          PIC X(10).      *> Hot
           05 ACCT-OPEN-DATE       PIC X(10).      *> Cold
           05 ACCT-BALANCE         PIC S9(11)V99
                                   COMP-3.         *> Hot
           05 ACCT-LAST-STMT-DATE  PIC X(10).      *> Cold
           05 ACCT-TYPE            PIC X(3).        *> Hot
           05 ACCT-BRANCH-CODE     PIC X(5).        *> Cold
           05 ACCT-ANNUAL-RATE     PIC V9(6)
                                   COMP-3.         *> Hot
           05 ACCT-MARKETING-CODE  PIC X(4).        *> Cold

      * BETTER: Hot data grouped for cache locality
       01  WS-ACCT-HOT-FIELDS.
           05 ACCT-NUMBER          PIC X(10).
           05 ACCT-BALANCE         PIC S9(11)V99
                                   COMP-3.
           05 ACCT-TYPE            PIC X(3).
           05 ACCT-ANNUAL-RATE     PIC V9(6)
                                   COMP-3.
      *    Total hot data: ~24 bytes — fits in one cache line

       01  WS-ACCT-COLD-FIELDS.
           05 ACCT-OPEN-DATE       PIC X(10).
           05 ACCT-LAST-STMT-DATE  PIC X(10).
           05 ACCT-BRANCH-CODE     PIC X(5).
           05 ACCT-MARKETING-CODE  PIC X(4).

This technique matters most when the hot fields are accessed millions of times (inside a high-iteration processing loop) while the cold fields are accessed only occasionally (for reporting or error handling).

Table Structure Optimization

For large internal tables, the choice between arrays-of-structures and structures-of-arrays affects performance:

      * Array of Structures (standard COBOL pattern)
       01  WS-RATE-TABLE.
           05 WS-RATE-ENTRY OCCURS 10000 TIMES
              INDEXED BY WS-RATE-IDX.
              10 WS-RATE-CODE     PIC X(5).
              10 WS-RATE-EFF-DATE PIC X(10).
              10 WS-RATE-AMOUNT   PIC S9(5)V99 COMP-3.
              10 WS-RATE-DESC     PIC X(40).
      *    Total per entry: ~59 bytes
      *    Searching requires loading 59 bytes per comparison
      *    even though we only compare the 5-byte code

      * Optimized: Split key from data
       01  WS-RATE-KEYS.
           05 WS-RATE-CODE OCCURS 10000 TIMES
              INDEXED BY WS-KEY-IDX
              PIC X(5).
      *    Total: 50,000 bytes — fits in L2 cache

       01  WS-RATE-DATA.
           05 WS-RATE-DETAIL OCCURS 10000 TIMES.
              10 WS-RATE-EFF-DATE PIC X(10).
              10 WS-RATE-AMOUNT   PIC S9(5)V99 COMP-3.
              10 WS-RATE-DESC     PIC X(40).

By separating the search key from the full record, SEARCH ALL only touches the 50,000-byte key array during comparison, not the full 590,000-byte table. On modern hardware with 256KB L2 caches, the key array fits entirely in cache, making binary search extremely fast.

🧪 Lab Exercise: Create a COBOL program with a 5,000-entry lookup table. Implement two versions: one where the table has a single group item with key and data combined, and one where the key is in a separate array. Perform 1,000,000 lookups against each version and compare elapsed times. The difference may surprise you, especially if you use SEARCH ALL (binary search).

36.17 Batch I/O Optimization Patterns

Beyond the basic buffering and block size optimization discussed in Section 36.3, several advanced I/O patterns can dramatically reduce batch job elapsed time.

The Sort-Merge Elimination Pattern

Many batch programs read a master file and a transaction file, then process transactions against the master. If the transaction file is unsorted, the program must either perform random reads against the master or sort the transactions first. But sometimes you can eliminate the sort entirely by restructuring the processing logic:

      * TRADITIONAL: Sort transactions, then sequential match
      *   Step 1: SORT transactions by account key
      *   Step 2: Sequential read both files, match on key
      *   Total I/O: Read trans + sort work + read master

      * OPTIMIZED: Load transactions into memory table
      *   Step 1: Read all transactions into WORKING-STORAGE
      *   Step 2: Sort in memory (no I/O)
      *   Step 3: Sequential read master, lookup in memory
      *   Total I/O: Read trans + read master (no sort work I/O)

       01  WS-TXN-TABLE.
           05 WS-TXN-ENTRY OCCURS 50000 TIMES
              ASCENDING KEY WS-TXN-ACCT-KEY
              INDEXED BY WS-TXN-IDX.
              10 WS-TXN-ACCT-KEY     PIC X(10).
              10 WS-TXN-AMOUNT       PIC S9(9)V99 COMP-3.
              10 WS-TXN-TYPE         PIC X.
       01  WS-TXN-COUNT              PIC 9(5)  COMP VALUE 0.

       1000-LOAD-TRANSACTIONS.
      *    Read all transactions into memory
           PERFORM UNTIL END-OF-TXN-FILE
               READ TXN-FILE INTO WS-TXN-RECORD
                   AT END SET END-OF-TXN-FILE TO TRUE
                   NOT AT END
                       ADD 1 TO WS-TXN-COUNT
                       MOVE TXN-ACCT-KEY
                           TO WS-TXN-ACCT-KEY(WS-TXN-COUNT)
                       MOVE TXN-AMOUNT
                           TO WS-TXN-AMOUNT(WS-TXN-COUNT)
                       MOVE TXN-TYPE
                           TO WS-TXN-TYPE(WS-TXN-COUNT)
               END-READ
           END-PERFORM
      *    Sort in memory using SORT verb on table
           SORT WS-TXN-TABLE ON ASCENDING KEY WS-TXN-ACCT-KEY.

This pattern works when the transaction file fits in memory (typically up to 50,000-100,000 records, depending on record size and available REGION). For larger transaction files, the external sort approach is necessary.

Multi-File Processing with Balanced Merge

When a batch job reads multiple input files and produces a merged output, the order of file processing matters:

      * SLOW: Process files sequentially, write output each time
       PERFORM 1000-PROCESS-CHECKING-FILE
       PERFORM 2000-PROCESS-SAVINGS-FILE
       PERFORM 3000-PROCESS-CD-FILE
       PERFORM 4000-PROCESS-MMA-FILE
      * Total: 4 passes over the output file

      * FAST: Read all inputs in parallel, write output once
       PERFORM UNTIL ALL-FILES-AT-END
           PERFORM 1000-FIND-LOWEST-KEY
           PERFORM 2000-WRITE-MERGED-RECORD
           PERFORM 3000-READ-NEXT-FROM-SOURCE
       END-PERFORM
      * Total: 1 pass over the output file

Deferred Write Pattern

For programs that compute running totals or multi-pass calculations, defer writing the output until all processing is complete:

      * SLOW: Write partial results, then update
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-ACCT-COUNT
           PERFORM 3000-FIRST-PASS-CALC
           WRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
       END-PERFORM
      * Second pass: REWRITE records that need adjustment
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-ADJUST-COUNT
           READ OUTPUT-FILE KEY IS WS-ADJ-KEY(WS-IDX)
           PERFORM 4000-ADJUST-CALC
           REWRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
       END-PERFORM
      * Double I/O: write + read + rewrite

      * FAST: Compute everything in memory, write once
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-ACCT-COUNT
           PERFORM 3000-FIRST-PASS-CALC
       END-PERFORM
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-ADJUST-COUNT
           PERFORM 4000-ADJUST-CALC
       END-PERFORM
       PERFORM VARYING WS-IDX FROM 1 BY 1
           UNTIL WS-IDX > WS-ACCT-COUNT
           WRITE OUTPUT-REC FROM WS-ACCT(WS-IDX)
       END-PERFORM
      * Single I/O: write only (no read-back, no rewrite)

💡 Memory vs. I/O Tradeoff: Most batch I/O optimization boils down to the same principle: trade memory for I/O. Load data into WORKING-STORAGE, process it in memory, and write it once. Memory operations are measured in nanoseconds; I/O operations are measured in milliseconds — a factor of one million. Use every byte of available REGION to avoid unnecessary I/O.

36.18 MedClaim Performance Case Study: Tuning the Daily Eligibility Batch

Tomás Rivera identified a performance issue with MedClaim's daily eligibility batch (ELIG-BATCH). The program verified member eligibility for all claims received that day — checking policy dates, benefit coverage, and provider network status. As claim volumes grew from 15,000 to 45,000 per day, the job's elapsed time grew from 20 minutes to over 90 minutes, threatening to delay the downstream adjudication cycle.

Profiling Results

STROBE Performance Profile: ELIG-BATCH
========================================
Total CPU Time:  142.3 seconds
Total Elapsed:  5,412.0 seconds (90.2 minutes)
CPU/Elapsed:     2.6% (severely I/O bound)

Paragraph Profile (Top 5):
Paragraph                CPU Secs    %CPU    Calls
---------                --------    ----    -----
3000-CHECK-POLICY        12.8        9.0%    45,000
3100-CHECK-NETWORK       89.4       62.8%    45,000    <<<
3200-CHECK-BENEFITS      18.7       13.1%    45,000
2000-READ-CLAIM           8.4        5.9%    45,000
4000-WRITE-RESULT          6.2        4.4%    45,000

The bottleneck was 3100-CHECK-NETWORK, which consumed 63% of CPU but — crucially — had a CPU/elapsed ratio of only 2.6%, meaning the program spent 97.4% of its time waiting for I/O. Examining the paragraph revealed a DB2 query inside the processing loop:

       3100-CHECK-NETWORK.
           EXEC SQL
               SELECT NETWORK_STATUS, EFF_DATE, TERM_DATE
               INTO :WS-NET-STATUS, :WS-NET-EFF, :WS-NET-TERM
               FROM PROVIDER_NETWORK
               WHERE PROVIDER_ID = :CLM-PROVIDER-ID
                 AND PLAN_CODE = :MBR-PLAN-CODE
                 AND EFF_DATE <= :CLM-SERVICE-DATE
               ORDER BY EFF_DATE DESC
               FETCH FIRST 1 ROW ONLY
           END-EXEC.

At 45,000 claims, this query executed 45,000 times. EXPLAIN showed a tablespace scan (no index on the PROVIDER_NETWORK table for the provider/plan combination).

The Three-Part Fix

Fix 1: Add composite index

CREATE INDEX IX_PROV_NET_LOOKUP
    ON PROVIDER_NETWORK
    (PROVIDER_ID, PLAN_CODE, EFF_DATE DESC);

Result: Query I/O dropped from ~200 per execution to 3. Elapsed time: 90 minutes down to 35 minutes.

Fix 2: Implement in-memory cache

Since many claims share the same provider/plan combination, Tomás added a 1,000-entry cache (similar to James's fee schedule cache in Section 36.11):

       01  WS-NET-CACHE.
           05 WS-NET-CACHE-ENTRY OCCURS 1000 TIMES
              ASCENDING KEY WS-NC-LOOKUP-KEY
              INDEXED BY WS-NC-IDX.
              10 WS-NC-LOOKUP-KEY.
                 15 WS-NC-PROV-ID   PIC X(8).
                 15 WS-NC-PLAN-CODE PIC X(5).
              10 WS-NC-STATUS       PIC X.
              10 WS-NC-EFF-DATE     PIC X(10).
              10 WS-NC-TERM-DATE    PIC X(10).

Cache hit rate: 82%. Only 8,100 DB2 queries instead of 45,000. Elapsed time: 35 minutes down to 12 minutes.

Fix 3: Multi-row FETCH for remaining lookups

For the 18% of claims that missed the cache, Tomás batched the DB2 lookups using a temporary table and a single JOIN query:

      * Batch uncached lookups into a temp table
           EXEC SQL
               INSERT INTO SESSION.LOOKUP_BATCH
               (PROVIDER_ID, PLAN_CODE, SERVICE_DATE)
               VALUES (:WS-BATCH-PROV(WS-B-IDX),
                       :WS-BATCH-PLAN(WS-B-IDX),
                       :WS-BATCH-DATE(WS-B-IDX))
           END-EXEC

      * Single query to resolve all lookups
           EXEC SQL
               DECLARE BATCH-CURSOR CURSOR FOR
               SELECT B.PROVIDER_ID, B.PLAN_CODE,
                      N.NETWORK_STATUS, N.EFF_DATE
               FROM SESSION.LOOKUP_BATCH B
               JOIN PROVIDER_NETWORK N
                   ON B.PROVIDER_ID = N.PROVIDER_ID
                  AND B.PLAN_CODE = N.PLAN_CODE
                  AND N.EFF_DATE <= B.SERVICE_DATE
           END-EXEC

Elapsed time: 12 minutes down to 6 minutes.

Final Results

Optimization Elapsed Reduction
Original (tablespace scan) 90.2 min
+ Composite index 35.0 min 61%
+ In-memory cache 12.0 min 87%
+ Batch DB2 lookup 6.0 min 93%

The total optimization reduced elapsed time from 90 minutes to 6 minutes — a 15x improvement. The downstream adjudication cycle now starts 84 minutes earlier, providing comfortable margin for the overall batch window.

"The lesson," Tomás told his team, "is that performance tuning is almost never about making the CPU go faster. It is about making the program stop waiting."

36.19 Performance Monitoring and Regression Detection

Optimizing a program once is not enough. Data volumes grow, access patterns change, and new code introduces performance regressions. Continuous performance monitoring catches degradation before it causes batch window overruns.

Establishing Performance Baselines

Record key metrics for every production batch run:

      * Add timing instrumentation to batch programs
       WORKING-STORAGE SECTION.
       01  WS-START-TIME          PIC X(21).
       01  WS-END-TIME            PIC X(21).
       01  WS-RECORDS-PROCESSED   PIC 9(9)   COMP VALUE 0.
       01  WS-IO-COUNT            PIC 9(9)   COMP VALUE 0.

       0000-MAIN.
           MOVE FUNCTION CURRENT-DATE TO WS-START-TIME
           DISPLAY "BAL-CALC START: " WS-START-TIME
           PERFORM 1000-INIT
           PERFORM 2000-PROCESS-ALL-ACCOUNTS
           PERFORM 8000-FINALIZE
           MOVE FUNCTION CURRENT-DATE TO WS-END-TIME
           DISPLAY "BAL-CALC END:   " WS-END-TIME
           DISPLAY "RECORDS: " WS-RECORDS-PROCESSED
           DISPLAY "I/O OPS: " WS-IO-COUNT
           STOP RUN.

Building a Trend Dashboard

Track elapsed time, CPU time, and record count over weeks to identify trends:

BAL-CALC Performance Trend (last 30 days):
────────────────────────────────────────────────
Date       Records     Elapsed   CPU     I/O
────────────────────────────────────────────────
2025-10-16 2,300,000   25.1 min  8.2m    23,400
2025-10-17 2,302,000   25.2 min  8.2m    23,420
2025-10-18 2,305,000   25.3 min  8.3m    23,450
...
2025-11-10 2,340,000   25.8 min  8.4m    23,800
2025-11-11 2,342,000   26.1 min  8.5m    23,820
2025-11-12 2,344,000   31.4 min  8.5m    48,200 <<<
2025-11-13 2,346,000   31.6 min  8.5m    48,400 <<<
────────────────────────────────────────────────
ALERT: I/O count doubled on 2025-11-12
       Elapsed increased 21% with only 0.1% record growth

The spike on November 12 indicates a performance regression — I/O doubled while record count barely changed. Investigation revealed that a code change on November 11 introduced an additional READ statement inside the main processing loop, doubling the I/O count. The fix was straightforward: cache the result of the extra read rather than re-reading for each record.

Automated Alerting

Set thresholds that trigger alerts when performance degrades beyond acceptable bounds:

//*----------------------------------------------------------
//* Performance gate: Alert if elapsed exceeds threshold
//*----------------------------------------------------------
//PERFGATE EXEC PGM=PERFCHEK
//STEPLIB  DD DSN=TOOLS.LOAD,DISP=SHR
//TIMING   DD DSN=PROD.PERF.LOG,DISP=SHR
//THRESHLD DD *
BAL-CALC,ELAPSED,30.0
TXN-POST,ELAPSED,20.0
RPT-DAILY,ELAPSED,15.0
REG-FEED,ELAPSED,10.0
/*
//ALERT    DD SYSOUT=*
//* RC=0: Within threshold
//* RC=4: Warning (within 10% of threshold)
//* RC=8: Exceeded threshold — alert operations

Try It Yourself: Add timing instrumentation to any COBOL program you have written. Record FUNCTION CURRENT-DATE at the start and end of the program, and count the number of records processed. Run the program with different data volumes (100, 1,000, 10,000 records) and observe how elapsed time scales. Does it scale linearly with data volume? If not, you may have an O(n^2) algorithm hiding in your code.

36.20 GlobalBank Post-Optimization Monitoring

After Maria Chen's performance optimization project (Section 36.10), GlobalBank institutionalized performance monitoring to prevent regression. Priya Kapoor designed a monitoring framework that tracked three key indicators.

The Three Performance Pillars

Pillar 1: Batch Window Utilization. The ratio of actual batch elapsed time to the available batch window. Target: below 60%, allowing headroom for growth:

Batch Window: 23:00 - 05:00 (6 hours = 360 minutes)
Current Elapsed: 75 minutes
Utilization: 20.8%
Headroom: 285 minutes (79.2%)
Monthly Growth Rate: 0.4 minutes/month
Time Until 60% Threshold: ~356 months (29 years)
Status: GREEN — no concern

Pillar 2: Per-Job CPU Efficiency. The ratio of CPU time to elapsed time, tracked per job. A declining ratio indicates increasing I/O wait — often the first sign of a growing dataset or degraded VSAM organization:

Job         CPU/Elapsed (Oct)  CPU/Elapsed (Nov)  Trend
BAL-CALC    32.8%              32.6%              Stable
TXN-POST    20.0%              19.7%              Stable
RPT-DAILY   17.8%              11.2%              DECLINING <<<
REG-FEED    25.0%              25.1%              Stable
ACCT-MAINT  62.5%              62.5%              Stable

RPT-DAILY's declining CPU/elapsed ratio triggered an investigation. The cause: a VSAM dataset used by the report had not been reorganized in four months. CI splits had fragmented the data, increasing I/O. After reorganization, the ratio returned to 17.5%.

Pillar 3: Records-Per-Second Throughput. Track processing throughput over time. A declining throughput with constant or growing record volumes indicates performance degradation:

BAL-CALC Throughput Trend:
Date        Records   Elapsed (sec)  Records/Sec
2025-10-01  2,300,000     1,500       1,533
2025-10-15  2,310,000     1,508       1,532
2025-11-01  2,320,000     1,520       1,526
2025-11-15  2,330,000     1,528       1,525

Stable throughput indicates that performance scales linearly with data growth — a healthy sign. Declining throughput would indicate an algorithmic issue (e.g., a quadratic search emerging as the table grows).

📊 By the Numbers: In the year following Maria's optimization project, the batch window utilization remained below 25% despite a 4.3% growth in data volume. The monitoring framework caught two potential regressions (the VSAM fragmentation issue above and a new program with an unindexed DB2 query) before they affected production. The total investment in ongoing monitoring was approximately 2 hours per month of Priya's time reviewing the dashboards — a trivial cost for preventing the kind of crisis that started the optimization project.

36.21 Summary

Performance tuning is a discipline of measurement, analysis, and targeted optimization. The key concepts from this chapter:

  • Measure first — profile before optimizing. The bottleneck is rarely where you think it is.
  • I/O dominates — in most COBOL batch programs, 80-95% of elapsed time is I/O. Optimize I/O first.
  • Data types matter — COMP-3 and COMP arithmetic are 3-5x faster than DISPLAY arithmetic.
  • Block size is critical — proper blocking can reduce I/O operations by 40x.
  • SEARCH ALL for large tables — binary search is O(log n) vs. linear search's O(n).
  • Compiler options provide 10-30% improvement with zero code changes.
  • SQL optimization is essential for DB2 programs — avoid queries in loops, use multi-row FETCH.
  • CICS optimization focuses on COMMAREA sizing and minimizing command overhead.
  • Batch tuning includes sort optimization, checkpoint strategy, and REGION sizing.
  • Profile continuously — performance degrades as data volumes grow. Regular profiling catches degradation before it causes batch window overruns.

As Maria's batch optimization proved: the biggest wins often come from the simplest changes. A one-line JCL change saved 30 minutes per night. The lesson is not that complex optimizations are unnecessary — sometimes they are — but that you should always pick the lowest-hanging fruit first.

In the next chapter, we'll address the ultimate performance question: when should you stop optimizing COBOL and start considering migration to modern platforms?