58 min read

> "The batch window is not a technical limitation. It's a business agreement — and the business just changed the terms."

Chapter 45: Capstone 3 — From Batch to Real-Time: A Full Migration Project

"The batch window is not a technical limitation. It's a business agreement — and the business just changed the terms." — Priya Kapoor, in the project kickoff meeting

Introduction: When Batch is Not Fast Enough

Capstone 1 taught you to build a COBOL system from scratch. Capstone 2 taught you to modernize a legacy system incrementally. This final capstone brings together everything you have learned in this textbook — every technique, every pattern, every design principle — to tackle the most complex challenge in enterprise COBOL: migrating a batch system to real-time event-driven processing while keeping the business running.

This is not a hypothetical exercise. Across the financial services and healthcare industries, organizations are facing the same pressure: customers, partners, and regulators want information now, not tomorrow morning. The nightly batch run that was acceptable in 2010 is a competitive disadvantage in 2025. But the batch system works. It is reliable, auditable, and understood. Replacing it with something faster cannot come at the cost of those qualities.

💡 The Fundamental Tension. Batch processing is inherently reliable because it is simple: read a file, process each record, write the results. There is no concurrency, no race conditions, no distributed state. Real-time processing introduces all of these complexities. The challenge is to gain the speed of real-time without losing the reliability of batch. This capstone shows you how.

The Business Problem

GlobalBank and MedClaim Health Services have entered a partnership. GlobalBank will process medical expense claims for MedClaim's members who hold GlobalBank health savings accounts (HSAs). When a MedClaim claim is adjudicated, the payment should be deducted from the member's GlobalBank HSA automatically.

Currently, this works through a nightly batch process:

  1. MedClaim's nightly batch produces a flat file of adjudicated claims
  2. The file is transmitted to GlobalBank via SFTP at 2 AM
  3. GlobalBank's morning batch reads the file and processes HSA debits
  4. Results are transmitted back to MedClaim in the afternoon
  5. MedClaim's next nightly batch posts the payment confirmations

Total elapsed time from claim adjudication to HSA debit: 24-48 hours.

The business wants this reduced to under 30 seconds.

"Thirty seconds," Derek Washington repeats when Priya Kapoor presents the requirement. "From claim adjudication to money moving?"

"Twenty-nine, if you want to underpromise," Priya replies. "The business case is simple: real-time HSA processing improves member satisfaction and reduces MedClaim's float. Both organizations benefit."

"And both organizations' batch systems need to keep running while we build this?"

"Obviously."

⚖️ The Stakes. This migration affects two production systems at two different organizations. A failure in the real-time system could cause incorrect HSA debits (taking money from members' accounts erroneously), duplicate payments, or lost transactions. The parallel-run period must prove that the real-time system produces exactly the same results as the batch system before batch is decommissioned.


Project Assessment

Current Architecture

Priya begins with a thorough assessment of both systems' current architecture. She has spent two months embedded with both teams — Maria Chen's at GlobalBank and James Okafor's at MedClaim — learning how the systems work.

MedClaim Side (Current):

Claims → [CLM-ADJUD] → Adjudicated Claims File
                              │
                    [CLM-EXTRACT] → HSA Payment File
                              │
                         SFTP Transfer (2 AM)
                              │
                    Flat file on GlobalBank LPAR

GlobalBank Side (Current):

HSA Payment File → [HSA-PROC] → VSAM Account Master Updated
                        │
                   [HSA-CONFIRM] → Confirmation File
                        │
                   SFTP Transfer (4 PM)
                        │
                   Flat file on MedClaim LPAR
                        │
              [CLM-CONFIRM] → Claim Status Updated

Six programs, two SFTP transfers, and a minimum of 24 hours.

Target Architecture:

[CLM-ADJUD] ──→ MQ Message ──→ [HSA-EVENTS] ──→ HSA Updated
                                      │
                                 MQ Confirmation
                                      │
                              [CLM-EVENTS] ──→ Claim Updated

Two new programs, one message queue, under 30 seconds.

Risk Assessment

Priya's risk assessment identifies the following concerns:

Risk Probability Impact Mitigation
MQ message loss Low Critical Persistent messages with dead-letter queue
Duplicate processing Medium High Idempotent design with transaction IDs
Network failure between LPARs Medium Medium MQ guaranteed delivery with retry
Real-time slower than 30 seconds Medium Medium Performance testing with production volumes
Data inconsistency during parallel run High Medium Reconciliation program runs daily
Rollback to batch needed Low Low Batch infrastructure preserved for 6 months

📊 The Parallel-Run Requirement. Both organizations require a parallel-run period where the batch and real-time systems process the same transactions simultaneously. A reconciliation program compares the results daily. Only when the reconciliation shows zero discrepancies for 30 consecutive business days will batch be decommissioned.


Phase 1: Message Infrastructure

IBM MQ Design

The message queue is the backbone of the real-time system. Priya designs a two-queue architecture:

Queue 1: MEDCLAIM.HSA.PAYMENTS — Carries adjudicated claim messages from MedClaim to GlobalBank.

Queue 2: GLOBALBANK.HSA.CONFIRMS — Carries payment confirmations from GlobalBank to MedClaim.

Both queues use persistent messages (messages survive queue manager restart), which ensures that no transaction is lost even if the MQ infrastructure fails.

MQ Object Definitions:

DEFINE QLOCAL('MEDCLAIM.HSA.PAYMENTS') +
    DEFPSIST(YES) +
    MAXDEPTH(100000) +
    MAXMSGL(4096) +
    BOTHRESH(5) +
    BOQNAME('MEDCLAIM.HSA.PAYMENTS.DLQ') +
    DESCR('HSA payment requests from MedClaim')

DEFINE QLOCAL('MEDCLAIM.HSA.PAYMENTS.DLQ') +
    DEFPSIST(YES) +
    MAXDEPTH(10000) +
    DESCR('Dead letter queue for failed payment messages')

DEFINE QLOCAL('GLOBALBANK.HSA.CONFIRMS') +
    DEFPSIST(YES) +
    MAXDEPTH(100000) +
    MAXMSGL(4096) +
    BOTHRESH(5) +
    BOQNAME('GLOBALBANK.HSA.CONFIRMS.DLQ') +
    DESCR('HSA payment confirmations to MedClaim')

DEFINE QLOCAL('GLOBALBANK.HSA.CONFIRMS.DLQ') +
    DEFPSIST(YES) +
    MAXDEPTH(10000) +
    DESCR('Dead letter queue for failed confirm messages')

Key design decisions:

DEFPSIST(YES): All messages are persistent by default. This means MQ writes them to disk before acknowledging the PUT. It is slower than non-persistent messaging but guarantees no data loss.

BOTHRESH(5): The backout threshold. If a message is read and rolled back 5 times (indicating the consuming program keeps failing), MQ moves it to the dead-letter queue (BOQNAME) instead of returning it to the queue for a 6th attempt. This prevents a "poison message" from blocking the queue.

MAXDEPTH(100000): If GlobalBank's consumer program is down, MQ can hold up to 100,000 messages before rejecting new ones. At 500,000 claims per month (approximately 20,000 per business day, of which maybe 5,000 involve HSAs), this provides nearly a full month of buffer.

Message Format

The message payload uses JSON. This is a deliberate choice: JSON is human-readable, widely supported, and can be generated and parsed natively in Enterprise COBOL v6+.

Payment Request Message:

{
    "messageId": "MSG20240315143022001",
    "messageType": "HSA_PAYMENT",
    "timestamp": "2024-03-15T14:30:22.001",
    "claimId": "CLM000098765",
    "memberId": "MBR100045678",
    "hsaAccountId": "HSA0045678",
    "paymentAmount": 1250.00,
    "diagnosisCode": "J06.9",
    "procedureCode": "99213",
    "serviceDate": "2024-03-10",
    "providerName": "City Medical Center",
    "adjudicationDate": "2024-03-15"
}

Payment Confirmation Message:

{
    "messageId": "CFM20240315143023456",
    "correlationId": "MSG20240315143022001",
    "messageType": "HSA_CONFIRM",
    "timestamp": "2024-03-15T14:30:23.456",
    "claimId": "CLM000098765",
    "hsaAccountId": "HSA0045678",
    "status": "SUCCESS",
    "newBalance": 3750.00,
    "transactionRef": "TXN20240315001234"
}

The correlationId in the confirmation links back to the original payment request, enabling end-to-end traceability.

⚠️ Idempotent Design. Every message includes a unique messageId. The consuming program checks whether it has already processed a message with this ID before applying it. This makes the system idempotent — processing the same message twice produces the same result as processing it once. This is critical because MQ's guaranteed delivery means messages may be delivered more than once in failure scenarios.


Phase 2: MedClaim Event Producer

Modifying CLM-ADJUD

The existing CLM-ADJUD program processes claims and writes results to a flat file. To enable real-time processing, James Okafor modifies CLM-ADJUD to also put a message on the MQ queue for each HSA-eligible claim.

The modification follows the "and" pattern — the program does everything it did before AND puts a message on the queue. During the parallel-run period, both the flat file and the MQ message carry the same data. This allows the batch and real-time paths to process the same transactions.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. CLM-ADJUD.
      *================================================================*
      * Program:  CLM-ADJUD (Modified for real-time HSA processing)    *
      * Purpose:  Adjudicate claims; send HSA events via MQ            *
      * Author:   James Okafor (real-time additions)                   *
      * Date:     Modified 2024-07-01                                  *
      *================================================================*
      * MQ additions use EXEC CICS commands for MQ access via          *
      * the CICS-MQ adapter.                                           *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.

      *--- Existing working storage preserved ---
       COPY CLMREC.
       01  WS-ADJUD-WORK              PIC X(500).

      *--- New MQ-related fields ---
       01  WS-MQ-MESSAGE              PIC X(2000).
       01  WS-MQ-MSG-LENGTH           PIC S9(08) COMP VALUE 0.
       01  WS-MQ-QUEUE-NAME           PIC X(48)
               VALUE 'MEDCLAIM.HSA.PAYMENTS'.
       01  WS-MQ-RESP                 PIC S9(08) COMP.
       01  WS-MQ-REASON               PIC S9(08) COMP.

       01  WS-HSA-FLAG                PIC X(01).
           88  WS-IS-HSA-ELIGIBLE         VALUE 'Y'.
           88  WS-NOT-HSA-ELIGIBLE        VALUE 'N'.

       01  WS-MSG-ID-WORK.
           05  WS-MSG-PREFIX          PIC X(03) VALUE 'MSG'.
           05  WS-MSG-DATE            PIC 9(08).
           05  WS-MSG-TIME            PIC 9(06).
           05  WS-MSG-SEQ             PIC 9(03).

       01  WS-MSG-SEQUENCE            PIC 9(03) VALUE 0.

      *--- JSON data structure for MQ message ---
       01  WS-HSA-PAYMENT-MSG.
           05  HSA-MSG-ID             PIC X(20).
           05  HSA-MSG-TYPE           PIC X(15)
               VALUE 'HSA_PAYMENT'.
           05  HSA-TIMESTAMP          PIC X(23).
           05  HSA-CLAIM-ID           PIC X(15).
           05  HSA-MEMBER-ID          PIC X(12).
           05  HSA-ACCOUNT-ID         PIC X(10).
           05  HSA-PAYMENT-AMOUNT     PIC 9(07)V99.
           05  HSA-DIAG-CODE          PIC X(07).
           05  HSA-PROC-CODE          PIC X(05).
           05  HSA-SERVICE-DATE       PIC X(10).
           05  HSA-PROVIDER-NAME      PIC X(30).
           05  HSA-ADJUD-DATE         PIC X(10).

       01  WS-CURRENT-TIMESTAMP       PIC X(23).

       PROCEDURE DIVISION.
      * ... (existing adjudication logic preserved) ...

      *================================================================*
      * NEW PARAGRAPH: Send HSA payment event via MQ                    *
      * Called after successful claim adjudication if HSA-eligible      *
      *================================================================*
       5000-SEND-HSA-EVENT.
      *    Generate unique message ID
           ACCEPT WS-MSG-DATE FROM DATE YYYYMMDD
           ACCEPT WS-MSG-TIME FROM TIME
           ADD 1 TO WS-MSG-SEQUENCE
           MOVE WS-MSG-SEQUENCE TO WS-MSG-SEQ
           STRING WS-MSG-PREFIX DELIMITED BY SIZE
                  WS-MSG-DATE   DELIMITED BY SIZE
                  WS-MSG-TIME   DELIMITED BY SIZE
                  WS-MSG-SEQ    DELIMITED BY SIZE
                  INTO HSA-MSG-ID
           END-STRING

      *    Build message payload
           MOVE CLM-CLAIM-ID      TO HSA-CLAIM-ID
           MOVE CLM-MEMBER-ID     TO HSA-MEMBER-ID
           MOVE CLM-PAID-AMOUNT   TO HSA-PAYMENT-AMOUNT
           MOVE CLM-DIAGNOSIS-CODE TO HSA-DIAG-CODE
           MOVE CLM-PROCEDURE-CODE TO HSA-PROC-CODE

      *    Generate JSON from COBOL data structure
           JSON GENERATE WS-MQ-MESSAGE
               FROM WS-HSA-PAYMENT-MSG
               COUNT WS-MQ-MSG-LENGTH
           END-JSON

      *    Put message on MQ queue via CICS
           EXEC CICS WRITEQ TD
               QUEUE(WS-MQ-QUEUE-NAME)
               FROM(WS-MQ-MESSAGE)
               LENGTH(WS-MQ-MSG-LENGTH)
               RESP(WS-MQ-RESP)
           END-EXEC

           IF WS-MQ-RESP NOT = DFHRESP(NORMAL)
               DISPLAY 'MQ PUT FAILED FOR CLAIM: '
                       CLM-CLAIM-ID
                       ' RESP: ' WS-MQ-RESP
      *        Log failure but DO NOT fail the adjudication
      *        The batch path will still process this claim
               PERFORM 5100-LOG-MQ-FAILURE
           END-IF
           .

       5100-LOG-MQ-FAILURE.
      *    Write to error log - batch will handle this claim
           DISPLAY 'MQ-FAIL: ' CLM-CLAIM-ID
                   ' AMOUNT: ' CLM-PAID-AMOUNT
                   ' RESP: ' WS-MQ-RESP
           .

Critical Design Decision: MQ Failure Does Not Fail Adjudication.

Notice paragraph 5000-SEND-HSA-EVENT: if the MQ PUT fails, the program logs the failure but does NOT reject the claim or abort processing. The claim is still adjudicated and written to the flat file. The batch path will process it normally. This is essential during the parallel-run period — the real-time path is additive, not replacing batch.

🔗 Theme: Defensive Programming. The "belt and suspenders" approach — sending the message AND writing the flat file — ensures that no transaction is lost even if the real-time path fails completely. During the parallel-run period, both paths process every transaction. After cutover, the flat file path is disabled, but the error handling in paragraph 5100 ensures that MQ failures are always logged and can trigger fallback processing.

The HSA Account Lookup Problem

One detail that consumed more design time than expected was determining the HSA account ID for a given claim. The CLM-ADJUD program knows the member ID and the claim details, but it does not know which GlobalBank HSA account corresponds to that member. This information lives on GlobalBank's side, not MedClaim's.

Priya's team considered three approaches:

Option A: Include the HSA Account ID in the MQ message. This requires MedClaim to maintain a cross-reference table mapping member IDs to HSA account IDs. The table would need to be synchronized whenever GlobalBank creates or closes an HSA account.

Option B: Let GlobalBank look up the HSA account. The MQ message includes only the member ID, and GlobalBank's consumer program looks up the corresponding HSA account using a DB2 query.

Option C: Include the HSA Account ID in MedClaim's member file. Add a field to MedClaim's member record that stores the GlobalBank HSA account ID. This requires a copybook change and a one-time data migration.

The team chose Option B — letting GlobalBank's consumer perform the lookup. The reasoning:

  1. Data ownership. The HSA account ID belongs to GlobalBank. MedClaim should not maintain a copy that could become stale.
  2. Simplicity. No cross-reference table to maintain, no synchronization process to build.
  3. Performance. The lookup is a simple indexed DB2 query — less than 1 millisecond.
  4. Isolation. If GlobalBank changes their account numbering scheme, only their consumer program changes. MedClaim is unaffected.

This decision exemplifies a core principle of event-driven design: the message should contain what the producer knows, not what the consumer needs. The consumer is responsible for enriching the message with data from its own domain.

Message Sequencing and Ordering

An important question for any messaging system is whether message ordering matters. For HSA payments, it does — but not in the way you might expect.

If the same member has two claims adjudicated within seconds, MedClaim puts two messages on the queue. GlobalBank might process them in any order. Does this matter?

For HSA debits, the order does not affect correctness — $100 deducted then $200 deducted produces the same final balance as $200 then $100. However, the order does affect the intermediate state: if the account has only $150, the first order succeeds then fails, while the second order fails then succeeds.

The team decided that message ordering is not guaranteed and not required. Each message is processed independently. If an HSA account has insufficient funds, the debit fails and the message is handled as an error — regardless of whether other messages for the same account are waiting on the queue.

This decision simplifies the architecture enormously. Guaranteeing message ordering across two LPARs connected by MQ would require single-threaded processing, eliminating the scalability benefits of message queuing. By designing each message to be independently processable, the team can run multiple consumer instances if volume grows.

💡 The Independence Principle. When designing event-driven systems, strive for messages that can be processed independently. If Message B can only be processed after Message A, your system has an implicit ordering dependency that will cause problems under load, during recovery, and when scaling. Design your messages so that each one carries enough context to be processed in isolation.

Testing the Producer in Isolation

Before connecting to GlobalBank's consumer, James tests the producer in isolation. He configures a test queue on MedClaim's LPAR and runs CLM-ADJUD with test claims that have HSA-eligible flags.

The test plan:

Test Case Input Expected Result
Normal HSA claim HSA-eligible claim, $500 Message on queue with correct JSON
Non-HSA claim Non-HSA claim No message on queue
Large amount HSA-eligible, $99,999.99 Message with correct amount formatting
Zero amount HSA-eligible, $0.00 No message (zero amounts filtered)
MQ down HSA-eligible claim, queue unavailable Claim adjudicated, MQ failure logged
Rapid fire 100 HSA claims in quick succession 100 unique messages, no duplicates
Special characters Provider name with apostrophe JSON properly escaped

James runs each test case and examines the messages on the queue using the MQ Explorer utility. He verifies that:

  1. Each message is valid JSON
  2. The message ID is unique for every message
  3. The claim amount matches the adjudicated amount exactly (COMP-3 to display conversion)
  4. The timestamp reflects the actual time of adjudication, not some default value
  5. No message is generated when the MQ PUT fails (the failure is logged instead)

The "MQ down" test is particularly important. James stops the queue manager, submits a batch of HSA-eligible claims, and verifies that every claim is still adjudicated correctly and written to the flat file. The only difference is the MQ failure messages in the job log. When he restarts the queue manager, the missed claims will be caught by the reconciliation process — they appear in the batch output but not in the real-time DB2 table, and the reconciliation report flags them as "batch only."

📊 Testing Philosophy: Trust But Verify. Testing the producer in isolation before connecting it to the consumer follows the same principle as unit testing before integration testing. If the producer generates malformed messages, debugging will be much harder when the consumer is involved. By verifying message format, uniqueness, and error handling before the consumer exists, James eliminates an entire class of integration problems.


Phase 3: GlobalBank Event Consumer

The HSA-EVENTS Program

At GlobalBank, Derek Washington builds the event consumer under Maria Chen's supervision. HSA-EVENTS is a CICS program that is triggered when messages arrive on the MQ queue. It reads the message, parses the JSON, validates the HSA account, applies the debit, and sends a confirmation message back.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. HSA-EVENTS.
      *================================================================*
      * Program:  HSA-EVENTS                                           *
      * Purpose:  Process real-time HSA payment events from MedClaim   *
      * Trigger:  MQ message arrival on MEDCLAIM.HSA.PAYMENTS          *
      * Author:   Derek Washington (supervised by Maria Chen)          *
      * Date:     2024-07-15                                           *
      * System:   GlobalBank Core Banking                              *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.

       01  WS-MQ-MESSAGE              PIC X(2000).
       01  WS-MQ-MSG-LENGTH           PIC S9(08) COMP.
       01  WS-PAYMENT-QUEUE           PIC X(48)
               VALUE 'MEDCLAIM.HSA.PAYMENTS'.
       01  WS-CONFIRM-QUEUE           PIC X(48)
               VALUE 'GLOBALBANK.HSA.CONFIRMS'.
       01  WS-RESP-CODE               PIC S9(08) COMP.

      *--- Parsed payment request ---
       01  WS-PAYMENT-REQUEST.
           05  WS-REQ-MSG-ID          PIC X(20).
           05  WS-REQ-MSG-TYPE        PIC X(15).
           05  WS-REQ-TIMESTAMP       PIC X(23).
           05  WS-REQ-CLAIM-ID        PIC X(15).
           05  WS-REQ-MEMBER-ID       PIC X(12).
           05  WS-REQ-HSA-ACCT        PIC X(10).
           05  WS-REQ-PAY-AMOUNT      PIC S9(7)V99 COMP-3.
           05  WS-REQ-DIAG-CODE       PIC X(07).
           05  WS-REQ-PROC-CODE       PIC X(05).
           05  WS-REQ-SVC-DATE        PIC X(10).
           05  WS-REQ-PROVIDER        PIC X(30).
           05  WS-REQ-ADJUD-DATE      PIC X(10).

      *--- Confirmation response ---
       01  WS-CONFIRM-RESPONSE.
           05  WS-CFM-MSG-ID          PIC X(20).
           05  WS-CFM-CORREL-ID       PIC X(20).
           05  WS-CFM-MSG-TYPE        PIC X(15)
               VALUE 'HSA_CONFIRM'.
           05  WS-CFM-TIMESTAMP       PIC X(23).
           05  WS-CFM-CLAIM-ID        PIC X(15).
           05  WS-CFM-HSA-ACCT        PIC X(10).
           05  WS-CFM-STATUS          PIC X(10).
           05  WS-CFM-NEW-BALANCE     PIC S9(9)V99 COMP-3.
           05  WS-CFM-TXN-REF         PIC X(20).

       01  WS-CONFIRM-JSON            PIC X(2000).
       01  WS-CONFIRM-LENGTH          PIC S9(08) COMP.

      *--- HSA Account fields (from DB2) ---
       01  WS-HSA-FIELDS.
           05  WS-HSA-BALANCE         PIC S9(9)V99 COMP-3.
           05  WS-HSA-STATUS          PIC X(01).
               88  WS-HSA-ACTIVE          VALUE 'A'.
           05  WS-HSA-MEMBER-ID       PIC X(12).
           05  WS-HSA-NEW-BALANCE     PIC S9(9)V99 COMP-3.

      *--- Duplicate check ---
       01  WS-DUP-CHECK-COUNT         PIC S9(08) COMP.

      *--- Work fields ---
       01  WS-TXN-REF                 PIC X(20).
       01  WS-CURRENT-TS              PIC X(23).
       01  WS-PROCESS-STATUS          PIC X(10).

           EXEC SQL INCLUDE SQLCA END-EXEC.

       PROCEDURE DIVISION.

       0000-MAIN.
           PERFORM 1000-RECEIVE-MESSAGE
           PERFORM 2000-PARSE-REQUEST
           PERFORM 3000-VALIDATE-REQUEST
           PERFORM 4000-PROCESS-PAYMENT
           PERFORM 5000-SEND-CONFIRMATION
           EXEC CICS RETURN END-EXEC
           .

       1000-RECEIVE-MESSAGE.
           EXEC CICS READQ TD
               QUEUE(WS-PAYMENT-QUEUE)
               INTO(WS-MQ-MESSAGE)
               LENGTH(WS-MQ-MSG-LENGTH)
               RESP(WS-RESP-CODE)
           END-EXEC

           IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
               DISPLAY 'HSA-EVENTS: MQ READ FAILED, RESP='
                       WS-RESP-CODE
               EXEC CICS RETURN END-EXEC
           END-IF
           .

       2000-PARSE-REQUEST.
           JSON PARSE WS-MQ-MESSAGE
               INTO WS-PAYMENT-REQUEST
           END-JSON
           .

       3000-VALIDATE-REQUEST.
      *    Check for duplicate message (idempotent processing)
           EXEC SQL
               SELECT COUNT(*)
               INTO :WS-DUP-CHECK-COUNT
               FROM GLOBALBANK.HSA_TRANSACTIONS
               WHERE MESSAGE_ID = :WS-REQ-MSG-ID
           END-EXEC

           IF WS-DUP-CHECK-COUNT > 0
               MOVE 'DUPLICATE' TO WS-PROCESS-STATUS
               DISPLAY 'HSA-EVENTS: DUPLICATE MESSAGE '
                       WS-REQ-MSG-ID
               GO TO 3000-EXIT
           END-IF

      *    Validate HSA account exists and is active
           EXEC SQL
               SELECT HSA_BALANCE,
                      HSA_STATUS,
                      MEMBER_ID
               INTO  :WS-HSA-BALANCE,
                     :WS-HSA-STATUS,
                     :WS-HSA-MEMBER-ID
               FROM  GLOBALBANK.HSA_ACCOUNTS
               WHERE HSA_ACCOUNT_ID = :WS-REQ-HSA-ACCT
           END-EXEC

           EVALUATE SQLCODE
               WHEN 0
                   IF NOT WS-HSA-ACTIVE
                       MOVE 'INACTIVE' TO WS-PROCESS-STATUS
                   ELSE IF WS-REQ-PAY-AMOUNT > WS-HSA-BALANCE
                       MOVE 'NSF' TO WS-PROCESS-STATUS
                   ELSE
                       MOVE 'VALIDATED' TO WS-PROCESS-STATUS
                   END-IF
               WHEN +100
                   MOVE 'ACCT_NF' TO WS-PROCESS-STATUS
               WHEN OTHER
                   MOVE 'DB2_ERROR' TO WS-PROCESS-STATUS
           END-EVALUATE
           .
       3000-EXIT.
           EXIT
           .

       4000-PROCESS-PAYMENT.
           IF WS-PROCESS-STATUS NOT = 'VALIDATED'
               GO TO 4000-EXIT
           END-IF

      *    Debit the HSA account
           COMPUTE WS-HSA-NEW-BALANCE =
               WS-HSA-BALANCE - WS-REQ-PAY-AMOUNT
           END-COMPUTE

           EXEC SQL
               UPDATE GLOBALBANK.HSA_ACCOUNTS
               SET    HSA_BALANCE = :WS-HSA-NEW-BALANCE,
                      LAST_ACTIVITY_TS = CURRENT TIMESTAMP
               WHERE  HSA_ACCOUNT_ID = :WS-REQ-HSA-ACCT
                 AND  HSA_BALANCE = :WS-HSA-BALANCE
           END-EXEC

           IF SQLCODE = 0 AND SQLERRD(3) = 1
      *        Exactly one row updated - success
               MOVE 'SUCCESS' TO WS-PROCESS-STATUS
               MOVE WS-HSA-NEW-BALANCE TO WS-CFM-NEW-BALANCE
      *        Record the transaction
               PERFORM 4100-INSERT-TRANSACTION
           ELSE
      *        Optimistic lock failure or error
               MOVE 'CONFLICT' TO WS-PROCESS-STATUS
               EXEC SQL ROLLBACK END-EXEC
           END-IF
           .
       4000-EXIT.
           EXIT
           .

       4100-INSERT-TRANSACTION.
           EXEC SQL
               INSERT INTO GLOBALBANK.HSA_TRANSACTIONS
               (MESSAGE_ID, CLAIM_ID, HSA_ACCOUNT_ID,
                PAYMENT_AMOUNT, NEW_BALANCE, PROCESS_STATUS,
                PROCESS_TS)
               VALUES
               (:WS-REQ-MSG-ID, :WS-REQ-CLAIM-ID,
                :WS-REQ-HSA-ACCT, :WS-REQ-PAY-AMOUNT,
                :WS-HSA-NEW-BALANCE, 'SUCCESS',
                CURRENT TIMESTAMP)
           END-EXEC

           IF SQLCODE = 0
               EXEC SQL COMMIT END-EXEC
           ELSE
               EXEC SQL ROLLBACK END-EXEC
               MOVE 'LOG_FAIL' TO WS-PROCESS-STATUS
           END-IF
           .

       5000-SEND-CONFIRMATION.
      *    Build confirmation message
           MOVE WS-REQ-MSG-ID     TO WS-CFM-CORREL-ID
           MOVE WS-REQ-CLAIM-ID   TO WS-CFM-CLAIM-ID
           MOVE WS-REQ-HSA-ACCT   TO WS-CFM-HSA-ACCT
           MOVE WS-PROCESS-STATUS TO WS-CFM-STATUS

      *    Generate unique confirmation message ID
           STRING 'CFM' DELIMITED BY SIZE
                  WS-REQ-MSG-ID(4:17) DELIMITED BY SIZE
                  INTO WS-CFM-MSG-ID
           END-STRING

      *    Generate JSON confirmation
           JSON GENERATE WS-CONFIRM-JSON
               FROM WS-CONFIRM-RESPONSE
               COUNT WS-CONFIRM-LENGTH
           END-JSON

      *    Put confirmation on return queue
           EXEC CICS WRITEQ TD
               QUEUE(WS-CONFIRM-QUEUE)
               FROM(WS-CONFIRM-JSON)
               LENGTH(WS-CONFIRM-LENGTH)
               RESP(WS-RESP-CODE)
           END-EXEC

           IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
               DISPLAY 'HSA-EVENTS: CONFIRM PUT FAILED, RESP='
                       WS-RESP-CODE
      *        Log failure - reconciliation will catch this
           END-IF
           .

Key Design Patterns in HSA-EVENTS:

Idempotent processing with duplicate check. Before processing any message, the program checks whether a message with the same ID has already been processed (by querying the HSA_TRANSACTIONS table). If it has, the message is skipped. This makes the system safe against duplicate message delivery.

Optimistic locking on the UPDATE. The UPDATE statement includes AND HSA_BALANCE = :WS-HSA-BALANCE — a condition that fails if another transaction modified the balance between the SELECT and the UPDATE. This is optimistic locking: instead of taking a database lock during the SELECT, the program assumes it will succeed and verifies at UPDATE time. If the balance changed, the UPDATE affects zero rows (SQLERRD(3) = 0), and the program handles it as a conflict.

COMMIT after transaction recording. The payment and its transaction record are committed together. If either fails, both are rolled back. This ensures that the transaction log is always consistent with the account balance.

🔵 Why Not VSAM? Notice that HSA-EVENTS uses DB2, not VSAM, for account data. This is a deliberate choice for a real-time system. DB2 provides row-level locking (VSAM locks at the CI level), SQL access for ad-hoc queries, and built-in recovery logging. For a high-concurrency, real-time system, DB2's capabilities are essential.


Understanding IBM MQ for COBOL Developers

What is Message Queuing?

If you come from a batch background (as most COBOL developers do), message queuing requires a mental shift. In batch, programs communicate through files: Program A writes a file, and Program B reads it. The file sits on disk until the next job step runs. Communication is synchronous with the job stream — Program B cannot run until Program A completes.

Message queuing is different. Program A puts a message on a queue and continues processing immediately. Program B reads from the queue whenever it is ready — which might be milliseconds later or hours later. The queue manager (IBM MQ) guarantees that the message is delivered, even if Program B is temporarily unavailable.

This decoupling is the fundamental advantage of message queuing:

  • Program A does not need to know if Program B is running. If B is down, messages accumulate on the queue and are processed when B comes back up.
  • Program A does not wait for Program B to finish. A puts the message and moves on to the next claim.
  • Multiple instances of Program B can read from the same queue. If processing is slow, you can start additional consumers to handle the load.

MQ from COBOL: Two Approaches

COBOL programs can interact with MQ in two ways:

1. MQ API (MQI — Message Queue Interface): Direct calls to MQ using CALL statements. This provides full control over message options (persistence, priority, expiry, correlation) but requires managing connection handles, queue handles, and message descriptors.

      * MQI approach - full control, more code
           CALL 'MQCONN' USING WS-QM-NAME
                               WS-HCONN
                               WS-COMP-CODE
                               WS-REASON
           CALL 'MQOPEN' USING WS-HCONN
                               WS-OBJ-DESC
                               WS-OPEN-OPTIONS
                               WS-HOBJ
                               WS-COMP-CODE
                               WS-REASON
           CALL 'MQPUT'  USING WS-HCONN
                               WS-HOBJ
                               WS-MSG-DESC
                               WS-PUT-OPTIONS
                               WS-MSG-LENGTH
                               WS-MSG-BUFFER
                               WS-COMP-CODE
                               WS-REASON

2. CICS MQ Bridge: In a CICS environment, MQ operations can be performed using EXEC CICS READQ TD (transient data) or the CICS-MQ adapter. This is simpler but offers less control.

      * CICS approach - simpler, CICS manages the connection
           EXEC CICS WRITEQ TD
               QUEUE(WS-QUEUE-NAME)
               FROM(WS-MESSAGE)
               LENGTH(WS-MSG-LENGTH)
               RESP(WS-RESP-CODE)
           END-EXEC

For Derek's HSA-EVENTS program, the CICS approach is appropriate because the program runs in a CICS region. For a batch program that puts messages on a queue, the MQI approach is typically used.

Message Design Principles

Priya establishes three message design principles for the HSA system:

Principle 1: Messages are self-contained. Each message contains all information needed to process it. The consumer should not need to make additional calls to get missing data. This is why the payment request includes member name, diagnosis code, and provider name — even though the consumer could look them up. Self-contained messages reduce coupling between systems and improve reliability.

Principle 2: Messages are immutable. Once a message is put on the queue, its content does not change. If an error is discovered, a new corrective message is sent — the original message is never modified. This ensures audit trail integrity and simplifies debugging.

Principle 3: Messages are versioned. The message includes a version field (implicit in the message type, like "HSA_PAYMENT"). If the message format changes in the future, the consumer can detect the version and process accordingly. This allows rolling upgrades where producers and consumers are updated at different times.

Dead-Letter Queue Processing

The dead-letter queue (DLQ) is where messages go when they cannot be processed. This can happen for several reasons:

  • The consuming program ABENDs repeatedly while processing the message (BOTHRESH exceeded)
  • The message format is invalid (the JSON PARSE fails)
  • The target queue is full (MAXDEPTH reached)
  • The message has expired (EXPIRY time exceeded)

Dead-letter queue processing is an operational concern, not an application concern. But the development team must design for it:

      * In HSA-EVENTS: handle JSON parse failure gracefully
           JSON PARSE WS-MQ-MESSAGE
               INTO WS-PAYMENT-REQUEST
           END-JSON

      *    Check for parse errors
           IF JSON-STATUS NOT = 0
               DISPLAY 'HSA-EVENTS: JSON PARSE FAILED FOR MSG'
      *        Do NOT process this message
      *        Allow MQ backout threshold to move it to DLQ
               EXEC CICS ABEND ABCODE('JSNP') END-EXEC
           END-IF

By ABENDing with a specific code ('JSNP'), the program tells MQ that this message could not be processed. After BOTHRESH attempts, MQ moves it to the DLQ. Operations can then examine the DLQ, investigate the malformed message, and take corrective action.

📊 DLQ Monitoring. In production, the DLQ depth should always be zero. Any message on the DLQ represents a transaction that was not processed — which in a financial system means money that was not moved, a payment that was not made, or an account that was not updated. Priya configures the HSA-MONITOR program to alert immediately if the DLQ contains any messages.

Transactional Messaging

In the HSA-EVENTS program, the DB2 update and the MQ confirmation PUT must be coordinated. If the DB2 update succeeds but the MQ PUT fails, the account has been debited but MedClaim does not know it. If the MQ PUT succeeds but the DB2 update fails, MedClaim thinks the payment was made but it was not.

The solution is transactional messaging: MQ and DB2 participate in the same CICS unit of work. When CICS commits, both the DB2 change and the MQ PUT are committed atomically. If either fails, both are rolled back.

      * Both DB2 and MQ participate in the CICS unit of work
      * When we issue EXEC CICS SYNCPOINT, both commit together

      *    Step 1: Update DB2 (within CICS UOW)
           EXEC SQL UPDATE GLOBALBANK.HSA_ACCOUNTS ... END-EXEC

      *    Step 2: Put MQ message (within same CICS UOW)
           EXEC CICS WRITEQ TD
               QUEUE(WS-CONFIRM-QUEUE) ...
           END-EXEC

      *    Step 3: Commit both atomically
           EXEC CICS SYNCPOINT
               RESP(WS-RESP-CODE)
           END-EXEC

      *    If SYNCPOINT fails, both DB2 and MQ are rolled back
           IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
               DISPLAY 'HSA-EVENTS: SYNCPOINT FAILED'
      *        Both the DB2 update and the MQ message are
      *        rolled back - data integrity is preserved
           END-IF

This is the mainframe's answer to the distributed transaction problem. CICS acts as a transaction coordinator, and both DB2 and MQ are resource managers that participate in the two-phase commit protocol. The COBOL programmer does not need to understand the protocol details — they just use EXEC CICS SYNCPOINT and CICS handles the rest.

⚠️ The Two-Phase Commit Overhead. Transactional messaging adds overhead: each commit requires coordination between CICS, DB2, and MQ. For the HSA system, this overhead is negligible (a few milliseconds per transaction). But for high-volume systems processing thousands of messages per second, the overhead can be significant. In such cases, careful design — batching commits, using non-persistent messages for intermediate steps — can reduce the impact.

Error Handling Patterns in Event-Driven COBOL

Event-driven systems introduce error handling patterns that batch COBOL programmers may not have encountered. In batch, an error typically means writing an error record and continuing to the next input record, or in severe cases, ABENDing the job. In event-driven processing, errors must be handled with more nuance because messages are independent and the system must continue processing other messages even when one fails.

Pattern 1: Retry with Backoff.

Some errors are transient — a DB2 deadlock, a temporary resource contention, a network glitch. The correct response is to retry the operation after a brief delay. In CICS, this is accomplished by rolling back the current message (allowing MQ to redeliver it) and using the backout threshold to limit retries:

      * Pattern: Detect transient error and allow retry
           EVALUATE SQLCODE
               WHEN -911
      *            Deadlock or timeout - transient error
      *            Rollback and let MQ redeliver
                   EXEC CICS SYNCPOINT ROLLBACK
                       RESP(WS-RESP-CODE)
                   END-EXEC
      *            The message goes back to the queue
      *            MQ will redeliver after a brief delay
      *            After BOTHRESH attempts, DLQ
               WHEN -904
      *            Resource unavailable - also transient
                   EXEC CICS SYNCPOINT ROLLBACK
                       RESP(WS-RESP-CODE)
                   END-EXEC
               WHEN OTHER
      *            Permanent error - do not retry
                   PERFORM 9100-LOG-PERMANENT-ERROR
      *            Commit the GET (remove message from queue)
      *            The error is logged; reconciliation will catch it
                   EXEC CICS SYNCPOINT
                       RESP(WS-RESP-CODE)
                   END-EXEC
           END-EVALUATE

The key distinction is between transient and permanent errors. Retrying a permanent error (like a member not found, or an invalid claim ID) will never succeed and wastes resources. The program must classify each error and respond appropriately.

Pattern 2: Compensating Transactions.

What happens if the HSA debit succeeds but the confirmation message fails to send? The account has been debited, but MedClaim does not know it. A compensating transaction reverses the debit:

      * Pattern: Compensating transaction
      *    DB2 update succeeded but MQ confirm failed
      *    Reverse the DB2 update
           EXEC SQL
               UPDATE GLOBALBANK.HSA_ACCOUNTS
               SET HSA_BALANCE = HSA_BALANCE
                   + :WS-PAYMENT-AMOUNT
               WHERE HSA_ACCOUNT_ID = :WS-HSA-ACCOUNT-ID
           END-EXEC

      *    Log the compensation
           DISPLAY 'HSA-EVENTS: COMPENSATING TXN FOR '
                   WS-HSA-ACCOUNT-ID
                   ' AMOUNT: ' WS-PAYMENT-AMOUNT

In practice, the HSA-EVENTS program uses transactional messaging (SYNCPOINT) to avoid this scenario. But compensating transactions are important in systems where the two resources (DB2 and MQ) cannot participate in the same unit of work — for example, when communicating with an external system via HTTP.

Pattern 3: The Error Event.

Instead of silently swallowing errors or ABENDing, event-driven systems can publish error events. These are messages placed on a dedicated error queue that describe what went wrong, when, and for which transaction:

      * Pattern: Publish error event
       9200-PUBLISH-ERROR-EVENT.
           INITIALIZE WS-ERROR-EVENT
           MOVE WS-ORIGINAL-MSG-ID TO ERR-ORIGINAL-MSG-ID
           MOVE WS-CLAIM-ID        TO ERR-CLAIM-ID
           MOVE WS-ERROR-CODE      TO ERR-ERROR-CODE
           MOVE WS-ERROR-DESC      TO ERR-ERROR-DESC
           ACCEPT ERR-TIMESTAMP FROM TIME

           JSON GENERATE WS-ERROR-JSON
               FROM WS-ERROR-EVENT
               COUNT WS-ERROR-JSON-LEN
           END-JSON

           EXEC CICS WRITEQ TD
               QUEUE('GLOBALBANK.HSA.ERRORS')
               FROM(WS-ERROR-JSON)
               LENGTH(WS-ERROR-JSON-LEN)
               RESP(WS-RESP-CODE)
           END-EXEC
           .

Error events are invaluable for operational monitoring. Instead of grepping through CICS logs for error messages, the operations team monitors the error queue. Automated tooling can consume error events and create tickets, send alerts, or trigger corrective processes.

Performance Considerations for Real-Time COBOL

Moving from batch to real-time changes the performance characteristics of a COBOL program in fundamental ways.

Batch Performance: Measured in throughput — records per second. A batch program processes millions of records over hours. Individual record processing time does not matter as long as the total batch window is met.

Real-Time Performance: Measured in latency — milliseconds per transaction. Every millisecond counts because the end user or partner system is waiting for a response. A batch program that processes 10,000 records per second (0.1ms each) is fast. A real-time program that takes 100ms per transaction may be too slow.

The HSA-EVENTS program targets a per-transaction latency of under 50ms. Priya's performance analysis breaks this down:

Component Target (ms) Notes
MQ GET 2-5 Network + disk I/O for persistent message
JSON PARSE 1-2 CPU-bound, depends on message size
Duplicate check (DB2) 3-8 Index lookup on MESSAGE_ID
Account SELECT (DB2) 3-8 Primary key lookup
Account UPDATE (DB2) 5-10 Row lock + log write
Transaction INSERT (DB2) 3-8 Index maintenance
JSON GENERATE 1-2 CPU-bound
MQ PUT (confirm) 2-5 Network + disk I/O
SYNCPOINT 5-15 Two-phase commit (DB2 + MQ)
Total 25-63 Target: < 50ms average

Tuning Strategies:

  1. DB2 Buffer Pool Sizing. The HSA_ACCOUNTS table is small enough to fit entirely in the DB2 buffer pool. If every row is cached in memory, the SELECT avoids physical I/O entirely. Priya works with GlobalBank's DBA to allocate a dedicated buffer pool (BP1) for the HSA tables with enough pages to hold the entire dataset.

  2. MQ Non-Persistent for Confirmations. While payment requests must be persistent (losing a payment message is unacceptable), confirmation messages could theoretically be non-persistent. If a confirmation is lost, the reconciliation process will detect the missing confirmation and regenerate it. However, the team decides to keep confirmations persistent — the performance difference (2-3ms) is not worth the operational complexity of missing confirmations.

  3. DB2 Static SQL. The HSA-EVENTS program uses static SQL (embedded in EXEC SQL blocks), not dynamic SQL. Static SQL is precompiled and optimized at BIND time, avoiding the overhead of runtime SQL parsing. For a program that executes the same queries millions of times, static SQL can be 2-5x faster than dynamic SQL.

  4. CICS Storage Management. Each invocation of HSA-EVENTS allocates working storage. By keeping working storage small and avoiding unnecessary GETMAIN/FREEMAIN calls, the program minimizes CICS storage overhead. The WS-MQ-MESSAGE field is sized at 2000 bytes — large enough for the largest expected message, small enough to avoid waste.

Capacity Planning

Priya builds a capacity model to ensure the real-time system can handle current and projected volumes:

Current volumes: - 20,000 claims adjudicated per business day - ~5,000 are HSA-eligible (25%) - Processing concentrated in business hours (10 hours) - Average rate: 500 HSA messages per hour = ~8.3 per minute

Projected growth (3 years): - Claims volume expected to grow 15% annually - New partner integrations may add 30% more HSA-eligible claims - Projected peak: ~15,000 HSA messages per day = 25 per minute

Capacity analysis: - At 50ms per transaction, one CICS transaction instance can process 20 per second = 1,200 per minute - Current volume (8.3/minute) uses less than 1% of capacity - Even at projected peak (25/minute), utilization is approximately 2% - The system has enormous headroom — a factor of 50x between projected peak and single-instance capacity

"This is one of the advantages of the mainframe," Priya explains to the project steering committee. "The z/OS hardware and CICS infrastructure can handle transaction rates that would require a cluster of distributed servers. We have capacity for growth that goes well beyond our 3-year horizon."

Burst capacity is the more relevant concern. Claims are not adjudicated evenly throughout the day. The peak hour may see 3x the average rate. And if MedClaim runs a re-adjudication batch (reprocessing previously denied claims), thousands of messages may arrive in minutes.

The team establishes monitoring thresholds:

Metric Normal Warning Critical
Queue depth < 100 100-1000 > 1000
Processing latency (avg) < 30ms 30-100ms > 100ms
Error rate 0% < 0.1% > 0.1%
DLQ depth 0 1-5 > 5

The HSA-MONITOR program checks these thresholds every 5 minutes and sends alerts when warning or critical levels are reached.


Phase 4: Parallel Running

The Reconciliation Architecture

During the parallel-run period, both the batch and real-time paths process every HSA payment. The reconciliation program compares the results nightly.

    MedClaim Adjudication
         │            │
    [MQ Message]  [Flat File]
         │            │
    [HSA-EVENTS]  [HSA-PROC Batch]
         │            │
    DB2: HSA_TXN  VSAM: Audit Trail
         │            │
         └──── [HSA-RECON] ────┘
                    │
            Reconciliation Report

The Reconciliation Program

HSA-RECON reads both the DB2 transaction log (from the real-time path) and the VSAM audit trail (from the batch path) and compares them claim-by-claim.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. HSA-RECON.
      *================================================================*
      * Program:  HSA-RECON                                            *
      * Purpose:  Reconcile real-time and batch HSA processing         *
      * Author:   Priya Kapoor                                        *
      * Date:     2024-08-01                                           *
      *================================================================*
      * Compares DB2 transaction log (real-time) with VSAM audit       *
      * trail (batch) to verify both paths produce identical results.  *
      *================================================================*

       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
           SELECT BATCH-AUDIT
               ASSIGN TO BCHAUDIT
               ORGANIZATION IS SEQUENTIAL
               FILE STATUS IS WS-BATCH-STATUS.

           SELECT RECON-REPORT
               ASSIGN TO RECONRPT
               ORGANIZATION IS SEQUENTIAL
               FILE STATUS IS WS-RPT-STATUS.

       DATA DIVISION.
       FILE SECTION.

       FD  BATCH-AUDIT
           RECORDING MODE IS F
           RECORD CONTAINS 150 CHARACTERS.
       01  BATCH-AUDIT-REC.
           05  BA-CLAIM-ID            PIC X(15).
           05  BA-HSA-ACCT            PIC X(10).
           05  BA-AMOUNT              PIC S9(7)V99 COMP-3.
           05  BA-STATUS              PIC X(01).
           05  BA-PROCESS-DATE        PIC 9(08).
           05  FILLER                 PIC X(111).

       FD  RECON-REPORT
           RECORDING MODE IS F
           RECORD CONTAINS 132 CHARACTERS.
       01  RECON-LINE                  PIC X(132).

       WORKING-STORAGE SECTION.

       01  WS-FILE-STATUSES.
           05  WS-BATCH-STATUS        PIC X(02).
               88  WS-BATCH-OK           VALUE '00'.
               88  WS-BATCH-EOF          VALUE '10'.
           05  WS-RPT-STATUS          PIC X(02).

      *--- DB2 cursor for real-time transactions ---
       01  WS-RT-FIELDS.
           05  WS-RT-CLAIM-ID         PIC X(15).
           05  WS-RT-HSA-ACCT         PIC X(10).
           05  WS-RT-AMOUNT           PIC S9(7)V99 COMP-3.
           05  WS-RT-STATUS           PIC X(10).
           05  WS-RT-PROCESS-DATE     PIC X(10).

       01  WS-COUNTERS.
           05  WS-BATCH-COUNT         PIC 9(07) VALUE ZERO.
           05  WS-REALTIME-COUNT      PIC 9(07) VALUE ZERO.
           05  WS-MATCH-COUNT         PIC 9(07) VALUE ZERO.
           05  WS-MISMATCH-COUNT      PIC 9(07) VALUE ZERO.
           05  WS-BATCH-ONLY          PIC 9(07) VALUE ZERO.
           05  WS-REALTIME-ONLY       PIC 9(07) VALUE ZERO.

       01  WS-RECON-DATE              PIC 9(08).

       01  WS-FLAGS.
           05  WS-EOF-FLAG            PIC X(01) VALUE 'N'.
               88  WS-END-OF-BATCH       VALUE 'Y'.
               88  WS-MORE-BATCH         VALUE 'N'.
           05  WS-CURSOR-FLAG         PIC X(01) VALUE 'N'.
               88  WS-END-OF-CURSOR      VALUE 'Y'.
               88  WS-MORE-CURSOR        VALUE 'N'.

       01  WS-RPT-HEADER.
           05  FILLER  PIC X(01) VALUE SPACES.
           05  FILLER  PIC X(50)
               VALUE 'HSA RECONCILIATION REPORT - BATCH VS REAL-TIME'.
           05  FILLER  PIC X(30) VALUE SPACES.
           05  FILLER  PIC X(06) VALUE 'DATE: '.
           05  RH-DATE PIC X(10).
           05  FILLER  PIC X(35) VALUE SPACES.

       01  WS-RPT-DETAIL.
           05  FILLER  PIC X(01) VALUE SPACES.
           05  RD-TYPE PIC X(10).
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-CLAIM PIC X(15).
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-ACCT PIC X(10).
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-B-AMT PIC -(7)9.99.
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-R-AMT PIC -(7)9.99.
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-B-STAT PIC X(01).
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-R-STAT PIC X(10).
           05  FILLER  PIC X(02) VALUE SPACES.
           05  RD-RESULT PIC X(10).
           05  FILLER  PIC X(40) VALUE SPACES.

       01  WS-RPT-SUMMARY.
           05  FILLER  PIC X(01) VALUE SPACES.
           05  FILLER  PIC X(50)
               VALUE '================================================'.
           05  FILLER  PIC X(81) VALUE SPACES.

           EXEC SQL INCLUDE SQLCA END-EXEC.

           EXEC SQL DECLARE RT-CURSOR CURSOR FOR
               SELECT CLAIM_ID,
                      HSA_ACCOUNT_ID,
                      PAYMENT_AMOUNT,
                      PROCESS_STATUS,
                      CHAR(PROCESS_TS, ISO)
               FROM   GLOBALBANK.HSA_TRANSACTIONS
               WHERE  DATE(PROCESS_TS) = :WS-RECON-DATE
               ORDER BY CLAIM_ID
           END-EXEC.

       PROCEDURE DIVISION.

       0000-MAIN.
           PERFORM 1000-INITIALIZE
           PERFORM 2000-RECONCILE
           PERFORM 3000-TERMINATE
           STOP RUN
           .

       1000-INITIALIZE.
           OPEN INPUT BATCH-AUDIT
           OPEN OUTPUT RECON-REPORT

           ACCEPT WS-RECON-DATE FROM DATE YYYYMMDD

           MOVE WS-RECON-DATE TO RH-DATE
           WRITE RECON-LINE FROM WS-RPT-HEADER
               AFTER ADVANCING PAGE

           EXEC SQL OPEN RT-CURSOR END-EXEC

           PERFORM 2100-READ-BATCH
           PERFORM 2200-FETCH-REALTIME
           .

       2000-RECONCILE.
           PERFORM UNTIL WS-END-OF-BATCH AND WS-END-OF-CURSOR
               EVALUATE TRUE
                   WHEN WS-END-OF-BATCH AND WS-END-OF-CURSOR
                       CONTINUE
                   WHEN WS-END-OF-CURSOR
      *                Batch has records, real-time doesn't
                       PERFORM 2300-BATCH-ONLY
                       PERFORM 2100-READ-BATCH
                   WHEN WS-END-OF-BATCH
      *                Real-time has records, batch doesn't
                       PERFORM 2400-REALTIME-ONLY
                       PERFORM 2200-FETCH-REALTIME
                   WHEN BA-CLAIM-ID = WS-RT-CLAIM-ID
      *                Both have this claim - compare
                       PERFORM 2500-COMPARE-RECORDS
                       PERFORM 2100-READ-BATCH
                       PERFORM 2200-FETCH-REALTIME
                   WHEN BA-CLAIM-ID < WS-RT-CLAIM-ID
      *                Batch has a claim that real-time doesn't
                       PERFORM 2300-BATCH-ONLY
                       PERFORM 2100-READ-BATCH
                   WHEN BA-CLAIM-ID > WS-RT-CLAIM-ID
      *                Real-time has a claim that batch doesn't
                       PERFORM 2400-REALTIME-ONLY
                       PERFORM 2200-FETCH-REALTIME
               END-EVALUATE
           END-PERFORM
           .

       2100-READ-BATCH.
           READ BATCH-AUDIT
           EVALUATE TRUE
               WHEN WS-BATCH-OK
                   ADD 1 TO WS-BATCH-COUNT
               WHEN WS-BATCH-EOF
                   SET WS-END-OF-BATCH TO TRUE
               WHEN OTHER
                   DISPLAY 'BATCH READ ERROR: ' WS-BATCH-STATUS
                   SET WS-END-OF-BATCH TO TRUE
           END-EVALUATE
           .

       2200-FETCH-REALTIME.
           EXEC SQL
               FETCH RT-CURSOR
               INTO :WS-RT-CLAIM-ID,
                    :WS-RT-HSA-ACCT,
                    :WS-RT-AMOUNT,
                    :WS-RT-STATUS,
                    :WS-RT-PROCESS-DATE
           END-EXEC

           EVALUATE SQLCODE
               WHEN 0
                   ADD 1 TO WS-REALTIME-COUNT
               WHEN +100
                   SET WS-END-OF-CURSOR TO TRUE
               WHEN OTHER
                   DISPLAY 'DB2 FETCH ERROR: ' SQLCODE
                   SET WS-END-OF-CURSOR TO TRUE
           END-EVALUATE
           .

       2300-BATCH-ONLY.
           ADD 1 TO WS-BATCH-ONLY
           MOVE 'BATCH-ONLY' TO RD-TYPE
           MOVE BA-CLAIM-ID TO RD-CLAIM
           MOVE BA-HSA-ACCT TO RD-ACCT
           MOVE BA-AMOUNT TO RD-B-AMT
           MOVE ZERO TO RD-R-AMT
           MOVE BA-STATUS TO RD-B-STAT
           MOVE SPACES TO RD-R-STAT
           MOVE 'MISMATCH' TO RD-RESULT
           WRITE RECON-LINE FROM WS-RPT-DETAIL
               AFTER ADVANCING 1 LINE
           .

       2400-REALTIME-ONLY.
           ADD 1 TO WS-REALTIME-ONLY
           MOVE 'RT-ONLY' TO RD-TYPE
           MOVE WS-RT-CLAIM-ID TO RD-CLAIM
           MOVE WS-RT-HSA-ACCT TO RD-ACCT
           MOVE ZERO TO RD-B-AMT
           MOVE WS-RT-AMOUNT TO RD-R-AMT
           MOVE SPACES TO RD-B-STAT
           MOVE WS-RT-STATUS TO RD-R-STAT
           MOVE 'MISMATCH' TO RD-RESULT
           WRITE RECON-LINE FROM WS-RPT-DETAIL
               AFTER ADVANCING 1 LINE
           .

       2500-COMPARE-RECORDS.
           MOVE 'BOTH' TO RD-TYPE
           MOVE BA-CLAIM-ID TO RD-CLAIM
           MOVE BA-HSA-ACCT TO RD-ACCT
           MOVE BA-AMOUNT TO RD-B-AMT
           MOVE WS-RT-AMOUNT TO RD-R-AMT
           MOVE BA-STATUS TO RD-B-STAT
           MOVE WS-RT-STATUS TO RD-R-STAT

           IF BA-AMOUNT = WS-RT-AMOUNT
              AND BA-HSA-ACCT = WS-RT-HSA-ACCT
               MOVE 'MATCH' TO RD-RESULT
               ADD 1 TO WS-MATCH-COUNT
           ELSE
               MOVE 'MISMATCH' TO RD-RESULT
               ADD 1 TO WS-MISMATCH-COUNT
               WRITE RECON-LINE FROM WS-RPT-DETAIL
                   AFTER ADVANCING 1 LINE
           END-IF
           .

       3000-TERMINATE.
           WRITE RECON-LINE FROM WS-RPT-SUMMARY
               AFTER ADVANCING 3 LINES

           DISPLAY '======================================='
           DISPLAY 'HSA RECONCILIATION SUMMARY'
           DISPLAY '======================================='
           DISPLAY 'BATCH TRANSACTIONS:     ' WS-BATCH-COUNT
           DISPLAY 'REAL-TIME TRANSACTIONS: ' WS-REALTIME-COUNT
           DISPLAY 'MATCHES:                ' WS-MATCH-COUNT
           DISPLAY 'MISMATCHES:             ' WS-MISMATCH-COUNT
           DISPLAY 'BATCH-ONLY:             ' WS-BATCH-ONLY
           DISPLAY 'REAL-TIME-ONLY:         ' WS-REALTIME-ONLY
           DISPLAY '======================================='

           EXEC SQL CLOSE RT-CURSOR END-EXEC

           CLOSE BATCH-AUDIT
                 RECON-REPORT

           IF WS-MISMATCH-COUNT > ZERO OR
              WS-BATCH-ONLY > ZERO OR
              WS-REALTIME-ONLY > ZERO
               MOVE 4 TO RETURN-CODE
           ELSE
               MOVE 0 TO RETURN-CODE
           END-IF
           .

The Reconciliation Algorithm:

HSA-RECON uses the classic merge-compare pattern — the same pattern used in sequential file matching throughout the COBOL world. Both sources are sorted by claim ID. The program advances through both sources simultaneously, comparing claim IDs at each step:

  • If both have the same claim ID, compare amounts and statuses (match or mismatch)
  • If batch has a claim that real-time does not, it is a "batch-only" record
  • If real-time has a claim that batch does not, it is a "real-time-only" record

This merge pattern is O(n) — it processes both sources in a single pass, regardless of volume.

📊 The 30-Day Gate. The reconciliation runs every night during the parallel-run period. The go-live criteria is 30 consecutive business days with zero mismatches, zero batch-only records, and zero real-time-only records. If any day shows a discrepancy, the counter resets to zero and investigation begins. This is a demanding criterion, but Priya insists: "We are moving money. Zero is the only acceptable error rate."


Phase 5: Monitoring and Alerting

The HSA-MONITOR Program

In production, real-time systems require active monitoring. Unlike batch, where you check the results in the morning, real-time problems must be detected and addressed immediately.

Priya designs a monitoring CICS transaction that runs every 5 minutes and checks system health:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. HSA-MONITOR.
      *================================================================*
      * Program:  HSA-MONITOR                                          *
      * Purpose:  Real-time health monitoring for HSA event system     *
      * Schedule: Every 5 minutes via CICS interval control            *
      * Author:   Priya Kapoor                                        *
      * Date:     2024-08-15                                           *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.

       01  WS-MONITOR-RESULTS.
           05  WS-QUEUE-DEPTH         PIC S9(08) COMP.
           05  WS-DLQ-DEPTH           PIC S9(08) COMP.
           05  WS-OLDEST-MSG-AGE      PIC S9(08) COMP.
           05  WS-PROCESSED-LAST-5M   PIC S9(08) COMP.
           05  WS-ERRORS-LAST-5M      PIC S9(08) COMP.

       01  WS-THRESHOLDS.
           05  WS-MAX-QUEUE-DEPTH     PIC S9(08) COMP
               VALUE 1000.
           05  WS-MAX-DLQ-DEPTH       PIC S9(08) COMP
               VALUE 0.
           05  WS-MAX-MSG-AGE-SEC     PIC S9(08) COMP
               VALUE 300.
           05  WS-MIN-THROUGHPUT      PIC S9(08) COMP
               VALUE 10.

       01  WS-ALERT-FLAG              PIC X(01).
           88  WS-ALERT-NEEDED            VALUE 'Y'.
           88  WS-NO-ALERT                VALUE 'N'.

       01  WS-ALERT-MESSAGE           PIC X(200).
       01  WS-RESP-CODE               PIC S9(08) COMP.

           EXEC SQL INCLUDE SQLCA END-EXEC.

       PROCEDURE DIVISION.

       0000-MAIN.
           SET WS-NO-ALERT TO TRUE
           PERFORM 1000-CHECK-QUEUE-DEPTH
           PERFORM 2000-CHECK-DLQ
           PERFORM 3000-CHECK-PROCESSING-RATE
           PERFORM 4000-CHECK-ERROR-RATE

           IF WS-ALERT-NEEDED
               PERFORM 5000-SEND-ALERT
           END-IF

      *    Reschedule for next interval
           EXEC CICS START
               TRANSID('HMON')
               INTERVAL(000500)
               RESP(WS-RESP-CODE)
           END-EXEC

           EXEC CICS RETURN END-EXEC
           .

       1000-CHECK-QUEUE-DEPTH.
      *    Check how many messages are waiting
      *    If depth > threshold, consumer may be down
           EXEC CICS INQUIRE TDQUEUE('MEDCLAIM.HSA.PAYMENTS')
               DEPTH(WS-QUEUE-DEPTH)
               RESP(WS-RESP-CODE)
           END-EXEC

           IF WS-QUEUE-DEPTH > WS-MAX-QUEUE-DEPTH
               SET WS-ALERT-NEEDED TO TRUE
               STRING 'ALERT: Payment queue depth = '
                   DELIMITED BY SIZE
                   WS-QUEUE-DEPTH DELIMITED BY SIZE
                   ' (threshold: '
                   DELIMITED BY SIZE
                   WS-MAX-QUEUE-DEPTH DELIMITED BY SIZE
                   ')' DELIMITED BY SIZE
                   INTO WS-ALERT-MESSAGE
               END-STRING
           END-IF
           .

       2000-CHECK-DLQ.
      *    Any messages on the dead-letter queue need attention
           EXEC CICS INQUIRE TDQUEUE(
               'MEDCLAIM.HSA.PAYMENTS.DLQ')
               DEPTH(WS-DLQ-DEPTH)
               RESP(WS-RESP-CODE)
           END-EXEC

           IF WS-DLQ-DEPTH > WS-MAX-DLQ-DEPTH
               SET WS-ALERT-NEEDED TO TRUE
               STRING 'CRITICAL: Dead letter queue has '
                   DELIMITED BY SIZE
                   WS-DLQ-DEPTH DELIMITED BY SIZE
                   ' messages - investigate immediately'
                   DELIMITED BY SIZE
                   INTO WS-ALERT-MESSAGE
               END-STRING
           END-IF
           .

       3000-CHECK-PROCESSING-RATE.
      *    Query DB2 for transactions in the last 5 minutes
           EXEC SQL
               SELECT COUNT(*)
               INTO :WS-PROCESSED-LAST-5M
               FROM GLOBALBANK.HSA_TRANSACTIONS
               WHERE PROCESS_TS > CURRENT TIMESTAMP
                                  - 5 MINUTES
           END-EXEC
           .

       4000-CHECK-ERROR-RATE.
      *    Query DB2 for failed transactions in the last 5 minutes
           EXEC SQL
               SELECT COUNT(*)
               INTO :WS-ERRORS-LAST-5M
               FROM GLOBALBANK.HSA_TRANSACTIONS
               WHERE PROCESS_TS > CURRENT TIMESTAMP
                                  - 5 MINUTES
                 AND PROCESS_STATUS <> 'SUCCESS'
           END-EXEC

           IF WS-ERRORS-LAST-5M > 0
               SET WS-ALERT-NEEDED TO TRUE
               STRING 'WARNING: '
                   DELIMITED BY SIZE
                   WS-ERRORS-LAST-5M DELIMITED BY SIZE
                   ' failed transactions in last 5 minutes'
                   DELIMITED BY SIZE
                   INTO WS-ALERT-MESSAGE
               END-STRING
           END-IF
           .

       5000-SEND-ALERT.
           DISPLAY 'HSA-MONITOR: ' WS-ALERT-MESSAGE
      *    In production, this would send to an alerting system
      *    (email, Slack, PagerDuty, etc.) via MQ or API
           .

Monitoring Metrics:

The monitoring program checks four health indicators every 5 minutes:

  1. Queue depth: If the payment queue has more than 1,000 messages waiting, the consumer may be down or overloaded.
  2. Dead-letter queue depth: Any messages on the DLQ indicate processing failures that need manual investigation.
  3. Processing rate: The number of successful transactions in the last 5 minutes. A sudden drop indicates a problem.
  4. Error rate: The number of failed transactions in the last 5 minutes. Any errors warrant investigation.

⚠️ Monitoring is Not Optional. Batch systems are self-monitoring in a sense: if the job fails, the operator sees it in the morning. Real-time systems can fail silently — messages accumulate on queues, error rates climb, and nobody notices until a customer complains. Active monitoring with automated alerting is essential for any real-time system.


Understanding the Reconciliation in Depth

Why Reconciliation is Hard

On the surface, reconciliation seems simple: compare two lists of transactions and report differences. In practice, it is one of the most challenging programs to write correctly, because of edge cases that do not appear in textbooks:

Timing mismatches. The batch path and real-time path may process the same claim at different times. A claim adjudicated at 11:55 PM might appear in the real-time log with a timestamp of 11:55 PM but in the batch extract for the following day. The reconciliation must handle cross-date matching.

Rounding differences. COBOL COMP-3 arithmetic and DB2 DECIMAL arithmetic use different internal representations. For most amounts, they produce identical results. But for certain calculations (especially those involving division), they may differ by one cent. The reconciliation must decide whether a one-cent difference constitutes a mismatch.

Order of operations. If two claims for the same member arrive simultaneously, the real-time path might process them in a different order than the batch path. The final balance should be the same, but the intermediate states differ. The reconciliation compares final results, not intermediate states.

Partial processing. If the real-time path processes 4,999 of 5,000 claims before MQ goes down, the reconciliation will show one "batch-only" record. This is not a bug — it is an expected consequence of the channel outage. The reconciliation report must distinguish between expected timing differences and actual processing errors.

The Reconciliation Algorithm in Detail

HSA-RECON uses the merge-compare algorithm. This algorithm requires both sources to be sorted by the same key (claim ID). The algorithm processes both sources in a single pass:

Initialize: Read first record from both sources

LOOP until both sources exhausted:
    IF both sources have current record:
        IF batch_claim_id = realtime_claim_id:
            COMPARE amounts and statuses
            ADVANCE both sources
        ELSE IF batch_claim_id < realtime_claim_id:
            REPORT "batch only" for batch record
            ADVANCE batch source
        ELSE:
            REPORT "realtime only" for realtime record
            ADVANCE realtime source
    ELSE IF only batch has records:
        REPORT "batch only" for remaining batch records
        ADVANCE batch source
    ELSE IF only realtime has records:
        REPORT "realtime only" for remaining realtime records
        ADVANCE realtime source
END LOOP

This is the same algorithm used in sequential file matching throughout COBOL — the merge pattern. It is O(n + m) where n and m are the sizes of the two sources. For 5,000 batch records and 5,000 real-time records, it requires at most 10,000 comparisons — far more efficient than the naive O(n * m) approach of searching one source for every record in the other.

Reconciliation Reporting

The reconciliation report serves multiple audiences:

Operations team: Needs a summary — how many matched, how many mismatched, are we clean today?

Development team: Needs details — which specific claims mismatched, what were the batch and real-time values, when were they processed?

Management: Needs trends — are mismatches decreasing over time? Are we on track for the 30-day gate?

Auditors: Needs complete history — every reconciliation result for every day of the parallel run, with the ability to drill into any specific mismatch.

The HSA-RECON report is designed for all four audiences: the summary at the bottom serves operations and management, the detail lines serve the development team, and the complete output file (written to a GDG for retention) serves auditors.

Handling Reconciliation Failures

When the reconciliation shows a mismatch, the investigation follows a standard procedure:

  1. Identify the claim. Use the claim ID from the reconciliation report.
  2. Check the batch audit trail. Find the claim in the batch audit file. Note the processing date, amount, and status.
  3. Check the DB2 transaction log. Find the claim in GLOBALBANK.HSA_TRANSACTIONS. Note the processing timestamp, amount, and status.
  4. Compare. Determine what differs: amount, status, or presence.
  5. Root cause. Common causes include: (a) timezone mismatch, (b) message delivery delay, (c) claim amended between batch extract and real-time processing, (d) bug in either path.
  6. Resolve. Fix the root cause. If necessary, manually adjust the affected account and update the transaction log.

During the parallel-run period, each mismatch resets the 30-day counter. This creates urgency: every mismatch must be investigated and resolved quickly, or the cutover date slips.

The 30-Day Gate: Statistical Confidence

Why 30 consecutive clean days? The number is not arbitrary. With 5,000 HSA transactions per day and a 30-day window, the parallel run processes approximately 150,000 transactions. If all 150,000 match, the team can say with high confidence that the real-time system produces identical results to the batch system.

Specifically, if the true error rate were 0.01% (one error per 10,000 transactions), the probability of seeing zero errors in 150,000 transactions is approximately 0.000003 — essentially zero. So 30 clean days with 5,000 daily transactions effectively proves the error rate is well below 0.01%.

If the daily volume were lower — say 100 transactions per day — 30 days would only test 3,000 transactions, and the confidence interval would be wider. For low-volume systems, a longer parallel-run period (60 or 90 days) would be appropriate.

Priya explains this to management using a simple analogy: "If you flip a coin 150,000 times and it comes up heads every time, you can be pretty confident it's not a fair coin. If our system processes 150,000 transactions with zero mismatches, we can be pretty confident it's working correctly."


Phase 6: Cutover Planning

The Cutover Sequence

After 30 consecutive clean reconciliation days, the team prepares for cutover — the moment when the batch path is disabled and the real-time path becomes the sole processing method.

Cutover Plan:

Step Time Action Responsibility Rollback
1 T-7 days Final parallel-run review Priya N/A
2 T-1 day Notify all stakeholders Sarah Kim N/A
3 T-0, 6 PM Disable batch flat file generation James Okafor Re-enable flat file
4 T-0, 6:15 PM Verify MQ messages flowing Derek Washington Rollback to step 3
5 T-0, 6:30 PM Run 10-transaction smoke test Maria Chen Rollback to step 3
6 T-0, 7 PM Monitor first real-time batch Priya Full rollback
7 T+1, 8 AM Morning reconciliation Priya Full rollback
8 T+1, 9 AM Go/no-go decision All leads Full rollback
9 T+7 Decommission batch flat file JCL James Okafor N/A
10 T+30 Remove batch code from production All N/A

Rollback Strategy:

The key to a safe cutover is a fast rollback. If anything goes wrong during or after cutover, the team can:

  1. Re-enable the batch flat file generation in CLM-ADJUD (a one-line JCL change)
  2. The next nightly batch will process all claims through the old path
  3. Real-time MQ messages continue to flow and are processed by HSA-EVENTS, but the reconciliation will show duplicates
  4. A cleanup program removes duplicate transactions from the DB2 table

🧪 Theme: Defensive Programming at the System Level. The cutover plan embodies defensive programming applied to an entire system. Every step has a defined rollback action. The rollback is tested before the cutover (a "rollback rehearsal" runs the previous weekend). The batch infrastructure is preserved for 30 days after cutover, not immediately decommissioned. This means the team can roll back to batch at any point during the first month if the real-time system proves unreliable.


Phase 7: The Cutover

Friday, 6:00 PM

The cutover begins. Both teams are on a conference bridge. James Okafor is at the MedClaim data center. Derek Washington and Maria Chen are at GlobalBank. Priya Kapoor is monitoring from her laptop, connected to both environments.

"Step 3: Disabling batch flat file generation," James announces. He modifies the CLM-ADJUD JCL to point the flat file DD statement to a DUMMY dataset. The program still runs — it still adjudicates claims — but it no longer writes the file that GlobalBank's batch reads.

"Step 4: Verifying MQ flow." Derek checks the queue depth. "I see 47 messages on MEDCLAIM.HSA.PAYMENTS. Messages are arriving."

"Step 5: Smoke test." Maria Chen submits 10 test claims through MedClaim's CICS interface. Within seconds, she sees the corresponding HSA debits in GlobalBank's DB2.

"All 10 processed. Amounts match. Confirmations received."

Priya checks the monitoring dashboard. Queue depth is stable. DLQ is empty. Processing rate is nominal.

"Step 6: Monitoring." Over the next two hours, 1,247 real-time HSA payments are processed. Zero errors. Average latency: 14 seconds from adjudication to HSA debit — well within the 30-second target.

Saturday, 8:00 AM

The morning reconciliation runs. Priya checks the results:

=======================================
HSA RECONCILIATION SUMMARY
=======================================
BATCH TRANSACTIONS:     0
REAL-TIME TRANSACTIONS: 4,891
MATCHES:                0
MISMATCHES:             0
BATCH-ONLY:             0
REAL-TIME-ONLY:         4,891
=======================================

All 4,891 transactions from the previous evening were processed exclusively through the real-time path. Zero batch transactions (as expected, since batch was disabled). Zero mismatches.

"We're clean," Priya tells the group.

Monday, 9:00 AM — Go/No-Go

After monitoring the weekend processing (12,340 real-time transactions, zero errors), the team meets for the go/no-go decision.

"The system processed 17,231 transactions over the weekend with zero errors, zero DLQ messages, and average latency of 12 seconds," Priya reports. "I recommend we proceed."

"Agreed," says James.

"Agreed," says Maria.

The cutover is declared successful. The batch flat file JCL will be decommissioned in 7 days. The batch programs will remain available (but unused) for 30 days.


Post-Cutover Operations

The First Week

The first week after cutover is the most critical. Even though the parallel run proved the system is correct under normal conditions, production inevitably brings conditions that testing did not anticipate.

Day 1 (Saturday): Low volume. 4,891 transactions processed. Zero errors. The team monitors continuously but finds nothing unusual.

Day 2 (Sunday): Even lower volume. 2,449 transactions. Zero errors. Derek notices that the average processing latency is 8 seconds — faster than during the parallel run because the batch system is no longer competing for DB2 locks.

Day 3 (Monday): First full business day. Volume spikes to 7,823 transactions — 50% higher than the average during parallel run. Peak hour (11 AM - 12 PM) processes 1,247 transactions. Queue depth briefly reaches 45 messages before the consumer catches up. Zero errors.

Day 4 (Tuesday): A new edge case appears. A partner hospital submits a claim with a diagnosis code that includes a special character (an en-dash instead of a hyphen). The JSON PARSE fails, and the message goes to the DLQ. HSA-MONITOR detects the DLQ message within 5 minutes and alerts the team. James traces the problem to the partner hospital's system generating non-standard characters. Fix: add a character validation step in the MedClaim producer before the JSON GENERATE.

Day 5 (Wednesday): MedClaim deploys the character validation fix. The DLQ message is manually reprocessed after correcting the diagnosis code. Zero new errors.

Day 6 (Thursday): Normal operations. 5,412 transactions. Zero errors. The team begins to relax slightly. Derek runs a DB2 performance report and confirms that buffer pool hit ratios are at 99.8% — the HSA tables are entirely cached in memory.

Day 7 (Friday): End of the first full business week. Total transactions for the week: 31,247. Total errors: 1 (the en-dash character issue, now fixed). Priya writes the first weekly operations report and distributes it to both management chains. The report includes performance metrics, error summaries, capacity utilization, and recommendations.

This is exactly why the batch infrastructure is preserved — the character validation bug would have been invisible in batch (the flat file would have contained the same character, and COBOL would have processed it without complaint). It only appeared because JSON has stricter character encoding rules than fixed-format flat files.

Capacity Planning for Growth

With the real-time system in production, Priya turns her attention to capacity planning. The system must handle projected growth without degradation.

Current capacity: - MQ: 100,000 message queue depth (20 days of buffer at current volume) - DB2: 500 transactions per second theoretical max (current peak: 3.5 per second) - CICS: 200 concurrent tasks (current peak: 12)

Projected growth: - MedClaim expects to add 3 new partner insurers in the next 12 months, each contributing ~2,000 transactions per day - Total projected daily volume in 12 months: 13,000 transactions - Total projected daily volume in 24 months: 20,000 transactions

At 20,000 transactions per day (peak hour ~2,500), the system is still well within capacity. The bottleneck, if one emerges, will be DB2 I/O — not MQ or CICS. Priya recommends monitoring DB2 buffer pool hit ratios and adjusting buffer pools if they fall below 95%.

Batch Decommissioning

After 30 days of clean production operation, the batch infrastructure is scheduled for decommissioning. But "decommission" does not mean "delete":

T+30 days: Remove the batch JCL steps from the production schedule. The JCL is archived (not deleted) in a "decommissioned" library.

T+60 days: Remove the batch program load modules from the production load library. They are archived in a backup library.

T+90 days: Review the decommissioned batch programs. If no issues have arisen in 90 days, the archive can be considered cold storage. It remains available but is no longer maintained.

T+365 days: Final review. If the real-time system has operated without reverting to batch for one full year, the batch archive can be formally retired. Even at this point, the code is not deleted — it is moved to long-term archive. In regulated industries like healthcare and banking, source code may need to be retained for 7+ years for audit purposes.

Maria Chen insists on this conservative timeline. "I've seen systems that ran perfectly for six months and then failed on year-end processing — because year-end volumes are three times normal, and the real-time system had never been tested at that scale. Keep the batch safety net until you've been through every seasonal peak."


Lessons from the Migration

Lesson 1: The Parallel Run is Everything

The 30-day parallel run was the most expensive phase of the project — it required running both systems simultaneously, building a reconciliation program, and investigating every discrepancy. It was also the most valuable phase. The parallel run found three bugs in the real-time system that would have caused production failures:

  1. A timezone issue where MedClaim's timestamp used Eastern time but GlobalBank's expected UTC
  2. A rounding difference where COMP-3 arithmetic and DB2 DECIMAL arithmetic produced slightly different results for certain amounts
  3. A message ordering issue where rapid-fire claims for the same member arrived out of order, causing optimistic lock failures

All three were found and fixed during the parallel run, before they could affect production.

Lesson 2: Idempotent Design Saves Lives

During the parallel run, there were several instances where MQ delivered duplicate messages (typically after a queue manager restart). Because HSA-EVENTS checked for duplicate message IDs before processing, these duplicates were silently ignored. Without idempotent design, each duplicate would have caused a double debit — taking money from a member's HSA account twice.

Lesson 3: Monitoring Must Be Built-In, Not Bolted On

The monitoring program (HSA-MONITOR) was built in Phase 5, before the parallel run. This meant the team had monitoring data from day one. They could see processing rates, error rates, and queue depths in real time, which made investigating reconciliation discrepancies much faster.

Lesson 4: Both Teams Must Understand Both Systems

Priya insisted that Derek Washington spend time learning MedClaim's system and that James Okafor spend time learning GlobalBank's. "You cannot debug a message that crosses organizational boundaries if you only understand half the path." This cross-training proved invaluable during the parallel run when the timezone bug required understanding both systems' date handling.

Lesson 5: Batch is Not the Enemy

💡 A Note on Distributed Transactions. The two-phase commit that CICS provides between DB2 and MQ is a luxury that many distributed systems do not have. If GlobalBank's consumer were a microservice running in a cloud environment, coordinating the database update and the message send would require the Outbox Pattern (write the message to a database table, then have a separate process read the table and send the message) or the Saga Pattern (a sequence of local transactions with compensating actions for rollback). The mainframe's integrated transaction manager makes this much simpler — but understanding the distributed alternatives helps you appreciate what CICS does behind the scenes.

Lesson 6: Design for Operations, Not Just Development

The monitoring program, the runbooks, the reconciliation reports — these are not afterthoughts. They are as important as the core processing programs. A system that works perfectly but cannot be monitored, debugged, or rolled back is a system that will eventually cause a production crisis.

Priya estimates that 30% of the project effort went into "operational infrastructure" — monitoring, reconciliation, alerting, runbook creation, and capacity planning. This ratio is typical for production-grade real-time systems. Development teams that allocate 100% of their time to "the application" and 0% to operations invariably pay for it later, usually at 3 AM on a Saturday.

Lesson 7: Cross-Training is a Risk Mitigation Strategy

At the end of the project, Derek Washington understands MedClaim's claim adjudication system nearly as well as James Okafor does. Maria Chen understands GlobalBank's HSA processing as well as Derek. This cross-training was not a nice-to-have — it was an explicit project deliverable.

Consider the alternative: if only James understands the MedClaim side and only Derek understands the GlobalBank side, a problem that spans both systems (like the timezone bug) requires both people to be available simultaneously. Cross-training reduces this dependency and improves the team's resilience.


Working with the Student Mainframe Lab

Simulating Message Queuing Without MQ

The Student Mainframe Lab does not have IBM MQ installed. But the core concepts of this capstone can be practiced using simulated message passing through sequential files or VSAM queues.

Approach 1: File-Based Message Simulation.

Replace the MQ PUT with a sequential file WRITE and the MQ GET with a sequential file READ. The "queue" is a sequential file. The "producer" writes JSON-formatted records; the "consumer" reads them.

      * Simulated MQ PUT (producer side)
           WRITE MSG-RECORD FROM WS-MQ-MESSAGE
               AFTER ADVANCING 0 LINES

      * Simulated MQ GET (consumer side)
           READ MSG-FILE INTO WS-MQ-MESSAGE
               AT END SET WS-QUEUE-EMPTY TO TRUE
           END-READ

This approach loses the asynchronous, guaranteed-delivery properties of MQ, but it preserves the fundamental pattern: one program generates messages, another program consumes them, and the message format is JSON.

Approach 2: VSAM Queue Simulation.

Use a VSAM ESDS (Entry-Sequenced Data Set) as a message queue. The producer adds records to the end of the ESDS; the consumer reads from the beginning, tracking its position with a "cursor" record in a separate VSAM KSDS.

This approach more closely simulates MQ behavior because VSAM ESDS supports concurrent access from multiple programs — the producer can write while the consumer reads. It does not provide guaranteed delivery or dead-letter queue functionality, but it is a useful approximation.

GnuCOBOL Adaptations

For students using GnuCOBOL on their local machines:

  1. JSON GENERATE/PARSE: GnuCOBOL does not support the JSON GENERATE and JSON PARSE statements (these are IBM Enterprise COBOL v6+ features). Instead, build the JSON string manually using STRING:
           STRING '{"messageId":"' DELIMITED BY SIZE
                  WS-MSG-ID DELIMITED BY SPACES
                  '","claimId":"' DELIMITED BY SIZE
                  WS-CLAIM-ID DELIMITED BY SPACES
                  '","amount":' DELIMITED BY SIZE
                  WS-AMOUNT-DISPLAY DELIMITED BY SPACES
                  '}' DELIMITED BY SIZE
                  INTO WS-JSON-OUTPUT
           END-STRING
  1. DB2 Queries: Replace EXEC SQL with file I/O against indexed files. The duplicate check becomes a VSAM KSDS READ by message ID; the account lookup becomes a VSAM KSDS READ by member ID.

  2. CICS Commands: Replace EXEC CICS with standard batch processing. The trigger mechanism becomes a polling loop that checks for new records in the simulated queue file.

  3. Monitoring: Replace the CICS interval control with a simple batch program that runs as a scheduled job (cron job on Linux, Task Scheduler on Windows). The program reads the simulated queue files and reports on message counts and processing statistics.

The key learning objective is not the specific APIs (MQ, DB2, CICS) but the patterns: event-driven design, idempotent processing, optimistic locking, reconciliation, and monitoring. These patterns apply regardless of the underlying technology.


Architectural Alternatives: What Else Could They Have Built?

The MQ-based event-driven architecture is not the only way to migrate from batch to real-time. Priya evaluated three alternatives before recommending MQ. Understanding why she chose MQ — and why the alternatives were rejected — provides valuable architectural perspective.

Alternative 1: Shared Database

Instead of message queuing, both organizations could share a single DB2 database. MedClaim writes adjudicated claims to a shared table; GlobalBank reads from the same table and processes HSA debits.

Advantages: Simple. No message infrastructure. No format conversion.

Disadvantages: Tight coupling — both organizations depend on the same database. A schema change by one organization breaks the other. Security is complex (both organizations need access to the same DB2 subsystem). Performance suffers because both organizations compete for the same database resources.

Priya rejected this approach because it creates a single point of failure that spans organizational boundaries. "If that database goes down, both organizations stop processing. With MQ, each organization can continue independent processing — messages accumulate and are processed when connectivity is restored."

Alternative 2: REST API with Polling

MedClaim exposes a REST API that returns adjudicated HSA-eligible claims. GlobalBank polls this API every few seconds, retrieves new claims, and processes them.

Advantages: Uses standard HTTP. Easy to implement with CICS web services (as demonstrated in Chapter 44). No message infrastructure required.

Disadvantages: Polling is wasteful — most requests return no new data. Latency is limited by the polling interval (if you poll every 30 seconds, average latency is 15 seconds). Error handling is complex — if GlobalBank's polling process fails, it must remember where it left off when it restarts. No guaranteed delivery — if a claim is adjudicated between polls and GlobalBank misses a poll cycle, the claim could be missed.

This approach would work for lower-volume, less critical integrations. For financial transactions where every claim must be processed exactly once, the lack of guaranteed delivery is a dealbreaker.

Alternative 3: Direct CICS-to-CICS Communication

CICS supports distributed program linking (DPL) — a CICS program on one LPAR can call a CICS program on another LPAR as if it were a local CALL. MedClaim's CLM-ADJUD could directly invoke GlobalBank's HSA-EVENTS using DPL.

Advantages: Synchronous — the adjudication waits for the HSA debit to complete, ensuring real-time confirmation. Simple — no message infrastructure, no reconciliation needed.

Disadvantages: Tight coupling — CLM-ADJUD cannot complete until HSA-EVENTS responds. If GlobalBank's CICS region is down, MedClaim's adjudication stops. Performance — the synchronous call adds latency to every adjudication, even for claims that are not HSA-eligible. Scalability — each concurrent adjudication ties up a CICS task on both LPARs.

Priya rejected this approach for its tight coupling. "If GlobalBank has a CICS outage, MedClaim stops adjudicating claims. That's 500,000 claims per month that would be delayed. The business cannot accept that risk."

Why MQ Won

MQ provides the best combination of loose coupling (each organization operates independently), guaranteed delivery (no message loss), and scalability (messages can be processed at the consumer's pace). The cost is complexity — MQ infrastructure, message format design, idempotent processing, reconciliation — but this complexity is manageable and well-understood on the mainframe platform.

"Every architecture is a set of tradeoffs," Priya tells the steering committee. "MQ trades simplicity for resilience. For financial transactions between two organizations, resilience wins."

Even after the cutover, batch processing continues to play a role. The reconciliation program is a batch job. The monitoring program's historical reports are batch jobs. The cleanup and archival of old transaction data are batch jobs. Real-time does not eliminate batch — it reduces the dependency on batch for time-sensitive operations.

🔴 Theme: Legacy != Obsolete. The batch programs that were "replaced" by real-time processing are still in the load library. They still work. If the real-time system experienced a catastrophic failure (MQ down, DB2 down, network outage), the batch path could be reactivated within minutes. The legacy batch system is not obsolete — it is a safety net. It earned that role through 18 years of reliable operation, and it will keep that role for as long as the real-time system needs a fallback.

Theme: The Modernization Spectrum. This project moved the HSA payment system from one end of the modernization spectrum (batch, flat files, SFTP) to the other (real-time, MQ, DB2, JSON). But it did so incrementally: first the message infrastructure, then the producer, then the consumer, then the parallel run, then the monitoring, then the cutover. At every phase, the system was fully operational. At no point did the migration require downtime or data loss.

🔗 Theme: The Human Factor. The migration succeeded because two organizations trusted each other's teams. Maria Chen trusted James Okafor's MedClaim changes. James trusted Derek's GlobalBank changes. Priya bridged both teams. The technical architecture was well-designed, but the human architecture — trust, communication, shared understanding — was what made the project work.


Operational Runbooks

After cutover, Priya creates operational runbooks — step-by-step procedures for handling common and uncommon situations. These runbooks are essential because the event-driven system operates 24/7, and the on-call engineer may not be someone who built the system.

Runbook 1: DLQ Message Investigation

Trigger: HSA-MONITOR alert for DLQ depth > 0.

Steps:

  1. Connect to MQ Explorer and browse the DLQ (MEDCLAIM.HSA.PAYMENTS.DLQ).
  2. Examine the MQ message header — specifically the MQMD.BackoutCount and MQMD.Feedback fields. These indicate why MQ moved the message to the DLQ.
  3. View the message body. Is it valid JSON? If not, the problem is on the MedClaim producer side. Contact MedClaim operations.
  4. If the JSON is valid, attempt to identify the claim ID and member ID. Check whether the corresponding HSA account exists in GlobalBank's DB2.
  5. If the account does not exist, this is a data mismatch — MedClaim has an HSA-eligible member that GlobalBank does not recognize. Contact the HSA account team to investigate.
  6. If the account exists and the JSON is valid, the failure is likely a transient error that exhausted retries. Verify that the root cause (DB2 availability, CICS region status) has been resolved.
  7. Manually resubmit the message by moving it from the DLQ back to the main queue using the MQ amqsput utility or a custom resubmission program.
  8. Monitor to confirm the resubmitted message is processed successfully.
  9. Document the incident in the HSA operations log.

Runbook 2: Queue Depth Growing

Trigger: HSA-MONITOR alert for queue depth > 1,000 or sustained growth.

Steps:

  1. Check the CICS region hosting HSA-EVENTS. Is the transaction running? Use CEMT I TRAN(HEVT) to verify.
  2. If the transaction is not running, investigate the CICS system log (CSMT) for ABENDs. Restart the transaction if appropriate.
  3. If the transaction is running but processing slowly, check DB2 performance. Use the DB2 Performance Monitor to look for lock contention, buffer pool misses, or long-running queries.
  4. If DB2 is healthy and the transaction is running, check MQ channel status. The channel connecting MedClaim and GlobalBank LPARs may be down.
  5. If all components are healthy but volume is higher than expected, this may be a legitimate spike (e.g., MedClaim re-adjudication batch). Monitor but do not act unless the queue depth exceeds MAXDEPTH.
  6. If MAXDEPTH is approaching, increase it temporarily using ALTER QLOCAL. Do not restart the queue manager.

Runbook 3: Emergency Rollback to Batch

Trigger: Real-time system is completely unavailable and cannot be restored within 4 hours.

Steps:

  1. Notify both GlobalBank and MedClaim management chains.
  2. On MedClaim's LPAR, restore the batch flat file JCL from the archive library. This is a JCL change — no program changes are needed.
  3. Run the evening batch cycle. All claims adjudicated since the real-time system failed will be included in the batch flat file.
  4. On GlobalBank's LPAR, verify that the batch HSA-PROC program is still in the production load library (it should be, per the decommissioning timeline).
  5. Run GlobalBank's batch cycle against the flat file.
  6. After both batches complete, run HSA-RECON to reconcile any transactions that were partially processed by the real-time system before the failure.
  7. Any duplicates (transactions processed by real-time before the failure AND by batch after the rollback) will appear in the reconciliation report. Process reversals as needed.
  8. Keep batch running until the real-time system is fully restored and tested.
  9. When the real-time system is restored, run a special reconciliation to verify it processes correctly before disabling batch again.

These runbooks are stored in the operations knowledge base and reviewed quarterly. Every new team member must walk through each runbook as part of their onboarding.

⚖️ Runbooks as Documentation of Design Intent. A good runbook does more than list steps — it explains why each step matters and what to look for. The DLQ investigation runbook, for example, distinguishes between data mismatches, transient errors, and producer-side problems. This classification helps the on-call engineer understand the system's design, not just its operation. Over time, runbooks become the primary way that system knowledge transfers from the builders to the operators.


Understanding the End-to-End Transaction Flow

To fully appreciate the migration from batch to real-time, it helps to trace a single transaction through the entire system, from the moment a doctor submits a claim to the moment the HSA debit appears on the member's account.

The Journey of Claim CLM000098765

9:15 AM — Claim Submission. A medical office submits a claim for patient Sarah Mitchell through MedClaim's CICS provider portal. The claim is for a routine office visit: diagnosis code J06.9 (upper respiratory infection), procedure code 99213 (established patient office visit), charged amount $175.00.

9:15:02 AM — Claim Receipt. MedClaim's CLM-INTAKE program (modernized in Chapter 44) receives the claim, validates the provider, validates the member, and writes it to the claims DB2 table with status 'RCV' (received).

9:15:05 AM — Adjudication. The CLM-ADJUD program processes the claim. It checks Sarah Mitchell's coverage: she has a MedClaim PPO plan with a $30 copay for office visits. The allowed amount for procedure 99213 under her plan is $150.00. After the $30 copay, MedClaim's payment to the provider is $120.00. The member's responsibility (charged amount minus allowed amount plus copay) is $55.00.

9:15:05 AM — HSA Eligibility Check. CLM-ADJUD checks whether Sarah Mitchell's member record indicates HSA eligibility. It does — she has a high-deductible health plan with a linked GlobalBank HSA. The member's out-of-pocket amount ($55.00) is eligible for HSA payment.

9:15:06 AM — MQ Message Published. CLM-ADJUD builds a JSON message containing the claim ID, member ID, payment amount ($55.00), and other details. It PUTs the message on the MEDCLAIM.HSA.PAYMENTS queue with message ID MSG20240315091506001.

9:15:06 AM — MQ Delivery. The IBM MQ channel between MedClaim and GlobalBank picks up the message. The message is transmitted across the private network connecting the two organizations. MQ acknowledges delivery to the MedClaim queue manager.

9:15:07 AM — Consumer Triggered. On GlobalBank's LPAR, the arrival of the message triggers the HSA-EVENTS CICS transaction. The CICS trigger monitor detects the message and starts the transaction.

9:15:07 AM — Duplicate Check. HSA-EVENTS reads the message and checks DB2: has a transaction with message ID MSG20240315091506001 already been processed? No — this is a new message.

9:15:07 AM — Account Lookup. HSA-EVENTS queries the HSA_ACCOUNTS table using Sarah Mitchell's member ID. It finds her HSA account: HSA0045678, current balance $3,805.00, status Active.

9:15:08 AM — Debit Applied. HSA-EVENTS updates Sarah Mitchell's HSA balance: $3,805.00 - $55.00 = $3,750.00. The UPDATE uses optimistic locking — it includes AND HSA_BALANCE = 3805.00 to ensure no other transaction has modified the balance since the SELECT.

9:15:08 AM — Transaction Recorded. HSA-EVENTS inserts a row in the HSA_TRANSACTIONS table recording the debit: message ID, claim ID, account ID, amount, new balance, status SUCCESS.

9:15:08 AM — Confirmation Sent. HSA-EVENTS generates a JSON confirmation message and PUTs it on the GLOBALBANK.HSA.CONFIRMS queue.

9:15:08 AM — SYNCPOINT. CICS commits the unit of work. The DB2 update, the DB2 insert, and the MQ PUT are all committed atomically. If any had failed, all would have been rolled back.

9:15:09 AM — Confirmation Received. On MedClaim's LPAR, the CLM-EVENTS program picks up the confirmation message. It updates the claim record in MedClaim's DB2 to reflect that the HSA payment has been processed.

Total elapsed time: 4 seconds — from claim adjudication to HSA debit confirmed.

Under the old batch system, this would have taken 24-48 hours. Sarah Mitchell would have seen the charge on her HSA account the next day (or the day after). Now she sees it in under 5 seconds — while she is still at the doctor's office.

This trace illustrates every major concept in this capstone: JSON messaging, MQ delivery, idempotent processing, optimistic locking, transactional messaging, and cross-organizational coordination. It also shows how the five programs in the real-time system (CLM-ADJUD, HSA-EVENTS, CLM-EVENTS, HSA-RECON, HSA-MONITOR) work together as a cohesive system — each program handling one responsibility, communicating through well-defined messages.


The Final Architecture

After cutover, the production system looks like this:

MedClaim LPAR                     GlobalBank LPAR
┌─────────────────┐               ┌─────────────────┐
│                 │               │                 │
│  CLM-ADJUD      │    ──MQ──→   │  HSA-EVENTS     │
│  (adjudicates   │               │  (processes     │
│   claims, puts  │               │   HSA debits)   │
│   MQ messages)  │               │                 │
│                 │    ←──MQ──   │  HSA-CONFIRM    │
│  CLM-EVENTS     │               │  (sends         │
│  (processes     │               │   confirmations)│
│   confirmations)│               │                 │
│                 │               │  HSA-MONITOR    │
│                 │               │  (health checks)│
│                 │               │                 │
│                 │               │  HSA-RECON      │
│                 │               │  (reconciliation)│
│                 │               │                 │
└─────────────────┘               └─────────────────┘

Performance metrics after 30 days of production:

Metric Target Actual
End-to-end latency < 30 seconds 12 seconds (avg)
Message delivery reliability 100% 100%
Processing accuracy 100% 100%
System availability 99.9% 99.97%
DLQ messages (30 days) 0 0
Reconciliation mismatches 0 0

Summary: The Complete Journey

This capstone brought together every topic in this textbook:

  • Data definition (Parts I-II): Copybooks, COMP-3 fields, 88-level conditions
  • File processing (Parts III-IV): Sequential files, VSAM, DB2, file status handling
  • Program design (Parts V-VI): Structure charts, subprograms, modular architecture
  • CICS programming (Part VII): Online transactions, BMS maps, pseudo-conversational design
  • Modern techniques (Part VIII): JSON, web services, MQ, event-driven architecture
  • Testing and deployment (Part VIII): JCL, parallel runs, reconciliation, CI/CD

All five themes converge in this final capstone:

  • Legacy != Obsolete: The batch system remains as a safety net; COBOL proves capable of modern real-time processing
  • Readability is a Feature: Every program uses consistent naming, 88-levels, and clear structure
  • The Modernization Spectrum: The migration was incremental, reversible, and delivered value at every phase
  • Defensive Programming: Idempotent processing, duplicate checking, optimistic locking, dead-letter queues, rollback plans
  • The Human Factor: Cross-organizational trust, team cross-training, and clear communication made the technical solution possible

The Three Capstones: A Retrospective

Looking back across all three capstones, a clear progression emerges — not just in technical complexity, but in what it means to be a professional COBOL developer.

Capstone 1: Learning to Build

In Capstone 1, Derek Washington built a banking system from scratch. He controlled every decision: the data design, the program structure, the error handling, the JCL. The system had no history, no legacy constraints, no competing stakeholders. This is the simplest kind of engineering — greenfield development with full autonomy.

The key lesson: building a system teaches you how the parts fit together. Before you can maintain, modernize, or migrate a system, you must understand how programs share data through copybooks, how JCL orchestrates job streams, how CICS provides online access, and how VSAM stores data. Capstone 1 taught those fundamentals.

Capstone 2: Learning to Improve

In Capstone 2, James Okafor modernized a legacy insurance system. He did not control the original design — he inherited it. He could not start over — the system was in production, processing half a million claims per month. Every change had to preserve existing behavior while improving maintainability, testability, and accessibility.

The key lesson: improving a system teaches you humility and discipline. The legacy code was not written by bad programmers — it was written by people solving problems with the tools they had at the time. James's job was not to judge the original design but to evolve it. The five-phase modernization (document, refactor, DB2, API, CI/CD) is a template that applies to any legacy system.

Capstone 3: Learning to Integrate

In Capstone 3, Priya Kapoor migrated a batch process to real-time event-driven processing across two organizations. She controlled neither the MedClaim system nor the GlobalBank system — she had to work within the constraints of both. The technical challenges (MQ messaging, idempotent processing, reconciliation) were significant, but the organizational challenges (cross-team trust, coordinated cutover, shared monitoring) were equally demanding.

The key lesson: integrating systems teaches you that technology is the easy part. Getting MQ to deliver messages is straightforward. Getting two organizations to agree on message formats, error handling procedures, cutover timing, and rollback criteria requires diplomacy, patience, and clear communication. Priya's role as the bridge between GlobalBank and MedClaim was as important as her technical design.

The Career Arc

These three capstones mirror a typical mainframe developer's career arc:

Years 1-2: Building and maintaining individual programs. Understanding copybooks, file handling, CICS, and JCL. This is Capstone 1 territory — learning the fundamentals by building things.

Years 3-7: Taking ownership of subsystems. Leading modernization efforts. Designing for testability and maintainability. This is Capstone 2 territory — improving existing systems while keeping them running.

Years 7+: Architecting cross-system integrations. Making technology decisions with multi-year implications. Mentoring junior developers. This is Capstone 3 territory — thinking beyond individual programs to systems of systems.

Derek Washington entered this textbook as a Capstone 1 developer. By participating in Priya's Capstone 3 project, he has glimpsed where his career can go. The path from "I can write a COBOL program" to "I can architect a cross-organizational real-time system" is long, but every step is built on the fundamentals.


Closing Thoughts

You began this textbook as a student who had completed a first COBOL course. You end it as someone who has designed a banking system from scratch, modernized a legacy insurance system, and migrated a batch process to real-time event-driven architecture.

These are not academic exercises. They are the kinds of projects that mainframe COBOL developers work on every day at banks, insurance companies, government agencies, and healthcare organizations around the world. The systems you have learned to build, maintain, and modernize process trillions of dollars, serve billions of people, and form the invisible infrastructure of modern society.

What You Have Learned

Take a moment to reflect on the breadth of knowledge you have acquired:

  • Data design: COMP-3 packed decimal for monetary precision, 88-level conditions for self-documenting code, copybooks for shared definitions, FILLER bytes for forward compatibility
  • File processing: Sequential files for batch I/O, VSAM KSDS for keyed access, DB2 for relational data, file status checking for defensive programming
  • Program design: Structure charts for planning, subprograms for modularity, LINKAGE SECTION for parameter passing, the read-ahead pattern for EOF handling
  • Online programming: CICS pseudo-conversational design, BMS maps for screen I/O, COMMAREA for conversation state, EXEC CICS for system services
  • Modern integration: JSON GENERATE and JSON PARSE for web interoperability, IBM MQ for guaranteed message delivery, CICS web services for API exposure, event-driven architecture for real-time processing
  • Testing and deployment: JCL job streams with conditional execution, GDGs for version management, parallel runs for migration safety, reconciliation for data verification, CI/CD for automated testing
  • Professional practice: Code reviews, documentation, error handling, audit trails, capacity planning, operational runbooks, cross-team collaboration

Each of these topics could fill a textbook on its own. Together, they form the toolkit of a professional mainframe COBOL developer — someone who can not only write code but design systems, manage migrations, and make architectural decisions that affect entire organizations.

The Demand for COBOL Skills

As of this writing, the demand for COBOL developers exceeds the supply by a significant margin. Major banks, insurance companies, government agencies, and healthcare organizations are actively recruiting COBOL developers — not because they are nostalgic for the past, but because their mission-critical systems run on COBOL and need skilled people to maintain, modernize, and extend them.

The retirement wave among experienced COBOL developers is accelerating. Developers like Maria Chen (15+ years of experience) and James Okafor are approaching the later stages of their careers. The knowledge they carry — not just COBOL syntax, but deep understanding of business processes, system architecture, and operational practices — is at risk of being lost.

You are the solution to this problem. Every concept in this textbook, every program you have written, every design decision you have analyzed brings you closer to being the developer that these organizations need. The path from student to production-ready developer is not easy, but it is well-defined: learn the fundamentals, build complete systems, understand legacy code, and practice modern integration techniques. You have done all of these things.

A Final Word from the Team

COBOL is not a historical curiosity. It is a living, working language that powers the systems you depend on — whether you know it or not. The skills you have learned in this textbook are not just relevant today; they will be relevant for decades to come.

The project is complete. The real-time HSA system is in production. The batch safety net is in place. The monitoring is active. The runbooks are written. Both teams are cross-trained.

Priya Kapoor closes her laptop and looks at the project dashboard one last time. 150,000 transactions processed in the first 30 days. Zero errors. Zero data loss. 12-second average latency against a 30-second target.

She turns to the group on the conference bridge. "I want to thank everyone — James and the MedClaim team, Maria and Derek at GlobalBank. We migrated a financial processing system from batch to real-time across two organizations without a single production incident. That does not happen by accident. It happens because every person on this team did their job with discipline and care."

"And because the batch safety net never had to be used," Derek adds.

Maria Chen smiles. "The best safety net is the one you never need. But the second best safety net is the one that is there when you do."

As Maria tells Derek Washington at the end of the project: "You came here thinking you'd be working on old technology. Now you know — there's nothing old about building systems that the world depends on."